Finding a Protein Motif

Motif Implies Function

As mentioned in “Translating RNA into Protein”, proteins perform every practical function in the cell. A structural and functional unit of the protein is a protein domain: in terms of the protein’s primary structure, the domain is an interval of amino acids that can evolve and function independently.
Each domain usually corresponds to a single function of the protein (e.g., binding the protein to DNA, creating or breaking specific chemical bonds, etc.). Some proteins, such as myoglobin and the Cytochrome complex, have only one domain, but many proteins are multifunctional and therefore possess several domains. It is even possible to artificially fuse different domains into a protein molecule with definite properties, creating a chimeric protein.
Just like species, proteins can evolve, forming homologous groups called protein families. Proteins from one family usually have the same set of domains, performing similar functions.
A component of a domain essential for its function is called a motif, a term that in general has the same meaning as it does in nucleic acids, although many other terms are also used (blocks, signatures, fingerprints, etc.) Usually protein motifs are evolutionarily conservative, meaning that they appear without much change in different species.
Proteins are identified in different labs around the world and gathered into freely accessible databases. A central repository for protein data is UniProt, which provides detailed protein annotation, including function description, domain structure, and post-translational modifications. UniProt also supports protein similarity search, taxonomy analysis, and literature citations.


To allow for the presence of its varying forms, a protein motif is represented by a shorthand as follows: [XY] means “either X or Y” and {X} means “any amino acid except X.” For example, the N-glycosylation motif is written as N{P}[ST]{P}.

You can see the complete description and features of a particular protein by its access ID “uniprot_id” in the UniProt database, by inserting the ID number into

Alternatively, you can obtain a protein sequence in FASTA format by following

For example, the data for protein B5ZC00 can be found at


At most 15 UniProt Protein Database access IDs.


For each protein possessing the N-glycosylation motif, output its given access ID followed by a list of locations in the protein string where the motif can be found.


For this problem, we will be using the python library requests. If you have problems while importing requests library, run the code below. Otherwise, run only the last line:

import sys
!{sys.executable} -m pip install requests
import requests

Let’s read the IDs and acquire the fasta files of the corresponding protein IDs:

url = ''

with open("rosalind_mprt.txt") as file:
    seqIDs ="\n", " ").split()
sequences = {}
for proID in seqIDs:
    goToURL = url+proID+".fasta"
    response = requests.get(goToURL)
    sequences[proID] = (response.text.split("\n"))
    sequences[proID] = "".join(sequences[proID][1::])

Now, we define the function that would seek for the N-glycosylation motif:

def N_gly_motif(ID, sequence):
    sequence = list(sequence)
    global result
    result = []
    for i in range(0, len(sequence)-3):
        seq = sequence[i:i+4]
        if (seq[0] == "N") and (seq[2] == "S" or seq[2] == "T") and (seq[1] and seq[3] != "P"):

Now, let’s run the function and print the output:

for key, value in sequences.items():
    N_gly_motif(key, value)
    if not result:
Important note:

This problem is taken from Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Independent Alleles

Mendel’s Second Law

Recall that Mendel’s first law states that for any factor, an individual randomly assigns one of its two alleles to its offspring. Yet this law does not state anything regarding the relationship with which alleles for different factors will be inherited.
After recording the results of crossing thousands of pea plants for seven years, Mendel surmised that alleles for different factors are inherited with no dependence on each other. This statement has become his second law, also known as the law of independent assortment.
What does it mean for factors to be “assorted independently?” If we cross two organisms, then a shortened form of independent assortment states that if we look only at organisms having the same alleles for one factor, then the inheritance of another factor should not change.
For example, Mendel’s first law states that if we cross two Aa organisms, then 1/4 of their offspring will be aa, 1/4 will be AA, and 1/2 will be Aa. Now, say that we cross plants that are both heterozygous for two factors, so that both of their genotypes may be written as Aa Bb. Next, examine only Bb offspring: Mendel’s second law states that the same proportions of AA, Aa, and aa individuals will be observed in these offspring. The same fact holds for BB and bb offspring.
As a result, independence will allow us to say that the probability of an aa BB offspring is simply equal to the probability of an aa offspring times the probability of a BB organism, i.e., 1/16.
Because of independence, we can also extend the idea of Punnett squares to multiple factors. We now wish to quantify Mendel’s notion of independence using probability.


Two events A and B are independent if Pr(A and B) is equal to Pr(A) × Pr(B). In other words, the events do not influence each other, so that we may simply calculate each of the individual probabilities separately and then multiply.

More generally, random variables X and Y are independent if whenever A and B are respective events for X and Y, A and B are independent (i.e., Pr(A and B)=Pr(A) × Pr(B)).

As an example of how helpful independence can be for calculating probabilities, let X and Y represent the numbers showing on two six-sided dice. Intuitively, the number of pips showing on one die should not affect the number showing on the other die. If we want to find the probability that X + Y is odd, then we don’t need to draw a tree diagram and consider all possibilities. We simply first note that for X + Y to be odd, either X is even and Y is odd or X is odd and Y is even. In terms of probability, Pr(X + Y is odd) = Pr(X is even and Y is odd) + Pr(X is odd and Y is even). Using independence, this becomes [Pr(X is even) × Pr(Y is odd)] + [Pr(X is odd) × Pr(Y is even)] or



Two positive integers k (k ≤ 7) and N (N ≤ 2^k). In this problem, we begin with Tom, who in the 0th generation has genotype Aa Bb. Tom has two children in the 1st generation, each of whom has two children, and so on. Each organism always mates with an organism having genotype Aa Bb.


The probability that at least N Aa Bb organisms will belong to the k-th generation of Tom’s family tree (don’t count the Aa Bb mates at each level). Assume that Mendel’s second law holds for the factors.


For this problem, we will be using the python library scipy, specifically binom function. If you have problems while importing sys library, run the code below. Otherwise, run only the last line:

import sys
!{sys.executable} -m pip install scipy
from scipy.special import binom

Define the function for the probability calculation:

def mendel_2(k, N):
    def probability(n, k):
        return binom(2 ** k, n) * 0.25 ** n * 0.75 ** (2 ** k - n)
    return 1 - sum(probability(n, k) for n in range(N))

Run function and print the output. Remember to replace k and N values with the values that you are given in the dataset:

print(mendel_2(k, N))
Important note:

This problem is taken from Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Finding a Shared Motif

Searching Through the Haystack

In “Finding a Motif in DNA”, we searched a given genetic string for a motif; however, this problem assumed that we know the motif in advance. In practice, biologists often do not know exactly what they are looking for. Rather, they must hunt through several different genomes at the same time to identify regions of similarity that may indicate genes shared by different organisms or species.
The simplest such region of similarity is a motif occurring without mutation in every one of a collection of genetic strings taken from a database; such a motif corresponds to a substring shared by all the strings. We want to search for long shared substrings, as a longer motif will likely indicate a greater shared function.


A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if there does not exist a longer common substring. For example, “CG” is a common substring of “ACGTACGT” and “AACCGTATA”, but it is not as long as possible; in this case, “CGTA” is a longest common substring of “ACGTACGT” and “AACCGTATA”.

Note that the longest common substring is not necessarily unique; for a simple example, “AA” and “CC” are both longest common substrings of “AACC” and “CCAA”.


A collection of k (k ≤ 100) DNA strings of length at most 1 kbp each in FASTA format.


A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)


For this problem, we will be using the python library Bio, specifically SeqIO function:

from Bio import SeqIO                      
sequences = []                             
handle = open('rosalind_lcsm.txt', 'r')
for record in SeqIO.parse(handle, 'fasta'):
    sequence = []                          
    seq = ''                               
    for nt in record.seq:                  
        seq += nt                          

Search through the haystack and return the output:

srt_seq = sorted(sequences, key=len)     
short_seq = srt_seq[0]                   
comp_seq = srt_seq[1:]                   
motif = ''                               
for i in range(len(short_seq)):          
    for j in range(i, len(short_seq)):   
        m = short_seq[i:j + 1]           
        found = False                    
        for sequ in comp_seq:            
            if m in sequ:                
                found = True             
                found = False            
        if found and len(m) > len(motif):
            motif = m                    
Important note:

This problem is taken from Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Calculating Expected Offspring

The Need for Averages

Averages arise everywhere. In sports, we want to project the average number of games that a team is expected to win; in gambling, we want to project the average losses incurred playing blackjack; in business, companies want to calculate their average expected sales for the next quarter.
Molecular biology is not immune from the need for averages. Researchers need to predict the expected number of antibiotic-resistant pathogenic bacteria in a future outbreak, estimate the predicted number of locations in the genome that will match a given motif, and study the distribution of alleles throughout an evolving population. In this problem, we will begin discussing the third issue; first, we need to have a better understanding of what it means to average a random process.


For a random variable X taking integer values between 1 and n, the expected value of X is

{E}(X)=\sum_{k=1}^{n} k \times {Pr}(X=k)

The expected value offers us a way of taking the long-term average of a random variable over a large number of trials.

As a motivating example, let X be the number on a six-sided die. Over a large number of rolls, we should expect to obtain an average of 3.5 on the die (even though it’s not possible to roll a 3.5). The formula for expected value confirms that

{E}(X)=\sum_{k=1}^{6} k \times {Pr}(X=k) = 3.5

More generally, a random variable for which every one of a number of equally spaced outcomes has the same probability is called a uniform random variable (in the die example, this “equal spacing” is equal to 1). We can generalize our die example to find that if X is a uniform random variable with minimum possible value a and maximum possible value b, then


You may also wish to verify that for the dice example, if Y is the random variable associated with the outcome of a second die roll, then E(X+Y)=7

{E}(X + Y)=7


Six nonnegative integers, each of which does not exceed 20,000. The integers correspond to the number of couples in a population possessing each genotype pairing for a given factor. In order, the six given integers represent the number of couples having the following genotypes:

  • AA-AA
  • AA-Aa
  • AA-aa
  • Aa-Aa
  • Aa-aa
  • aa-aa

The expected number of offspring displaying the dominant phenotype in the next generation, under the assumption that every couple has exactly two offspring.


For this problem, we will be using a python library called six:

import six

def iev(a,b,c,d,e,f):
    return ((4 * a + 4 * b + 4 * c + 3 * d + 2 * e) / 2.0)

def main():
    line = six.moves.input()
    tokens = line.split(' ')
    a = int(tokens[0])
    b = int(tokens[1])
    c = int(tokens[2])
    d = int(tokens[3])
    e = int(tokens[4])
    f = int(tokens[5])
    answer = iev(a,b,c,d,e,f)

Call the function with the values given in the dataset:

iev(17849, 18970, 19661, 17001, 18174, 19701)
Important note:

This problem is taken from Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Overlap Graphs

A Brief Introduction to Graph Theory

Networks arise everywhere in the practical world, especially in biology. Networks are prevalent in popular applications such as modeling the spread of disease, but the extent of network applications spreads far beyond popular science. Our first question asks how to computationally model a network without actually needing to render a picture of the network.
First, some terminology: graph is the technical term for a network; a graph is made up of hubs called nodes (or vertices), pairs of which are connected via segments/curves called edges. If an edge connects nodes v and w, then it is denoted by v, w (or equivalently w, v).
  • an edge v,w is incident to nodes v and w; we say that v and w are adjacent to each other;
  • the degree of v is the number of edges incident to it;
  • a walk is an ordered collection of edges for which the ending node of one edge is the starting node of the next (e.g., {v1,v2}, {v2,v3}, {v3,v4}, etc.);
  • a path is a walk in which every node appears in at most two edges;
  • path length is the number of edges in the path;
  • a cycle is a path whose final node is equal to its first node (so that every node is incident to exactly two edges in the cycle) and
  • the distance between two vertices is the length of the shortest path connecting them.

Graph theory is the abstract mathematical study of graphs and their properties.


A graph whose nodes have all been labeled can be represented by an adjacency list, in which each row of the list contains the two node labels corresponding to a unique edge.

A directed graph (or digraph) is a graph containing directed edges, each of which has an orientation. That is, a directed edge is represented by an arrow instead of a line segment; the starting and ending nodes of an edge form its tail and head, respectively. The directed edge with tail v and head w is represented by (v, w) (but not by (w, v)). A directed loop is a directed edge of the form (v, v).

For a collection of strings and a positive integer k, the overlap graph for the strings is a directed graph Ok in which each string is represented by a node, and string s is connected to string t with a directed edge when there is a length k suffix of s that matches a length k prefix of t, as long as st; we demand st to prevent directed loops in the overlap graph (although directed cycles may be present).


A collection of DNA strings in FASTA format having total length at most 10 kbp.


The adjacency list corresponding to O3. You may return edges in any order.


For this problem, we will be using a python library called Bio, but only SeqIO package:

from Bio import SeqIO
k = 3

Open and read the FASTA file:

with open("rosalind_grph.txt") as file:
    fasta_sequences = list(SeqIO.parse(file, 'fasta'))

Now, read each FASTA sequence and return the adjacency list:

for fasta1 in fasta_sequences:
    for fasta2 in fasta_sequences:
        name1, sequence1 =, str(fasta1.seq)
        name2, sequence2 =, str(fasta2.seq)
        if sequence1 == sequence2:
        suffix = sequence1[-k:]
        prefix = sequence2[:k]
        if prefix == suffix:
            print(name1, name2)
Important note:

This problem is taken from Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.