Independent Alleles

Mendel’s Second Law

Recall that Mendel’s first law states that for any factor, an individual randomly assigns one of its two alleles to its offspring. Yet this law does not state anything regarding the relationship with which alleles for different factors will be inherited.
After recording the results of crossing thousands of pea plants for seven years, Mendel surmised that alleles for different factors are inherited with no dependence on each other. This statement has become his second law, also known as the law of independent assortment.
What does it mean for factors to be “assorted independently?” If we cross two organisms, then a shortened form of independent assortment states that if we look only at organisms having the same alleles for one factor, then the inheritance of another factor should not change.
For example, Mendel’s first law states that if we cross two Aa organisms, then 1/4 of their offspring will be aa, 1/4 will be AA, and 1/2 will be Aa. Now, say that we cross plants that are both heterozygous for two factors, so that both of their genotypes may be written as Aa Bb. Next, examine only Bb offspring: Mendel’s second law states that the same proportions of AA, Aa, and aa individuals will be observed in these offspring. The same fact holds for BB and bb offspring.
As a result, independence will allow us to say that the probability of an aa BB offspring is simply equal to the probability of an aa offspring times the probability of a BB organism, i.e., 1/16.
Because of independence, we can also extend the idea of Punnett squares to multiple factors. We now wish to quantify Mendel’s notion of independence using probability.


Two events A and B are independent if Pr(A and B) is equal to Pr(A) × Pr(B). In other words, the events do not influence each other, so that we may simply calculate each of the individual probabilities separately and then multiply.

More generally, random variables X and Y are independent if whenever A and B are respective events for X and Y, A and B are independent (i.e., Pr(A and B)=Pr(A) × Pr(B)).

As an example of how helpful independence can be for calculating probabilities, let X and Y represent the numbers showing on two six-sided dice. Intuitively, the number of pips showing on one die should not affect the number showing on the other die. If we want to find the probability that X + Y is odd, then we don’t need to draw a tree diagram and consider all possibilities. We simply first note that for X + Y to be odd, either X is even and Y is odd or X is odd and Y is even. In terms of probability, Pr(X + Y is odd) = Pr(X is even and Y is odd) + Pr(X is odd and Y is even). Using independence, this becomes [Pr(X is even) × Pr(Y is odd)] + [Pr(X is odd) × Pr(Y is even)] or



Two positive integers k (k ≤ 7) and N (N ≤ 2^k). In this problem, we begin with Tom, who in the 0th generation has genotype Aa Bb. Tom has two children in the 1st generation, each of whom has two children, and so on. Each organism always mates with an organism having genotype Aa Bb.


The probability that at least N Aa Bb organisms will belong to the k-th generation of Tom’s family tree (don’t count the Aa Bb mates at each level). Assume that Mendel’s second law holds for the factors.


For this problem, we will be using the python library scipy, specifically binom function. If you have problems while importing sys library, run the code below. Otherwise, run only the last line:

import sys
!{sys.executable} -m pip install scipy
from scipy.special import binom

Define the function for the probability calculation:

def mendel_2(k, N):
    def probability(n, k):
        return binom(2 ** k, n) * 0.25 ** n * 0.75 ** (2 ** k - n)
    return 1 - sum(probability(n, k) for n in range(N))

Run function and print the output. Remember to replace k and N values with the values that you are given in the dataset:

print(mendel_2(k, N))
Important note:

This problem is taken from Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Finding a Shared Motif

Searching Through the Haystack

In “Finding a Motif in DNA”, we searched a given genetic string for a motif; however, this problem assumed that we know the motif in advance. In practice, biologists often do not know exactly what they are looking for. Rather, they must hunt through several different genomes at the same time to identify regions of similarity that may indicate genes shared by different organisms or species.
The simplest such region of similarity is a motif occurring without mutation in every one of a collection of genetic strings taken from a database; such a motif corresponds to a substring shared by all the strings. We want to search for long shared substrings, as a longer motif will likely indicate a greater shared function.


A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if there does not exist a longer common substring. For example, “CG” is a common substring of “ACGTACGT” and “AACCGTATA”, but it is not as long as possible; in this case, “CGTA” is a longest common substring of “ACGTACGT” and “AACCGTATA”.

Note that the longest common substring is not necessarily unique; for a simple example, “AA” and “CC” are both longest common substrings of “AACC” and “CCAA”.


A collection of k (k ≤ 100) DNA strings of length at most 1 kbp each in FASTA format.


A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)


For this problem, we will be using the python library Bio, specifically SeqIO function:

from Bio import SeqIO                      
sequences = []                             
handle = open('rosalind_lcsm.txt', 'r')
for record in SeqIO.parse(handle, 'fasta'):
    sequence = []                          
    seq = ''                               
    for nt in record.seq:                  
        seq += nt                          

Search through the haystack and return the output:

srt_seq = sorted(sequences, key=len)     
short_seq = srt_seq[0]                   
comp_seq = srt_seq[1:]                   
motif = ''                               
for i in range(len(short_seq)):          
    for j in range(i, len(short_seq)):   
        m = short_seq[i:j + 1]           
        found = False                    
        for sequ in comp_seq:            
            if m in sequ:                
                found = True             
                found = False            
        if found and len(m) > len(motif):
            motif = m                    
Important note:

This problem is taken from Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Calculating Expected Offspring

The Need for Averages

Averages arise everywhere. In sports, we want to project the average number of games that a team is expected to win; in gambling, we want to project the average losses incurred playing blackjack; in business, companies want to calculate their average expected sales for the next quarter.
Molecular biology is not immune from the need for averages. Researchers need to predict the expected number of antibiotic-resistant pathogenic bacteria in a future outbreak, estimate the predicted number of locations in the genome that will match a given motif, and study the distribution of alleles throughout an evolving population. In this problem, we will begin discussing the third issue; first, we need to have a better understanding of what it means to average a random process.


For a random variable X taking integer values between 1 and n, the expected value of X is

{E}(X)=\sum_{k=1}^{n} k \times {Pr}(X=k)

The expected value offers us a way of taking the long-term average of a random variable over a large number of trials.

As a motivating example, let X be the number on a six-sided die. Over a large number of rolls, we should expect to obtain an average of 3.5 on the die (even though it’s not possible to roll a 3.5). The formula for expected value confirms that

{E}(X)=\sum_{k=1}^{6} k \times {Pr}(X=k) = 3.5

More generally, a random variable for which every one of a number of equally spaced outcomes has the same probability is called a uniform random variable (in the die example, this “equal spacing” is equal to 1). We can generalize our die example to find that if X is a uniform random variable with minimum possible value a and maximum possible value b, then


You may also wish to verify that for the dice example, if Y is the random variable associated with the outcome of a second die roll, then E(X+Y)=7

{E}(X + Y)=7


Six nonnegative integers, each of which does not exceed 20,000. The integers correspond to the number of couples in a population possessing each genotype pairing for a given factor. In order, the six given integers represent the number of couples having the following genotypes:

  • AA-AA
  • AA-Aa
  • AA-aa
  • Aa-Aa
  • Aa-aa
  • aa-aa

The expected number of offspring displaying the dominant phenotype in the next generation, under the assumption that every couple has exactly two offspring.


For this problem, we will be using a python library called six:

import six

def iev(a,b,c,d,e,f):
    return ((4 * a + 4 * b + 4 * c + 3 * d + 2 * e) / 2.0)

def main():
    line = six.moves.input()
    tokens = line.split(' ')
    a = int(tokens[0])
    b = int(tokens[1])
    c = int(tokens[2])
    d = int(tokens[3])
    e = int(tokens[4])
    f = int(tokens[5])
    answer = iev(a,b,c,d,e,f)

Call the function with the values given in the dataset:

iev(17849, 18970, 19661, 17001, 18174, 19701)
Important note:

This problem is taken from Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Overlap Graphs

A Brief Introduction to Graph Theory

Networks arise everywhere in the practical world, especially in biology. Networks are prevalent in popular applications such as modeling the spread of disease, but the extent of network applications spreads far beyond popular science. Our first question asks how to computationally model a network without actually needing to render a picture of the network.
First, some terminology: graph is the technical term for a network; a graph is made up of hubs called nodes (or vertices), pairs of which are connected via segments/curves called edges. If an edge connects nodes v and w, then it is denoted by v, w (or equivalently w, v).
  • an edge v,w is incident to nodes v and w; we say that v and w are adjacent to each other;
  • the degree of v is the number of edges incident to it;
  • a walk is an ordered collection of edges for which the ending node of one edge is the starting node of the next (e.g., {v1,v2}, {v2,v3}, {v3,v4}, etc.);
  • a path is a walk in which every node appears in at most two edges;
  • path length is the number of edges in the path;
  • a cycle is a path whose final node is equal to its first node (so that every node is incident to exactly two edges in the cycle) and
  • the distance between two vertices is the length of the shortest path connecting them.

Graph theory is the abstract mathematical study of graphs and their properties.


A graph whose nodes have all been labeled can be represented by an adjacency list, in which each row of the list contains the two node labels corresponding to a unique edge.

A directed graph (or digraph) is a graph containing directed edges, each of which has an orientation. That is, a directed edge is represented by an arrow instead of a line segment; the starting and ending nodes of an edge form its tail and head, respectively. The directed edge with tail v and head w is represented by (v, w) (but not by (w, v)). A directed loop is a directed edge of the form (v, v).

For a collection of strings and a positive integer k, the overlap graph for the strings is a directed graph Ok in which each string is represented by a node, and string s is connected to string t with a directed edge when there is a length k suffix of s that matches a length k prefix of t, as long as st; we demand st to prevent directed loops in the overlap graph (although directed cycles may be present).


A collection of DNA strings in FASTA format having total length at most 10 kbp.


The adjacency list corresponding to O3. You may return edges in any order.


For this problem, we will be using a python library called Bio, but only SeqIO package:

from Bio import SeqIO
k = 3

Open and read the FASTA file:

with open("rosalind_grph.txt") as file:
    fasta_sequences = list(SeqIO.parse(file, 'fasta'))

Now, read each FASTA sequence and return the adjacency list:

for fasta1 in fasta_sequences:
    for fasta2 in fasta_sequences:
        name1, sequence1 =, str(fasta1.seq)
        name2, sequence2 =, str(fasta2.seq)
        if sequence1 == sequence2:
        suffix = sequence1[-k:]
        prefix = sequence2[:k]
        if prefix == suffix:
            print(name1, name2)
Important note:

This problem is taken from Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Mortal Fibonacci Rabbits

Wabbit Season

In “Rabbits and Recurrence Relations”, we mentioned the disaster caused by introducing European rabbits into Australia. By the turn of the 20th Century, the situation was so out of control that the creatures could not be killed fast enough to slow their spread.
The conclusion? Build a fence! The fence, intended to preserve the sanctity of Western Australia, was completed in 1907 after undergoing revisions to push it back as the bunnies pushed their frontier ever westward. If it sounds like a crazy plan, the Australians at the time seem to have concurred.
By 1950, Australian rabbits numbered 600 million, causing the government to decide to release a virus (called myxoma) into the wild, which cut down the rabbits until they acquired resistance. In a final Hollywood twist, another experimental rabbit virus escaped in 1991, and some resistance has already been observed.
The bunnies will not be stopped, but they don’t live forever, and so in this problem, our aim is to expand Fibonacci’s rabbit population model to allow for mortal rabbits.


Recall the definition of the Fibonacci numbers from “Rabbits and Recurrence Relations”, which followed the recurrence relation


and assumed that each pair of rabbits reaches maturity in one month and produces a single pair of offspring (one male, one female) each subsequent month.

Our aim is to somehow modify this recurrence relation to achieve a dynamic programming solution in the case that all rabbits die out after a fixed number of months.


Positive integers n ≤ 100 and m ≤ 20.


The total number of pairs of rabbits that will remain after the n-th month if all rabbits live for m months.


Change n and m with the values given in the dataset:

n = 89 #replace input                                                                        
m = 16 #replace input                                                                       
bunnies = [1, 1]                                                               
months = 2
count = []                                                                     
while months < n:                                                              
    if months < m:                                                             
        bunnies.append(bunnies[-2] + bunnies[-1])                              
    elif months == m or count == m + 1:                                        
        bunnies.append(bunnies[-2] + bunnies[-1] - 1)                          
        bunnies.append(bunnies[-2] + bunnies[-1] - bunnies[-(                  
            m + 1)])                                                           
    months += 1                                                                
Important note:

This problem is taken from Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.