Mortal Fibonacci Rabbits

Wabbit Season

In “Rabbits and Recurrence Relations”, we mentioned the disaster caused by introducing European rabbits into Australia. By the turn of the 20th Century, the situation was so out of control that the creatures could not be killed fast enough to slow their spread.
The conclusion? Build a fence! The fence, intended to preserve the sanctity of Western Australia, was completed in 1907 after undergoing revisions to push it back as the bunnies pushed their frontier ever westward. If it sounds like a crazy plan, the Australians at the time seem to have concurred.
By 1950, Australian rabbits numbered 600 million, causing the government to decide to release a virus (called myxoma) into the wild, which cut down the rabbits until they acquired resistance. In a final Hollywood twist, another experimental rabbit virus escaped in 1991, and some resistance has already been observed.
The bunnies will not be stopped, but they don’t live forever, and so in this problem, our aim is to expand Fibonacci’s rabbit population model to allow for mortal rabbits.

Problem

Recall the definition of the Fibonacci numbers from “Rabbits and Recurrence Relations”, which followed the recurrence relation

F_{n}=F_{n-1}+F_{n-2}

and assumed that each pair of rabbits reaches maturity in one month and produces a single pair of offspring (one male, one female) each subsequent month.

Our aim is to somehow modify this recurrence relation to achieve a dynamic programming solution in the case that all rabbits die out after a fixed number of months.

Given

Positive integers n ≤ 100 and m ≤ 20.

Return

The total number of pairs of rabbits that will remain after the n-th month if all rabbits live for m months.


Solution

Change n and m with the values given in the dataset:

n = 89 #replace input                                                                        
m = 16 #replace input                                                                       
bunnies = [1, 1]                                                               
months = 2
count = []                                                                     
while months < n:                                                              
    if months < m:                                                             
        bunnies.append(bunnies[-2] + bunnies[-1])                              
    elif months == m or count == m + 1:                                        
        bunnies.append(bunnies[-2] + bunnies[-1] - 1)                          
    else:                                                                      
        bunnies.append(bunnies[-2] + bunnies[-1] - bunnies[-(                  
            m + 1)])                                                           
    months += 1                                                                
print(bunnies[-1])
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Consensus and Profile

Finding a Most Likely Common Ancestor

In “Counting Point Mutations”, we calculated the minimum number of symbol mismatches between two strings of equal length to model the problem of finding the minimum number of point mutations occurring on the evolutionary path between two homologous strands of DNA. If we instead have several homologous strands that we wish to analyze simultaneously, then the natural problem is to find an average-case strand to represent the most likely common ancestor of the given strands.

Problem

A matrix is a rectangular table of values divided into rows and columns. An m×n matrix has m rows and n columns. Given a matrix A, we write Ai,j to indicate the value found at the intersection of row i and column j.

Say that we have a collection of DNA strings, all having the same length n. Their profile matrix is a 4 × n matrix P in which P1,j represents the number of times that ‘A’ occurs in the jth position of one of the strings, P2,j represents the number of times that C occurs in the j-th position, and so on.

A consensus string c is a string of length n formed from our collection by taking the most common symbol at each position; the j-th symbol of c therefore corresponds to the symbol having the maximum value in the j-th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.


DNA strings
A T C C A G C T
G G G C A A C T
A T G G A T C T
A A G C A A C C
T T G G A A C T
A T G C C A T T
A T G G C A C T

Profile
A 5 1 0 0 5 5 0 0
C 0 0 1 4 2 0 6 1
G 1 1 6 3 0 1 0 0
T 1 5 0 0 0 1 1 6

Consensus A T G C A A C T

Given

A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

Return

A consensus string and profile matrix for the collection (If several possible consensus strings exist, then you may return any one of them).


Solution

Let’s define a function that would find out the locations of substrings:

def read_fasta(fp):
    name, seq = None, []
    for line in fp:
        line = line.rstrip()
        if line.startswith(">"):
            if name: yield (name, ''.join(seq))
            name, seq = line, []
        else:
            seq.append(line)
    if name: yield (name, ''.join(seq))

Then, read the dataset and run the function:

data_list = []
file = open('main.out','w')
with open('rosalind_cons.txt') as fp: #replace filename with yours
    for name, seq in read_fasta(fp):
        data_list.append(seq)
length = len(data_list)

L = len(seq)

P = [[0 for x in xrange(L)] for y in xrange(4)] 

Q = ['A','C','G','T']

for x in range(L):
    for y in range(4):
        for z in range(length):
            P[y][x] = P[y][x] + data_list[z][x].count(Q[y])

domi = ""
for x in range(L):
    MAX = 0
    for y in range(4):  
        if P[y][x]>=P[MAX][x]:
            MAX = y       
            
    if MAX == 0:
        domi = domi+'A'
    elif MAX ==1:
        domi = domi+'C'
    elif MAX ==2:
        domi = domi+'G'
    elif MAX ==3:
        domi = domi+'T'

And finally, print the output:

file.write('%s\n%s\n%s\n%s\n%s' %(
    domi,'A: '+str(P[0]).strip('[]').replace(',',''),'C: '
    +str(P[1]).strip('[]').replace(',',''),'G: '
    +str(P[2]).strip('[]').replace(',',''),'T: '
    +str(P[3]).strip('[]').replace(',','')))
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Finding a Motif in DNA

Combing Through the Haystack

Finding the same interval of DNA in the genomes of two different organisms (often taken from different species) is highly suggestive that the interval has the same function in both organisms.
We define a motif as such a commonly shared interval of DNA. A common task in molecular biology is to search an organism’s genome for a known motif.
The situation is complicated by the fact that genomes are riddled with intervals of DNA that occur multiple times (possibly with slight modifications), called repeats. These repeats occur far more often than would be dictated by random chance, indicating that genomes are anything but random and in fact illustrate that the language of DNA must be very powerful (compare with the frequent reuse of common words in any human language).
The most common repeat in humans is the Alu repeat, which is approximately 300 bp long and recurs around a million times throughout every human genome. However, Alu has not been found to serve a positive purpose, and appears in fact to be parasitic: when a new Alu repeat is inserted into a genome, it frequently causes genetic disorders.

Problem

Given two strings s and t, t is a substring of s if t is contained as a contiguous collection of symbols in s (as a result, t must be no longer than s).

The position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of ‘U’ in “AUGCUUCAGAAAGGUCUUACG” are 2, 5, 6, 15, 17, and 18). The symbol at position i of s is denoted by s[i].

A substring of s can be represented as s[j:k], where j and k represent the starting and ending positions of the substring in s; for example, if s = “AUGCUUCAGAAAGGUCUUACG”, then s[2:5] = “UGCU”.

The location of a substring s[j:k] is its beginning position j; note that t will have multiple locations in s if it occurs more than once as a substring of s.

Given

Two DNA strings s and t (each of length at most 1 kbp).

Return

All locations of t as a substring of s.


Solution

Let’s define a function that would find out the locations of substrings:

def substring_loc(s, t):
    global result
    result = []
    t_len = len(t)
    s_len = len(s)
    for i in range(0, s_len - t_len + 1):
        if s[i:i+t_len] == t:
            result.append(i+1)        
    return result

Then, read the dataset:

with open("rosalind_subs.txt","r") as motif:
    motiff = str()
    for line in motif:
        motiff = motiff + line
    list = motiff.split("\n")

motiff_1 = list[0]
motiff_2 = list[1]

And finally, run the function to find the locations of substrings:

substring_loc(motiff_1,motiff_2)
for i in range(0, len(result)):
    print(result[i], end=" ")
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Translating RNA into Protein

The Genetic Code

Just as nucleic acids are polymers of nucleotides, proteins are chains of smaller molecules called amino acids; 20 amino acids commonly appear in every species. Just as the primary structure of a nucleic acid is given by the order of its nucleotides, the primary structure of a protein is the order of its amino acids. Some proteins are composed of several subchains called polypeptides, while others are formed of a single polypeptide.
Proteins power every practical function carried out by the cell, and so presumably, the key to understanding life lies in interpreting the relationship between a chain of amino acids and the function of the protein that this chain of amino acids eventually constructs. Proteomics is the field devoted to the study of proteins.
How are proteins created? The genetic code, discovered throughout the course of a number of ingenious experiments in the late 1950s, details the translation of an RNA molecule called messenger RNA (mRNA) into amino acids for protein creation. The apparent difficulty in translation is that somehow 4 RNA bases must be translated into a language of 20 amino acids; in order for every possible amino acid to be created, we must translate 3-nucleobase strings (called codons) into amino acids. Note that there are 4^3 = 64 possible codons, so that multiple codons may encode the same amino acid. Two special types of codons are the start codon (AUG), which codes for the amino acid methionine always indicates the start of translation, and the three stop codons (UAA, UAG, UGA), which do not code for an amino acid and cause translation to end.
The notion that protein is always created from RNA, which in turn is always created from DNA, forms the central dogma of molecular biology. Like all dogmas, it does not always hold; however, it offers an excellent approximation of the truth.
An organelle called a ribosome creates peptides by using a helper molecule called transfer RNA (tRNA). A single tRNA molecule possesses a string of three RNA nucleotides on one end (called an anticodon) and an amino acid at the other end. The ribosome takes an RNA molecule transcribed from DNA (see “Transcribing DNA into RNA”), called messenger RNA (mRNA), and examines it one codon at a time. At each step, the tRNA possessing the complementary anticodon bonds to the mRNA at this location, and the amino acid found on the opposite end of the tRNA is added to the growing peptide chain before the remaining part of the tRNA is ejected into the cell, and the ribosome looks for the next tRNA molecule.
Not every RNA base eventually becomes translated into a protein, and so an interval of RNA (or an interval of DNA translated into RNA) that does code for a protein is of great biological interest; such an interval of DNA or RNA is called a gene. Because protein creation drives cellular processes, genes differentiate organisms and serve as a basis for heredity, or the process by which traits are inherited.

Problem

The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.

The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.

Given

An RNA string s corresponding to a strand of mRNA (of length at most 10 kbp).

Return

The protein string encoded by s.


Solution

Let’s define a dictionary of codons as keys with their corresponding amino acids (or START, STOP signals) as values:

map = {"UUU":"F", "UUC":"F", "UUA":"L", "UUG":"L",
    "UCU":"S", "UCC":"S", "UCA":"S", "UCG":"S",
    "UAU":"Y", "UAC":"Y", "UAA":"STOP", "UAG":"STOP",
    "UGU":"C", "UGC":"C", "UGA":"STOP", "UGG":"W",
    "CUU":"L", "CUC":"L", "CUA":"L", "CUG":"L",
    "CCU":"P", "CCC":"P", "CCA":"P", "CCG":"P",
    "CAU":"H", "CAC":"H", "CAA":"Q", "CAG":"Q",
    "CGU":"R", "CGC":"R", "CGA":"R", "CGG":"R",
    "AUU":"I", "AUC":"I", "AUA":"I", "AUG":"M",
    "ACU":"T", "ACC":"T", "ACA":"T", "ACG":"T",
    "AAU":"N", "AAC":"N", "AAA":"K", "AAG":"K",
    "AGU":"S", "AGC":"S", "AGA":"R", "AGG":"R",
    "GUU":"V", "GUC":"V", "GUA":"V", "GUG":"V",
    "GCU":"A", "GCC":"A", "GCA":"A", "GCG":"A",
    "GAU":"D", "GAC":"D", "GAA":"E", "GAG":"E",
    "GGU":"G", "GGC":"G", "GGA":"G", "GGG":"G",}

Then, define a function to translate codons into proteins:

def translate_mrna(mRNA):
    start = mRNA.find("AUG")
    triplets = [mRNA[start:start+3] for start in range(start, len(mRNA), 3)]
    for triplet in triplets:
        if map.get(triplet) == "STOP":
            return
        else:
            print(map.get(triplet), end="")
    return

And finally, open the dataset and run the function to translate mRNA into a protein string:

with open("rosalind_prot.txt","r") as mrna_data:
    mRNA = str(mrna_data.readlines())
    translate_mrna(mRNA)
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Mendel’s First Law

Introduction to Mendelian Inheritance

Modern laws of inheritance were first described by Gregor Mendel (an Augustinian Friar) in 1865. The contemporary hereditary model, called blending inheritance, stated that an organism must exhibit a blend of its parent’s traits. This rule is obviously violated both empirically (consider the huge number of people who are taller than both their parents) and statistically (over time, blended traits would simply blend into the average, severely limiting variation).
Mendel, working with thousands of pea plants, believed that rather than viewing traits as continuous processes, they should instead be divided into discrete building blocks called factors. Furthermore, he proposed that every factor possesses distinct forms, called alleles.
In what has come to be known as his first law (also known as the law of segregation), Mendel stated that every organism possesses a pair of alleles for a given factor. If an individual’s two alleles for a given factor are the same, then it is homozygous for the factor; if the alleles differ, then the individual is heterozygous. The first law concludes that for any factor, an organism randomly passes one of its two alleles to each offspring, so that an individual receives one allele from each parent.
Mendel also believed that any factor corresponds to only two possible alleles, the dominant and recessive alleles. An organism only needs to possess one copy of the dominant allele to display the trait represented by the dominant allele. In other words, the only way that an organism can display a trait encoded by a recessive allele is if the individual is homozygous recessive for that factor.
We may encode the dominant allele of a factor by a capital letter (e.g., A) and the recessive allele by a lower case letter (e.g., a). Because a heterozygous organism can possess a recessive allele without displaying the recessive form of the trait, we henceforth define an organism’s genotype to be its precise genetic makeup and its phenotype as the physical manifestation of its underlying traits.
The different possibilities describing an individual’s inheritance of two alleles from its parents can be represented by a Punnett square.

Problem

Probability is the mathematical study of randomly occurring phenomena. We will model such a phenomenon with a random variable, which is simply a variable that can take a number of different distinct outcomes depending on the result of an underlying random process.

For example, say that we have a bag containing 3 red balls and 2 blue balls. If we let X represent the random variable corresponding to the color of a drawn ball, then the probability of each of the two outcomes is given by Pr(X = red) = 35 and Pr(X = blue) = 25.

Random variables can be combined to yield new random variables. Returning to the ball example, let Y model the color of a second ball drawn from the bag (without replacing the first ball). The probability of Y being red depends on whether the first ball was red or blue. To represent all outcomes of X and Y, we therefore use a probability tree diagram. This branching diagram represents all possible individual probabilities for X and Y, with outcomes at the endpoints (“leaves”) of the tree. The probability of any outcome is given by the product of probabilities along the path from the beginning of the tree.

An event is simply a collection of outcomes. Because outcomes are distinct, the probability of an event can be written as the sum of the probabilities of its constituent outcomes. For our colored ball example, let A be the event “Y is blue.” Pr(A) is equal to the sum of the probabilities of two different outcomes:

{Pr}(X=\text { blue } and\text { } Y=\text { blue }) + {Pr}(X=\text { red } and\text { } Y=\text { blue })

or

\frac{3}{10}+\frac{1}{10}=\frac{2}{5}

Given

Three positive integers k, m, and n, representing a population containing k + m + n organisms: k individuals are homozygous dominant for a factor, m are heterozygous, and n are homozygous recessive.

Return

The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype). Assume that any two organisms can mate.


Solution

Let’s define the function that would calculate the probability of recessive traits only but return the probability of dominant phenotype:

def mendel(x, y, z):
    #calculate the probability of recessive traits only
    total = x + y + z
    twoRecess = (z/total)*((z-1)/(total-1))
    twoHetero = (y/total)*((y-1)/(total-1))
    heteroRecess = (z/total)*(y/(total-1)) + (y/total)*(z/(total-1))
    recessProb = twoRecess + twoHetero*1/4 + heteroRecess*1/2
    print(1-recessProb)

Then, open and read the dataset:

with open ("rosalind_iprb.txt", "r") as file: #replace filename with your filename
    line = file.readline().split()
    x, y, z = [int(n) for n in line]
    print(x, y, z)

And finally, run the code to find out the probability:

print(mendel(x, y, z))
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.