Locating Restriction Sites

The Billion-Year War

The war between viruses and bacteria has been waged for over a billion years. Viruses called bacteriophages (or simply phages) require a bacterial host to propagate, and so they must somehow infiltrate the bacterium; such deception can only be achieved if the phage understands the genetic framework underlying the bacterium’s cellular functions. The phage’s goal is to insert DNA that will be replicated within the bacterium and lead to the reproduction of as many copies of the phage as possible, which sometimes also involves the bacterium’s demise.
To defend itself, the bacterium must either obfuscate its cellular functions so that the phage cannot infiltrate it, or better yet, go on the counterattack by calling in the air force. Specifically, the bacterium employs aerial scouts called restriction enzymes, which operate by cutting through viral DNA to cripple the phage. But what kind of DNA are restriction enzymes looking for?
The restriction enzyme is a homodimer, which means that it is composed of two identical substructures. Each of these structures separates from the restriction enzyme in order to bind to and cut one strand of the phage DNA molecule; both substructures are pre-programmed with the same target string containing 4 to 12 nucleotides to search for within the phage DNA. The chance that both strands of phage DNA will be cut (thus crippling the phage) is greater if the target is located on both strands of phage DNA, as close to each other as possible. By extension, the best chance of disarming the phage occurs when the two target copies appear directly across from each other along the phage DNA, a phenomenon that occurs precisely when the target is equal to its own reverse complement. Eons of evolution have made sure that most restriction enzyme targets now have this form.

Problem

A DNA string is a reverse palindrome if it is equal to its reverse complement. For instance, GCATGC is a reverse palindrome because its reverse complement is GCATGC.

Given

A DNA string of length at most 1 kbp in FASTA format.

Return

The position and length of every reverse palindrome in the string having length between 4 and 12. You may return these pairs in any order.


Solution

First, let’s read the FASTA file. Remember that first line includes a name, therefore, we skip the first line:

with open("rosalind_revp.txt") as f:
    for line in f:
        if line.startswith(">"):
            dna_str = str()
        else:
            dna_str += (line.strip("\n"))

To find its complementary sequence, we use the same function that we covered in “Complementing a String of DNA

def complementary(dna_sequence):
    replace_bases = {"A":"T","T":"A","G":"C","C":"G"}
    return ''.join([replace_bases[base] for base in reversed(dna_sequence)])

Then, we create another function to find out the location and the length of palindromes:

def LocatingRestrictionSites(dna):
    position_length = []
    for i in range(4, 13):
        for j in range(0, len(dna) - i + 1):
            if complementary(dna[j:j+i]) == dna[j:j+i]:
                position_length.append(str(j+1) + ' ' + str(i))
    return position_length

Finally, we print the position and length of the palindromes:

for pos_len in LocatingRestrictionSites(dna_str):
    print(pos_len)
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Calculating Protein Mass

Chaining the Amino Acids

In “Translating RNA into Protein”, we examined the translation of RNA into an amino acid chain for the construction of a protein. When two amino acids link together, they form a peptide bond, which releases a molecule of water. Thus, after a series of amino acids have been linked together into a polypeptide, every pair of adjacent amino acids has lost one molecule of water, meaning that a polypeptide containing n amino acids has had n − 1 water molecules removed.
More generally, a residue is a molecule from which a water molecule has been removed; every amino acid in a protein are residues except the leftmost and the rightmost ones. These outermost amino acids are special in that one has an “unstarted” peptide bond, and the other has an “unfinished” peptide bond. Between them, the two molecules have a single “extra” molecule of water. Thus, the mass of a protein is the sum of masses of all its residues plus the mass of a single water molecule.
There are two standard ways of computing the mass of a residue by summing the masses of its individual atoms. Its monoisotopic mass is computed by using the principal (most abundant) isotope of each atom in the amino acid, whereas its average mass is taken by taking the average mass of each atom in the molecule (over all naturally appearing isotopes).
Many applications in proteomics rely on mass spectrometry, an analytical chemical technique used to determine the mass, elemental composition, and structure of molecules. In mass spectrometry, monoisotopic mass is used more often than average mass, and so all amino acid masses are assumed to be monoisotopic unless otherwise stated.
The standard unit used in mass spectrometry for measuring mass is the atomic mass unit, which is also called the dalton (Da) and is defined as one twelfth of the mass of a neutral atom of carbon-12. The mass of a protein is the sum of the monoisotopic masses of its amino acid residues plus the mass of a single water molecule (whose monoisotopic mass is 18.01056 Da).
In the following several problems on applications of mass spectrometry, we avoid the complication of having to distinguish between residues and non-residues by only considering peptides excised from the middle of the protein. This is a relatively safe assumption because in practice, peptide analysis is often performed in tandem mass spectrometry. In this special class of mass spectrometry, a protein is first divided into peptides, which are then broken into ions for mass analysis.

Problem

In a weighted alphabet, every symbol is assigned a positive real number called a weight. A string formed from a weighted alphabet is called a weighted string, and its weight is equal to the sum of the weights of its symbols.

The standard weight assigned to each member of the 20-symbol amino acid alphabet is the monoisotopic mass of the corresponding amino acid.

Given

A protein string P of length at most 1000 aa.

Return

The total weight of P. Consult the monoisotopic mass table.


Solution

Let’s create the monoisotopic mass table as a dictionary:

mass_table = {"A":71.03711, "C":103.00919,
              "D":115.02694, "E":129.04259,
              "F":147.06841, "G":57.02146,
              "H":137.05891, "I":113.08406,
              "K":128.09496, "L":113.08406,
              "M":131.04049, "N":114.04293,
              "P":97.05276, "Q":128.05858,
              "R":156.10111, "S":87.03203,
              "T":101.04768, "V":99.06841,
              "W":186.07931, "Y":163.06333,}

Then, create the function to calculate the total protein mass, which will return the results as in float with three decimals:

def protein_weight(protein_sequence):
    total_weight = 0
    for amino_acid in protein_sequence:
        total_weight += mass_table.get(amino_acid)
    print("%.3f" % total_weight)

All we need to do is to read the dataset, and call the function:

with open("rosalind_prtm.txt") as f:
    protein_string = f.read().strip()

Call the function:

protein_weight(protein_string)
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Enumerating Gene Orders

Rearrangements Power Large-Scale Genomic Changes

Point mutations can create changes in populations of organisms from the same species, but they lack the power to create and differentiate entire species. This more arduous work is left to larger mutations called genome rearrangements, which move around huge blocks of DNA. Rearrangements cause major genomic change, and most rearrangements are fatal or seriously damaging to the mutated cell and its descendants (many cancers derive from rearrangements). For this reason, rearrangements that come to influence the genome of an entire species are very rare.
Because rearrangements that affect species evolution occur infrequently, two closely related species will have very similar genomes. Thus, to simplify comparison of two such genomes, researchers first identify similar intervals of DNA from the species, called synteny blocks; over time, rearrangements have created these synteny blocks and heaved them around across the two genomes (often separating blocks onto different chromosomes).
A pair of synteny blocks from two different species are not strictly identical (they are separated by the action of point mutations or very small rearrangements), but for the sake of studying large-scale rearrangements, we consider them to be equivalent. As a result, we can label each synteny block with a positive integer; when comparing two species’ genomes/chromosomes, we then only need to specify the order of its numbered synteny blocks.

Problem

A permutation of length n is an ordering of the positive integers {1, 2, …, n}. For example, π = (5, 3, 2, 1, 4) is a permutation of length 5.

Given

A positive integer n ≤ 7.

Return

The total number of permutations of length n, followed by a list of all such permutations (in any order).


Solution

Let’s import the required package for permutation calculation:

import itertools

Then, we read the dataset:

with open("rosalind_perm.txt") as file:
    N = int(file.read())

We need to define the loop for the permutations. However, in order to print the output as rosalind platform requires, I used a trick here:

def total_permutation(n):
    perm = itertools.permutations(list(range(1, n + 1)))
    print(len(list(perm)))

def permutation(n):
    perm = itertools.permutations(list(range(1, n + 1)))
    for i, j in enumerate(list(perm)):
        permutation_list = ''
        for item in j:
            permutation_list += str(item) + ' '
        print(permutation_list)

The functions above return the total number of permutations of length n and list of permutations, respectively. All we need to do is call these function in order:

def enumerating_gene_orders(n):
    total_permutation(n)
    permutation(n)

Finally, we call the main function and return the desired output.

enumerating_gene_orders(N)
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Open Reading Frames

Transcription May Begin Anywhere

In “Transcribing DNA into RNA”, we discussed the transcription of DNA into RNA, and in “Translating RNA into Protein”, we examined the translation of RNA into a chain of amino acids for the construction of proteins. We can view these two processes as a single step in which we directly translate a DNA string into a protein string, thus calling for a DNA codon table.
However, three immediate wrinkles of complexity arise when we try to pass directly from DNA to proteins. First, not all DNA will be transcribed into RNA: so-called junk DNA appears to have no practical purpose for cellular function. Second, we can begin translation at any position along a strand of RNA, meaning that any substring of a DNA string can serve as a template for translation, as long as it begins with a start codon, ends with a stop codon, and has no other stop codons in the middle. As a result, the same RNA string can actually be translated in three different ways, depending on how we group triplets of symbols into codons. For example, …AUGCUGAC… can be translated as …AUGCUG…, …UGCUGA…, and …GCUGAC…, which will typically produce wildly different protein strings.

Problem

Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.

An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.

Given

A DNA string s of length at most 1 kbp in FASTA format.

Return

Every distinct candidate protein string that can be translated from ORFs of s. Strings can be returned in any order.


Solution

Let’s define the DNA codon map. This step will help bypassing mRNA transcription step:

dna_codon_table = {
    "TTT":"F", "CTT":"L", "ATT":"I", "GTT":"V",
    "TTC":"F", "CTC":"L", "ATC":"I", "GTC":"V",
    "TTA":"L", "CTA":"L", "ATA":"I", "GTA":"V",
    "TTG":"L", "CTG":"L", "ATG":"M", "GTG":"V",
    "TCT":"S", "CCT":"P", "ACT":"T", "GCT":"A",
    "TCC":"S", "CCC":"P", "ACC":"T", "GCC":"A",
    "TCA":"S", "CCA":"P", "ACA":"T", "GCA":"A",
    "TCG":"S", "CCG":"P", "ACG":"T", "GCG":"A",
    "TAT":"Y", "CAT":"H", "AAT":"N", "GAT":"D",
    "TAC":"Y", "CAC":"H", "AAC":"N", "GAC":"D",
    "TAA":"STOP", "CAA":"Q", "AAA":"K", "GAA":"E",
    "TAG":"STOP", "CAG":"Q", "AAG":"K", "GAG":"E",
    "TGT":"C", "CGT":"R", "AGT":"S", "GGT":"G",
    "TGC":"C", "CGC":"R", "AGC":"S", "GGC":"G",
    "TGA":"STOP", "CGA":"R", "AGA":"R", "GGA":"G",
    "TGG":"W", "CGG":"R", "AGG":"R", "GGG":"G"
}

For the complementary DNA strand, we can simply replace A, T, G, C with T, A, C, G, respectively:

def complementary_dna(dna_string):
    replace_bases = {"A":"T","T":"A","G":"C","C":"G"}
    return ''.join([replace_bases[base] for base in reversed(dna_string)])

Another function we need to define is the direct translation function using DNA string. The translation must start with methionine (ATG):

def translate(fragment):
    peptide = []
    methionine = fragment.find("ATG")
    codons = [fragment[methionine:methionine+3] for methionine in range(methionine, len(fragment), 3)]
    for codon in codons:
        peptide += dna_codon_table.get(codon)
    return ''.join(map(str, peptide)) 

But before we directly translate DNA into amino acid sequences, we need to find out possible fragments starting with methionine codon and ending with any of stop codons (TAA, TAG or TGA), while checking three bases (codon length) in between start and stop codons.

import re

pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAA|TAG|TGA))')

fragments = []
for string in re.findall(pattern, dna_string):
    fragments.append(string)

for string in re.findall(pattern, complementary_dna(dna_string)):
    fragments.append(string)

Since we have returned DNA fragments to possibly be translated in fragments set, we can run the translate function to find out the possible peptide sequences.

with open("rosalind_orf.txt") as file:
    for line in file:
        if line.startswith(">"):
            nextline = str()
        else:
            nextline += (line.strip("\n"))
    dna_string = nextline

for string in set(fragments):
    print(translate(string))
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Inferring mRNA from Protein

Pitfalls of Reversing Translation

When researchers discover a new protein, they would like to infer the strand of mRNA from which this protein could have been translated, thus allowing them to locate genes associated with this protein on the genome.
Unfortunately, although any RNA string can be translated into a unique protein string, reversing the process yields a huge number of possible RNA strings from a single protein string because most amino acids correspond to multiple RNA codons (see the RNA Codon Table).
Because of memory considerations, most data formats that are built into languages have upper bounds on how large an integer can be: in some versions of Python, an “int” variable may be required to be no larger than 2^31 − 1 or 2,147,483,647. As a result, to deal with very large numbers in Rosalind, we need to devise a system that allows us to manipulate large numbers without actually having to store large numbers.

Problem

For positive integers a and n, a modulo n (written a mod n in shorthand) is the remainder when a is divided by n. For example, 29 mod 11 = 7 because 29 = 11 × 2 + 7.

Modular arithmetic is the study of addition, subtraction, multiplication, and division with respect to the modulo operation. We say that a and b are congruent modulo n if a mod n=b mod n; in this case, we use the notation ab mod n.

Two useful facts in modular arithmetic are that if a b mod n and c d mod n, then a + c b + d mod n and a × cb × d mod n. To check your understanding of these rules, you may wish to verify these relationships for a = 29, b = 73, c = 10, d = 32, and n = 11.

As you will see in this exercise, some Rosalind problems will ask for a (very large) integer solution modulo a smaller number to avoid the computational pitfalls that arise with storing such large numbers.

Given

A protein string of length at most 1000 aa.

Return

The total number of different RNA strings from which the protein could have been translated, modulo 1,000,000. (Don’t neglect the importance of the stop codon in protein translation.)


Solution

Let’s define the codon map as well as the functions first:

codon_map = {
    'UUU': 'F',     'CUU': 'L',     'AUU': 'I',     'GUU': 'V',
    'UUC': 'F',     'CUC': 'L',     'AUC': 'I',     'GUC': 'V',
    'UUA': 'L',     'CUA': 'L',     'AUA': 'I',     'GUA': 'V',
    'UUG': 'L',     'CUG': 'L',     'AUG': 'M',     'GUG': 'V',
    'UCU': 'S',     'CCU': 'P',     'ACU': 'T',     'GCU': 'A',
    'UCC': 'S',     'CCC': 'P',     'ACC': 'T',     'GCC': 'A',
    'UCA': 'S',     'CCA': 'P',     'ACA': 'T',     'GCA': 'A',
    'UCG': 'S',     'CCG': 'P',     'ACG': 'T',     'GCG': 'A',
    'UAU': 'Y',     'CAU': 'H',     'AAU': 'N',     'GAU': 'D',
    'UAC': 'Y',     'CAC': 'H',     'AAC': 'N',     'GAC': 'D',
    'UAA': 'Stop',  'CAA': 'Q',     'AAA': 'K',     'GAA': 'E',
    'UAG': 'Stop',  'CAG': 'Q',     'AAG': 'K',     'GAG': 'E',
    'UGU': 'C',     'CGU': 'R',     'AGU': 'S',     'GGU': 'G',
    'UGC': 'C',     'CGC': 'R',     'AGC': 'S',     'GGC': 'G',
    'UGA': 'Stop',  'CGA': 'R',     'AGA': 'R',     'GGA': 'G',
    'UGG': 'W',     'CGG': 'R',     'AGG': 'R',     'GGG': 'G' 
}

def codon_frequency():
    frequency = {}
    for k, v in codon_map.items():
        if v not in frequency:
            frequency[v] = 0
        frequency[v] += 1
    return (frequency)

def possible_sequences(sequence):
    f = codon_frequency()
    n = f['Stop']
    for seq in sequence:
        n *= f[seq]
    return (n % 1000000)                              

All we need to do is to read the dataset and return the results:

with open('rosalind_mrna.txt') as file: #replace filename with yours
    protein_seq = file.read().strip()
print(possible_sequences(protein_seq))
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.