Calculating Protein Mass

Chaining the Amino Acids

In “Translating RNA into Protein”, we examined the translation of RNA into an amino acid chain for the construction of a protein. When two amino acids link together, they form a peptide bond, which releases a molecule of water. Thus, after a series of amino acids have been linked together into a polypeptide, every pair of adjacent amino acids has lost one molecule of water, meaning that a polypeptide containing n amino acids has had n − 1 water molecules removed.
More generally, a residue is a molecule from which a water molecule has been removed; every amino acid in a protein are residues except the leftmost and the rightmost ones. These outermost amino acids are special in that one has an “unstarted” peptide bond, and the other has an “unfinished” peptide bond. Between them, the two molecules have a single “extra” molecule of water. Thus, the mass of a protein is the sum of masses of all its residues plus the mass of a single water molecule.
There are two standard ways of computing the mass of a residue by summing the masses of its individual atoms. Its monoisotopic mass is computed by using the principal (most abundant) isotope of each atom in the amino acid, whereas its average mass is taken by taking the average mass of each atom in the molecule (over all naturally appearing isotopes).
Many applications in proteomics rely on mass spectrometry, an analytical chemical technique used to determine the mass, elemental composition, and structure of molecules. In mass spectrometry, monoisotopic mass is used more often than average mass, and so all amino acid masses are assumed to be monoisotopic unless otherwise stated.
The standard unit used in mass spectrometry for measuring mass is the atomic mass unit, which is also called the dalton (Da) and is defined as one twelfth of the mass of a neutral atom of carbon-12. The mass of a protein is the sum of the monoisotopic masses of its amino acid residues plus the mass of a single water molecule (whose monoisotopic mass is 18.01056 Da).
In the following several problems on applications of mass spectrometry, we avoid the complication of having to distinguish between residues and non-residues by only considering peptides excised from the middle of the protein. This is a relatively safe assumption because in practice, peptide analysis is often performed in tandem mass spectrometry. In this special class of mass spectrometry, a protein is first divided into peptides, which are then broken into ions for mass analysis.

Problem

In a weighted alphabet, every symbol is assigned a positive real number called a weight. A string formed from a weighted alphabet is called a weighted string, and its weight is equal to the sum of the weights of its symbols.

The standard weight assigned to each member of the 20-symbol amino acid alphabet is the monoisotopic mass of the corresponding amino acid.

Given

A protein string P of length at most 1000 aa.

Return

The total weight of P. Consult the monoisotopic mass table.


Solution

Let’s create the monoisotopic mass table as a dictionary:

mass_table = {"A":71.03711, "C":103.00919,
              "D":115.02694, "E":129.04259,
              "F":147.06841, "G":57.02146,
              "H":137.05891, "I":113.08406,
              "K":128.09496, "L":113.08406,
              "M":131.04049, "N":114.04293,
              "P":97.05276, "Q":128.05858,
              "R":156.10111, "S":87.03203,
              "T":101.04768, "V":99.06841,
              "W":186.07931, "Y":163.06333,}

Then, create the function to calculate the total protein mass, which will return the results as in float with three decimals:

def protein_weight(protein_sequence):
    total_weight = 0
    for amino_acid in protein_sequence:
        total_weight += mass_table.get(amino_acid)
    print("%.3f" % total_weight)

All we need to do is to read the dataset, and call the function:

with open("rosalind_prtm.txt") as f:
    protein_string = f.read().strip()

Call the function:

protein_weight(protein_string)
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Enumerating Gene Orders

Rearrangements Power Large-Scale Genomic Changes

Point mutations can create changes in populations of organisms from the same species, but they lack the power to create and differentiate entire species. This more arduous work is left to larger mutations called genome rearrangements, which move around huge blocks of DNA. Rearrangements cause major genomic change, and most rearrangements are fatal or seriously damaging to the mutated cell and its descendants (many cancers derive from rearrangements). For this reason, rearrangements that come to influence the genome of an entire species are very rare.
Because rearrangements that affect species evolution occur infrequently, two closely related species will have very similar genomes. Thus, to simplify comparison of two such genomes, researchers first identify similar intervals of DNA from the species, called synteny blocks; over time, rearrangements have created these synteny blocks and heaved them around across the two genomes (often separating blocks onto different chromosomes).
A pair of synteny blocks from two different species are not strictly identical (they are separated by the action of point mutations or very small rearrangements), but for the sake of studying large-scale rearrangements, we consider them to be equivalent. As a result, we can label each synteny block with a positive integer; when comparing two species’ genomes/chromosomes, we then only need to specify the order of its numbered synteny blocks.

Problem

A permutation of length n is an ordering of the positive integers {1, 2, …, n}. For example, π = (5, 3, 2, 1, 4) is a permutation of length 5.

Given

A positive integer n ≤ 7.

Return

The total number of permutations of length n, followed by a list of all such permutations (in any order).


Solution

Let’s import the required package for permutation calculation:

import itertools

Then, we read the dataset:

with open("rosalind_perm.txt") as file:
    N = int(file.read())

We need to define the loop for the permutations. However, in order to print the output as rosalind platform requires, I used a trick here:

def total_permutation(n):
    perm = itertools.permutations(list(range(1, n + 1)))
    print(len(list(perm)))

def permutation(n):
    perm = itertools.permutations(list(range(1, n + 1)))
    for i, j in enumerate(list(perm)):
        permutation_list = ''
        for item in j:
            permutation_list += str(item) + ' '
        print(permutation_list)

The functions above return the total number of permutations of length n and list of permutations, respectively. All we need to do is call these function in order:

def enumerating_gene_orders(n):
    total_permutation(n)
    permutation(n)

Finally, we call the main function and return the desired output.

enumerating_gene_orders(N)
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Open Reading Frames

Transcription May Begin Anywhere

In “Transcribing DNA into RNA”, we discussed the transcription of DNA into RNA, and in “Translating RNA into Protein”, we examined the translation of RNA into a chain of amino acids for the construction of proteins. We can view these two processes as a single step in which we directly translate a DNA string into a protein string, thus calling for a DNA codon table.
However, three immediate wrinkles of complexity arise when we try to pass directly from DNA to proteins. First, not all DNA will be transcribed into RNA: so-called junk DNA appears to have no practical purpose for cellular function. Second, we can begin translation at any position along a strand of RNA, meaning that any substring of a DNA string can serve as a template for translation, as long as it begins with a start codon, ends with a stop codon, and has no other stop codons in the middle. As a result, the same RNA string can actually be translated in three different ways, depending on how we group triplets of symbols into codons. For example, …AUGCUGAC… can be translated as …AUGCUG…, …UGCUGA…, and …GCUGAC…, which will typically produce wildly different protein strings.

Problem

Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.

An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.

Given

A DNA string s of length at most 1 kbp in FASTA format.

Return

Every distinct candidate protein string that can be translated from ORFs of s. Strings can be returned in any order.


Solution

Let’s define the DNA codon map. This step will help bypassing mRNA transcription step:

dna_codon_table = {
    "TTT":"F", "CTT":"L", "ATT":"I", "GTT":"V",
    "TTC":"F", "CTC":"L", "ATC":"I", "GTC":"V",
    "TTA":"L", "CTA":"L", "ATA":"I", "GTA":"V",
    "TTG":"L", "CTG":"L", "ATG":"M", "GTG":"V",
    "TCT":"S", "CCT":"P", "ACT":"T", "GCT":"A",
    "TCC":"S", "CCC":"P", "ACC":"T", "GCC":"A",
    "TCA":"S", "CCA":"P", "ACA":"T", "GCA":"A",
    "TCG":"S", "CCG":"P", "ACG":"T", "GCG":"A",
    "TAT":"Y", "CAT":"H", "AAT":"N", "GAT":"D",
    "TAC":"Y", "CAC":"H", "AAC":"N", "GAC":"D",
    "TAA":"STOP", "CAA":"Q", "AAA":"K", "GAA":"E",
    "TAG":"STOP", "CAG":"Q", "AAG":"K", "GAG":"E",
    "TGT":"C", "CGT":"R", "AGT":"S", "GGT":"G",
    "TGC":"C", "CGC":"R", "AGC":"S", "GGC":"G",
    "TGA":"STOP", "CGA":"R", "AGA":"R", "GGA":"G",
    "TGG":"W", "CGG":"R", "AGG":"R", "GGG":"G"
}

For the complementary DNA strand, we can simply replace A, T, G, C with T, A, C, G, respectively:

def complementary_dna(dna_string):
    replace_bases = {"A":"T","T":"A","G":"C","C":"G"}
    return ''.join([replace_bases[base] for base in reversed(dna_string)])

Another function we need to define is the direct translation function using DNA string. The translation must start with methionine (ATG):

def translate(fragment):
    peptide = []
    methionine = fragment.find("ATG")
    codons = [fragment[methionine:methionine+3] for methionine in range(methionine, len(fragment), 3)]
    for codon in codons:
        peptide += dna_codon_table.get(codon)
    return ''.join(map(str, peptide)) 

But before we directly translate DNA into amino acid sequences, we need to find out possible fragments starting with methionine codon and ending with any of stop codons (TAA, TAG or TGA), while checking three bases (codon length) in between start and stop codons.

import re

pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAA|TAG|TGA))')

fragments = []
for string in re.findall(pattern, dna_string):
    fragments.append(string)

for string in re.findall(pattern, complementary_dna(dna_string)):
    fragments.append(string)

Since we have returned DNA fragments to possibly be translated in fragments set, we can run the translate function to find out the possible peptide sequences.

with open("rosalind_orf.txt") as file:
    for line in file:
        if line.startswith(">"):
            nextline = str()
        else:
            nextline += (line.strip("\n"))
    dna_string = nextline

for string in set(fragments):
    print(translate(string))
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Inferring mRNA from Protein

Pitfalls of Reversing Translation

When researchers discover a new protein, they would like to infer the strand of mRNA from which this protein could have been translated, thus allowing them to locate genes associated with this protein on the genome.
Unfortunately, although any RNA string can be translated into a unique protein string, reversing the process yields a huge number of possible RNA strings from a single protein string because most amino acids correspond to multiple RNA codons (see the RNA Codon Table).
Because of memory considerations, most data formats that are built into languages have upper bounds on how large an integer can be: in some versions of Python, an “int” variable may be required to be no larger than 2^31 − 1 or 2,147,483,647. As a result, to deal with very large numbers in Rosalind, we need to devise a system that allows us to manipulate large numbers without actually having to store large numbers.

Problem

For positive integers a and n, a modulo n (written a mod n in shorthand) is the remainder when a is divided by n. For example, 29 mod 11 = 7 because 29 = 11 × 2 + 7.

Modular arithmetic is the study of addition, subtraction, multiplication, and division with respect to the modulo operation. We say that a and b are congruent modulo n if a mod n=b mod n; in this case, we use the notation ab mod n.

Two useful facts in modular arithmetic are that if a b mod n and c d mod n, then a + c b + d mod n and a × cb × d mod n. To check your understanding of these rules, you may wish to verify these relationships for a = 29, b = 73, c = 10, d = 32, and n = 11.

As you will see in this exercise, some Rosalind problems will ask for a (very large) integer solution modulo a smaller number to avoid the computational pitfalls that arise with storing such large numbers.

Given

A protein string of length at most 1000 aa.

Return

The total number of different RNA strings from which the protein could have been translated, modulo 1,000,000. (Don’t neglect the importance of the stop codon in protein translation.)


Solution

Let’s define the codon map as well as the functions first:

codon_map = {
    'UUU': 'F',     'CUU': 'L',     'AUU': 'I',     'GUU': 'V',
    'UUC': 'F',     'CUC': 'L',     'AUC': 'I',     'GUC': 'V',
    'UUA': 'L',     'CUA': 'L',     'AUA': 'I',     'GUA': 'V',
    'UUG': 'L',     'CUG': 'L',     'AUG': 'M',     'GUG': 'V',
    'UCU': 'S',     'CCU': 'P',     'ACU': 'T',     'GCU': 'A',
    'UCC': 'S',     'CCC': 'P',     'ACC': 'T',     'GCC': 'A',
    'UCA': 'S',     'CCA': 'P',     'ACA': 'T',     'GCA': 'A',
    'UCG': 'S',     'CCG': 'P',     'ACG': 'T',     'GCG': 'A',
    'UAU': 'Y',     'CAU': 'H',     'AAU': 'N',     'GAU': 'D',
    'UAC': 'Y',     'CAC': 'H',     'AAC': 'N',     'GAC': 'D',
    'UAA': 'Stop',  'CAA': 'Q',     'AAA': 'K',     'GAA': 'E',
    'UAG': 'Stop',  'CAG': 'Q',     'AAG': 'K',     'GAG': 'E',
    'UGU': 'C',     'CGU': 'R',     'AGU': 'S',     'GGU': 'G',
    'UGC': 'C',     'CGC': 'R',     'AGC': 'S',     'GGC': 'G',
    'UGA': 'Stop',  'CGA': 'R',     'AGA': 'R',     'GGA': 'G',
    'UGG': 'W',     'CGG': 'R',     'AGG': 'R',     'GGG': 'G' 
}

def codon_frequency():
    frequency = {}
    for k, v in codon_map.items():
        if v not in frequency:
            frequency[v] = 0
        frequency[v] += 1
    return (frequency)

def possible_sequences(sequence):
    f = codon_frequency()
    n = f['Stop']
    for seq in sequence:
        n *= f[seq]
    return (n % 1000000)                              

All we need to do is to read the dataset and return the results:

with open('rosalind_mrna.txt') as file: #replace filename with yours
    protein_seq = file.read().strip()
print(possible_sequences(protein_seq))
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Finding a Protein Motif

Motif Implies Function

As mentioned in “Translating RNA into Protein”, proteins perform every practical function in the cell. A structural and functional unit of the protein is a protein domain: in terms of the protein’s primary structure, the domain is an interval of amino acids that can evolve and function independently.
Each domain usually corresponds to a single function of the protein (e.g., binding the protein to DNA, creating or breaking specific chemical bonds, etc.). Some proteins, such as myoglobin and the Cytochrome complex, have only one domain, but many proteins are multifunctional and therefore possess several domains. It is even possible to artificially fuse different domains into a protein molecule with definite properties, creating a chimeric protein.
Just like species, proteins can evolve, forming homologous groups called protein families. Proteins from one family usually have the same set of domains, performing similar functions.
A component of a domain essential for its function is called a motif, a term that in general has the same meaning as it does in nucleic acids, although many other terms are also used (blocks, signatures, fingerprints, etc.) Usually protein motifs are evolutionarily conservative, meaning that they appear without much change in different species.
Proteins are identified in different labs around the world and gathered into freely accessible databases. A central repository for protein data is UniProt, which provides detailed protein annotation, including function description, domain structure, and post-translational modifications. UniProt also supports protein similarity search, taxonomy analysis, and literature citations.

Problem

To allow for the presence of its varying forms, a protein motif is represented by a shorthand as follows: [XY] means “either X or Y” and {X} means “any amino acid except X.” For example, the N-glycosylation motif is written as N{P}[ST]{P}.

You can see the complete description and features of a particular protein by its access ID “uniprot_id” in the UniProt database, by inserting the ID number into

http://www.uniprot.org/uniprot/uniprot_id

Alternatively, you can obtain a protein sequence in FASTA format by following

http://www.uniprot.org/uniprot/uniprot_id.fasta

For example, the data for protein B5ZC00 can be found at http://www.uniprot.org/uniprot/B5ZC00.

Given

At most 15 UniProt Protein Database access IDs.

Return

For each protein possessing the N-glycosylation motif, output its given access ID followed by a list of locations in the protein string where the motif can be found.


Solution

For this problem, we will be using the python library requests. If you have problems while importing requests library, run the code below. Otherwise, run only the last line:

import sys
!{sys.executable} -m pip install requests
import requests

Let’s read the IDs and acquire the fasta files of the corresponding protein IDs:

url = 'https://www.uniprot.org/uniprot/'

with open("rosalind_mprt.txt") as file:
    seqIDs = file.read().replace("\n", " ").split()
sequences = {}
for proID in seqIDs:
    goToURL = url+proID+".fasta"
    response = requests.get(goToURL)
    sequences[proID] = (response.text.split("\n"))
    sequences[proID] = "".join(sequences[proID][1::])

Now, we define the function that would seek for the N-glycosylation motif:

def N_gly_motif(ID, sequence):
    sequence = list(sequence)
    global result
    result = []
    for i in range(0, len(sequence)-3):
        seq = sequence[i:i+4]
        if (seq[0] == "N") and (seq[2] == "S" or seq[2] == "T") and (seq[1] and seq[3] != "P"):
            result.append(i+1)

Now, let’s run the function and print the output:

for key, value in sequences.items():
    N_gly_motif(key, value)
    if not result:
        continue
    else:
        print(key)
        print(*result)
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.