Computing GC Content

Identifying Unknown DNA Quickly

A quick method used by early computer software to determine the language of a given piece of text was to analyze the frequency with which each letter appeared in the text. This strategy was used because each language tends to exhibit its own letter frequencies, and as long as the text under consideration is long enough, software will correctly recognize the language quickly and with a very low error rate.
You may ask: what in the world does this linguistic problem have to do with biology? Although two members of the same species will have different genomes, they still share the vast percentage of their DNA; notably, 99.9% of the 3.2 billion base pairs in a human genome are common to almost all humans (i.e., excluding people having major genetic defects). For this reason, biologists will speak of the human genome, meaning an average-case genome derived from a collection of individuals. Such an average case genome can be assembled for any species, a challenge that we will soon discuss.
The biological analog of identifying unknown text arises when researchers encounter a molecule of DNA from an unknown species. Because of the base pairing relations of the two DNA strands, cytosine and guanine will always appear in equal amounts in a double-stranded DNA molecule. Thus, to analyze the symbol frequencies of DNA for comparison against a database, we compute the molecule’s GC-content, or the percentage of its bases that are either cytosine or guanine.
In practice, the GC-content of most eukaryotic genomes hovers around 50%. However, because genomes are so long, we may be able to distinguish species based on very small discrepancies in GC-content; furthermore, most prokaryotes have a GC-content significantly higher than 50%, so that GC-content can be used to quickly differentiate many prokaryotes and eukaryotes by using relatively small DNA samples.

Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are ‘C’ or ‘G’. For example, the GC-content of “AGCTATAG” is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with ‘>’, followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with ‘>’ indicates the label of the next string.

In Rosalind’s implementation, a string in FASTA format will be labeled by the ID “Rosalind_xxxx”, where “xxxx” denotes a four-digit code between 0000 and 9999.

Given

At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return

The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

Absolute error

We say that a number x is within an absolute error of y to a correct solution if x is within y of the correct solution. For example, if an exact solution is 6.157892, then for x to be within an absolute error of 0.001, we must have that |x − 6.157892| < 0.001, or 6.156892 < x <6.158892.

Error bounding is a vital practical tool because of the inherent round-off error in representing decimals in a computer, where only a finite number of decimal places are allotted to any number. After being compounded over a number of operations, this round-off error can become evident. As a result, rather than testing whether two numbers are equal with x = z, you may wish to simply verify that |xz| is very small.

The mathematical field of numerical analysis is devoted to rigorously studying the nature of computational approximation.


Solution

Let’s define the function that would calculate the GC content of a string:

def gc_content(fasta):
    count = [0, 0]
    for nuc in fasta:
        if nuc == "G":
            count[0] += 1
        elif nuc == "C":
            count[1] += 1
    return (count[0] + count[1]) / len(fasta)

Then, open and read the FASTA file:

with open("rosalind_gc.txt","r") as fasta:
    nextline = str()
    dict = {}
    for line in fasta:
        if line.startswith(">"):
            header = line.strip(">").strip("\n")
        else:
            nextline = (line.strip("\n") + nextline)
        dict[header] = nextline

And finally, run the code:

for key, value in dict.items():
    print(key + "\n" + str(gc_content(value) * 100))
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Rabbits and Recurrence Relations

Wascally Wabbits

In 1202, Leonardo of Pisa (commonly known as Fibonacci) considered a mathematical exercise regarding the reproduction of a population of rabbits. He made the following simplifying assumptions about the population:
The population begins in the first month with a pair of newborn rabbits.
Rabbits reach reproductive age after one month.
In any given month, every rabbit of reproductive age mates with another rabbit of reproductive age.
Exactly one month after two rabbits mate, they produce one male and one female rabbit.
Rabbits never die or stop reproducing.
Fibonacci’s exercise was to calculate how many pairs of rabbits would remain in one year. We can see that in the second month, the first pair of rabbits reach reproductive age and mate. In the third month, another pair of rabbits is born, and we have two rabbit pairs; our first pair of rabbits mates again. In the fourth month, another pair of rabbits is born to the original pair, while the second pair reach maturity and mate (with three total pairs). After a year, the rabbit population boasts 144 pairs.
Although Fibonacci’s assumption of the rabbits’ immortality may seem a bit farfetched, his model was not unrealistic for reproduction in a predator-free environment: European rabbits were introduced to Australia in the mid 19th Century, a place with no real indigenous predators for them. Within 50 years, the rabbits had already eradicated many plant species across the continent, leading to irreversible changes in the Australian ecosystem and turning much of its grasslands into eroded, practically uninhabitable parts of the modern Outback. In this problem, we will use the simple idea of counting rabbits to introduce a new computational topic, which involves building up large solutions from smaller ones.

Problem

A sequence is an ordered collection of objects (usually numbers), which are allowed to repeat. Sequences can be finite or infinite. Two examples are the finite sequence (π,−2‾√,0,π) and the infinite sequence of odd numbers (1,3,5,7,9,…). We use the notation an to represent the n-th term of a sequence.

A recurrence relation is a way of defining the terms of a sequence with respect to the values of previous terms. In the case of Fibonacci’s rabbits from the introduction, any given month will contain the rabbits that were alive the previous month, plus any new offspring. A key observation is that the number of offspring in any month is equal to the number of rabbits that were alive two months prior. As a result, if Fn represents the number of rabbit pairs alive after the n-th month, then we obtain the Fibonacci sequence having terms Fn that are defined by the recurrence relation

F_{n}=F_{n-1}+F_{n-2}

(with F1=F2=1 to initiate the sequence). Although the sequence bears Fibonacci’s name, it was known to Indian mathematicians over two millennia ago.

When finding the n-th term of a sequence defined by a recurrence relation, we can simply use the recurrence relation to generate terms for progressively larger values of n. This problem introduces us to the computational technique of dynamic programming, which successively builds up solutions by using the answers to smaller cases.

Given a DNA string t corresponding to a coding strand, its transcribed RNA string u is formed by replacing all occurrences of ‘T’ in t with ‘U’ in u.

Given

Given: Positive integers n ≤ 40 and k ≤ 5.

Return

The total number of rabbit pairs that will be present after n months, if we begin with 1 pair and in each generation, every pair of reproduction-age rabbits produces a litter of k rabbit pairs (instead of only 1 pair).


Solution

Change the n and m values in the dataset given to you.

n = 89                                                                   
m = 16                                                                      
bunnies = [1, 1]                                                               
months = 2
count = []                                                                     
while months < n:                                                              
    if months < m:                                                             
        bunnies.append(bunnies[-2] + bunnies[-1])                              
    elif months == m or count == m + 1:                                        
        bunnies.append(bunnies[-2] + bunnies[-1] - 1)                          
    else:
        bunnies.append(bunnies[-2] + bunnies[-1] - bunnies[-(m + 1)])                                                           
    months += 1
print(bunnies[-1])
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Complementing a Strand of DNA

The Secondary and Tertiary Structures of DNA

In “Counting DNA Nucleotides”, we introduced nucleic acids, and we saw that the primary structure of a nucleic acid is determined by the ordering of its nucleobases along the sugar-phosphate backbone that constitutes the bonds of the nucleic acid polymer. Yet primary structure tells us nothing about the larger, 3-dimensional shape of the molecule, which is vital for a complete understanding of nucleic acids.
The search for a complete chemical structure of nucleic acids was central to molecular biology research in the mid-20th Century, culminating in 1953 with a publication in Nature of fewer than 800 words by James Watson and Francis Crick. Consolidating a high resolution X-ray image created by Rosalind Franklin and Raymond Gosling with a number of established chemical results, Watson and Crick proposed the following structure for DNA:
The DNA molecule is made up of two strands, running in opposite directions. Each base bonds to a base in the opposite strand. Adenine always bonds with thymine, and cytosine always bonds with guanine; the complement of a base is the base to which it always bonds.
In light of Watson and Crick’s model, the bonding of two complementary bases is called a base pair (bp). Therefore, the length of a DNA molecule will commonly be given in bp instead of nt. By complementarity, once we know the order of bases on one strand, we can immediately deduce the sequence of bases in the complementary strand. These bases will run in the opposite order to match the fact that the two strands of DNA run in opposite directions.

Problem

In DNA strings, symbols ‘A’ and ‘T’ are complements of each other, as are ‘C’ and ‘G’.

The reverse complement of a DNA string s is the string sc formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of “GTCA” is “TGAC”).

Given

A DNA string s of length at most 1000 bp.

Return

The reverse complement sc of s.


Solution

This is a fairly simple task. Therefore, we can read the file, run the function and return the output altogether:

for N in open("/Users/cenkcelik/Downloads/rosalind_orf.txt","r").read()[::-1]:
    for pair in ["GC","AT"]:
        if N in pair: print("".join(set(N)^set(pair)),end="")
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Transcribing DNA to RNA

The Second Nucleic Acid

In “Counting DNA Nucleotides”, we described the primary structure of a nucleic acid as a polymer of nucleotide units, and we mentioned that the omnipresent nucleic acid DNA is composed of a varied sequence of four bases.
Yet a second nucleic acid exists alongside DNA in the chromatin; this molecule, which possesses a different sugar called ribose, came to be known as ribose nucleic acid, or RNA. RNA differs further from DNA in that it contains a base called uracil in place of thymine; structural differences between DNA and RNA are shown in Figure 1. Biologists initially believed that RNA was only contained in plant cells, whereas DNA was restricted to animal cells. However, this hypothesis dissipated as improved chemical methods discovered both nucleic acids in the cells of all life forms on Earth.
The primary structure of DNA and RNA is so similar because the former serves as a blueprint for the creation of a special kind of RNA molecule called messenger RNA, or mRNA. mRNA is created during RNA transcription, during which a strand of DNA is used as a template for constructing a strand of RNA by copying nucleotides one at a time, where uracil is used in place of thymine.
In eukaryotes, DNA remains in the nucleus, while RNA can enter the far reaches of the cell to carry out DNA’s instructions. In future problems, we will examine the process and ramifications of RNA transcription in more detail.

Problem

An RNA string is a string formed from the alphabet containing ‘A’, ‘C’, ‘G’, and ‘U’.

Given a DNA string t corresponding to a coding strand, its transcribed RNA string u is formed by replacing all occurrences of ‘T’ in t with ‘U’ in u.

Given

A DNA string t having length at most 1000 nt.

Return

The transcribed RNA string of t.


Solution

Let’s read the file first:

with open("rosalind_rna.txt") as file:
     string = file.readlines()

Now, it is time to replace uracil with thymine:

print(string.replace("T", "U"))
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.

Counting DNA Nucleotides

A Rapid Introduction to Molecular Biology

Making up all living material, the cell is considered to be the building block of life. The nucleus, a component of most eukaryotic cells, was identified as the hub of cellular activity 150 years ago. Viewed under a light microscope, the nucleus appears only as a darker region of the cell, but as we increase magnification, we find that the nucleus is densely filled with a stew of macromolecules called chromatin. During mitosis (eukaryotic cell division), most of the chromatin condenses into long, thin strings called chromosomes.
One class of the macromolecules contained in chromatin are called nucleic acids. Early 20th century research into the chemical identity of nucleic acids culminated with the conclusion that nucleic acids are polymers, or repeating chains of smaller, similarly structured molecules known as monomers. Because of their tendency to be long and thin, nucleic acid polymers are commonly called strands.
The nucleic acid monomer is called a nucleotide and is used as a unit of strand length (abbreviated to nt). Each nucleotide is formed of three parts: a sugar molecule, a negatively charged ion called a phosphate, and a compound called a nucleobase (“base” for short). Polymerization is achieved as the sugar of one nucleotide bonds to the phosphate of the next nucleotide in the chain, which forms a sugar-phosphate backbone for the nucleic acid strand. A key point is that the nucleotides of a specific type of nucleic acid always contain the same sugar and phosphate molecules, and they differ only in their choice of base. Thus, one strand of a nucleic acid can be differentiated from another based solely on the order of its bases; this ordering of bases defines a nucleic acid’s primary structure.
A strand of deoxyribose nucleic acid (DNA), in which the sugar is called deoxyribose, and the only four choices for nucleobases are molecules called adenine (A), cytosine (C), guanine (G), and thymine (T).
For reasons we will soon see, DNA is found in all living organisms on Earth, including bacteria; it is even found in many viruses (which are often considered to be nonliving). Because of its importance, we reserve the term genome to refer to the sum total of the DNA contained in an organism’s chromosomes.

Problem

A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.

An example of a length 21 DNA string (whose alphabet contains the symbols ‘A’, ‘C’, ‘G’, and ‘T’) is “ATGCTTCAGAAAGGTCTTACG.”

Given

A DNA string s of length at most 1000 nt.

Return

Four integers (separated by spaces) counting the respective number of times that the symbols ‘A’, ‘C’, ‘G’, and ‘T’ occur in s.


Solution

Let’s define the function:

def counting_nucleotides(string):
      return string.count("A"),
      string.count("G"),
      string.count("C"),
      string.count("T")

Now, let’s read the file:

with open("rosalind_dna.txt") as file:
      string = file.readlines()

Call the function and get the output:

counting_nucleotides(string)
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.