Finding a Protein Motif

Motif Implies Function

As mentioned in “Translating RNA into Protein”, proteins perform every practical function in the cell. A structural and functional unit of the protein is a protein domain: in terms of the protein’s primary structure, the domain is an interval of amino acids that can evolve and function independently.
Each domain usually corresponds to a single function of the protein (e.g., binding the protein to DNA, creating or breaking specific chemical bonds, etc.). Some proteins, such as myoglobin and the Cytochrome complex, have only one domain, but many proteins are multifunctional and therefore possess several domains. It is even possible to artificially fuse different domains into a protein molecule with definite properties, creating a chimeric protein.
Just like species, proteins can evolve, forming homologous groups called protein families. Proteins from one family usually have the same set of domains, performing similar functions.
A component of a domain essential for its function is called a motif, a term that in general has the same meaning as it does in nucleic acids, although many other terms are also used (blocks, signatures, fingerprints, etc.) Usually protein motifs are evolutionarily conservative, meaning that they appear without much change in different species.
Proteins are identified in different labs around the world and gathered into freely accessible databases. A central repository for protein data is UniProt, which provides detailed protein annotation, including function description, domain structure, and post-translational modifications. UniProt also supports protein similarity search, taxonomy analysis, and literature citations.

Problem

To allow for the presence of its varying forms, a protein motif is represented by a shorthand as follows: [XY] means “either X or Y” and {X} means “any amino acid except X.” For example, the N-glycosylation motif is written as N{P}[ST]{P}.

You can see the complete description and features of a particular protein by its access ID “uniprot_id” in the UniProt database, by inserting the ID number into

http://www.uniprot.org/uniprot/uniprot_id

Alternatively, you can obtain a protein sequence in FASTA format by following

http://www.uniprot.org/uniprot/uniprot_id.fasta

For example, the data for protein B5ZC00 can be found at http://www.uniprot.org/uniprot/B5ZC00.

Given

At most 15 UniProt Protein Database access IDs.

Return

For each protein possessing the N-glycosylation motif, output its given access ID followed by a list of locations in the protein string where the motif can be found.


Solution

For this problem, we will be using the python library requests. If you have problems while importing requests library, run the code below. Otherwise, run only the last line:

import sys
!{sys.executable} -m pip install requests
import requests

Let’s read the IDs and acquire the fasta files of the corresponding protein IDs:

url = 'https://www.uniprot.org/uniprot/'

with open("rosalind_mprt.txt") as file:
    seqIDs = file.read().replace("\n", " ").split()
sequences = {}
for proID in seqIDs:
    goToURL = url+proID+".fasta"
    response = requests.get(goToURL)
    sequences[proID] = (response.text.split("\n"))
    sequences[proID] = "".join(sequences[proID][1::])

Now, we define the function that would seek for the N-glycosylation motif:

def N_gly_motif(ID, sequence):
    sequence = list(sequence)
    global result
    result = []
    for i in range(0, len(sequence)-3):
        seq = sequence[i:i+4]
        if (seq[0] == "N") and (seq[2] == "S" or seq[2] == "T") and (seq[1] and seq[3] != "P"):
            result.append(i+1)

Now, let’s run the function and print the output:

for key, value in sequences.items():
    N_gly_motif(key, value)
    if not result:
        continue
    else:
        print(key)
        print(*result)
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.