Finding a Shared Motif

Searching Through the Haystack

In “Finding a Motif in DNA”, we searched a given genetic string for a motif; however, this problem assumed that we know the motif in advance. In practice, biologists often do not know exactly what they are looking for. Rather, they must hunt through several different genomes at the same time to identify regions of similarity that may indicate genes shared by different organisms or species.
The simplest such region of similarity is a motif occurring without mutation in every one of a collection of genetic strings taken from a database; such a motif corresponds to a substring shared by all the strings. We want to search for long shared substrings, as a longer motif will likely indicate a greater shared function.

Problem

A common substring of a collection of strings is a substring of every member of the collection. We say that a common substring is a longest common substring if there does not exist a longer common substring. For example, “CG” is a common substring of “ACGTACGT” and “AACCGTATA”, but it is not as long as possible; in this case, “CGTA” is a longest common substring of “ACGTACGT” and “AACCGTATA”.

Note that the longest common substring is not necessarily unique; for a simple example, “AA” and “CC” are both longest common substrings of “AACC” and “CCAA”.

Given

A collection of k (k ≤ 100) DNA strings of length at most 1 kbp each in FASTA format.

Return

A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)


Solution

For this problem, we will be using the python library Bio, specifically SeqIO function:

from Bio import SeqIO                      
sequences = []                             
handle = open('rosalind_lcsm.txt', 'r')
for record in SeqIO.parse(handle, 'fasta'):
    sequence = []                          
    seq = ''                               
    for nt in record.seq:                  
        seq += nt                          
    sequences.append(seq)                  
handle.close()

Search through the haystack and return the output:

srt_seq = sorted(sequences, key=len)     
short_seq = srt_seq[0]                   
comp_seq = srt_seq[1:]                   
motif = ''                               
for i in range(len(short_seq)):          
    for j in range(i, len(short_seq)):   
        m = short_seq[i:j + 1]           
        found = False                    
        for sequ in comp_seq:            
            if m in sequ:                
                found = True             
            else:                        
                found = False            
                break                    
        if found and len(m) > len(motif):
            motif = m                    
print(motif)
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.