Enumerating k-mers Lexicographically

Organising Strings

When cataloguing a collection of genetic strings, we should have an established system by which to organize them. The standard method is to organize strings as they would appear in a dictionary, so that “APPLE” precedes “APRON”, which in turn comes before “ARMOR”.

Problem

Assume that an alphabet 𝒜 has a predetermined order; that is, we write the alphabet as a permutation 𝒜 = (a1, a2, …, ak), where a1 < a2 < ⋯ < ak. For instance, the English alphabet is organized as (A, B, …, Z).

Given two strings s and t having the same length n, we say that s precedes t in the lexicographic order (and write s < Lex_t) if the first symbol s[j] that doesn’t match t[j] satisfies sj < tj in 𝒜.

Given

A collection of at most 10 symbols defining an ordered alphabet, and a positive integer n (n ≤ 10).

Return

All strings of length n that can be formed from the alphabet, ordered lexicographically (use the standard order of symbols in the English alphabet).


Solution

First, let’s import the tools and read the FASTA file:

import itertools

with open("rosalind_lexf.txt") as f:
    data = f.read().split()
    letters = data[:-1]
    n = int(data[-1])

Then, we find the all possible strings of given length n:

perm = itertools.product(letters, repeat = n)
output = []
for i, j in enumerate(list(perm)):
    permutation = ''
    for item in j:
        permutation += str(item)
    output.append(permutation)

We need to sort the strings in the list alphabetically:

output.sort()

Finally, we print the result as in rosalind’s desired format:

for item in output:
    print(item, end="\n")
Important note:

This problem is taken from rosalind.info. Please visit ROSALIND to find out more about Bioinformatics problems. You may also clink onto links for the definitions of each terminology.