k-mer

From Wikipedia, the free encyclopedia
Jump to: navigation, search

The term k-mer typically refers to all the possible substrings of length k that are contained in a string. In computational genomics, k-mers refer to all the possible subsequences (of length k) from a read obtained through DNA Sequencing. The amount of k-mers possible given a string of length L is , whilst the number of possible k-mers given n possibilities (4 in the case of DNA e.g. ACTG) is . K-mers are typically used during sequence assembly,[1] but can also be used in sequence alignment. In the context of the human genome, k-mers of various lengths have been used to explain variability in mutation rates.[2][3]

Sequence Assembly[edit]

Overview[edit]

In sequence assembly, k-mers are typically used during the construction of De Bruijn graphs. In order to create a De Bruijn Graph, the strings stored in each edge with length, , must overlap another string in another edge by in order to create a vertex. Reads generated from next-generation sequencing will typically have different read lengths being generated. For example, reads by Illumina’s sequencing technology capture reads of 100-mers. However, the problem with the sequencing is that only small fractions out of all the possible 100-mers that are present in the genome are actually generated. This is due to read errors, but more importantly, just simple coverage holes that occur during sequencing. The problem is though, that these small fractions of the possible k-mers violates the key assumption of de Bruijn graphs such that all the k-mer reads must overlap its adjoining k-mer in the genome by (which can’t occur when all the possible k-mers aren’t present). The solution to this problem is to break these k-mer sized reads into smaller k-mers, such that the resulting smaller k-mers will represent all the possible k-mers of that smaller size that are present in the genome.[1] Furthermore, splitting the k-mers into smaller sizes also helps alleviate the problem of different initial read lengths. An example of the solution of splitting the reads into smaller k-mers is shown in figure 1. In this example the 5 reads do not account for all the possible 7-mers of the genome, and as such, a de Bruijn graph cannot be created. But when they are split into 4-mers, the resultant subsequences are enough to reconstruct the genome using a de Bruijn graph.

This figure shows the process of splitting reads into smaller k-mers (4-mers in this case) in order to be able to be used in a de Bruijn graph. (A) Shows the initial segment of DNA being sequenced. (B) Shows the reads that were made output from sequencing and also shows how they align. The problem with this alignment though is that they overlap by k-2 not k-1 (which is needed in de Bruijn graphs). (C) Shows the reads being split into smaller 4-mers. (D) Discards the repeated 4-mers and then shows the alignment of them. Note that these k-mers overlap by k-1 and can then be used in a de Bruijn graph.

Choice of k-mer[edit]

The choice of the k-mer size has many different effects on the sequence assembly. These effects vary greatly between lower sized and larger sized k-mers. Therefore, an understanding of the different k-mer sizes must be known in order to choose a suitable size that balances the effects. The effects of the sizes are outlined below.

Lower k-mer sizes[edit]

  • A lower k-mer size will decrease the amount of edges stored in the graph, and as such, will help decrease the amount of space required to store DNA sequence.
  • Having smaller sizes will increase the chance for all the k-mers to overlap, and as such, have the required subsequences in order to construct the de Bruijn graph.[4]
  • However, by having smaller sized k-mers, you also risk having many vertices in the graph leading into a single k-mer. Therefore, this will make the reconstruction of the genome more difficult as there is a higher level of path ambiguities due to the larger amount of vertices that will need to be traversed.
  • Information is lost as the k-mers become smaller.
    • E.g. The possibility of AGTCGTAGATGCTG is lower than ACGT, and as such, holds a greater amount of information (refer to entropy (information theory) for more information).
  • Smaller k-mers also have the problem of not being able to resolve areas in the DNA where small microsatellites or repeats occur. This is because smaller k-mers will tend to sit entirely within the repeat region and is therefore hard to determine the amount of repetition that has actually taken place.
    • E.g. For the subsequence ATGTGTGTGTGTGTACG, the amount of repetitions of TG will be lost if a k-mer size less than 16 is chosen. This is because most of the k-mers will sit in the repeated region and may just be discarded as repeats of the same k-mer instead of referring the amount of repeats.

Higher k-mer sizes[edit]

  • Having larger sized k-mers will increase the amount of edges in the graph, which in turn, will increase the amount of memory needed to store the DNA sequence.
  • By increasing the size of the k-mers, the number of vertices will also decrease. This will help with the construction of the genome as there will be fewer paths to traverse in the graph.[4]
  • Larger k-mers also run a higher risk of not having outward vertices from every k-mer. This is due to larger k-mers increasing the risk that it will not overlap with another k-mer by . Therefore, this can lead to disjoints in the reads, and as such, can lead to a higher amount of smaller contigs.
  • Larger k-mer sizes help alleviate the problem of small repeat regions. This is due to the fact that the k-mer will contain a balance of the repeat region and the adjoining DNA sequences (given it are a large enough size) that can help to resolve the amount of repetition in that particular area.

Applications of k-mer in bioinformatics analysis[edit]

The frequency of a set of k-mers in a specie's genome, in a genomic region, in a class of sequences, can be used as a "signature" of the underlying sequence. Comparing these frequencies are computationally easier than sequence alignment, and is an important method in alignment-free sequence analysis. It can also be used as a first stage analysis before an alignment.

Pseudocode[edit]

Determining the possible k-mers of a read can be done by simply cycling over the string length by one and taking out each substring of length, k. The pseudocode to achieve this is as follows:

procedure k-mer(String, k : length of each k-mer)

     n = length(String)
     
     /* cycle over the length of String till k-mers of length, k, can still be made */
     for i = 1 to  n-k+1 inclusive do
          /* output each k-mer of length k, from i to i+k in String*/
          output String[i:i+k]
     end for

end procedure

Implementations[edit]

A number of implementations in different languages to find all the k-mers in a string are listed below.

APL[edit]

      k,/string  ⍝ returns an array of the k-mers of string
      4,/'ATCGAAGGTCGT'
 ATCG  TCGA  CGAA  GAAG  AAGG  AGGT  GGTC  GTCG  TCGT

OR
      kmers{>⍴,⍵:''  ,/,}  ⍝ handles the case where k is greater than the length of string 
      4 kmers 'ATCGAAGGTCGT'
 ATCG  TCGA  CGAA  GAAG  AAGG  AGGT  GGTC  GTCG  TCGT

Haskell[edit]

kmers :: Int -> [a] -> [[a]]
kmers k xs | k == length (take k xs) = take k xs : kmers k (drop 1 xs)
           | otherwise               = []

-- or

import Data.List

kmers' k xs = take (length x - (n - 1)) . map (take n) . tails $ xs

Python[edit]

def find_kmers(string, k):
    
      kmers = []
      n = len(string)

      for i in range(0, n-k+1):
           kmers.append(string[i:i+k])

      return kmers

R[edit]

find_kmers <- function(string, k){
  n <- nchar(string) - k + 1
  kmers <- substring(string, 1:n, 1:n + k - 1)
  return(kmers)
}

Ruby[edit]

class String
  # Iterate over each k-mer of this string
  def each_kmer k
    return enum_for(:each_kmer, k) unless block_given?
    (0 .. length - k).each { |i|
      yield self[i, k]
    }
  end
end

Java[edit]

private void getKmers(String seq)
{
	int seqLength = seq.length();
	if(seqLength > length)
	{
		for(int i = 0; i < seqLength - length + 1; i++)
		{
			System.out.println(seq.substring(i,length + i));
		}
	} else 
	{
		System.out.println(seq);
	}
}

Perl 6[edit]

my $seq = 'AGCTTTTCATTCTGACTGCAACGGG';
my $n   = $seq.chars - $k + 1;
my @kmers = gather for 0..^$n -> $i {
    take $seq.substr($i, $k);
}

Or:

my $k = 10;
dd 'AGCTTTTCATTCTGACTGCAACGGG'.comb.rotor($k => -1 * ($k - 1)).map(*.join);

C#[edit]

public static List<string> find_kmers(string Text, int k)
{
    var kmers = new List<string>();
    int n = Text.Length;
   
    for (int i = 0; i < n-k+1; i++)
    {                
        kmers.Add(Text.Substring(i, k));               
    }

    return kmers;
 }

Examples[edit]

Here are some examples showing the possible k-mers (given a specified k value) from DNA sequences:

Read:     AGATCGAGTG
3-mers: AGA GAT ATC TCG CGA GAG AGT GTG
Read:     GTAGAGCTGT
5-mers: GTAGA TAGAG AGAGC GAGCT AGCTG GCTGT

References[edit]

  1. ^ a b Compeau, P.; Pevzner, P.; Teslar, G. (2011). "How to apply de Bruijn graphs to genome assembly". Nature Biotechnology. 29: 987–991. doi:10.1038/nbt.2023. 
  2. ^ Samocha, Kaitlin E; Robinson, Elise B; Sanders, Stephan J; Stevens, Christine; Sabo, Aniko; McGrath, Lauren M; Kosmicki, Jack A; Rehnström, Karola; Mallick, Swapan; Kirby, Andrew; Wall, Dennis P; MacArthur, Daniel G; Gabriel, Stacey B; DePristo, Mark; Purcell, Shaun M; Palotie, Aarno; Boerwinkle, Eric; Buxbaum, Joseph D; Cook, Edwin H; Gibbs, Richard A; Schellenberg, Gerard D; Sutcliffe, James S; Devlin, Bernie; Roeder, Kathryn; Neale, Benjamin M; Daly, Mark J (2014). "A framework for the interpretation of de novo mutation in human disease". Nature Genetics. 46 (9): 944–950. ISSN 1061-4036. doi:10.1038/ng.3050. 
  3. ^ Aggarwala, Varun; Voight, Benjamin F (2016). "An expanded sequence context model broadly explains variability in polymorphism levels across the human genome". Nature Genetics. 48 (4): 349–355. ISSN 1061-4036. PMC 4811712Freely accessible. PMID 26878723. doi:10.1038/ng.3511. 
  4. ^ a b Zerbino, Daniel R.; Birney, Ewan (2008). "Velvet: algorithms for de novo short read assembly using de Bruijn graphs". Genome Research. 18 (5): 821–829. PMC 2336801Freely accessible. PMID 18349386. doi:10.1101/gr.074492.107. 
  5. ^ "Rachid Ounit, Steve Wanamaker, Timothy J Close and Stefano Lonardi" (2015). "CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers". BMC Genomics. 16: 236. PMC 4428112Freely accessible. PMID 25879410. doi:10.1186/s12864-015-1419-2. 
  6. ^ Dubinkina, Ischenko, Ulyantsev, Tyakht, Alexeev (2016). "Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis". BMC Bioinformatics. 17: 38. doi:10.1186/s12859-015-0875-7. 
  7. ^ Zhu,Zheng (2014). "Self-organizing approach for meta-genomes". Computational Biology and Chemistry. 53: 118–124. doi:10.1016/j.compbiolchem.2014.08.016. 
  8. ^ Chor,Horn, Goldman, Levy, Massingham (2009). "Genomic DNA k-mer spectra: models and modalities". Genome Biology. 10 (10): R108. doi:10.1186/gb-2009-10-10-r108. 
  9. ^ Meher,Sahu, Rao (2016). "Identification of species based on DNA barcode using k-mer feature vector and Random forest classifier". Gene. 592 (2): 316–324. doi:10.1016/j.gene.2016.07.010. 
  10. ^ Li et al. (2010). "De novo assembly of human genomes with massively parallel short read sequencing". Genome Research. 20 (2): 265–272. PMC 2813482Freely accessible. PMID 20019144. doi:10.1101/gr.097261.109. 
  11. ^ Navarro-Gomez et al. (2015). "Phy-Mer: a novel alignment-free and reference-independent mitochondrial haplogroup classifier". Bioinformatics. 31 (8): 1310–1312. doi:10.1093/bioinformatics/btu825. 
  12. ^ Phillippy,Schatz,Pop (2008). "Genome assembly forensics: finding the elusive mis-assembly". Bioinformatics. 9 (3): R55. doi:10.1186/gb-2008-9-3-r55. 
  13. ^ Price,Jones,Pevzner (2005). "De novo identification of repeat families in large genomes". Bioinformatics. 21(supp 1): i351–i358. doi:10.1093/bioinformatics/bti1018. 
  14. ^ Newburger,Bulyk (2009). "UniPROBE: an online database of protein binding microarray data on protein–DNA interactions". Nucleic Acids Research. 37(supp 1): D77–D82. PMC 2686578Freely accessible. PMID 18842628. doi:10.1093/nar/gkn660. 
  15. ^ S. Burkhardt and J. Kärkkäinen (2002). "Better filtering with gapped q-grams". Fundamenta Informaticae. 56 (1–2): 51–70. doi:10.1007/3-540-45452-7_19. 
  16. ^ Keich et al. (2004). "On spaced seeds for similarity search". Discrete Applied Mathematics. 138 (3): 253–263. doi:10.1016/S0166-218X(03)00382-2. 
  17. ^ Ghandi et al. (2014). "Enhanced regulatory sequence prediction using gapped k-mer features". PLoS Computational Biology. 10 (7): e1003711. Bibcode:2014PLSCB..10E3711G. doi:10.1371/journal.pcbi.1003711. 
  18. ^ Nordstrom et al. (2013). "Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k-mers". Nature Biotechnology. 31 (4): 325–330. doi:10.1038/nbt.2515. 
  19. ^ Chae et al. (2013). "Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes". Nucleic Acids Research. 41 (9): 4783–4791. doi:10.1093/nar/gkt144. 
  20. ^ Mohamed Hashim,Abdullah (2015). "Rare k-mer DNA: Identification of sequence motifs and prediction of CpG island and promoter". Journal ofTheoretical Biology. 387: 88–100. doi:10.1016/j.jtbi.2015.09.014. 
  21. ^ Jaron,Moravec,Martinkova (2014). "SigHunt: horizontal gene transfer finder optimized for eukaryotic genomes". Bioinformatics. 30 (8): 1081–1086. PMID 24371153. doi:10.1093/bioinformatics/btt727. 
  22. ^ Delmont,Eren (2016). "Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies". PeerJ. 4: e1839. doi:10.7717/Fpeerj.1839. 
  23. ^ Bemm et al. (2016). "Genome of a tardigrade: Horizontal gene transfer or bacterial contamination?". Proceedings of National Academy of Sciences. 113: E3054-E3056. doi:10.1073/pnas.1525116113. 
  24. ^ Wang,Xu,Liu (2016). "Recombination spot identification Based on gapped k-mers". Scientific Report. 6: 23934. Bibcode:2016NatSR...623934W. doi:10.1038/srep23934. 
  25. ^ Hozza,Vinar,Brejova (2015). How big is that genome? estimating genomesize and coverage from k-mer abundance spectra. SPIRE 2015. 
  26. ^ Lamichhaney et al. (2016). "Structural genomic changes underlie alternative reproductive strategies in the ruff (Philomachus pugnax)". Nature Genetics. 48: 84–88. PMID 26569123. doi:10.1038/ng.3430. 
  27. ^ Patro,Mount,Kingsford (2014). "Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms". Nature Biotechnology. 32: 462–464. PMC 4077321Freely accessible. PMID 24752080. doi:10.1038/nbt.2862. 

External links[edit]