Alignment-free sequence analysis: Difference between revisions

Content deleted Content added

Inline

Revision as of 13:41, 27 September 2013

Alignment-free sequence analysis

The emergence and need for the analysis of different types of data from the biological research has given rise to the field of Bioinformatics^[1] . Molecular sequence and structure data of DNA, RNA and proteins, gene expression profiles or micro array data, metabolic pathway data are some of the major types of data being analysed in Bioinformatics. Among them sequence data is increasing at the exponential rate due to advent of next-generation sequencing technologies. Since the origin of Bioinformatics sequence analysis has remained the major area of research with wide range of applications in Database searching, Genome annotation, Comparative genomics, Molecular phylogeny, Gene prediction etc. The pioneering approaches for sequence analysis were based on sequence alignment either global or local, pairwise or multiple sequence alignment^[2] ^[3] . Alignment-based approaches generally give excellent results when the sequences under study are closely related and can be reliably aligned, but when the sequences are divergent, a reliable alignment cannot be obtained and hence the applications of sequence alignment are limited. Another limitation of alignment-based approaches is their computational complexity and are time-consuming and thus, are limited when dealing with large-scale sequence data^[4]. The advent of next generation sequencing technologies has resulted in generation of voluminous sequencing data. The size of this sequence data poses challenges on alignment-based algorithms in their assembly, annotation and comparative studies. Thus, alignment-free sequence analysis approaches provide attractive alternatives over alignment-based approaches.

Alignment-free methods

Alignment-free methods can broadly be classified into four categories, a) methods based on k-mer/word frequency, b) methods based on substrings, c) methods based on information theory and d) methods based on graphical representation.

Methods based on k-mer/word frequency

The popular methods based on k-mer/word frequencies include Feature Frequency Profile (FFP)^[5] ^[6] , Composition vector (CV)^[7] ^[8] , Return time distribution (RTD)^[9] ^[10] and frequency chaos game representation (FCGR)^[11] .

Feature frequency profile (FFP)

The methodology involved in FFP based method starts by calculating the count of each possible k-mer (possible number of k-mers for nucleotide sequence: 4^k, while that for protein sequence: 20^k) in sequences. Each k-mer count in each sequence is then normalized by dividing it by total of all k-mers' count in that sequence. This leads to conversion of each sequence into its feature frequency profile. The pair wise distance between two sequences is then calculated Jensen-Shannon (JS) divergence between their respective FFPs.

Methods based on substrings

Methods based on information theory

Methods based on graphical representation

References

^ Rothberg, J (2012 Sep). "Bioinformatics. Introduction". The Yale journal of biology and medicine. 85 (3): 305–8. PMC 3447194. PMID 23189382. {{cite journal}}: Check date values in: |date= (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Batzoglou, S (2005 Mar). "The many faces of sequence alignment". Briefings in bioinformatics. 6 (1): 6–22. doi:10.1093/bib/6.1.6. PMID 15826353. {{cite journal}}: Check date values in: |date= (help)
^ Mullan, L (2006 Mar). "Pairwise sequence alignment--it's all about us!". Briefings in bioinformatics. 7 (1): 113–5. doi:10.1093/bib/bbk008. PMID 16761368. {{cite journal}}: Check date values in: |date= (help)
^ Kemena, C (2009 Oct 1). "Upcoming challenges for multiple sequence alignment methods in the high-throughput era". Bioinformatics (Oxford, England). 25 (19): 2455–65. doi:10.1093/bioinformatics/btp452. PMC 2752613. PMID 19648142. {{cite journal}}: Check date values in: |date= (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Sims, GE (2009 Oct 6). "Whole-genome phylogeny of mammals: evolutionary information in genic and nongenic regions". Proceedings of the National Academy of Sciences of the United States of America. 106 (40): 17077–82. PMID 19805074. {{cite journal}}: Check date values in: |date= (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Sims, GE (2011 May 17). "Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs)". Proceedings of the National Academy of Sciences of the United States of America. 108 (20): 8329–34. PMID 21536867. {{cite journal}}: Check date values in: |date= (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Gao, L (2007 Mar 15). "Whole genome molecular phylogeny of large dsDNA viruses using composition vector method". BMC evolutionary biology. 7: 41. PMID 17359548. {{cite journal}}: Check date values in: |date= (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Wang, H (2009 Aug 10). "A fungal phylogeny based on 82 complete genomes using the composition vector method". BMC evolutionary biology. 9: 195. PMID 19664262. {{cite journal}}: Check date values in: |date= (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Kolekar, P (2012 Nov). "Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping". Molecular phylogenetics and evolution. 65 (2): 510–22. PMID 22820020. {{cite journal}}: Check date values in: |date= (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Kolekar, PS (2011 Nov 30). "Genotyping of Mumps viruses based on SH gene: Development of a server using alignment-free and alignment-based methods". Immunome research. 7 (3): 1–7. PMID 22126822. {{cite journal}}: Check date values in: |date= (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Hatje, K (2012). "A phylogenetic analysis of the brassicales clade based on an alignment-free sequence comparison method". Frontiers in plant science. 3: 192. PMID 22952468. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[1] Rothberg, J (2012 Sep). "Bioinformatics. Introduction". The Yale journal of biology and medicine. 85 (3): 305–8. PMC 3447194. PMID 23189382. {{cite journal}}: Check date values in: |date= (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)

[2] Batzoglou, S (2005 Mar). "The many faces of sequence alignment". Briefings in bioinformatics. 6 (1): 6–22. doi:10.1093/bib/6.1.6. PMID 15826353. {{cite journal}}: Check date values in: |date= (help)

[3] Mullan, L (2006 Mar). "Pairwise sequence alignment--it's all about us!". Briefings in bioinformatics. 7 (1): 113–5. doi:10.1093/bib/bbk008. PMID 16761368. {{cite journal}}: Check date values in: |date= (help)

[4] Kemena, C (2009 Oct 1). "Upcoming challenges for multiple sequence alignment methods in the high-throughput era". Bioinformatics (Oxford, England). 25 (19): 2455–65. doi:10.1093/bioinformatics/btp452. PMC 2752613. PMID 19648142. {{cite journal}}: Check date values in: |date= (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)

[5] Sims, GE (2009 Oct 6). "Whole-genome phylogeny of mammals: evolutionary information in genic and nongenic regions". Proceedings of the National Academy of Sciences of the United States of America. 106 (40): 17077–82. PMID 19805074. {{cite journal}}: Check date values in: |date= (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)

[6] Sims, GE (2011 May 17). "Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs)". Proceedings of the National Academy of Sciences of the United States of America. 108 (20): 8329–34. PMID 21536867. {{cite journal}}: Check date values in: |date= (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)

[7] Gao, L (2007 Mar 15). "Whole genome molecular phylogeny of large dsDNA viruses using composition vector method". BMC evolutionary biology. 7: 41. PMID 17359548. {{cite journal}}: Check date values in: |date= (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)

[8] Wang, H (2009 Aug 10). "A fungal phylogeny based on 82 complete genomes using the composition vector method". BMC evolutionary biology. 9: 195. PMID 19664262. {{cite journal}}: Check date values in: |date= (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)

[9] Kolekar, P (2012 Nov). "Alignment-free distance measure based on return time distribution for sequence analysis: applications to clustering, molecular phylogeny and subtyping". Molecular phylogenetics and evolution. 65 (2): 510–22. PMID 22820020. {{cite journal}}: Check date values in: |date= (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)

[10] Kolekar, PS (2011 Nov 30). "Genotyping of Mumps viruses based on SH gene: Development of a server using alignment-free and alignment-based methods". Immunome research. 7 (3): 1–7. PMID 22126822. {{cite journal}}: Check date values in: |date= (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)

[11] Hatje, K (2012). "A phylogenetic analysis of the brassicales clade based on an alignment-free sequence comparison method". Frontiers in plant science. 3: 192. PMID 22952468. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]