Sequence analysis

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. Methodologies used include sequence alignment, searches against biological databases, and others.[1] Since the development of methods of high-throughput production of gene and protein sequences, the rate of addition of new sequences to the databases increased exponentially. Such a collection of sequences does not, by itself, increase the scientist's understanding of the biology of organisms. However, comparing these new sequences to those with known functions is a key way of understanding the biology of an organism from which the new sequence comes. Thus, sequence analysis can be used to assign function to genes and proteins by the study of the similarities between the compared sequences. Nowadays, there are many tools and techniques that provide the sequence comparisons (sequence alignment) and analyze the alignment product to understand its biology.

Sequence analysis in molecular biology includes a very wide range of relevant topics:

  1. The comparison of sequences in order to find similarity, often to infer if they are related (homologous)
  2. Identification of intrinsic features of the sequence such as active sites, post translational modification sites, gene-structures, reading frames, distributions of introns and exons and regulatory elements
  3. Identification of sequence differences and variations such as point mutations and single nucleotide polymorphism (SNP) in order to get the genetic marker.
  4. Revealing the evolution and genetic diversity of sequences and organisms
  5. Identification of molecular structure from sequence alone

In chemistry, sequence analysis comprises techniques used to determine the sequence of a polymer formed of several monomers. In molecular biology and genetics, the same process is called simply "sequencing".

In marketing, sequence analysis is often used in analytical customer relationship management applications, such as NPTB models (Next Product to Buy).

In sociology, sequence methods are increasingly used to study life-course and career trajectories, patterns of organizational and national development, conversation and interaction structure, and the problem of work/family synchrony. This body of research has given rise to the emerging subfield of social sequence analysis.


Since the very first sequences of the insulin protein was characterised by Fred Sanger in 1951 biologists have been trying to use this knowledge to understand the function of molecules.[2][3] He also contributed to DNA sequence. Not only he and his colleague’s successes sequence the first DNA-based genome[4]] . The method used in this study, which is called “Sanger method” or Sanger sequencing, was a milestone in sequencing long strand molecule such as DNA and this method was eventually used in human genome project.[5] According to Michael Levitt, sequence analysis was born in the period from 1969-1977.[6] In 1969 the analysis of sequences of transfer RNAs were used to infer residue interactions from correlated changes in the nucleotide sequences giving rise to a model of the tRNA secondary structure.[7] In 1970, Saul B. Needleman and Christian D. Wunsch published the first computer algorithm for aligning two sequences.[8] Over this time developments in obtaining nucleotide sequence greatly improved leading to the publication of the first complete genome of a bacteriophage in 1977.[9] Robert Holley and his team in Cornell University was believed to be the first to sequence RNA molecule.[10]

Sequence Alignment[edit]

Example multiple sequence alignment

There are millions of protein and nucleotide sequences known. These sequences fall into many groups of related sequences known as protein families or gene families. Relationships between these sequences are usually discovered by aligning them together and assigning this alignment a score. There are two main types of sequence alignment. Pair-wise sequence alignment only compares two sequences at a time and multiple sequence alignment compares many sequences in one go. Two important algorithms for aligning pairs of sequences are the Needleman-Wunsch algorithm and the Smith-Waterman algorithm. Popular tools for sequence alignment include:

A common use for pairwise sequence alignment is to take a sequence of interest and compare it to all known sequences in a database to identify homologous sequences. In general the matches in the database are ordered to show the most closely related sequences first followed by sequences with diminishing similarity. These matches are usually reported with a measure of statistical significance such as an Expectation value.

Profile comparison[edit]

In 1987, Michael Gribskov, Andrew McLachlan, and David Eisenberg introduced the method of profile comparison for identifying distant similarities between proteins.[11] Rather than using a single sequence, profile methods use a multiple sequence alignment to encode a profile which contains information about the conservation level of each residue. These profiles can then be used to search collections of sequences to find sequences that are related. Profiles are also known as Position Specific Scoring Matrices (PSSMs). In 1993, a probabilistic interpretation of profiles was introduced by David Haussler and colleagues using hidden Markov models.[12][13] These models have become known as profile-HMMs.

In recent years,[when?] methods have been developed that allow the comparison of profiles directly to each other. These are known as profile-profile comparison methods.[14]

Sequence assembly[edit]

Sequence assembly refers to the reconstruction of a DNA sequence by aligning and merging small DNA fragments. It is an integral part of modern DNA sequencing. Since presently-available DNA sequencing technologies are ill-suited for reading long sequences, large pieces of DNA (such as genomes) are often sequenced by (1) cutting the DNA into small pieces, (2) reading the small fragments, and (3) reconstituting the original DNA by merging the information on various fragment.

Recently sequencing multiple species at one time is one of the top research target. Metagenomics is studying microbial communities directly obtained from environment. Different from cultured microorganism from lab, the wild sample usually contains dozens, sometimes even thousands types of microorganisms from their original habitats.[15] Recovering the original genomes is a real challenging work. Most recently Projects:

Global Ocean survey (GOS)

Data Download

Human Microbiome Project (HMP)

Data Download

Earth Microbiome Project (EMP)

Gene prediction[edit]

Gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced. In general the prediction of bacterial genes is significantly simpler and more accurate than the prediction of genes in eukaryotic species that usually have complex intron/exon patterns.Identifying genes in long sequences remains a problem, especially when the number of genes is unknown. Hidden markov model can be part of the solutions.[16] Machine learning has played a significant role in predicting the sequence of transcription factors.[17] Traditional sequencing analyzing used focused on the statistical parameters of nucleotide sequence itself (The most common programs used are listed in Table 4.1). Another way is identifying homologous sequence based on other known gene sequence(Tools see Table 4.3).[18] Those two methods are both focusing on sequence. However, nowadays the shape feature of these molecules such as DNA and protein have also been studied and proposed to have an equivalent influence on the behaviors of these molecular as the sequence, if not higher.[19]

Protein Structure Prediction[edit]

Target protein structure (3dsm, shown in ribbons), with Calpha backbones (in gray) of 354 predicted models for it submitted in the CASP8 structure-prediction experiment.

The 3D structures of molecules are of great importance to their functions in nature. Since structural prediction of large molecules at an atomic level is largely intractable problem, some biologists introduced ways to predict 3D structure at a primary sequence level. This includes biochemical or statistical analysis of amino acid residues in local regions and structural inference from homologs (or other potentially related proteins) with known 3D structures.

There have been a large number of diverse approaches to solve the structure prediction problem. In order to determine which methods were most effective a structure prediction competition was founded called CASP (Critical Assessment of Structure Prediction).[20]


The tasks that lie in the space of sequence analysis are often non-trivial to resolve and require the use of relatively complex approaches. Of the many types of methods used in practice, the most popular include:

See also[edit]


  1. ^ Durbin, Richard M.; Eddy, Sean R.; Krogh, Anders; Mitchison, Graeme (1998), Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (1st ed.), Cambridge: Cambridge University Press, ISBN 0-521-62971-3, doi:10.2277/0521629713 
  2. ^ Sanger F; Tuppy H (September 1951). "The amino-acid sequence in the phenylalanyl chain of insulin. I. The identification of lower peptides from partial hydrolysates". Biochem. J. 49 (4): 463–81. PMC 1197535Freely accessible. PMID 14886310. doi:10.1042/bj0490463. 
  3. ^ SANGER F; TUPPY H (September 1951). "The amino-acid sequence in the phenylalanyl chain of insulin. 2. The investigation of peptides from enzymic hydrolysates". Biochem. J. 49 (4): 481–90. PMC 1197536Freely accessible. PMID 14886311. doi:10.1042/bj0490481. 
  4. ^ Sanger, F; Nicklen, S; Coulson, AR (1977 December). "DNA sequencing with chain-terminating inhibitors". Proc Natl Acad Sci U S A. 74 (12): 441–448. PMC 431765Freely accessible. PMID 271968. doi:10.1073/pnas.74.12.5463.  Check date values in: |date= (help)
  5. ^ Sanger, F; Air, GM; Barrell, BG; Brown, NL; Coulson, AR; Fiddes, CA; Hutchison, CA; Slocombe, PM; Smith, M (1977 Feb 24). "Nucleotide sequence of bacteriophage phi X174 DNA". Nature. 265 (5596): 687–695. PMID 870828. doi:10.1038/265687a0.  Check date values in: |date= (help)
  6. ^ Levitt M (May 2001). "The birth of computational structural biology". Nature Structural & Molecular Biology. 8 (5): 392–3. PMID 11323711. doi:10.1038/87545. 
  7. ^ Levitt M (November 1969). "Detailed molecular model for transfer ribonucleic acid". Nature. 224 (5221): 759–63. Bibcode:1969Natur.224..759L. PMID 5361649. doi:10.1038/224759a0. 
  8. ^ Needleman SB; Wunsch CD (March 1970). "A general method applicable to the search for similarities in the amino acid sequence of two proteins". J. Mol. Biol. 48 (3): 443–53. PMID 5420325. doi:10.1016/0022-2836(70)90057-4. 
  9. ^ Sanger F, Air GM, Barrell BG, et al. (February 1977). "Nucleotide sequence of bacteriophage phi X174 DNA". Nature. 265 (5596): 687–95. Bibcode:1977Natur.265..687S. PMID 870828. doi:10.1038/265687a0. 
  10. ^ Holley, RW; Apgar, J; Everett, GA; Madison, JT; Marquisee, M; Merrill, SH; Penswick, JR; Zamir, A (1965 May 19). "Structure of a Ribonucleic Acid". Science. 147 (3664): 1462–1465. PMID 14263761. doi:10.1126/science.147.3664.1462.  Check date values in: |date= (help)
  11. ^ Gribskov M; McLachlan AD; Eisenberg D (July 1987). "Profile analysis: detection of distantly related proteins". Proc. Natl. Acad. Sci. U.S.A. 84 (13): 4355–8. Bibcode:1987PNAS...84.4355G. PMC 305087Freely accessible. PMID 3474607. doi:10.1073/pnas.84.13.4355. 
  12. ^ Brown M; Hughey R; Krogh A; Mian IS; Sjölander K; Haussler D (1993). "Using Dirichlet mixture priors to derive hidden Markov models for protein families". Proc Int Conf Intell Syst Mol Biol. 1: 47–55. PMID 7584370. 
  13. ^ Krogh A; Brown M; Mian IS; Sjölander K; Haussler D (February 1994). "Hidden Markov models in computational biology. Applications to protein modeling". J. Mol. Biol. 235 (5): 1501–31. PMID 8107089. doi:10.1006/jmbi.1994.1104. 
  14. ^ Ye X; Wang G; Altschul SF (December 2011). "An assessment of substitution scores for protein profile-profile comparison". Bioinformatics. 27 (24): 3356–63. PMC 3232366Freely accessible. PMID 21998158. doi:10.1093/bioinformatics/btr565. 
  15. ^ Wooley, JC; Godzik, A; Friedberg, I (2010 Feb 26). "A primer on metagenomics". PLoS Comput Biol. 6 (2): e1000667. PMC 2829047Freely accessible. PMID 20195499. doi:10.1371/journal.pcbi.1000667.  Check date values in: |date= (help)
  16. ^ Stanke, M; Waack, S (2003 Oct 19). "Gene prediction with a hidden Markov model and a new intron submodel". Bioinformatics (2): 215–25. PMID 14534192. doi:10.1093/bioinformatics/btg1080.  Check date values in: |date= (help)
  17. ^ Alipanahi, B; Delong, A; Weirauch, MT; Frey, BJ (2015 Aug). "Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning". Nat Biotechnol. 33 (8): 831–8. PMID 26213851. doi:10.1038/nbt.3300.  Check date values in: |date= (help)
  18. ^ Wooley, JC; Godzik, A; Friedberg, I (2010 Feb 26). "A primer on metagenomics". PLoS Comput Biol. 6 (2): e1000667. PMC 2829047Freely accessible. PMID 20195499. doi:10.1371/journal.pcbi.1000667. Retrieved 15 November 2016.  Check date values in: |date= (help)
  19. ^ Abe, N; Dror, I; Yang, L; Slattery, M; Zhou, T; Bussemaker, HJ; Rohs R, R; Mann, RS (2015 Apr 9). "Deconvolving the recognition of DNA shape from sequence". Cell. 161 (2): 307–18. PMC 4422406Freely accessible. PMID 25843630. doi:10.1016/j.cell.2015.02.008.  Check date values in: |date= (help)
  20. ^ Moult J; Hubbard T; Bryant SH; Fidelis K; Pedersen JT (1997). "Critical assessment of methods of protein structure prediction (CASP): round II". Proteins. Suppl 1: 2–6. PMID 9485489. doi:10.1002/(SICI)1097-0134(1997)1+<2::AID-PROT2>3.0.CO;2-T.