Jump to content

European Nucleotide Archive: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
expand
fixing lead
Line 19: Line 19:
The '''European Nucleotide Archive''' (ENA) is an open access, annotated collection of all publicly available [[nucleic acid sequence|nucleotide sequence]]s.<ref name="pmid20972220">{{cite journal |author=Leinonen R, Akhtar R, Birney E, ''et al.'' |title=The European Nucleotide Archive |journal=Nucleic Acids Res. |volume=39 |issue=Database issue |pages=D28–31 |year=2011 |month=January |pmid=20972220 |pmc=3013801 |doi=10.1093/nar/gkq967 |url=}}</ref> The collection is composed of three main databases: the [[Sequence Read Archive]] (SRA), the Trace Archive and EMBL-bank. The ENA is produced and maintained by the [[European Bioinformatics Institute]] and is a member of the [[International Nucleotide Sequence Database Collaboration]] (INSDC) along with the [[DNA Data Bank of Japan]] and [[GenBank]].
The '''European Nucleotide Archive''' (ENA) is an open access, annotated collection of all publicly available [[nucleic acid sequence|nucleotide sequence]]s.<ref name="pmid20972220">{{cite journal |author=Leinonen R, Akhtar R, Birney E, ''et al.'' |title=The European Nucleotide Archive |journal=Nucleic Acids Res. |volume=39 |issue=Database issue |pages=D28–31 |year=2011 |month=January |pmid=20972220 |pmc=3013801 |doi=10.1093/nar/gkq967 |url=}}</ref> The collection is composed of three main databases: the [[Sequence Read Archive]] (SRA), the Trace Archive and EMBL-bank. The ENA is produced and maintained by the [[European Bioinformatics Institute]] and is a member of the [[International Nucleotide Sequence Database Collaboration]] (INSDC) along with the [[DNA Data Bank of Japan]] and [[GenBank]].


Sequence data from the ENA and its INSDC partners are used in biological and medical research around the world, and the data are accessed millions of times every month. As of June 2012 the EMBL-Bank section of the ENA contains 250 million records including over 444 billion nucleotides, and the data are increasing exponentially with a doubling time of approximately 10 months.<ref name="CochraneAlako2012">{{cite journal|last1=Cochrane|first1=G.|last2=Alako|first2=B.|last3=Amid|first3=C.|last4=Bower|first4=L.|last5=Cerdeno-Tarraga|first5=A.|last6=Cleland|first6=I.|last7=Gibson|first7=R.|last8=Goodgame|first8=N.|last9=Jang|first9=M.|last10=Kay|first10=S.|last11=Leinonen|first11=R.|last12=Lin|first12=X.|last13=Lopez|first13=R.|last14=McWilliam|first14=H.|last15=Oisel|first15=A.|last16=Pakseresht|first16=N.|last17=Pallreddy|first17=S.|last18=Park|first18=Y.|last19=Plaister|first19=S.|last20=Radhakrishnan|first20=R.|last21=Riviere|first21=S.|last22=Rossello|first22=M.|last23=Senf|first23=A.|last24=Silvester|first24=N.|last25=Smirnov|first25=D.|last26=ten Hoopen|first26=P.|last27=Toribio|first27=A.|last28=Vaughan|first28=D.|last29=Zalunin|first29=V.|title=Facing growth in the European Nucleotide Archive|journal=Nucleic Acids Research|volume=41|issue=D1|year=2012|pages=D30–D35|issn=0305-1048|doi=10.1093/nar/gks1175}}</ref>
Sequence data from the ENA and its INSDC partners are used in biological and medical research around the world, and the data are accessed millions of times every month. As of early 2012, the ENA contains complete [[genome]]s of 5,682 organisms and sequence data for almost 700,000.<ref name="CochraneCook2012">{{cite journal|last1=Cochrane|first1=Guy|last2=Cook|first2=Charles E|last3=Birney|first3=Ewan|title=The future of DNA sequence archiving|journal=GigaScience|volume=1|issue=1|year=2012|pages=2|issn=2047-217X|doi=10.1186/2047-217X-1-2}}</ref>
Further, the data are [[Exponential growth|increasing exponentially]] with a doubling time of approximately 10 months.<ref name="CochraneAlako2012">{{cite journal|last1=Cochrane|first1=G.|last2=Alako|first2=B.|last3=Amid|first3=C.|last4=Bower|first4=L.|last5=Cerdeno-Tarraga|first5=A.|last6=Cleland|first6=I.|last7=Gibson|first7=R.|last8=Goodgame|first8=N.|last9=Jang|first9=M.|last10=Kay|first10=S.|last11=Leinonen|first11=R.|last12=Lin|first12=X.|last13=Lopez|first13=R.|last14=McWilliam|first14=H.|last15=Oisel|first15=A.|last16=Pakseresht|first16=N.|last17=Pallreddy|first17=S.|last18=Park|first18=Y.|last19=Plaister|first19=S.|last20=Radhakrishnan|first20=R.|last21=Riviere|first21=S.|last22=Rossello|first22=M.|last23=Senf|first23=A.|last24=Silvester|first24=N.|last25=Smirnov|first25=D.|last26=ten Hoopen|first26=P.|last27=Toribio|first27=A.|last28=Vaughan|first28=D.|last29=Zalunin|first29=V.|title=Facing growth in the European Nucleotide Archive|journal=Nucleic Acids Research|volume=41|issue=D1|year=2012|pages=D30–D35|issn=0305-1048|doi=10.1093/nar/gks1175}}</ref>


==History==
==History==

Revision as of 19:32, 6 January 2013

European Nucleotide Archive (ENA)
Content
DescriptionNucleotide sequences from all publicly available sources with supporting bibliographic and biological annotation.
Data types
captured
Nucleotide Sequence,
Organismsall
Contact
Research centerEuropean Bioinformatics Institute
Primary citationPMID 20972220
Release date1982
Access
Data formatXML
ASN.1
EMBL-Bank format
Website[1]
Download URL[2]
Tools
WebBLAST
StandaloneBLAST
Miscellaneous
LicensePublic domain

The European Nucleotide Archive (ENA) is an open access, annotated collection of all publicly available nucleotide sequences.[1] The collection is composed of three main databases: the Sequence Read Archive (SRA), the Trace Archive and EMBL-bank. The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank.

Sequence data from the ENA and its INSDC partners are used in biological and medical research around the world, and the data are accessed millions of times every month. As of early 2012, the ENA contains complete genomes of 5,682 organisms and sequence data for almost 700,000.[2] Further, the data are increasing exponentially with a doubling time of approximately 10 months.[3]

History

The EMBL Data Library was established in 1982 at the European Molecular Biology Laboratory (EMBL) Heidelberg and was later renamed the EMBL Nucleotide Sequence Database.

With the advancement of Sanger sequencing, the Wellcome Trust Sanger Institute (then known as The Sanger Centre) began cataloguing sequence reads along with quality information in a database called The Trace Archive.[1] In 2008, the European Bioinformatics Institute (EBI) combined the Trace Archive, EMBL Nucleotide Sequence Database and the new Short Read Archive (SRA) to make up the ENA, aimed at providing a comprehensive nucleotide sequence archive.[1]

The EBI at the Wellcome Trust Genome Campus in Hinxton, UK which hosts the ENA.

EMBL-Bank

EMBL-Bank is that part of the ENA database dedicated to annotated and assembled nucleotide sequence entries.

Release 112 on 31 May 2012 contained 247,335,689 sequence entries comprising 429,512,389,024 nucleotides and this is increasing rapidly.

Sequence Read Archive

The ENA operates an instance of the Sequence (or Short) Read Archive (SRA). The SRA is an archival repository of sequence reads and analyses which are intended for public release.[4] Currently, the archive accepts both sequence read and analysis (e.g. BAM alignment and VCF variation) data generated by next-generation sequencing methodologies such as 454, Illumina Genome Analyzer and ABI SOLiD data. The SRA operates under the guidance of the International Nucleotide Sequence Database Collaboration.[4]

In 2010 the SRA made up approximately 95% of the base pair data available through the ENA,[1] encompassing over 500 billion sequence reads made up of over 60 trillion base pairs.[4] Almost half of this data was deposited in relation to the 1000 Genomes Project.[4]

Storage

The ENA handles large volumes of data which pose a significant storage challenge.[3][5] To cope with these storage requirements, the ENA discards less-valuable sequencing platform data and implements advanced compression strategies.[4] The CRAM reference-based compression toolkit was developed to help reduce ENA storage requirements.[3][6]

References

  1. ^ a b c d Leinonen R, Akhtar R, Birney E; et al. (2011). "The European Nucleotide Archive". Nucleic Acids Res. 39 (Database issue): D28–31. doi:10.1093/nar/gkq967. PMC 3013801. PMID 20972220. {{cite journal}}: Explicit use of et al. in: |author= (help); Unknown parameter |month= ignored (help)CS1 maint: multiple names: authors list (link)
  2. ^ Cochrane, Guy; Cook, Charles E; Birney, Ewan (2012). "The future of DNA sequence archiving". GigaScience. 1 (1): 2. doi:10.1186/2047-217X-1-2. ISSN 2047-217X.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  3. ^ a b c Cochrane, G.; Alako, B.; Amid, C.; Bower, L.; Cerdeno-Tarraga, A.; Cleland, I.; Gibson, R.; Goodgame, N.; Jang, M.; Kay, S.; Leinonen, R.; Lin, X.; Lopez, R.; McWilliam, H.; Oisel, A.; Pakseresht, N.; Pallreddy, S.; Park, Y.; Plaister, S.; Radhakrishnan, R.; Riviere, S.; Rossello, M.; Senf, A.; Silvester, N.; Smirnov, D.; ten Hoopen, P.; Toribio, A.; Vaughan, D.; Zalunin, V. (2012). "Facing growth in the European Nucleotide Archive". Nucleic Acids Research. 41 (D1): D30–D35. doi:10.1093/nar/gks1175. ISSN 0305-1048.
  4. ^ a b c d e Leinonen R, Sugawara H, Shumway M (2011). "The sequence read archive". Nucleic Acids Res. 39 (Database issue): D19–21. doi:10.1093/nar/gkq1019. PMC 3013647. PMID 21062823. {{cite journal}}: Unknown parameter |month= ignored (help)CS1 maint: multiple names: authors list (link)
  5. ^ Cochrane, G.; Akhtar, R.; Bonfield, J.; Bower, L.; Demiralp, F.; Faruque, N.; Gibson, R.; Hoad, G.; Hubbard, T.; Hunter, C.; Jang, M.; Juhos, S.; Leinonen, R.; Leonard, S.; Lin, Q.; Lopez, R.; Lorenc, D.; McWilliam, H.; Mukherjee, G.; Plaister, S.; Radhakrishnan, R.; Robinson, S.; Sobhany, S.; Hoopen, P. T.; Vaughan, R.; Zalunin, V.; Birney, E. (2009). "Petabyte-scale innovations at the European Nucleotide Archive". Nucleic Acids Research. 37 (Database): D19–D25. doi:10.1093/nar/gkn765. ISSN 0305-1048.
  6. ^ Hsi-Yang Fritz, M.; Leinonen, R.; Cochrane, G.; Birney, E. (2011). "Efficient storage of high throughput DNA sequencing data using reference-based compression". Genome Research. 21 (5): 734–740. doi:10.1101/gr.114819.110. ISSN 1088-9051.

External links