Earth Microbiome Project
The Earth Microbiome Project (EMP) is an initiative to collect natural samples and to analyze the microbial community around the globe.
Microbes are highly abundant, diverse, and have an important role in the ecological system. There are an estimated 1.3 x 1028 archaeal cells, 3.1 x 1028 bacterial cells, and 1 x 1030 virus particles in the ocean. The bacterial diversity, a measure of the number of types of bacteria in a community, is estimated to be about 160 for a mL of ocean water, 6,400–38,000 for a g of soil, and 70 for a mL of sewage works. Yet as of 2010[update], it was estimated that the total global environmental DNA sequencing effort had produced less than 1 percent of the total DNA found in a liter of seawater or a gram of soil, and the specific interactions between microbes are largely unknown.
The EMP aims to process as many as 200,000 samples in different biomes, generating a complete database of microbes on earth to characterize environments and ecosystems by microbial composition and interaction. Using these data, new ecological and evolutionary theories can be proposed and tested.
The primary goal of EMP is to survey microbial composition in many environments across the planet, across time as well as space, using a standard set of protocols. The development of standardized protocols is vital, because variations in sample extraction, amplification, sequencing and analysis introduce biases that would invalidate comparisons of microbial community structure.
Another important goal is to determine how reconstruction of microbial communities is affected by analytic biases. The rate of technological advance is rapid, and it is necessary to understand how data using updated protocols will compare with data collected using earlier techniques. Information from this project will be archived in a database to facilitate analysis. Other outputs will include a global atlas of protein function and a catalog of reassembled genomes classified by their taxonomic distributions.
Large amounts of sequence data generated from analyzing diverse microbial communities are a challenge to store, organize and analyse. The problem is exacerbated by the short reads provided by the high-throughput sequencing platform that will be the standard instrument used in the EMP project. Improved algorithms, improved analysis tools, huge amounts of computer storage, and access to many thousands of hours of supercomputer time will be necessary.
Another challenge will be the large number of sequencing errors that are expected. Next-generation sequencing technologies provide enormous throughput but lower accuracies than older sequencing methods. When sequencing a single genome, the intrinsic lower accuracy of these methods is far more than compensated for by the ability to cover the entire genome multiple times in opposite directions from multiple start points, but this capability provides no improvement in accuracy when sequencing a diverse mixture of genomes. The question will be, how can sequencing errors be distinguished from actual diversity in the collected microbial samples?
Despite the issuance of standard protocols, systematic biases from lab to lab are expected. The need to amplify DNA from samples with low biomass will introduce additional distortions of the data. Assembly of genomes of even the dominant organisms in a diverse sample of organisms requires gigabytes of sequence data.
The EMP must avoid a problem that has become prevalent in the public sequence databases. With the advancement in high-throughput sequencing technologies, many sequences are entering public databases with no experimentally determined function, but which have been annotated on the basis of observed homologies with a known sequence. The first known sequence is used to annotate the first unknown sequence, but what is happening is that the first unknown sequence is being used to annotate the second unknown sequence and so on. Sequence homology is only a modestly reliable predictor of function.
Samples will be collected using appropriate methods from various environments including deep ocean, fresh water lakes, desert sand, and soil. Standardized collection protocols will be used when possible, so that the results are comparable. Microbes from natural samples cannot always be cultured. Because of this, metagenomic methods will be employed to sequence all the DNA or RNA in a sample in a culture-independent fashion.
The wet lab usually needs to perform a series of procedures to select and purify the microbial portion of the samples. The purification process may be very different according to the type of sample. DNA will be extracted from soil particles, or microbes will be concentrated using a series of filtration techniques. In addition, various amplification techniques may be used to increase DNA yield. For example, non-PCR based Multiple displacement amplification is preferred by some researchers. DNA extraction, the use of primers, and PCR protocols are all areas that, in order to avoid bias, need to be performed following carefully standardized protocols.
Depending on the biological question, researchers can choose to sequence a metagenomic sample using two main approaches. If the biological question to be resolved is, what types of organisms are present and in what abundance, the preferred approach would be to target and amplify a specific gene that is highly conserved among the species of interest. The 16S ribosomal RNA gene for bacteria and the 18S ribosomal RNA gene for protists are often used as target genes for this purpose. The advantage of targeting a specific gene is that the gene can be amplified and sequenced at a very high coverage. This approach is called "deep sequencing", which allows rare species to be identified in a sample. However, this approach will not enable assembly of any whole genomes, nor will it provide information on how organisms may interact with each other. The second approach is called shotgun metagenomics, in which all the DNA in the sample is sheared and the random fragments sequenced. In principle, this approach allows for the assembly of whole microbial genomes, and it allows inference of metabolic relationships. However, if most of microbes are uncharacterised in a given environment, de novo assembly will be computationally expensive.
Data analysis usually includes the following steps: 1) Data clean up. A pre-procedure to clean up any reads with low quality scores; any sequences containing "N" or ambiguous nucleotides are removed; and 2) Assignment of taxonomy to the sequences. This method is usually done using tools such as BLAST or RDP. Very often, novel sequences are discovered which cannot be mapped to existing taxonomy. In this case, a phylogenetic tree is created with the novel sequences and a pool of closely related known sequences. One can then derive the taxonomy of the novel sequences based on the phylogenetic tree.
Depending on the sequencing technology and the underlying biological question, additional methods may be employed. For example, if the sequenced reads are too short to infer any useful information, an assembly will be required. An assembly can also be used to construct whole genomes, which will provide useful information on the species. Furthermore, if the metabolic relationships within a microbial metagenome are to be understood, the DNA sequences need to be translated into amino acid sequences using gene prediction tools such as GeneMark or FragGeneScan.
Four key outputs from the EMP have been defined:
- Ultimately, all primary data generated from the Earth Microbiome Project, regardless of their degree of conclusiveness, will be stored in a centralized database called the "Gene Atlas" (GA). The GA will have sequence data, annotations and environmental metadata. Known as well as unknown sequences, i.e. "Dark Matter", will be included hoping that, given the time, the unknown sequences may eventually be characterized.
- Assembled genomes, annotated using an automated pipeline, will be stored in "Earth Microbiome Assembled Genomes" (EM-AG) in public repositories. These will enable comparative genomic analysis.
- Interactive visualizations of the data will be provided through the "Earth Microbiome Visualization Portal" (EM-VIP), which will allow the relationship between microbial makeup, environmental parameters, and genomic function to be viewed.
- Reconstructed metabolic profiles will be offered through "Earth Microbiome Metabolic Reconstruction" (EMMR).
- Suttle, C. A. (2007). "Marine viruses — major players in the global ecosystem". Nature Reviews Microbiology 5 (10): 801–812. doi:10.1038/nrmicro1750. PMID 17853907.
- Curtis, T. P.; Sloan, W. T.; Scannell, J. W. (2002). "Estimating prokaryotic diversity and its limits". Proceedings of the National Academy of Sciences 99 (16): 10494–10499. doi:10.1073/pnas.142680199. PMC 124953. PMID 12097644.
- Gilbert, J. A.; Meyer, F.; Antonopoulos, D.; Balaji, P.; Brown, C. T.; Brown, C. T.; Desai, N.; Eisen, J. A.; Evers, D.; Field, D.; Feng, W.; Huson, D.; Jansson, J.; Knight, R.; Knight, J.; Kolker, E.; Konstantindis, K.; Kostka, J.; Kyrpides, N.; MacKelprang, R.; McHardy, A.; Quince, C.; Raes, J.; Sczyrba, A.; Shade, A.; Stevens, R. (2010). "Meeting Report: The Terabase Metagenomics Workshop and the Vision of an Earth Microbiome Project". Standards in Genomic Sciences 3 (3): 243–248. doi:10.4056/sigs.1433550. PMC 3035311. PMID 21304727.
- Gilbert, J. A.; O'Dor, R.; King, N.; Vogel, T. M. (2011). "The importance of metagenomic surveys to microbial ecology: Or why Darwin would have been a metagenomic scientist". Microbial Informatics and Experimentation 1: 5. doi:10.1186/2042-5783-1-5.
- Gilbert, J.A.; Meyer, F. (2012). "Modeling the Earth Microbiome". Microbe Magazine 7 (2): 64–69. Retrieved 2012-03-06.
- Jansson, Janet (2011). "Towards "Tera-Terra": Terabase Sequencing of Terrestrial Metagenomes". Microbe Magazine 6 (7): 309–15. Retrieved 2012-03-07.
- Gilbert, J. A.; Dupont, C. L. (2011). "Microbial Metagenomics: Beyond the Genome". Annual Review of Marine Science 3: 347–371. doi:10.1146/annurev-marine-120709-142811. PMID 21329209.
- "Earth Microbiome Project / Standard Protocols". Retrieved 2012-03-07.
- "BLAST: Basic Local Alignment Search Tool".
- "Ribosomal Database Project". Retrieved 2012-03-06.
- Meyer, F.; Paarmann, D.; d'Souza, M.; Olson, R.; Glass, E. M.; Kubal, M.; Paczian, T.; Rodriguez, A.; Stevens, R.; Wilke, A.; Wilkening, J.; Edwards, R. A. (2008). "The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes". BMC Bioinformatics 9: 386. doi:10.1186/1471-2105-9-386. PMC 2563014. PMID 18803844.
- "GeneMark - Free gene prediction software". Retrieved 2012-03-06.
- "FragGeneScan". Retrieved 2012-03-06.
- "Earth Microbiome Project / Defining the Tasks". Retrieved 2012-03-07.