review open access metagenomics - a guide from sampling to ... · review open access metagenomics -...

12
REVIEW Open Access Metagenomics - a guide from sampling to data analysis Torsten Thomas 1* , Jack Gilbert 2,3 and Folker Meyer 2,4 Abstract Metagenomics applies a suite of genomic technologies and bioinformatics tools to directly access the genetic content of entire communities of organisms. The field of metagenomics has been responsible for substantial advances in microbial ecology, evolution, and diversity over the past 5 to 10 years, and many research laboratories are actively engaged in it now. With the growing numbers of activities also comes a plethora of methodological knowledge and expertise that should guide future developments in the field. This review summarizes the current opinions in metagenomics, and provides practical guidance and advice on sample processing, sequencing technology, assembly, binning, annotation, experimental design, statistical analysis, data storage, and data sharing. As more metagenomic datasets are generated, the availability of standardized procedures and shared data storage and analysis becomes increasingly important to ensure that output of individual projects can be assessed and compared. Keywords: sampling, sequencing, assembly, binning, annotation, data storage, data sharing, DNA extraction, micro- bial ecology, microbial diversity Introduction Arguably, one of the most remarkable events in the field of microbial ecology in the past decade has been the advent and development of metagenomics. Metage- nomics is defined as the direct genetic analysis of gen- omes contained with an environmental sample. The field initially started with the cloning of environmental DNA, followed by functional expression screening [1], and was then quickly complemented by direct random shotgun sequencing of environmental DNA [2,3]. These initial projects not only showed proof of principle of the metagenomic approach, but also uncovered an enor- mous functional gene diversity in the microbial world around us [4]. Metagenomics provides access to the functional gene composition of microbial communities and thus gives a much broader description than phylogenetic surveys, which are often based only on the diversity of one gene, for instance the 16S rRNA gene. On its own, metage- nomics gives genetic information on potentially novel biocatalysts or enzymes, genomic linkages between func- tion and phylogeny for uncultured organisms, and evo- lutionary profiles of community function and structure. It can also be complemented with metatranscriptomic or metaproteomic approaches to describe expressed activities [5,6]. Metagenomics is also a powerful tool for generating novel hypotheses of microbial function; the remarkable discoveries of proteorhodopsin-based photo- heterotrophy or ammonia-oxidizing Archaea attest to this fact [7,8]. The rapid and substantial cost reduction in next-gen- eration sequencing has dramatically accelerated the development of sequence-based metagenomics. In fact, the number of metagenome shotgun sequence datasets has exploded in the past few years. In the future, meta- genomics will be used in the same manner as 16S rRNA gene fingerprinting methods to describe microbial com- munity profiles. It will therefore become a standard tool for many laboratories and scientists working in the field of microbial ecology. This review gives an overview of the field of metage- nomics, with particular emphasis on the steps involved in a typical sequence-based metagenome project (Figure 1). We describe and discuss sample processing, * Correspondence: [email protected] 1 School of Biotechnology and Biomolecular Sciences & Centre for Marine Bio-Innovation, The University of New South Wales, Sydney, NSW 2052, Australia Full list of author information is available at the end of the article Thomas et al. Microbial Informatics and Experimentation 2012, 2:3 http://www.microbialinformaticsj.com/content/2/1/3 © 2012 Thomas et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Upload: others

Post on 26-May-2020

10 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: REVIEW Open Access Metagenomics - a guide from sampling to ... · REVIEW Open Access Metagenomics - a guide from sampling to data analysis Torsten Thomas1*, Jack Gilbert2,3 and Folker

REVIEW Open Access

Metagenomics - a guide from sampling to dataanalysisTorsten Thomas1*, Jack Gilbert2,3 and Folker Meyer2,4

Abstract

Metagenomics applies a suite of genomic technologies and bioinformatics tools to directly access the geneticcontent of entire communities of organisms. The field of metagenomics has been responsible for substantialadvances in microbial ecology, evolution, and diversity over the past 5 to 10 years, and many research laboratoriesare actively engaged in it now. With the growing numbers of activities also comes a plethora of methodologicalknowledge and expertise that should guide future developments in the field. This review summarizes the currentopinions in metagenomics, and provides practical guidance and advice on sample processing, sequencingtechnology, assembly, binning, annotation, experimental design, statistical analysis, data storage, and data sharing.As more metagenomic datasets are generated, the availability of standardized procedures and shared data storageand analysis becomes increasingly important to ensure that output of individual projects can be assessed andcompared.

Keywords: sampling, sequencing, assembly, binning, annotation, data storage, data sharing, DNA extraction, micro-bial ecology, microbial diversity

IntroductionArguably, one of the most remarkable events in the fieldof microbial ecology in the past decade has been theadvent and development of metagenomics. Metage-nomics is defined as the direct genetic analysis of gen-omes contained with an environmental sample. Thefield initially started with the cloning of environmentalDNA, followed by functional expression screening [1],and was then quickly complemented by direct randomshotgun sequencing of environmental DNA [2,3]. Theseinitial projects not only showed proof of principle of themetagenomic approach, but also uncovered an enor-mous functional gene diversity in the microbial worldaround us [4].Metagenomics provides access to the functional gene

composition of microbial communities and thus gives amuch broader description than phylogenetic surveys,which are often based only on the diversity of one gene,for instance the 16S rRNA gene. On its own, metage-nomics gives genetic information on potentially novel

biocatalysts or enzymes, genomic linkages between func-tion and phylogeny for uncultured organisms, and evo-lutionary profiles of community function and structure.It can also be complemented with metatranscriptomicor metaproteomic approaches to describe expressedactivities [5,6]. Metagenomics is also a powerful tool forgenerating novel hypotheses of microbial function; theremarkable discoveries of proteorhodopsin-based photo-heterotrophy or ammonia-oxidizing Archaea attest tothis fact [7,8].The rapid and substantial cost reduction in next-gen-

eration sequencing has dramatically accelerated thedevelopment of sequence-based metagenomics. In fact,the number of metagenome shotgun sequence datasetshas exploded in the past few years. In the future, meta-genomics will be used in the same manner as 16S rRNAgene fingerprinting methods to describe microbial com-munity profiles. It will therefore become a standard toolfor many laboratories and scientists working in the fieldof microbial ecology.This review gives an overview of the field of metage-

nomics, with particular emphasis on the steps involvedin a typical sequence-based metagenome project (Figure1). We describe and discuss sample processing,

* Correspondence: [email protected] of Biotechnology and Biomolecular Sciences & Centre for MarineBio-Innovation, The University of New South Wales, Sydney, NSW 2052,AustraliaFull list of author information is available at the end of the article

Thomas et al. Microbial Informatics and Experimentation 2012, 2:3http://www.microbialinformaticsj.com/content/2/1/3

© 2012 Thomas et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.

Page 2: REVIEW Open Access Metagenomics - a guide from sampling to ... · REVIEW Open Access Metagenomics - a guide from sampling to data analysis Torsten Thomas1*, Jack Gilbert2,3 and Folker

sequencing technology, assembly, binning, annotation,experimental design, statistical analysis, and data storageand sharing. Clearly, any kind of metagenomic datasetwill benefit from the rich information available fromother metagenome projects, and it is hoped that com-mon, yet flexible, standards and interactions amongscientists in the field will facilitate this sharing of infor-mation. This review article summarizes the currentthinking in the field and introduces current practicesand key issues that those scientists new to the field needto consider for a successful metagenome project.

Sampling and processingSample processing is the first and most crucial step inany metagenomics project. The DNA extracted shouldbe representative of all cells present in the sample andsufficient amounts of high-quality nucleic acids must beobtained for subsequent library production and

sequencing. Processing requires specific protocols foreach sample type, and various robust methods for DNAextraction are available (e.g. [3,9,10]). Initiatives are alsounder way to explore the microbial biodiversity fromtens of thousands of ecosystems using a single DNAextraction technology to ensure comparability [11].If the target community is associated with a host (e.g.

an invertebrate or plant), then either fractionation orselective lysis might be suitable to ensure that minimalhost DNA is obtained (e.g. [9,12]). This is particularlyimportant when the host genome is large and hencemight “overwhelm” the sequences of the microbial com-munity in the subsequent sequencing effort. Physicalfractionation is also applicable when only a certain partof the community is the target of analysis, for example,in viruses seawater samples. Here a range of selective fil-tration or centrifugation steps, or even flow cytometry,can be used to enrich the target fraction [3,13,14]. Frac-tionation steps should be checked to ensure that suffi-cient enrichment of the target is achieved and thatminimal contamination of non-target material occurs.Physical separation and isolation of cells from the

samples might also be important to maximize DNAyield or avoid coextraction of enzymatic inhibitors (suchas humic acids) that might interfere with subsequentprocessing. This situation is particularly relevant for soilmetagenome projects, and substantial work has beendone in this field to address the issue ([10] and refer-ences therein). Direct lysis of cells in the soil matrix ver-sus indirect lysis (i.e. after separation of cells from thesoil) has a quantifiable bias in terms of microbial diver-sity, DNA yield, and resulting sequence fragment length[10]. The extensive work on soil highlights the need toensure that extraction procedures are well benchmarkedand that multiple methods are compared to ensurerepresentative extraction of DNA.Certain types of samples (such as biopsies or ground-

water) often yield only very small amounts of DNA [15].Library production for most sequencing technologiesrequire high nanograms or micrograms amounts ofDNA (see below), and hence amplification of startingmaterial might be required. Multiple displacementamplification (MDA) using random hexamers and phagephi29 polymerase is one option employed to increaseDNA yields. This method can amplify femtograms ofDNA to produce micrograms of product and thus hasbeen widely used in single-cell genomics and to a cer-tain extent in metagenomics [16,17]. As with any ampli-fication method, there are potential problems associatedwith reagent contaminations, chimera formation andsequence bias in the amplification, and their impact willdepend on the amount and type of starting material andthe required number of amplification rounds to producesufficient amounts of nucleic acids. These issues can

Figure 1 Flow diagram of a typical metagenome projects.Dashed arrows indicate steps that can be omitted.

Thomas et al. Microbial Informatics and Experimentation 2012, 2:3http://www.microbialinformaticsj.com/content/2/1/3

Page 2 of 12

Page 3: REVIEW Open Access Metagenomics - a guide from sampling to ... · REVIEW Open Access Metagenomics - a guide from sampling to data analysis Torsten Thomas1*, Jack Gilbert2,3 and Folker

have significant impact on subsequent metagenomiccommunity analysis [15], and so it will be necessary toconsider whether amplification is permissible.

Sequencing technologyOver the past 10 years metagenomic shotgun sequen-cing has gradually shifted from classical Sanger sequen-cing technology to next-generation sequencing (NGS).Sanger sequencing, however, is still considered the goldstandard for sequencing, because of its low error rate,long read length (> 700 bp) and large insert sizes (e.g. >30 Kb for fosmids or bacterial artificial chromosomes(BACs)). All of these aspects will improve assembly out-comes for shotgun data, and hence Sanger sequencingmight still be applicable if generating close-to-completegenomes in low-diversity environments is the objective[18]. A drawback of Sanger sequencing is the labor-intensive cloning process in its associated bias againstgenes toxic for the cloning host [19] and the overallcost per gigabase (appr. USD 400,000).Of the NGS technologies, both the 454/Roche and the

Illumina/Solexa systems have now been extensivelyapplied to metagenomic samples. Excellent reviews ofthese technologies are available [20,21], but a brief sum-mary is given here with particular attention to metage-nomic applications.The 454/Roche system applies emulsion polymerase

chain reaction (ePCR) to clonally amplify random DNAfragments, which are attached to microscopic beads.Beads are deposited into the wells of a picotitre plateand then individually and in parallel pyrosequenced.The pyrosequencing process involves the sequentialaddition of all four deoxynucleoside triphosphates,which, if complementary to the template strand, areincorporated by a DNA polymerase. This polymerizationreaction releases pyrophosphate, which is converted viatwo enzymatic reactions to produce light. Light produc-tion of ~ 1.2 million reactions is detected in parallel viaa charge-coupled device (CCD) camera and convertedto the actual sequence of the template. Two aspects areimportant in this process with respect to metagenomicapplications. First, the ePCR has been shown to produceartificial replicate sequences, which will impact any esti-mates of gene abundance. Understanding the amount ofreplicate sequences is crucial for the data quality ofsequencing runs, and replicates can be identified and fil-tered out with bioinformatics tools [22,23]. Second, theintensity of light produced when the polymerase runsthrough a homopolymer is often difficult to correlate tothe actual number of nucleotide positions. Typically,this results in insertion or deletion errors in homopoly-mers and can hence cause reading frameshifts, if proteincoding sequences (CDSs) are called on a single read.This type of error can however be incorporated into

models of CDS prediction thus resulting in high, albeitnot perfect, accuracy [24]. Despite these disadvantages,the much cheaper cost of ~ USD 20,000 per gigabasepair has made 454/Roche pyrosequencing a popularchoice for shotgun-sequencing metagenomics. In addi-tion, the 454/Roche technology produces an averageread length between 600-800 bp, which is long enoughto cause only minor loss in the number of reads thatcan be annotated [25]. Sample preparation has also beenoptimized so that tens of nanograms of DNA are suffi-cient for sequencing single-end libraries [26,27],although pair-end sequencing might still require micro-grams quantities. Moreover, the 454/Roche sequencingplatform offers multiplexing allowing for up to 12 sam-ples to be analyzed in a single run of ~500 Mbp.The Illumina/Solexa technology immobilizes random

DNA fragments on a surface and then performs solid-surface PCR amplification, resulting in clusters of identi-cal DNA fragments. These are then sequenced withreversible terminators in a sequencing-by-synthesis pro-cess [28]. The cluster density is enormous, with hun-dreds of millions of reads per surface channel and 16channels per run on the HiSeq2000 instrument. Readlength is now approaching 150 bp, and clustered frag-ments can be sequenced from both ends. Continuoussequence information of nearly 300 bp can be obtainedfrom two overlapping 150 bp paired-reads from a singleinsert. Yields of ~60 Gbp can therefore be typicallyexpected in a single channel. While Illumina/Solexa haslimited systematic errors, some datasets have shownhigh error rates at the tail ends of reads [29]. In general,clipping reads has proven to be a good strategy for elim-inating the error in “bad” datasets, however, sequencequality values should also be used to detect “bad”sequences. The lower costs of this technology (~ USD50 per Gbp) and recent success in its application tometagenomics, and even the generation of draft gen-omes from complex dataset [30,31], are currently mak-ing the Illumina technology an increasingly popularchoice. As with 454/Roche sequencing, starting materialcan be as low as a 20 nanograms, but larger amounts(500-1000 ng) are required when matepair-libraries forlonger insert libraries are made. The limited read lengthof the Illumina/Solexa technology means that a greaterproportion of unassembled reads might be too short forfunctional annotation than are with 454/Roche technol-ogy [25]. While assembly might be advisable in such acase, potential bias, such as the suppression of low-abundance species (which can not be assembled) shouldbe considered, as should the fact that some current soft-ware packages (e.g. MG-RAST) are capable of analyzingunassembled Illumina reads of 75 bp and longer. Multi-plexing of samples is also available for individualsequencing channels, with more than 500 samples

Thomas et al. Microbial Informatics and Experimentation 2012, 2:3http://www.microbialinformaticsj.com/content/2/1/3

Page 3 of 12

Page 4: REVIEW Open Access Metagenomics - a guide from sampling to ... · REVIEW Open Access Metagenomics - a guide from sampling to data analysis Torsten Thomas1*, Jack Gilbert2,3 and Folker

multiplexed per lane. Another important factor to con-sider is run time, with a 2 × 100 bp paired-end sequen-cing analysis taking approx. 10 days HiSeq2000instrument time, in contrast to 1 day for the 454/ Rochetechnology. However, faster runtime (albeit at highercost per Gbp of approx. USD 600) can be achieved withthe new Illumina MiSeq instrument. This smaller ver-sion of Illumina/Solexa technology can also be used totest-run sequencing libraries, before analysis on HiSeqinstrument for deeper sequencing.A few additional sequencing technologies are avail-

able that might prove useful for metagenomic applica-tions, now or in the near future. The AppliedBiosystems SOLiD sequencer has been extensivelyused, for example, in genome resequencing [32].SOLiD arguably provides the lowest error rate of anycurrent NGS sequencing technology, however it doesnot achieve reliable read length beyond 50 nucleotides.This will limit its applicability for direct gene annota-tion of unassembled reads or for assembly of largecontigs. Nevertheless, for assembly or mapping ofmetagenomic data against a reference genome, recentwork showed encouraging outcomes [33]. Roche is alsomarketing a smaller-scale sequencer based on pyrose-quencing with about 100 Mbp output and low per runcosts. This system might be useful, because relativelylow coverage of metagenomes can establish meaningfulgene profiles [34]. Ion Torrent (and more recently IonProton) is another emerging technology and is basedon the principle that protons released during DNApolymerization can detect nucleotide incorporation.This system promises read lengths of > 100 bp andthroughput on the order of magnitude of the 454/Roche sequencing systems. Pacific Biosciences (PacBio)has released a sequencing technology based on single-molecule, real-time detection in zero-mode waveguidewells. Theoretically, this technology on its RS1 plat-form should provide much greater read lengths thanthe other technologies mentioned, which would facili-tate annotation and assembly. In addition, a processcalled strobing will mimic pair-end reads. However,accuracy of single reads with PacBio is currently onlyat 85%, and random reads are “dropped,” making theinstrument unusable in its current form for metage-nomic sequencing [35]. Complete Genomics is offeringa technology based on sequencing DNA nanoballs withcombinatorial probe-anchor ligation [36]. Its readlength of 35 nucleotides is rather limited and so mightbe its utility for de novo assemblies. While none of theemerging sequencing technologies have been thor-oughly applied and tested with metagenomics samples,they offer promising alternatives and even further costreduction.

AssemblyIf the research aims at recovering the genome of uncul-tured organisms or obtain full-length CDS for subse-quent characterization rather than a functionaldescription of the community, then assembly of shortread fragments will be performed to obtain longer geno-mic contigs. The majority of current assembly programswere designed to assemble single, clonal genomes andtheir utility for complex pan-genomic mixtures shouldbe approached with caution and critical evaluation.Two strategies can be employed for metagenomics

samples: reference-based assembly (co-assembly) and denovo assembly.Reference-based assembly can be done with software

packages such as Newbler (Roche), AMOS http://sour-ceforge.net/projects/amos/, or MIRA [37]. These soft-ware packages include algorithms that are fast andmemory-efficient and hence can often be performed onlaptop-sized machines in a couple of hours. Reference-based assembly works well, if the metagenomic datasetcontains sequences where closely related reference gen-omes are available. However, differences in the true gen-ome of the sample to the reference, such as a largeinsertion, deletion, or polymorphisms, can mean thatthe assembly is fragmented or that divergent regions arenot covered.De novo assembly typically requires larger computa-

tional resources. Thus, a whole class of assembly toolsbased on the de Bruijn graphs was specifically created tohandle very large amounts of data [38,39]. Machinerequirements for the de Bruijn assemblers Velvet [40] orSOAP [41] are still significantly higher than for refer-ence-based assembly (co-assembly), often requiring hun-dreds of gigabytes of memory in a single machine andrun times frequently being days.The fact that most (if not all) microbial communities

include significant variation on a strain and species levelmakes the use of assembly algorithms that assume clo-nal genomes less suitable for metagenomics. The “clo-nal” assumptions built into many assemblers might leadto suppression of contig formation for certain heteroge-neous taxa at specific parameter settings. Recently, twode Bruijn-type assemblers, MetaVelvet and Meta-IDBA[42] have been released that deal explicitly with thenon-clonality of natural populations. Both assemblersaim to identify within the entire de Bruijn graph a sub-graph that represents related genomes. Alternatively, themetagenomic sequence mix can be partition into “spe-cies bins” via k-mer binning (Titus Brown, personalcommunications). Those subgraphs or subsets are thenresolved to build a consensus sequence of the genomes.For Meta-IDBA a improvement in terms of N50 andmaximum contig length has been observed when

Thomas et al. Microbial Informatics and Experimentation 2012, 2:3http://www.microbialinformaticsj.com/content/2/1/3

Page 4 of 12

Page 5: REVIEW Open Access Metagenomics - a guide from sampling to ... · REVIEW Open Access Metagenomics - a guide from sampling to data analysis Torsten Thomas1*, Jack Gilbert2,3 and Folker

compared to “classical” de Bruijn assembler (e.g. Velvetor SOAP; results from the personal experience of theauthors; data not shown here). The development of“metagenomic assemblers” is however still at an earlystage, and it is difficult to access their accuracy for realmetagenomic data as typically no references exist tocompare the results to. A true gold standard (i.e. a realdataset for a diverse microbial community with knownreference sequences) that assemblers can be evaluatedagainst is thus urgently required.Several factors need to be considered when exploring

the reasons for assembling metagenomic data; these canbe condensed to two important questions. First, what isthe length of the sequencing reads used to generate themetagenomic dataset, and are longer sequences requiredfor annotation? Some approaches, e.g. IMG/M, preferassembled contigs, other pipelines such as MG-RAST[43] require only 75 bp or longer for gene prediction orsimilarity analysis that provides taxonomic binning andfunctional classification. On the whole, however, thelonger the sequence information, the better is the abilityto obtain accurate information. One obvious impact ison annotation: the longer the sequence, the more infor-mation provided, making it easier to compare withknown genetic data (e.g. via homology searches [25]).Annotation issues will be discussed in the next section.Binning and classification of DNA fragments for phylo-genetic or taxonomic assignment also benefits fromlong, contiguous sequences and certain tools (e.g. Phylo-pythia) work reliably only over a specific cut-off point(e.g. 1 Kb) [44]. Second, is the dataset assembled toreduce data-processing requirements? Here, as an alter-native to assembling reads into contigs, clustering near-identical reads with cd-hit [45] or uclust [46] will pro-vide clear benefits in data reduction. The MG-RASTpipeline also uses clustering as a data reduction strategy.Fundamentally, assembly is also driven by the specific

problem that single reads have generally lower qualityand hence lower confidence in accuracy than do multi-ple reads that cover the same segment of genetic infor-mation. Therefore, merging reads increases the qualityof information. Obviously in a complex community withlow sequencing depth or coverage, it is unlikely to actu-ally get many reads that cover the same fragment ofDNA. Hence assembly may be of limited value formetagenomics.Unfortunately, without assembly, longer and more

complex genetic elements (e.g., CRISPRS) cannot beanalyzed. Hence there is a need for metagenomic assem-bly to obtain high-confidence contigs that enable thestudy of, for example, major repeat classes. However,none of the current assembly tools is bias-free. Severalstrategies have been proposed to increase assemblyaccuracy [38], but strategies such as removal of rare k-

mers are no longer considered adequate, since rare k-mers do not represent sequence errors (as initiallyassumed), but instead represent reads from less abun-dant pan-genomes in the metagenomic mix.

BinningBinning refers to the process of sorting DNA sequencesinto groups that might represent an individual genomeor genomes from closely related organisms. Several algo-rithms have been developed, which employ two types ofinformation contained within a given DNA sequence.Firstly, compositional binning makes use of the fact thatgenomes have conserved nucleotide composition (e.g. acertain GC or the particular abundance distribution ofk-mers) and this will be also reflected in sequence frag-ments of the genomes. Secondly, the unknown DNAfragment might encode for a gene and the similarity ofthis gene with known genes in a reference database canbe used to classify and hence bin the sequence.Compositional-based binning algorithms include Phy-

lopythia [44], S-GSOM [47], PCAHIER [48,49] andTACAO [49], while examples of purely similarity-basedbinning software include IMG/M [50], MG-RAST [43],MEGAN [51], CARMA [52], SOrt-ITEMS [53] andMetaPhyler [54]. There is also number of binning algo-rithms that consider both composition and similarity,including the programs PhymmBL [55] and MetaCluster[56]. All these tools employ different methods of group-ing sequences, including self-organising maps (SOMs)or hierarchical clustering, and are operated in either anunsupervised manner or with input from the user(supervised) to define bins.Important considerations for using any binning algo-

rithm are the type of input data available and the exis-tence of a suitable training datasets or referencegenomes. In general, composition-based binning is notreliable for short reads, as they do not contain enoughinformation. For example, a 100 bp read can at bestpossess only less than half of all 256 possible 4-mersand this is not sufficient to determine a 4-mer distribu-tion that will reliably relate this read to any other read.Compositional assignment can however be improved, iftraining datasets (e.g. a long DNA fragment of knownorigin) exist that can be used to define a compositionalclassifier [44]. These “training” fragments can either bederived from assembled data or from sequenced fosmidsand should ideally contain a phylogenetic marker (suchas a rRNA gene) that can be used for high-resolution,taxonomic assignment of the binned fragments [57].Short reads may contain similarity to a known gene

and this information can be used to putatively assignthe read to a specific taxon. This taxonomic assignmentobviously requires the availability of reference data. Ifthe query sequence is only distantly related to known

Thomas et al. Microbial Informatics and Experimentation 2012, 2:3http://www.microbialinformaticsj.com/content/2/1/3

Page 5 of 12

Page 6: REVIEW Open Access Metagenomics - a guide from sampling to ... · REVIEW Open Access Metagenomics - a guide from sampling to data analysis Torsten Thomas1*, Jack Gilbert2,3 and Folker

reference genomes, only a taxonomic assignment at avery high level (e.g. phylum) is possible. If the metage-nomic dataset, however, contains two or more genomesthat would fall into this high taxon assignment, then“chimeric” bins might be produced. In this case, the twogenomes might be separated by additional binningbased on compositional features. In general, howeverthis might again require that the unknown fragmentshave a certain length.Binning algorithm will obviously in the future benefit

from the availability of a greater number and phyloge-netic breadth of reference genomes, in particular forsimilarity-based assignment to low taxonomic levels.Post-assembly the binning of contigs can lead to thegeneration of partial genomes of yet-uncultured orunknown organisms, which in turn can be used to per-form similarity-based binning of other metagenomicdatasets. Caution should however been taken to ensurethe validity of any newly created genome bin, as “con-taminating” fragments can rapidly propagate into falseassignments in subsequent binning efforts. Prior toassembly with clonal assemblers binning can be used toreduce the complexity of an assembly effort and mightreduce computational requirement.As major annotation pipelines like IMG/M or MG-

RAST also perform taxonomic assignments of reads,one needs to carefully weigh the additional computa-tional demands of the particular binning algorithm cho-sen against the added value they provide.

AnnotationFor the annotation of metagenomes two different initialpathways can be taken. First, if reconstructed genomesare the objective of the study and assembly has pro-duced large contigs, it is preferable to use existing pipe-lines for genome annotation, such as RAST [58] or IMG[59]. For this approach to be successful, minimal contigslength of 30,000 bp or longer are required. Second,annotation can be performed on the entire communityand relies on unassembled reads or short contigs. Herethe tools for genome annotation are significantly lessuseful than those specifically developed for metagenomicanalyses. Annotation of metagenomic sequence data hasin general two steps. First, features of interest (genes)are identified (feature prediction) and, second, putativegene functions and taxonomic neighbors are assigned(functional annotation).Feature prediction is the process of labeling sequences

as genes or genomic elements. For completed genomesequences a number of algorithms have been developed[60,61] that identify CDS with more than 95% accuracyand a low false negative ratio. A number of tools werespecifically designed to handle metagenomic predictionof CDS, including FragGeneScan [24], MetaGeneMark

[62], MetaGeneAnnotator (MGA)/ Metagene [63] andOrphelia [64,65]. All of these tools use internal informa-tion (e.g. codon usage) to classify sequence stretches aseither coding or non-coding, however they distinguishthemselves from each other by the quality of the train-ing sets used and their usefulness for short or error-prone sequences. FragGeneScan is currently the onlyalgorithm known to the authors that explicitly modelssequencing errors and thus results in gene predictionerrors of only 1-2%. True positive rates of FragGeneScanare around 70% (better than most other methods),which means that even this tool still misses a significantsubset of genes. These missing genes can potentially beidentified by BLAST-based searches, however the size ofcurrent metagenomic datasets makes this computationalexpensive step often prohibitive.There exists also a number of tools for the prediction

of non-protein coding genes such as tRNAs [66,67], sig-nal peptides [68] or CRISPRs [69,70], however theymight require significant computational resources orlong contiguous sequences. Clearly subsequent analysisdepends on the initial identification of features andusers of annotation pipelines need to be aware of thespecific prediction approaches used. MG-RAST uses atwo-step approach for feature identification, FGS and asimilarity search for ribosomal RNAs against a non-redundant integration of the SILVA [71], Greengenes[72] and RDP [73] databases. CAMERA’s RAMCAPPpipeline [74] uses FGA and MGA, while IMG/Memploys a combination of tools, including FGS andMGA [58,59].Functional annotation represents a major computa-

tional challenge for most metagenomic projects andtherefore deserves much attention now and over thenext years. Current estimates are that only 20 to 50% ofa metagenomic sequences can be annotated [75], leavingthe immediate question of importance and function ofthe remaining genes. We note that annotation is notdone de novo, but via mapping to gene or proteinlibraries with existing knowledge (i.e., a non-redundantdatabase). Any sequences that cannot be mapped to theknown sequence space are referred to as ORFans. TheseORFans are responsible for the seemingly never-endinggenetic novelty in microbial metagenomics (e.g. [76].Three hypotheses exist for existence of this unknownfraction. First, ORFans might simply reflect erroneousCDS calls caused by imperfect detection algorithms.Secondly, these ORFans are real genes, but encode forunknown biochemical functions. Third, ORFan geneshave no sequence homology with known genes, butmight have structural homology with known proteins,thus representing known protein families or folds.Future work will likely reveal that the truth lies some-where between these hypotheses [77]. For improving the

Thomas et al. Microbial Informatics and Experimentation 2012, 2:3http://www.microbialinformaticsj.com/content/2/1/3

Page 6 of 12

Page 7: REVIEW Open Access Metagenomics - a guide from sampling to ... · REVIEW Open Access Metagenomics - a guide from sampling to data analysis Torsten Thomas1*, Jack Gilbert2,3 and Folker

annotation of ORFan genes, we will rely on the challen-ging and labor-intensive task of protein structure analy-sis (e.g. via NMR and x-ray crystallography) and onbiochemical characterization.Currently, metagenomic annotation relies on classify-

ing sequences to known functions or taxonomic unitsbased on homology searches against available “anno-tated” data. Conceptually, the annotation is relativelysimple and for small datasets (< 10,000 sequences) man-ual curation can be used increase the accuracy of anyautomated annotation. Metagenomic datasets are typi-cally very large, so manual annotation is not possible.Automated annotation therefore has to become moreaccurate and computationally inexpensive. Currently,running a BLASTX similarity search is computationallyexpensive; as much as ten times the cost of sequencing[78]. Unfortunately, computationally less demandingmethods involving detecting feature composition ingenes [44] have limited success for short reads. Withgrowing dataset sizes, faster algorithms are urgentlyneeded, and several programs for similarity searcheshave been developed to resolve this issue [46,79-81].Many reference databases are available to give func-

tional context to metagenomic datasets, such as KEGG[82], eggNOG [83], COG/KOG [84], PFAM [85], andTIGRFAM [86]. However, since no reference databasecovers all biological functions, the ability to visualizeand merge the interpretations of all database searcheswithin a single framework is important, as implementedin the most recent versions of MG-RAST and IMG/M.It is essential that metagenome analysis platforms beable to share data in ways that map and visualize datain the framework of other platforms. These metage-nomic exchange languages should also reduce the bur-den associated with re-processing large datasets,minimizing, the redundancy of searching and enablingthe sharing of annotations that can be mapped to differ-ent ontologies and nomenclatures, thereby allowingmultifaceted interpretations. The Genomic StandardsConsortium (GSC) with the M5 project is providing aprototypical standard for exchange of computed meta-genome analysis results, one cornerstone of theseexchange languages.Several large-scale databases are available that process

and deposit metagenomic datasets. MG-RAST, IMG/M,and CAMERA are three prominent systems [43,50,74].MG-RAST is a data repository, an analysis pipeline anda comparative genomics environment. Its fully auto-mated pipeline provides quality control, feature predic-tion and functional annotation and has been optimizedfor achieving a trade-off between accuracy and compu-tational efficiency for short reads using BLAT {Kent,2002 #64}. Results are expressed in the form of abun-dance profiles for specific taxa or functional annotations.

Supported are the comparison of NCBI taxonomiesderived from 16S rRNA gene or whole genome shotgundata and the comparison of relative abundance forKEGG, eggNOG, COG and SEED subsystems on multi-ple levels of resolution. Users can also download all dataproducts generated by MG-RAST, share them and pub-lish within the portal. The MG-RAST web interfaceallows comparison using a number of statistical techni-ques and allows for the incorporation of metadata intothe statistics. MG-RAST has more than 7000 users, >38,000 uploaded and analyzed metagenomes (of which7000 are publicly accessible) and 9 Terabases analyzedas of December 2011. These statistics demonstrate amove by the scientific community to centralizeresources and standardize annotation.IMG/M also provides a standardized pipeline, but with

“higher” sensitivity as it performs, for example, hiddenMarkov model (HMM) and BLASTX searches at sub-stantial computational cost. In contrast to MG-RAST,comparisons in IMG/M are not performed on an abun-dance table level, but are based on an all vs. all genescomparison. Therefore IMG/M is the only system thatintegrates all datasets into a single protein level abstrac-tion. Both IMG/M and MG-RAST provide the ability touse stored computational results for comparison,enabling comparison of novel metagenomes with a richbody of other datasets without requiring the end-user toprovide the computational means for reanalysis of alldatasets involved in their study. Other systems, such asCAMERA [74], offer more flexible annotation schemabut require that individual researchers understand theannotation of data and analytical pipelines well enoughto be confident in their interpretation. Also for compari-son, all datasets need to be analyzed using the sameworkflow, thus adding additional computational require-ments. CAMERA allows the publication of datasets andwas the first to support the Genomic Standards Consor-tium’s Minimal Information checklists for metadata intheir web interface [87].MEGAN is another tool used for visualizing annota-

tion results derived from BLAST searches in a func-tional or taxonomic dendrogram [51]. The use ofdendrograms to display metagenomic data provides acollapsible network of interpretation, which makes ana-lysis of particular functional or taxonomic groupsvisually easy.

Experimental Design and Statistical AnalysisOwing to the high costs, many of the early metagenomicshotgun-sequencing projects were not replicated or werefocused on targeted exploration of specific organisms (e.g. uncultured organisms in low-diversity acid mine drai-nage [2]). Reduction of sequencing cost (see above) anda much wider appreciation of the utility of

Thomas et al. Microbial Informatics and Experimentation 2012, 2:3http://www.microbialinformaticsj.com/content/2/1/3

Page 7 of 12

Page 8: REVIEW Open Access Metagenomics - a guide from sampling to ... · REVIEW Open Access Metagenomics - a guide from sampling to data analysis Torsten Thomas1*, Jack Gilbert2,3 and Folker

metagenomics to address fundamental questions inmicrobial ecology now require proper experimentaldesigns with appropriate replication and statistical analy-sis. These design and statistical aspects, while obvious,are often not properly implemented in the field ofmicrobial ecology [88]. However, many suitableapproaches and strategies are readily available from thedecades of research in quantitative ecology of higherorganisms (e.g. animals, plants). In a simplistic way, thedata from multiple metagenomic shotgun-sequencingprojects can be reduced to tables, where the columnsrepresent samples and the rows indicate either a taxo-nomic group or a gene function (or groups thereof) andthe fields containing abundance or presence/absencedata. This is analogous to species-sample matrices inecology of higher organisms, and hence many for thestatistical tools available to identify correlations and sta-tistically significant patterns are transferable. As metage-nomic data however often contain many more speciesor gene functions then the number of samples taken,appropriate corrections for multiple hypothesis testinghave to be implemented (e.g. Bonferroni correction fort-test based analyses).The Primer-E package [89] is a well-established tool,

allowing for a range of multivariate statistical analyses,including the generation of multidimensional scaling(MDS) plots, analysis of similarities (ANOSIM), andidentification of the species or functions that contributeto the difference between two samples (SIMPER).Recently, multivariate statistics was also incorporated ina web-based tools called Metastats [90], which revealedwith high confidence discriminatory functions betweenthe replicated metagenome dataset of the gut microbiotaof lean and obese mice [91]. In addition, the Shotgun-FunctionalizeR package provides several statistical pro-cedures for assessing functional differences betweensamples, both for individual genes and for entire path-ways using the popular R statistical package [92].Ideally, and in general, experimental design should be

driven by the question asked (rather than technical oroperational restriction). For example, if a project aims toidentify unique taxa or functions in a particular habitat,then suitable reference samples for comparison shouldbe taken and processed in consistent manner. In addi-tion, variation between sample types can be due to truebiological variation, (something biologist would be mostinterested in) and technical variation and this should becarefully considered when planning the experiment. Oneshould also be aware that many microbial systems arehighly dynamic, so temporal aspects of sampling canhave a substantial impact on data analysis and interpre-tation. While the question of the number of replicates isoften difficult to predict prior to the final statistical ana-lysis, small-scale experiments are often useful to

understand the magnitude of variation inherent in a sys-tem. For example, a small number of samples could beselected and sequenced to shallower depth, then ana-lyzed to determine if a larger sampling size or greatersequencing effort are required to obtain statisticallymeaningful results [88]. Also, the level at which replica-tion takes place is something that should not lead tofalse interpretation of the data. For example, if one isinterested in the level of functional variation of themicrobial community in habitat A, then multiple sam-ples from this habitat should be taken and processedcompletely separately, but in the same manner. Takingjust one sample and splitting it up prior to processingwill provide information only about technical, but notbiological, variation in habitat A. Taking multiple sam-ples and then pooling them will lose all information onvariability and hence will be of little use for statisticalpurposes. Ultimately, good experimental design of meta-genomic projects will facilitate integration of datasetsinto new or existing ecological theories [93].As metagenomics gradually moves through a range of

explorative biodiversity surveys, it will also prove itselfextremely valuable for manipulative experiments. Thesewill allow for observation of treatment impact on thefunctional and phylogenetic composition of microbialcommunities. Initial experiments already showed pro-mising results [94]. However, careful experimental plan-ning and interpretations should be paramount in thisfield.One of the ultimate aims of metagenomics is to link

functional and phylogenetic information to the chemical,physical, and other biological parameters that character-ize an environment. While measuring all these para-meters can be time-consuming and cost-intensive, itallows retrospective correlation analysis of metagenomicdata that was perhaps not part of the initial aim of theproject or might be of interest for other research ques-tions. The value of such metadata cannot be overstatedand, in fact, has become mandatory or optional fordeposition of metagenomic data into some databases[50,74].

Sharing and Storage of DataData sharing has a long tradition in the field of genomeresearch, but for metagenomic data this will require awhole new level of organization and collaboration toprovide metadata and centralized services (e.g., IMG/M,CAMERA and MG-RAST) as well as sharing of bothdata and computational results. In order to enable shar-ing of computed results, some aspects of the variousanalytical pipelines mentioned above will need to becoordinated - a process currently under way under theauspices of the GSC. Once this has been achieved,researchers will be able to download intermediate and

Thomas et al. Microbial Informatics and Experimentation 2012, 2:3http://www.microbialinformaticsj.com/content/2/1/3

Page 8 of 12

Page 9: REVIEW Open Access Metagenomics - a guide from sampling to ... · REVIEW Open Access Metagenomics - a guide from sampling to data analysis Torsten Thomas1*, Jack Gilbert2,3 and Folker

processed results from any one of the major repositoriesfor local analysis or comparison.A suite of standard languages for metadata is currently

provided by the Minimum Information about any (x)Sequence checklists (MIxS) [95]. MIxS is an umbrellaterm to describe MIGS (the Minimum Information abouta Genome Sequence), MIMS (the Minimum Informationabout a Metagenome Sequence) and MIMARKS (Mini-mum Information about a MARKer Sequence)[87] andcontains standard formats for recording environmentaland experimental data. The latest of these checklists,MIMARKS builds on the foundation of the MIGS andMIMS checklists, by including an expansion of the richcontextual information about each environmental sample.The question of centralized versus decentralized sto-

rage is also one of “who pays for the storage,” which is amatter with no simple answer. The US National Centerfor Biotechnology Information (NCBI) is mandated tostore all metagenomic data, however, the sheer volumeof data being generated means there is an urgent needfor appropriate ways of storing vast amounts ofsequences. As the cost of sequencing continues to dropwhile the cost for analysis and storing remains more orless constant, selection of data storage in either biologi-cal (i.e. the sample that was sequenced) or digital formin (de-) centralized archives might be required. Ongoingwork and successes in compression of (meta-) genomicdata [96], however, might mean that digital informationcan still be stored cost-efficiently in the near future.

ConclusionMetagenomics has benefited in the past few years frommany visionary investments in both financial and intel-lectual terms. To ensure that those investments are uti-lized in the best possible way, the scientific communityshould aim to share, compare, and critically evaluate theoutcomes of metagenomic studies. As datasets becomeincreasingly more complex and comprehensive, noveltools for analysis, storage, and visualization will berequired. These will ensure the best use of the metage-nomics as a tool to address fundamental question ofmicrobial ecology, evolution and diversity and to deriveand test new hypotheses. Metagenomics will beemployed as commonly and frequently as any otherlaboratory method, and “metagenomizing” a samplemight become as colloquial as “PCRing.” It is thereforealso important that metagenomics be taught to studentsand young scientists in the same way that other techni-ques and approaches have been in the past.

AcknowledgementsThis work was supported by the Australian Research Council and the U.S.Dept. of Energy under Contract DE-AC02-06CH11357.

The submitted manuscript has been created by UChicago Argonne, LLC,Operator of Argonne National Laboratory ("Argonne”). Argonne, a U.S.Department of Energy Office of Science laboratory, is operated underContract No. DE-AC02-06CH11357. The U.S. Government retains for itself,and others acting on its behalf, a paid-up nonexclusive, irrevocableworldwide license in said article to reproduce, prepare derivative works,distribute copies to the public, and perform publicly and display publicly, byor on behalf of the Government.

Author details1School of Biotechnology and Biomolecular Sciences & Centre for MarineBio-Innovation, The University of New South Wales, Sydney, NSW 2052,Australia. 2Argonne National Laboratory, 9700 South Cass Avenue, Argonne,IL 60439, USA. 3Department of Ecology and Evolution, University of Chicago,5640 South Ellis Avenue, Chicago, IL 60637, USA. 4Computation Institute,University of Chicago, 5640 South Ellis Avenue, Chicago, IL 60637, USA.

Authors’ contributionsAll authors contributed to the conception and writing of the review article.All authors have read and approved the final manuscript.

Competing interestsThe authors declare that they have no competing interests.

Received: 13 October 2011 Accepted: 9 February 2012Published: 9 February 2012

References1. Handelsman J, Rondon MR, Brady SF, Clardy J, Goodman RM: Molecular

biological access to the chemistry of unknown soil microbes: a newfrontier for natural products. Chem Biol 1998, 5(10):R245-249.

2. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM,Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF: Community structure andmetabolism through reconstruction of microbial genomes from theenvironment. Nature 2004, 428(6978):37-43.

3. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA,Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH,Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers YH, Smith HO: Environmental genomeshotgun sequencing of the Sargasso Sea. Science 2004, 304(5667):66-74.

4. Simon C, Daniel R: Metagenomic analyses: past and future trends. ApplEnviron Microbiol 2011, 77(4):1153-1161.

5. Wilmes P, Bond PL: Metaproteomics: studying functional gene expressionin microbial ecosystems. Trends Microbiol 2006, 14(2):92-97.

6. Gilbert JA, Field D, Huang Y, Edwards R, Li W, Gilna P, Joint I: Detection oflarge numbers of novel sequences in the metatranscriptomes ofcomplex marine microbial communities. PLoS One 2008, 3(8):e3042.

7. Beja O, Aravind L, Koonin EV, Suzuki MT, Hadd A, Nguyen LP,Jovanovich SB, Gates CM, Feldman RA, Spudich JL, Spudich EN, DeLong EF:Bacterial rhodopsin: evidence for a new type of phototrophy in the sea.Science 2000, 289(5486):1902-1906.

8. Nicol GW, Schleper C: Ammonia-oxidising Crenarchaeota: importantplayers in the nitrogen cycle? Trends Microbiol 2006, 14(5):207-212.

9. Burke C, Kjelleberg S, Thomas T: Selective extraction of bacterial DNAfrom the surfaces of macroalgae. Appl Environ Microbiol 2009,75(1):252-256.

10. Delmont TO, Robe P, Clark I, Simonet P, Vogel TM: Metagenomiccomparison of direct and indirect soil DNA extraction approaches. JMicrobiol Methods 2011, 86(3):397-400.

11. Knight R, Desai N, Field D, Fierer N, Fuhrman J, Gordon J, Hu B,Hugenholtz P, Jansson J, Meyer F, Stevens R, Bailey M, Kowalchuk G,Gilbert J: Designing Better Metagenomic Surveys: The role ofexperimental design and metadata capture in making usefulmetagenomic datasets for ecology and biotechnology. NatureBiotechnology , in review.

12. Thomas T, Rusch D, DeMaere MZ, Yung PY, Lewis M, Halpern A,Heidelberg KB, Egan S, Steinberg PD, Kjelleberg S: Functional genomicsignatures of sponge bacteria reveal unique and shared features ofsymbiosis. ISME J 2010, 4(12):1557-1567.

Thomas et al. Microbial Informatics and Experimentation 2012, 2:3http://www.microbialinformaticsj.com/content/2/1/3

Page 9 of 12

Page 10: REVIEW Open Access Metagenomics - a guide from sampling to ... · REVIEW Open Access Metagenomics - a guide from sampling to data analysis Torsten Thomas1*, Jack Gilbert2,3 and Folker

13. Palenik B, Ren Q, Tai V, Paulsen IT: Coastal Synechococcus metagenomereveals major roles for horizontal gene transfer and plasmids inpopulation diversity. Environ Microbiol 2009, 11(2):349-359.

14. Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, Carlson C, Chan AM,Haynes M, Kelley S, Liu H, Mahaffy JM, Mueller JE, Nulton J, Olson R,Parsons R, Rayhawk S, Suttle CA, Rohwer F: The marine viromes of fouroceanic regions. PLoS Biol 2006, 4(11):e368.

15. Abbai NS, Govender A, Shaik R, Pillay B: Pyrosequence analysis ofunamplified and whole genome amplified DNA from hydrocarbon-contaminated groundwater. Mol Biotechnol 2011.

16. Lasken RS: Genomic DNA amplification by the multiple displacementamplification (MDA) method. Biochem Soc Trans 2009, 37(Pt 2):450-453.

17. Ishoey T, Woyke T, Stepanauskas R, Novotny M, Lasken RS: Genomicsequencing of single microbial cells from environmental samples. CurrOpin Microbiol 2008, 11(3):198-204.

18. Goltsman DS, Denef VJ, Singer SW, VerBerkmoes NC, Lefsrud M, Mueller RS,Dick GJ, Sun CL, Wheeler KE, Zemla A, Baker BJ, Hauser L, Land M, Shah MB,Thelen MP, Hettich RL, Banfield JF: Community genomic and proteomicanalyses of chemoautotrophic iron-oxidizing “Leptospirillum rubarum”(Group II) and “ Leptospirillum ferrodiazotrophum” (Group III) bacteria inacid mine drainage biofilms. Appl Environ Microbiol 2009,75(13):4599-4615.

19. Sorek R, Zhu Y, Creevey CJ, Francino MP, Bork P, Rubin EM: Genome-wideexperimental determination of barriers to horizontal gene transfer.Science 2007, 318(5855):1449-1452.

20. Metzker ML: Sequencing technologies - the next generation. Nat RevGenet 2010, 11(1):31-46.

21. Mardis ER: The impact of next-generation sequencing technology ongenetics. Trends Genet 2008, 24(3):133-141.

22. Niu B, Fu L, Sun S, Li W: Artificial and natural duplicates inpyrosequencing reads of metagenomic data. BMC Bioinformatics 2010,11:187.

23. Teal TK, Schmidt TM: Identifying and removing artificial replicates from454 pyrosequencing data. Cold Spring Harb Protoc 2010, 2010(4):pdbprot5409.

24. Rho M, Tang H, Ye Y: FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Res 2010, 38(20):e191.

25. Wommack KE, Bhavsar J, Ravel J: Metagenomics: read length matters. ApplEnviron Microbiol 2008, 74(5):1453-1463.

26. White RA, Blainey PC, Fan HC, Quake SR: Digital PCR provides sensitiveand absolute calibration for high throughput sequencing. BMC Genomics2009, 10:116.

27. Adey A, Morrison HG, Asan Xun X, Kitzman JO, Turner EH, Stackhouse B,MacKenzie AP, Caruccio NC, Zhang X, Shendure J: Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitrotransposition. Genome Biol 2010, 11(12):R119.

28. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J,Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J,Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA,Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS,Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, et al: Accuratewhole human genome sequencing using reversible terminatorchemistry. Nature 2008, 456(7218):53-59.

29. Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y,Ishikawa S, Linak MC, Hirai A, Takahashi H, Altaf-Ul-Amin M, Ogasawara N,Kanaya S: Sequence-specific error profile of Illumina sequencers. NucleicAcids Res 2011, 39(13):e90.

30. Hess M, Sczyrba A, Egan R, Kim TW, Chokhawala H, Schroth G, Luo S,Clark DS, Chen F, Zhang T, Mackie RI, Pennacchio LA, Tringe SG, Visel A,Woyke T, Wang Z, Rubin EM: Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science 2011,331(6016):463-467.

31. Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, Nielsen T,Pons N, Levenez F, Yamada T, Mende DR, Li J, Xu J, Li S, Li D, Cao J,Wang B, Liang H, Zheng H, Xie Y, Tap J, Lepage P, Bertalan M, Batto JM,Hansen T, Le Paslier D, Linneberg A, Nielsen HB, Pelletier E, Renault P, et al:A human gut microbial gene catalogue established by metagenomicsequencing. Nature 2010, 464(7285):59-65.

32. Gulig PA, de Crecy-Lagard V, Wright AC, Walts B, Telonis-Scott M,McIntyre LM: SOLiD sequencing of four Vibrio vulnificus genomes enables

comparative genomic analysis and identification of candidate clade-specific virulence genes. BMC Genomics 2010, 11:512.

33. Tyler HL, Roesch LF, Gowda S, Dawson WO, Triplett EW: Confirmation ofthe sequence of ‘Candidatus Liberibacter asiaticus’ and assessment ofmicrobial diversity in Huanglongbing-infected citrus phloem using ametagenomic approach. Mol Plant Microbe Interact 2009, 22(12):1624-1634.

34. Kunin V, Raes J, Harris JK, Spear JR, Walker JJ, Ivanova N, von Mering C,Bebout BM, Pace NR, Bork P, Hugenholtz P: Millimeter-scale geneticgradients and community-level molecular convergence in a hypersalinemicrobial mat. Mol Syst Biol 2008, 4:198.

35. Rasko DA, Webster DR, Sahl JW, Bashir A, Boisen N, Scheutz F, Paxinos EE,Sebra R, Chin CS, Iliopoulos D, Klammer A, Peluso P, Lee L, Kislyuk AO,Bullard J, Kasarskis A, Wang S, Eid J, Rank D, Redman JC, Steyert SR,Frimodt-Moller J, Struve C, Petersen AM, Krogfelt KA, Nataro JP, Schadt EE,Waldor MK: Origins of the E. coli strain causing an outbreak ofhemolytic-uremic syndrome in Germany. N Engl J Med 2011,365(8):709-717.

36. Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG,Carnevali P, Nazarenko I, Nilsen GB, Yeung G, Dahl F, Fernandez A, Staker B,Pant KP, Baccash J, Borcherding AP, Brownley A, Cedeno R, Chen L,Chernikoff D, Cheung A, Chirita R, Curson B, Ebert JC, Hacker CR, Hartlage R,Hauser B, Huang S, Jiang Y, Karpinchyk V, et al: Human genomesequencing using unchained base reads on self-assembling DNAnanoarrays. Science 2010, 327(5961):78-81.

37. Chevreux B, Wetter T, Suhai S: Genome Sequence Assembly Using TraceSignals and Additional Sequence Information Computer Science andBiology. Proceedings of the German Conference on Bioinformatics 1999,99:45-56.

38. Miller JR, Koren S, Sutton G: Assembly algorithms for next-generationsequencing data. Genomics 2010, 95(6):315-327.

39. Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNAfragment assembly. Proc Natl Acad Sci USA 2001, 98(17):9748-9753.

40. Zerbino DR, Birney E: Velvet: algorithms for de novo short read assemblyusing de Bruijn graphs. Genome Res 2008, 18(5):821-829.

41. Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignmentprogram. Bioinformatics 2008, 24(5):713-714.

42. Peng Y, Leung HC, Yiu SM, Chin FY: Meta-IDBA: a de Novo assembler formetagenomic data. Bioinformatics 2011, 27(13):i94-101.

43. Glass EM, Wilkening J, Wilke A, Antonopoulos D, Meyer F: Using themetagenomics RAST server (MG-RAST) for analyzing shotgunmetagenomes. Cold Spring Harb Protoc 2010, 2010(1), pdb prot5368.

44. McHardy AC, Martin HG, Tsirigos A, Hugenholtz P, Rigoutsos I: Accuratephylogenetic classification of variable-length DNA fragments. NatMethods 2007, 4(1):63-72.

45. Li W, Godzik A: Cd-hit: a fast program for clustering and comparing largesets of protein or nucleotide sequences. Bioinformatics 2006,22(13):1658-1659.

46. Edgar RC: Search and clustering orders of magnitude faster than BLAST.Bioinformatics 2010, 26(19):2460-2461.

47. Chan CK, Hsu AL, Halgamuge SK, Tang SL: Binning sequences using verysparse labels within a metagenome. BMC Bioinformatics 2008, 9:215.

48. Zheng H, Wu H: Short prokaryotic DNA fragment binning using ahierarchical classifier based on linear discriminant analysis and principalcomponent analysis. J Bioinform Comput Biol 2010, 8(6):995-1011.

49. Diaz NN, Krause L, Goesmann A, Niehaus K, Nattkemper TW: TACOA:taxonomic classification of environmental genomic fragments using akernelized nearest neighbor approach. BMC Bioinformatics 2009, 10:56.

50. Markowitz VM, Ivanova NN, Szeto E, Palaniappan K, Chu K, Dalevi D,Chen IM, Grechkin Y, Dubchak I, Anderson I, Lykidis A, Mavromatis K,Hugenholtz P, Kyrpides NC: IMG/M: a data management and analysissystem for metagenomes. Nucleic Acids Res 2008, , 36 Database: D534-538.

51. Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of metagenomicdata. Genome Res 2007, 17(3):377-386.

52. Krause L, Diaz NN, Goesmann A, Kelley S, Nattkemper TW, Rohwer F,Edwards RA, Stoye J: Phylogenetic classification of short environmentalDNA fragments. Nucleic Acids Res 2008, 36(7):2230-2239.

53. Monzoorul Haque M, Ghosh TS, Komanduri D, Mande SS: SOrt-ITEMS:Sequence orthology based approach for improved taxonomicestimation of metagenomic sequences. Bioinformatics 2009,25(14):1722-1730.

Thomas et al. Microbial Informatics and Experimentation 2012, 2:3http://www.microbialinformaticsj.com/content/2/1/3

Page 10 of 12

Page 11: REVIEW Open Access Metagenomics - a guide from sampling to ... · REVIEW Open Access Metagenomics - a guide from sampling to data analysis Torsten Thomas1*, Jack Gilbert2,3 and Folker

54. Liu B, Gibbons T, Ghodsi M, Treangen T, Pop M: Accurate and fastestimation of taxonomic profiles from metagenomic shotgun sequences.BMC Genomics 2011, 12(Suppl 2):S4.

55. Brady A, Salzberg SL: Phymm and PhymmBL: metagenomic phylogeneticclassification with interpolated Markov models. Nat Methods 2009,6(9):673-676.

56. Leung HC, Yiu SM, Yang B, Peng Y, Wang Y, Liu Z, Chen J, Qin J, Li R,Chin FY: A robust and accurate binning algorithm for metagenomicsequences with arbitrary species abundance ratio. Bioinformatics 2011,27(11):1489-1495.

57. Yung PY, Burke C, Lewis M, Egan S, Kjelleberg S, Thomas T: Phylogeneticscreening of a bacterial, metagenomic library using homingendonuclease restriction and marker insertion. Nucleic Acids Res 2009,37(21):e144.

58. Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K,Gerdes S, Glass EM, Kubal M, Meyer F, Olsen GJ, Olson R, Osterman AL,Overbeek RA, McNeil LK, Paarmann D, Paczian T, Parrello B, Pusch GD,Reich C, Stevens R, Vassieva O, Vonstein V, Wilke A, Zagnitko O: The RASTServer: rapid annotations using subsystems technology. BMC Genomics2008, 9:75.

59. Markowitz VM, Mavromatis K, Ivanova NN, Chen IM, Chu K, Kyrpides NC:IMG ER: a system for microbial genome annotation expert review andcuration. Bioinformatics 2009, 25(17):2271-2278.

60. Lukashin AV, Borodovsky M: GeneMark.hmm: new solutions for genefinding. Nucleic Acids Res 1998, 26(4):1107-1115.

61. Delcher AL, Harmon D, Kasif S, White O, Salzberg SL: Improved microbialgene identification with GLIMMER. Nucleic Acids Res 1999,27(23):4636-4641.

62. McHardy ACZ, Wenhan Martin HGL, Alexandre Tsirigos A, Hugenholtz P,Rigoutsos IB, Mark : Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods 2007, 4(1):63-72.

63. Noguchi H, Taniguchi T, Itoh T: MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction inanonymous prokaryotic and phage genomes. DNA Res 2008,15(6):387-396.

64. Hoff KJ, Lingner T, Meinicke P, Tech M: Orphelia: predicting genes inmetagenomic sequencing reads. Nucleic Acids Res 2009, , 37 Web Server:W101-105.

65. Yok NG, Rosen GL: Combining gene prediction methods to improvemetagenomic gene annotation. BMC Bioinformatics 2011, 12:20.

66. Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S,Wilkinson AC, Finn RD, Griffiths-Jones S, Eddy SR, Bateman A: Rfam:updates to the RNA families database. Nucleic Acids Res 2009, , 37Database: D136-140.

67. Lowe TM, Eddy SR: tRNAscan-SE: a program for improved detection oftransfer RNA genes in genomic sequence. Nucleic Acids Res 1997,25(5):955-964.

68. Bendtsen JD, Nielsen H, von Heijne G, Brunak S: Improved prediction ofsignal peptides: SignalP 3.0. J Molec Biol 2004, 340(4):783-795.

69. Bland C, Ramsey TL, Sabree F, Lowe M, Brown K, Kyrpides NC,Hugenholtz P: CRISPR recognition tool (CRT): a tool for automaticdetection of clustered regularly interspaced palindromic repeats. BMCBioinformatics 2007, 8:209.

70. Grissa I, Vergnaud G, Pourcel C: CRISPRFinder: a web tool to identifyclustered regularly interspaced short palindromic repeats. Nucleic AcidsRes 2007, , 35 Web Server: W52-57.

71. Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glöckner FO:SILVA: a comprehensive online resource for quality checked and alignedribosomal RNA sequence data compatible with ARB. Nucleic Acids Res2007, 35(21):7188-7196.

72. DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T,Dalevi D, Hu P, Andersen GL: Greengenes, a chimera-checked 16S rRNAgene database and workbench compatible with ARB. Appl EnvironMicrobiol 2006, 72(7):5069-5072.

73. Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen AS, McGarrell DM, Marsh T, Garrity GM, Tiedje JM: TheRibosomal Database Project: improved alignments and new tools forrRNA analysis. Nucleic Acids Res 2009, , 37 Database: D141-145.

74. Sun S, Chen J, Li W, Altintas I, Lin A, Peltier S, Stocks K, Allen EE, Ellisman M,Grethe J, Wooley J: Community cyberinfrastructure for Advanced

Microbial Ecology Research and Analysis: the CAMERA resource. NucleicAcids Res 2011, , 39 Database: D546-551.

75. Gilbert JA, Field D, Swift P, Thomas S, Cummings D, Temperton B,Weynberg K, Huse S, Hughes M, Joint I, Somerfield PJ, Muhling M: Thetaxonomic and functional diversity of microbes at a temperate coastalsite: a ‘multi-omic’ study of seasonal and diel temporal variation. PLoSOne 2010, 5(11):e15545.

76. Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, Remington K,Eisen JA, Heidelberg KB, Manning G, Li W, Jaroszewski L, Cieplak P,Miller CS, Li H, Mashiyama ST, Joachimiak MP, van Belle C, Chandonia JM,Soergel DA, Zhai Y, Natarajan K, Lee S, Raphael BJ, Bafna V, Friedman R,Brenner SE, Godzik A, Eisenberg D, Dixon JE, Taylor SS, et al: The Sorcerer IIGlobal Ocean Sampling expedition: expanding the universe of proteinfamilies. PLoS Biol 2007, 5(3):e16.

77. Godzik A: Metagenomics and the protein universe. Curr Opin Struct Biol2011, 21(3):398-403.

78. Wilkening J, Desai N, Meyer F, A W: Using clouds for metagenomics - casestudy. IEEE Cluster 2009.

79. Ye Y, Choi JH, Tang H: RAPSearch: a fast protein similarity search tool forshort reads. BMC Bioinformatics 2011, 12:159.

80. Kent WJ: BLAT-the BLAST-like alignment tool. Genome Res 2002,12(4):656-664.

81. Wang W, Zhang P, Liu X: Short read DNA fragment anchoring algorithm.BMC Bioinformatics 2009, 10(Suppl 1):S17.

82. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resourcefor deciphering the genome. Nucleic Acids Res 2004, , 32 Database:D277-280.

83. Muller J, Szklarczyk D, Julien P, Letunic I, Roth A, Kuhn M, Powell S, vonMering C, Doerks T, Jensen LJ, Bork P: eggNOG v2.0: extending theevolutionary genealogy of genes with enhanced non-supervisedorthologous groups, species and functional annotations. Nucleic Acids Res2010, , 38 Database: D190-195.

84. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV,Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S,Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: The COG database:an updated version includes eukaryotes. BMC Bioinformatics 2003, 4:41.

85. Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL,Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR,Bateman A: The Pfam protein families database. Nucleic Acids Res 2010, ,38 Database: D211-222.

86. Selengut JD, Haft DH, Davidsen T, Ganapathy A, Gwinn-Giglio M,Nelson WC, Richter AR, White O: TIGRFAMs and Genome Properties: toolsfor the assignment of molecular function and biological process inprokaryotic genomes. Nucleic Acids Res 2007, , 35 Database: D260-264.

87. Field D, Amaral-Zettler L, Cochrane G, Cole JR, Dawyndt P, Garrity GM,Gilbert J, Glockner FO, Hirschman L, Karsch-Mizrachi I, Klenk HP, Knight R,Kottmann R, Kyrpides N, Meyer F, San Gil I, Sansone SA, Schriml LM, Sterk P,Tatusova T, Ussery DW, White O, Wooley J, Yilmaz P, Gilbert JA, Johnston A,Vaughan R, Hunter C, Park J, Morrison N, et al: The Genomic StandardsConsortium: Minimum information about a marker gene sequence(MIMARKS) and minimum information about any (x) sequence (MIxS)specifications. PLoS Biol 2011, 9(6):e1001088.

88. Prosser JI: Replicate or lie. Environ Microbiol 2010, 12(7):1806-1810.89. Clarke KR: Non-parametric multivariate analyses of changes in

community structure. Australian J Ecology 1993, , 18: 117-143.90. White JR, Nagarajan N, Pop M: Statistical methods for detecting

differentially abundant features in clinical metagenomic samples. PLoSComput Biol 2009, 5(4):e1000352.

91. Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE,Sogin ML, Jones WJ, Roe BA, Affourtit JP, Egholm M, Henrissat B, Heath AC,Knight R, Gordon JI: A core gut microbiome in obese and lean twins.Nature 2009, 457(7228):480-484.

92. Kristiansson E, Hugenholtz P, Dalevi D: ShotgunFunctionalizeR: an R-package for functional comparison of metagenomes. Bioinformatics 2009,25(20):2737-2738.

93. Burke C, Steinberg P, Rusch D, Kjelleberg S, Thomas T: Bacterial communityassembly based on functional genes rather than species. Proc Natl AcadSci USA 2011, 108(34):14288-14293.

94. Mou X, Sun S, Edwards RA, Hodson RE, Moran MA: Bacterial carbonprocessing by generalist species in the coastal ocean. Nature 2008,451(7179):708-711.

Thomas et al. Microbial Informatics and Experimentation 2012, 2:3http://www.microbialinformaticsj.com/content/2/1/3

Page 11 of 12

Page 12: REVIEW Open Access Metagenomics - a guide from sampling to ... · REVIEW Open Access Metagenomics - a guide from sampling to data analysis Torsten Thomas1*, Jack Gilbert2,3 and Folker

95. Yilmaz P, Kottmann R, Field D, Knight R, Cole JR, Amaral-Zettler L,Gilbert JA, Karsch-Mizrachi I, Johnston A, Cochrane G, Vaughan R, Hunter C,Park J, Morrison N, Rocca-Serra P, Sterk P, Arumugam M, Bailey M,Baumgartner L, Birren BW, Blaser MJ, Bonazzi V, Booth T, Bork P,Bushman FD, Buttigieg PL, Chain PS, Charlson E, Costello EK, Huot-Creasy H,et al: Minimum information about a marker gene sequence (MIMARKS)and minimum information about any (x) sequence (MIxS) specifications.Nat Biotechnol 2011, 29(5):415-420.

96. Hsi-Yang Fritz M, Leinonen R, Cochrane G, Birney E: Efficient storage ofhigh throughput DNA sequencing data using reference-basedcompression. Genome Res 2011, 21(5):734-740.

doi:10.1186/2042-5783-2-3Cite this article as: Thomas et al.: Metagenomics - a guide fromsampling to data analysis. Microbial Informatics and Experimentation 20122:3.

Submit your next manuscript to BioMed Centraland take full advantage of:

• Convenient online submission

• Thorough peer review

• No space constraints or color figure charges

• Immediate publication on acceptance

• Inclusion in PubMed, CAS, Scopus and Google Scholar

• Research which is freely available for redistribution

Submit your manuscript at www.biomedcentral.com/submit

Thomas et al. Microbial Informatics and Experimentation 2012, 2:3http://www.microbialinformaticsj.com/content/2/1/3

Page 12 of 12