novogene meta demo report meta demo report.pdfthe bioinformatics analysis will be carried on with...
TRANSCRIPT
Novogene META Demo Report
1 Introduction
Microbial populations exist in almost every ecological community in the earth. From
the insect gut to the oceans, and also can be found in the sediment beneath. For most
of the history, life on the earth consisted solely of microscopic life forms, and
microbial life still dominates Earth in many aspects. Microbes are not only ubiquitous,
and they are essential to all forms of life, as the primary source for nutrients, and the
primary recyclers of dead matter back to available organic form [1].
For a long time, microbial ecologists were mostly restricted to pure cultures of
cultivable isolates to shed light on the diversity and functions of environmental
microbes [2]. Pure cultures allow the study of an isolate’s metabolism and of its gene
repertoire by genome sequencing. Both provide valuable information for extrapolating
on the isolate’s ecophysiological role. Cultivability of environmental microbes often
ranges below 1% of the total bacteria, limiting the research of microbial diversity.
We now have the ability to obtain genomic information directly from microbial
communities in their natural habitats. Instead of looking at a few species individually,
we are able to study tens of thousands all together. Metagenomics is defined as the
direct genetic analysis of genomes of all samples (the community) in a certain
environment [3]. It refers to the study in which the genomic DNA obtained from
microorganisms cannot be cultured in the laboratory. Metagenomics provides the
access to functional gene composition of microbial communities; thus giving a much
broader description than phylogenetic surveys, which are only based on the diversity
of one gene for most of the time, for instance the 16S rRNA gene. The classical
metagenome approach involves cloning of environmental DNA into vectors with the
help of ultra-competent bioengineered host strains. The obtained clone libraries are
subsequently screened either for dedicated marker genes (sequence-driven approach)
or metabolic functions (function-driven approach)[2].
Nowadays, shotgun metagenomicsis is commonly used to study the gene
inventories of microbial communities [4,5,6]. With the rapid development of
next-generation sequencing, large metagenome and multiple metagenome study have
been generated. There were many pioneering metagenomic studies in different
environments, such as the NIH Human Microbiome Project
(HMP,http://www.hmpdacc.org/), and the Earth Microbiome
Project(EMP,http://www.earthmicrobiome.org/).
2 Project Process
2.1 Experimental Procedures
After the DNA sample(s) is delivered, we will conduct the sample quality test first.
Then we use this (those) qualified DNA sample(s) to construct library(s). At last, the
qualified library(s) will be used for sequencing. The bioinformatics analysis will be
carried on with sequencing data. From samples to raw data, every aspect of sample
testing, library construction and sequencing could impact the quality and quantity of
the data, and the data’s quality will directly affect the results of follow-up
bioinformatics analysis. In order to ensure the accuracy and reliability of the
sequencing data from the beginning, each step of sample testing, library construction
and sequencing is strictly controlled by Novegene, ensuring the output of high-quality
data fundamentally.
2.1.1 Quality Control (QC)
There are three key methods in QC for DNA samples:
(1) DNA degradation degree and potential contamination were monitored on 1%
agarose gels.
(2) DNA purity (OD260/OD280, OD260/OD230) was checked using the
NanoPhotometer® spectrophotometer (IMPLEN, CA, USA).
(3) DNA concentration was measured using Qubit®dsDNA Assay Kit in Qubit® 2.0
Flurometer (Life Technologies, CA, USA).
OD value was between 1.8~2.0, and for the DNA with the content above 1ug, were
used for library construction.
2.1.2 Library Construction
A total amount of 1μg DNA (per sample) was used as an input material for the DNA
sample preparation. Sequencing libraries were generated using NEBNext® Ultra™
DNA Library Prep Kit by Illumina (NEB, USA) following manufacturer’s
recommendations, and index codes were added to attribute sequences to each sample.
Briefly, the DNA sample was fragmented to 300bp by sonication; then the inheriting
DNA fragments were end-polished, A-tailed, and ligated with the full-length adaptor
for Illumina sequencing with further PCR amplification. At last, PCR products were
purified (by AMPure XP system) and library-generated (for size distribution) by
Agilent2100 Bioanalyzer, quantified by real-time PCR.
2.1.3 Sequencing
The clustering of the index-coded samples was performed on a cBot Cluster
Generation System according to the manufacturer’s instructions. After cluster
generation, the prepared libraries were sequenced on an Illumina HiSeq2500 platform
and paired-end reads were generated.
2.2 Bioinformatics Analysis
The initial step of metagenomic data analysis requires the execution of certain
pre-filtering steps.d
Following by assembling DNA sequence reads into contiguous consensus sequences
(contigs), and annotating taxonomy and abundance as well as prediction of genes [7].
Then the functional annotation was accomplished by analyzing of Metabolic Pathway,
Cluster of Orthologous Group (COG) and Carbohydrate-active Enzymes.
To determine the similarity or difference of taxonomic and functional components
between different samples, relative Clustering analysis and Principal Component
Analysis (PCA) were performed. Meanwhile, there were a series of advanced analysis
items available to explore the environmental samples in depth, such as LEfSe,
Significant Difference, CCA/RDA, NMDS, prediction of Secreted protein and Type
III secretion system effector, annotation of VFDB, ARDB and PHI-base, Comparative
metabolic pathway analysis of multiple samples et al.
3.Analysis Results
3.1 Sequence Pre-filtering
The first step of data analysis requires the execution of certain pre-filtering steps,
including:
a) Remove low-quality reads of which the ratio of low quality bases(≤5 ) is over 40%
b) Remove N-rich reads of which the ratio of N bases is over 10%
c) Remove adapter-polluted reads which have the overlap longer than 15 bp with
adapter
d) Remove host contamination (in case that the identity of any reads with host
genome is higher than 90%)
Table 3.1 The statistic of data pre-filtering
#Sample: Sample name; InsertSize: the clone insert size of library; SeqStrategy: sequencing
strategy, i.e. (125:125) means pair-end reads of 125bp; RawData: raw sequencing data in Mbp;
CleanData: effective data obtained from pre-filtering in Mbp; Clean_Q20/Clean_Q30: the
proportion of bases having a quality score equal or higher than 20(error rate <1%) and 30(error
rate <0.1%) respectively of clean data; Clean_GC(%):the percentage of G and C bases in clean
data. Effective (%): the ratio of effective data account for raw data;
3.2 Assembly
SOAP de novo [8] package was utilized to perform metagenomic assembling with
different K values (default 49, 55 and 59). The best assembly result of Scaffold,
which has the largest N50, was selected for the subsequent analysis. The Scaftigs
numbers and length distribution were then obtained and shown in Table 3.2 and
Figure 3.2 respectively.
Table 3-2 The Statistic of Scaftigs of Each Sample
#SampleID: sample name; Total Len.: the total length of all Scaftigs; Num.: the total number of
all Scaftigs; Average Len.: the average length of all Scaftigs; N50/90 Len.: the length of
N50/N90; Max Len.: the length of longest scaftig.
Fiure 3-2 The length distribution of Scaftigs of each sample. The Y1-axis titled “Frequence (#)”
means the numbers of Scaftigs of certain length; The Y2-axis titled “Percentage (%)” means the
percentage of Scaftigs of certain length accounts for the total Scaftigs; The X-axis titled
“ScaftigsLength(bp)” indicates the length of Scaftigs.
3.3 Species Annotation and Abundance
3.3.1 Species Annotation
CD-HIT was used to cluster Scaftigs derived from assembly with a default identity of
0.95. In order to analyze the relative abundance of scaftigs further in each sample, the
clean reads after pre-processing were mapped with the non-redundant Scaftigs dataset
by SoapAligner [10] firstly. Then the Scaftigs of which the total depth equal to 0 was
filtered, and at last the abundance table of filtered Scaftigs was obtained.
The corresponding Scaftigs were mapped to the mass of Bacteria, Fungi, Archaea and
Viruses data extracted from the NT database of NCBI (E-value ≤1e-05). LCA
algorithm (Lowest common ancestor, applied in MEGAN [11] software system) was
used to ensure the annotation significance by picking out the lowest common
classified ancestor for final display.
According to the results of LCA annotation and Scaftigs abundance, relative
abundance table on different levels were calculated. Top ten phyla were selected and
displayed in the Figure 3.3.1
Figure 3.3.1 Phylum distribution of all samples. Plotted by the “Relative Abundance”
along the Y-axis and “Samples Name” on the X-axis, where “Others” represents a
total relative abundance of the entire phylum besides the top 10 phylum.
3.3.2 Species Abundance Cluster
Selecting the dominant 35 genera among all samples was based on the results of
species annotation and abundance information of all samples in genus level. The
abundance distribution of these dominant 35 genera was displayed in the Species
abundance Heat-map. The result is shown in Figure 3.3.2.
Figure 3.3.2 Species abundance Heat-map. Plotted by sample name on the X-axis and
selected genera on the Y-axis. The absolute value of 'Z' represents the distance
between the raw score and the population mean in units of the standard deviation. 'Z'
is negative when the raw score is below the mean, positive when above.
3.4 Gene Prediction and Abundance
3.4.1 Gene Prediction
Scaftigs over 300bp were selected for gene prediction by MetaGeneMark [13]
software. MetaGeneMark [13] software is widely used in metagenomic analysis and
unknown prokaryotes prediction based on a Heuristic algorithms and a tremendous
training set. The result is shown in table 3-4-1.
Table 3.4.1 Gene prediction basic information
#ORF NO. : the number of ORFs (Open reading frame) ; Total Len. (Mbp): the total length of all
ORFs; Average Len. (Mbp): the average length of all ORFs; GC Percent: the GC content of the OR
Fs .
Figure 3.4.1 The length distribution of the predict genes. Plotted by the number of the
genes along the Y1-axis, percentage of genes along the Y2-axis, and the length of
genes along the X-axis.
3.4.2 Eliminating Redundancy of Genes
The gene protein sequences were clustered at the threshold 95%[14] sequence
similarity by CD-HIT [15,16] software and then select the longest one as the
representative sequence. The length distribution of the representative sequences is
shown in Figure 3.4.2
Figure 3.4.2 The length distribution of the representative genes. Plotted by the number
of the genes along the Y1-axis, percentage of genes along the Y2-axis, and the length
of genes along the X-axis.
3.4.3 Gene Abundance
In order to analyze the gene abundance of each sample, combining the initiation site
and termination site of each gene on Scfatigs and the single base’s depth of Scaftigs
that genes come from, we can obtain the table of representative gene’s abundance in
each sample. Part of representative genes abundance are shown in Table 3.4.3
Table 3.4.3 Part of representative genes abundance
Displayed by gene's ID number vertical and sample names transversal. Each grid
represents the abundance distribution of certain gene in corresponding sample.
3.5 Functional Annotation and Abundance Analysis
Sequences were annotated to functional categories against the databases (KEGG/
eggNOG /CAZydatabases) using BLAST, and the result with minimum-value was
selected. The BCR ( BLAST Coverage Ratio of Gene ) of reference and query gene
was selected with cutoff ≥ 40%. The index of BCR is:
BCR (Ref.) = (Match/Length(R))×100%;
BCR (Que.) = (Match/Length (Q)) ×100%;
Where Match is the available alignment length between reference and query
gene.Length(R) is the length of reference gene. Length (Q) is the length of query
gene.
3.5.1 Functional Annotation
KEGG is a database resource for understanding high-level functions and utilities of
the biological systems, such as the cell, the organism and the ecosystem, from
genomic and molecular-level information (http://www.genome.jp/kegg/). KEGG is
the reference knowledge base that integrates current knowledge on molecular
interaction networks such as pathways, complexes (PATHWAY database), and the
information about genes as well as those proteins generated by genome projects
(GENES/SSDB/KO databases) and information about biochemical compounds and
reactions (COMPOUND/GLYCAN/REACTION databases). The PATHWAY database
is a collection of manually drawn maps called the KEGG reference pathway maps,
each corresponding to a known network of functional significance. Reflecting the map
resolution and functional modules at different levels, these pathway maps are
hierarchically classified. There are seven categories in the top level (Metabolism,
Genetic Information Processing, Environmental Information Processing, Cellular
Processes, Organismal Systems, Human Diseases and Drug Development) and 54
subcategories in the second level. The third level in the hierarchy corresponds to
individual pathway maps. The fourth level corresponds to KO (KEGG Orthology)
entries [17-19].
The eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups)
is a database of orthologous groups of genes. The orthologous groups are annotated
with functional description lines (derived by identifying a common denominator for
the genes based on their various annotations), with functional categories (i.e derived
from the original COG/KOG categories) (http://eggnog.embl.de). The eggNOG's
database currently counts 1.7 million orthologous groups in 3686 species, covering
over 7.7 million proteins (built from 9.6 million proteins)[20].
The CAZy database describes the families of structurally related catalytic and
carbohydrate-binding modules (or functional domains) of enzymes that degrade,
modify, or create glycosidic bonds. (http://www.cazy.org/) [21,22]. The Enzyme
Classes are currently including Glycoside Hydrolases (GHs), Glycosyl Transferases
(GTs), Polysaccharide Lyases (PLs), Carbohydrate Esterases (CEs), and Auxiliary
Activities (AAs). The Associated Modules are currently covered the
Carbohydrate-Binding Modules (CBMs).
Fig 3.5.1-1 Abundance of genes annotated to the functional databases. Top panel:
KEGG database. Middle panel: eggNOG database. Low panel: CAZy database.
AfterKEGG database annotating, genes were displayed using KEGG maps. One of
the maps is shown below.
Fig3.5.1-2 Map of TCA cycle pathway
3.5.2 Abundance Analysis of Functional Genes
Genes were annotated to functional databases at different levels. Relative abundance of
annotated genes by the top level of each functional database is shown below.
Fig3.5.2 Relative abundance of annotated genes by the top level of different functional
databases. Top panel: KEGG database. Middle panel: eggNOG database. Low panel:
CAZy database.
3.5.3 Abundance Heat-map of Functional Genes
Functional distribution abundance distribution of top 35 genera among all samples is
displayed in the following heat-map.
Fig3.5.3 Abundance Heat-map of functional genes. Plotted by sample name on X-axis
and genes on Y-axis. The absolute value of 'Z' represents the distance between the raw
score and the population mean in units of the standard deviation. 'Z' is negative when
the raw score is below the mean, positive when above.
3.6 Comparative Analysis (between samples)
3.6.1 PCA on Communities
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal
transformation to convert a set of observations of possibly correlated variables into a
set of values of linearly uncorrelated variables called principal components [23]. The
first principal component accounts for as much of the variability in the data as
possible, and each succeeding component accounts for as much of the remaining
variability as possible.
Figure 3.6.1 Principal component analysis on the relative abundance of phylum level.
Each point represents a sample, plotted by the second principal component on the
Y-axis and the first principal component on the X-axis, which was colored by group.
3.6.2 PCA on Functional Genes
Figure 3.6.2 Principal component analysis on the relative abundance of functional
genes from three databases. Top panel: KEGG database. Middle panel: eggNOG
database. Low panel: CAZy database. Each point represents a sample, plotted by the
second principal component on the Y-axis and the first principal component on the
X-axis, which was colored by group.
3.6.3 Cluster Analysis on Communities
The Bray-Curtis index is a statistic used to quantify the compositional dissimilarity
between two different samples, based on the counts of each sample. In ecology and
biology study, the Bray-Curtis dissimilarity is bound between 0 and 1, where “0”
means the two sites have the same composition (because they share all the species),
and “1” means the two sites do not share any species. At the sites where BC is
intermediate (e.g. BC = 0.5) this index differs from other commonly used indices.
Figure 3.6.3 Clustering tree based on Bray–Curtis distance. Plotted with clustering
tree on the left and the relative phylum-level abundance map on the right side.
3.6.4 Cluster Analysis on Functional Genes
Figure 3.6.4 Clustering tree based on Bray–Curtis distance. Plotted with clustering
tree in the centre and the functional genes relative abundance from top level of three
databases in the outer ring. Top panel: KEGG database. Middle panel: eggNOG
database. Low panel: CAZy database.
Reference
[1] Wooley J C, Godzik A, Friedberg I. A primer on metagenomics[J]. PLoS
computational biology, 2010, 6(2): e1000667.
[2] Teeling H, Glöckner F O. Current opportunities and challenges in microbial
metagenome analysis—a bioinformaticperspective[J]. Briefings in bioinformatics,
2012: bbs039.
[3] Thomas T, Gilbert J, Meyer F. Metagenomics-a guide from sampling to data
analysis[J]. Microb Inform Exp, 2012, 2(3).
[4] Tringe, S. G., Von Mering, C., Kobayashi, A., Salamov, A. A., Chen, K., Chang,
H. W., ... & Rubin, E. M. (2005). Comparative metagenomics of microbial
communities. Science, 308(5721), 554-557.
[5] Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K. S., Manichanh,
C., ...&Weissenbach, J. (2010). A human gut microbial gene catalogue established by
metagenomic sequencing. Nature, 464(7285), 59-65.
[6] Raes, J., Foerstner, K. U., & Bork, P. (2007). Get the most out of your
metagenome: computational analysis of environmental sequence data. Current
opinion in microbiology, 10(5), 490-498.
[7] Mende DR, Waller AS, Sunagawa S, Jarvelin AI, Chan MM, Arumugam M, Raes J, Bork P:
Assessment of metagenomic assembly using simulated next generation sequencing data.
PLoS One 2012, 7(2):e31386.
[8] Luo et al.: SOAPdenovo2: an empirically improved memory-efficient short-read
de novo assembler. GigaScience 2012 1:18.
[9] Edgar, R. C.(2010). Search and clustering orders of magnitude faster than BLAST.
Bioinformatics 26: 2460–2461
[10] Liu P, Fang X, Feng Z, Guo YM, . 2011. Direct sequencing and characterization
of a clinical isolate of Epstein-Barr virus from nasopharyngeal carcinoma tissue by
using next-generation sequencing technology. J. Virol. 85:11291–11299
[11] Huson, Daniel H., et al. "Integrative analysis of environmental sequences using
MEGAN4." Genome research 21.9 (2011): 1552-1560.
[12] Yok NG, Rosen GL: Combining gene prediction methods to improve
metagenomic gene annotation. BMC Bioinformatics 2011, 12:20.
[13] Zhu, Wenhan, AlexandreLomsadze, and Mark Borodovsky. "Ab initio gene
identification in metagenomic sequences." Nucleic acids research 38.12 (2010):
e132-e132
[14] Karlsson FH, Tremaroli V, Nookaew I, Bergstrom G, Behre CJ, Fagerberg B,
Nielsen J, Backhed F: Gut metagenome in European women with normal, impaired
and diabetic glucose control. Nature 2013, 498(7452):99-103.
[15] Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets
of protein or nucleotide sequences. Bioinformatics 2006, 22(13):1658-1659.
[16] Fu L, Niu B, Zhu Z, Wu S, Li W: CD-HIT: accelerated for clustering the
next-generation sequencing data. Bioinformatics 2012, 28(23):3150-3152.
[17] Kanehisa M (1997). A database for post-genome analysis. Trends Genet 13 (9):
375–376.
[18] Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M (2004). The KEGG
resource for deciphering the genome. Nucleic Acids Res 32 (Database issue):
D277–280.
[19] Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, et al
(2006). From genomics to chemical genomics: new developments in KEGG. Nucleic
Acids Res 34(Database issue): D354–357.
[20] Powell S, Forslund K, Szklarczyk D, et al (2014). eggNOG v4.0: nested
orthology inference across 3686 organisms. Nucleic Acids Res 42 (Database issue):
D231–239.
[21] Cantarel BL, Coutinho PM, Rancurel C, Bernard T, Lombard V, Henrissat B
(2009). The Carbohydrate-Active EnZymes database (CAZy): an expert resource for
Glycogenomics. Nucleic Acids Res 37:D233-238.
[22] Lombard V, Ramulu HG, Drula E, Coutinho PM, and Henrissat B (2014). The
carbohydrate-active enzymes database (CAZy) in 2013. Nucleic Acids Res 42
(Database issue): D490–495.
[23] Ekaterina A, Frisli T, and Rudi K. De novo Semi-alignment of 16S rRNA Gene
Sequences for Deep Phylogenetic Characterization of Next Generation Sequencing
Data. Microbes and Environments 28.2 (2013): 211-216.