rdp release 11 - xander assembler...
TRANSCRIPT
-
Xander – Gene Targeted Metagenomic Assembler
Xander – Gene Targeted Metagenomic Assembler
Qiong Wang
Center for Microbial Ecology Dept. of Plant, Soil and Microbial Sciences
Michigan State University
June 4th, 2015
1
-
Xander – Gene Targeted Metagenomic Assembler
• Genome assembly – Repeats are a major problem – Short reads with different error profiles
• Metagenomic bulk assembly – Assuming same abundance, but not true in metagenomic data
– Metagenomes are highly diverse – Big data, space and Ime complexity, need to discard low abundance reads before assembly
2
Genome Assembly
-
Xander – Gene Targeted Metagenomic Assembler
Profile Hidden Markov Models
• Widely used in many fields – e.g. voice recogniIon
• Protein and Nucleic Acid HMM – Powerful gene – search and assignment tool – MulIple sequence alignments
• ProbabilisIc models on linear system, changes states according to a transiIon rule
• only depends on the current state, independent of any other states
• A profile HMM has three states, 7 transiIons between states, transiIon and emission probabiliIes
-
Xander – Gene Targeted Metagenomic Assembler
Protein Profile Hidden Markov Model Add insert states for extra residues
Insert state Match state Delete state
I L R K V −
-
Xander – Gene Targeted Metagenomic Assembler
Xander: Gene-‐Targeted Assembler Combining de Bruijn Graph and HMM
de Bruijn Graph
Profile Hidden Markov Model
D 57
ccggga
ccgagc
…
M 56
I 56
…
D 56
I 57
D 58
M 58
Xander combined weighted
assembly graph
M gag ccg
57
M 57
gagccg
I ccg gga
57
M ccg gga
58
I ccg agc
57
M ccg agc
58
D gag ccg
58
Wang et al., 2015, Xander: Employing a Novel Method for Efficient Gene-‐Targeted Metagenomic Assembly. In revision
-
Xander – Gene Targeted Metagenomic Assembler
HMM-‐Guided Graph Search
-
Xander – Gene Targeted Metagenomic Assembler
HMP Defined Community
hMp://rdp.cme.msu.edu
Organism Name Strain Accession Number Streptococcus mutans NN2025 DNA NC_013928 (AP010655) † Listeria monocytogenes L99 serovar 4a NC_003210 (FM211688)† Acinetobacter baumannii ATCC 17978 NC_009085.1 (CP000521) AcJnomyces odontolyJcus ATCC 17982 DS264586.1 Bacillus cereus ATCC 10987 AE017194.1 Bacteroides vulgatus ATCC 8482 CP000139.1 Candida albicans* SC5314 Assembly 21 N/A Clostridium beijerinckii NCIMB 8052 CP000721.1 Deinococcus radiodurans R1 chromosome 1 AE000513.1 Enterococcus faecalis OG1RF chromosome CP002621.1 Escherichia coli K12 NC_000913.2 Helicobacter pylori 26695 NC_000915.1 Lactobacillus gasseri ATCC 33323 NC_008530.1 Methanobrevibacter smithii* ATCC 35061 NC_009515.1 Neisseria meningiJdis MC58 NC_003112.2 Propionibacterium acnes KPA171202 NC_006085.1 Pseudomonas aeruginosa PAO1 NC_002516.2 Rhodobacter sphaeroides 2.4.1 chromosome 1 NC_007493.1 Staphylococcus aureus subsp. aureus USA300 TCH1516 NC_010079.1 Staphylococcus epidermidis ATCC 12228 NC_004461.1 Streptococcus agalacJae 2603V/R NC_004116.1 Streptococcus pneumoniae TIGR4 NC_003028.3
-
Xander – Gene Targeted Metagenomic Assembler
Xander ValidaSon
8
Dataset: HMP defined community data (SRR172902, SRR172903), 1,037 Mbp of length 75 bp Illumina reads Conclusion: kmer length 45, prune 20 and Count 1 works well Count: minimum occurrence of kmers to be included in the graph Prune: stop the search if score has not improved in # of verIces Accuracy measurements: 1. Number of errors found 2. Number of chimeric conIgs formed
-
Xander – Gene Targeted Metagenomic Assembler
Comparison to SAT-‐Assembler SAT-‐Assembler (Zhang Y et al., PLoS Comput. Biology. 2014) Target gene: 50S ribosomal subunit protein L2 (rplB) (average length 825 bp) Xander was run with prune 20, kmer 45 and count 1 sekng. • Xander recovered full or near full-‐length (94.6%) of 4 HMP defined members. • SAT only recovered 79.9% of the 3 members. All conIgs missed both ends. Sample HMP HMP & Corn
14.5 M reads 24.7 M reads
SAT Xander SAT Xander
# contigs 4 6 *
# members recovered 3 4 * 9
Median gene coverage (%) 75.7 94.6 * 100
Max gene coverage(%) 79.9 100 * 100
Median % nucleoIde idenIty 97.8 99.8 * 90.3
Max % nucleoIde idenIty 99.8 100 * 100
Time (min) 12a 5a * 738b * SAT did not complete a1er 100 h. a on iMac, 3.2 GHz Intel Core i5 b MSU HPCC network drive
-
Xander – Gene Targeted Metagenomic Assembler
amoA: ammonia monooxygenase nifH: nitrogen fixaIon nirK/ nirS: nitrite reductase norB: nitric oxide reductase nosZ: nitrous oxide reductase rplB: 50S ribosomal subunit protein L2
Biofuel Crops and Nitrogen Cycling Genes
Miscanthus Switchgrass Corn
-
Xander – Gene Targeted Metagenomic Assembler
Rhizosphere Soil Data, Bulk Assembly, nirK
hMp://rdp.cme.msu.edu
Sample Name Corn Miscanthus Switchgrass File size (GB) 349 325 277 Data size (Gbp) 293 275 233
# protein conIg clusters (99%) 41 37 39
# OTUs at 95% aa idenIty 38 33 34
Median length (aa) 131 115 130 Max length (aa) 234 252 301
Median % aa idenIty 75.6 79.6 73.3
Max % aa idenIty 95.1 94.3 92.1 # reads covering kmers 105 123 106 Gene Abundance 0.25 0.25 0.3
7 replicates from each crop from KBS intensive site one sample per lane of Illumina HiSeq, replicates were pooled before assembly Using Khmer protocol ( provided by Jiarong Guo, Howe et al., 2014. PNAS)
-
Xander – Gene Targeted Metagenomic Assembler
Rhizosphere Soil Data, Xander Assembly
hMp://rdp.cme.msu.edu
Gene nirK nifH rplB Crop C M S C M S C M S
# chimeric clusters 16 207 11 0 1 0 14 28 44
# protein contig clusters 1993 1807 1581 39 57 41 19287 20463 17334
# OTUs at 95% aa identity 741 674 582 14 24 17 6100 6887 6004
Median (aa) 215 230 208 294 256 255 274 274 274
Longest (aa) 380 372 370 296 296 296 285 285 284
Median % aa identity 88.3 84.7 87.8 92.7 91.9 91.6 77.7 75.8 76.3
Max % aa identity 100 99.4 98.6 100 100 100 100 100 100
# reads covering kmers 27404 19815 16661 411 534 461 225985 179867 149661
Gene Abundance 0.121 0.11 0.111 0.002 0.003 0.003
-
Xander – Gene Targeted Metagenomic Assembler
Rhizosphere Soil Data, Xander Assembly
hMp://rdp.cme.msu.edu
Gene nirK nifH rplB Crop C M S C M S C M S
# chimeric clusters 16 207 11 0 1 0 14 28 44
# protein contig clusters 1993 1807 1581 39 57 41 19287 20463 17334
# OTUs at 95% aa identity 741 674 582 14 24 17 6100 6887 6004
Median (aa) 215 230 208 294 256 255 274 274 274
Longest (aa) 380 372 370 296 296 296 285 285 284
Median % aa identity 88.3 84.7 87.8 92.7 91.9 91.6 77.7 75.8 76.3
Max % aa identity 100 99.4 98.6 100 100 100 100 100 100
# reads covering kmers 27404 19815 16661 411 534 461 225985 179867 149661
Gene Abundance 0.121 0.11 0.111 0.002 0.003 0.003
Use rplB gene to normalize gene abundance Read RaIo: # reads covering kmers in gene coIgs / # reads covering kmers in rplB conIgs
-
Xander – Gene Targeted Metagenomic Assembler
nirK Kmer Abundance
14
Kmer abundance of nitrite reductase gene (nirK) representaIve conIgs assembled by Xander from the pooled rhizosphere samples. More than 35% of kmers of length 45 in the conIgs occurred only once in the reads
Fra
ctio
n of
Km
ers
Kmer Abundance 1 21 41 61 81 101 121 141 161 181 201
1x10-5
1x10-6
1x10-3
1x10-4
1x10-2
1x10+0
1x10-1Corn
SwitchgrassMiscanthus
-
Xander – Gene Targeted Metagenomic Assembler
Mean Kmer Coverage
15
1
10
100
1000
10000
15
Num
ber o
f Con
Sgs
Mean Kmer Coverage
Corn
Miscanthus
Switchgrass
1
10
100
1000
11
Num
ber o
f Con
Sgs
Mean Kmer Coverage
nirk rplB
Mean kmer coverage of a conIg: mean number of reads containing each kmer in a conIg. Counts for kmers that occurred in mulIple conIgs were equally divided. RepresentaIve conIgs were chosen from clusters at 99% aa idenIty
-
Xander – Gene Targeted Metagenomic Assembler
Taxonomic Abundance rplB
16
Xander, rplB Shotgun, 16S
Acidobacteria has few (
-
Xander – Gene Targeted Metagenomic Assembler
Taxonomic Abundance nirK
17
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Corn Miscanthus Switchgrass
Percen
t Abu
ndan
ce
Fungi Thermobaculum Firmicutes Spirochaetes Bacteroidetes Environmental Chloroflexi Verrucomicrobia Deltaproteobacteria Gammaproteobacteria Betaproteobacteria Alphaproteobacteria
15% of the nirK conIgs were closest match to rplB from Bradyrhizobium japonicum USDA 110 The other top matches were: Ralstonia pickeWi 12J, Rhodanobacter fulvus Jip2
-
Xander – Gene Targeted Metagenomic Assembler
PCA Analysis using OTU abundance at 95% aa idenSty
18
nirk rplB
-0.2 -0.1 0.0 0.1
-0.2
-0.1
0.0
0.1
0.2
PC1 8.54%
PC2
5.91
%
CMS
C
M
S
-0.20 -0.10 0.00 0.05 0.10
-0.15
-0.050.00
0.05
0.10
0.15
PC1 6.58%
PC2
5.33
%
CMS
C M
S
-
Xander – Gene Targeted Metagenomic Assembler
MulSpath to find Sequence Heterogeneity
hMp://rdp.cme.msu.edu
Xander can find mulIple paths using Yen’s k shortest path algorithm 1 starIng kmer, 1000 paths, 37 unique conIgs
-
Xander – Gene Targeted Metagenomic Assembler
Xander Gene-‐targeted Assembly Processing StaSsScs
hMp://rdp.cme.msu.edu
Sample Name Mock K312 C1 7 Corns Data size (GB) 1.7 7.4 46 349 Build graph (GB) 1 8 50 200
Build graph Time (h) 0.3 0.4 6.4 41
Find starIng kmers (h) * 0.1 0.5 3.6 27.0
Search conIgs * min min h h
nifH 0.3 0.1 0.02 0.1
nirk NA 0.7 0.8 36.7
rplB 1.1 7 3.8 49.4
The processing Ime on MSU HPCC network drive, single CPU * can be mulIthreaded or be run in parallel
1 lane of Illumina Hiseq run in < 20 h
-
Xander – Gene Targeted Metagenomic Assembler
Xander Build
Xander Search
Modified HMMER3
Xander Gene Assembly Workflow
Quality-‐filtered Genes
Quality Filtering
Post-‐Assembly Analysis
Read mapping, Gene coverage Nearest neighbor assignments Taxonomic abundance …
-
Xander – Gene Targeted Metagenomic Assembler
Xander Assembly Prep Steps
1. Build specialized forward and reverse HMMs • Input: a small set of aligned seed sequences (using
original HMMER3 and HMMs from FunGene) • Output: forward and reverse HMMs for Xander built
using our modified HMMER3-‐mod which is tuned to detect close homologs
2. IdenIfy starIng kmers • Input 1: A larger set of reference sequences (cover
all possible diversity) that was aligned by the forward HMMs using HMMER3-‐mod
• Input 2: read files • Output: starIng nucleoIde kmers, alignment
posiIons, HMM states • MulIple genes can be run together
hMp://rdp.cme.msu.edu
-
Xander – Gene Targeted Metagenomic Assembler
Xander Assembly Steps
3. Build de Brujin graph • Input: read files • Output: de Bruijn graph structure
4. Assemble one path for each direcIon for each start, then combine into one conIg • Input 1: forward and reverse HMMs • Input 2: de Bruijn graph • Input 3: starIng kmers • Output: nucleoIde and protein conIgs
5. Quality filter • Length cutoff and HMM score cutoff • Cluster at 99%, chose longest conIgs (RDP mcClust) • Chimera removal (UCHIME) • Outputs: quality-‐filtered conIgs
hMp://rdp.cme.msu.edu
-
Xander – Gene Targeted Metagenomic Assembler
Xander Post-‐Assembly Analysis
6. Read Mapping (RDP KmerFilter) • Input: quality-‐filtered conIgs • Output: coIg coverage, kmer abundance
7. Nearest neighbor assignment, taxonomy abundance • Input: quality-‐filtered conIgs • Input: reference seqs • Input: coIg coverage • Output: nearest matches • Taxonomic abundance adjusted by coverage
8. Beta-‐diversity analysis (mulIple samples) • Input: quality-‐filtered aligned protein conIgs • Input: conIg coverage • Output: coverage-‐adjusted OTU abundance matrix
hMp://rdp.cme.msu.edu
-
Xander – Gene Targeted Metagenomic Assembler
Xander – User Efficient
25
• Setup Xander – GitHub repo hvps://github.com/rdpstaff/Xander_assembler – Step-‐by-‐step instrucIons – preconfigured with rplB gene, and nitrogen cycling genes including nirK, nirS, nifH, nosZ, norB and amoA
• Prepare the HMMs, this step requires biological insight! – Get reference sequences for gene(s) (FunGene, or literature search)
– Build specialized HMMs for Xander • Get metagenomic data • Go Xander assembly
– Choose the right parameters for your dataset, see instrucIons
-
Xander – Gene Targeted Metagenomic Assembler
Summary
• Comparing to a recent targeted-‐gene assembler and a recent bulk assembly method, Xander assembled more gene conIgs, longer in length and shared higher % aa idenIty with known references.
• Detects low-‐abundance genes and low-‐ abundance organisms.
• Provides gene abundance and kmer abundance esImate • HMMs can be tailored to the targeted genes, allowing
flexibility to improve annotaIon over generic annotaIon pipelines.
• Larger kmer size improves quality by reducing chimeras, but may results in shorter conIgs.
26
-
Xander – Gene Targeted Metagenomic Assembler
Acknowledgements
James Cole James Tiedje Qiong Wang Jordan Fish Mariah Gilman Yanni Sun C. Titus Brown Jiarong Guo
27