national center for emerging and zoonotic infectious diseases division of scientific resources...
TRANSCRIPT
National Center for Emerging and Zoonotic Infectious Diseases
Division of Scientific Resources
Sequencing and Bioinformatics in the CDC Biotechnology Core Facility
Branch
Computational Lab• Scott Sammons, Team Lead• Kevin Tang• Kristen Knipe
Sequencing Lab• Mike Frace, Team Lead• Lori Rowe• Marina Khristova• Mark Burroughs• Milli Sheth
PacBio SMRT sequencer Ion Torrent PGMIllumina MiSeq
Roche 454 Titanium + Illumina GA IIx
Genome Sequencing Lab sequencing platforms
Illumina 2500
3
Building 23 Server Room – Main ISLE
4
High Performance Computing Cluster (Aspen)
• What is it?• 35 compute nodes each with 12 processor
cores, total of 420 cores, 110GB of memory, and 2 Tesla 2050 GPU cards
• What can it do today?40 cluster applications are currently enabled including MatLab, Beast, MrBayes, Blast, MPI Blast, PacBio analysis tools, Celera Assembler, CLC Server, Geneious Server
5
Isilon• What is it?
• High speed, scalable, and redundant Network Attached Storage
• Connected to both the CDC network and the Aspen HPC cluster utilizing Infiniband
• Total of 500TB usable space• What can it do today?
• It provides user workspace for end-users and HPC applications
• Solves the problem of being out of disk space on individual servers
• What are we doing with it?• Data warehouse for all scientific equipment• Central network share for all scientific users• Integrating directly with ITSO’s Active Directory forest
6
Private Cloud• What is it?
• Support science through front-end and back-end services
• Implementation of virtualized infrastructure• Currently in the process of being deployed
• What can it do today?• Provide test environments for scientific projects• Lay the foundation for hardware consolidation
and migration• What are we doing with it?
• Standardize platforms• Centralize management
Sequencing Lab Origins
• Began in 2001• Mission: sequence 8 human smallpox
viruses before the WHO revisits destruction of all smallpox stocks
• By 2005, had sequenced over 150 smallpox and related poxvirus genomes.
• 2006: Roche 454, focus moved to small bacterial genomes
• 2010: Illumina GAIIx• 2011: Ion Torrent, PacBio
Position of E-PCR overlapping amplicons
A2 A4 A6 A8 A10 A12 A14 A16 A18
A1 A3 A5 A7 A9 A11 A13 A15 A17 End-R
D P O C E K H ML I F N A J B GSRQ
End-L
Primers designed using VAR-BSH and VAC-CPN sequencesPrimers target genes involved in reproduction & host
response Sequence sample: primers 40 sites, 1 enz. RFLP ~120 sitesPCR uses minimal DNA amounts, often no need to grow virusPCR uses hifi expand long-template Taq & Pwo enzymes
(Roche)
HindIII map
Sequencing: extended PCR
Sequencing Assembly: Phred/Phrap/Consed
Gene Prediction
• Heuristic algorithm to assign quality scores to ORFs (from 1 to 100)
• Quality scores are based on a number of factors including– Gene Predictions (glimmer, genemark, getorf)– Primary sequence homology to known genes
(BLAST)– Presence of predicted promoter (MEME/MAST)– Size of predicted ORF– Presence of transcription terminal signals
Visualizing Gene Predictions and Differences
ITR
ITR
crm-D
ORFs of CPVXs from 4 different clades
B. American alastrim minor CFR <1%
C-1. non-West-African-African int CFR ~10%
C. Asian majorCFR ~5 - 35%
A. West African int. CFR ~10%
C-2. non-West-African African minor CFR <1%
45 Smallpox Strains
Unrooted tree phylogenetic relationships of ORF encoding the hemagglutinin protein
VACLS1
Z99045AY243312
AF377884
AF375102
Z99052
AF
375096A
F375099
AF
375112
AF
375095AF3751
13AF375098
AY
523994AF
2292
47
AF
0956
89M14
783
AF3751
18AF375119AF375078
AY603355AY366477
X94355CPV
91 g
er3
AY90
2253
AF375084
AF375087AY902252AF375086
AY
9023
04
AY902303AF012825
Z99054
X69198X65516L22579
AF375135
AF375141
AF375143
JAP
46 yamA
F375142A
F375130 BR
Z66
gar
AF3
7513
8
AF375129
AF
375093A
F375081 AY
0090
89A
Y90
2277
AF3
7508
5A
Y90
2269
AF
3750
90
AF
377886 AF3
7787
8
AF
377877
AY90
2260
AF3750
83
AY9022
83A
Y90
2286
AY
9023
01A
Y90
2272
AY
9022
99A
Y90
2274
AY
9022
75A
Y90
2295
AF
4827
58A
Y90
2289
AY90
2294
AY902276
AY902257
AY902256AY902268AY902300
AY902308
AY298785
AY902270
AY902271
AY902287AY902297AY902288
AY902296
CPV90 ger2A
F37
5088
AF
377885
AY902298
AF375077
AF375123
NC 001559
Cowpox clade IVCPXV90_ger2
CamelpoxTaterapoxVariola
Ectromelia
Cowpox clade III(CPXV91_ger3)
Cowpox clade II
Cowpox clade IVaccinia
Monkeypox
AY298785
GSL sequencing 2013
NCEZID NCIRD NCHHSTP
Vibrio choleraVibrio sppCamphylobacterSalmonellaBacillus anthracisListeraBukholderia sppYersinia pestis Brucella spp.Klebsiella pneumonia Fungal MeningiditisRift Valley Fever virusLujo virusMarburg virusCCHF virusLassa Fever virusClinical sample metagenomics
Haemophilus influenzaeLegionella pneumophila Legionella spp. Mycoplasma pneumoniaWater cooling tower metagenomicsRespiratory filter metagenomicsBordetella spp.Tick metagenomics
Neiseria sppHepatitisMycobacterium tuberculosis
INFLUENZA
CGH
RhodoccocusCryptosporidiumFasciola sppBalamuthia spp
Next-Gen Diagnostic Sequencing Applications
‘Massively parallel’ sequencing not only produces throughput, it providessequences of potentially millions of individual molecules (instant cloning). By sequencing a PCR reaction it allows the detailed search for low expression quasi-species or mutations which may signal growing drug or vaccine resistance – a process called ultra-deep or amplicon sequencing.
Example: clinical case of poxvirus infection with samples exhibiting a reduced sensitivity to an antiviral drug.
Complex clinical, laboratory or environmental samples can be sequenced toprovide a diagnostic ‘snapshot’ of the resident organisms - an approach called metagenomic sequencing.
Examples: tissue culture, soil, blood serum, sputum, stool
Shotgun / Paired-End Sequencing: random shearing of DNA, even sequence coverage over entire genome.
Shotgun / Paired-End Sequencing
De novo Assembly• Newbler• CLCBio• Mira• Geneious• Velvet• Celera Assembler
Reference Mapping• Newbler• CLCBio• Mira• Geneious• BWA • Bowtie
Genome Assembly Visualization
Genome Assembly Visualization
Genome Comparison
HGAP – Hierarchical Genome Assembly Process
• PreAssembly– Generation of long accurate reads
• Assembly– Choice of assemblers, but OLC (Overlap
Layout Consensus) are best, MIRA and Celera Assembler
• Consensus Polishing– Quiver – a quality aware consensus
algorithm maps all reads back to the assembly and creates a new consensus
HGAP: PreAssembly
30X
HGAP: PreAssembly/Assembly
• Correct seed reads with short reads
• Assemble with Celera Assembler or MIRA
HGAP - Quiver
• To reduce the remaining InDel and base substitution errors in the draft assembly, we use the PacBio Quiver, a quality-aware consensus algorithm. Four different per-base Quality Values (QV scores) represent the intrinsically calculated error probabilities for inserted, deleted, substituted and merged base calls in single pass reads. These values allow Quiver to generate a highly accurate consensus for the final assembly, which frequently exceeds QV50 (99.999% accuracy).
HGAP Example
HGAP Confirmationwith Physical Mapping
HGAP Assembly Structural Confirmation
HGAP Sequence Confirmation with Illumina
reads
Amplicon (deep) sequencing project
• Clinical case of progressive vaccinia infection from smallpox vaccination of an immune compromised patient
• Pox antiviral ST-246 administered which targets pox gene F13L, a major envelope protein which mediates production of extracellular virus
• Oral ST-246 given daily and vaccination site sampled over 3 week period
Li, Damon - NCZEID/DVRD/PRB
A region of gene F13L was amplified from clinical samples, deep sequenced,and compared to the smallpox vaccine reference sequence (Acambis 2000)
Control swab prior to ST-246
2 weeks after ST-246
C > T869
T > A943
3 weeks after ST-246
C > T869
T > A943
What is Metagenomics?
• Is the genomic study of DNA from uncultured microorganisms, generally from environmental samples
• Related• Metatranscriptomics• Metaproteomics
Sample CoverageRarefaction Curves
Wooley JC, Godzik A, Friedberg I, 2010 A Primer on Metagenomics. PLoS Comput Biol 6(2)
Samples
Classification Techniques
• Supervised Taxonomic Classification• Homology-based
• Database searching by similarity (BLAST, SW)• BLAST, BLASTX: genbank, specialized DBs: NCBI-ENV-
NT, NCBI-ENV-NR
• Composition-based• N-mer frequency
• Markov Models, Support Vector Machines (SVM), need training set
• Unsupervised Taxonomic Classification• Clustering methods
• SOM - self-organizing maps • PCA – principal component analysis
Contigs, ReadsRemove redundant sequences
Unique sequencesMask repetitive and low complexity seqs
Good sequences
BLASTN against Human Genome (e ≤ 1e-10)
Non-human sequences
BLASTNvs nt
BLASTXvs nr
Sample Collection
DNA
Library Construction
Sequencing
Basecalling
Vector Trimming
Assembly
Viral Metagenomic Pipeline (Wash U scripts implemented at CDC)
Report Generation, Display in MEGAN, inspect top hits
BLASTNvs GB-viral
Megan
Ugandan Outbreak Samples• 4 patients
• Total RNA from patient sera• 2 samples per 454 run
• ~ 565,000 reads/sample, avg length = 235nt• Sequences were screened for random
library amplication primers and low quality• Assembled each run de novo using the 454
gsAssembler• Performed a blastx database search using
the assembled contigs (overnight)• Visualized the blast output using MEGAN.
MEGAN (MetaGenomeANalyzer)
Ugandan Outbreak - results
• Run1 - 5 contigs (out of 2463 > 100nt) matched YF virus, covering 98% of the genome (10,441 of 10,823bp)
• Mapped each sample from Run1 using an Ethiopian YF virus as reference. 3229 individual reads from Sample 1 indentified as YF.
• Run 2 – no YF reads found
Phylogenetic analysis of yellow fever virus sequences
Laura McMullan (DHPP/VSPB)
Comparative Metagenomics
• One 454 run• Two samples
• Sample 1 – ~578,000 reads, avg read length 438 bases• Sample 2 – ~550,000 reads, avg read length 425 bases
• Total number of bases sequenced - ~488,000,000
Sample 1 – Rarefaction Curve
Sample 1 Taxa tree (collapsed at the Order level)
Comparison of Sample 1 and 2
Bioinformatics Tools• Bioinformatics Packages
– EMBOSS– CLCbio– Geneious– LaserGene-Ngen– Galaxy
• General Tools/ Languages– Java/BioJava– Perl/BioPerl– R– BLAST Suite– BioEdit
• Genome Comparison/Alignment Tools– Mavid– Mauve– Clustal– Muscle– MAFFT
• Gene Prediction– Glimmer– GeneMark
• Assembly/Mapping Tools– 454 Suite– Mosaik Tools– Mummer– BWA– Velvet– AHA (pacbio)
• Functional Annotation– Manatee
• Phylogenetics– Paup– Phylip– MrBayes– Beauti/Beast– MEGA– DnaSP
• Metagenomics– MEGAN– Galaxy– Carma
• In-House– WAMS– POCs/VOCs
Challenges
Data Management – image files are large moving these files around the network is slow
Assembly/Mapping Software – Some are provided with the instrument, but additional methods and algorithms are needed
Finishing Tools – gap filling, primer design
Visualization Tools – tools to graphically display contigs on reference sequence as well as genome multiple alignments
Generic Robust Annotation Tools – Researchers need tools to intelligently choose predicted ORFs as genes, assign function, and submit to GenBank
What are the weaknesses of current next-gen sequencers?
Complicated and time consuming library preparation
Requires amplification of library
Instruments require repetitive sequential ‘flows’ of reagents
Requires micrograms of DNA to begin3 days to prepare library
Low copy number polymorphisms may be missedEmulsion PCR is an inefficient, time consuming, oily messPotential to introduce PCR bias into sample
Repetitive flows of nucleotides, blocking/unblocking chemistry, washing out reaction byproducts all slow synthesis and hinder read-length Consumes liters of reagents ($) Repetitive flows and imaging extend sequence runs to days (or weeks)
Pacific Bioscience SMRT sequencer (single-molecule sequencer)
Ion Torrent Personal Gene Machine (solid-state sequencer)
Nanopore sequencing
Pacific Biosciences SMRT sequencer
Sponsor: Influenza Research Agenda
Individual ZMW with attached polymerase and DNA strand Laser excitation/detection volume
glass
Pacific Biosciences SMRT Technology
~ 50 nm Functional volume (red) is in zL!
SMRTcell = 160,000 ZMW SMRTcell array = 1.5 million ZMW
Nucleotide incorporation is a realtime data movie
100 ms
Pacific Biosciences Advantages
Read lengths of 1,000 – 10,000 bases No reagent ‘flows’ =10-fold increase in sequencing speed
Substitute reverse transcriptase for polymerase and sequence RNA directly
Bacteria genomes sequenced in hours
Sequence run costs 99$; take 15 minutes to complete