national center for emerging and zoonotic infectious diseases division of scientific resources...

55
National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core Facility Branch Computational Lab Scott Sammons, Team Lead Kevin Tang Kristen Knipe Sequencing Lab Mike Frace, Team Lead Lori Rowe Marina Khristova Mark Burroughs Milli Sheth

Upload: chrystal-watson

Post on 16-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

National Center for Emerging and Zoonotic Infectious Diseases

Division of Scientific Resources

Sequencing and Bioinformatics in the CDC Biotechnology Core Facility

Branch

Computational Lab• Scott Sammons, Team Lead• Kevin Tang• Kristen Knipe

Sequencing Lab• Mike Frace, Team Lead• Lori Rowe• Marina Khristova• Mark Burroughs• Milli Sheth

Page 2: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

PacBio SMRT sequencer Ion Torrent PGMIllumina MiSeq

Roche 454 Titanium + Illumina GA IIx

Genome Sequencing Lab sequencing platforms

Illumina 2500

Page 3: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

3

Building 23 Server Room – Main ISLE

Page 4: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

4

High Performance Computing Cluster (Aspen)

• What is it?• 35 compute nodes each with 12 processor

cores, total of 420 cores, 110GB of memory, and 2 Tesla 2050 GPU cards

• What can it do today?40 cluster applications are currently enabled including MatLab, Beast, MrBayes, Blast, MPI Blast, PacBio analysis tools, Celera Assembler, CLC Server, Geneious Server

Page 5: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

5

Isilon• What is it?

• High speed, scalable, and redundant Network Attached Storage

• Connected to both the CDC network and the Aspen HPC cluster utilizing Infiniband

• Total of 500TB usable space• What can it do today?

• It provides user workspace for end-users and HPC applications

• Solves the problem of being out of disk space on individual servers

• What are we doing with it?• Data warehouse for all scientific equipment• Central network share for all scientific users• Integrating directly with ITSO’s Active Directory forest

Page 6: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

6

Private Cloud• What is it?

• Support science through front-end and back-end services

• Implementation of virtualized infrastructure• Currently in the process of being deployed

• What can it do today?• Provide test environments for scientific projects• Lay the foundation for hardware consolidation

and migration• What are we doing with it?

• Standardize platforms• Centralize management

Page 7: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Sequencing Lab Origins

• Began in 2001• Mission: sequence 8 human smallpox

viruses before the WHO revisits destruction of all smallpox stocks

• By 2005, had sequenced over 150 smallpox and related poxvirus genomes.

• 2006: Roche 454, focus moved to small bacterial genomes

• 2010: Illumina GAIIx• 2011: Ion Torrent, PacBio

Page 8: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Position of E-PCR overlapping amplicons

A2 A4 A6 A8 A10 A12 A14 A16 A18

A1 A3 A5 A7 A9 A11 A13 A15 A17 End-R

D P O C E K H ML I F N A J B GSRQ

End-L

Primers designed using VAR-BSH and VAC-CPN sequencesPrimers target genes involved in reproduction & host

response Sequence sample: primers 40 sites, 1 enz. RFLP ~120 sitesPCR uses minimal DNA amounts, often no need to grow virusPCR uses hifi expand long-template Taq & Pwo enzymes

(Roche)

HindIII map

Sequencing: extended PCR

Page 9: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Sequencing Assembly: Phred/Phrap/Consed

Page 10: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Gene Prediction

• Heuristic algorithm to assign quality scores to ORFs (from 1 to 100)

• Quality scores are based on a number of factors including– Gene Predictions (glimmer, genemark, getorf)– Primary sequence homology to known genes

(BLAST)– Presence of predicted promoter (MEME/MAST)– Size of predicted ORF– Presence of transcription terminal signals

Page 11: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Visualizing Gene Predictions and Differences

Page 12: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

ITR

ITR

crm-D

ORFs of CPVXs from 4 different clades

Page 13: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

B. American alastrim minor CFR <1%

C-1. non-West-African-African int CFR ~10%

C. Asian majorCFR ~5 - 35%

A. West African int. CFR ~10%

C-2. non-West-African African minor CFR <1%

45 Smallpox Strains

Page 14: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Unrooted tree phylogenetic relationships of ORF encoding the hemagglutinin protein

VACLS1

Z99045AY243312

AF377884

AF375102

Z99052

AF

375096A

F375099

AF

375112

AF

375095AF3751

13AF375098

AY

523994AF

2292

47

AF

0956

89M14

783

AF3751

18AF375119AF375078

AY603355AY366477

X94355CPV

91 g

er3

AY90

2253

AF375084

AF375087AY902252AF375086

AY

9023

04

AY902303AF012825

Z99054

X69198X65516L22579

AF375135

AF375141

AF375143

JAP

46 yamA

F375142A

F375130 BR

Z66

gar

AF3

7513

8

AF375129

AF

375093A

F375081 AY

0090

89A

Y90

2277

AF3

7508

5A

Y90

2269

AF

3750

90

AF

377886 AF3

7787

8

AF

377877

AY90

2260

AF3750

83

AY9022

83A

Y90

2286

AY

9023

01A

Y90

2272

AY

9022

99A

Y90

2274

AY

9022

75A

Y90

2295

AF

4827

58A

Y90

2289

AY90

2294

AY902276

AY902257

AY902256AY902268AY902300

AY902308

AY298785

AY902270

AY902271

AY902287AY902297AY902288

AY902296

CPV90 ger2A

F37

5088

AF

377885

AY902298

AF375077

AF375123

NC 001559

Cowpox clade IVCPXV90_ger2

CamelpoxTaterapoxVariola

Ectromelia

Cowpox clade III(CPXV91_ger3)

Cowpox clade II

Cowpox clade IVaccinia

Monkeypox

AY298785

Page 15: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

GSL sequencing 2013

NCEZID NCIRD NCHHSTP

Vibrio choleraVibrio sppCamphylobacterSalmonellaBacillus anthracisListeraBukholderia sppYersinia pestis Brucella spp.Klebsiella pneumonia Fungal MeningiditisRift Valley Fever virusLujo virusMarburg virusCCHF virusLassa Fever virusClinical sample metagenomics

Haemophilus influenzaeLegionella pneumophila Legionella spp. Mycoplasma pneumoniaWater cooling tower metagenomicsRespiratory filter metagenomicsBordetella spp.Tick metagenomics

Neiseria sppHepatitisMycobacterium tuberculosis

INFLUENZA

CGH

RhodoccocusCryptosporidiumFasciola sppBalamuthia spp

Page 16: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Next-Gen Diagnostic Sequencing Applications

‘Massively parallel’ sequencing not only produces throughput, it providessequences of potentially millions of individual molecules (instant cloning). By sequencing a PCR reaction it allows the detailed search for low expression quasi-species or mutations which may signal growing drug or vaccine resistance – a process called ultra-deep or amplicon sequencing.

Example: clinical case of poxvirus infection with samples exhibiting a reduced sensitivity to an antiviral drug.

Complex clinical, laboratory or environmental samples can be sequenced toprovide a diagnostic ‘snapshot’ of the resident organisms - an approach called metagenomic sequencing.

Examples: tissue culture, soil, blood serum, sputum, stool

Shotgun / Paired-End Sequencing: random shearing of DNA, even sequence coverage over entire genome.

Page 17: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Shotgun / Paired-End Sequencing

De novo Assembly• Newbler• CLCBio• Mira• Geneious• Velvet• Celera Assembler

Reference Mapping• Newbler• CLCBio• Mira• Geneious• BWA • Bowtie

Page 18: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Genome Assembly Visualization

Page 19: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Genome Assembly Visualization

Page 20: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Genome Comparison

Page 21: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

HGAP – Hierarchical Genome Assembly Process

• PreAssembly– Generation of long accurate reads

• Assembly– Choice of assemblers, but OLC (Overlap

Layout Consensus) are best, MIRA and Celera Assembler

• Consensus Polishing– Quiver – a quality aware consensus

algorithm maps all reads back to the assembly and creates a new consensus

Page 22: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

HGAP: PreAssembly

30X

Page 23: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

HGAP: PreAssembly/Assembly

• Correct seed reads with short reads

• Assemble with Celera Assembler or MIRA

Page 24: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

HGAP - Quiver

• To reduce the remaining InDel and base substitution errors in the draft assembly, we use the PacBio Quiver, a quality-aware consensus algorithm. Four different per-base Quality Values (QV scores) represent the intrinsically calculated error probabilities for inserted, deleted, substituted and merged base calls in single pass reads. These values allow Quiver to generate a highly accurate consensus for the final assembly, which frequently exceeds QV50 (99.999% accuracy).

Page 25: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

HGAP Example

Page 26: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

HGAP Confirmationwith Physical Mapping

Page 27: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

HGAP Assembly Structural Confirmation

Page 28: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

HGAP Sequence Confirmation with Illumina

reads

Page 29: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Amplicon (deep) sequencing project

• Clinical case of progressive vaccinia infection from smallpox vaccination of an immune compromised patient

• Pox antiviral ST-246 administered which targets pox gene F13L, a major envelope protein which mediates production of extracellular virus

• Oral ST-246 given daily and vaccination site sampled over 3 week period

Li, Damon - NCZEID/DVRD/PRB

Page 30: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

A region of gene F13L was amplified from clinical samples, deep sequenced,and compared to the smallpox vaccine reference sequence (Acambis 2000)

Control swab prior to ST-246

Page 31: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

2 weeks after ST-246

C > T869

T > A943

Page 32: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

3 weeks after ST-246

C > T869

T > A943

Page 33: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

What is Metagenomics?

• Is the genomic study of DNA from uncultured microorganisms, generally from environmental samples

• Related• Metatranscriptomics• Metaproteomics

Page 34: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Sample CoverageRarefaction Curves

Wooley JC, Godzik A, Friedberg I, 2010 A Primer on Metagenomics. PLoS Comput Biol 6(2)

Samples

Page 35: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Classification Techniques

• Supervised Taxonomic Classification• Homology-based

• Database searching by similarity (BLAST, SW)• BLAST, BLASTX: genbank, specialized DBs: NCBI-ENV-

NT, NCBI-ENV-NR

• Composition-based• N-mer frequency

• Markov Models, Support Vector Machines (SVM), need training set

• Unsupervised Taxonomic Classification• Clustering methods

• SOM - self-organizing maps • PCA – principal component analysis

Page 36: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Contigs, ReadsRemove redundant sequences

Unique sequencesMask repetitive and low complexity seqs

Good sequences

BLASTN against Human Genome (e ≤ 1e-10)

Non-human sequences

BLASTNvs nt

BLASTXvs nr

Sample Collection

DNA

Library Construction

Sequencing

Basecalling

Vector Trimming

Assembly

Viral Metagenomic Pipeline (Wash U scripts implemented at CDC)

Report Generation, Display in MEGAN, inspect top hits

BLASTNvs GB-viral

Page 37: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Megan

Page 38: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core
Page 39: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Ugandan Outbreak Samples• 4 patients

• Total RNA from patient sera• 2 samples per 454 run

• ~ 565,000 reads/sample, avg length = 235nt• Sequences were screened for random

library amplication primers and low quality• Assembled each run de novo using the 454

gsAssembler• Performed a blastx database search using

the assembled contigs (overnight)• Visualized the blast output using MEGAN.

Page 40: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

MEGAN (MetaGenomeANalyzer)

Page 41: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Ugandan Outbreak - results

• Run1 - 5 contigs (out of 2463 > 100nt) matched YF virus, covering 98% of the genome (10,441 of 10,823bp)

• Mapped each sample from Run1 using an Ethiopian YF virus as reference. 3229 individual reads from Sample 1 indentified as YF.

• Run 2 – no YF reads found

Page 42: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Phylogenetic analysis of yellow fever virus sequences

Laura McMullan (DHPP/VSPB)

Page 43: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Comparative Metagenomics

• One 454 run• Two samples

• Sample 1 – ~578,000 reads, avg read length 438 bases• Sample 2 – ~550,000 reads, avg read length 425 bases

• Total number of bases sequenced - ~488,000,000

Page 44: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Sample 1 – Rarefaction Curve

Page 45: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Sample 1 Taxa tree (collapsed at the Order level)

Page 46: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Comparison of Sample 1 and 2

Page 47: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Bioinformatics Tools• Bioinformatics Packages

– EMBOSS– CLCbio– Geneious– LaserGene-Ngen– Galaxy

• General Tools/ Languages– Java/BioJava– Perl/BioPerl– R– BLAST Suite– BioEdit

• Genome Comparison/Alignment Tools– Mavid– Mauve– Clustal– Muscle– MAFFT

• Gene Prediction– Glimmer– GeneMark

• Assembly/Mapping Tools– 454 Suite– Mosaik Tools– Mummer– BWA– Velvet– AHA (pacbio)

• Functional Annotation– Manatee

• Phylogenetics– Paup– Phylip– MrBayes– Beauti/Beast– MEGA– DnaSP

• Metagenomics– MEGAN– Galaxy– Carma

• In-House– WAMS– POCs/VOCs

Page 48: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Challenges

Data Management – image files are large moving these files around the network is slow

Assembly/Mapping Software – Some are provided with the instrument, but additional methods and algorithms are needed

Finishing Tools – gap filling, primer design

Visualization Tools – tools to graphically display contigs on reference sequence as well as genome multiple alignments

Generic Robust Annotation Tools – Researchers need tools to intelligently choose predicted ORFs as genes, assign function, and submit to GenBank

Page 49: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core
Page 50: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

What are the weaknesses of current next-gen sequencers?

Complicated and time consuming library preparation

Requires amplification of library

Instruments require repetitive sequential ‘flows’ of reagents

Requires micrograms of DNA to begin3 days to prepare library

Low copy number polymorphisms may be missedEmulsion PCR is an inefficient, time consuming, oily messPotential to introduce PCR bias into sample

Repetitive flows of nucleotides, blocking/unblocking chemistry, washing out reaction byproducts all slow synthesis and hinder read-length Consumes liters of reagents ($) Repetitive flows and imaging extend sequence runs to days (or weeks)

Page 51: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Pacific Bioscience SMRT sequencer (single-molecule sequencer)

Ion Torrent Personal Gene Machine (solid-state sequencer)

Nanopore sequencing

Page 52: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Pacific Biosciences SMRT sequencer

Sponsor: Influenza Research Agenda

Page 53: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Individual ZMW with attached polymerase and DNA strand Laser excitation/detection volume

glass

Pacific Biosciences SMRT Technology

~ 50 nm Functional volume (red) is in zL!

SMRTcell = 160,000 ZMW SMRTcell array = 1.5 million ZMW

Page 54: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Nucleotide incorporation is a realtime data movie

100 ms

Page 55: National Center for Emerging and Zoonotic Infectious Diseases Division of Scientific Resources Sequencing and Bioinformatics in the CDC Biotechnology Core

Pacific Biosciences Advantages

Read lengths of 1,000 – 10,000 bases No reagent ‘flows’ =10-fold increase in sequencing speed

Substitute reverse transcriptase for polymerase and sequence RNA directly

Bacteria genomes sequenced in hours

Sequence run costs 99$; take 15 minutes to complete