bioinformatics in the cdc biotechnology core facility branch

58
Bioinformatics in the CDC Biotechnology Core Facility Branch Computational Lab Scott Sammons Kevin Tang Chandni Desai Sequencing Lab Mike Frace Missy Olsen- Rasmussen Marina Khristova Lori Rowe

Upload: swain

Post on 25-Feb-2016

38 views

Category:

Documents


3 download

DESCRIPTION

Bioinformatics in the CDC Biotechnology Core Facility Branch . Computational Lab Scott Sammons Kevin Tang Chandni Desai. Sequencing Lab Mike Frace Missy Olsen-Rasmussen Marina Khristova Lori Rowe. Genome Sequencing Lab sequencing platforms – current and upcoming. AB 3730XL. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bioinformatics in the CDC Biotechnology Core Facility Branch

Bioinformatics in the CDC Biotechnology Core Facility

Branch

Computational LabScott SammonsKevin TangChandni Desai

Sequencing LabMike FraceMissy Olsen-RasmussenMarina KhristovaLori Rowe

Page 2: Bioinformatics in the CDC Biotechnology Core Facility Branch

Pacific Biosciences SMRT sequencer

Ion Torrent Personal Gene Machine

AB 3730XL Roche 454 Titanium + Illumina GA IIx

Genome Sequencing Lab sequencing platforms – current and upcoming

Page 3: Bioinformatics in the CDC Biotechnology Core Facility Branch

3

Building 23 Server Room – Main ISLE

Page 4: Bioinformatics in the CDC Biotechnology Core Facility Branch

4

High Performance Computing Cluster (Aspen)• What is it?

• 35 compute nodes each with 12 processor cores, 48GB of memory, and 2 Tesla 2050 GPU cards

• Currently in the final stages of development in preparation for code-freeze and C&A

• What can it do today?• 25 cluster applications are currently enabled

for our phase-one deployment including MatLab, Geneious, Beast, Blast, and PacBio

• Collaboration with NCI via IAA will GPU scientific applications even further

• How fast is it?• By example, a Blast job that takes over 60 hours to complete on

our old cluster takes 2 hours on the new cluster*

• *NOT GPU OPTIMIZED CODE

Page 5: Bioinformatics in the CDC Biotechnology Core Facility Branch

5

Isilon• What is it?

• High speed, scalable, and redundant Network Attached Storage• Currently in the process of being integrated with applications• Connected to both the CDC network and the Aspen HPC cluster

utilizing Infiniband• What can it do today?

• It provides user workspace for end-users and HPC applications

• Solves the problem of being out of disk space on individual servers

• What are we doing with it?• Data warehouse for all scientific equipment• Central network share for all scientific users• Integrating directly with ITSO’s Active Directory forest

Page 6: Bioinformatics in the CDC Biotechnology Core Facility Branch

6

Private Cloud• What is it?

• Support science through front-end and back-end services• Implementation of virtualized infrastructure.• Currently in the process of being deployed.

• What can it do today?• Provide test environments for scientific projects• Lay the foundation for hardware consolidation

and migration• What are we doing with it?

• Standardize platforms• Centralize management• Support ongoing growth within the scientific

computing community while enabling science

Page 7: Bioinformatics in the CDC Biotechnology Core Facility Branch

7

Scientific Computing InfrastructureThe Server Room

• 2 Linux High Performance Computing Clusters (~40 nodes each)• 1 Genomics Cluster• 4 Solaris Servers• 12 Stand-Alone Linux Servers • 1 Stand-Alone Database Server• 5 Stand-Alone Windows Servers• Virtualized Cluster with 15 VMs • 3 NAS Devices• 2 Tape Libraries• 2 Dedicated IP Subnets

• One C&A addressing all legacy production hardware (NCEZID) with several in-process for systems currently under development (NCIRD)

Page 8: Bioinformatics in the CDC Biotechnology Core Facility Branch

GSL sequencing 2011

NCEZID NCIRD CGH

Vibrio choleraVibrio sppCyclosporaBacillus anthracisListeraYersinia pestis Brucella spp.Klebsiella pneumonia Junin virusRift Valley Fever virusLujo virusMarburg virusCCHF virusLassa Fever virusClinical sample metagenomicsTick metagenomicsSoil metagenomics

Haemophilus influenzaeLegionella pneumophila Legionella spp. Mycoplasma pneumoniaWater cooling tower metagenomicsRespiratory filter metagenomicsBat metagenomics

GuineawormTaenia soliumAngiostrongylus

INFLUENZA

Page 9: Bioinformatics in the CDC Biotechnology Core Facility Branch

Position of E-PCR overlapping amplicons

A2 A4 A6 A8 A10 A12 A14 A16 A18

A1 A3 A5 A7 A9 A11 A13 A15 A17 End-R

D P O C E K H ML I F N A J B GSRQ

End-L

Primers designed using VAR-BSH and VAC-CPN sequencesPrimers target genes involved in reproduction & host

response Sequence sample: primers 40 sites, 1 enz. RFLP ~120 sitesPCR uses minimal DNA amounts, often no need to grow virusPCR uses hifi expand long-template Taq & Pwo enzymes

(Roche)

HindIII map

Sequencing: extended PCR

Page 10: Bioinformatics in the CDC Biotechnology Core Facility Branch

16

12

8

4

fold redundancy

First Pass Assembly: Seqmerge

Page 11: Bioinformatics in the CDC Biotechnology Core Facility Branch

Sequencing Assembly: Phred/Phrap/Consed

Page 12: Bioinformatics in the CDC Biotechnology Core Facility Branch

Gene Prediction

• Heuristic algorithm to assign quality scores to ORFs (from 1 to 100)

• Quality scores are based on a number of factors including– Gene Predictions (glimmer, genemark, getorf)– Primary sequence homology to known genes

(BLAST)– Presence of predicted promoter (MEME/MAST)– Size of predicted ORF– Presence of transcription terminal signals

Page 13: Bioinformatics in the CDC Biotechnology Core Facility Branch

Visualizing Gene Predictions and Differences

Page 14: Bioinformatics in the CDC Biotechnology Core Facility Branch

ITR

ITR

crm-D

ORFs of CPVXs from 4 different clades

Page 15: Bioinformatics in the CDC Biotechnology Core Facility Branch

B. American alastrim minor CFR <1%

C-1. non-West-African-African int CFR ~10%

C. Asian majorCFR ~5 - 35%

A. West African int. CFR ~10%

C-2. non-West-African African minor CFR <1%

45 Smallpox Strains

Page 16: Bioinformatics in the CDC Biotechnology Core Facility Branch

Unrooted tree phylogenetic relationships of ORF encoding the hemagglutinin protein

VACLS1Z99045

AY243312

AF377884

AF375102

Z99052

AF375096AF375099

AF375112

AF375095AF37511

3AF375098

AY523994AF22

9247

AF09

5689M14

783

AF3751

18AF375119AF375078

AY603355AY366477

X94355CPV9

1 ger

3

AY90

2253

AF375084AF375087AY902252

AF375086

AY90

2304

AY902303AF012825

Z99054

X69198X65516L22579

AF375135

AF375141

AF375143

JAP46 yamAF375142

AF375130 BRZ66 g

ar

AF375

138

AF375129

AF375093AF375081 AY

0090

89AY

9022

77AF

3750

85AY

9022

69AF

3750

90

AF377886 AF37

7878

AF377877

AY90

2260

AF3750

83

AY9022

83AY

9022

86AY

9023

01AY

9022

72AY

9022

99AY

9022

74AY

9022

75AY

9022

95AF

4827

58AY

9022

89AY

9022

94

AY902276

AY902257

AY902256AY902268AY902300

AY902308

AY298785

AY902270

AY902271

AY902287AY902297AY902288

AY902296

CPV90 ger2AF37

5088

AF377885

AY902298

AF375077AF375123NC 001559

Cowpox clade IVCPXV90_ger2

CamelpoxTaterapoxVariola

Ectromelia

Cowpox clade III(CPXV91_ger3)

Cowpox clade II

Cowpox clade IVaccinia

Monkeypox

AY298785

Page 17: Bioinformatics in the CDC Biotechnology Core Facility Branch

Next-Gen Diagnostic Sequencing Applications

‘Massively parallel’ sequencing not only produces throughput, it providessequences of potentially millions of individual molecules (instant cloning). By sequencing a PCR reaction it allows the detailed search for low expression quasi-species or mutations which may signal growing drug or vaccine resistance – a process called ultra-deep or amplicon sequencing.

Example: clinical case of poxvirus infection with samples exhibiting a reduced sensitivity to an antiviral drug.

Complex clinical, laboratory or environmental samples can be sequenced toprovide a diagnostic ‘snapshot’ of the resident organisms - an approach called metagenomic sequencing.

Examples: tissue culture, soil

Shotgun / Paired-End Sequencing: random shearing of DNA, even sequence coverage over entire genome.

Page 18: Bioinformatics in the CDC Biotechnology Core Facility Branch

Shotgun / Paired-End Sequencing

De novo Assembly• Newbler• CLCBio• Mira• Geneious• Velvet• Celera

Reference Mapping• Newbler• CLCBio• Mosaik• Mira• Geneious• BWA

Page 19: Bioinformatics in the CDC Biotechnology Core Facility Branch

Genome Assembly Visualization

Page 20: Bioinformatics in the CDC Biotechnology Core Facility Branch

Genome Assembly Visualization

Page 21: Bioinformatics in the CDC Biotechnology Core Facility Branch

Amplicon (deep) sequencing project

• Clinical case of progressive vaccinia infection from smallpox vaccination of an immune compromised patient

• Pox antiviral ST-246 administered which targets pox gene F13L, a major envelope protein which mediates production of extracellular virus

• Oral ST-246 given daily and vaccination site sampled over 3 week period

Li, Damon - NCZEID/DVRD/PRB

Page 22: Bioinformatics in the CDC Biotechnology Core Facility Branch

A region of gene F13L was amplified from clinical samples, deep sequenced,and compared to the smallpox vaccine reference sequence (Acambis 2000)

Control swab prior to ST-246

Page 23: Bioinformatics in the CDC Biotechnology Core Facility Branch

2 weeks after ST-246

C > T869

T > A943

Page 24: Bioinformatics in the CDC Biotechnology Core Facility Branch

3 weeks after ST-246

C > T869

T > A943

Page 25: Bioinformatics in the CDC Biotechnology Core Facility Branch

What is Metagenomics?• Is the genomic study of DNA from uncultured

microorganisms, generally from environmental samples

• Related• Metatranscriptomics• Metaproteomics

Page 26: Bioinformatics in the CDC Biotechnology Core Facility Branch

Sample CoverageRarefaction Curves

Wooley JC, Godzik A, Friedberg I, 2010 A Primer on Metagenomics. PLoS Comput Biol 6(2)

Samples

Page 27: Bioinformatics in the CDC Biotechnology Core Facility Branch

Classification Techniques• Supervised Taxonomic Classification

• Homology-based• Database searching by similarity (BLAST, SW)

• BLAST, BLASTX: genbank, specialized DBs: NCBI-ENV-NT, NCBI-ENV-NR

• Composition-based• N-mer frequency

• Markov Models, Support Vector Machines (SVM), need training set

• Unsupervised Taxonomic Classification• Clustering methods

• SOM - self-organizing maps • PCA – principal component analysis

Page 28: Bioinformatics in the CDC Biotechnology Core Facility Branch

Viral Metagenomic Pipeline (Wash U scripts implemented at CDC)

Sample Collection

DNA

Library Construction

Sequencing

Basecalling

Vector Trimming

Assembly

Contigs, ReadsRemove redundant sequences

Unique sequencesMask repetitive and low complexity seqs

Good sequences

BLASTN against Human Genome (e ≤ 1e-10)

Non-human sequences

BLASTNvs nt

BLASTXvs nr

Report Generation, Display in MEGAN, inspect top hits

BLASTNvs GB-viral

Page 29: Bioinformatics in the CDC Biotechnology Core Facility Branch

Software for Taxonomic Classification• MEGAN – GUI interface for classification based on

blast searches• CARMA web-based classification using pFam

database and HMMER alignment of protein families• MG-RAST classification system utilizing protein

encoding databases and several ribosomal DBs. Can analyze user provided datasets, web use only

• Geneious – commercial product • NextGENe – commercial product• Phymm, PhymmBL – composition based

classification system

Page 30: Bioinformatics in the CDC Biotechnology Core Facility Branch

Software for Comparative Metagenomics

• Megan – can display two metagenome populations on the same phylogenetic tree, uses BLAST file as input

• STAMP – calculates statistical differences between sets of metagenomes

• XIPE-TOTEC – performs pairwise comparisons of every metagenome in the two sets, creates a distance matrix which is then used for clustering and PCA analysis to calculate statistical values of relatedness

Page 31: Bioinformatics in the CDC Biotechnology Core Facility Branch

Megan

Page 32: Bioinformatics in the CDC Biotechnology Core Facility Branch
Page 33: Bioinformatics in the CDC Biotechnology Core Facility Branch

Ugandan Outbreak Samples• 4 patients

• Total RNA from patient sera• 2 samples per 454 run

• ~ 565,000 reads/sample, avg length = 235nt• Sequences were screened for random library

amplication primers and low quality• Assembled each run de novo using the 454

gsAssembler• Performed a blastx database search using the

assembled contigs (overnight)• Visualized the blast output using MEGAN.

Page 34: Bioinformatics in the CDC Biotechnology Core Facility Branch

MEGAN (MetaGenomeANalyzer)

Page 35: Bioinformatics in the CDC Biotechnology Core Facility Branch

Ugandan Outbreak - results• Run1 - 5 contigs (out of 2463 > 100nt) matched YF

virus, covering 98% of the genome (10,441 of 10,823bp)

• Mapped each sample from Run1 using an Ethiopian YF virus as reference. 3229 individual reads from Sample 1 indentified as YF.

• Run 2 – no YF reads found

Page 36: Bioinformatics in the CDC Biotechnology Core Facility Branch

Phylogenetic analysis of yellow fever virus sequences

Laura McMullan (DHPP/VSPB)

Page 37: Bioinformatics in the CDC Biotechnology Core Facility Branch

Comparative Metagenomics – current work• One 454 run• Two samples

• Sample 1 – ~578,000 reads, avg read length 438 bases• Sample 2 – ~550,000 reads, avg read length 425 bases

• Total number of bases sequenced - ~488,000,000

Page 38: Bioinformatics in the CDC Biotechnology Core Facility Branch

Sample 1 – Rarefaction Curve

Page 39: Bioinformatics in the CDC Biotechnology Core Facility Branch

Sample 1 Taxa tree (collapsed at the Order level)

Page 40: Bioinformatics in the CDC Biotechnology Core Facility Branch

Comparison of Sample 1 and 2

Page 41: Bioinformatics in the CDC Biotechnology Core Facility Branch

Bioinformatics Tools• Bioinformatics Packages

– EMBOSS– BioInquiry

• General Tools– Java/BioJava– Perl/BioPerl– BLAST Suite– BioEdit– GFFtoPS

• Genome Comparison/Alignment Tools– Mavid– Mauve– Clustal– Muscle

• Gene Prediction– Glimmer– GeneMark

• Assembly/Mapping Tools– 454 Suite– Mosaik Tools– Mummer– CLC Bio– BWA– Velvet– AHA (pacbio)

• Functional Annotation– Manatee

• Phylogenetics– Paup– Phylip– MrBayes– Beauti/Beast– MEGA– DnaSP

• Metagenomics– MEGAN– Galaxy– Carma

• In-House– WAMS– POCs/VOCs

Page 42: Bioinformatics in the CDC Biotechnology Core Facility Branch

Challenges

Data Management – image files are large (1 run ~25G) moving these files around the network is slow

Assembly/Mapping Software – Some are provided with the instrument, but additional methods and algorithms are needed

Finishing Tools – gap filling, primer design

Visualization Tools – tools to graphically display contigs on reference sequence as well as genome multiple alignments

Generic Robust Annotation Tools – Researchers need tools to intelligently choose predicted ORFs as genes, assign function, and submit to GenBank

Page 43: Bioinformatics in the CDC Biotechnology Core Facility Branch
Page 44: Bioinformatics in the CDC Biotechnology Core Facility Branch

What are the weaknesses of current next-gen sequencers?

Complicated and time consuming library preparation

Requires amplification of library

Instruments require repetitive sequential ‘flows’ of reagents

Requires micrograms of DNA to begin3 days to prepare library

Low copy number polymorphisms may be missedEmulsion PCR is an inefficient, time consuming, oily messPotential to introduce PCR bias into sample

Repetitive flows of nucleotides, blocking/unblocking chemistry, washing out reaction byproducts all slow synthesis and hinder read-length Consumes liters of reagents ($) Repetitive flows and imaging extend sequence runs to days (or weeks)

Page 45: Bioinformatics in the CDC Biotechnology Core Facility Branch

Pacific Bioscience SMRT sequencer (single-molecule sequencer)

Ion Torrent Personal Gene Machine (solid-state sequencer)

Nanopore sequencing

Page 46: Bioinformatics in the CDC Biotechnology Core Facility Branch

Pacific Biosciences SMRT sequencer

Sponsor: Influenza Research Agenda

Page 47: Bioinformatics in the CDC Biotechnology Core Facility Branch

Individual ZMW with attached polymerase and DNA strand Laser excitation/detection volume

glass

Pacific Biosciences SMRT Technology

~ 50 nm Functional volume (red) is in zL!

SMRTcell = 160,000 ZMW SMRTcell array = 1.5 million ZMW

Page 48: Bioinformatics in the CDC Biotechnology Core Facility Branch

Nucleotide incorporation is a realtime data movie

100 ms

Page 49: Bioinformatics in the CDC Biotechnology Core Facility Branch

Pacific Biosciences Advantages

Read lengths of 1,000 – 10,000 bases No reagent ‘flows’ =10-fold increase in sequencing speed

Substitute reverse transcriptase for polymerase and sequence RNA directly

Bacteria genomes sequenced in hours

Sequence run costs 99$; take 15 minutes to complete

4

Page 50: Bioinformatics in the CDC Biotechnology Core Facility Branch
Page 51: Bioinformatics in the CDC Biotechnology Core Facility Branch

454 Sequencing

• DNA Library Prep• emPCR Amplification• Sequencing• Data Analysis

Page 52: Bioinformatics in the CDC Biotechnology Core Facility Branch

454 Sequencing: DNA Prep• Nebulization

– sheared with high pressure nitrogen to create fragments ~300-800 bases long

• Repair Ends– double stranded pieces are purified, blunt ended, and phosphorylated

• Adaptor Ligation– two different adaptors are ligated to the fragment, A and B– 44 bases long: 20 base PCR primer, 20 base sequencing primer, 4

base key– B fragment contain a biotin tag for immobilization– This forms 4 different strands A-A, A-B, B-A, B-B

• Fragment Immobilization– These immobilized on streptavidin-coated magnetic beads, A-A strands

will not bind and are washed away • Single-strand Isolation

– bound fragments are denatured and the released strands (containing both an A and a B tag) form a single-stranded template DNA library

Page 53: Bioinformatics in the CDC Biotechnology Core Facility Branch

454 Sequencing: emulsionPCREmulsion-based clonal PCR• Annealing

– Fragments are annealed to primer tagged “catcher” beads

– optimized to anneal a single strand to a single bead• Distribution in a water-oil-emulsion

– the captured dna and beads along with amplication reagents are placed in a water-oil mixture

– Each bead is captured in a “bubble” and creates its’ own small “micro-reactor”

– thermocyled creating millions of copies of a single clonal fragment in individual “microreactors”

– cleaned up and denatured

Page 54: Bioinformatics in the CDC Biotechnology Core Facility Branch

454 Sequencing: Sequencing by Synthesis

• Bead Preparation - sequencing primer attached and polymerase and cofactors are added

• Bead Deposition – beads are layered on a picotiter plate (wells are 44 μm), then enzyme beads and packing beads are added

Page 55: Bioinformatics in the CDC Biotechnology Core Facility Branch

454 Sequencing: Sequencing by Synthesis (cont.)

• Sequencing– enzyme beads contain

sulfurylase and luciferase, packing beads help keep reaction beads in position

– a fluidics system delivers sequencing reagents, flowing the nucleotides one at a time in a specific order across the wells

Page 56: Bioinformatics in the CDC Biotechnology Core Facility Branch

454 Sequencing: Sequencing by Synthesis (cont.)

• Sequencing– if a nucleotide is incorporated, a

pyrophosphate is released which is converted to ATP by the sulfurylase

– the ATP is hydrolyzed by the luciferase enzyme producing oxyluciferase and light

– The light emission is measured with a CCD camera

– light intensity indicates nucleotide incorporation

Page 57: Bioinformatics in the CDC Biotechnology Core Facility Branch

454 Sequencing: Sequencing by Synthesis (cont.)

• Characteristics– Flow of the four nucleotides is repeated for

one hundred cycles, resulting in average read length of 300-500 bases

– system averages ~1,000,000 high quality wells

– therefore, a typical run yields over 400 million high quality bases

Page 58: Bioinformatics in the CDC Biotechnology Core Facility Branch

454 Sequencing: Paired End Protocol