rapid outbreak characterisation - uk genome sciences 2014 - wed 3 sep 2014

Rapid bacterial outbreak characterisation from whole

genome sequencing

Torsten Seemann

Genome Science: Biology, Technology & Bioinformatics - Wed 13 July 2014 - Oxford, UK - #UKGS2014

About me

● Victorian Bioinformatics Consortiumo Monash University, Melbourne, Australia

● Microbial genomicso bacterial pathogens; some parasites, viruses, fungi

● Tool developmento Prokka, Nesoni, VelvetOptimiser, Snippy, ...

Microbial Diagnostic Unit

● Oldest public health lab in Australiao established 1897 in Melbourne

o large historical isolate collection back to 1950s

● National reference laboratoryo Salmonella, Listeria, EHEC

● WHO regional reference labo vaccine preventable invasive bacterial pathogens

New director

● Professor Ben Howdeno clinician, microbiologist, pathologist

o early adopter of genomics and bioinformatics

● Mandateo modernise service delivery

o enhance research output and collaborationo nationally lead the conversion to WGS

Outbreak scenario

● Receive samples (human, animal, enviro)

● Extract, culture, isolate

● Identification via phenotype, growth, media

● Typing: MLST, MLVA, PFGE, phage, sero, ...

● Screening: VITEK

● Report back to hospital, state government

Traditional typing

● Low resolutiono small subset of genome

MLST ~7 core genes

MLVA uses handful of VNTR regions

o requires constant curation of new genotypes

● Labour intensiveo time consuming

o relatively expensive

Whole Genome Sequencing

● Backward compatibleo can derive most traditional genotypes

● High resolutiono all variation, plasmids, AbR & virulence genes

● High throughputo cheap, fast - one assay replaces many

Resistance to change

● Protecting empireso “this is how we’ve always done it”, job redundancies

● Expense of instrumentso capital purchase, new staff, maintenance

● Lack of bioinformatics supporto infrastructure, software, training

● Legal requirementso must do PFGE, validation, accreditation

A vision for Australia

● A common online system for all labso upload sampleso automated standard analysis pipelines

● Access controlo each lab controls their own datao jurisdictions can share data in national outbreaks

● Deploy on our national research cloudo no investment or expertise neededo can deploy private version if desired

Suggested pipeline

● Inputo FASTQ files for each isolate

● Per isolate outputo de novo assembly & annotationo typing (species dependent)o antibiotic resistance & virulence genes

● Per outbreak outputo annotated phylogenomic treeo SNP distances, clonality predictions

Design goals

● Speedo multi-threaded wherever possible

● Modularo Unix-style reusable components

● Deployable on cloudo Amazon, Nectar (.au), CLIMB (.uk)

● Open sourceo Auditable, community contribution

Progress

● Currently o assessing existing components

o implementing new ones - all on GitHub

● No final product yeto but some components are usable now

● Rolling out in 2015o labs around Australia will opt in, most are keen

Identifying isolates

● De novo assembly approacho assemble into contigso BLAST contigs against all microbial sequenceso best hits, highest coverage

● Assembly free methodo build index of all microbial k-mers w/ taxonomyo scan k-mers from reads and tallyo Kraken, BioBloomTools, ...

Kraken report

1.04 1046 1046 U 0 unclassified

98.96 99624 142 - 1 root

98.81 99473 1 - 131567 cellular organisms

98.81 99472 194 D 2 Bacteria

98.57 99233 111 P 1224 Proteobacteria

98.45 99110 318 C 1236 Gammaproteobacteria

98.07 98728 0 O 91347 Enterobacteriales

98.07 98728 52477 F 543 Enterobacteriaceae

44.95 45256 665 G 561 Escherichia

44.20 44498 33391 S 562 Escherichia coli

8.84 8899 8899 - 1274814 Escherichia coli APEC O78

0.29 287 0 - 244319 Escherichia coli O26:H11

0.29 287 287 - 573235 Escherichia coli O26:H11 str 11368

0.21 216 216 - 316401 Escherichia coli ETEC H10407

0.19 193 0 - 168807 Escherichia coli O127:H6

0.19 193 193 - 574521 Escherichia coli O127:H6 str E2348/69

http://ccb.jhu.edu/software/kraken

Assembill

● Decent automated assemblieso only 3 parameters: outdir + R1.fq.gz + R2.fq.gzo supports multithreading at all steps

● Main stepso adaptor removal & quality trimming (Skewer)o selection of K from k-mer spectra (KmerGenie)o de novo assembly (Velvet, Spades)o ordering of contigs against reference (MUMmer)

Prokka

● Prokaryotic Annotationo only 2 parameters: outdir + contigs.fao scales to about 32 threads

● Findso CDS, tRNA, tmRNA, rRNA, some ncRNAo CRISPR, signal peptides

● Produceso Genbank, GFF3, Sequin, FASTA, ...

mlst

● Multi-Locus Sequence Typingo only 2 parameters: scheme + contigs.fa

● Can mass-screen hundreds of assemblieso comes bundled with PubMLST database

● Outputo tab/comma separated valueso file, scheme, ST, allele1, allelle2, ...

AbRicate

● Identify known AB resistance geneso only 1 parameters: contigs.fa

● Only as good as the underlying databaseo Bundled with ResFinder o does not include SNP-based AbR-conferring genes

● Outputo tab/comma separated table

Wombac

● Quickly identify core genome SNPs

● Efficiently use all CPUs and RAM

● Re-use previous reference alignments

● Cheap to calculate new core subsets

Read alignment

Use BWA MEM

● Do not need to clip reads

● Deduces the fragment library attributes

● Marks multi-mapping reads properly

● Scales linearly to >100 cores

● Outputs SAM directly

Sorted BAM

● No intermediate fileso use Unix pipes

● Multiple CPUs with SAMtools > 0.1.19+o use the -@ command line parameter

bwa → samtools view → samtools sort → BAM

SNP calling

● FreeBayeso set in haploid mode (p=1)

o set regular parameters (mindepth, minfrac)

o call variants in all samples jointly (more power)

o single multi-isolate VCF output

freebayes -p 1 *.bam → all.vcf

Parallel Freebayes

● FreeBayes is single threaded

o divide genome into regions

o run separate freebayes in parallel on each region

o merge the results

o scales nearly linearly!

fasta-generate-regions.py ref.fa > regions.txt

freebayes-parallel 32 regions.txt -p 1 *.bam → all.vcf

Select core SNPs

● Core SNPso position present in every isolateo more than one allele (not wholly conserved)o usually ignore indels and other odd genotypes

● Recombinationo not all core SNPs are realo many result of recombinationo should be filtered out, could alter tree topology

Wombac speed

● Exampleo 130 E.coli isolates, MiSeq 300bp PE

o With 32 cores, used < 4GB RAM/core

o Took just over 1 hour

● Add a new sampleo Re-use existing alignments

o Will migrate to gVCF method that GATK will use

● Recalculate a core tree on subset

o < 1 minute, just reprocess the VCF file

Email [email protected]

Twitter @torstenseemann

Blog

TheGenomeFactory.blogspot.com

Web bioinformatics.net.au

Contact

rapid outbreak characterisation - uk genome sciences 2014 - wed 3 sep 2014

Science

new staff

technology bioinformatics

microbial sequencesbest

national outbreaksdeploy

mers w taxonomyscan

private version

expertise neededcan

contigsblast contigs