rapid outbreak characterisation - uk genome sciences 2014 - wed 3 sep 2014
TRANSCRIPT
Rapid bacterial outbreak characterisation from whole
genome sequencing
Torsten Seemann
Genome Science: Biology, Technology & Bioinformatics - Wed 13 July 2014 - Oxford, UK - #UKGS2014
About me
● Victorian Bioinformatics Consortiumo Monash University, Melbourne, Australia
● Microbial genomicso bacterial pathogens; some parasites, viruses, fungi
● Tool developmento Prokka, Nesoni, VelvetOptimiser, Snippy, ...
Microbial Diagnostic Unit
● Oldest public health lab in Australiao established 1897 in Melbourne
o large historical isolate collection back to 1950s
● National reference laboratoryo Salmonella, Listeria, EHEC
● WHO regional reference labo vaccine preventable invasive bacterial pathogens
New director
● Professor Ben Howdeno clinician, microbiologist, pathologist
o early adopter of genomics and bioinformatics
● Mandateo modernise service delivery
o enhance research output and collaborationo nationally lead the conversion to WGS
Outbreak scenario
● Receive samples (human, animal, enviro)
● Extract, culture, isolate
● Identification via phenotype, growth, media
● Typing: MLST, MLVA, PFGE, phage, sero, ...
● Screening: VITEK
● Report back to hospital, state government
Traditional typing
● Low resolutiono small subset of genome
MLST ~7 core genes
MLVA uses handful of VNTR regions
o requires constant curation of new genotypes
● Labour intensiveo time consuming
o relatively expensive
Whole Genome Sequencing
● Backward compatibleo can derive most traditional genotypes
● High resolutiono all variation, plasmids, AbR & virulence genes
● High throughputo cheap, fast - one assay replaces many
Resistance to change
● Protecting empireso “this is how we’ve always done it”, job redundancies
● Expense of instrumentso capital purchase, new staff, maintenance
● Lack of bioinformatics supporto infrastructure, software, training
● Legal requirementso must do PFGE, validation, accreditation
A vision for Australia
● A common online system for all labso upload sampleso automated standard analysis pipelines
● Access controlo each lab controls their own datao jurisdictions can share data in national outbreaks
● Deploy on our national research cloudo no investment or expertise neededo can deploy private version if desired
Suggested pipeline
● Inputo FASTQ files for each isolate
● Per isolate outputo de novo assembly & annotationo typing (species dependent)o antibiotic resistance & virulence genes
● Per outbreak outputo annotated phylogenomic treeo SNP distances, clonality predictions
Design goals
● Speedo multi-threaded wherever possible
● Modularo Unix-style reusable components
● Deployable on cloudo Amazon, Nectar (.au), CLIMB (.uk)
● Open sourceo Auditable, community contribution
Progress
● Currently o assessing existing components
o implementing new ones - all on GitHub
● No final product yeto but some components are usable now
● Rolling out in 2015o labs around Australia will opt in, most are keen
Identifying isolates
● De novo assembly approacho assemble into contigso BLAST contigs against all microbial sequenceso best hits, highest coverage
● Assembly free methodo build index of all microbial k-mers w/ taxonomyo scan k-mers from reads and tallyo Kraken, BioBloomTools, ...
Kraken report
1.04 1046 1046 U 0 unclassified
98.96 99624 142 - 1 root
98.81 99473 1 - 131567 cellular organisms
98.81 99472 194 D 2 Bacteria
98.57 99233 111 P 1224 Proteobacteria
98.45 99110 318 C 1236 Gammaproteobacteria
98.07 98728 0 O 91347 Enterobacteriales
98.07 98728 52477 F 543 Enterobacteriaceae
44.95 45256 665 G 561 Escherichia
44.20 44498 33391 S 562 Escherichia coli
8.84 8899 8899 - 1274814 Escherichia coli APEC O78
0.29 287 0 - 244319 Escherichia coli O26:H11
0.29 287 287 - 573235 Escherichia coli O26:H11 str 11368
0.21 216 216 - 316401 Escherichia coli ETEC H10407
0.19 193 0 - 168807 Escherichia coli O127:H6
0.19 193 193 - 574521 Escherichia coli O127:H6 str E2348/69
http://ccb.jhu.edu/software/kraken
Assembill
● Decent automated assemblieso only 3 parameters: outdir + R1.fq.gz + R2.fq.gzo supports multithreading at all steps
● Main stepso adaptor removal & quality trimming (Skewer)o selection of K from k-mer spectra (KmerGenie)o de novo assembly (Velvet, Spades)o ordering of contigs against reference (MUMmer)
Prokka
● Prokaryotic Annotationo only 2 parameters: outdir + contigs.fao scales to about 32 threads
● Findso CDS, tRNA, tmRNA, rRNA, some ncRNAo CRISPR, signal peptides
● Produceso Genbank, GFF3, Sequin, FASTA, ...
mlst
● Multi-Locus Sequence Typingo only 2 parameters: scheme + contigs.fa
● Can mass-screen hundreds of assemblieso comes bundled with PubMLST database
● Outputo tab/comma separated valueso file, scheme, ST, allele1, allelle2, ...
AbRicate
● Identify known AB resistance geneso only 1 parameters: contigs.fa
● Only as good as the underlying databaseo Bundled with ResFinder o does not include SNP-based AbR-conferring genes
● Outputo tab/comma separated table
Wombac
● Quickly identify core genome SNPs
● Efficiently use all CPUs and RAM
● Re-use previous reference alignments
● Cheap to calculate new core subsets
Read alignment
Use BWA MEM
● Do not need to clip reads
● Deduces the fragment library attributes
● Marks multi-mapping reads properly
● Scales linearly to >100 cores
● Outputs SAM directly
Sorted BAM
● No intermediate fileso use Unix pipes
● Multiple CPUs with SAMtools > 0.1.19+o use the -@ command line parameter
bwa → samtools view → samtools sort → BAM
SNP calling
● FreeBayeso set in haploid mode (p=1)
o set regular parameters (mindepth, minfrac)
o call variants in all samples jointly (more power)
o single multi-isolate VCF output
freebayes -p 1 *.bam → all.vcf
Parallel Freebayes
● FreeBayes is single threaded
o divide genome into regions
o run separate freebayes in parallel on each region
o merge the results
o scales nearly linearly!
fasta-generate-regions.py ref.fa > regions.txt
freebayes-parallel 32 regions.txt -p 1 *.bam → all.vcf
Select core SNPs
● Core SNPso position present in every isolateo more than one allele (not wholly conserved)o usually ignore indels and other odd genotypes
● Recombinationo not all core SNPs are realo many result of recombinationo should be filtered out, could alter tree topology
Wombac speed
● Exampleo 130 E.coli isolates, MiSeq 300bp PE
o With 32 cores, used < 4GB RAM/core
o Took just over 1 hour
● Add a new sampleo Re-use existing alignments
o Will migrate to gVCF method that GATK will use
● Recalculate a core tree on subset
o < 1 minute, just reprocess the VCF file
Email [email protected]
Twitter @torstenseemann
Blog
TheGenomeFactory.blogspot.com
Web bioinformatics.net.au
Contact