whole-genome sequencing (wgs) for food safetymay 22, 2017 · whole-genome sequencing (wgs) for...
TRANSCRIPT
Whole-Genome Sequencing (WGS)
for Food Safety
Errol Strain, Ph.D.Director, Biostatistics and Bioinformatics StaffCenter for Food Safety and Applied Nutrition
U.S. Food Drug Administration
IFSH Meeting
5/22/2017
2
FDA Regulatory Use Cases1. Do these new bacterial isolates from
environmental/product testing match any clinical isolates in the DB?
– Is this product/facility causing illness?
2. Do these new clinical isolates match any environmental/food isolates in DB?
– Should we test product/swab a facility?
3. Are isolates collected at different points in time from the same facility a match?
– Is there a problem w/ a resident pathogen, harborage?
GenomeTrakr Data Flow
GenomeTrakr Labs& Collaborators
Salmonella
Listeria
4
NGS-Based Surveillance(prior to NCBI Pathogen Detection)
Initial Clustering:PFGE, K-mer, MASH, BLAST
Goal: Find a group of 10-200 Closely related isolates
SNP Pipeline: Find phylogenetically
informative SNPs,FASTA alignment
NCBI
Construct Phylogeny
FDAFDA
Missing
5
NCBI Pathogen Detection
6
CFSAN vs NCBI SNPs
7
Scientific Evidence – Daubert Standard1. Empirical testing: whether the theory or technique is falsifiable, refutable,
and/or testable.
2. Whether it has been subjected to peer review and publication.–Specific/Target Studies for pathogen have been published. Multiple software packages for mapping and calling SNPs.
3. The known or potential error rate.–Well characterized at read level, less so for cluster analysis.
4. The existence and maintenance of standards and controls concerning its operation.
–Proficiency testing efforts through Global Microbial Identifier and also FDA GenomeTrakr network.
5. The degree to which the theory and technique is generally accepted by a relevant scientific community.
–Acceptance facilitated by open database (NCBI/SRA).
8
Why Build A Pipeline?
1. Regulatory Use and/or Accredited Labs
– NCBI methods not public and peer-reviewed
– Chain of custody – local computation
– Results needed immediately
2. Pathogen and/or data not at NCBI
– Mycobacterium, Legionella*
– Food Industry – private data
9
What Kind of Pipeline?
9
SNPs wgMLST
Unit of MeasureSingle Nucleotide Substitutions (other
types of mutations are excluded)
Allele - variant of a gene. Variation could arise form a number of sources, including
SNPs, insertions, deletions, etc.
RequirementsComplete or high-quality reference
genome for mappingDatabase of named alleles, must be
actively maintained
ProsExtremely High Resolution, Methods have
been published and validatedRelatively Fast, not directly dependent
upon reference genome
ConsRequires reference genome,
computationally intense, requires local bioinformatics expertise
Allele database must be centralized, cannot compute novel wgMLST types locally. wgMLST schemas not easy to
publicly access
10
FDA Pipeline Requirements
1. Public, Peer-Reviewed
– Results may be subject to legal scrutiny
– Accessible to FDA-regulated industries
2. Reproducible
3. Documentation & Validation
4. Platform independent (fastq)
5. Run Locally
10
11
Background: CFSAN SNP Pipeline
Mapping/Aligning (66+) SNP Detection (16+)
Samtools
SOAPsnp
GATK
SNVer
VarScan
SHORE
SMALT
MaCH
IMPUTE2
CLC BioQualitySNPngDNABaserSNPdetector
FreeBayes
SolSNP
DNAStar
Bowtie2 VarScan
12
CFSAN SNP Pipeline
Documentation: http://snp-pipeline.rtfd.org
Source Code: https://github.com/CFSAN-Biostatistics/snp-pipeline
Pettengill JB, Luo Y, Davis S, Chen Y, Gonzalez-Escalona N, Ottesen A, Rand H, Allard MW, Strain E. (2014) An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella. PeerJ 2:e620 http://dx.doi.org/10.7717/peerj.620
Davis S, Pettengill JB, Luo Y, Payne J, Shpuntoff A, Rand H, Strain E. (2015) CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequencedata. PeerJ Computer Science 1:e20 https://dx.doi.org/10.7717/peerj-cs.20
13
FDA\CFSAN Validation Efforts
1. Technical Performance
Accuracy: Salmonella LT2 and Agona SL483
2. Intralaboratory variation, sequencing platform
Salmonella Montevideo (180+ runs), PacBio vs short reads
3. Interlaboratory variation
Salmonella Braenderup BAA-664 (PFGE control), ISO/CEN WG,
GenomeTrakr PT set (Salmonella & Listeria), Global Microbial
Identifier PT
4. Bioinformatics Pipeline
Software Validation, Benchmark bioinformatic data sets
Collaborations w/ Canada, CDC, NIH/NCBI
14
Proficiency Testing:
• GenomeTrakr 2014, 2015:• Each lab in the GT network sequenced the same set of 8 strains. CFSAN
PT analysis returned.
• Manuscript in preparation
• GMI (yearly since 2013)• 2016 PT has wet and dry lab components
• 2016 PT includes K. pneumonia, L. mono, C. jejuni, E. coli
• PulseNet/GenomeTrakr harmonized PT• Early 2017
15
CFSAN Workflow
15
1616
17
“Min-diff” – Minimum SNP distance to an isolate of a different sample type
– Food/Environmental vs Clinical (or Microbe)
17
1818
8 SNPs Check SNP Cluster
1919
20
CFSAN Workflow
CFSAN SNP Pipeline is run on NCBI SNP cluster
– Reference – prefer complete genomes, drafts work almost as well
– High-Density SNP regions are filtered
>3 SNPs in 1000 bases, phages/recombination/etc.
– Phylogenetic inference – Maximum Likelihood
Ambiguous sites are treated as missing data
20
21
-5
0
5
10
0% 0.1% 0.5% 1% 2.5% 5%
% Divergence
SN
P D
iffe
ren
ce
Which Reference?
Strain
SubtypePFGE
Serotype SubspeciesSpecies
22
CFSAN SNP Pipeline: Listeria Draft vs PacBio Genome
High-Quality Draft Complete (PacBio)
23
InterpretationSNP Distance
How close are the isolates? No single threshold for all species/types, rough guides
1. <=20 SNPs match, virtually identical
2. 20-100 SNPs inconclusive
3. > 100 SNPs exclude
Bootstrapping
Do the isolates form a unique cluster w/ >= 95% support? Is the cluster distinct from other isolates in the tree?
Results are critically evaluated and not used blindly
23
24
Forensic NeedsWGS (SRA) Database:
Random survey of bacteria not possible, need to continue to grow database and curate genotypes
Thresholds for SNPs vs wgMLST:
1 SNP ≠ 1 INDEL ≠ 1 Recombination
Well-Documented wgMLST databases
ExampleE. coli & Flour
25
26
www.ncbi.nlm.nih.gov/pathogens/
26
2727
2828
0-3 SNPs to clinical isolates
0-3 SNPs to other food/env isolates
2929
30
CFSAN SNP Pipeline
30
31
Future of GenomeTrakr & CFSAN SNP Pipeline
1. Local or web-based QA/QC and identification tools–Detect sample mix-ups and low quality before data is submitted to NCBI/SRA, fix problems more quickly
2. Continue to build WGS databases–Better thresholds for identity, increase odds of finding a match
3. Local SNP pipeline analysis–Accredited labs don’t have to send out data
32
Snapshot of Data – 3/1 to 4/30SNP/ERD Clusters
* 2 or more isolates within 50 SNPs
# SNP Clusters% isolates in SNP clusters (3/2017) Total
Campylobacter 242 69 1054
E.coli/Shigella 221 59 (56%) 1132
Listeria 87 91 (89%) 356
Salmonella 439 83 (86%) 2100
33
Acknowledgements• FDA
• Center for Food Safety and Applied Nutrition• Center for Veterinary Medicine• Office of Regulatory Affairs
• National Institutes of Health• National Center for Biotechnology Information
• State Health and University Labs• Alaska• Arizona• California• Florida• Hawaii• Maryland• Minnesota• New Mexico• New York• South Dakota• Texas• Virginia• Washington
• USDA/FSIS• Eastern Laboratory
• CDC• Enteric Diseases Laboratory
• INEI-ANLIS “Carolos Malbran Institute,” Argentina
• Centre for Food Safety, University College Dublin, Ireland
• Food Environmental Research Agency, UK
• Public Health England, UK
• WHO
• Illumina
• Pac Bio
• CLC Bio
• Other independent collaborators