whole-genome sequencing (wgs) for food safetymay 22, 2017 · whole-genome sequencing (wgs) for...

Whole-Genome Sequencing (WGS)

for Food Safety

Errol Strain, Ph.D.Director, Biostatistics and Bioinformatics StaffCenter for Food Safety and Applied Nutrition

U.S. Food Drug Administration

IFSH Meeting

5/22/2017

2

FDA Regulatory Use Cases1. Do these new bacterial isolates from

environmental/product testing match any clinical isolates in the DB?

– Is this product/facility causing illness?

2. Do these new clinical isolates match any environmental/food isolates in DB?

– Should we test product/swab a facility?

3. Are isolates collected at different points in time from the same facility a match?

– Is there a problem w/ a resident pathogen, harborage?

GenomeTrakr Data Flow

GenomeTrakr Labs& Collaborators

Salmonella

Listeria

4

NGS-Based Surveillance(prior to NCBI Pathogen Detection)

Initial Clustering:PFGE, K-mer, MASH, BLAST

Goal: Find a group of 10-200 Closely related isolates

SNP Pipeline: Find phylogenetically

informative SNPs,FASTA alignment

NCBI

Construct Phylogeny

FDAFDA

Missing

5

NCBI Pathogen Detection

6

CFSAN vs NCBI SNPs

7

Scientific Evidence – Daubert Standard1. Empirical testing: whether the theory or technique is falsifiable, refutable,

and/or testable.

2. Whether it has been subjected to peer review and publication.–Specific/Target Studies for pathogen have been published. Multiple software packages for mapping and calling SNPs.

3. The known or potential error rate.–Well characterized at read level, less so for cluster analysis.

4. The existence and maintenance of standards and controls concerning its operation.

–Proficiency testing efforts through Global Microbial Identifier and also FDA GenomeTrakr network.

5. The degree to which the theory and technique is generally accepted by a relevant scientific community.

–Acceptance facilitated by open database (NCBI/SRA).

8

Why Build A Pipeline?

1. Regulatory Use and/or Accredited Labs

– NCBI methods not public and peer-reviewed

– Chain of custody – local computation

– Results needed immediately

2. Pathogen and/or data not at NCBI

– Mycobacterium, Legionella*

– Food Industry – private data

9

What Kind of Pipeline?

9

SNPs wgMLST

Unit of MeasureSingle Nucleotide Substitutions (other

types of mutations are excluded)

Allele - variant of a gene. Variation could arise form a number of sources, including

SNPs, insertions, deletions, etc.

RequirementsComplete or high-quality reference

genome for mappingDatabase of named alleles, must be

actively maintained

ProsExtremely High Resolution, Methods have

been published and validatedRelatively Fast, not directly dependent

upon reference genome

ConsRequires reference genome,

computationally intense, requires local bioinformatics expertise

Allele database must be centralized, cannot compute novel wgMLST types locally. wgMLST schemas not easy to

publicly access

10

FDA Pipeline Requirements

1. Public, Peer-Reviewed

– Results may be subject to legal scrutiny

– Accessible to FDA-regulated industries

2. Reproducible

3. Documentation & Validation

4. Platform independent (fastq)

5. Run Locally

10

11

Background: CFSAN SNP Pipeline

Mapping/Aligning (66+) SNP Detection (16+)

Samtools

SOAPsnp

GATK

SNVer

VarScan

SHORE

SMALT

MaCH

IMPUTE2

CLC BioQualitySNPngDNABaserSNPdetector

FreeBayes

SolSNP

DNAStar

Bowtie2 VarScan

12

CFSAN SNP Pipeline

Documentation: http://snp-pipeline.rtfd.org

Source Code: https://github.com/CFSAN-Biostatistics/snp-pipeline

Pettengill JB, Luo Y, Davis S, Chen Y, Gonzalez-Escalona N, Ottesen A, Rand H, Allard MW, Strain E. (2014) An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella. PeerJ 2:e620 http://dx.doi.org/10.7717/peerj.620

Davis S, Pettengill JB, Luo Y, Payne J, Shpuntoff A, Rand H, Strain E. (2015) CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequencedata. PeerJ Computer Science 1:e20 https://dx.doi.org/10.7717/peerj-cs.20

http://snp-pipeline.rtfd.org/

https://github.com/CFSAN-Biostatistics/snp-pipeline

http://dx.doi.org/10.7717/peerj.620

https://dx.doi.org/10.7717/peerj-cs.20

13

FDA\CFSAN Validation Efforts

1. Technical Performance

Accuracy: Salmonella LT2 and Agona SL483

2. Intralaboratory variation, sequencing platform

Salmonella Montevideo (180+ runs), PacBio vs short reads

3. Interlaboratory variation

Salmonella Braenderup BAA-664 (PFGE control), ISO/CEN WG,

GenomeTrakr PT set (Salmonella & Listeria), Global Microbial

Identifier PT

4. Bioinformatics Pipeline

Software Validation, Benchmark bioinformatic data sets

Collaborations w/ Canada, CDC, NIH/NCBI

14

Proficiency Testing:

• GenomeTrakr 2014, 2015:• Each lab in the GT network sequenced the same set of 8 strains. CFSAN

PT analysis returned.

• Manuscript in preparation

• GMI (yearly since 2013)• 2016 PT has wet and dry lab components

• 2016 PT includes K. pneumonia, L. mono, C. jejuni, E. coli

• PulseNet/GenomeTrakr harmonized PT• Early 2017

15

CFSAN Workflow

15

17

“Min-diff” – Minimum SNP distance to an isolate of a different sample type

– Food/Environmental vs Clinical (or Microbe)

17

1818

8 SNPs Check SNP Cluster

20

CFSAN Workflow

CFSAN SNP Pipeline is run on NCBI SNP cluster

– Reference – prefer complete genomes, drafts work almost as well

– High-Density SNP regions are filtered

>3 SNPs in 1000 bases, phages/recombination/etc.

– Phylogenetic inference – Maximum Likelihood

Ambiguous sites are treated as missing data

20

21

-5

0

5

10

0% 0.1% 0.5% 1% 2.5% 5%

% Divergence

SN

P D

iffe

ren

ce

Which Reference?

Strain

SubtypePFGE

Serotype SubspeciesSpecies

22

CFSAN SNP Pipeline: Listeria Draft vs PacBio Genome

High-Quality Draft Complete (PacBio)

23

InterpretationSNP Distance

How close are the isolates? No single threshold for all species/types, rough guides

1. <=20 SNPs match, virtually identical

2. 20-100 SNPs inconclusive

3. > 100 SNPs exclude

Bootstrapping

Do the isolates form a unique cluster w/ >= 95% support? Is the cluster distinct from other isolates in the tree?

Results are critically evaluated and not used blindly

23

24

Forensic NeedsWGS (SRA) Database:

Random survey of bacteria not possible, need to continue to grow database and curate genotypes

Thresholds for SNPs vs wgMLST:

1 SNP ≠ 1 INDEL ≠ 1 Recombination

Well-Documented wgMLST databases

ExampleE. coli & Flour

25

26

www.ncbi.nlm.nih.gov/pathogens/

26

2828

0-3 SNPs to clinical isolates

0-3 SNPs to other food/env isolates

30

CFSAN SNP Pipeline

30

31

Future of GenomeTrakr & CFSAN SNP Pipeline

1. Local or web-based QA/QC and identification tools–Detect sample mix-ups and low quality before data is submitted to NCBI/SRA, fix problems more quickly

2. Continue to build WGS databases–Better thresholds for identity, increase odds of finding a match

3. Local SNP pipeline analysis–Accredited labs don’t have to send out data

32

Snapshot of Data – 3/1 to 4/30SNP/ERD Clusters

* 2 or more isolates within 50 SNPs

# SNP Clusters% isolates in SNP clusters (3/2017) Total

Campylobacter 242 69 1054

E.coli/Shigella 221 59 (56%) 1132

Listeria 87 91 (89%) 356

Salmonella 439 83 (86%) 2100

33

Acknowledgements• FDA

• Center for Food Safety and Applied Nutrition• Center for Veterinary Medicine• Office of Regulatory Affairs

• National Institutes of Health• National Center for Biotechnology Information

• State Health and University Labs• Alaska• Arizona• California• Florida• Hawaii• Maryland• Minnesota• New Mexico• New York• South Dakota• Texas• Virginia• Washington

• USDA/FSIS• Eastern Laboratory

• CDC• Enteric Diseases Laboratory

• INEI-ANLIS “Carolos Malbran Institute,” Argentina

• Centre for Food Safety, University College Dublin, Ireland

• Food Environmental Research Agency, UK

• Public Health England, UK

• WHO

• Illumina

• Pac Bio

• CLC Bio

• Other independent collaborators

whole-genome sequencing (wgs) for food safetymay 22, 2017 · whole-genome sequencing (wgs) for...

Documents