sept2016 plenary nist_intro
TRANSCRIPT
Genome in a Bottle Workshop
Justin Zook and Marc SalitNIST Genome-Scale Measurements
GroupJIMB
September 16, 2016
WELCOME
Today we’re releasing 4 new GIAB RM Genomes.
• PGP Human Genomes– AJ son– AJ trio– Asian son
• Parents also characterized
• Available immediately
Today we’re releasing 4 new GIAB RM Genomes.
• New, reproducible methods applied to characterize high-confidence SNPs/indels in 85-90% of each genome
We’re also releasing a Microbial Genome RM
This Reference Material (RM) is intended for validation, optimization, process evaluation, and performance assessment of whole genome sequencing.
• Salmonella Typhimurium • Pseudomonas
aeruginosa • Staphylococcus aureus• Clostridium sporogenes
What’s JIMB?• Joint Initiative for
Metrology in Biology– develop standards,
methods, tools and measurement science
– make biology easier to engineer
– make reproducibility and reliability easier• lower barriers to translation
of innovation• enable scaling through
distribution of labor
Faculty• Science• Technology
Development• Innovation
NIST • Metrology• Standards
Realization Lab• Measurement
Science
Trainees• Postdocs• Coursework• Graduate
Trainees
Commercial• Customers• Technology• Metrology
Training• Workforce
is Genomics and Synthetic Biology.
DNA Read and Write.
Genome in a Bottle ConsortiumWhole Genome Variant Calling
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• gDNA reference materials to evaluate performance– materials certified for their
variants against a reference sequence, with confidence estimates
• established consortium to develop reference materials, data, methods, performance metrics
• Characterized Pilot Genome NA12878
• Ashkenazim Trio, Asian son from PGP released today!
gene
ric m
easu
rem
ent p
roce
ss
Bringing Principles of Metrologyto the Genome
• Reference materials– DNA in a tube you can buy from
NIST– NA12878 pilot sample, now 2
PGP-sourced trios• Extensive state-of-the-art
characterization– as good as we can get for small
variants– arbitrated “gold standard” calls
for SNPs, small indels• “Upgradable” as technology
develops
• Analysis of all samples ongoing as technology develops
• PGP genomes suitable for commercial derived products
• Developing benchmarking tools and software– with GA4GH
• Samples being used to develop and demonstrate new technology
We are liaising with…• Illumina Platinum Genomes• CDC GeT-RM• Korean Genome Project• Genome Reference Consortium• 1000 Genomes SV group• CAP/CLIA
• Global Alliance for Genomics and Health Benchmarking Team• ABRF• FDA• SEQC• Global metrology system
AgendaMonday• Breakfast and registration• Welcome and Context Setting• NIST RM Update and Status Report• Charge to Working Groups• Coffee Break• Working Group Breakout Discussions• Lunch (provided)• Informal Working Group Reports• Coffee Break• Breakout Topical Discussions
– Topic #1: Moving beyond the 'easy' variants and regions of the genome
– Topic #2: Selecting future genomes for Reference Materials
Tuesday• Breakfast and registration• Use cases: Experiences using the pilot
Reference Material• Discussion of plans to release pilot
Reference Material• Coffee Break• Working Group Breakout discussions• Lunch (provided)• Working Group leaders present plans
and discussion• Steering committee Overview• First meeting of the Steering
Committee (others adjourn)
Please Note
Slides will be made available on SlideShare after the workshop (see genomeinabottle.org).
Tweets are welcome unless the speaker requests otherwise. Please use #giab as the hashtag.
NIST Reference MaterialsGenome PGP ID Coriell ID NIST ID NIST RM #
CEPH Mother/Daughter
N/A GM12878 HG001 RM8398
AJ Son huAA53E0 GM24385 HG002 RM8391 (son)/RM8392 (trio)
AJ Father hu6E4515 GM24149 HG003 RM8392 (trio)
AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio)
Asian Son hu91BD69 GM24631 HG005 RM8393
Asian Father huCA017E GM24694 N/A N/A
Asian Mother hu38168C GM24695 N/A N/A
Data for GIAB PGP TriosDataset Characteristics Coverage Availability Most useful for…
Illumina Paired-end WGS
150x150bp250x250bp
~300x/individual~50x/individual
on SRA/FTP SNPs/indels/some SVs
Complete Genomics 100x/individual on SRA/ftp SNPs/indels/some SVs
SOLiD 5500W WGS 50bp single end 70x/son on FTP SNPs
Illumina Paired-end WES
100x100bp ~300x/individual on SRA/FTP SNPs/indels in exome
Ion Proton Exome 1000x/individual on SRA/FTP SNPs/indels in exome
Illumina Mate pair ~6000 bp insert ~30x/individual on FTP SVs
Illumina “moleculo” Custom library ~30x by long fragments
on FTP SVs/phasing/assembly
Complete Genomics LFR 100x/individual on SRA/FTP SNPs/indels/phasing
10X Pseudo-long reads 30-45x/individual on FTP SVs/phasing/assembly
PacBio ~10kb reads ~70x on AJ son, ~30x on each AJ parent
on SRA/FTP SVs/phasing/assembly/STRs
Oxford Nanopore 5.8kb 2D reads 0.02x on AJ son on FTP SVs/assembly
Nabsys 2.0 ~100kbp N50 nanopore maps
70x on AJ son SVs/assembly
BioNano Genomics 200-250kbp optical map reads
~100x/AJ individual; 57x on Asian son
on FTP SVs/assembly
Dataset AJ Son AJ Parents Chinese son Chinese parents
NA12878
Illumina Paired-end X X X X XIllumina Long Mate pair X X X X XIllumina “moleculo” X X X X XComplete Genomics X X X X XComplete Genomics LFR X X XIon exome X X X XBioNano X X X X10X X X XPacBio X X XSOLiD single end X X XIllumina exome X X X XOxford Nanopore X
Paper describing data…51 authors14 institutions12 datasets7 genomesData described in ISA-tab
0
20000
40000
60000
80000
100000
120000
140000
0
200
400
600
800
1000
1200
1400
1600
1800
2000
GIAB ftp site downloads/unique-IPs by month
Month
# do
wnl
oads
# IP
s
Integration Methods to Establish Reference Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of bias
Confidence Level Zook et al., Nature Biotechnology, 2014.
Integration Methods to Establish Reference Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of bias
Confidence Level Zook et al., Nature Biotechnology, 2014.
NEW: Reproducible
integration pipeline
with new calls for
NA12878 and PGP
Trios!
New calls (v3.3) vs. old calls (v2.19)
V3.3• 3441361 match PG• 550982 PG calls outside
high conf• 124715 calls not in PG• After excluding low
confidence regions and regions around filtered PG calls:– 40 calls not in PG– 60 extra PG calls
V2.19 • 3030717 match PG• 1018795 PG calls outside
high conf• 122359 calls not in PG• After excluding low
confidence regions and regions around filtered PG calls:– 87 calls not in PG– 404 extra PG calls
New calls (v3.3) vs. old calls (v2.19)
V3.3• 3441361 match PG• 550982 PG calls outside
high conf• 124715 calls not in PG• After excluding low
confidence regions and regions around filtered PG calls:– 40 calls not in PG– 60 extra PG calls
V2.19 • 3030717 match PG• 1018795 PG calls outside
high conf• 122359 calls not in PG• After excluding low
confidence regions and regions around filtered PG calls:– 87 calls not in PG– 404 extra PG calls
More high-confidence calls match Platinum Genomes
New calls (v3.3) vs. old calls (v2.19)
V3.3• 3441361 match PG• 550982 PG calls outside
high conf• 124715 calls not in PG• After excluding low
confidence regions and regions around filtered PG calls:– 40 calls not in PG– 60 extra PG calls
V2.19 • 3030717 match PG• 1018795 PG calls outside
high conf• 122359 calls not in PG• After excluding low
confidence regions and regions around filtered PG calls:– 87 calls not in PG– 404 extra PG calls
Similar extra calls not in Platinum Genomes
New calls (v3.3) vs. old calls (v2.19)
V3.3• 3441361 match PG• 550982 PG calls outside
high conf• 124715 calls not in PG• After excluding low
confidence regions and regions around filtered PG calls:– 40 calls not in PG– 60 extra PG calls
V2.19 • 3030717 match PG• 1018795 PG calls outside
high conf• 122359 calls not in PG• After excluding low
confidence regions and regions around filtered PG calls:– 87 calls not in PG– 404 extra PG calls
~80% fewer differences from PG in high confidence regions
New calls (v3.3) vs. old calls (v2.19)Example vcf (verily) Stratified
V3.3• 17% of SNPs not assessed
– 23% of SNPs in RefSeq coding– 53% of SNPs in “bad
promoters”• 78% of indels not assessed
– 0.7% difference rate• 17% FP in regions
homologous to decoy
V2.19 • 27% of SNPs not assessed
– 36% of SNPs in RefSeq coding– 82% of SNPs in “bad
promoters”• 78% of indels not assessed
– 1.2% difference rate• 0.2% FP in regions
homologous to decoy
Principles of Integration Process
• Form sensitive variant calls from each dataset
• Define “callable regions” for each callset
• Filter calls from each method with annotations unlike concordant calls
• Compare high-confidence calls to other callsets and manually inspect subset of differences– vs. pedigree-based calls– vs. common pipelines– Trio analysis
• When benchmarking a new callset against ours, most putative FPs/FNs should actually be FPs/FNs
Criteria for including new callsets• Form sensitive variant
calls from each dataset• Define “callable regions”
for each callset• Good coverage and MapQ• Use knowledge about
technology and manual inspection to exclude repetitive regions difficult for each dataset
• For new callsets, ensure most FNs in callable regions relative to current high-confidence calls are questionable in the current calls
• Filter calls from each method with annotations unlike concordant calls– Annotations for which
outliers are expected to indicate bias should be selected for each callset
Global Alliance for Genomics and Health Benchmarking Task Team
• Developed standardized definitions for performance metrics like TP, FP, and FN.
• Developing sophisticated benchmarking tools• Integrated into a single
framework with standardized inputs and outputs
• Standardized bed files with difficult genome contexts for stratification
Credit: GA4GH, Abby Beeler, Ellie Wood
Stratification of FP RatesHigher FP rates at Tandem Repeats
https://github.com/ga4gh/benchmarking-tools
Benchmarking Tools
Standardized comparison, counting, and stratification with Hap.py + vcfeval
https://precision.fda.gov/ https://github.com/ga4gh/benchmarking-tools
Microbial Genomic RM Characterization
PEPR Workflow
https://github.com/usnistgov/pepr
Acknowledgements
• NIST– Marc Salit– Jenny McDaniel– Lindsay Vang– David Catoe
• Genome in a Bottle Consortium
• GA4GH Benchmarking Team
• FDA– Liz Mansfield– Zivana Tevak– David Litwack
For More Informationwww.genomeinabottle.org - sign up for general GIAB and Analysis Team google group emails
github.com/genome-in-a-bottle – Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser
Data: http://www.nature.com/articles/sdata201625
Global Alliance Benchmarking Team– https://github.com/ga4gh/benchmarking-tools
Public workshops – Next one Sep 15-16 at NIST, MD, USA
NIST postdoc opportunities available! Justin Zook: [email protected] Salit: [email protected]
NIST Microbial RMs
talk by Jason Kralj
tomorrow at 5:15pm
Clinical Genome Sequencing Process
Preanalytical
Sequencing
Sequence Bioinformatics
Functional Variant Annotation
Clinical Variant Knowledgebase
Query
Clinical Interpretation Reporting
EHR Archival
What is the standards architecture to demonstrate safety and efficacy?
Preanalytical
Sequencing
Sequence Bioinformatics
Functional Variant Annotation
Clinical Variant Knowledgebase
Query
Clinical Interpretation Reporting
EHR Archival
Analytical/Technical PerformanceAssessment
Preanalytical
Sequencing
Sequence Bioinformatics
Functional Variant Annotation
Clinical Variant Knowledgebase
Query
Clinical Interpretation Reporting
EHR Archival