aug2014 giab status update and wg charge
DESCRIPTION
giab status update and wg chargeTRANSCRIPT
Genome in a Bottle: Reference Materials to Enable Translation
August 2014
Justin Zook, Marc Salit, and the Genome in a Bottle Consortium
NIST Human Genome RMs in the pipeline
• All 10 ug samples of DNA isolated from multistage large growth cell cultures– all are intended to act as stable,
homogeneous references suitable for use in regulated applications
– all genomes also available from Coriell repository
• Pilot Genome– ~8400 tubes
• Ashkenazim Jewish Trio– ~10000 son; ~2500 each parent
• Asian Trio– ~10000 son; parents not yet
planned as NIST RM
Homogeneity Analysis
First and last vial
3 libraries sequenced to ~33x each
Use Varscan to detect differences in allele fraction of SNPs and indels between vials
Significant differences only found in regions prone to alignment
errors
Use BIC-seq to detect differences in copy
number between vials and libraries
No consistent differences between
comparisons of different libraries
between vials
4 Random vials
2 libraries sequenced to 12.5x each
Use BIC-seq to detect differences in copy
number between vials and libraries
Only one difference with p<10^-8, which is
in a region prone to mapping errors.
• Sequence multiple libraries from multiple vials
• Use somatic mutation callers to detect differences in SNPs and CNVs
Human/Bacterial RM Stability Study
37°CAccelerated
aging
4°CIntermediate
storage condition
-20°C ControlLong term
storage condition
Handling conditions
8 w
eek
8 w
eek
8 w
eek
2 w
eek
2 w
eek
2 w
eek
8 w
eek
8 w
eek
8 w
eek
8 w
eek
8 w
eek
8 w
eek
2 w
eek
2 w
eek
2 w
eek
Run multiple gels for each condition
Time = 0
Time = 8 weeks
Freeze Thaw 2x
Vortex (10sec)
Freeze Thaw 2x
Vortex (10sec)
Vigorous Pipetting (full vol 10x)
Vigorous Pipetting (full vol 10x)
Freeze Thaw 2xFreeze Thaw 5xFreeze Thaw 5x Freeze Thaw 5x
8 w
eek
Vortex (10sec)
Vigorous Pipetting (full vol 10x)
• Blinded qualitative analysis of gel by 5 NIST staff• Consensus that only vials stored at 37° C for 8 weeks had significantly decreased
size
Shipping cross-country
Example Gel Images
6
Goals for Data to Accompany RM
• ~0 false positive AND false negative calls in confident regions
• Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection)
• Avoid bias towards any particular platform– take advantage of strengths of each platform
• Avoid bias towards any particular bioinformatics algorithms
7
Integrate 12 14 Datasets from 5 platforms
8
Integration of Data toForm Highly Confident Genotype Calls
Find all possible variant sites
Find concordant sites across multiple datasets
Identify sites with atypical characteristics signifying sequencing, mapping, or alignment bias
For each site, remove datasets with decreasingly atypical characteristics until all datasets agree
Even if all datasets agree, identify them as uncertain if few have typical characteristics, or if they fall in known
segmental duplications, SVs, or long repeats
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of bias
Confidence Level
Integration Methods to Establish Reference Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of bias
Confidence Level Zook et al., Nature Biotechnology, 2014.
10
Pedigree calls
• RTG and Illumina Platinum Genomes developed these
• Sequence NA12878, husband, and 11 children to identify high confidence variants– Identify cross-over events– Determine if genotypes are
consistent with inheritance
• Integrated these with NIST high-confidence genotypes
• Should we find larger families for future genomes?
Source: Mike Eberle, Illumina
Assigning confidence to genotypes
High-confidence sites• Sequencing/bioinformatics
methods agree or we understand the biases causing disagreement
• At least some methods have no evidence of bias
• Inherited as expected
Less confident sites• In a region known to be
difficult for current technologies
• State reasons for lower confidence
• If a site is near a low confidence site, make it low confidence
Performance Metrics Specification
• Goal is to standardize performance metrics measured with respect to NIST RMs
• Licensing• Definitions• Input formats• User interface
• Accuracy outputs– FP, FN, Sens, Spec, etc.– Stratification
• by variant type• by genome context• by functional regions
• Characteristics of FP/FN• Working with Global
Alliance for Genomic Health
• See draft at genomeinabottle.org
Working Group Charges
RM Selection and Design• Derivative products based
on NIST RMs • RMs for cancer and somatic
variant calling? • Do we need another large
family and/or more diversity?
• What is the priority of transcriptome RMs?
• Meet in Lecture Room C
Characterization/Bioinformatics• What are the barriers to
submitting data via SRA? • How should we use long read
technologies? • How should we call structural
variants? • Do we need targeted
confirmation/validation of SNPs, indels, or SVs?
• Integration of data for PGP trios • Meet in Lecture Room B in the
morning and Lecture Room A in the afternoon
Working Group Charges
Performance Metrics• How should we coordinate
with Global Alliance for Genomic Health Benchmarking group?
• Feedback about Performance Metrics Specification – Stratification of performance
by type of error, variant type, genome context, and functional region
• Meet in Dining Room A&B