aug2014 giab status update and wg charge

Genome in a Bottle: Reference Materials to Enable Translation

August 2014

Justin Zook, Marc Salit, and the Genome in a Bottle Consortium

NIST Human Genome RMs in the pipeline

• All 10 ug samples of DNA isolated from multistage large growth cell cultures– all are intended to act as stable,

homogeneous references suitable for use in regulated applications

– all genomes also available from Coriell repository

• Pilot Genome– ~8400 tubes

• Ashkenazim Jewish Trio– ~10000 son; ~2500 each parent

• Asian Trio– ~10000 son; parents not yet

planned as NIST RM

Homogeneity Analysis

First and last vial

3 libraries sequenced to ~33x each

Use Varscan to detect differences in allele fraction of SNPs and indels between vials

Significant differences only found in regions prone to alignment

errors

Use BIC-seq to detect differences in copy

number between vials and libraries

No consistent differences between

comparisons of different libraries

between vials

4 Random vials

2 libraries sequenced to 12.5x each

Use BIC-seq to detect differences in copy

number between vials and libraries

Only one difference with p<10^-8, which is

in a region prone to mapping errors.

• Sequence multiple libraries from multiple vials

• Use somatic mutation callers to detect differences in SNPs and CNVs

Human/Bacterial RM Stability Study

37°CAccelerated

aging

4°CIntermediate

storage condition

-20°C ControlLong term

storage condition

Handling conditions

8 w

eek

8 w

eek

8 w

eek

2 w

eek

2 w

eek

2 w

eek

8 w

eek

8 w

eek

8 w

eek

8 w

eek

8 w

eek

8 w

eek

2 w

eek

2 w

eek

2 w

eek

Run multiple gels for each condition

Time = 0

Time = 8 weeks

Freeze Thaw 2x

Vortex (10sec)

Freeze Thaw 2x

Vortex (10sec)

Vigorous Pipetting (full vol 10x)


Freeze Thaw 2xFreeze Thaw 5xFreeze Thaw 5x Freeze Thaw 5x

8 w

eek

Vortex (10sec)


• Blinded qualitative analysis of gel by 5 NIST staff• Consensus that only vials stored at 37° C for 8 weeks had significantly decreased

size

Shipping cross-country

Example Gel Images

6

Goals for Data to Accompany RM

• ~0 false positive AND false negative calls in confident regions

• Include as much of the genome as possible in the confident regions (i.e., don’t just take the intersection)

• Avoid bias towards any particular platform– take advantage of strengths of each platform

• Avoid bias towards any particular bioinformatics algorithms

7

Integrate 12 14 Datasets from 5 platforms

8

Integration of Data toForm Highly Confident Genotype Calls

Find all possible variant sites

Find concordant sites across multiple datasets

Identify sites with atypical characteristics signifying sequencing, mapping, or alignment bias

For each site, remove datasets with decreasingly atypical characteristics until all datasets agree

Even if all datasets agree, identify them as uncertain if few have typical characteristics, or if they fall in known

segmental duplications, SVs, or long repeats

Candidate variants

Concordant variants

Find characteristics of bias

Arbitrate using evidence of bias

Confidence Level

Integration Methods to Establish Reference Variant Calls

Candidate variants

Concordant variants

Find characteristics of bias

Arbitrate using evidence of bias

Confidence Level Zook et al., Nature Biotechnology, 2014.

10

Pedigree calls

• RTG and Illumina Platinum Genomes developed these

• Sequence NA12878, husband, and 11 children to identify high confidence variants– Identify cross-over events– Determine if genotypes are

consistent with inheritance

• Integrated these with NIST high-confidence genotypes

• Should we find larger families for future genomes?

Source: Mike Eberle, Illumina

Assigning confidence to genotypes

High-confidence sites• Sequencing/bioinformatics

methods agree or we understand the biases causing disagreement

• At least some methods have no evidence of bias

• Inherited as expected

Less confident sites• In a region known to be

difficult for current technologies

• State reasons for lower confidence

• If a site is near a low confidence site, make it low confidence

Performance Metrics Specification

• Goal is to standardize performance metrics measured with respect to NIST RMs

• Licensing• Definitions• Input formats• User interface

• Accuracy outputs– FP, FN, Sens, Spec, etc.– Stratification

• by variant type• by genome context• by functional regions

• Characteristics of FP/FN• Working with Global

Alliance for Genomic Health

• See draft at genomeinabottle.org

Working Group Charges

RM Selection and Design• Derivative products based

on NIST RMs • RMs for cancer and somatic

variant calling? • Do we need another large

family and/or more diversity?

• What is the priority of transcriptome RMs?

• Meet in Lecture Room C

Characterization/Bioinformatics• What are the barriers to

submitting data via SRA? • How should we use long read

technologies? • How should we call structural

variants? • Do we need targeted

confirmation/validation of SNPs, indels, or SVs?

• Integration of data for PGP trios • Meet in Lecture Room B in the

morning and Lecture Room A in the afternoon

Working Group Charges

Performance Metrics• How should we coordinate

with Global Alliance for Genomic Health Benchmarking group?

• Feedback about Performance Metrics Specification – Stratification of performance

by type of error, variant type, genome context, and functional region

• Meet in Dining Room A&B

aug2014 giab status update and wg charge

Health & Medicine

vials significant differences

consistent differences

random vials

use bicseq

use varscan

copy number

nist human genome rms

son parents