genomics for everyone - duke center for applied genomics...
TRANSCRIPT
Confidential
1
James Lu MD PhD
Genomics for Everyone
Confidential
2
Big Data is everywhere!
Confidential
3
Amount of data about people will be zetta byte scale (1 Billion GB scale)
100 Million – 2 billion people expected to be sequenced by 2025
Confidential
4
Data acquisition will not be an issue… we will awash in data
1. Data QC
2. Data silos
3. Systematically under- analyzed
4. Methods (sophistication and computation)
5. Need to (re)engage participants users to gather relevant datasets
Data
MethodsPatient Engagement
Confidential
5
4 Data Stories
1.Robust variant calling (pooling similar data)
2. Identifying disease variants in rare disease (simple forms of two types of data)
3.Phenotype modeling (multi-modal data)
4.Patient centricity and precision medicine (mashing things up at scale)
And many more…
Confidential
6
Confidential
7
Aggregation of low coverage sequencing enables low cost mapping of human genetic diversity.
X thousands =
More diverse
haplotype references
&
Improved imputation for GWAS
Durbin et al., Nature (2011), Abecasis et al, Nature (2012)
Confidential
8
39 A , 1 T = A/A25 A , 15 T = A/T5 A , 35 T = T/T
Early variant calling was heuristic based and used simple maximum likelihood based models
Confidential
9
What if your read depth is really low to none…?
3 A , 1 T = A/A ?
4 A , 2 T = A/T ?3 A , 7 T = T/T ?
Unacceptable error rates with just maximum likelihood methods
1.63%
3.59%
0.12%
1.10%
28.16%
0.41%
0.00% 10.00% 20.00% 30.00%
Alt/Alt
Ref/Alt
Ref/Ref
Disconcordance Rate
LowCovExome
Confidential
10
Build flexible models that incorporates more information (linkage disequilibrium + genotype priors). Posteriors then provide accurate inference
~ 50-95% improvements vs. read depth alone~ 8-30% better than other standard tools
Wang Y*, Lu JT*, Genome Research (2013)
P(Genotype= X/X| Data) = P(Read Depth| Genotype= X/X) P(LD| Genotype= X/X) P(Genotype)
Statistical Relationship between variants across
samplesPriorGenotype
Likelihood
Confidential
11
Framework also lets us combine other sources of data to improve inference
1) Exome Data
2) Microarray Data
Confidential
12
Accommodating more data in flexible models improves accuracy, at no additional cost
1.07%
2.44%
0.09%
1.63%
3.59%
0.12%
1.10%
28.16%
0.41%
0.00% 10.00% 20.00% 30.00%
Alt/Alt
Ref/Alt
Ref/Ref
Disconcordance Rate
LowCovExomeIntegration
Another 4- 10% improvement in accuracy
*Never published work
Confidential
13
4 Data Stories
1.Robust variant calling (pooling similar data)
2. Identifying disease variants in rare disease (simple forms of two types of data)
3.Phenotype modeling (multi-modal data)
4.Patient centricity and precision medicine (mashing things up at scale)
And many more…
Confidential
14
Confidential
15
Skeletal Dysmorphism: e.g. Yunis Varon
• Cleidocranial dysplasia, digital anomalies and severe central / peripheranlneurological involvement
• Enlarged vacuoles are found in neurons, muscle, and cartilage
• Negative for RUNX2 (classical cleidocranialdysplasia) test
Campeau P*, Lu JT*, (2013)
Confidential
16
Strategic usage of large databases allow identification of primary causative alleles
~30% success rate
1. Highly conserved variant
2. Segregated amongst probands
3. Validated in functional studies
4. Phenotypic similarity in mice
Campeau P, Lu JT et al., AJHG (2013)
Confidential
17
”Solving” additional cases requires:
1) Modeling of phenotype vectors to identify clusters of rare cases that may share a genetic etiology
2) Learn from identified causative alleles to train novel algorithms to identify better candidates
AJHG (2012, 2013), Hum Mutation (2012), NEJM (2013), Lancet Neurology (2013), etc…
New candidates Published
Confidential
18
Four stories
1.Robust variant calling
2. Identifying disease variants in rare disease
3.Phenotype modeling at scale
4.Patient centricity and precision medicine
Confidential
19
4 Data Stories
1.Robust variant calling (pooling similar data)
2. Identifying disease variants in rare disease (simple forms of two types of data)
3.Phenotype modeling (multi-modal data)
4.Patient centricity and precision medicine (mashing things up at scale)
Confidential
20
Confidential
21
Phenotyping will be bottleneck, but massive opportunity to mine data that is collected incidentally (and is already available and free!)
Lu JT et al., NEJM (2013)
Two emerging sources:
1.EHR
2.Clinical Trials
Confidential
22
EHR: Model & mine multi-modal datasets for novel findings
• Patient level (>240,000 patients)– Birthday, Death, Gender. Race, Ethnicity
• Visit level (4.4 Million patient visits)– Average 18 measurements recorded per visit including presence/absence of particular diseases (computed)– Encounter date (start, end)– Location (DHRH, DUH, DRH)– Path (ED -> inpatient for example)– Inpatient / Outpatient
• Interventions (> 60,000 types of observations)– CPT – ICD9 diagnoses– ICD9 procedures– Lab values– Medications– Vitals
Confidential
23
Patients are composites of common and rare latent phenotypes (discovered by factor models)
ER/ EKG
Standard Labs (i.e. CBC/ BMP)
Kidney Disease
Hypertension
Surgical Patient
Identify known and unique clusters of patients for further study
Confidential
24
Computational factors performs better than validated clinical risk algorithms (UKPDS) when used to predict outcomes (T2DM)
1) Local models outperform risk prediction algorithms + 2) identify patient clusters at higher risk
Confidential
25
#2: Clinical Trial Data is now being available: huge opportunities for new science
Confidential
26
Typically, blinded data is not very informative
Un-blinded, huge divergence in (-) arms between treatment and Placebo
Proof of Concept: Can we predict clinical trial outcomes?
-2
-1
0
1
2
3
4
Baseline Day 45 Change from Baseline Day 90 Change from Baseline
(-) Tx
(-) Placebo
(+) Tx
(+) Placebo
-1.5
-1
-0.5
0
0.5
1
1.5
Baseline Day 45 Change from Baseline Day 90 Change from Baseline (-)
(+)
Data from Phase2b study. Not published and intentionally blinded
Confidential
27
Analysis of individual-level trends identifies sub-groups of responders
27
(-) Tx (-) P (+) Tx (+) PLow (26) -60% (2) 73% (7) 17% (11) -21% (6)
Medium (77) 0% (23) 0% (19) -7% (18) 8% (17)High (37) 27% (14) -33% (6) -4% (9) 5% (8)
Confidential
28
Can we guess the results of the Phase 3?
Simulate weighted randomization based on Phase 2b to “guess” individual patients that are in treatment or placebo arm
Simulate trial outcomes (s=10000) and estimate outcome of full study.
Difference between Txand Placebo Arm
Likelihood
>2 53.8%>3 22.6%>4 4.4%
Unsupervised clustering individuals response to drug over time in Phase 3 (not a snap shot i.e. interim analysis)
Confidential
29
Four stories
1.Robust variant calling (pooling similar data)
2. Identifying disease variants in rare disease (simple forms of two types of data)
3.Phenotype modeling (multi-modal data)
4.Patient centricity and precision medicine (mashing things up at scale)
Confidential
30
1) Expensive 2) Hard to interpret 3) Not portable
Confidential
31
Ecosystem of Partners
Logistics &Sequencing
Initial Purchase
Digital DNA
Two purchase points:Helix & your app
Helix handles shipping, saliva collection, exome+
sequencing and data storage
Partner accesses DNA data through Helix APIs, enablingrapid product development
Our announced partners
EXPLORAGEN
Clinical quality genomic data, available via API31
1) Pay as you go 2) Expert interpretation 3) User-centric
Confidential
32
What if genomic data becomes ubiquitous?
1.Genetic “Testing” isn’t testing anymore
2.Massive creativity enabled by open data platforms
3.Genetics gains context (and loses its exceptionalism) by mashing it with other data
4.Cost-benefit assessments will change dramatically
5.Users will engage through applications they care about… and will be happy to contribute to research
Confidential
33
#1: “Testing” changes
HealthIntake Diagnosis
Confirm
1) One- time event 2) Long wait times3) Report is unintelligible4) Data belongs to health system5) Results are never updated
Confidential
34
#1: “Testing” changes
Directed Health Intake
Diagnosis Intervention
1) Normal to know about genetic risk2) Near- instant data transmission3) Data is made actionable for patient4) Data belongs to patient5) Results are regularly updated
Confidential
35
#2: Creativity is enabled by open data platforms35
Voice RecognitionLocation Genomics
?
Confidential
36
Polygenic genetic risk is independent of lifestyle but modifiable by lifestyle choices
#3: Genetics will gain context and become more predictive and useful
Source: Khera et al., NEJM, 2016
e.g. CHD Risk ~ f (lifestyle factors & genetic risk)
Confidential
37
Customize recommendations
Compare by genotype Encourage specific behaviors
… partner this information with other data to risk stratify and encourage specific behaviors
Confidential
38
#4: Novel models will transform the cost- benefit discussion around genetic applications
38
Source: Khera et al., NEJM, 2016; Bennette et al., Genetics in Medicine, 2015
~$50,000per QALY
for actionable medical panel (ACMG) panel priced at
~$250
Confidential
39
#5: Individuals will engage through products they already care about…
Confidential
40
... and will be willing to participate in research as long as they get something back of value
Believe they should be in control of who has access to their genetic
data
Believe they should be informed of findings that resulted from access to
their genetic data
90%
Source: Rock Health, “The Genomics Inflection Point: Implications for Healthcare” & 23andMe
89%
Are willing to participate in research that generates novel discovery
85%
Confidential
41
More data
Drives Innovation
Enables Better Products
Attracts Users
Creating a virtuous cycle
Confidential
42
Ultimately, this has the power to…
Eliminate guesswork Improve outcomes
Enhance other data Accelerate product development
Confidential
43
Appendix
Confidential
44
How I got here