genomics for everyone - duke center for applied genomics...

Confidential

1

James Lu MD PhD

Genomics for Everyone

Confidential

2

Big Data is everywhere!

Confidential

3

Amount of data about people will be zetta byte scale (1 Billion GB scale)

100 Million – 2 billion people expected to be sequenced by 2025

Confidential

4

Data acquisition will not be an issue… we will awash in data

1. Data QC

2. Data silos

3. Systematically under- analyzed

4. Methods (sophistication and computation)

5. Need to (re)engage participants users to gather relevant datasets

Data

MethodsPatient Engagement

Confidential

5

4 Data Stories

1.Robust variant calling (pooling similar data)

2. Identifying disease variants in rare disease (simple forms of two types of data)

3.Phenotype modeling (multi-modal data)

4.Patient centricity and precision medicine (mashing things up at scale)

And many more…

Confidential

6

Confidential

7

Aggregation of low coverage sequencing enables low cost mapping of human genetic diversity.

X thousands =

More diverse

haplotype references

&

Improved imputation for GWAS

Durbin et al., Nature (2011), Abecasis et al, Nature (2012)

Confidential

8

39 A , 1 T = A/A25 A , 15 T = A/T5 A , 35 T = T/T

Early variant calling was heuristic based and used simple maximum likelihood based models

Confidential

9

What if your read depth is really low to none…?

3 A , 1 T = A/A ?

4 A , 2 T = A/T ?3 A , 7 T = T/T ?

Unacceptable error rates with just maximum likelihood methods

1.63%

3.59%

0.12%

1.10%

28.16%

0.41%

0.00% 10.00% 20.00% 30.00%

Alt/Alt

Ref/Alt

Ref/Ref

Disconcordance Rate

LowCovExome

Confidential

10

Build flexible models that incorporates more information (linkage disequilibrium + genotype priors). Posteriors then provide accurate inference

~ 50-95% improvements vs. read depth alone~ 8-30% better than other standard tools

Wang Y*, Lu JT*, Genome Research (2013)

P(Genotype= X/X| Data) = P(Read Depth| Genotype= X/X) P(LD| Genotype= X/X) P(Genotype)

Statistical Relationship between variants across

samplesPriorGenotype

Likelihood

Confidential

11

Framework also lets us combine other sources of data to improve inference

1) Exome Data

2) Microarray Data

Confidential

12

Accommodating more data in flexible models improves accuracy, at no additional cost

1.07%

2.44%

0.09%

1.63%

3.59%

0.12%

1.10%

28.16%

0.41%

0.00% 10.00% 20.00% 30.00%

Alt/Alt

Ref/Alt

Ref/Ref

Disconcordance Rate

LowCovExomeIntegration

Another 4- 10% improvement in accuracy

*Never published work

Confidential

13

4 Data Stories





And many more…

Confidential

14

Confidential

15

Skeletal Dysmorphism: e.g. Yunis Varon

• Cleidocranial dysplasia, digital anomalies and severe central / peripheranlneurological involvement

• Enlarged vacuoles are found in neurons, muscle, and cartilage

• Negative for RUNX2 (classical cleidocranialdysplasia) test

Campeau P*, Lu JT*, (2013)

Confidential

16

Strategic usage of large databases allow identification of primary causative alleles

~30% success rate

1. Highly conserved variant

2. Segregated amongst probands

3. Validated in functional studies

4. Phenotypic similarity in mice

Campeau P, Lu JT et al., AJHG (2013)

Confidential

17

”Solving” additional cases requires:

1) Modeling of phenotype vectors to identify clusters of rare cases that may share a genetic etiology

2) Learn from identified causative alleles to train novel algorithms to identify better candidates

AJHG (2012, 2013), Hum Mutation (2012), NEJM (2013), Lancet Neurology (2013), etc…

New candidates Published

Confidential

18

Four stories

1.Robust variant calling

2. Identifying disease variants in rare disease

3.Phenotype modeling at scale

4.Patient centricity and precision medicine

Confidential

19

4 Data Stories





Confidential

20

Confidential

21

Phenotyping will be bottleneck, but massive opportunity to mine data that is collected incidentally (and is already available and free!)

Lu JT et al., NEJM (2013)

Two emerging sources:

1.EHR

2.Clinical Trials

Confidential

22

EHR: Model & mine multi-modal datasets for novel findings

• Patient level (>240,000 patients)– Birthday, Death, Gender. Race, Ethnicity

• Visit level (4.4 Million patient visits)– Average 18 measurements recorded per visit including presence/absence of particular diseases (computed)– Encounter date (start, end)– Location (DHRH, DUH, DRH)– Path (ED -> inpatient for example)– Inpatient / Outpatient

• Interventions (> 60,000 types of observations)– CPT – ICD9 diagnoses– ICD9 procedures– Lab values– Medications– Vitals

Confidential

23

Patients are composites of common and rare latent phenotypes (discovered by factor models)

ER/ EKG

Standard Labs (i.e. CBC/ BMP)

Kidney Disease

Hypertension

Surgical Patient

Identify known and unique clusters of patients for further study

Confidential

24

Computational factors performs better than validated clinical risk algorithms (UKPDS) when used to predict outcomes (T2DM)

1) Local models outperform risk prediction algorithms + 2) identify patient clusters at higher risk

Confidential

25

#2: Clinical Trial Data is now being available: huge opportunities for new science

Confidential

26

Typically, blinded data is not very informative

Un-blinded, huge divergence in (-) arms between treatment and Placebo

Proof of Concept: Can we predict clinical trial outcomes?

-2

-1

0

1

2

3

4

Baseline Day 45 Change from Baseline Day 90 Change from Baseline

(-) Tx

(-) Placebo

(+) Tx

(+) Placebo

-1.5

-1

-0.5

0

0.5

1

1.5

Baseline Day 45 Change from Baseline Day 90 Change from Baseline (-)

(+)

Data from Phase2b study. Not published and intentionally blinded

Confidential

27

Analysis of individual-level trends identifies sub-groups of responders

27

(-) Tx (-) P (+) Tx (+) PLow (26) -60% (2) 73% (7) 17% (11) -21% (6)

Medium (77) 0% (23) 0% (19) -7% (18) 8% (17)High (37) 27% (14) -33% (6) -4% (9) 5% (8)

Confidential

28

Can we guess the results of the Phase 3?

Simulate weighted randomization based on Phase 2b to “guess” individual patients that are in treatment or placebo arm

Simulate trial outcomes (s=10000) and estimate outcome of full study.

Difference between Txand Placebo Arm

Likelihood

>2 53.8%>3 22.6%>4 4.4%

Unsupervised clustering individuals response to drug over time in Phase 3 (not a snap shot i.e. interim analysis)

Confidential

29

Four stories





Confidential

30

1) Expensive 2) Hard to interpret 3) Not portable

Confidential

31

Ecosystem of Partners

Logistics &Sequencing

Initial Purchase

Digital DNA

Two purchase points:Helix & your app

Helix handles shipping, saliva collection, exome+

sequencing and data storage

Partner accesses DNA data through Helix APIs, enablingrapid product development

Our announced partners

EXPLORAGEN

Clinical quality genomic data, available via API31

1) Pay as you go 2) Expert interpretation 3) User-centric

Confidential

32

What if genomic data becomes ubiquitous?

1.Genetic “Testing” isn’t testing anymore

2.Massive creativity enabled by open data platforms

3.Genetics gains context (and loses its exceptionalism) by mashing it with other data

4.Cost-benefit assessments will change dramatically

5.Users will engage through applications they care about… and will be happy to contribute to research

Confidential

33

#1: “Testing” changes

HealthIntake Diagnosis

Confirm

1) One- time event 2) Long wait times3) Report is unintelligible4) Data belongs to health system5) Results are never updated

Confidential

34

#1: “Testing” changes

Directed Health Intake

Diagnosis Intervention

1) Normal to know about genetic risk2) Near- instant data transmission3) Data is made actionable for patient4) Data belongs to patient5) Results are regularly updated

Confidential

35

#2: Creativity is enabled by open data platforms35

Voice RecognitionLocation Genomics

?

Confidential

36

Polygenic genetic risk is independent of lifestyle but modifiable by lifestyle choices

#3: Genetics will gain context and become more predictive and useful

Source: Khera et al., NEJM, 2016

e.g. CHD Risk ~ f (lifestyle factors & genetic risk)

Confidential

37

Customize recommendations

Compare by genotype Encourage specific behaviors

… partner this information with other data to risk stratify and encourage specific behaviors

Confidential

38

#4: Novel models will transform the cost- benefit discussion around genetic applications

38

Source: Khera et al., NEJM, 2016; Bennette et al., Genetics in Medicine, 2015

~$50,000per QALY

for actionable medical panel (ACMG) panel priced at

~$250

Confidential

39

#5: Individuals will engage through products they already care about…

Confidential

40

... and will be willing to participate in research as long as they get something back of value

Believe they should be in control of who has access to their genetic

data

Believe they should be informed of findings that resulted from access to

their genetic data

90%

Source: Rock Health, “The Genomics Inflection Point: Implications for Healthcare” & 23andMe

89%

Are willing to participate in research that generates novel discovery

85%

Confidential

41

More data

Drives Innovation

Enables Better Products

Attracts Users

Creating a virtuous cycle

Confidential

42

Ultimately, this has the power to…

Eliminate guesswork Improve outcomes

Enhance other data Accelerate product development

Confidential

43

Appendix

Confidential

44

How I got here

genomics for everyone - duke center for applied genomics...

Documents