obtaining phenotype and outcome data from ehrs · phenotypes in the ehr true cases identify...

Obtaining phenotype and outcome data from EHRs

Josh Denny, MD MS

Vanderbilt University Medical Center

3/26/2018

EH

R D

ata

fro

m V

an

de

rbilt

Bio

ba

nk

BioVU startVanderbilt

biobank

enrollment

EHR data are dense and efficient for discovery:

Vanderbilt’s experience (BioVU)

eMERGE Goals:

• To perform genomic studies using the EHR

• To implement of genomic medicine

Making text documents useful for research

Clinical notes, test reports, etc

chief_complaint:

Shortness of Breath

history_present_illness:

Congestive Heart Failure

Type 2 diabetes, negated

mother_medical_history:

rheumatoid arthritis

Structured Outputcertainty (positive, negated)Who experienced it? (patient or family member?)

Structured Output

DrugName: atenolol

Strength: 50 mg

Frequency: daily

Research

EHR

CC: SOB

HPI: Mr. **jones** is a

65yo w/ h/o CHF, … no

dm2…

on atenolol 50mg daily…

Mother had RA.

CC: SOB

HPI: Mr. Smith is a 65yo

w/ h/o CHF, … no dm2…

on atenolol 50mg daily…

Mother had RA.

Medication

extraction

Find biomedical concepts and

qualifiers; create structured data

Customized classifiers

(smoking status, etc)

Billing

codes

Deidentify: remove HIPAA

identifiers + ….

Doesn’t have hypertension

Has hypertension

Finding a “simple” disease in the EHR: Who has hypertension?

Definition: SBP > 140 or DBP > 90

Pati

en

t 1

Pati

en

t 2

Our “simple” example: HypertensionMultiple components are better

(and blood pressure is the worst)

Teixeira, JAMIA 2016

Algorithm Development and Implementation

Clinical Notes

(NLP - natural language

processing)

Billing codes

ICD9 & CPT

Medications

ePrescribing

& NLPLabs & test results

NLP

What we learned - Finding phenotypes in the EHR

True cases

Identify

phenotype

of interest

Case & control

algorithm

development

and refinement

Manual review;

assess

precision

Deploy

in BioVU

Genetic

associatio

n tests

≥95%

<95%

Early discovery science in eMERGE – Hypothyroidism

Am J Hum Genet. 2011;89:529-42

Algorithms can

be deployed

across

multiple EHRs

Analyses can

be performed

using extant

data

GWAS of QRS Duration in eMERGE

SCN5A/SCN10An=5,272

Ritchie et al., Circulation 2013

What happens in the “heart healthy” population?

Examined the n=5272

“heart healthy”

population

Followed for

development of atrial

fibrillation based on

genotype

Years since normal ECG

(and no heart disease)

Atr

ial fibrilla

tion

-fre

e

su

rviv

al

HR=1.49 per G allele

p=0.001 GG

AG

AA

Ritchie et al., Circulation 2013

Mega et al., NEJM 2009

From clinical trials

Normal

metabolizers

Carriers

From the EHR

Delaney et al. Clin Pharm Ther. 2012

N=807, P=0.005

EHRs for drug response: Clopidogrel adverse events associated with CYP2C19

Deep learning for Diabetic Retinopathy

Train a machine

learning algorithm

over >128k images

Gulshan et al. JAMA 2016

Phenome scanning (PheWAS) in the EHR

A genetic

variantAssociated

phenotypes

The curated EHR-

based phenome

A phenotypeAssociated

genotypes

Dense genomic

information

Replications of

GWAS

associations

via PheWAS

Bin

ary

trai

tsC

on

tin

uo

us

trai

tsP-value for replication:• All - 210/751: 2x10-98

• Powered - 51/77: 3x10-47

Nat Biotech 2013; 31:1102-1111

PheWAS across all HLA types(n= 37,270)

Karnes et al, Sci Trans Med 2017

The potential for “call back” deeper phenotyping:Long QT genes (SCN5A and KCNH2) in 2,200

sequenced patients in eMERGE

• 83 rare (MAF < 1%) in SCN5A, 45 in KCNH2

• 121/128 MAF < 0.5%, 92 singletons

• Three labs assessed known/likely pathogenicity

Lab 1

16/121

Lab 2

24/121

Lab 3

17/121

4

Van Driest et al, JAMA 2016

Calculating a Phenotype Risk Score (PheRS)

OMIM

feature 1

OMIM

feature 2

OMIM

feature k

...

For each record i, generate

PheRS

PheRS𝑖 =

𝑗=1

𝑘

10𝜔𝑗

Score for

subject i

Add up

terms for k

phenotypes

0=phenotype j

absent

1=phenotype j

present

weight for

phenotype j

derrived from

entire EHR

Human

Phenotype

Ontology

EHR

phenotypes

Repeat this for all Mendelian diseases

Bastarache et al, Science 2018

CF cases CF controls

Age/Sex 18F 26M 29F 29M 18F 26M 29F 29M

Chronic airway obstruction

Pneumonia

Diseases of pancreas

Hypovolemia

Acute upper respiratory infections

Asthma

Bronchiectasis

Intestinal malabsorption

Hepatomegaly

Acute pulmonary heart disease

Phenotype Risk Score 9.8 4.4 6.3 7.8 2.5 0.7 0.0 0.7

Example: a phenotype risk score in Cystic Fibrosis


PheRS

identified

potentially

pathogenic

SNVs

N=21k on exome chip

6k SNVs


The All of Us Research Program –Breaking Down Data Silos

Precision Medicine Initiative, PMI, All of Us, the All of Us logo, and The Future of Health Begins With You are service marks

of the U.S. Department of Health and Human Services.

Direct

VolunteersHealth Care Provider

Organizations

Health

Surveys

Overview of the All of Us approach and protocol

EHR data Baseline

measurements

Bio-

specimensSmartphones

& Wearables

Multiple data types linked

together by semantic standards

From Healthcare Provider Orgs

All of Us will aggregate data from many sources

MedsBilling codes Labs

Version 1 (2018)

Clinical Notes & Reports

ClinicalMessaging

Version 2

Raw Data Repository

Data added centrally by DRC

DeathIndex

Claims &Rx Data

…Visits

Local Registries

Much longer term

Images

From Direct Volunteers

Sync for Science

Participant provided data(Health surveys, activity

monitors, etc)

Participant exams and biospecimens

Curated Data Repository

APIs, Analysis tools, etc

Geospatial data

Health data aggregators(PicnicHealth)

http://www.onlinetelemedicine.com/html/product/sam_images/X-Ray.jpg

http://www.onlinetelemedicine.com/html/product/sam_images/X-Ray.jpg

Sync 4 Science (S4S) – a technology to share health data

S4S Pilot

Sites

S4S:

- FHIR-based

- Starting with MU

Common Clinical Data set

Data Access is centralized in All of Us

Traditional Approach: Bring data

to researchers

Problems

• Data sharing = data copying

• Security (data handoffs)

• Huge infrastructure needed

• Siloed compute

AoU Approach: Bring

researchers to the data

Data

Advantages

• Cost

• Threat detection and auditing

• Increased Accessibility

• Shared compute

Public CloudDownload from

public repository

The power of a data biosphere of common semantics and APIs

Obtaining phenotype and outcome data

from e-health records and digital platforms:

the experience of UK Biobank

Cathie Sudlow

Professor of Neurology and Clinical Epidemiology

Director, Centre for Medical Informatics, Usher Institute, University of Edinburgh

Director of Health Data Research UK Scottish substantive site

Chief Scientist, UK Biobank

International Cohorts Summit,

Durham, North Carolina

March 2018

UK Biobank in a nutshell

• 500,000 UK men and women aged 40-69 years when recruited during 2006-2010

• Consent for all types of health research by both academic and commercial researchers

• Extensive baseline questions and physical measures, with biological samples stored for

future assays

• Subsequent enhancements in all or large subsets of participants:

– Data from portable wearable devices (100,000 accelerometry; 20,000+ continuous ECG)

– Sample assays in all or large subsets:

Complete: genome-wide genotyping; biochemistry panel

Underway/planned: exome and whole genome sequencing; proteomics;

infectious disease assays; stool microbiome

– Multimodal imaging of 100,000 (>22,000 so far)

– Web questionnaires

• Comprehensive, long term follow-up for a wide range of health-related outcomes

• Open access for approved research: see www.ukbiobank.ac.uk

Aim: identify a wide range of incident diseases and other health related outcomes

Active methods requiring participant re-engagement

• face to face reassessment

• postal or web-based surveys

• expensive

• prone to incomplete coverage & selective loss to follow-up

• miss cases emerging between assessments

Passive methods via linkages to national health records

• can follow all participants without need for re-engagement

• efficient and cost effective

• need adequate consent at recruitment

• rely on universal healthcare system & availability of relevant datasets

• can only detect cases of disease diagnosed in a healthcare setting

• data need to be accurate and sufficiently detailed for research studies

Follow-up of participants in very large prospective cohorts

Web questionnaires

• Using email and web questionnaires

– for more detailed assessment of exposures

– and to obtain information on outcomes that cannot be obtained

through linking to health records

• Of 350,000 with email, >150,000 complete each questionnaire

– Details of dietary intake

– Cognitive function

– Mental health (thoughts and feelings)

– Gastrointestinal symptoms

Useful for following change

over time…but beware

selective attrition

Following the health of 0.5 million UK Biobank participants

through linking to National Health Service (NHS) records

Scotland36,000 participants

England446,000 participants

Wales21,000 participants

Regularly updated information on a wide range of diseases

from NHS datasets in all three countries:

• Deaths - date and cause of death

for all participants

>14,000 by early 2016

• Cancers – date, stage and grade of cancer


>79,000 cancer cases by late 2015

• Admissions to hospital – dates, diagnoses, procedures


1000’s of cases of many incident diseases

• Primary care data – dates, diagnoses, symptoms, signs,

referrals, prescriptions, labs etc

for half of the participants

1000’s more cases of many incident diseases

Maximising the value of the linked healthcare data

• Messy ‘real world data’ - not collected primarily for research

• Not 100% accurate due to administrative and clinical error

• Mainly structured, coded datasets (ICD, OPCS4, Read…)

• Experts advising in a range of disease areas:

• Combine different linked data sources to create algorithmically derived disease

status indicators

• Estimate the accuracy and completeness of these

• Consider limitations and potential additional sources of unstructured data

Cancer Neurodegenerative diseases

Diabetes Chest diseases

Cardiac diseases Musculoskeletal conditions

Stroke Infections

Mental health disorders Kidney diseases

Eye diseases

Cancers in UK Biobank ascertained from the national cancer registries

Observed Predicted

By recruitment Incident by 2015 Incident by 2022

Breast cancer 9,000 4,200 10,000

Colorectal cancer 2,300 2,500 7,000

Prostate cancer 3,000 4,300 9,000

→ Date, stage and grade of cancer

Beyond the structured registry data…exploring feasibility of retrieving additional

information for subtyping of identified cancer cases through regional linkages to:

• histopathology reports

• digitised histopathology slides

• tumour specimens

Exemplar non-cancer conditions in UK Biobank ascertained from baseline self report, hospital admissions and death registries

Observed

By recruitment Incident by 2016

Myocardial infarction 12,000 7,400 8,100

Stroke 8,000 4,600 6,900

Diabetes 26,000 9,000 18,000

COPD 10,000 7,600 16,900

Asthma 60,000 5,700 19,000

Dementia 200 1,800 3,600

Exemplar non-cancer conditions in UK Biobank ascertained from baseline self report, hospital admissions and death registries

Observed

By recruitment Incident by 2016

Myocardial infarction 12,000 7,400 8,100

Stroke 8,000 4,600 6,900

Diabetes 26,000 9,000 18,000

COPD 10,000 7,600 16,900

Asthma 60,000 5,700 19,000

Dementia 200 1,800 3,600

Estimated effects of including primary care data

Accuracy?

Limitations?

Dementia: positive predictive value of routine healthcare data

From published studies

Mortality

Hospital inpatient

Hospital in- & outpatient

Insurance

Outpatients

Primary care

0 0.2 0.4 0.6 0.8 1

Wide variation but in most PPV >80%

0 0.2 0.4 0.6 0.8 1

Alzheimer’s disease

Vascular dementia

PPV for AD generally higher than for vasc dementia

Dementia: positive predictive value of routine healthcare data

From comparison with expert review of free text electronic medical record in UK Biobank

Dementia: 80%

Alzheimer’s disease: 72%

Vascular dementia: 44%

Beyond the linked coded healthcare data

Obtaining these data at national scale is challenging

To extract value from these data on 1000’s of

outcomes across multiple diseases, we need scalable

approaches: crowd sourcing, natural language

processing, machine learning, artificial intelligence…

Structured, coded data from linked national healthcare datasets:

• Can ascertain cases of a wide range of diseases with acceptable accuracy

• Capture only 10-20% of the information from electronic medical records

• Are limited for detailed sub-phenotyping of disease

Deeper phenotyping of disease will require multiple

unstructured data sources, including:

• Free text of electronic records

• Complex electrical signalling data (ECG’s, EEG’s etc)

• Histopathology slide sets

• Clinical imaging data

Acknowledgements

Biochemistry analyses in all 500,000 participants

Cardiovascular

Cholesterol

Direct LDL-c

HDL-c

Triglyceride

ApoA

ApoB

Lp(a)

CRP

Cancer

SHBG

Testosterone

Oestradiol

IGF-I

Bone and joint

Vitamin D

Rheumatoid factor

Alkaline Phosphatase

Calcium

Liver

Albumin

Direct Bilirubin

Total Bilirubin

GGT

ALT

AST

Note: Haematological assays were conducted during recruitment phase

Diabetes

HbA1c

Glucose

Renal

Creatinine

Cystatin C

Total protein

Urea

Phosphate

Urate

Urinary:

• Creatinine

• Sodium

• Potassium

• Albumin

500,000 participants

22 recruitment centres

89% England

7% Scotland

4% Wales

Industrial scale processes:

samples during recruitment

25,000aliquots produced

per day

700participants

per day

4,900sample tubes

per day

15 million 0.85ml aliquots

• Blood

whole blood

serum

plasma

red cells

buffy coat

• Urine

• Saliva

Total > 15 million aliquots

Expected disease cases during follow-up

Condition 2012 2017 2022

Diabetes 10,000 25,000 40,000

Heart attack 7,000 17,000 28,000

Stroke 2,000 5,000 9,000

Chronic obstructive lung disease 3,000 8,000 14,000

Breast cancer 2,500 6,000 10,000

Colorectal cancer 1,500 3,500 7,000

Prostate cancer 1,500 3,500 7,000

Hip fracture 1,000 2,500 6,000

Alzheimer’s 1,000 3,000 9,000

Data from portable wearable devices

Accelerometry data: 100,000 participants

Continuous ECG monitoring: 20,000 + participants

Prospective design

and large size enable

reasonably well-

powered studies of

(causal) associations

between accelerometry

and cardiac rhythm

measures and later

onset disease

Sample analyses

• Genome-wide genotyping of all participants

• Standard panel of assays (e.g. lipids; hormones; metabolic) on

samples from all participants

• Exome & whole genome sequencing, proteomics, metabolomics,

infectious disease assays, stool microbiome…all underway/planned

Multimodal imaging of 100,000 participants

>22,000 imaged

so far

Prospective design and large size enable well-powered studies of (causal) associations

between structure and function of organs and later onset disease…but…need scalable

methods of analysing complex data to derive measures for large scale analyses

Obtaining phenotype and outcome

data from health records:China Kadoorie Biobank experience

Zhengming CHENCKB Principal Investigator

Professor of Epidemiology

Nuffield Dept. of Population Health

University of Oxford, UK([email protected])

International Cohorts Summit, Duke University, USA,

26-27 March 2018

>512K recruited from 10 localities in 2004-08

Participants interviewed, measured, and gave

plasma and DNA (urine) for long-term storage

All followed up indefinitely via electronic record

linkage to deaths and ALL hospital episodes

Periodic resurvey of 5% surviving participants

(for enhancements and sources of variation)

China Kadoorie Biobank (CKB)

Informed consent for linkages to health records

and unspecified research use of stored samples

CKB: Clinical stations at local assessment centre

登记，知情同意，分发现场调查袋

问卷

(包括一般调查问卷, COPD 问卷, CIDI问卷以及完整性检查 I)

完成

· 完整性检查 II – 检查是否有遗漏项目

· 分发体检结果报告

体脂

手握力

体格检查

身高腰臀围

血压脉搏波速

心血管检查

采血及血检采尿及尿检

血、尿样采集及检验

肺功能一氧化碳浓度

生理检查

颈动脉超声跟骨密度心电图

其他检查

1.Registration

& consent

8.Physical exam.

7.Questionnaire

(with recording)5.Sample

collection

The clinic visit took 60-90 minutes, with daily statistical monitoring

Recruitment rate:

7~800 per day

2.Physical exam.

3.Physical exam.

4.Physical exam.

9.Clinic consultation

CKB: Supported by >90 bespoke IT systems

CKB: Fully established with 10-year follow-up

Questionnaire

SES, smoking, alcohol, tea,

diet, physical activity, indoor air

pollution, sleep, reproductive

patterns, medical history

Measurements

Blood pressure, height, weight,

lung function, heart rate, bone

density, exhaled CO, ECG,

cIMT, ambient temperature,

ambient air pollution, blood

lipids, metabolites, proteomics,

infectious markers, genetics

Electronic health records

>1,300 different diseases, 43K

deaths, <5K lost to follow-up,

~0.9 million hospitalizations,

>100 million chargeable items

www.ckbiobank.org

Data is growing rapidly

◊ ◊

Outcome Follow up

in CKB

Active

follow-up

Disease

registries

National

health

insurance

Death

registries

CKB: Follow-up through record linkages

National health insurance system in China(supplementing death and cancer registries)

• Introduced during 2004-6, with almost universal

coverage in CKB areas by 2010

• Multiple disease diagnoses, with ICD-10 codes plus

disease descriptions and >2,500 procedure codes

• Managed electronically at city level, with detailed

chargeable items for reimbursement purposes

• Lacks certain details (e.g. cancer pathology) required

for disease sub-phenotyping

Nearly all CKB participants now linked to the health insurance databases via unique national ID number

46063

32468

28274

19796

15539

14277

13882

11037

8035

6270

5904

5489

4632

4002

3188

3169

2748

2664

2273

1730

1316

1192

1104

1098

822

425

54

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

Ischaemic Stroke (I63)

Diabetes (E10-E14)

Cancer (C00-C97 )

COPD (J41-J44)

Fractures (S02, S12, S22, S32, S42, S52, S62, S72, S82, S92)

Cataract (H25-H28)

Angina (I20)

Haemorrhagic Stroke (I61)

MI (I21-I23)

Arrhythmia (I47-I49)

Chronic liver disease (K70-K77, B18-B19, I85, Z22.5)

Pulmonary heart disease (I26-I27)

Heart failure (I50)

Anxiety disorders (F40-F48)

Tuberculosis (A15-A19, B90, J65)

Asthma (J45-J46)

CKD (N02-N03, N07, N11,N18)

Osteoporosis (M80-M81)

Coronary revascularisation

Rheumatoid arthritis (M05-M06, M45)

Schizophrenia (F20-F29)

Depression (F30-F39)

Retinopathy (E10-4.3,H36.0)

SAH (I60, I69.0)

Parkinson disease (G20-G21)

(Venomous) snake bite (T63.0, X20)

Victim of earthquake (X34)

CKB: Participants with selected diseases in 10 years(43K deaths, 0.9M hospital admissions; 2017 HI data are being processed)

Haemorrhagic strokeNAFLD

Waist circumference (cm)

CKB: Procedures for improving disease

phenotyping

Pilot study of ~1000 cases for specific disease before

deciding whether to undertake systematic adjudication

CKB: Disease standardisation and coding tool

CKB: Verifying reported diagnosis

CKB: Adjudicating & phenotyping major diseases(>70K adjudicated: 30K stroke, 25K IHD, 15K cancer, >3K CKD)

CKB: “traffic” light approach for outcome data

Future work for disease phenotyping

• Standardising and ICD-10 coding new events collected

• Processing and incorporating >100M chargeable items

data to enhance disease phenotyping

• Extending outcome adjudication to several other

diseases (e.g. heart failure, chronic liver disease)

• Developing automated algorithm to sub-phenotype

stroke and other diseases according to clinical criteria

• Piloting collection of discharge summary pages and

tumour tissue samples

CKB: Open data access platform (www.ckbiobank.org)

obtaining phenotype and outcome data from ehrs · phenotypes in the ehr true cases identify...

Documents