obtaining phenotype and outcome data from ehrs · phenotypes in the ehr true cases identify...
TRANSCRIPT
Obtaining phenotype and outcome data from EHRs
Josh Denny, MD MS
Vanderbilt University Medical Center
3/26/2018
EH
R D
ata
fro
m V
an
de
rbilt
Bio
ba
nk
BioVU startVanderbilt
biobank
enrollment
EHR data are dense and efficient for discovery:
Vanderbilt’s experience (BioVU)
eMERGE Goals:
• To perform genomic studies using the EHR
• To implement of genomic medicine
Making text documents useful for research
Clinical notes, test reports, etc
chief_complaint:
Shortness of Breath
history_present_illness:
Congestive Heart Failure
Type 2 diabetes, negated
mother_medical_history:
rheumatoid arthritis
Structured Outputcertainty (positive, negated)Who experienced it? (patient or family member?)
Structured Output
DrugName: atenolol
Strength: 50 mg
Frequency: daily
Research
EHR
CC: SOB
HPI: Mr. **jones** is a
65yo w/ h/o CHF, … no
dm2…
on atenolol 50mg daily…
Mother had RA.
CC: SOB
HPI: Mr. Smith is a 65yo
w/ h/o CHF, … no dm2…
on atenolol 50mg daily…
Mother had RA.
Medication
extraction
Find biomedical concepts and
qualifiers; create structured data
Customized classifiers
(smoking status, etc)
Billing
codes
Deidentify: remove HIPAA
identifiers + ….
Doesn’t have hypertension
Has hypertension
Finding a “simple” disease in the EHR: Who has hypertension?
Definition: SBP > 140 or DBP > 90
Pati
en
t 1
Pati
en
t 2
Our “simple” example: HypertensionMultiple components are better
(and blood pressure is the worst)
Teixeira, JAMIA 2016
Algorithm Development and Implementation
Clinical Notes
(NLP - natural language
processing)
Billing codes
ICD9 & CPT
Medications
ePrescribing
& NLPLabs & test results
NLP
What we learned - Finding phenotypes in the EHR
True cases
Identify
phenotype
of interest
Case & control
algorithm
development
and refinement
Manual review;
assess
precision
Deploy
in BioVU
Genetic
associatio
n tests
≥95%
<95%
Early discovery science in eMERGE – Hypothyroidism
Am J Hum Genet. 2011;89:529-42
Algorithms can
be deployed
across
multiple EHRs
Analyses can
be performed
using extant
data
GWAS of QRS Duration in eMERGE
SCN5A/SCN10An=5,272
Ritchie et al., Circulation 2013
What happens in the “heart healthy” population?
Examined the n=5272
“heart healthy”
population
Followed for
development of atrial
fibrillation based on
genotype
Years since normal ECG
(and no heart disease)
Atr
ial fibrilla
tion
-fre
e
su
rviv
al
HR=1.49 per G allele
p=0.001 GG
AG
AA
Ritchie et al., Circulation 2013
Mega et al., NEJM 2009
From clinical trials
Normal
metabolizers
Carriers
From the EHR
Delaney et al. Clin Pharm Ther. 2012
N=807, P=0.005
EHRs for drug response: Clopidogrel adverse events associated with CYP2C19
Deep learning for Diabetic Retinopathy
Train a machine
learning algorithm
over >128k images
Gulshan et al. JAMA 2016
Phenome scanning (PheWAS) in the EHR
A genetic
variantAssociated
phenotypes
The curated EHR-
based phenome
A phenotypeAssociated
genotypes
Dense genomic
information
Replications of
GWAS
associations
via PheWAS
Bin
ary
trai
tsC
on
tin
uo
us
trai
tsP-value for replication:• All - 210/751: 2x10-98
• Powered - 51/77: 3x10-47
Nat Biotech 2013; 31:1102-1111
PheWAS across all HLA types(n= 37,270)
Karnes et al, Sci Trans Med 2017
The potential for “call back” deeper phenotyping:Long QT genes (SCN5A and KCNH2) in 2,200
sequenced patients in eMERGE
• 83 rare (MAF < 1%) in SCN5A, 45 in KCNH2
• 121/128 MAF < 0.5%, 92 singletons
• Three labs assessed known/likely pathogenicity
Lab 1
16/121
Lab 2
24/121
Lab 3
17/121
4
Van Driest et al, JAMA 2016
Calculating a Phenotype Risk Score (PheRS)
OMIM
feature 1
OMIM
feature 2
OMIM
feature k
...
For each record i, generate
PheRS
PheRS𝑖 =
𝑗=1
𝑘
10𝜔𝑗
Score for
subject i
Add up
terms for k
phenotypes
0=phenotype j
absent
1=phenotype j
present
weight for
phenotype j
derrived from
entire EHR
Human
Phenotype
Ontology
EHR
phenotypes
Repeat this for all Mendelian diseases
Bastarache et al, Science 2018
CF cases CF controls
Age/Sex 18F 26M 29F 29M 18F 26M 29F 29M
Chronic airway obstruction
Pneumonia
Diseases of pancreas
Hypovolemia
Acute upper respiratory infections
Asthma
Bronchiectasis
Intestinal malabsorption
Hepatomegaly
Acute pulmonary heart disease
Phenotype Risk Score 9.8 4.4 6.3 7.8 2.5 0.7 0.0 0.7
Example: a phenotype risk score in Cystic Fibrosis
Bastarache et al, Science 2018
PheRS
identified
potentially
pathogenic
SNVs
N=21k on exome chip
6k SNVs
Bastarache et al, Science 2018
The All of Us Research Program –Breaking Down Data Silos
Precision Medicine Initiative, PMI, All of Us, the All of Us logo, and The Future of Health Begins With You are service marks
of the U.S. Department of Health and Human Services.
Direct
VolunteersHealth Care Provider
Organizations
Health
Surveys
Overview of the All of Us approach and protocol
EHR data Baseline
measurements
Bio-
specimensSmartphones
& Wearables
Multiple data types linked
together by semantic standards
From Healthcare Provider Orgs
All of Us will aggregate data from many sources
MedsBilling codes Labs
Version 1 (2018)
Clinical Notes & Reports
ClinicalMessaging
Version 2
Raw Data Repository
Data added centrally by DRC
DeathIndex
Claims &Rx Data
…Visits
Local Registries
Much longer term
Images
From Direct Volunteers
Sync for Science
Participant provided data(Health surveys, activity
monitors, etc)
Participant exams and biospecimens
Curated Data Repository
APIs, Analysis tools, etc
Geospatial data
Health data aggregators(PicnicHealth)
Sync 4 Science (S4S) – a technology to share health data
S4S Pilot
Sites
S4S:
- FHIR-based
- Starting with MU
Common Clinical Data set
Data Access is centralized in All of Us
Traditional Approach: Bring data
to researchers
Problems
• Data sharing = data copying
• Security (data handoffs)
• Huge infrastructure needed
• Siloed compute
AoU Approach: Bring
researchers to the data
Data
Advantages
• Cost
• Threat detection and auditing
• Increased Accessibility
• Shared compute
Public CloudDownload from
public repository
The power of a data biosphere of common semantics and APIs
Obtaining phenotype and outcome data
from e-health records and digital platforms:
the experience of UK Biobank
Cathie Sudlow
Professor of Neurology and Clinical Epidemiology
Director, Centre for Medical Informatics, Usher Institute, University of Edinburgh
Director of Health Data Research UK Scottish substantive site
Chief Scientist, UK Biobank
International Cohorts Summit,
Durham, North Carolina
March 2018
UK Biobank in a nutshell
• 500,000 UK men and women aged 40-69 years when recruited during 2006-2010
• Consent for all types of health research by both academic and commercial researchers
• Extensive baseline questions and physical measures, with biological samples stored for
future assays
• Subsequent enhancements in all or large subsets of participants:
– Data from portable wearable devices (100,000 accelerometry; 20,000+ continuous ECG)
– Sample assays in all or large subsets:
Complete: genome-wide genotyping; biochemistry panel
Underway/planned: exome and whole genome sequencing; proteomics;
infectious disease assays; stool microbiome
– Multimodal imaging of 100,000 (>22,000 so far)
– Web questionnaires
• Comprehensive, long term follow-up for a wide range of health-related outcomes
• Open access for approved research: see www.ukbiobank.ac.uk
Aim: identify a wide range of incident diseases and other health related outcomes
Active methods requiring participant re-engagement
• face to face reassessment
• postal or web-based surveys
• expensive
• prone to incomplete coverage & selective loss to follow-up
• miss cases emerging between assessments
Passive methods via linkages to national health records
• can follow all participants without need for re-engagement
• efficient and cost effective
• need adequate consent at recruitment
• rely on universal healthcare system & availability of relevant datasets
• can only detect cases of disease diagnosed in a healthcare setting
• data need to be accurate and sufficiently detailed for research studies
Follow-up of participants in very large prospective cohorts
Web questionnaires
• Using email and web questionnaires
– for more detailed assessment of exposures
– and to obtain information on outcomes that cannot be obtained
through linking to health records
• Of 350,000 with email, >150,000 complete each questionnaire
– Details of dietary intake
– Cognitive function
– Mental health (thoughts and feelings)
– Gastrointestinal symptoms
Useful for following change
over time…but beware
selective attrition
Following the health of 0.5 million UK Biobank participants
through linking to National Health Service (NHS) records
Scotland36,000 participants
England446,000 participants
Wales21,000 participants
Regularly updated information on a wide range of diseases
from NHS datasets in all three countries:
• Deaths - date and cause of death
for all participants
>14,000 by early 2016
• Cancers – date, stage and grade of cancer
for all participants
>79,000 cancer cases by late 2015
• Admissions to hospital – dates, diagnoses, procedures
for all participants
1000’s of cases of many incident diseases
• Primary care data – dates, diagnoses, symptoms, signs,
referrals, prescriptions, labs etc
for half of the participants
1000’s more cases of many incident diseases
Maximising the value of the linked healthcare data
• Messy ‘real world data’ - not collected primarily for research
• Not 100% accurate due to administrative and clinical error
• Mainly structured, coded datasets (ICD, OPCS4, Read…)
• Experts advising in a range of disease areas:
• Combine different linked data sources to create algorithmically derived disease
status indicators
• Estimate the accuracy and completeness of these
• Consider limitations and potential additional sources of unstructured data
Cancer Neurodegenerative diseases
Diabetes Chest diseases
Cardiac diseases Musculoskeletal conditions
Stroke Infections
Mental health disorders Kidney diseases
Eye diseases
Cancers in UK Biobank ascertained from the national cancer registries
Observed Predicted
By recruitment Incident by 2015 Incident by 2022
Breast cancer 9,000 4,200 10,000
Colorectal cancer 2,300 2,500 7,000
Prostate cancer 3,000 4,300 9,000
→ Date, stage and grade of cancer
Beyond the structured registry data…exploring feasibility of retrieving additional
information for subtyping of identified cancer cases through regional linkages to:
• histopathology reports
• digitised histopathology slides
• tumour specimens
Exemplar non-cancer conditions in UK Biobank ascertained from baseline self report, hospital admissions and death registries
Observed
By recruitment Incident by 2016
Myocardial infarction 12,000 7,400 8,100
Stroke 8,000 4,600 6,900
Diabetes 26,000 9,000 18,000
COPD 10,000 7,600 16,900
Asthma 60,000 5,700 19,000
Dementia 200 1,800 3,600
Exemplar non-cancer conditions in UK Biobank ascertained from baseline self report, hospital admissions and death registries
Observed
By recruitment Incident by 2016
Myocardial infarction 12,000 7,400 8,100
Stroke 8,000 4,600 6,900
Diabetes 26,000 9,000 18,000
COPD 10,000 7,600 16,900
Asthma 60,000 5,700 19,000
Dementia 200 1,800 3,600
Estimated effects of including primary care data
Accuracy?
Limitations?
Dementia: positive predictive value of routine healthcare data
From published studies
Mortality
Hospital inpatient
Hospital in- & outpatient
Insurance
Outpatients
Primary care
0 0.2 0.4 0.6 0.8 1
Wide variation but in most PPV >80%
0 0.2 0.4 0.6 0.8 1
Alzheimer’s disease
Vascular dementia
PPV for AD generally higher than for vasc dementia
Dementia: positive predictive value of routine healthcare data
From comparison with expert review of free text electronic medical record in UK Biobank
Dementia: 80%
Alzheimer’s disease: 72%
Vascular dementia: 44%
Beyond the linked coded healthcare data
Obtaining these data at national scale is challenging
To extract value from these data on 1000’s of
outcomes across multiple diseases, we need scalable
approaches: crowd sourcing, natural language
processing, machine learning, artificial intelligence…
Structured, coded data from linked national healthcare datasets:
• Can ascertain cases of a wide range of diseases with acceptable accuracy
• Capture only 10-20% of the information from electronic medical records
• Are limited for detailed sub-phenotyping of disease
Deeper phenotyping of disease will require multiple
unstructured data sources, including:
• Free text of electronic records
• Complex electrical signalling data (ECG’s, EEG’s etc)
• Histopathology slide sets
• Clinical imaging data
Acknowledgements
Biochemistry analyses in all 500,000 participants
Cardiovascular
Cholesterol
Direct LDL-c
HDL-c
Triglyceride
ApoA
ApoB
Lp(a)
CRP
Cancer
SHBG
Testosterone
Oestradiol
IGF-I
Bone and joint
Vitamin D
Rheumatoid factor
Alkaline Phosphatase
Calcium
Liver
Albumin
Direct Bilirubin
Total Bilirubin
GGT
ALT
AST
Note: Haematological assays were conducted during recruitment phase
Diabetes
HbA1c
Glucose
Renal
Creatinine
Cystatin C
Total protein
Urea
Phosphate
Urate
Urinary:
• Creatinine
• Sodium
• Potassium
• Albumin
500,000 participants
22 recruitment centres
89% England
7% Scotland
4% Wales
Industrial scale processes:
samples during recruitment
25,000aliquots produced
per day
700participants
per day
4,900sample tubes
per day
15 million 0.85ml aliquots
• Blood
whole blood
serum
plasma
red cells
buffy coat
• Urine
• Saliva
Total > 15 million aliquots
Expected disease cases during follow-up
Condition 2012 2017 2022
Diabetes 10,000 25,000 40,000
Heart attack 7,000 17,000 28,000
Stroke 2,000 5,000 9,000
Chronic obstructive lung disease 3,000 8,000 14,000
Breast cancer 2,500 6,000 10,000
Colorectal cancer 1,500 3,500 7,000
Prostate cancer 1,500 3,500 7,000
Hip fracture 1,000 2,500 6,000
Alzheimer’s 1,000 3,000 9,000
Data from portable wearable devices
Accelerometry data: 100,000 participants
Continuous ECG monitoring: 20,000 + participants
Prospective design
and large size enable
reasonably well-
powered studies of
(causal) associations
between accelerometry
and cardiac rhythm
measures and later
onset disease
Sample analyses
• Genome-wide genotyping of all participants
• Standard panel of assays (e.g. lipids; hormones; metabolic) on
samples from all participants
• Exome & whole genome sequencing, proteomics, metabolomics,
infectious disease assays, stool microbiome…all underway/planned
Multimodal imaging of 100,000 participants
>22,000 imaged
so far
Prospective design and large size enable well-powered studies of (causal) associations
between structure and function of organs and later onset disease…but…need scalable
methods of analysing complex data to derive measures for large scale analyses
Obtaining phenotype and outcome
data from health records:China Kadoorie Biobank experience
Zhengming CHENCKB Principal Investigator
Professor of Epidemiology
Nuffield Dept. of Population Health
University of Oxford, UK([email protected])
International Cohorts Summit, Duke University, USA,
26-27 March 2018
>512K recruited from 10 localities in 2004-08
Participants interviewed, measured, and gave
plasma and DNA (urine) for long-term storage
All followed up indefinitely via electronic record
linkage to deaths and ALL hospital episodes
Periodic resurvey of 5% surviving participants
(for enhancements and sources of variation)
China Kadoorie Biobank (CKB)
Informed consent for linkages to health records
and unspecified research use of stored samples
CKB: Clinical stations at local assessment centre
登记,知情同意,分发现场调查袋
问卷
(包括一般调查问卷, COPD 问卷, CIDI问卷 以及完整性检查 I)
完成
· 完整性检查 II – 检查是否有遗漏项目
· 分发体检结果报告
体脂
手握力
体格检查
身高 腰臀围
血压 脉搏波速
心血管检查
采血及血检 采尿及尿检
血、尿样采集及检验
肺功能一氧化碳浓度
生理检查
颈动脉超声 跟骨密度 心电图
其他检查
1.Registration
& consent
8.Physical exam.
7.Questionnaire
(with recording)5.Sample
collection
The clinic visit took 60-90 minutes, with daily statistical monitoring
Recruitment rate:
7~800 per day
2.Physical exam.
3.Physical exam.
4.Physical exam.
9.Clinic consultation
CKB: Supported by >90 bespoke IT systems
CKB: Fully established with 10-year follow-up
Questionnaire
SES, smoking, alcohol, tea,
diet, physical activity, indoor air
pollution, sleep, reproductive
patterns, medical history
Measurements
Blood pressure, height, weight,
lung function, heart rate, bone
density, exhaled CO, ECG,
cIMT, ambient temperature,
ambient air pollution, blood
lipids, metabolites, proteomics,
infectious markers, genetics
Electronic health records
>1,300 different diseases, 43K
deaths, <5K lost to follow-up,
~0.9 million hospitalizations,
>100 million chargeable items
www.ckbiobank.org
Data is growing rapidly
◊ ◊
Outcome Follow up
in CKB
Active
follow-up
Disease
registries
National
health
insurance
Death
registries
CKB: Follow-up through record linkages
National health insurance system in China(supplementing death and cancer registries)
• Introduced during 2004-6, with almost universal
coverage in CKB areas by 2010
• Multiple disease diagnoses, with ICD-10 codes plus
disease descriptions and >2,500 procedure codes
• Managed electronically at city level, with detailed
chargeable items for reimbursement purposes
• Lacks certain details (e.g. cancer pathology) required
for disease sub-phenotyping
Nearly all CKB participants now linked to the health insurance databases via unique national ID number
46063
32468
28274
19796
15539
14277
13882
11037
8035
6270
5904
5489
4632
4002
3188
3169
2748
2664
2273
1730
1316
1192
1104
1098
822
425
54
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000
Ischaemic Stroke (I63)
Diabetes (E10-E14)
Cancer (C00-C97 )
COPD (J41-J44)
Fractures (S02, S12, S22, S32, S42, S52, S62, S72, S82, S92)
Cataract (H25-H28)
Angina (I20)
Haemorrhagic Stroke (I61)
MI (I21-I23)
Arrhythmia (I47-I49)
Chronic liver disease (K70-K77, B18-B19, I85, Z22.5)
Pulmonary heart disease (I26-I27)
Heart failure (I50)
Anxiety disorders (F40-F48)
Tuberculosis (A15-A19, B90, J65)
Asthma (J45-J46)
CKD (N02-N03, N07, N11,N18)
Osteoporosis (M80-M81)
Coronary revascularisation
Rheumatoid arthritis (M05-M06, M45)
Schizophrenia (F20-F29)
Depression (F30-F39)
Retinopathy (E10-4.3,H36.0)
SAH (I60, I69.0)
Parkinson disease (G20-G21)
(Venomous) snake bite (T63.0, X20)
Victim of earthquake (X34)
CKB: Participants with selected diseases in 10 years(43K deaths, 0.9M hospital admissions; 2017 HI data are being processed)
Haemorrhagic strokeNAFLD
Waist circumference (cm)
CKB: Procedures for improving disease
phenotyping
Pilot study of ~1000 cases for specific disease before
deciding whether to undertake systematic adjudication
CKB: Disease standardisation and coding tool
CKB: Verifying reported diagnosis
CKB: Adjudicating & phenotyping major diseases(>70K adjudicated: 30K stroke, 25K IHD, 15K cancer, >3K CKD)
CKB: “traffic” light approach for outcome data
Future work for disease phenotyping
• Standardising and ICD-10 coding new events collected
• Processing and incorporating >100M chargeable items
data to enhance disease phenotyping
• Extending outcome adjudication to several other
diseases (e.g. heart failure, chronic liver disease)
• Developing automated algorithm to sub-phenotype
stroke and other diseases according to clinical criteria
• Piloting collection of discharge summary pages and
tumour tissue samples
CKB: Open data access platform (www.ckbiobank.org)