Electronic phenotyping with APHRODITE and the Observational
Health Sciences and Informatics (OHDSI) data network
S16: Papers - EHR-based Phenotyping 2017 Joint Summits on Translational Science,
March 28th, 2017
J U A N M . B A N D A , P H D 1 , Y O N I H A L P E R N , P H D 2 , D A V I D S O N T A G , P H D 3 , N I G A M H . S H A H , P H D 1
1 S T A N F O R D C E N T E R F O R B I O M E D I C A L I N F O R M A T I C S R E S E A R C H , S T A N F O R D , C A ;2 N E W Y O R K U N I V . , N E W Y O R K , N Y ;
3 M I T , C A M B R I D G E , M A
DISCLOSURE:
SPEAKER DISCLOSES THAT HE HAS NO RELATIONSHIPS WITH COMMERCIAL INTERESTS.
1
Electronic phenotyping with APHRODITE and the Observational Health Sciences and Informatics (OHDSI) data network
Motivation
- Phenotyping is the beginning step to almost every biomedical study, yet in many cases is very complicated
- Popular modern approaches are rule-based which take very long to develop
- Can we build a statistical model which provides performance comparable to a consensus definition and requires less development effort? …. Turns out we can! (Agarwal V, et al. and Halpern Y, et al.)
- Can we port these two approaches to the Observational Health Sciences and Informatics (OHDSI) tech stack?
2
Agarwal V, et al. Learning statistical models of phenotypes using noisy labeled training data. JAMIA. 2016. 23 (6), 1166-1173Halpern Y, et al. Electronic medical record phenotyping using the anchor and learn framework. JAMIA 2016. 23 (4): 731-740
Automated PHenotype Routine for Observational Definition, Identification, Training and Evaluation (APHRODITE)- Framework is designed to enable phenotyping via supervised (or semi-
supervised) learning of phenotype models
- Reads patient data from the OHDSI/OMOP CDM version 5
- Combines two recently published phenotyping approaches:- XPRESS (Agarwal V, et al.)- Achor learning (Halpern Y, et al.)
- In addition, to enable sharing and reproducibility of the underlying phenotype 'recipes', APHRODITE allows sharing of either the trained model or sharing of the configuration settings and anchor selections across multiple sites
Agarwal V, et al. Learning statistical models of phenotypes using noisy labeled training data. JAMIA. 2016. 23 (6), 1166-1173Halpern Y, et al. Electronic medical record phenotyping using the anchor and learn framework. JAMIA 2016. 23 (4): 731-740
3
4
+140 collaborators
16 countries
Common CDM and vocabulary
The patient network available includes 84 databases, both clinical and claims, totaling over 650 million patients
OHDSI /OMOP CDMv5
APHRODITE Overview
5
Step 1 – Build Keyword List
Function call:
wordLists <- buildKeywordList(conn, aphrodite_concept_name, cdmSchema, dbms)
Generated keywords for “Myocardial Infarction”
Vocabulary v5
6
Step 2 – Find Cases and Controls
CDMv5
Function call:
casesANDcontrolspatient_ids_df<- getdPatientCohort(conn, dbms,as.character(keywordList_FF$V3),as.character(ignoreList_FF$V3), cdmSchema,nCases,nControls)
7
Step 3 – Extract Patient Data
CDMv5
Function calls:
dataFcases <- getPatientData(conn, dbms, cases, as.character(ignoreList_FF$V3), flag, cdmSchema)dataFcontrols <- getPatientData(conn, dbms, controls, as.character(ignoreList_FF$V3), flag, cdmSchema)
8
Step 4 – Create Feature Vector
Function calls:
fv_all<-buildFeatureVector(flag, dataFcases,dataFcontrols)fv_full_data <- combineFeatureVectors(flag, data.frame(cases), controls, fv_all, outcomeName)
9
Step 5 – Build Model
Function calls:
model_predictors <- buildModel(flag, fv_full_data, outcomeName, folder)model<-model_predictors$model
predictorsNames<-model_predictors$predictorsNames 10
Step 6 – Get Evaluation Results
11
How do we know ‘noisy labeling’ works?Using the evaluation of Agarwal V., et al.:
- Selected one phenotype each from defined by the Electronic Medical Records and Genomics (eMERGE) and the Observational Medical Outcomes Partnership (OMOP) initiatives:- Myocardial infarction- Type 2 diabetes mellitus
- Created manually reviewed gold standard of cases and controls
Agarwal V., et al. Learning statistical models of phenotypes using noisy labeled training data. JAMIA. 2016. 23 (6), 1166-1173
Cases Controls Accuracy Recall PPV Cases Controls Accuracy Recall PPVSource Myocardial Infarction (MI) Type 2 Diabetes Mellitus (T2DM)
OMOP / PheKB Definition 94 94 0.87 0.91 0.84 152 152 0.92 0.88 0.96
XPRESS Noisy labels 94 94 0.85 0.93 0.8 152 152 0.89 0.99 0.81
APHRODITE Noisy labels 94 94 0.94 0.87 1.00 152 152 0.91 0.98 0.87
12
Building phenotype models with APHRODITEUsing the following keywords
Performance of noisly labeled trained models
Performance of gold-standard test set
Myocardial Infarction (MI) Type 2 diabetes mellitus (T2DM)Old myocardial infarction Type 2 diabetes mellitus with hyperosmolar coma
True posterior myocardial infarction Type 2 diabetes mellitusMyocardial infarction with complication Pre-existing type 2 diabetes mellitusMyocardial infarction in recovery phase Type 2 diabetes mellitus with multiple complications
Microinfarct of heart Type 2 diabetes mellitus in non-obese
Cases Cont. Acc. Recall PPV Acc. Recall PPVMyocardial Infarction (MI) Type 2 Diabetes Mellitus (T2DM)
XPRESS 750 750 0.86 0.89 0.84 0.88 0.89 0.87APHRODITE 750 750 0.9 0.92 0.89 0.89 0.92 0.87APHRODITE 1,500 1,500 0.9 0.93 0.9 0.91 0.93 0.88APHRODITE 10,000 10,000 0.91 0.93 0.91 0.92 0.94 0.89
Cases Cont. Acc. Recall PPV Cases Cont. Acc. Recall PPVSource Myocardial Infarction (MI) Type 2 Diabetes Mellitus (T2DM)
OMOP/PheKB 94 94 0.87 0.91 0.84 152 152 0.92 0.88 0.96XPRESS 94 94 0.89 0.93 0.86 152 152 0.89 0.88 0.9
APHRODITE (750) 94 94 0.91 0.93 0.90 152 152 0.91 0.95 0.88
APHRODITE (1,500) 94 94 0.92 0.93 0.91 152 152 0.92 0.95 0.89
APHRODITE (10,000) 94 94 0.92 0.94 0.91 152 152 0.93 0.96 0.89
13
15
Top 15 features for T2DM and MI
Type 2 diabetes mellitus (T2DM) Myocardial Infarction (MI)Rank Feature Concept Name Concept Vocabulary Rank Feature Concept Name Concept Vocabulary
1 drugEx:1503297 Metformin RxNorm 1 obs:40398391 Hypertensive disease SNOMED2 obs: 433736 Obesity SNOMED 2 obs:4243768 Auscultation SNOMED3 obs:16890009 Insulin measurement SNOMED 3 obs:4342792 Infarction NDFRT
4 obs:434005 Morbid Obesity SNOMED 4 obs:4191705Coronary heart disease review SNOMED
5 lab:4149519:NA Glucose measurement SNOMED 5 obs:4167217Family history of clinical finding SNOMED
6 drugEx:253182 Regular Insulin, Human RxNorm 6 obs:4282140 Chewable tablet SNOMED7 drugEx:1112807 Aspirin RxNorm 7 drugEx:1112807 Aspirin RxNorm8 lab: 4144235:NA Glucose measurement, blood SNOMED 8 obs:4325480 Aspirin NDFRT9 obs:45889912 Hemoglobin SNOMED 9 lab:3002030 Lymphocytes % LOINC
10 obs:45889669 Insulin CPT4 10 lab:3014576 Chloride serum/plasma LOINC11 lab:3004410 Hemoglobin A1c (Glycated) LOINC 11 obs:4329041 Pain SNOMED12 obs: 40398391 Hypertensive disease SNOMED 12 lab:40789893 Potassium | Bld-Ser-Plas LOINC13 obs:1503297 Metformin RxNorm 13 obs:433595 Edema SNOMED14 obs:4325480 Aspirin NDFRT 14 visit:77670 Chest Pain SNOMED15 obs:40301556 Past medical history SNOMED 15 obs:1126658 Hydromorphone RxNorm
14
Using Anchors with APHRODITE:Step 1 – Build Keyword List - Anchors
Function call:
wordLists <- buildKeywordList(conn, aphrodite_concept_name, cdmSchema, dbms)
Generated keywords for “Myocardial Infarction”
15
Step 2 – Find Anchors
CDMv5 Function call:
anchor_list <- getAnchors(conn, dbms, cdmSchema, cases, controls, as.character(ignoreList_FF$V3), studyName, outcomeName, flag, numAnchors=50)
16
Step 3 – Curate Anchors
Manual Step
17
Step 4 – Find Cases and Controls - Anchors
CDMv5
Function call:
Anchored_casesANDcontrolspatient_ids_df<- getPatientCohort_w_Anchors(conn, dbms,as.character(keywordList_FF$V3),as.character(ignoreList_FF$V3),
cdmSchema,nCases,nControls, flag) 18
Step 5 – Extract Patient Data - Anchors
CDMv5
Function calls:
dataFcases <- getPatientData(conn, dbms, cases, as.character(ignoreList_FF$V3), flag, cdmSchema)dataFcontrols <- getPatientData(conn, dbms, controls, as.character(ignoreList_FF$V3), flag, cdmSchema)
19
Step 6 – Create Feature Vector - Anchors
Function calls:
fv_all<-buildFeatureVector(flag, dataFcases,dataFcontrols)fv_full_data <- combineFeatureVectors(flag, data.frame(cases), controls, fv_all, outcomeName)
20
Step 7 – Build Model - Anchors
Function calls:
model_predictors <- buildModel(flag, fv_full_data, outcomeName, folder)model<-model_predictors$model
predictorsNames<-model_predictors$predictorsNames 21
Anchors sugested for MI and T2DM
Myocardial Infarction (MI) Type 2 diabetes mellitus (T2DM)
Source importance / rank
concept_name Source Importance /rank
concept_name
obs 1.9036 1 Renal function lab 0.5637 1 Serum HDL/non-HDL cholesterol ratio measurement
obs 1.8554 2 Every eight hours lab 0.5466 2 Glucose measurementobs 1.7831 3 Chest CT obs 0.5328 3 Lipid panelobs 1.7108 4 Cataract obs 0.5156 4 Asthmaobs 1.6386 5 Hypercholesterolemia obs 0.4984 5 Diabetes mellitusobs 1.5663 6 Osteopenia lab 0.4778 6 Hyaline castsobs 1.4699 7 Palpitations obs 0.4675 7 Pulmonary edemaobs 1.3735 8 Commode obs 0.4366 8 Obesityobs 1.3012 9 Tightness sensation quality obs 0.4228 9 Hypoglycemiaobs 1.253 10 Afternoon obs 0.3987 10 Metforminobs 1.012 11 Deficiency obs 0.3816 11 Duplex
drugEx 0.8916 12 ferrous sulfate obs 0.3712 12 PalpitationsdrugEx 0.7711 13 Dobutamine visit 0.3541 13 Obesity
obs 0.7229 14 Fibrovascular lab 0.3334 14 Glucose
lab 0.5542 15 Sodium [Moles/volume] in Blood lab 0.3059 15 Hemoglobin A1c
drugEx 0.4819 16 clopidogrel obs 0.2784 16 Subclavicular approachlab 0.3855 17 Lactic acid measurement obs 0.2303 17 Colonoscopyobs 0.3133 18 Pressure ulcer obs 0.2131 18 Cholesterol
drugEx 0.1928 19 zolpidem drugEx 0.1925 19 Insulin, Regular, Human
obs 0.0964 20 Aspirin 81 MG Enteric Coated Tablet
drugEx 0.0206 20 Oxycodone
22
APHRODITE anchored phenotype performance
Cases Cont. Acc. Recall PPV Cases Cont. Acc. Recall PPVMyocardial Infarction (MI) Type 2 Diabetes Mellitus (T2DM)
OMOP/PheKB 94 94 0.87 0.91 0.84 152 152 0.92 0.88 0.96
XPRESS 94 94 0.89 0.93 0.86 152 152 0.89 0.99 0.9
APHRODITE 94 94 0.91 0.93 0.9 152 152 0.91 0.98 0.88
APHRODITE (Anchors) 94 94 0.92 0.97 0.89 152 152 0.92 0.95 0.9
23
ConclusionsWe have successfully implemented the framework proposed by Agarwal et
al. and the Anchor learning framework by Halpern et al. to build and refine phenotype models
We have demonstrated that it is possible identify anchors during the model building process to generate a better labeled training set which leads to a better performing model than just using keyword for noisy labeling
Our main contribution is the APHRODITE package, which allows for the potential redistribution of locally validated phenotype models as well as the sharing of the workflows for learning phenotype models at multiple sites of the OHDSI data network
Agarwal V, et al. Learning statistical models of phenotypes using noisy labeled training data. JAMIA. 2016. 23 (6), 1166-1173Halpern Y, et al. Electronic medical record phenotyping using the anchor and learn framework. JAMIA 2016. 23 (4): 731-740
24
AcknowledgementsStanford- Nigam Shah- Vibhu Agarwal- Shah lab
New York University- Yoni Halpern
MIT- David Sontag
OHDSI Community
25
QUESTIONS?
Download Aphrodite:https://github.com/OHDSI/Aphrodite
Learn more:http://tinyurl.com/use-aphrodite
25
@drjmbanda
28
Why do noisy labels work
Learning theory - Generalization error of a model trained using data with inaccurate labels can be made small
Model trained using m observations with accurate labels has the same gen error as a model trained using m/(1 – 2τ) 2 observations with noisy labels.
Model inaccuracies in automatic labeling as random classification noise given by error rate τ
Unless τ is 0.5, having more observations will help The cost of getting additional observations is negligible
29
Performance results for data removal experiments Cases Cont. Acc. Recall PPV Cases Cont. Acc. Recall PPV
Source Myocardial Infarction (MI) Type 2 Diabetes Mellitus (T2DM)
APHRODITE 94 94 0.91 0.93 0.9 152 152 0.91 0.98 0.88
Observations removed 94 94 0.75 0.84 0.78 152 152 0.67 0.76 0.71
Labs removed 94 94 0.87 0.85 0.82 152 152 0.69 0.78 0.72
Drugs removed 94 94 0.85 0.85 0.82 152 152 0.83 0.9 0.84
Visits removed 94 94 0.89 0.88 0.86 152 152 0.86 0.91 0.86
APHRODITE w Anchors 94 94 0.92 0.97 0.89 152 152 0.92 0.95 0.9
Observations removed 94 94 0.77 0.89 0.79 152 152 0.7 0.77 0.73
Labs removed 94 94 0.89 0.88 0.81 152 152 0.71 0.79 0.75
Drugs removed 94 94 0.86 0.87 0.84 152 152 0.86 0.91 0.85
Visits removed 94 94 0.91 0.9 0.87 152 152 0.88 0.9 0.84