ehr-based phenotyping: bulk learning and evaluaon (with … · bulk learning is a batch-phenotyping...

27
EHR-Based Phenotyping: Bulk Learning and Evalua;on (with Infec;ous Diseases) Po-Hsiang (Barnett) Chiu

Upload: others

Post on 13-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

EHR-BasedPhenotyping:BulkLearningandEvalua;on(with

Infec;ousDiseases) Po-Hsiang (Barnett) Chiu

Page 2: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

Phenotypesandphenotyping

Physicallyobservabletraitsofgenotypes(andtheirinterac;onswithenvironments)

Diseases(anddiseasesubtypes)

AFribu;onsofdiseases(e.g.suscep;bility)

Biochemicalorphysiologicalproper;es,behavior,andproductsofbehavior

Page 3: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

Data-DrivenPhenotyping•  Data-drivenphenotyping

–  Twomainmethodologies•  Rule-basedapproach(e.g.eMerge,hFps://emerge.mc.vanderbilt.edu)•  Predic3veAnaly3cs

–  Datasources:•  EHRs/EMRs:Medicinaltreatments,diagnoses,labmeasurements,etc.•  Genomicdata:SNParrays,copynumbervaria;on(CNVs),etc.

–  Phenotypes•  Diseases,subtypes,orvariablesaFributedtodiseasepredic;ons

Page 4: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

Diagnos;cConceptUnits•  Variousdiseasessharingthesamesetofdiagnos;cconceptunits

•  Infec;ousdiseases–  Labtests

•  Microorganism,blood,urine,body;ssues,stool– Medica;ons

•  An;bio;c,an;virus,anthelmin;c

•  Buildsta;s;calmodelsforeachdiagnos;ccomponentandcombinethemappropriately–  Ensemblelearning

Page 5: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

BulkLearninginaNutshell…Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set) as a substrate for model learning and evaluation wherein (a given) medical ontology is used to perform feature selection and model stacking is used to construct abstract feature representation of low sample complexity in order to reduce training requirements.

KeyConcepts:

1.Buildphenotypingmodelsontopofmul;plediseases

3.Modelsarecombinedviamodelstacking(aformofensemblelearning)4.Abstractfeatures

Dimensionalityreduc;on

2.Automa;cfeatureselec;onusinganexis;ngontology

5.Lesslabeleddatarequiredformodelevalua;ons

Page 6: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

PhenotypingviaBulkLearning•  Undermodelstacking,wethenarriveattheno;onof“concept-drivenphenotyping”–  Asubsetorcombina;onsoflabtestsaremoreaFributabletosomediseaseswhiletheothersarebeFerexplainedbymedica;ons

•  Inthisstudy,infec;ousdiseasesassociatedwith100ICD-9codesasthedomainofstudyforbulklearning–  Forsimplicity,considerdifferentdiagnos;ccodesasdifferentdiseases…

– Why100codes?–  Codeselec;onstrategy?

Page 7: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

BulkLearningBasicsI

•  Addressestwocentralissuesinpredic;veanaly;calapproachtocomputa;onalphenotyping– Featureengineering

•  Medicalontologyforfeaturedecomposi;on•  MedicalEn;;esDict(hFp://med.dmi.columbia.edu)

– Dataannota;on•  Ensemblelearning(e.g.stackedgeneraliza;on[Wolpert1992])

•  Featureabstrac;onfordimensionalityreduc;on

Page 8: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

MedicalOntologyforGroupingFeatures

•  SnapshotofMedicalEn;;esDic;onary(hFp://med.dmi.columbia.edu)

Page 9: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

ModelStacking•  Whyinspec;ngmul;ple(infec;ous)diseases?

–  Usingmul3plediseasesassubstrateandiden;fytheircommonelements–  Examplestackingarchitecture(understackedgeneraliza;onmethod)

Level 1

Level 0

Antibiotic Measure

Urinary Chemistry Measure

Intravenous Chemistry Measure

Microbiology Measure

Level 2

Attributes: Level-0 Probabilities and IndicatorsTarget: Diagnostic Codes (Silver Standard)

Other Phenotypic Measures (e.g. Antiviral)

Attributes: Level-1 Probabilities and ICD-9Target: True Labels (Gold Standard)

Page 10: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

SurrogateLabelsvsTrueLabels

•  Modelstackingisusedtoachieve:–  Improveuponbasemodelperformances–  TransformEHRdatatoadenserform

•  Usesdiagnos;ccodes(e.g.ICD-9)assurrogatelabelstoestablish“approximatepredic;vemodels.”

•  Whysurrogatelabels(e.g.ICD-9)?–  FeaturesextractedfromEHRcanbelarge–  Usedtoderivecompactrepresenta;onofthetrainingdata–  “Free”supervisedsignalsthataresufficientlyclosebutcanbeobtained

withoutextrawork•  Objec;ve:Buildsta;s;calmodelsinabstractfeaturespace

–  Createasparseannota;onset(i.e.goldstandard)thatservesaproxydatasetfordownstreammodelevalua;ons

–  83annotatedcases

Page 11: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

m1

a1

b1

u1

m1(1)

m1g a1g b1g u1g

global2

(i)

(i-1)

(i+1)

a1(1) b1

(1) u1(1)

logistic unitsraw

features

microbiology

antibioticblood test

urine test

f11

f12

f1j

f21

f2j

f31

f41

f3j

Four Example Base Models

Page 12: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

PerformanceEvalua;ons

•  HowwelldoesthemodelpredictICD-9s(usingaseparatetestdata)?

•  Howwelldoesthemodelpredictannotateddata(assoc.with“truelabels”)?–  (Binarized)ICD-9becomesacandidatefeatureamongabstractfeatures(e.g.probabilityscores,indicators)

•  AnnotatedsampleconsistsofrandomlyselectedcasesinwhicherrorsofICD-9codingarecorrected

•  Dataannota;onsandcodingproceduresaretwoindependentprocesses

Page 13: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

BaseLevelPerformances

Page 14: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

127.4Enterobiasis

047.8(Other)viralmeningi;s

009.1Gastroenteri;s...

053.9Herpezzoster

117.9Mycoses

Page 15: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)
Page 16: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

OtherComponents

•  Semi-supervisedlearningandvirtualannota;onset

•  The3rd;erinmodelstackinghierarchy–  Trade-offbetweenlearnedabstractfeaturesandtheICD-9codesassurrogatelabels.

–  Performanceevalua;ononpredic;ngannotatedlabels•  Ontology-basedfeatureengineering•  Properdesignoftreatmentandcontrol(training)data

Page 17: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

ModelingPerspec;ve•  EHRdataconsistofobserva;onsandlatentvariables

–  Observa;onscanbedirectlyansweredviasimplequeries•  Didthepa;enthavetestsonE.Coli?•  Didthepa;enttakeCekriaxon?

•  Latentvariablesrepresentquan;;esthatcannotbedirectlyobservedinEHRorcomputedviasimplequeries–  Doesthepa;enthaveaninfec;on?–  Diagnos;cques;ons:specificallywhichinfec;onsdothepa;enthave?

•  Learnclassifierstopredictlatentvariables(withonlyaccesstoobserva;ons)

Page 18: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

MedicalPerspec;ve•  Seeminglydifferentinfec;ousdiseasesmaysharesimilarsetsoflabtestsandmedica;ons–  Staph.aureus

•  Skininfec;ons,pneumonia,bloodpoisoning–  Cekriaxone

•  Meningi;s•  Infec;onsatdifferentsitesofthebody(e.g.bloodstream,lungs,urinarytracts)

•  Mul;pleclassifiersforthesamedisease–  4classifiersperICD-9code,eachofwhichisbinaryclassifier

•  400classifiersatbaselevel

Page 19: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

DataDistribu;onPerspec;ve

“Canwebuildajointmodelapplicabletoalldiseases?”

Page 20: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

AbstractFeatureRepresenta;on:DesignChoices

•  Relatedworkinconstruc;nghigh-levelfeatures–  PCA,unsupervisedfeaturelearning,manifoldlearning,etc.

•  Designchoices–  Datacharacteris;cs–  Interpretability

•  DeepNeuralNetwork–  Linearcombina;on–  Non-lineartransforma;on(e.g.sigmoid,rec;fier,etc.)

•  Featureset:con;nuous,dense,and“homogeneous”–  Imagepixels–  Timesseriesoflabmeasurements–  word2vec

•  EHRdatahoweverareverydifferent–  sparseandincomplete–  consistofmanydifferenttypes(binary,categorical,con;nuous,etc.)–  Featuresassociatedwithmul;pleconcepts

Page 21: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

MovingForward…•  Summary

–  Bulklearningisaframeworkwithatleastthefollowingsystemchoices•  Thebulklearningset(oftargetcondi;ons)=>basemodels•  Classifica;onalgorithms(guideline:probabilis;cclassifiers+well-calibrated)•  Stackingarchitecture(mul;ple;ers=>levelsofabstrac;ons)•  Strategyforcombiningindividual(local)diseasemodelstoaglobalmodel

–  Advantage:Canuseasmallannotatedsampleformodelconstruc;onandevalua;onwithintheabstractfeaturespace(e.g.level-1data)

•  83clinicalcaseswerelabeledinthisstudy–  Challenge:Themodelinvolvingtheinterac;onbetweenabstractfeaturesand

ICD-9donotgeneralizewellintotheregionofthedatawheretheICD-9codingwasincorrect

•  Mul;pletypesofsurrogatelabels⌃

m1

a1

b1

u1

m1(1)

m1(i)

a1(i)

b1(i)

u1(i)

m1g a1g b1g u1g

local2(i)

global2

(i)

(i-1)

(i+1)

a1(1) b1

(1) u1(1)

(i-1)(i)

(i+1)

Semi-supervisedlearningAc3velearning

Complexdecisionboundary?

Othersurrogatelabels

•  Ongoingandfuturework

Page 22: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

Reference[1]D.H.Wolpert,Stackedgeneraliza;on,NeuralNetworks.5(1992)241–259.[2]K.M.Ting,I.H.WiFen,Issuesinstackedgeneraliza;on,J.Ar;f.Intell.Res.10(1999)271–289.[3]J.JinChen,C.ChengWang,R.RunshengWang,UsingStackedGeneraliza;ontoCombineSVMsinMagnitudeandShapeFeatureSpacesforClassifica;onofHyperspectralData,IEEETrans.Geosci.RemoteSens.47(2009)2193-2205.[4]DavidBaorto,JamesCimino,etal.Available:hFp://med.dmi.columbia.edu.Accessdate:Oct20,2016.[5]T.A.Lasko,J.C.Denny,M.A.Levy,Computa;onalPhenotypeDiscoveryUsingUnsupervisedFeatureLearningoverNoisy,Sparse,andIrregularClinicalData,PLoSOne.8(2013)e66341.

Page 23: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

T H A N K

Y O U ⌃

m1

a1

b1

u1

f11f12

f1j

f21

f2j

f31

f41

f3j

Page 24: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

m1

a1

b1

u1

logistic unitsraw

featuresf11

f12

f1j

f21

f2j

f31

f41

f3j

m1

a1

b1

u1

Level 0 Level 1

Microbiology

An;bio;c

Bloodtest

Urinetest

Page 25: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

ExampleFeatures

Page 26: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)
Page 27: EHR-Based Phenotyping: Bulk Learning and Evaluaon (with … · Bulk Learning is a batch-phenotyping framework that uses multiple diseases collectively (i.e. bulk learning set)

m1

a1

b1

u1

m1(1)

m1(i)

a1(i)

b1(i)

u1(i)

m1g a1g b1g u1g

local2(i)

global2

(i)

(i-1)

(i+1)

a1(1) b1

(1) u1(1)

(i-1)(i)

(i+1)

logistic unitsraw

features

microbiology

antibioticblood test

urine test

2. Compute Base Models

Level-1 Global Unit

Individual Level-1 Local Units

Level-1 abstractfeatures

f11

f12

f1j

f21

f2j

f31

f41

f3j

Four Example Base Models

3. Compute Meta Models (via Ensemble Learning)1. Define Feature Groups Using Medical Ontology

1a. Gather EHR data according to medical concepts

1b. Use Medical Entities Dictionary to delineate feature scopes

1c. Apply feature selection within each

concept group

3a. Per-disease ensembles:compute local level-1 models

3b. Cross-disease ensemble: compute a global

level-1 model

Global level-1 features