tree based methods (ensemble schemes) · 2020-01-15 · • greedy strategy. • split the records...
TRANSCRIPT
Overview • DecisionTrees
• Overview• Spli1ngnodes• Limita8ons
• Bagging/BootstrapAggrega8ngandBoos8ng• Howbaggingreducesvariance?• Boos8ng
• RandomForests• Overview• WhyRFworks?
• CancerGenomicsApplica8on
Classifica=on
• Givenacollec8onofrecords(trainingset)• EachrecordcontainsasetofaLributes,oneoftheaLributesistheclass/label
• FindamodelforclassaLributeasafunc8onofthevaluesofotheraLributes.
• Goal:previouslyunseenrecordsshouldbeassignedaclassasaccuratelyaspossible.
• Atestsetisusedtodeterminetheaccuracyofthemodel.Usually,thegivendatasetisdividedintotrainingandtestsets,withtrainingsetusedtobuildthemodelandtestsetusedtovalidateit.
Classifica=on Illustra=on
Apply Model
Induction
Deduction
Learn Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
Learningalgorithm
Training Set
Courtesy:www.cs.kent.edu/~jin/DM07/
Classifica=on Examples
• Predic8ngtumorcellsasbenignormalignant
• Classifyingcreditcardtransac8onsaslegi8mateorfraudulent
• Classifyingsecondarystructuresofproteinasalpha-helix,beta-sheet,orrandomcoil
• Categorizingnewsstoriesasfinance,weather,entertainment,sports,etc.
• Severalalgorithms–Decisiontrees,SupportVectorMachines,Rule-basedMethodsetc.
Decision Tree (Example)
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
Decision Tree (Another Example)
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
MarSt
Refund
TaxInc
YES NO
NO
NO
Yes No
Married Single,
Divorced
< 80K > 80K
Therecouldbemorethanonetreethatfitsthesamedata!
Applying the model Oncethedecisiontreeisbuilt,themodelcanbeusedtotestanunclassifieddata
Refund
MarSt
TaxInc
YES NO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Refund Marital Status
Taxable Income Cheat
No Married 80K ? 10
Test Data
Assign Cheat to “No”
x = 1
y = 0.5
x < 1
y < 1 y < 0.5
(1.5,0.8)
Tumor samples/patients given by expression in (gene1,gene2)
Normal samples/patients given by expression in (gene1,gene2)
T
Yes No
N
No Yes
T N
No Yes
gene1
gene
2
y = 1
Tumor Classifica=on (Example)
Decision Tree Algorithms
• ManyAlgorithms:• Hunt’sAlgorithm(oneoftheearliest)• CART• ID3,C4.5• SLIQ,SPRINT
Hunt’sAlgorithm(GeneralIdea)LetDtbethesetoftrainingrecordsthatreachanodetGeneralProcedure:
IfDtcontainsrecordsthatbelongthesameclassyt,thentisaleafnodelabeledasytIfDtisanemptyset,thentisaleafnodelabeledbythedefaultclass,ydIfDtcontainsrecordsthatbelongtomorethanoneclass,useanaLributetesttosplitthedataintosmallersubsets.Recursivelyapplytheproceduretoeachsubset.
Hunt’s Algorithm (Illustra=on)
Don’t Cheat
Refund
Don’t Cheat
Don’t Cheat
Yes No
Refund
Don’t Cheat
Yes No Marital Status
Don’t Cheat
Cheat
Single,Divorced Married
Taxable Income
Don’t Cheat
<80K >=80K
Refund
Don’t Cheat
Yes No Marital Status
Don’t Cheat Cheat
Single,Divorced Married
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
Tree Induc=on
• Greedystrategy.
• SplittherecordsbasedonanaLributetestthatop8mizescertaincriterion.
• Mainques8ons
• Determinehowtosplittherecords
• Binaryormul8-waysplit?
• Howtodeterminethebestsplit?
• Determinewhentostopspli1ng
Test Condi=on (SpliHng Based on Nominal/Ordinal AKributes)
Mul8-waysplit:Useasmanypar88onsasdis8nctvalues.Binarysplit:Dividesvaluesintotwosubsets.
Needtofindop8malpar88oning.
Car Type Family
SportsLuxury
Car Type {Family,Luxury} {Sports}
Car Type {Sports,Luxury} {Family} OR
Test Condi=on (SpliHng Based on Con=nuous AKributes)
TaxableIncome> 80K?
Yes No
TaxableIncome?
(i) Binary split (ii) Multi-way split
< 10K
[10K,25K) [25K,50K) [50K,80K)
> 80K
Binary vs. Mul=-way split – which is the best?
hLp://www.cse.msu.edu/~cse802/DecisionTrees.pdf
Binary vs. Mul=-way split – which is the best?
hLp://www.cse.msu.edu/~cse802/DecisionTrees.pdf
Determining Best Split
OwnCar?
C0: 6C1: 4
C0: 4C1: 6
C0: 1C1: 3
C0: 8C1: 0
C0: 1C1: 7
CarType?
C0: 1C1: 0
C0: 1C1: 0
C0: 0C1: 1
StudentID?
...
Yes No Family
Sports
Luxury c1c10
c20
C0: 0C1: 1
...
c11
BeforeSpli1ng: 10recordsofclass0, 10recordsofclass1
WhichaLributeisthebestforspli1ng?
Determining Best Split
• Greedyapproach:• Nodeswithhomogeneousclassdistribu8onarepreferred
• Needameasureofnodeimpurity:
C0: 5C1: 5
C0: 9C1: 1
Non-homogeneous,
Highdegreeofimpurity
Homogeneous,
Lowdegreeofimpurity
Measures of Node Impurity
• GiniIndex• Entropy
• Misclassifica8onerror
∑−=j
tjptGINI 2)]|([1)(
∑−=j
tjptjptEntropy )|(log)|()(
)|(max1)( tiPtErrori
−=
Compu=ng Measures of Node Impurity
B?
Yes No
Node N3 Node N4
A?
Yes No
Node N1 Node N2
BeforeSpliWng:
C0 N10 C1 N11
C0 N20 C1 N21
C0 N30 C1 N31
C0 N40 C1 N41
C0 N00 C1 N01
M0
M1 M2 M3 M4
M12 M34Gain = M0 – M12 vs M0 – M34
Compu=ng Measures of Node Impurity (Gini Index)
• GiniIndexforagivennodet:
(NOTE:p( j | t) istherela8vefrequencyofclassjatnodet)
• Maximum(1-1/nc)whenrecordsareequallydistributedamongallclasses,implyingleastinteres8nginforma8on
• Minimum(0.0)whenallrecordsbelongtooneclass,implyingmostinteres8nginforma8on
∑−=j
tjptGINI 2)]|([1)(
M1 M2
Example (Gini Index of a Node)
∑−=j
tjptGINI 2)]|([1)(A?
Yes No
Node N1 Node N2
C0# 0"C1# 6"
#
C0# 1"C1# 5"
#
C0# 0"C1# 6"
#
C0# 1"C1# 5"
#
P(C0) = 0/6 = 0 P(C1) = 6/6 = 1
Gini = 1 – P(C0)2 – P(C1)2 = 1 – 0 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
M1
M2
SpliHng based on Gini Index
• UsedinCART,SLIQ,SPRINT
• Whenanodepissplitintokpar88ons(children),thequalityofsplitiscomputedas,
where,nL=numberofrecordsatthelejchildnode, nR=numberofrecordsattherightchildnode
• SplitontheaLributethatmaximizesGinisplit
SpliHng Based on Gini Index
• Splitsintotwopar88ons
• EffectofWeighingpar88ons:• Largerandpurerpar88onsaresoughtfor.
B?
Yes No
Node N1 Node N2
Parent C1 6
C2 6 Gini = 0.500
N1 N2 C1 5 1 C2 2 4 Gini=0.333
Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194
Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528
Gini(Children)=7/12*0.194+5/12*0.528=0.333
Ginisplit(B)=0.5-0.3=0.2
Measures of Node Impurity (Entropy)
• Entropyatagivennodet:
• (NOTE:p( j | t) istherela8vefrequencyofclassjatnodet).
• Measureshomogeneityofanode.
• Maximum(lognc)whenrecordsareequallydistributedamongallclassesimplyingleastinforma8on
• Minimum(0.0)whenallrecordsbelongtooneclass,implyingmostinforma8on
• Entropybasedcomputa8onsaresimilartotheGINIindexcomputa8ons
∑−=j
tjptjptEntropy )|(log)|()(
Example (Entropy)
C1 0 C2 6
C1 2 C2 4
C1 1 C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
SpliHng Based on Entropy: Informa=on Gain
• Informa8onGain:
ParentNode,pissplitintokpar88ons;niisnumberofrecordsinpar88onI
• Measuresreduc8oninentropyachievedbecauseofthesplit.Choosethesplitthatachievesmostreduc8on(maximizesGAIN)
• UsedinID3andC4.5• Disadvantage:Tendstoprefersplitsthatresultinlargenumberofpar88ons,eachbeingsmallbutpure.
⎟⎠⎞
⎜⎝⎛−= ∑
=
k
i
i
splitiEntropy
nnpEntropyGAIN
1)()(
Measures of Node Impurity (Misclassifica=on Error)
• Classifica8onerroratanodet:
• Measuresmisclassifica8onerrormadebyanode.
• Maximum(1-1/nc)whenrecordsareequallydistributedamongallclasses,implyingleastinteres8nginforma8on
• Minimum(0.0)whenallrecordsbelongtooneclass,implyingmostinteres8nginforma8on
• Misclassifica8onbasedcomputa8onsaresimilartotheGINIindexandentropycomputa8ons
)|(max1)( tiPtErrori
−=
Example (Misclassifica=on Error)
C1 0 C2 6
C1 2 C2 4
C1 1 C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
)|(max1)( tiPtErrori
−=
Comparison (SpliHng Criteria)
Two-classproblem
E/G/
MCVa
lues
Summary
• Greedystrategy.
• SplittherecordsbasedonanaLributetestthatop8mizescertaincriterion.
• Ques8ons
• Determinehowtosplittherecords
• Binaryormulb-waysplit?Mulb-wayissameasbinarysplit.
• HowtodeterminethebestspliWngfeatures?Gini/Entropy/Misclassificabon
• WhentostopspliWng?
When to stop spliHng?
• Amaximalnodedepthisreached
• Ifspli1ngisstoppedtooearly,errorontrainingdataisnotsufficientlylowandperformanceonthetestdatawillsuffer(underfi1ng)
• Theleafnodesdoesn’thaveanyimpurity
• Ifwecon8nuetogrowthetreefullyun8leachleafnodecorrespondstothelowestimpurity,thenthedatahavetypicallybeenoverfit;inthelimit,eachleafnodehasonlyonepaLern!
#ofnodesTomMitchell,1997
When to stop spliHng?
• Stopspli1ngwhenthebestcandidatesplitatanodereducestheimpuritybylessthanthepresetamount(threshold)
• Howtosetthethreshold?
• Spli1nganotedoesnotleadtoaninforma8ongain
• Dependsonthemetricusedforinforma8ongain
Comparison (Gini Vs. Misclassifica=on)
A?
Yes No
Node N1 Node N2
Parent C1 7
C2 3 Gini = 0.42
! N1# N2#C1! 3# 4#C2! 0# 3####Gini#=#0.34#
!
Gini(Children)=3/10*0+7/10*0.489=0.342
Gini(N1) = 1 – (3/3)2 – (0/3)2 = 0
Gini(N2) = 1 – (4/7)2 – (3/7)2 = 0.489
MC(Parent)=1-max(7/10,3/10)=1-7/10=3/10
MC(N1)=1-max(3/3,0/3)=1-1=0MC(N2)=1-max(4/7,3/7)=1-4/7=3/7
Mcsplit(A)=3/10-(3/10)0-(7/10)(3/7)=0Ginisplit(A)=0.42–0.34=0.08
Comparison (Entropy Vs. Misclassifica=on)
OriginalTree
hLps://sebas8anraschka.com/faq/docs/decisiontree-error-vs-entropy.html
Advantages & Disadvantages
• Inexpensivetoconstructifclassifyingfeaturesaregivenorknown• Extremelyfastatclassifyingunknownrecords• Easytointerpretforsmall-sizedtrees• Accuracyiscomparabletootherclassificabontechniquesformanysimpledatasets
• UnderfiWngandOverfiWng
• Costsofclassificabonhighastreeconstrucboniscombinatorialinfeatures
• Sensibvetodata-highvariance–lessaccuratepredicbon
When to stop spliHng?
• Usevalida8on/cross-valida8ontechniques
• Con8nuespli1ngun8lerroronvalida8onsetisminimum
• Cross-valida8onreliesonseveralindependentlychosensubsets
• Cross-valida8onmethods
• Hold-outmethod(50-70%Training,50-30%tes8ng/valida8on)• Pronetoselec8onbias
• K-foldcross-valida8on
• Leaveoneoutcross-valida8on(lotsofcomputa8on8me)• Bootstrapmethods
Bootstrap Aggrega=on
• Keyvariancereduc8ontechnique
• Ifeachclassifierhasahighvariancetheaggregatedclassifierhasasmallervariancethaneachsingleclassifier
• Thebaggingclassifierislikeanapproxima8onofthetrueaveragecomputedbyreplacingtheprobabilitydistribu8onwithbootstrapapproxima8on
normaltumor
Classifier1 Classifier2
SmallchangeinthedatacouldresultinaradicallydifferenttreestructureLargevarianceinclassifica8onVariancereducesbyaggrega8ng/averaging
Variance Classifiedasnormal Classifiedastumor
Why Bagging Works?
Why Bagging Works?
x
x
x
x
y1
y2
y3
y4
Ifyi’sareuncorrelated
E[Z ]= 1m
E[yi ]=1m(mE[y])∑ = E[y]
(x11, y1
1), (x21, y2
1 ),!, (xk1, yk
1 ) ↔ (x, y1)(x1
2, y12 ), (x2
2, y22 ),!, (xk
2, yk2 ) ↔ (x, y2 )
"
(x1m, y1
m ), (x2m, y2
m ),!, (xkm, yk
m )↔ (x, ym )
Why bagging works?
• Asmincreases,varianceisreducedandtheaggregatedpredic8onisclosertothetruevalue.
• Unfortunately,weDON’Thaveseveralsetsofsamples!!!
• Whatshouldwedo?
• Ofcourse,makeauniformsamplingfromgivensampleset,withreplacementanduseeachsamplesetasabootstrapsampleset!!!
40%
Ensemble Classifiers
• Subsamplewithreplacement
• Uniformlysubsample
Classifier1 Classifier2 Classifier3
Trainingset
Tes8ngset
Consensus/Aggregatepredic8ons
Predic8on1 Predic8on2 Predic8on3
Bagging (Error Curves)
web1.sph.emory.edu/users/tyu8/740/Lecture%2011%20forest.pptx
Boos=ng
• Powerfultechniqueforcombiningmul8ple“base”classifierstoformacommiLeewhoseperformancecanbesignificantlybeLerthananyofthebaseclassifiers
• AdaBoost(adap8veboos8ng)cangivegoodresultsevenifbase
classifierperformanceisonlyslightlybeLerthanrandom• Hencebaseclassifiersareknownasweaklearners• Designedforclassifica8on,canbeusedforregressionaswell
40%
Boos=ng • Uniformlysubsamplewithreplacement
• Testonthetrainingdataandweightappropriatelyaccordingtoerror
• Sequen8allysubsamplewithweightsbasedonmisclassifica8on
Classifier1 Classifier2 Classifier3
Trainingset
Tes8ngset
Predic8on1 Predic8on2 Predic8on3
test
Predic8on
Bagging vs. Boos=ng Comparison
web1.sph.emory.edu/users/tyu8/740/Lecture%2011%20forest.pptx
Random Forest
• Baggingcanbeseenasamethodtoreducevarianceofanes8matedpredic8onfunc8on.Itmostlyhelpshigh-varianceclassifiers.
• Compara8vely,boos8ngbuildweakclassifiersone-by-one,allowingthecollec8ontoevolvetotherightdirec8on.
• Randomforestisasubstan8almodifica8ontobagging
• Buildacollec8onofde-correlatedtrees.
• Similarperformancetoboos8ng
• Simplertotrainandtunecomparedtoboos8ng
Random Forest (Intui=on)
Random Forest (Algorithm)
Why Random Forest Works?
Bagging vs. Boos=ng vs. Random Forest Comparison
ElementsofSta8s8calLearning(2ndEd.)cHas8e,Tibshirani&Friedman2009Chap15
Cancer Genomics Applica=ons I: Soma=c Score
• Fewsoma8cmuta8onsbelow15(max255).• 47isthevalida8onnecessity• Takenasaminimumthreshold
Distribu8onofsoma8cscores
in-validated
validated
Cancer Genomics Applica=ons II: Determining key muta=ons
• Howdowedeterminekeymuta8ons?
• Lookintoevolu8onforanswers
• "NothinginBiologyMakesSenseExceptintheLightofEvolu8on”
-Dobzhansky
Soma=c Muta=ons
• Allcancersarisebecauseofsoma8callyacquiredchanges
• Notallsoma8cvaria8onswillresultincancer–theonesthatresultarecalled“drivers”.“Passengers”arepassive.
• Ideabehind‘driver’and‘passenger’muta8onsisbasedonevolu8on
• Certainmuta8onsarecasuallyselectedinthetumormicro-environmentthatwouldleadtoincreasedsurvival/reproduc8on.
• Selec8on-Whatdoesthismean??*?#$%$%*&???!!!!
Non-selec=on • Understandingselec8onrequiresunderstandingnon-selec8on.
• Non-selec8on–essen8allystochas8cprocess.• e.g.Randomwalk/Brownianmo8onetc.
• Doesnotmeanuncertainty• ifuncertaintymeanstheprobabilityofobservingan
eventisnot1.e.g.,Drunkard'swalk.
• Expecta8oniszeroatanyinstanceof8me.
• Gene8cdrij• Changeintheallelefrequencyduetorandomness.
• Drivermuta8onsarenotoutcomesofgene8cdrij,butduetoselecbon.
Natural Selec=on and Cancer
• Naturalselec8onisanon-random*processbywhichphenotypesbecomefixatedinapopula8onasafunc8onofdifferen8alreproduc8on/survival
• Cellsinpre-malignantandmalignantstate(tumors)evolvebyselec8nggenesthatfavorstumorgenesis
*Asopposedtogene8cdrij,NSisdeterminis8candhasdirec8on.
Recurrent mutations
TP53 KRAS
Classic approach to finding driver genes
DriverMuta8ons
Muta=onal Frequency
• Muta8onalfrequencyadefiningcriteria?
CHASM
• Methodsindependentoffrequencyofoccurrencearerequiredtodeterminedrivergenes.
• Cancer-specificHigh-throughputAnnota8onofSoma8cMuta8ons(CHASM)*isamachinelearningalgorithmdesignedtoiden8fydrivermuta8ons.
• Basedonrandomforestclassifier
*BozicIetal.,PNAS,2010
Natural Selec=on and Cancer
• Naturalselec8onisanon-random*processbywhichphenotypesbecomefixatedinapopula8onasafunc8onofdifferen8alreproduc8on/survival
• Cellsinpre-malignantandmalignantstate(tumors)evolvebyselec8nggenesthatfavorstumorgenesis
* As opposed to genetic drift, NS is deterministic and has direction.
Recurrent mutations
TP53 KRAS
Classic approach to finding driver genes
Driver Mutations
Muta=onal Frequency
• Muta8onalfrequencyadefiningcriteria?
CHASM
• Methodsindependentoffrequencyofoccurrencearerequiredtodeterminedrivergenes.
• Cancer-specificHigh-throughputAnnota8onofSoma8cMuta8ons(CHASM)*isamachinelearningalgorithmdesignedtoiden8fydrivermuta8ons.
• Basedonrandomforestclassifier
* Bozic I et al., PNAS, 2010
CHASM Features Set
• Foreachmuta8on,56defaultfeatureslike,
PTM enzyme
SNP density
DNA binding
Orthologcompatibleamino acid
Exonconservation
Superfamilyconservation
CHASM Overview
• CHASMhasacuratedsetof2488drivergenestakenfromCOSMICandothercancerdatasets
• CHASMsimulatesmuta8onsbasedongivencontextsandcallsthempassengermuta8ons.Thesearemuta8onsthatoccurbyrandomchancealone
• Apor8onofthesepassengermuta8onsareisolatedtoformanulldistribu8on
CHASM Overview
Known driver mutations Simulated passenger mutations NULL
Random forest classifier
Our mutations Classify
Get CHASM scores (NULL)
Get CHASM scores (Our mutations)
Report CHASM scores, significance, FDR Statistics (t-test)
CHASM Pipeline
Identify somatic mutations
Validated mutations
Compute context table
Get CHASM scores, significance, FDR
Set threshold & identify key genes
Find pathways
Summary • DecisionTrees
• Overview• Spli1ngnodes• Limita8ons
• Bagging/BootstrapAggrega8ngandBoos8ng• Howbaggingreducesvariance?• Boos8ng
• RandomForests• Overview• WhyRFworks?
• CancerGenomicsApplica8on