tree based methods (ensemble schemes) · 2020-01-15 · • greedy strategy. • split the records...

Tree Based Methods (Ensemble Schemes)

MachineLearningSpring2018Feb262018

[email protected]

Overview •  DecisionTrees

•  Overview•  Spli1ngnodes•  Limita8ons

•  Bagging/BootstrapAggrega8ngandBoos8ng•  Howbaggingreducesvariance?•  Boos8ng

•  RandomForests•  Overview•  WhyRFworks?

•  CancerGenomicsApplica8on

Classifica=on

• Givenacollec8onofrecords(trainingset)•  EachrecordcontainsasetofaLributes,oneoftheaLributesistheclass/label

•  FindamodelforclassaLributeasafunc8onofthevaluesofotheraLributes.

• Goal:previouslyunseenrecordsshouldbeassignedaclassasaccuratelyaspossible.

•  Atestsetisusedtodeterminetheaccuracyofthemodel.Usually,thegivendatasetisdividedintotrainingandtestsets,withtrainingsetusedtobuildthemodelandtestsetusedtovalidateit.

Classifica=on Illustra=on

Apply Model

Induction

Deduction

Learn Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learningalgorithm

Training Set

Courtesy:www.cs.kent.edu/~jin/DM07/

Classifica=on Examples

•  Predic8ngtumorcellsasbenignormalignant

•  Classifyingcreditcardtransac8onsaslegi8mateorfraudulent

•  Classifyingsecondarystructuresofproteinasalpha-helix,beta-sheet,orrandomcoil

•  Categorizingnewsstoriesasfinance,weather,entertainment,sports,etc.

•  Severalalgorithms–Decisiontrees,SupportVectorMachines,Rule-basedMethodsetc.

Decision Tree (Example)

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree

Decision Tree (Another Example)


TaxableIncome Cheat



3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No


MarSt

Refund

TaxInc

YES NO

NO

NO

Yes No

Married Single,

Divorced

< 80K > 80K

Therecouldbemorethanonetreethatfitsthesamedata!

Applying the model Oncethedecisiontreeisbuilt,themodelcanbeusedtotestanunclassifieddata

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Assign Cheat to “No”

x = 1

y = 0.5

x < 1

y < 1 y < 0.5

(1.5,0.8)

Tumor samples/patients given by expression in (gene1,gene2)

Normal samples/patients given by expression in (gene1,gene2)

T

Yes No

N

No Yes

T N

No Yes

gene1

gene

2

y = 1

Tumor Classifica=on (Example)

Decision Tree Algorithms

•  ManyAlgorithms:•  Hunt’sAlgorithm(oneoftheearliest)•  CART•  ID3,C4.5•  SLIQ,SPRINT

Hunt’sAlgorithm(GeneralIdea)LetDtbethesetoftrainingrecordsthatreachanodetGeneralProcedure:

IfDtcontainsrecordsthatbelongthesameclassyt,thentisaleafnodelabeledasytIfDtisanemptyset,thentisaleafnodelabeledbythedefaultclass,ydIfDtcontainsrecordsthatbelongtomorethanoneclass,useanaLributetesttosplitthedataintosmallersubsets.Recursivelyapplytheproceduretoeachsubset.

Hunt’s Algorithm (Illustra=on)

Don’t Cheat

Refund

Don’t Cheat

Don’t Cheat

Yes No

Refund

Don’t Cheat

Yes No Marital Status

Don’t Cheat

Cheat

Single,Divorced Married

Taxable Income

Don’t Cheat

<80K >=80K

Refund

Don’t Cheat

Yes No Marital Status

Don’t Cheat Cheat

Single,Divorced Married


TaxableIncome Cheat



3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No


Tree Induc=on

•  Greedystrategy.

•  SplittherecordsbasedonanaLributetestthatop8mizescertaincriterion.

• Mainques8ons

•  Determinehowtosplittherecords

•  Binaryormul8-waysplit?

•  Howtodeterminethebestsplit?

•  Determinewhentostopspli1ng

Test Condi=on (SpliHng Based on Nominal/Ordinal AKributes)

Mul8-waysplit:Useasmanypar88onsasdis8nctvalues.Binarysplit:Dividesvaluesintotwosubsets.

Needtofindop8malpar88oning.

Car Type Family

SportsLuxury

Car Type {Family,Luxury} {Sports}

Car Type {Sports,Luxury} {Family} OR

Test Condi=on (SpliHng Based on Con=nuous AKributes)

TaxableIncome> 80K?

Yes No

TaxableIncome?

(i) Binary split (ii) Multi-way split

< 10K

[10K,25K) [25K,50K) [50K,80K)

> 80K

Binary vs. Mul=-way split – which is the best?

hLp://www.cse.msu.edu/~cse802/DecisionTrees.pdf

Determining Best Split

OwnCar?

C0: 6C1: 4

C0: 4C1: 6

C0: 1C1: 3

C0: 8C1: 0

C0: 1C1: 7

CarType?

C0: 1C1: 0

C0: 1C1: 0

C0: 0C1: 1

StudentID?

...

Yes No Family

Sports

Luxury c1c10

c20

C0: 0C1: 1

...

c11

BeforeSpli1ng: 10recordsofclass0, 10recordsofclass1

WhichaLributeisthebestforspli1ng?

Determining Best Split

• Greedyapproach:•  Nodeswithhomogeneousclassdistribu8onarepreferred

• Needameasureofnodeimpurity:

C0: 5C1: 5

C0: 9C1: 1

Non-homogeneous,

Highdegreeofimpurity

Homogeneous,

Lowdegreeofimpurity

Measures of Node Impurity

• GiniIndex•  Entropy

• Misclassifica8onerror

∑−=j

tjptGINI 2)]|([1)(

∑−=j

tjptjptEntropy )|(log)|()(

)|(max1)( tiPtErrori

−=

Compu=ng Measures of Node Impurity

B?

Yes No

Node N3 Node N4

A?

Yes No

Node N1 Node N2

BeforeSpliWng:

C0 N10 C1 N11

C0 N20 C1 N21

C0 N30 C1 N31

C0 N40 C1 N41

C0 N00 C1 N01

M0

M1 M2 M3 M4

M12 M34Gain = M0 – M12 vs M0 – M34

Compu=ng Measures of Node Impurity (Gini Index)

•  GiniIndexforagivennodet:

(NOTE:p( j | t) istherela8vefrequencyofclassjatnodet)

•  Maximum(1-1/nc)whenrecordsareequallydistributedamongallclasses,implyingleastinteres8nginforma8on

•  Minimum(0.0)whenallrecordsbelongtooneclass,implyingmostinteres8nginforma8on

∑−=j

tjptGINI 2)]|([1)(

M1 M2

Example (Gini Index of a Node)

∑−=j

tjptGINI 2)]|([1)(A?

Yes No

Node N1 Node N2

C0# 0"C1# 6"

#

C0# 1"C1# 5"

#

C0# 0"C1# 6"

#

C0# 1"C1# 5"

#

P(C0) = 0/6 = 0 P(C1) = 6/6 = 1

Gini = 1 – P(C0)2 – P(C1)2 = 1 – 0 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6

Gini = 1 – (1/6)2 – (5/6)2 = 0.278

M1

M2

SpliHng based on Gini Index

•  UsedinCART,SLIQ,SPRINT

• Whenanodepissplitintokpar88ons(children),thequalityofsplitiscomputedas,

where,nL=numberofrecordsatthelejchildnode, nR=numberofrecordsattherightchildnode

•  SplitontheaLributethatmaximizesGinisplit

SpliHng Based on Gini Index

•  Splitsintotwopar88ons

•  EffectofWeighingpar88ons:•  Largerandpurerpar88onsaresoughtfor.

B?

Yes No

Node N1 Node N2

Parent C1 6

C2 6 Gini = 0.500

N1 N2 C1 5 1 C2 2 4 Gini=0.333

Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194

Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528

Gini(Children)=7/12*0.194+5/12*0.528=0.333

Ginisplit(B)=0.5-0.3=0.2

Measures of Node Impurity (Entropy)

•  Entropyatagivennodet:

•  (NOTE:p( j | t) istherela8vefrequencyofclassjatnodet).

•  Measureshomogeneityofanode.

•  Maximum(lognc)whenrecordsareequallydistributedamongallclassesimplyingleastinforma8on

•  Minimum(0.0)whenallrecordsbelongtooneclass,implyingmostinforma8on

•  Entropybasedcomputa8onsaresimilartotheGINIindexcomputa8ons

∑−=j

tjptjptEntropy )|(log)|()(

Example (Entropy)

C1 0 C2 6

C1 2 C2 4

C1 1 C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

P(C1) = 1/6 P(C2) = 5/6

Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

P(C1) = 2/6 P(C2) = 4/6

Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

SpliHng Based on Entropy: Informa=on Gain

•  Informa8onGain:

ParentNode,pissplitintokpar88ons;niisnumberofrecordsinpar88onI

•  Measuresreduc8oninentropyachievedbecauseofthesplit.Choosethesplitthatachievesmostreduc8on(maximizesGAIN)

•  UsedinID3andC4.5•  Disadvantage:Tendstoprefersplitsthatresultinlargenumberofpar88ons,eachbeingsmallbutpure.

⎟⎠⎞

⎜⎝⎛−= ∑

=

k

i

i

splitiEntropy

nnpEntropyGAIN

1)()(

Measures of Node Impurity (Misclassifica=on Error)

•  Classifica8onerroratanodet:

•  Measuresmisclassifica8onerrormadebyanode.

•  Maximum(1-1/nc)whenrecordsareequallydistributedamongallclasses,implyingleastinteres8nginforma8on

•  Minimum(0.0)whenallrecordsbelongtooneclass,implyingmostinteres8nginforma8on

•  Misclassifica8onbasedcomputa8onsaresimilartotheGINIindexandentropycomputa8ons


−=

Example (Misclassifica=on Error)

C1 0 C2 6

C1 2 C2 4

C1 1 C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Error = 1 – max (0, 1) = 1 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6

Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

P(C1) = 2/6 P(C2) = 4/6

Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3


−=

Comparison (SpliHng Criteria)

Two-classproblem

E/G/

MCVa

lues

Summary

•  Greedystrategy.

•  SplittherecordsbasedonanaLributetestthatop8mizescertaincriterion.

•  Ques8ons

•  Determinehowtosplittherecords

•  Binaryormulb-waysplit?Mulb-wayissameasbinarysplit.

•  HowtodeterminethebestspliWngfeatures?Gini/Entropy/Misclassificabon

•  WhentostopspliWng?

When to stop spliHng?

• Amaximalnodedepthisreached

•  Ifspli1ngisstoppedtooearly,errorontrainingdataisnotsufficientlylowandperformanceonthetestdatawillsuffer(underfi1ng)

•  Theleafnodesdoesn’thaveanyimpurity

•  Ifwecon8nuetogrowthetreefullyun8leachleafnodecorrespondstothelowestimpurity,thenthedatahavetypicallybeenoverfit;inthelimit,eachleafnodehasonlyonepaLern!

#ofnodesTomMitchell,1997


•  Stopspli1ngwhenthebestcandidatesplitatanodereducestheimpuritybylessthanthepresetamount(threshold)

•  Howtosetthethreshold?

•  Spli1nganotedoesnotleadtoaninforma8ongain

•  Dependsonthemetricusedforinforma8ongain

Comparison (Gini Vs. Misclassifica=on)

A?

Yes No

Node N1 Node N2

Parent C1 7

C2 3 Gini = 0.42

! N1# N2#C1! 3# 4#C2! 0# 3####Gini#=#0.34#

!

Gini(Children)=3/10*0+7/10*0.489=0.342

Gini(N1) = 1 – (3/3)2 – (0/3)2 = 0

Gini(N2) = 1 – (4/7)2 – (3/7)2 = 0.489

MC(Parent)=1-max(7/10,3/10)=1-7/10=3/10

MC(N1)=1-max(3/3,0/3)=1-1=0MC(N2)=1-max(4/7,3/7)=1-4/7=3/7

Mcsplit(A)=3/10-(3/10)0-(7/10)(3/7)=0Ginisplit(A)=0.42–0.34=0.08

Comparison (Entropy Vs. Misclassifica=on)

OriginalTree

hLps://sebas8anraschka.com/faq/docs/decisiontree-error-vs-entropy.html

Advantages & Disadvantages

•  Inexpensivetoconstructifclassifyingfeaturesaregivenorknown•  Extremelyfastatclassifyingunknownrecords•  Easytointerpretforsmall-sizedtrees•  Accuracyiscomparabletootherclassificabontechniquesformanysimpledatasets

•  UnderfiWngandOverfiWng

•  Costsofclassificabonhighastreeconstrucboniscombinatorialinfeatures

•  Sensibvetodata-highvariance–lessaccuratepredicbon


•  Usevalida8on/cross-valida8ontechniques

•  Con8nuespli1ngun8lerroronvalida8onsetisminimum

•  Cross-valida8onreliesonseveralindependentlychosensubsets

•  Cross-valida8onmethods

•  Hold-outmethod(50-70%Training,50-30%tes8ng/valida8on)•  Pronetoselec8onbias

•  K-foldcross-valida8on

•  Leaveoneoutcross-valida8on(lotsofcomputa8on8me)•  Bootstrapmethods

Bootstrap Aggrega=on

• Keyvariancereduc8ontechnique

•  Ifeachclassifierhasahighvariancetheaggregatedclassifierhasasmallervariancethaneachsingleclassifier

•  Thebaggingclassifierislikeanapproxima8onofthetrueaveragecomputedbyreplacingtheprobabilitydistribu8onwithbootstrapapproxima8on

normaltumor

Classifier1 Classifier2

SmallchangeinthedatacouldresultinaradicallydifferenttreestructureLargevarianceinclassifica8onVariancereducesbyaggrega8ng/averaging

Variance Classifiedasnormal Classifiedastumor

Why Bagging Works?

Why Bagging Works?

x

x

x

x

y1

y2

y3

y4

Ifyi’sareuncorrelated

E[Z ]= 1m

E[yi ]=1m(mE[y])∑ = E[y]

(x11, y1

1), (x21, y2

1 ),!, (xk1, yk

1 ) ↔ (x, y1)(x1

2, y12 ), (x2

2, y22 ),!, (xk

2, yk2 ) ↔ (x, y2 )

"

(x1m, y1

m ), (x2m, y2

m ),!, (xkm, yk

m )↔ (x, ym )

Why bagging works?

•  Asmincreases,varianceisreducedandtheaggregatedpredic8onisclosertothetruevalue.

•  Unfortunately,weDON’Thaveseveralsetsofsamples!!!

• Whatshouldwedo?

•  Ofcourse,makeauniformsamplingfromgivensampleset,withreplacementanduseeachsamplesetasabootstrapsampleset!!!

40%

Ensemble Classifiers

•  Subsamplewithreplacement

•  Uniformlysubsample

Classifier1 Classifier2 Classifier3

Trainingset

Tes8ngset

Consensus/Aggregatepredic8ons

Predic8on1 Predic8on2 Predic8on3

Bagging (Error Curves)

web1.sph.emory.edu/users/tyu8/740/Lecture%2011%20forest.pptx

Boos=ng

•  Powerfultechniqueforcombiningmul8ple“base”classifierstoformacommiLeewhoseperformancecanbesignificantlybeLerthananyofthebaseclassifiers

•  AdaBoost(adap8veboos8ng)cangivegoodresultsevenifbase

classifierperformanceisonlyslightlybeLerthanrandom•  Hencebaseclassifiersareknownasweaklearners•  Designedforclassifica8on,canbeusedforregressionaswell

40%

Boos=ng •  Uniformlysubsamplewithreplacement

•  Testonthetrainingdataandweightappropriatelyaccordingtoerror

•  Sequen8allysubsamplewithweightsbasedonmisclassifica8on

Classifier1 Classifier2 Classifier3

Trainingset

Tes8ngset

Predic8on1 Predic8on2 Predic8on3

test

Predic8on

Bagging vs. Boos=ng Comparison

web1.sph.emory.edu/users/tyu8/740/Lecture%2011%20forest.pptx

Random Forest

•  Baggingcanbeseenasamethodtoreducevarianceofanes8matedpredic8onfunc8on.Itmostlyhelpshigh-varianceclassifiers.

•  Compara8vely,boos8ngbuildweakclassifiersone-by-one,allowingthecollec8ontoevolvetotherightdirec8on.

•  Randomforestisasubstan8almodifica8ontobagging

•  Buildacollec8onofde-correlatedtrees.

•  Similarperformancetoboos8ng

•  Simplertotrainandtunecomparedtoboos8ng

Random Forest (Intui=on)

Random Forest (Algorithm)

Why Random Forest Works?

Bagging vs. Boos=ng vs. Random Forest Comparison

ElementsofSta8s8calLearning(2ndEd.)cHas8e,Tibshirani&Friedman2009Chap15

Cancer Genomics Applica=ons I: Soma=c Score

•  Fewsoma8cmuta8onsbelow15(max255).•  47isthevalida8onnecessity•  Takenasaminimumthreshold

Distribu8onofsoma8cscores

in-validated

validated

Cancer Genomics Applica=ons II: Determining key muta=ons

• Howdowedeterminekeymuta8ons?

• Lookintoevolu8onforanswers

• "NothinginBiologyMakesSenseExceptintheLightofEvolu8on”

-Dobzhansky

Soma=c Muta=ons

• Allcancersarisebecauseofsoma8callyacquiredchanges

• Notallsoma8cvaria8onswillresultincancer–theonesthatresultarecalled“drivers”.“Passengers”arepassive.

•  Ideabehind‘driver’and‘passenger’muta8onsisbasedonevolu8on

• Certainmuta8onsarecasuallyselectedinthetumormicro-environmentthatwouldleadtoincreasedsurvival/reproduc8on.

•  Selec8on-Whatdoesthismean??*?#$%$%*&???!!!!

Non-selec=on •  Understandingselec8onrequiresunderstandingnon-selec8on.

•  Non-selec8on–essen8allystochas8cprocess.•  e.g.Randomwalk/Brownianmo8onetc.

•  Doesnotmeanuncertainty•  ifuncertaintymeanstheprobabilityofobservingan

eventisnot1.e.g.,Drunkard'swalk.

•  Expecta8oniszeroatanyinstanceof8me.

•  Gene8cdrij•  Changeintheallelefrequencyduetorandomness.

•  Drivermuta8onsarenotoutcomesofgene8cdrij,butduetoselecbon.

Natural Selec=on and Cancer

• Naturalselec8onisanon-random*processbywhichphenotypesbecomefixatedinapopula8onasafunc8onofdifferen8alreproduc8on/survival

• Cellsinpre-malignantandmalignantstate(tumors)evolvebyselec8nggenesthatfavorstumorgenesis

*Asopposedtogene8cdrij,NSisdeterminis8candhasdirec8on.

Recurrent mutations

TP53 KRAS

Classic approach to finding driver genes

DriverMuta8ons

Muta=onal Frequency

• Muta8onalfrequencyadefiningcriteria?

CHASM

• Methodsindependentoffrequencyofoccurrencearerequiredtodeterminedrivergenes.

• Cancer-specificHigh-throughputAnnota8onofSoma8cMuta8ons(CHASM)*isamachinelearningalgorithmdesignedtoiden8fydrivermuta8ons.

• Basedonrandomforestclassifier

*BozicIetal.,PNAS,2010

Natural Selec=on and Cancer

• Naturalselec8onisanon-random*processbywhichphenotypesbecomefixatedinapopula8onasafunc8onofdifferen8alreproduc8on/survival

• Cellsinpre-malignantandmalignantstate(tumors)evolvebyselec8nggenesthatfavorstumorgenesis

* As opposed to genetic drift, NS is deterministic and has direction.

Recurrent mutations

TP53 KRAS

Classic approach to finding driver genes

Driver Mutations

Muta=onal Frequency

• Muta8onalfrequencyadefiningcriteria?

CHASM

• Methodsindependentoffrequencyofoccurrencearerequiredtodeterminedrivergenes.

• Cancer-specificHigh-throughputAnnota8onofSoma8cMuta8ons(CHASM)*isamachinelearningalgorithmdesignedtoiden8fydrivermuta8ons.

• Basedonrandomforestclassifier

* Bozic I et al., PNAS, 2010

CHASM Features Set

•  Foreachmuta8on,56defaultfeatureslike,

PTM enzyme

SNP density

DNA binding

Orthologcompatibleamino acid

Exonconservation

Superfamilyconservation

CHASM Overview

•  CHASMhasacuratedsetof2488drivergenestakenfromCOSMICandothercancerdatasets

•  CHASMsimulatesmuta8onsbasedongivencontextsandcallsthempassengermuta8ons.Thesearemuta8onsthatoccurbyrandomchancealone

•  Apor8onofthesepassengermuta8onsareisolatedtoformanulldistribu8on

CHASM Overview

Known driver mutations Simulated passenger mutations NULL

Random forest classifier

Our mutations Classify

Get CHASM scores (NULL)

Get CHASM scores (Our mutations)

Report CHASM scores, significance, FDR Statistics (t-test)

CHASM Pipeline

Identify somatic mutations

Validated mutations

Compute context table

Get CHASM scores, significance, FDR

Set threshold & identify key genes

Find pathways

Summary •  DecisionTrees

•  Overview•  Spli1ngnodes•  Limita8ons

•  Bagging/BootstrapAggrega8ngandBoos8ng•  Howbaggingreducesvariance?•  Boos8ng

•  RandomForests•  Overview•  WhyRFworks?

•  CancerGenomicsApplica8on

tree based methods (ensemble schemes) · 2020-01-15 · • greedy strategy. • split the records...

Documents