lecture 2: classification and decision trees …...lecture 2: classification and decision trees...
TRANSCRIPT
Lecture2:ClassificationandDecisionTrees
SanjeevArora EladHazan
ThislecturecontainsmaterialfromtheT.Micheltext“MachineLearning”,andslidesadaptedfrom
DavidSontag,LukeZettlemoyer,CarlosGuestrin,andAndrewMoore
COS402– MachineLearningand
ArtificialIntelligenceFall2016
Admin
• Enrollingtothecourse– priorities• Pre-requisitesreminder(calculus,linearalgebra,discretemath,probability,datastructures&graphalgorithms)
• Movietickets• Slides• Exercise1– theory– dueinoneweek,inclass
Agenda
Thiscourse:Basicprinciplesofhowtodesignmachines/programsthatact“intelligently.”
• Recallcrow/foxexample.• Startwithsimplertask– classification,learningfromexamples• Today:stillNaiveAI.Nextweek:statisticallearningtheory
Classification
Goal: Find best mapping from domain (features) to output (labels)
• Given a document (email), classify spam or ham. Features = words , labels = {spam, ham}
• Given a picture, classify if it contains a chair or notfeatures = bits in a bitmap image, labels = {chair, no chair}
GOAL: automatic machine that learns from examples
Terminology for learning from examples:• Set aside a ”training set” of examples, train a classification machine• Test on a “test set”, to see how well machine performs on unseen examples
Classifyingfuelefficiency
From the UCI repository (thanks to Ross Quinlan)
• 40datapoints
• Goal:predictMPG
• Needtofind:f : Xà Y
• Discretedata(fornow)
mpg cylinders displacement horsepower weight acceleration modelyear maker
good 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europe
X Y
Decisiontreesforclassification
• Whyusedecisiontrees?• Whatistheirexpressivepower?• Cantheybeconstructedautomatically?• Howaccuratecantheyclassify?• Whatmakesagooddecisiontreebesidesaccuracyongivenexamples?
Somerealexamples(fromRussell&Norvig,Mitchell)• BP’sGasOIL systemforseparatinggasandoilonoffshoreplatforms- decisiontreesreplacedahand-designedrulessystemwith2500rules.C4.5-basedsystemoutperformedhumanexpertsandsavedBPmillions.(1986)
• learningtoflyaCessnaonaflightsimulatorbywatchinghumanexpertsflythesimulator(1992)
• canalsolearntoplaytennis,analyzeC-sectionrisk,etc.
Decisiontreesforclassification
• interpretable/intuitive,popularinmedicalapplicationsbecausetheymimicthewayadoctorthinks
• modeldiscreteoutcomesnicely• canbeverypowerful (expressive)• C4.5andCART- from“top10dataminingmethods”- verypopular
• ThisThu:we’llseewhynot…
Decisiontreesforclassification
decision trees f : X à Y
• Each internal node tests an attribute xi
• One branch for each possible attribute value xi=v
• Each leaf assigns a class y
• To classify input x: traverse the tree from root to leaf, output the labeled y
• Can we construct a tree automatically?
Cylinders
3 4 5 6 8
good bad badMaker Horsepower
low med highamerica asia europe
bad badgoodgood goodbad
Humaninterpretable!
ExpressivepowerofDT
Whatkindof functionscantheypotentially represent?• Booleanfunctions?
F= 0,1 $ ↦ {0,1} X1 X2 X3 F(X1,X2,X3)
0 0 0 0
0 0 1 0
0 1 0 0
0 1 1 0
1 0 0 0
1 0 1 0
1 1 0 0
1 1 1 0
0 0
x1
x3x3 x3x3
x2 x2
0
0
0
1 1
1
0 10 1 0 10 10
0
0
0
Issimplerbetter?
Whatkindof functionscantheypotentially represent?• Booleanfunctions?
F= 0,1 $ ↦ {0,1} X1 X2 X3 F(X1,X2,X3)
0 0 0 0
0 0 1 0
0 1 0 0
0 1 1 0
1 0 0 0
1 0 1 0
1 1 0 0
1 1 1 0
?
0
WhatistheSimplestTree?
mpg cylinders displacement horsepower weight acceleration modelyear maker
good 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europe
Isthisagoodtree?
[22+,18-] Means:correcton22examplesincorrecton18examples
predictmpg=bad
Arealldecisiontreesequal?
• Manytreescanrepresentthesameconcept• But,notalltreeswillhavethesamesize!
– e.g.,((AandB)or(notAandC))
A
B Ct
t
f
f
+ _t f
+ _
• Which tree do we prefer?
B
C C
t f
f
t f+ _
At f
A
_ +
_t t f
Aft
+ _
Learning simplestdecisiontreeis NP-hard
• Formaljustification- statisticallearningtheory(nextlecture)
• Learningthesimplest(smallest)decisiontreeisanNP-completeproblem[Hyafil &Rivest ’76]
• Resorttoagreedyheuristic:– Startfromemptydecisiontree– Splitonnextbestattribute(feature)– Recurs
( ) ( ){ }1 1, ,..., ,N NS y y= x x
LearningAlgorithmforDecisionTrees( ){ }
1,...,
, 0,1d
j
x x
x y
=
Œ
x
What happens if features are not binary? What about regression?Decision Trees
DTalgs differonthischoice!• ID3• CAT4.5• CART
ADecisionStump
Keyidea:Greedilylearntreesusingrecursion
Take theOriginalDataset..
And partition it accordingto the value of the attribute we split on
Records in which cylinders
= 4
Records in which cylinders
= 5
Records in which cylinders
= 6
Records in which cylinders
= 8
RecursiveStep
Records in which cylinders
= 4
Records in which cylinders
= 5
Records in which cylinders
= 6
Records in which cylinders
= 8
Build tree fromThese records..
Build tree fromThese records..
Build tree fromThese records..
Build tree fromThese records..
Secondleveloftree
Recursively build a tree from the seven records in which there are four cylinders and the maker was based in Asia
(Similar recursion in the other cases)
A full tree
Splitting:choosingagoodattribute
X1 X2 YT T TT F TT T TT F TF T TF F FF T FF F F
X1
Y=t : 4Y=f : 0
t f
Y=t : 1Y=f : 3
X2
Y=t : 3Y=f : 1
t f
Y=t : 2Y=f : 2
Would we prefer to split on X1 or X2?
Idea: use counts at leaves to define probability distributions, so we can measure uncertainty!
Measuringuncertainty
• Goodsplitifwearemorecertainaboutclassificationaftersplit– Deterministicgood(alltrueorallfalse)– Uniformdistributionbad– Whataboutdistributionsinbetween?
P(Y=A) = 1/4 P(Y=B) = 1/4 P(Y=C) = 1/4 P(Y=D) = 1/4
P(Y=A) = 1/2 P(Y=B) = 1/4 P(Y=C) = 1/8 P(Y=D) = 1/8
EntropyEntropyH(Y) ofarandomvariableY
More uncertainty, more entropy!
Information Theory interpretation:H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code)
ProbabilityofheadsEntropy
Entropyofacoinflip
High,LowEntropy
• “HighEntropy”– Yisfromauniformlikedistribution– Flathistogram– Valuessampledfromitarelesspredictable
• “LowEntropy”– Yisfromavaried(peaksandvalleys)distribution
– Histogramhasmanylowsandhighs– Valuessampledfromitaremorepredictable
(Slide from Vibhav Gogate)
EntropyExample
X1 X2 YT T TT F TT T TT F TF T TF F F
P(Y=t) = 5/6P(Y=f) = 1/6
H(Y) = - 5/6 log2 5/6 - 1/6 log2 1/6= 0.65
Probabilityofheads
Entropy
Entropyofacoinflip
ConditionalEntropyConditionalEntropyH(Y |X) ofarandomvariableY conditionedona
randomvariableX
X1
Y=t : 4Y=f : 0
t f
Y=t : 1Y=f : 1
P(X1=t) = 4/6P(X1=f) = 2/6
X1 X2 YT T TT F TT T TT F TF T TF F F
Example:
H(Y|X1) = - 4/6 (1 log2 1 + 0 log2 0)- 2/6 (1/2 log2 1/2 + 1/2 log2 1/2)
= 2/6
Informationgain• Decreaseinentropy(uncertainty)aftersplitting
X1 X2 YT T TT F TT T TT F TF T TF F F
In our running example:
IG(X1) = H(Y) – H(Y|X1)= 0.65 – 0.33
IG(X1) > 0 à we prefer the split!
Learningdecisiontrees
• Startfromemptydecisiontree• Splitonnextbestattribute(feature)
– Use,forexample,informationgaintoselectattribute:
• Recurs
Lookatalltheinformationgains…
Suppose we want to predict MPG
Whentostop?
First split looks good! But, when do we stop?
Base Case One
Don’t split a node if all matching
records have the same
output value
Base Case Two
Don’t split a node if data points are
identical on remaining attributes
BaseCases:Anidea
• BaseCaseOne:Ifallrecordsincurrentdatasubsethavethesameoutputthendon’trecurs
• BaseCaseTwo:Ifallrecordshaveexactlythesamesetofinputattributesthendon’trecurs
Proposed Base Case 3:If all attributes have small
information gain then don’t recurs
•This is not a good idea
Theproblemwith proposedcase3a b y0 0 00 1 11 0 11 1 0
y = a XOR b
The information gains:
Ifweomit proposedcase3:
a b y0 0 00 1 11 0 11 1 0
y = a XOR bThe resulting decision tree:
Instead, perform pruning after building a tree
Non-BooleanFeatures
• Real-valuedfeatures?
Real->threshold
WEIGHT>?
𝑁𝑜 𝑌𝑒𝑠
• Number of thresholds <= # of different values in dataset
• Can choose threshold based on information gain
Summary:BuildingDecisionTrees
BuildTree(DataSet,Output)• IfalloutputvaluesarethesameinDataSet,returnaleafnode
thatsays“predictthisuniqueoutput”• Ifallinputvaluesarethesame,returnaleafnodethatsays
“predictthemajorityoutput”• ElsefindattributeX withhighestInfoGain• SupposeX hasnX distinctvalues(i.e.Xhasarity nX).
– Createanon-leafnodewithnX children.– Thei’th childshouldbebuiltbycalling
BuildTree(DSi,Output)WhereDSi containstherecordsinDataSetwhereX=ith valueofX.
Machine Space Search
• ID3 / C4.5 / CART search for a succinct tree that perfectly fits the data.
• They are not going to find it in general (NP-hard)
• Entropy-guided splitting – well-performing heuristic. Exists others.
• Why should we search for a small tree at all?
MPG Test set error
The test set error is much worse than the training set error…
…why?
Decisiontreeswilloverfit
Fittingapolynomial𝑡 = sin 2𝜋𝑥 + 𝜖
Figure fromMachineLearningandPatternRecognition,Bishop
Fittingapolynomial
Figure fromMachineLearningandPatternRecognition,Bishop
𝑡 = sin 2𝜋𝑥 + 𝜖
RegressionusingpolynomialofdegreeM
Figure fromMachineLearningandPatternRecognition,Bishop
𝑡 = sin 2𝜋𝑥 + 𝜖
Figure fromMachineLearningandPatternRecognition,Bishop
𝑡 = sin 2𝜋𝑥 + 𝜖
𝑡 = sin 2𝜋𝑥 + 𝜖
Figure fromMachineLearningandPatternRecognition,Bishop
Figure fromMachineLearningandPatternRecognition,Bishop
Overfitting
• Precisecharacterization– statisticallearningtheory• SpecialtechnicstopreventoverfittinginDTlearning
• Pruningthetree,e.g.“reducederror”pruning:Dountilfurtherpruningisharmful:1. Evaluatetheimpactonvalidation(test)setofthedataofpruningeachpossiblenode
(andit’ssubtree)2. Greedilyremoveonethatmostimprovesvalidation(test)error
Conceptswehaveencountered(andwillappearagaininthecourse!)
• Greedyalgorithm.
• Tryingtofindsuccinctmachines.
• Computationalefficiencyofanalgorithm(running time).
• Overfitting.
Questions
• Whyaresmallertrees(theory) preferabletolargetrees(theory)?
• Whencanatreegeneralizetounseenexamples,andhowwell?
• Howmanyexamplesareneededtobuildatreethatgeneralizeswell?
• Howtoidentifyandpreventoverfitting?
• Arethereothernaturalclassificationmachinesandhowdotheycompare?
Needtoreasonmoregenerallyabout“whatlearningmeans”,TBConThu…