lecture 2: classification and decision trees …...lecture 2: classification and decision trees...

Lecture2:ClassificationandDecisionTrees

SanjeevArora EladHazan

ThislecturecontainsmaterialfromtheT.Micheltext“MachineLearning”,andslidesadaptedfrom

DavidSontag,LukeZettlemoyer,CarlosGuestrin,andAndrewMoore

COS402– MachineLearningand

ArtificialIntelligenceFall2016

Admin

• Enrollingtothecourse– priorities• Pre-requisitesreminder(calculus,linearalgebra,discretemath,probability,datastructures&graphalgorithms)

• Movietickets• Slides• Exercise1– theory– dueinoneweek,inclass

Agenda

Thiscourse:Basicprinciplesofhowtodesignmachines/programsthatact“intelligently.”

• Recallcrow/foxexample.• Startwithsimplertask– classification,learningfromexamples• Today:stillNaiveAI.Nextweek:statisticallearningtheory

Classification

Goal: Find best mapping from domain (features) to output (labels)

• Given a document (email), classify spam or ham. Features = words , labels = {spam, ham}

• Given a picture, classify if it contains a chair or notfeatures = bits in a bitmap image, labels = {chair, no chair}

GOAL: automatic machine that learns from examples

Terminology for learning from examples:• Set aside a ”training set” of examples, train a classification machine• Test on a “test set”, to see how well machine performs on unseen examples

Classifyingfuelefficiency

From the UCI repository (thanks to Ross Quinlan)

• 40datapoints

• Goal:predictMPG

• Needtofind:f : Xà Y

• Discretedata(fornow)

mpg cylinders displacement horsepower weight acceleration modelyear maker

good 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europe

X Y

Decisiontreesforclassification

• Whyusedecisiontrees?• Whatistheirexpressivepower?• Cantheybeconstructedautomatically?• Howaccuratecantheyclassify?• Whatmakesagooddecisiontreebesidesaccuracyongivenexamples?

Somerealexamples(fromRussell&Norvig,Mitchell)• BP’sGasOIL systemforseparatinggasandoilonoffshoreplatforms- decisiontreesreplacedahand-designedrulessystemwith2500rules.C4.5-basedsystemoutperformedhumanexpertsandsavedBPmillions.(1986)

• learningtoflyaCessnaonaflightsimulatorbywatchinghumanexpertsflythesimulator(1992)

• canalsolearntoplaytennis,analyzeC-sectionrisk,etc.


• interpretable/intuitive,popularinmedicalapplicationsbecausetheymimicthewayadoctorthinks

• modeldiscreteoutcomesnicely• canbeverypowerful (expressive)• C4.5andCART- from“top10dataminingmethods”- verypopular

• ThisThu:we’llseewhynot…


decision trees f : X à Y

• Each internal node tests an attribute xi

• One branch for each possible attribute value xi=v

• Each leaf assigns a class y

• To classify input x: traverse the tree from root to leaf, output the labeled y

• Can we construct a tree automatically?

Cylinders

3 4 5 6 8

good bad badMaker Horsepower

low med highamerica asia europe

bad badgoodgood goodbad

Humaninterpretable!

ExpressivepowerofDT

Whatkindof functionscantheypotentially represent?• Booleanfunctions?

F= 0,1 $ ↦ {0,1} X1 X2 X3 F(X1,X2,X3)

0 0 0 0

0 0 1 0

0 1 0 0

0 1 1 0

1 0 0 0

1 0 1 0

1 1 0 0

1 1 1 0

0 0

x1

x3x3 x3x3

x2 x2

0

0

0

1 1

1

0 10 1 0 10 10

0

0

0

Issimplerbetter?

Whatkindof functionscantheypotentially represent?• Booleanfunctions?

F= 0,1 $ ↦ {0,1} X1 X2 X3 F(X1,X2,X3)

0 0 0 0

0 0 1 0

0 1 0 0

0 1 1 0

1 0 0 0

1 0 1 0

1 1 0 0

1 1 1 0

?

0

WhatistheSimplestTree?

mpg cylinders displacement horsepower weight acceleration modelyear maker

good 4 low low low high 75to78 asiabad 6 medium medium medium medium 70to74 americabad 4 medium medium medium low 75to78 europebad 8 high high high low 70to74 americabad 6 medium medium medium medium 70to74 americabad 4 low medium low medium 70to74 asiabad 4 low medium low low 70to74 asiabad 8 high high high low 75to78 america: : : : : : : :: : : : : : : :: : : : : : : :bad 8 high high high low 70to74 americagood 8 high medium high high 79to83 americabad 8 high high high low 75to78 americagood 4 low low low low 79to83 americabad 6 medium medium medium high 75to78 americagood 4 medium low low low 79to83 americagood 4 low low medium high 79to83 americabad 8 high high high low 70to74 americagood 4 low medium low medium 75to78 europebad 5 medium medium medium medium 75to78 europe

Isthisagoodtree?

[22+,18-] Means:correcton22examplesincorrecton18examples

predictmpg=bad

Arealldecisiontreesequal?

• Manytreescanrepresentthesameconcept• But,notalltreeswillhavethesamesize!

– e.g.,((AandB)or(notAandC))

A

B Ct

t

f

f

+ _t f

+ _

• Which tree do we prefer?

B

C C

t f

f

t f+ _

At f

A

_ +

_t t f

Aft

+ _

Learning simplestdecisiontreeis NP-hard

• Formaljustification- statisticallearningtheory(nextlecture)

• Learningthesimplest(smallest)decisiontreeisanNP-completeproblem[Hyafil &Rivest ’76]

• Resorttoagreedyheuristic:– Startfromemptydecisiontree– Splitonnextbestattribute(feature)– Recurs

( ) ( ){ }1 1, ,..., ,N NS y y= x x

LearningAlgorithmforDecisionTrees( ){ }

1,...,

, 0,1d

j

x x

x y

=

Œ

x

What happens if features are not binary? What about regression?Decision Trees

DTalgs differonthischoice!• ID3• CAT4.5• CART

ADecisionStump

Keyidea:Greedilylearntreesusingrecursion

Take theOriginalDataset..

And partition it accordingto the value of the attribute we split on

Records in which cylinders

= 4


= 5


= 6


= 8

RecursiveStep


= 4


= 5


= 6


= 8

Build tree fromThese records..




Secondleveloftree

Recursively build a tree from the seven records in which there are four cylinders and the maker was based in Asia

(Similar recursion in the other cases)

A full tree

Splitting:choosingagoodattribute

X1 X2 YT T TT F TT T TT F TF T TF F FF T FF F F

X1

Y=t : 4Y=f : 0

t f

Y=t : 1Y=f : 3

X2

Y=t : 3Y=f : 1

t f

Y=t : 2Y=f : 2

Would we prefer to split on X1 or X2?

Idea: use counts at leaves to define probability distributions, so we can measure uncertainty!

Measuringuncertainty

• Goodsplitifwearemorecertainaboutclassificationaftersplit– Deterministicgood(alltrueorallfalse)– Uniformdistributionbad– Whataboutdistributionsinbetween?

P(Y=A) = 1/4 P(Y=B) = 1/4 P(Y=C) = 1/4 P(Y=D) = 1/4

P(Y=A) = 1/2 P(Y=B) = 1/4 P(Y=C) = 1/8 P(Y=D) = 1/8

EntropyEntropyH(Y) ofarandomvariableY

More uncertainty, more entropy!

Information Theory interpretation:H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code)

ProbabilityofheadsEntropy

Entropyofacoinflip

High,LowEntropy

• “HighEntropy”– Yisfromauniformlikedistribution– Flathistogram– Valuessampledfromitarelesspredictable

• “LowEntropy”– Yisfromavaried(peaksandvalleys)distribution

– Histogramhasmanylowsandhighs– Valuessampledfromitaremorepredictable

(Slide from Vibhav Gogate)

EntropyExample

X1 X2 YT T TT F TT T TT F TF T TF F F

P(Y=t) = 5/6P(Y=f) = 1/6

H(Y) = - 5/6 log2 5/6 - 1/6 log2 1/6= 0.65

Probabilityofheads

Entropy

Entropyofacoinflip

ConditionalEntropyConditionalEntropyH(Y |X) ofarandomvariableY conditionedona

randomvariableX

X1

Y=t : 4Y=f : 0

t f

Y=t : 1Y=f : 1

P(X1=t) = 4/6P(X1=f) = 2/6


Example:

H(Y|X1) = - 4/6 (1 log2 1 + 0 log2 0)- 2/6 (1/2 log2 1/2 + 1/2 log2 1/2)

= 2/6

Informationgain• Decreaseinentropy(uncertainty)aftersplitting


In our running example:

IG(X1) = H(Y) – H(Y|X1)= 0.65 – 0.33

IG(X1) > 0 à we prefer the split!

Learningdecisiontrees

• Startfromemptydecisiontree• Splitonnextbestattribute(feature)

– Use,forexample,informationgaintoselectattribute:

• Recurs

Lookatalltheinformationgains…

Suppose we want to predict MPG

Whentostop?

First split looks good! But, when do we stop?

Base Case One

Don’t split a node if all matching

records have the same

output value

Base Case Two

Don’t split a node if data points are

identical on remaining attributes

BaseCases:Anidea

• BaseCaseOne:Ifallrecordsincurrentdatasubsethavethesameoutputthendon’trecurs

• BaseCaseTwo:Ifallrecordshaveexactlythesamesetofinputattributesthendon’trecurs

Proposed Base Case 3:If all attributes have small

information gain then don’t recurs

•This is not a good idea

Theproblemwith proposedcase3a b y0 0 00 1 11 0 11 1 0

y = a XOR b

The information gains:

Ifweomit proposedcase3:

a b y0 0 00 1 11 0 11 1 0

y = a XOR bThe resulting decision tree:

Instead, perform pruning after building a tree

Non-BooleanFeatures

• Real-valuedfeatures?

Real->threshold

WEIGHT>?

𝑁𝑜 𝑌𝑒𝑠

• Number of thresholds <= # of different values in dataset

• Can choose threshold based on information gain

Summary:BuildingDecisionTrees

BuildTree(DataSet,Output)• IfalloutputvaluesarethesameinDataSet,returnaleafnode

thatsays“predictthisuniqueoutput”• Ifallinputvaluesarethesame,returnaleafnodethatsays

“predictthemajorityoutput”• ElsefindattributeX withhighestInfoGain• SupposeX hasnX distinctvalues(i.e.Xhasarity nX).

– Createanon-leafnodewithnX children.– Thei’th childshouldbebuiltbycalling

BuildTree(DSi,Output)WhereDSi containstherecordsinDataSetwhereX=ith valueofX.

Machine Space Search

• ID3 / C4.5 / CART search for a succinct tree that perfectly fits the data.

• They are not going to find it in general (NP-hard)

• Entropy-guided splitting – well-performing heuristic. Exists others.

• Why should we search for a small tree at all?

MPG Test set error

The test set error is much worse than the training set error…

…why?

Decisiontreeswilloverfit

Fittingapolynomial𝑡 = sin 2𝜋𝑥 + 𝜖

Figure fromMachineLearningandPatternRecognition,Bishop

Fittingapolynomial


𝑡 = sin 2𝜋𝑥 + 𝜖

RegressionusingpolynomialofdegreeM

Overfitting

• Precisecharacterization– statisticallearningtheory• SpecialtechnicstopreventoverfittinginDTlearning

• Pruningthetree,e.g.“reducederror”pruning:Dountilfurtherpruningisharmful:1. Evaluatetheimpactonvalidation(test)setofthedataofpruningeachpossiblenode

(andit’ssubtree)2. Greedilyremoveonethatmostimprovesvalidation(test)error

Conceptswehaveencountered(andwillappearagaininthecourse!)

• Greedyalgorithm.

• Tryingtofindsuccinctmachines.

• Computationalefficiencyofanalgorithm(running time).

• Overfitting.

Questions

• Whyaresmallertrees(theory) preferabletolargetrees(theory)?

• Whencanatreegeneralizetounseenexamples,andhowwell?

• Howmanyexamplesareneededtobuildatreethatgeneralizeswell?

• Howtoidentifyandpreventoverfitting?

• Arethereothernaturalclassificationmachinesandhowdotheycompare?

Needtoreasonmoregenerallyabout“whatlearningmeans”,TBConThu…

lecture 2: classification and decision trees …...lecture 2: classification and decision trees...

Documents