decision trees and empirical methodology sec 4.3, 5.1-5.4

21
Decision trees and empirical methodology Sec 4.3, 5.1-5.4

Post on 21-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Decision trees and empirical

methodologySec 4.3, 5.1-5.4

Review...•Goal: want to find/replicate target function f()

•Candidates from hypothesis space, H

•“best” candidate measured by accuracy (for the moment)

•Decision trees built by greedy, recursive search

•Produces piecewise constant, axis-orthagonal, hyperrectangular model

•Can handle continuous or categorical attributes

•Only categorical class labels

•Learning bias for small, well balanced trees

Splitting criteria•What properties do we want our getBestSplitAttribute() function to have?

•Increase the purity of the data

•After split, new sets should be closer to uniform labeling than before the split

•Want the subsets to have roughly the same purity

•Want the subsets to be as balanced as possible

•These choices are designed to produce small trees

•Definition: Learning bias == tendency to find one class of solution out of H in preference to another

Entropy•We’ll use entropy

•Consider a set of true/false labels

•Want our measure to be small when the set is pure (all true or all false), and large when set is almost split between the classes

•Expresses the amount of information in the set

•(Later we’ll use the negative of this function, so it’ll be better if the set is almost pure)

Entropy, cont’d•Define: class fractions (a.k.a., class prior

probabilities)

•Define: entropy of a set

•In general, for classes :

The entropy curve

Entropy of a split•A split produces a number of sets (one

for each branch)

•Need a corresponding entropy of a split (i.e., entropy of a collection of sets)

•Definition: entropy of a split

Information gain•The last, easy step:

•Want to pick the attribute that decreases the information content of the data as much as possible

•Q: Why decrease?

•Define: gain of splitting data set [X,Y] on attribute a:

Final algorithm•Now we have a complete alg for the getBestSplitAttribute() function:

Input: InstanceSet X, LabelSet YOutput: AttributebaseInfo=entropy(Y);foreach a in (X.attributes) {

[X1,...,Xk,Y1,...,Yk]=splitData(X,Y,a);gain[a]=baseInfo-splitEntropy(Y1,...,Yk);

}return argmax(gain);

DTs in practice...•Growing to purity is bad (overfitting)

DTs in practice...•Growing to purity is bad (overfitting)

x1: petal length

x2:

sepa

l wid

th

DTs in practice...•Growing to purity is bad (overfitting)

x1: petal length

x2:

sepa

l wid

th

DTs in practice...•Growing to purity is bad (overfitting)

•Terminate growth early

•Grow to purity, then prune back

•Multiway splits are a pain

•Entropy is biased in favor of more splits

•Correct w/ gain ratio

DTs in practice...•Growing to purity is bad (overfitting)

•Terminate growth early

•Grow to purity, then prune back

•Multiway splits are a pain

•Entropy is biased in favor of more splits

•Correct w/ gain ratio

•Real-valued attributes

•rules of form if (x1<3.4) { ... }

•How to pick the “3.4”?

Measuring accuracy•So now you have a DT -- what now?

•Usually, want to use it to classify new data (previously unseen)

•Want to know how well you should expect it to perform

•How do you estimate such a thing?

Measuring accuracy•So now you have a DT -- what now?

•Usually, want to use it to classify new data (previously unseen)

•Want to know how well you should expect it to perform

•How do you estimate such a thing?

•Theoretically -- prove that you have the “right” tree

•Very, very hard in practice

•Measure it

•Trickier than it sounds!....

Testing with training data•So you have a data set:

•and corresponding labels:

•You build your decision tree:

•tree=buildDecisionTree(X,Y)

•What happens if you just do this:acc=0.0;for (i=1;i<=N;++i) {

acc+=(tree.classify(X[i])==Y[i]);}acc/=N;return acc;

•?

Testing with training data•Answer: you tend to overestimate real

accuracy (possibly drastically)

x2:

sepa

l wid

th ??

?

?

?

?

Separation of train & test•Fundamental principle (1st

amendment of ML):

•Don’t evaluate accuracy (performance) of your classifier (learning system) on the same data used to train it!

Holdout data•Usual to “hold out” a separate set of data for

testing; not used to train classifier

•A.k.a., test set, holdout set, evaluation set, etc.

•E.g.,

• is training set accuracy

• is test set (or generalization) accuracy

Gotchas...•What if you’re unlucky when you split data

into train/test?

•E.g., all train data are class A and all test are class B?

•No “red” things show up in training data

•Best answer: stratification

•Try to make sure class (+feature) ratios are same in train/test sets (and same as original data)

•Why does this work?

•Almost as good: randomization

•Shuffle data randomly before split

•Why does this work?