machine learning (cse 446): decision trees...announcements i first assignment posted. due thurs, jan...

Machine Learning (CSE 446):Decision Trees

Sham M Kakadec© 2018

University of [email protected]

1 / 18

Announcements

I First assignment posted. Due Thurs, Jan 18th.Remember the late policy (see the website).

I TA office hours posted.(Please check website before you go, just in case of changes.)

I Midterm: Weds, Feb 7.

I Today: Decision Trees, the supervised learning

2 / 18

Features (a conceptual point)

Let φ be (one such) function that maps from inputs x to values. There could be manysuch functions, sometimes we write Φ(x) for the feature “vector” (it’s really a“tuple”).

I If φ maps to {0, 1}, we call it a “binary feature (function).”

I If φ maps to R, we call it a “real-valued feature (function).”

I φ could map to categorical values.

I ordinal values, integers, ...

Often, there isn’t much of a difference between x and the tuple of features.

3 / 18

FeaturesData derived from https://archive.ics.uci.edu/ml/datasets/Auto+MPG

mpg; cylinders; displacement; horsepower; weight; acceleration; year; origin

Input: a row in this table. afeature mapping corresponds to acolumn.

Goal: predict whether mpg is < 23(“bad” = 0) or above (“good” =1) given other attributes (othercolumns).

201 “good” and 197 “bad”;guessing the most frequent class(good) will get 50.5% accuracy.

4 / 18

https://archive.ics.uci.edu/ml/datasets/Auto+MPG

Let’s build a classifier!

I Let’s just try to build a classifier.(This is our first learning algorithm)

I For now, let’s ignore the “test” set and trying to “generalize”

I Let’s start with just looking at a simple classifier.What is a simple classification rule?

5 / 18

Contingency Table

values of y

values of feature φv1 v2 · · · vK

01

6 / 18

Decision Stump Example

ymaker

america europe asia0 174 14 91 75 56 70

↓ ↓ ↓0 1 1

7 / 18


ymaker


↓ ↓ ↓0 1 1

root197:201

maker?

europe14:56

america174:75

asia9:70

7 / 18


ymaker


↓ ↓ ↓0 1 1

root197:201

maker?

europe14:56

america174:75

asia9:70

Errors: 75 + 14 + 9 = 98 (about 25%)

7 / 18


root197:201

cylinders?

420:184

673:11

8100:3

33:1

51:2

8 / 18


root197:201

cylinders?

420:184

673:11

8100:3

33:1

51:2

Errors: 1 + 20 + 1 + 11 + 3 = 36 (about 9%)

8 / 18

Key Idea: Recursion

A single feature partitions the data.

For each partition, we could choose another feature and partition further.

Applying this recursively, we can construct a decision tree.

9 / 18

Decision Tree Example

root197:201

cylinders?

420:184

673:11

8100:3

33:1

51:2

maker?

europe10:53

america7:65

asia3:66

Error reduction compared to the cylinders stump?10 / 18


root197:201

cylinders?

420:184

673:11

8100:3

33:1

51:2

maker?

europe3:1

america67:7

asia3:3



root197:201

cylinders?

420:184

673:11

8100:3

33:1

51:2

ϕ?

10:10

073:1



root197:201

cylinders?

420:184

673:11

8100:3

33:1

51:2

ϕ?

10:10

073:1

ϕ’?

118:15

02:169

Error reduction compared to the cylinders stump? 10 / 18

Decision Tree: Making a Predictionrootn:p

ϕ1?

0n0:p0

ϕ2?

ϕ3?

1n101:p101

0n100:p100

ϕ4?

1n111:p111

0n110:p110

1n1:p1

0n10:p10

1n11:p11

11 / 18


ϕ1?

0n0:p0

ϕ2?

ϕ3?

1n101:p101

0n100:p100

ϕ4?

1n111:p111

0n110:p110

1n1:p1

0n10:p10

1n11:p11

Data: decision tree t, input example xResult: predicted classif t has the form Leaf(y) then

return y;else

# t.φ is the feature associated with t;# t.child(v) is the subtree for value v;return DTreeTest(t.child(t.φ(x)), x));

endAlgorithm 1: DTreeTest

11 / 18


ϕ1?

0n0:p0

ϕ2?

ϕ3?

1n101:p101

0n100:p100

ϕ4?

1n111:p111

0n110:p110

1n1:p1

0n10:p10

1n11:p11

Equivalent boolean formulas:

(φ1 = 0)⇒ Jn0 < p0K(φ1 = 1) ∧ (φ2 = 0) ∧ (φ3 = 0)⇒ Jn100 < p100K(φ1 = 1) ∧ (φ2 = 0) ∧ (φ3 = 1)⇒ Jn101 < p101K(φ1 = 1) ∧ (φ2 = 1) ∧ (φ4 = 0)⇒ Jn110 < p110K(φ1 = 1) ∧ (φ2 = 1) ∧ (φ4 = 1)⇒ Jn111 < p111K

11 / 18

Tangent: How Many Formulas?

I Assume we have D binary features.

I Each feature could be set to 0, or set to 1, or excluded (wildcard/don’t care).

I 3D formulas.

12 / 18

Building a Decision Tree

rootn:p

13 / 18


rootn:p

ϕ1?

0n0:p0

1n1:p1

We chose feature φ1. Note that n = n0 + n1 and p = p0 + p1.

13 / 18


rootn:p

ϕ1?

0n0:p0

1n1:p1

We chose not to split the left partition. Why not?

13 / 18


rootn:p

ϕ1?

0n0:p0

ϕ2?

1n1:p1

0n10:p10

1n11:p11

13 / 18

Building a Decision Treerootn:p

ϕ1?

0n0:p0

ϕ2?

ϕ3?

1n101:p101

0n100:p100

1n1:p1

0n10:p10

1n11:p11

13 / 18

Building a Decision Treerootn:p

ϕ1?

0n0:p0

ϕ2?

ϕ3?

1n101:p101

0n100:p100

ϕ4?

1n111:p111

0n110:p110

1n1:p1

0n10:p10

1n11:p11

13 / 18

Greedily Building a Decision Tree (Binary Features)

Data: data D, feature set ΦResult: decision treeif all examples in D have the same label y, or Φ is empty and y is the best guess then

return Leaf(y);else

for each feature φ in Φ dopartition D into D0 and D1 based on φ-values;let mistakes(φ) = (non-majority answers in D0) + (non-majority answers inD1);

endlet φ∗ be the feature with the smallest number of mistakes;return Node(φ∗, {0 → DTreeTrain(D0,Φ \ {φ∗}), 1 →DTreeTrain(D1,Φ \ {φ∗})});

endAlgorithm 2: DTreeTrain

14 / 18

What could go wrong?

I Suppose we split on a variable with many values? (e.g. a continous one like“displacement”)

I Suppose we built out our tree to be very deep and wide?

15 / 18

Danger: Overfitting

error rate(lower is better)

depth of the decision tree

training data

unseen data

overfitting

16 / 18

Detecting Overfitting

If you use all of your data to train, you won’t be able to draw the red curve on thepreceding slide!

17 / 18



Solution: hold some out. This data is called development data. More terms:

I Decision tree max depth is an example of a hyperparameter

I “I used my development data to tune the max-depth hyperparameter.”

17 / 18



Solution: hold some out. This data is called development data. More terms:

I Decision tree max depth is an example of a hyperparameter

I “I used my development data to tune the max-depth hyperparameter.”

Better yet, hold out two subsets, one for tuning and one for a true, honest-to-sciencetest.

Splitting your data into training/development/test requires careful thinking. Startingpoint: randomly shuffle examples with an 80%/10%/10% split.

17 / 18

The “i.i.d.” Supervised Learning Setup

I Let ` be a loss function; `(y, y) is what we lose by outputting y when y is thecorrect output. For classification:

`(y, y) = Jy 6= yK

I Let D(x, y) define the true probability of input/output pair (x, y), in “nature.”We never “know” this distribution.

I The training data D = 〈(x1, y1), (x2, y2), . . . , (xN , yN )〉 are assumed to beidentical, independently, distributed (i.i.d.) samples from D.

I The test data are also assumed to be i.i.d. samples from D.

I The space of classifiers we’re considering is F ; f is a classifier from F , chosen byour learning algorithm.

18 / 18

machine learning (cse 446): decision trees...announcements i first assignment posted. due thurs, jan...

Documents