machine learning (cse 446): decision trees...announcements i first assignment posted. due thurs, jan...
TRANSCRIPT
Machine Learning (CSE 446):Decision Trees
Sham M Kakadec© 2018
University of [email protected]
1 / 18
Announcements
I First assignment posted. Due Thurs, Jan 18th.Remember the late policy (see the website).
I TA office hours posted.(Please check website before you go, just in case of changes.)
I Midterm: Weds, Feb 7.
I Today: Decision Trees, the supervised learning
2 / 18
Features (a conceptual point)
Let φ be (one such) function that maps from inputs x to values. There could be manysuch functions, sometimes we write Φ(x) for the feature “vector” (it’s really a“tuple”).
I If φ maps to {0, 1}, we call it a “binary feature (function).”
I If φ maps to R, we call it a “real-valued feature (function).”
I φ could map to categorical values.
I ordinal values, integers, ...
Often, there isn’t much of a difference between x and the tuple of features.
3 / 18
FeaturesData derived from https://archive.ics.uci.edu/ml/datasets/Auto+MPG
mpg; cylinders; displacement; horsepower; weight; acceleration; year; origin
Input: a row in this table. afeature mapping corresponds to acolumn.
Goal: predict whether mpg is < 23(“bad” = 0) or above (“good” =1) given other attributes (othercolumns).
201 “good” and 197 “bad”;guessing the most frequent class(good) will get 50.5% accuracy.
4 / 18
Let’s build a classifier!
I Let’s just try to build a classifier.(This is our first learning algorithm)
I For now, let’s ignore the “test” set and trying to “generalize”
I Let’s start with just looking at a simple classifier.What is a simple classification rule?
5 / 18
Contingency Table
values of y
values of feature φv1 v2 · · · vK
01
6 / 18
Decision Stump Example
ymaker
america europe asia0 174 14 91 75 56 70
↓ ↓ ↓0 1 1
7 / 18
Decision Stump Example
ymaker
america europe asia0 174 14 91 75 56 70
↓ ↓ ↓0 1 1
root197:201
maker?
europe14:56
america174:75
asia9:70
7 / 18
Decision Stump Example
ymaker
america europe asia0 174 14 91 75 56 70
↓ ↓ ↓0 1 1
root197:201
maker?
europe14:56
america174:75
asia9:70
Errors: 75 + 14 + 9 = 98 (about 25%)
7 / 18
Decision Stump Example
root197:201
cylinders?
420:184
673:11
8100:3
33:1
51:2
8 / 18
Decision Stump Example
root197:201
cylinders?
420:184
673:11
8100:3
33:1
51:2
Errors: 1 + 20 + 1 + 11 + 3 = 36 (about 9%)
8 / 18
Key Idea: Recursion
A single feature partitions the data.
For each partition, we could choose another feature and partition further.
Applying this recursively, we can construct a decision tree.
9 / 18
Decision Tree Example
root197:201
cylinders?
420:184
673:11
8100:3
33:1
51:2
maker?
europe10:53
america7:65
asia3:66
Error reduction compared to the cylinders stump?10 / 18
Decision Tree Example
root197:201
cylinders?
420:184
673:11
8100:3
33:1
51:2
maker?
europe3:1
america67:7
asia3:3
Error reduction compared to the cylinders stump?10 / 18
Decision Tree Example
root197:201
cylinders?
420:184
673:11
8100:3
33:1
51:2
ϕ?
10:10
073:1
Error reduction compared to the cylinders stump?10 / 18
Decision Tree Example
root197:201
cylinders?
420:184
673:11
8100:3
33:1
51:2
ϕ?
10:10
073:1
ϕ’?
118:15
02:169
Error reduction compared to the cylinders stump? 10 / 18
Decision Tree: Making a Predictionrootn:p
ϕ1?
0n0:p0
ϕ2?
ϕ3?
1n101:p101
0n100:p100
ϕ4?
1n111:p111
0n110:p110
1n1:p1
0n10:p10
1n11:p11
11 / 18
Decision Tree: Making a Predictionrootn:p
ϕ1?
0n0:p0
ϕ2?
ϕ3?
1n101:p101
0n100:p100
ϕ4?
1n111:p111
0n110:p110
1n1:p1
0n10:p10
1n11:p11
Data: decision tree t, input example xResult: predicted classif t has the form Leaf(y) then
return y;else
# t.φ is the feature associated with t;# t.child(v) is the subtree for value v;return DTreeTest(t.child(t.φ(x)), x));
endAlgorithm 1: DTreeTest
11 / 18
Decision Tree: Making a Predictionrootn:p
ϕ1?
0n0:p0
ϕ2?
ϕ3?
1n101:p101
0n100:p100
ϕ4?
1n111:p111
0n110:p110
1n1:p1
0n10:p10
1n11:p11
Equivalent boolean formulas:
(φ1 = 0)⇒ Jn0 < p0K(φ1 = 1) ∧ (φ2 = 0) ∧ (φ3 = 0)⇒ Jn100 < p100K(φ1 = 1) ∧ (φ2 = 0) ∧ (φ3 = 1)⇒ Jn101 < p101K(φ1 = 1) ∧ (φ2 = 1) ∧ (φ4 = 0)⇒ Jn110 < p110K(φ1 = 1) ∧ (φ2 = 1) ∧ (φ4 = 1)⇒ Jn111 < p111K
11 / 18
Tangent: How Many Formulas?
I Assume we have D binary features.
I Each feature could be set to 0, or set to 1, or excluded (wildcard/don’t care).
I 3D formulas.
12 / 18
Building a Decision Tree
rootn:p
13 / 18
Building a Decision Tree
rootn:p
ϕ1?
0n0:p0
1n1:p1
We chose feature φ1. Note that n = n0 + n1 and p = p0 + p1.
13 / 18
Building a Decision Tree
rootn:p
ϕ1?
0n0:p0
1n1:p1
We chose not to split the left partition. Why not?
13 / 18
Building a Decision Tree
rootn:p
ϕ1?
0n0:p0
ϕ2?
1n1:p1
0n10:p10
1n11:p11
13 / 18
Building a Decision Treerootn:p
ϕ1?
0n0:p0
ϕ2?
ϕ3?
1n101:p101
0n100:p100
1n1:p1
0n10:p10
1n11:p11
13 / 18
Building a Decision Treerootn:p
ϕ1?
0n0:p0
ϕ2?
ϕ3?
1n101:p101
0n100:p100
ϕ4?
1n111:p111
0n110:p110
1n1:p1
0n10:p10
1n11:p11
13 / 18
Greedily Building a Decision Tree (Binary Features)
Data: data D, feature set ΦResult: decision treeif all examples in D have the same label y, or Φ is empty and y is the best guess then
return Leaf(y);else
for each feature φ in Φ dopartition D into D0 and D1 based on φ-values;let mistakes(φ) = (non-majority answers in D0) + (non-majority answers inD1);
endlet φ∗ be the feature with the smallest number of mistakes;return Node(φ∗, {0 → DTreeTrain(D0,Φ \ {φ∗}), 1 →DTreeTrain(D1,Φ \ {φ∗})});
endAlgorithm 2: DTreeTrain
14 / 18
What could go wrong?
I Suppose we split on a variable with many values? (e.g. a continous one like“displacement”)
I Suppose we built out our tree to be very deep and wide?
15 / 18
Danger: Overfitting
error rate(lower is better)
depth of the decision tree
training data
unseen data
overfitting
16 / 18
Detecting Overfitting
If you use all of your data to train, you won’t be able to draw the red curve on thepreceding slide!
17 / 18
Detecting Overfitting
If you use all of your data to train, you won’t be able to draw the red curve on thepreceding slide!
Solution: hold some out. This data is called development data. More terms:
I Decision tree max depth is an example of a hyperparameter
I “I used my development data to tune the max-depth hyperparameter.”
17 / 18
Detecting Overfitting
If you use all of your data to train, you won’t be able to draw the red curve on thepreceding slide!
Solution: hold some out. This data is called development data. More terms:
I Decision tree max depth is an example of a hyperparameter
I “I used my development data to tune the max-depth hyperparameter.”
Better yet, hold out two subsets, one for tuning and one for a true, honest-to-sciencetest.
Splitting your data into training/development/test requires careful thinking. Startingpoint: randomly shuffle examples with an 80%/10%/10% split.
17 / 18
The “i.i.d.” Supervised Learning Setup
I Let ` be a loss function; `(y, y) is what we lose by outputting y when y is thecorrect output. For classification:
`(y, y) = Jy 6= yK
I Let D(x, y) define the true probability of input/output pair (x, y), in “nature.”We never “know” this distribution.
I The training data D = 〈(x1, y1), (x2, y2), . . . , (xN , yN )〉 are assumed to beidentical, independently, distributed (i.i.d.) samples from D.
I The test data are also assumed to be i.i.d. samples from D.
I The space of classifiers we’re considering is F ; f is a classifier from F , chosen byour learning algorithm.
18 / 18