today’s topics read chapter 3 & section 4.1 (skim section 3.6 and rest of chapter 4), sections...
TRANSCRIPT
Lecture 1, Slide 1
Today’s Topics
• Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5) of textbook
• Reviewing the Info Gain Calc from Last Week
• HW0 due 11:55pm, HW1 due in one week (two with late days)
• Fun reading: http://homes.cs.washington.edu/~pedrod/Prologue.pdf
• Information Gain Derived (and Generalized to k Output Categories)
• Handling Numeric and Hierarchical Features
• Advanced Topic: Regression Trees
• The Trouble with Too Many Possible Values
• What if Measuring Features is Costly?9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
ID3 Info Gain Measure Justified(Ref. C4.5, J. R. Quinlan, Morgan Kaufmann, 1993, pp 21-22)
Definition of Information Info conveyed by message M depends on its probability, i.e.,
info(M) -log2[Prob(M)] (due to Claude Shannon)
Note: last week we used infoNeeded() as a more informative name for info()
The Supervised Learning Task Select example from a set S and announce it belongs to class C
The probability of this occurring is approx fC the fraction of C ’s in S
Hence info in this announcement is, by definition, -log2(fC)
9/22/15 2
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
Let there be K different classes in set S, namely C1, C2, …, CK
What’s expected info from msg about class of an example in set S ?
info(s) is the average number of bits of information (by looking at feature values) needed to classify member of set S
ID3 Info Gain Measure (cont.)
. class of are that set of fraction where
,)(log )info(
)(log...)(log)(log )info(
12
222 2211
jC
K
jCC
CCCCCC
CSf
ffS
ffffffS
j
jj
KK
9/22/15 3
9/15/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 2 4
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
Handling Hierarchical Features in ID3
Define a new feature for each level in hierarchy, e.g.,
Let ID3 choose the appropriate level of abstraction!
Shape
Circular Polygonal
Shape1 = { Circular, Polygonal }Shape2 = { }
9/22/15 5
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 2
Handling Numeric Features in ID3
On the fly create binary features and choose bestStep 1: Plot current examples (green=pos, red=neg)
Step 2: Divide midway between every consecutive pair of points with different categories to create new binary features, eg featurenew1 F<8 and featurenew2 F<10
Step 3: Choose split with best info gain (compete with all other features)
Value of Feature5 7 9 11 13
9/15/15 6
Note: “On the fly” means in each recursive call to ID3
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
Handling Numeric Features (cont.)
Technical Note
F<10
F< 5 +
+ -
T
T F
F
Cannot discard numeric feature after use in one portion of d-tree
9/22/15 7
Advanced Topic: Regression Trees(assume features are numerically valued)
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
Age > 25
Gender Output = 4 f3 + 7 f5 – 2 f9
Output = 7 f6 - 2 f1 - 2 f8 + f7 Output = 100 f4 – 2 f8
Yes
M
No
F
9/22/15 8
We want to return real values at the leaves
- For each feature, F, “split” as done in ID3- Use residue remaining, say using Linear Least Squares (LLS), instead of info gain to score candidate splits
Why not a weighted sum in total error?
Commonly models at leaves are wgt’ed sums of features (y = mx + b)Some approaches just place constants at leaves
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
Advanced Topic: Scoring “Splits” for Regression (Real-Valued) Problems
2[ ( ) ( )]i
F iex subset
Error out ex LLS ex
( ) F i
i Split
Total Error F Error
X
Output LLS
9/22/15 9
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
Unfortunate Characteristic Property
of Using Info-Gain MeasureFAVORS FEATURES WITH HIGH BRANCHING FACTORS (ie, many possible values)
Extreme Case:
At most one example per leaf and all Info(.,.) scores for leaves equals zero, so gets perfect score! But generalizes very poorly (ie, memorizes data)
1 +0 - 0 +
0 -
0 +1 -
1
99
999999Student ID
9/22/15 10
… …
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
One Fix (used in HW0/HW1)
Convert all features to binaryeg, Color = { Red, Blue, Green }
From one N-valued feature to N binary-valued features
Color = Red?
Color = Blue?
Color = Green?
Used in Neural Nets and SVMs
D-tree readability probably less, but not necessarily
9/22/15 11
CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3
Considering the Cost of Measuring a Feature
• Want trees with high accuracy and whose tests are inexpensive to compute
– take temperature vs. do CAT scan
• Common Heuristic– InformationGain(F)² / Cost(F)
– Used in medical domains as well as robot-sensing tasks
9/22/15 12