today’s topics read chapter 3 & section 4.1 (skim section 3.6 and rest of chapter 4), sections...

12
Today’s Topics • Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5) of textbook • Reviewing the Info Gain Calc from Last Week • HW0 due 11:55pm, HW1 due in one week (two with late days) • Fun reading: http://homes.cs.washington.edu/~pedrod/Prologue.pdf • Information Gain Derived (and Generalized to k Output Categories) • Handling Numeric and Hierarchical Features • Advanced Topic: Regression Trees • The Trouble with Too Many Possible Values • What if Measuring Features is Costly? 9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3 1

Upload: tamsyn-banks

Post on 05-Jan-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Today’s Topics Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5)

Lecture 1, Slide 1

Today’s Topics

• Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5) of textbook

• Reviewing the Info Gain Calc from Last Week

• HW0 due 11:55pm, HW1 due in one week (two with late days)

• Fun reading: http://homes.cs.washington.edu/~pedrod/Prologue.pdf

• Information Gain Derived (and Generalized to k Output Categories)

• Handling Numeric and Hierarchical Features

• Advanced Topic: Regression Trees

• The Trouble with Too Many Possible Values

• What if Measuring Features is Costly?9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3

Page 2: Today’s Topics Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5)

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3

ID3 Info Gain Measure Justified(Ref. C4.5, J. R. Quinlan, Morgan Kaufmann, 1993, pp 21-22)

Definition of Information Info conveyed by message M depends on its probability, i.e.,

info(M) -log2[Prob(M)] (due to Claude Shannon)

Note: last week we used infoNeeded() as a more informative name for info()

The Supervised Learning Task Select example from a set S and announce it belongs to class C

The probability of this occurring is approx fC the fraction of C ’s in S

Hence info in this announcement is, by definition, -log2(fC)

9/22/15 2

Page 3: Today’s Topics Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5)

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3

Let there be K different classes in set S, namely C1, C2, …, CK

What’s expected info from msg about class of an example in set S ?

info(s) is the average number of bits of information (by looking at feature values) needed to classify member of set S

ID3 Info Gain Measure (cont.)

. class of are that set of fraction where

,)(log )info(

)(log...)(log)(log )info(

12

222 2211

jC

K

jCC

CCCCCC

CSf

ffS

ffffffS

j

jj

KK

9/22/15 3

Page 4: Today’s Topics Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5)

9/15/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 2 4

Page 5: Today’s Topics Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5)

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3

Handling Hierarchical Features in ID3

Define a new feature for each level in hierarchy, e.g.,

Let ID3 choose the appropriate level of abstraction!

Shape

Circular Polygonal

Shape1 = { Circular, Polygonal }Shape2 = { }

9/22/15 5

Page 6: Today’s Topics Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5)

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 2

Handling Numeric Features in ID3

On the fly create binary features and choose bestStep 1: Plot current examples (green=pos, red=neg)

Step 2: Divide midway between every consecutive pair of points with different categories to create new binary features, eg featurenew1 F<8 and featurenew2 F<10

Step 3: Choose split with best info gain (compete with all other features)

Value of Feature5 7 9 11 13

9/15/15 6

Note: “On the fly” means in each recursive call to ID3

Page 7: Today’s Topics Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5)

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3

Handling Numeric Features (cont.)

Technical Note

F<10

F< 5 +

+ -

T

T F

F

Cannot discard numeric feature after use in one portion of d-tree

9/22/15 7

Page 8: Today’s Topics Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5)

Advanced Topic: Regression Trees(assume features are numerically valued)

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3

Age > 25

Gender Output = 4 f3 + 7 f5 – 2 f9

Output = 7 f6 - 2 f1 - 2 f8 + f7 Output = 100 f4 – 2 f8

Yes

M

No

F

9/22/15 8

Page 9: Today’s Topics Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5)

We want to return real values at the leaves

- For each feature, F, “split” as done in ID3- Use residue remaining, say using Linear Least Squares (LLS), instead of info gain to score candidate splits

Why not a weighted sum in total error?

Commonly models at leaves are wgt’ed sums of features (y = mx + b)Some approaches just place constants at leaves

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3

Advanced Topic: Scoring “Splits” for Regression (Real-Valued) Problems

2[ ( ) ( )]i

F iex subset

Error out ex LLS ex

( ) F i

i Split

Total Error F Error

X

Output LLS

9/22/15 9

Page 10: Today’s Topics Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5)

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3

Unfortunate Characteristic Property

of Using Info-Gain MeasureFAVORS FEATURES WITH HIGH BRANCHING FACTORS (ie, many possible values)

Extreme Case:

At most one example per leaf and all Info(.,.) scores for leaves equals zero, so gets perfect score! But generalizes very poorly (ie, memorizes data)

1 +0 - 0 +

0 -

0 +1 -

1

99

999999Student ID

9/22/15 10

… …

Page 11: Today’s Topics Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5)

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3

One Fix (used in HW0/HW1)

Convert all features to binaryeg, Color = { Red, Blue, Green }

From one N-valued feature to N binary-valued features

Color = Red?

Color = Blue?

Color = Green?

Used in Neural Nets and SVMs

D-tree readability probably less, but not necessarily

9/22/15 11

Page 12: Today’s Topics Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5)

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3

Considering the Cost of Measuring a Feature

• Want trees with high accuracy and whose tests are inexpensive to compute

– take temperature vs. do CAT scan

• Common Heuristic– InformationGain(F)² / Cost(F)

– Used in medical domains as well as robot-sensing tasks

9/22/15 12