today’s topics read chapter 3 & section 4.1 (skim section 3.6 and rest of chapter 4), sections...

Lecture 1, Slide 1

Today’s Topics

• Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5) of textbook

• Reviewing the Info Gain Calc from Last Week

• HW0 due 11:55pm, HW1 due in one week (two with late days)

• Fun reading: http://homes.cs.washington.edu/~pedrod/Prologue.pdf

• Information Gain Derived (and Generalized to k Output Categories)

• Handling Numeric and Hierarchical Features

• Advanced Topic: Regression Trees

• The Trouble with Too Many Possible Values

• What if Measuring Features is Costly?9/22/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3

http://homes.cs.washington.edu/~pedrod/Prologue.pdf

CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 3

ID3 Info Gain Measure Justified(Ref. C4.5, J. R. Quinlan, Morgan Kaufmann, 1993, pp 21-22)

Definition of Information Info conveyed by message M depends on its probability, i.e.,

info(M) -log2[Prob(M)] (due to Claude Shannon)

Note: last week we used infoNeeded() as a more informative name for info()

The Supervised Learning Task Select example from a set S and announce it belongs to class C

The probability of this occurring is approx fC the fraction of C ’s in S

Hence info in this announcement is, by definition, -log2(fC)

9/22/15 2


Let there be K different classes in set S, namely C1, C2, …, CK

What’s expected info from msg about class of an example in set S ?

info(s) is the average number of bits of information (by looking at feature values) needed to classify member of set S

ID3 Info Gain Measure (cont.)

. class of are that set of fraction where

,)(log )info(

)(log...)(log)(log )info(

12

222 2211

jC

K

jCC

CCCCCC

CSf

ffS

ffffffS

j

jj

KK

9/22/15 3

9/15/15 CS 540 - Fall 2015 (© Jude Shavlik), Lecture 5, Week 2 4


Handling Hierarchical Features in ID3

Define a new feature for each level in hierarchy, e.g.,

Let ID3 choose the appropriate level of abstraction!

Shape

Circular Polygonal

Shape1 = { Circular, Polygonal }Shape2 = { }

9/22/15 5


Handling Numeric Features in ID3

On the fly create binary features and choose bestStep 1: Plot current examples (green=pos, red=neg)

Step 2: Divide midway between every consecutive pair of points with different categories to create new binary features, eg featurenew1 F<8 and featurenew2 F<10

Step 3: Choose split with best info gain (compete with all other features)

Value of Feature5 7 9 11 13

9/15/15 6

Note: “On the fly” means in each recursive call to ID3


Handling Numeric Features (cont.)

Technical Note

F<10

F< 5 +

+ -

T

T F

F

Cannot discard numeric feature after use in one portion of d-tree

9/22/15 7

Advanced Topic: Regression Trees(assume features are numerically valued)


Age > 25

Gender Output = 4 f3 + 7 f5 – 2 f9

Output = 7 f6 - 2 f1 - 2 f8 + f7 Output = 100 f4 – 2 f8

Yes

M

No

F

9/22/15 8

We want to return real values at the leaves

- For each feature, F, “split” as done in ID3- Use residue remaining, say using Linear Least Squares (LLS), instead of info gain to score candidate splits

Why not a weighted sum in total error?

Commonly models at leaves are wgt’ed sums of features (y = mx + b)Some approaches just place constants at leaves


Advanced Topic: Scoring “Splits” for Regression (Real-Valued) Problems

2[ ( ) ( )]i

F iex subset

Error out ex LLS ex

( ) F i

i Split

Total Error F Error

X

Output LLS

9/22/15 9


Unfortunate Characteristic Property

of Using Info-Gain MeasureFAVORS FEATURES WITH HIGH BRANCHING FACTORS (ie, many possible values)

Extreme Case:

At most one example per leaf and all Info(.,.) scores for leaves equals zero, so gets perfect score! But generalizes very poorly (ie, memorizes data)

1 +0 - 0 +

0 -

0 +1 -

1

99

999999Student ID

9/22/15 10

… …


One Fix (used in HW0/HW1)

Convert all features to binaryeg, Color = { Red, Blue, Green }

From one N-valued feature to N binary-valued features

Color = Red?

Color = Blue?

Color = Green?

Used in Neural Nets and SVMs

D-tree readability probably less, but not necessarily

9/22/15 11


Considering the Cost of Measuring a Feature

• Want trees with high accuracy and whose tests are inexpensive to compute

– take temperature vs. do CAT scan

• Common Heuristic– InformationGain(F)² / Cost(F)

– Used in medical domains as well as robot-sensing tasks

9/22/15 12

today’s topics read chapter 3 & section 4.1 (skim section 3.6 and rest of chapter 4), sections...

Documents

jude shavlik

ai fall

info gain calc

definition of information

handling hierarchical

new binary features

handling numeric features

measuring features