information theory, classification & decision trees ling 572 advanced statistical methods in nlp...

68
Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

Upload: austin-oneal

Post on 20-Jan-2016

223 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

Information Theory, Classification & Decision Trees

Ling 572Advanced Statistical Methods in NLP

January 5, 2012

Page 2: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

2

Information Theory

Page 3: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

3

EntropyInformation theoretic measure

Measures information in model

Conceptually, lower bound on # bits to encode

Entropy: H(X): X is a random var, p: prob fn

)(log)()( 2 xpxpXHXx

Page 4: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

4

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

Page 5: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

5

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

Page 6: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

6

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

Page 7: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

7

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

Page 8: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

8

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

Page 9: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

9

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

Page 10: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

10

Relative EntropyCommonly known as Kullback-Liebler divergence

Expresses difference between probability distributions

Page 11: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

11

Relative EntropyCommonly known as Kullback-Liebler divergence

Expresses difference between probability distributions

Page 12: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

12

Relative EntropyCommonly known as Kullback-Liebler divergence

Expresses difference between probability distributions

Not a proper distance metric:

Page 13: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

13

Relative EntropyCommonly known as Kullback-Liebler divergence

Expresses difference between probability distributions

Not a proper distance metric: asymmetricKL(p||q) != KL(q||p)

Page 14: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

14

Joint & Conditional Entropy

Joint entropy:

Page 15: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

15

Joint & Conditional Entropy

Joint entropy:

Conditional entropy:

Page 16: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

16

Joint & Conditional Entropy

Joint entropy:

Conditional entropy:

Page 17: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

17

Joint & Conditional Entropy

Joint entropy:

Conditional entropy:

Page 18: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

18

Perplexity and Entropy

Given that

Consider the perplexity equation:

PP(W) = P(W)-1/N = = = 2H(L,P)

Where H is the entropy of the language L

Page 19: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

19

Mutual InformationMeasure of information in common between two

distributions

Page 20: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

20

Mutual InformationMeasure of information in common between two

distributions

Page 21: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

21

Mutual InformationMeasure of information in common between two

distributions

Page 22: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

22

Mutual InformationMeasure of information in common between two

distributions

Page 23: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

23

Mutual InformationMeasure of information in common between two

distributions

Symmetric: I(X;Y) = I(Y;X)

Page 24: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

24

Mutual InformationMeasure of information in common between two

distributions

Symmetric: I(X;Y) = I(Y;X)

I(X;Y) = KL(p(x,y)||p(x)p(y))

Page 25: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

25

Decision Trees

Page 26: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

26

Classification TaskTask:

C is a finite set of labels (aka categories, classes)Given x, determine its category y in C

Page 27: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

27

Classification TaskTask:

C is a finite set of labels (aka categories, classes)Given x, determine its category y in C

Instance: (x,y)x: thing to be labeled/classifiedy: label/class

Page 28: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

28

Classification TaskTask:

C is a finite set of labels (aka categories, classes)Given x, determine its category y in C

Instance: (x,y)x: thing to be labeled/classifiedy: label/class

Data: set of instances labeled data: y is knownunlabeled data: y is unknown

Page 29: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

29

Classification TaskTask:

C is a finite set of labels (aka categories, classes) Given x, determine its category y in C

Instance: (x,y) x: thing to be labeled/classified y: label/class

Data: set of instances labeled data: y is known unlabeled data: y is unknown

Training data, test data

Page 30: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

30

Two StagesTraining:

Learner: training data classifier

Page 31: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

31

Two StagesTraining:

Learner: training data classifier

Testing:Decoder: test data + classifier classification

output

Page 32: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

32

Two StagesTraining:

Learner: training data classifierClassifier: f(x) =y: x is input; y in C

Testing:Decoder: test data + classifier classification output

AlsoPreprocessingPostprocessingEvaluation

Page 33: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

33

Page 34: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

34

RoadmapDecision Trees:

Sunburn exampleDecision tree basicsFrom trees to rulesKey questions

Training procedure?Decoding procedure? Overfitting?Different feature type?

Analysis: Pros & Cons

Page 35: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

35

Sunburn ExampleName Hair Height Weight Lotion Result

Sarah Blonde Average Light No Burn

Dana Blonde Tall Average Yes None

Alex Brown Short Average Yes None

Annie Blonde Short Average No Burn

Emily Red Average Heavy No Burn

Pete Brown Tall Heavy No None

John Brown Average Heavy No None

Katie Blonde Short Light Yes None

Page 36: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

36

Learning about Sunburn

Goal:Train on labeled examplesPredict Burn/None for new instances

Solution??Exact match: same features, same output

Problem: 2*3^3 feature combinations Could be much worse

Same label as ‘most similar’Problem: What’s close? Which features matter?

Many match on two features but differ on result

Page 37: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

37

Learning about SunburnBetter Solution: Decision tree

Training:Divide examples into subsets based on feature testsSets of samples at leaves define classification

Prediction:Route NEW instance through tree to leaf based on

feature testsAssign same value as samples at leaf

Page 38: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

38

Sunburn Decision Tree

Hair Color

Lotion Used

BlondeRed

Brown

Alex: NoneJohn: NonePete: None

Emily: Burn

No Yes

Sarah: BurnAnnie: Burn

Katie: NoneDana: None

Page 39: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

39

Decision Tree Structure Internal nodes:

Each node is a testGenerally tests a single feature

E.g. Hair == ?

Theoretically could test multiple features

Branches: Each branch corresponds to outcome of test

E.g Hair == Red; Hair != Blond

Leaves: Each leaf corresponds to a decision

Discrete class: Classification / Decision TreeReal value: Regression

Page 40: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

40

From Trees to RulesTree:

Branches from root to leaves =

Tests => classifications

Tests = if antecedents; Leaf labels= consequent

All Decision trees-> rules Not all rules as trees

Page 41: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

41

From ID Trees to RulesHair Color

Lotion Used

BlondeRed

Brown

Alex: NoneJohn: NonePete: None

Emily: Burn

No Yes

Sarah: BurnAnnie: Burn

Katie: NoneDana: None

(if (equal haircolor blonde) (equal lotionused yes) (then None))(if (equal haircolor blonde) (equal lotionused no) (then Burn))(if (equal haircolor red) (then Burn))(if (equal haircolor brown) (then None))

Page 42: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

42

Which Tree?Many possible decision trees for any problem

How can we select among them?

What would be the ‘best’ tree?Smallest?

Shallowest?

Most accurate on unseen data?

Page 43: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

43

SimplicityOccam’s Razor:

Simplest explanation that covers the data is best

Occam’s Razor for decision trees:Smallest tree consistent with samples will be

best predictor for new data

Problem: Finding all trees & finding smallest: Expensive!

Solution:Greedily build a small tree

Page 44: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

44

Building Trees:Basic Algorithm

Goal: Build a small tree such that all samples at leaves have same class

Greedy solution:At each node, pick test using ‘best’ feature

Split into subsets based on outcomes of feature test

Repeat process until stopping criterioni.e. until leaves have same class

Page 45: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

45

Key QuestionsSplitting:

How do we select the ‘best’ feature?

Stopping: When do we stop splitting to avoid overfitting?

Features:How do we split different types of features?

Binary? Discrete? Continuous?

Page 46: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

46

Building Decision Trees: IGoal: Build a small tree such that all samples at

leaves have same class

Greedy solution:At each node, pick test such that branches are

closest to having same class

Split into subsets where most instances in uniform class

Page 47: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

47

Picking a TestHair Color

BlondeRed

Brown

Alex: NPete: NJohn: N

Emily: BSarah: BDana: NAnnie: BKatie: N

Height

WeightLotion

Short AverageTall

Alex:NAnnie:BKatie:N

Sarah:BEmily:BJohn:N

Dana:NPete:N

Sarah:BKatie:N

Light AverageHeavy

Dana:NAlex:NAnnie:B

Emily:BPete:NJohn:N

No Yes

Sarah:BAnnie:BEmily:BPete:NJohn:N

Dana:NAlex:NKatie:N

Page 48: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

48

Picking a TestHeight

WeightLotion

Short AverageTall

Annie:BKatie:N

Sarah:B Dana:N

Sarah:BKatie:N

Light AverageHeavy

Dana:NAnnie:B

No Yes

Sarah:BAnnie:B

Dana:NKatie:N

Page 49: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

49

Measuring Disorder

Problem: In general, tests on large DB’s don’t yield

homogeneous subsets

Solution:General information theoretic measure of disorderDesired features:

Homogeneous set: least disorder = 0Even split: most disorder = 1

Page 50: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

50

Measuring EntropyIf split m objects into 2 bins size m1 & m2,

what is the entropy?

m

m

m

m

m

m

m

mm

m

m

m iii

22

212

1

2

loglog

log

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

m1/m

Disorder

Page 51: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

51

Measuring Disorder:Entropy

mmp ii / the probability of being in bin i

i

ii pp 2log

mmp ii /

Entropy (disorder) of a split

i

ip 1

00log0 2 Assume

10 ip

-½ log2½ - ½ log2½ = ½ +½ = 1½½

-¼ log2¼ - ¾ log2¾ = 0.5 + 0.311 = 0.811¾¼

-1log21 - 0log20 = 0 - 0 = 001

Entropyp2p1

Page 52: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

52

Information GainInfoGain(Y|X)

How many bits can we save if know X?

InfoGain(Y|X) = H(Y) – H(Y|X)

(equivalent to InfoGain(Y,X))

Page 53: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

53

Information GainInfoGain(S,A): expected reduction in entropy due to A

Select A with max InfoGainResulting in lowest average entropy

Page 54: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

54

Computing Average Entropy

Disorder of class distribution on branch i

Fraction of samples down branch i

|S| instances

Branch1 Branch 2

Sa1 a Sa1 b

Sa2 aSa2 b

Page 55: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

55

Entropy in Sunburn Example

Hair color = 0.954-(4/8(-2/4 log 2/4 - 2/4log2/4) + 1/8*0 + 3/8 *0) = 0.954- 0.5 = 0.454

Height = 0.954 - 0.69= 0.264Weight = 0.954 - 0.94= 0.014Lotion = 0.954 - 0.61= 0.344

S = [3B,5N]

Page 56: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

56

Entropy in Sunburn Example

Height = 1-2/4(-1/2log1/2-1/2log1/2) + 1/4*0+1/4*0 = 1- 0.5 =0.5Weight = 1-2/4(-1/2log1/2-1/2log1/2) +2/4(-1/2log1/2-1/2log1/2) = 1 Lotion = 1- 0 = 1

S=[2B,2N]

Page 57: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

57

Building Decision Trees with Information Gain

Until there are no inhomogeneous leavesSelect an inhomogeneous leaf nodeReplace that leaf node by a test node creating

subsets that yield highest information gain

Effectively creates set of rectangular regionsRepeatedly draws lines in different axes

Page 58: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

58

Alternate MeasuresIssue with Information Gain:

Favors features with more values

Option:Gain Ratio

Sa : elements of S with value A=a

Page 59: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

59

OverfittingOverfitting:

Model fits the training data TOO wellFits noise, irrelevant details

Why is this bad? Harms generalization Fits training data too well, fits new data badly

For model m, training_error(m), D_error(m) – D=all data

If overfit, for another model m’, training_error(m) < training_error(m’), but D_error(m) > D_error(m’)

Page 60: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

60

Avoiding OverfittingStrategies to avoid overfitting:

Early stopping:Stop when InfoGain < thresholdStop when number of instances < thresholdStop when tree depth > threshold

Post-pruningGrow full tree and remove branches

Which is better?Unclear, both used.For some applications, post-pruning better

Page 61: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

61

Post-PruningDivide data into

Training set: used to build the original treeValidation set: used to perform pruning

Build decision tree based on training data

Until pruning does not reduce validation set performanceCompute perf. for pruning each nodes (& its children)Greedily remove nodes that do not reduce VS performance

Yields smaller tree with best performance

Page 62: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

62

Performance MeasuresCompute accuracy on:

Validation setk-fold cross-validation

Weighted classification error cost:Weight some types of errors more heavily

Minimum description length:Favor good accuracy on compact modelsMDL = error(tree) + model_size(tree)

Page 63: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

63

Rule Post-PruningConvert tree to rules

Prune rules independently

Sort final rule set

Probably most widely used method (toolkits)

Page 64: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

64

Modeling FeaturesDifferent types of features need different tests

Binary: Test branches on true/false

Discrete: Branches for each discrete value

Continuous? Need to discretizeEnumerate all values not possible or desirablePick value x

Branches: value < x; value >= xHow can we pick split points?

Page 65: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

65

Picking Splits Need split useful, sufficient split points

What’s a good strategy?

Approach:Sort all values for the feature in training data Identify adjacent instances of different classes

Candidate split points between those instancesSelect candidate with highest information gain

Page 66: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

66

Features in Decision Trees: Pros

Feature selection:Tests features that yield low disorder

E.g. selects features that are important!Ignores irrelevant features

Feature type handling:Discrete type: 1 branch per valueContinuous type: Branch on >= value

Absent features: Distribute uniformly

Page 67: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

67

Features in Decision Trees: Cons

Features Assumed independentIf want group effect, must model explicitly

E.g. make new feature AorB

Feature tests conjunctive

Page 68: Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012

68

Decision TreesTrain:

Build tree by forming subsets of least disorder

Predict:Traverse tree based on feature testsAssign leaf node sample label

Pros: Robust to irrelevant features, some noise, fast prediction, perspicuous rule reading

Cons: Poor feature combination, dependency, optimal tree build intractable