information theory, classification & decision trees ling 572 advanced statistical methods in nlp...

Post on 20-Jan-2016

223 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Information Theory, Classification & Decision Trees

Ling 572Advanced Statistical Methods in NLP

January 5, 2012

2

Information Theory

3

EntropyInformation theoretic measure

Measures information in model

Conceptually, lower bound on # bits to encode

Entropy: H(X): X is a random var, p: prob fn

)(log)()( 2 xpxpXHXx

4

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

5

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

6

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

7

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

8

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

9

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

10

Relative EntropyCommonly known as Kullback-Liebler divergence

Expresses difference between probability distributions

11

Relative EntropyCommonly known as Kullback-Liebler divergence

Expresses difference between probability distributions

12

Relative EntropyCommonly known as Kullback-Liebler divergence

Expresses difference between probability distributions

Not a proper distance metric:

13

Relative EntropyCommonly known as Kullback-Liebler divergence

Expresses difference between probability distributions

Not a proper distance metric: asymmetricKL(p||q) != KL(q||p)

14

Joint & Conditional Entropy

Joint entropy:

15

Joint & Conditional Entropy

Joint entropy:

Conditional entropy:

16

Joint & Conditional Entropy

Joint entropy:

Conditional entropy:

17

Joint & Conditional Entropy

Joint entropy:

Conditional entropy:

18

Perplexity and Entropy

Given that

Consider the perplexity equation:

PP(W) = P(W)-1/N = = = 2H(L,P)

Where H is the entropy of the language L

19

Mutual InformationMeasure of information in common between two

distributions

20

Mutual InformationMeasure of information in common between two

distributions

21

Mutual InformationMeasure of information in common between two

distributions

22

Mutual InformationMeasure of information in common between two

distributions

23

Mutual InformationMeasure of information in common between two

distributions

Symmetric: I(X;Y) = I(Y;X)

24

Mutual InformationMeasure of information in common between two

distributions

Symmetric: I(X;Y) = I(Y;X)

I(X;Y) = KL(p(x,y)||p(x)p(y))

25

Decision Trees

26

Classification TaskTask:

C is a finite set of labels (aka categories, classes)Given x, determine its category y in C

27

Classification TaskTask:

C is a finite set of labels (aka categories, classes)Given x, determine its category y in C

Instance: (x,y)x: thing to be labeled/classifiedy: label/class

28

Classification TaskTask:

C is a finite set of labels (aka categories, classes)Given x, determine its category y in C

Instance: (x,y)x: thing to be labeled/classifiedy: label/class

Data: set of instances labeled data: y is knownunlabeled data: y is unknown

29

Classification TaskTask:

C is a finite set of labels (aka categories, classes) Given x, determine its category y in C

Instance: (x,y) x: thing to be labeled/classified y: label/class

Data: set of instances labeled data: y is known unlabeled data: y is unknown

Training data, test data

30

Two StagesTraining:

Learner: training data classifier

31

Two StagesTraining:

Learner: training data classifier

Testing:Decoder: test data + classifier classification

output

32

Two StagesTraining:

Learner: training data classifierClassifier: f(x) =y: x is input; y in C

Testing:Decoder: test data + classifier classification output

AlsoPreprocessingPostprocessingEvaluation

33

34

RoadmapDecision Trees:

Sunburn exampleDecision tree basicsFrom trees to rulesKey questions

Training procedure?Decoding procedure? Overfitting?Different feature type?

Analysis: Pros & Cons

35

Sunburn ExampleName Hair Height Weight Lotion Result

Sarah Blonde Average Light No Burn

Dana Blonde Tall Average Yes None

Alex Brown Short Average Yes None

Annie Blonde Short Average No Burn

Emily Red Average Heavy No Burn

Pete Brown Tall Heavy No None

John Brown Average Heavy No None

Katie Blonde Short Light Yes None

36

Learning about Sunburn

Goal:Train on labeled examplesPredict Burn/None for new instances

Solution??Exact match: same features, same output

Problem: 2*3^3 feature combinations Could be much worse

Same label as ‘most similar’Problem: What’s close? Which features matter?

Many match on two features but differ on result

37

Learning about SunburnBetter Solution: Decision tree

Training:Divide examples into subsets based on feature testsSets of samples at leaves define classification

Prediction:Route NEW instance through tree to leaf based on

feature testsAssign same value as samples at leaf

38

Sunburn Decision Tree

Hair Color

Lotion Used

BlondeRed

Brown

Alex: NoneJohn: NonePete: None

Emily: Burn

No Yes

Sarah: BurnAnnie: Burn

Katie: NoneDana: None

39

Decision Tree Structure Internal nodes:

Each node is a testGenerally tests a single feature

E.g. Hair == ?

Theoretically could test multiple features

Branches: Each branch corresponds to outcome of test

E.g Hair == Red; Hair != Blond

Leaves: Each leaf corresponds to a decision

Discrete class: Classification / Decision TreeReal value: Regression

40

From Trees to RulesTree:

Branches from root to leaves =

Tests => classifications

Tests = if antecedents; Leaf labels= consequent

All Decision trees-> rules Not all rules as trees

41

From ID Trees to RulesHair Color

Lotion Used

BlondeRed

Brown

Alex: NoneJohn: NonePete: None

Emily: Burn

No Yes

Sarah: BurnAnnie: Burn

Katie: NoneDana: None

(if (equal haircolor blonde) (equal lotionused yes) (then None))(if (equal haircolor blonde) (equal lotionused no) (then Burn))(if (equal haircolor red) (then Burn))(if (equal haircolor brown) (then None))

42

Which Tree?Many possible decision trees for any problem

How can we select among them?

What would be the ‘best’ tree?Smallest?

Shallowest?

Most accurate on unseen data?

43

SimplicityOccam’s Razor:

Simplest explanation that covers the data is best

Occam’s Razor for decision trees:Smallest tree consistent with samples will be

best predictor for new data

Problem: Finding all trees & finding smallest: Expensive!

Solution:Greedily build a small tree

44

Building Trees:Basic Algorithm

Goal: Build a small tree such that all samples at leaves have same class

Greedy solution:At each node, pick test using ‘best’ feature

Split into subsets based on outcomes of feature test

Repeat process until stopping criterioni.e. until leaves have same class

45

Key QuestionsSplitting:

How do we select the ‘best’ feature?

Stopping: When do we stop splitting to avoid overfitting?

Features:How do we split different types of features?

Binary? Discrete? Continuous?

46

Building Decision Trees: IGoal: Build a small tree such that all samples at

leaves have same class

Greedy solution:At each node, pick test such that branches are

closest to having same class

Split into subsets where most instances in uniform class

47

Picking a TestHair Color

BlondeRed

Brown

Alex: NPete: NJohn: N

Emily: BSarah: BDana: NAnnie: BKatie: N

Height

WeightLotion

Short AverageTall

Alex:NAnnie:BKatie:N

Sarah:BEmily:BJohn:N

Dana:NPete:N

Sarah:BKatie:N

Light AverageHeavy

Dana:NAlex:NAnnie:B

Emily:BPete:NJohn:N

No Yes

Sarah:BAnnie:BEmily:BPete:NJohn:N

Dana:NAlex:NKatie:N

48

Picking a TestHeight

WeightLotion

Short AverageTall

Annie:BKatie:N

Sarah:B Dana:N

Sarah:BKatie:N

Light AverageHeavy

Dana:NAnnie:B

No Yes

Sarah:BAnnie:B

Dana:NKatie:N

49

Measuring Disorder

Problem: In general, tests on large DB’s don’t yield

homogeneous subsets

Solution:General information theoretic measure of disorderDesired features:

Homogeneous set: least disorder = 0Even split: most disorder = 1

50

Measuring EntropyIf split m objects into 2 bins size m1 & m2,

what is the entropy?

m

m

m

m

m

m

m

mm

m

m

m iii

22

212

1

2

loglog

log

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

m1/m

Disorder

51

Measuring Disorder:Entropy

mmp ii / the probability of being in bin i

i

ii pp 2log

mmp ii /

Entropy (disorder) of a split

i

ip 1

00log0 2 Assume

10 ip

-½ log2½ - ½ log2½ = ½ +½ = 1½½

-¼ log2¼ - ¾ log2¾ = 0.5 + 0.311 = 0.811¾¼

-1log21 - 0log20 = 0 - 0 = 001

Entropyp2p1

52

Information GainInfoGain(Y|X)

How many bits can we save if know X?

InfoGain(Y|X) = H(Y) – H(Y|X)

(equivalent to InfoGain(Y,X))

53

Information GainInfoGain(S,A): expected reduction in entropy due to A

Select A with max InfoGainResulting in lowest average entropy

54

Computing Average Entropy

Disorder of class distribution on branch i

Fraction of samples down branch i

|S| instances

Branch1 Branch 2

Sa1 a Sa1 b

Sa2 aSa2 b

55

Entropy in Sunburn Example

Hair color = 0.954-(4/8(-2/4 log 2/4 - 2/4log2/4) + 1/8*0 + 3/8 *0) = 0.954- 0.5 = 0.454

Height = 0.954 - 0.69= 0.264Weight = 0.954 - 0.94= 0.014Lotion = 0.954 - 0.61= 0.344

S = [3B,5N]

56

Entropy in Sunburn Example

Height = 1-2/4(-1/2log1/2-1/2log1/2) + 1/4*0+1/4*0 = 1- 0.5 =0.5Weight = 1-2/4(-1/2log1/2-1/2log1/2) +2/4(-1/2log1/2-1/2log1/2) = 1 Lotion = 1- 0 = 1

S=[2B,2N]

57

Building Decision Trees with Information Gain

Until there are no inhomogeneous leavesSelect an inhomogeneous leaf nodeReplace that leaf node by a test node creating

subsets that yield highest information gain

Effectively creates set of rectangular regionsRepeatedly draws lines in different axes

58

Alternate MeasuresIssue with Information Gain:

Favors features with more values

Option:Gain Ratio

Sa : elements of S with value A=a

59

OverfittingOverfitting:

Model fits the training data TOO wellFits noise, irrelevant details

Why is this bad? Harms generalization Fits training data too well, fits new data badly

For model m, training_error(m), D_error(m) – D=all data

If overfit, for another model m’, training_error(m) < training_error(m’), but D_error(m) > D_error(m’)

60

Avoiding OverfittingStrategies to avoid overfitting:

Early stopping:Stop when InfoGain < thresholdStop when number of instances < thresholdStop when tree depth > threshold

Post-pruningGrow full tree and remove branches

Which is better?Unclear, both used.For some applications, post-pruning better

61

Post-PruningDivide data into

Training set: used to build the original treeValidation set: used to perform pruning

Build decision tree based on training data

Until pruning does not reduce validation set performanceCompute perf. for pruning each nodes (& its children)Greedily remove nodes that do not reduce VS performance

Yields smaller tree with best performance

62

Performance MeasuresCompute accuracy on:

Validation setk-fold cross-validation

Weighted classification error cost:Weight some types of errors more heavily

Minimum description length:Favor good accuracy on compact modelsMDL = error(tree) + model_size(tree)

63

Rule Post-PruningConvert tree to rules

Prune rules independently

Sort final rule set

Probably most widely used method (toolkits)

64

Modeling FeaturesDifferent types of features need different tests

Binary: Test branches on true/false

Discrete: Branches for each discrete value

Continuous? Need to discretizeEnumerate all values not possible or desirablePick value x

Branches: value < x; value >= xHow can we pick split points?

65

Picking Splits Need split useful, sufficient split points

What’s a good strategy?

Approach:Sort all values for the feature in training data Identify adjacent instances of different classes

Candidate split points between those instancesSelect candidate with highest information gain

66

Features in Decision Trees: Pros

Feature selection:Tests features that yield low disorder

E.g. selects features that are important!Ignores irrelevant features

Feature type handling:Discrete type: 1 branch per valueContinuous type: Branch on >= value

Absent features: Distribute uniformly

67

Features in Decision Trees: Cons

Features Assumed independentIf want group effect, must model explicitly

E.g. make new feature AorB

Feature tests conjunctive

68

Decision TreesTrain:

Build tree by forming subsets of least disorder

Predict:Traverse tree based on feature testsAssign leaf node sample label

Pros: Robust to irrelevant features, some noise, fast prediction, perspicuous rule reading

Cons: Poor feature combination, dependency, optimal tree build intractable

top related