information theory, classification & decision trees ling 572 advanced statistical methods in nlp...

Information Theory, Classification & Decision Trees

Ling 572Advanced Statistical Methods in NLP

January 5, 2012

Information Theory

EntropyInformation theoretic measure

Measures information in model

Conceptually, lower bound on # bits to encode

Entropy: H(X): X is a random var, p: prob fn

)(log)()( 2 xpxpXHXx

Cross-EntropyComparing models

Actual distribution unknown pUse simplified model to estimate m

Closer match will have lower cross-entropy

Relative EntropyCommonly known as Kullback-Liebler divergence

Expresses difference between probability distributions

Not a proper distance metric:

Not a proper distance metric: asymmetricKL(p||q) != KL(q||p)

Joint & Conditional Entropy

Joint entropy:

Conditional entropy:

Joint entropy:

Perplexity and Entropy

Given that

Consider the perplexity equation:

PP(W) = P(W)-1/N = = = 2H(L,P)

Where H is the entropy of the language L

Mutual InformationMeasure of information in common between two

distributions

Symmetric: I(X;Y) = I(Y;X)

distributions

Symmetric: I(X;Y) = I(Y;X)

I(X;Y) = KL(p(x,y)||p(x)p(y))

Decision Trees

Classification TaskTask:

C is a finite set of labels (aka categories, classes)Given x, determine its category y in C

Instance: (x,y)x: thing to be labeled/classifiedy: label/class

Data: set of instances labeled data: y is knownunlabeled data: y is unknown

C is a finite set of labels (aka categories, classes) Given x, determine its category y in C

Instance: (x,y) x: thing to be labeled/classified y: label/class

Data: set of instances labeled data: y is known unlabeled data: y is unknown

Training data, test data

Two StagesTraining:

Learner: training data classifier

Two StagesTraining:

Learner: training data classifier

Testing:Decoder: test data + classifier classification

output

Two StagesTraining:

Learner: training data classifierClassifier: f(x) =y: x is input; y in C

Testing:Decoder: test data + classifier classification output

AlsoPreprocessingPostprocessingEvaluation

RoadmapDecision Trees:

Sunburn exampleDecision tree basicsFrom trees to rulesKey questions

Training procedure?Decoding procedure? Overfitting?Different feature type?

Analysis: Pros & Cons

Sunburn ExampleName Hair Height Weight Lotion Result

Sarah Blonde Average Light No Burn

Dana Blonde Tall Average Yes None

Alex Brown Short Average Yes None

Annie Blonde Short Average No Burn

Emily Red Average Heavy No Burn

Pete Brown Tall Heavy No None

John Brown Average Heavy No None

Katie Blonde Short Light Yes None

Learning about Sunburn

Goal:Train on labeled examplesPredict Burn/None for new instances

Solution??Exact match: same features, same output

Problem: 2*3^3 feature combinations Could be much worse

Same label as ‘most similar’Problem: What’s close? Which features matter?

Many match on two features but differ on result

Learning about SunburnBetter Solution: Decision tree

Training:Divide examples into subsets based on feature testsSets of samples at leaves define classification

Prediction:Route NEW instance through tree to leaf based on

feature testsAssign same value as samples at leaf

Sunburn Decision Tree

Hair Color

Lotion Used

BlondeRed

Alex: NoneJohn: NonePete: None

Emily: Burn

No Yes

Sarah: BurnAnnie: Burn

Katie: NoneDana: None

Decision Tree Structure Internal nodes:

Each node is a testGenerally tests a single feature

E.g. Hair == ?

Theoretically could test multiple features

Branches: Each branch corresponds to outcome of test

E.g Hair == Red; Hair != Blond

Leaves: Each leaf corresponds to a decision

Discrete class: Classification / Decision TreeReal value: Regression

From Trees to RulesTree:

Branches from root to leaves =

Tests => classifications

Tests = if antecedents; Leaf labels= consequent

All Decision trees-> rules Not all rules as trees

From ID Trees to RulesHair Color

Lotion Used

BlondeRed

Alex: NoneJohn: NonePete: None

Emily: Burn

No Yes

Sarah: BurnAnnie: Burn

Katie: NoneDana: None

(if (equal haircolor blonde) (equal lotionused yes) (then None))(if (equal haircolor blonde) (equal lotionused no) (then Burn))(if (equal haircolor red) (then Burn))(if (equal haircolor brown) (then None))

Which Tree?Many possible decision trees for any problem

How can we select among them?

What would be the ‘best’ tree?Smallest?

Shallowest?

Most accurate on unseen data?

SimplicityOccam’s Razor:

Simplest explanation that covers the data is best

Occam’s Razor for decision trees:Smallest tree consistent with samples will be

best predictor for new data

Problem: Finding all trees & finding smallest: Expensive!

Solution:Greedily build a small tree

Building Trees:Basic Algorithm

Goal: Build a small tree such that all samples at leaves have same class

Greedy solution:At each node, pick test using ‘best’ feature

Split into subsets based on outcomes of feature test

Repeat process until stopping criterioni.e. until leaves have same class

Key QuestionsSplitting:

How do we select the ‘best’ feature?

Stopping: When do we stop splitting to avoid overfitting?

Features:How do we split different types of features?

Binary? Discrete? Continuous?

Building Decision Trees: IGoal: Build a small tree such that all samples at

leaves have same class

Greedy solution:At each node, pick test such that branches are

closest to having same class

Split into subsets where most instances in uniform class

Picking a TestHair Color

BlondeRed

Alex: NPete: NJohn: N

Emily: BSarah: BDana: NAnnie: BKatie: N

Height

WeightLotion

Short AverageTall

Alex:NAnnie:BKatie:N

Sarah:BEmily:BJohn:N

Dana:NPete:N

Sarah:BKatie:N

Light AverageHeavy

Dana:NAlex:NAnnie:B

Emily:BPete:NJohn:N

No Yes

Sarah:BAnnie:BEmily:BPete:NJohn:N

Dana:NAlex:NKatie:N

Picking a TestHeight

WeightLotion

Short AverageTall

Annie:BKatie:N

Sarah:B Dana:N

Sarah:BKatie:N

Light AverageHeavy

Dana:NAnnie:B

No Yes

Sarah:BAnnie:B

Dana:NKatie:N

Measuring Disorder

Problem: In general, tests on large DB’s don’t yield

homogeneous subsets

Solution:General information theoretic measure of disorderDesired features:

Homogeneous set: least disorder = 0Even split: most disorder = 1

Measuring EntropyIf split m objects into 2 bins size m1 & m2,

what is the entropy?

loglog

0 0.2 0.4 0.6 0.8 1 1.2

Disorder

Measuring Disorder:Entropy

mmp ii / the probability of being in bin i

ii pp 2log

mmp ii /

Entropy (disorder) of a split

00log0 2 Assume

-½ log2½ - ½ log2½ = ½ +½ = 1½½

-¼ log2¼ - ¾ log2¾ = 0.5 + 0.311 = 0.811¾¼

-1log21 - 0log20 = 0 - 0 = 001

Entropyp2p1

Information GainInfoGain(Y|X)

How many bits can we save if know X?

InfoGain(Y|X) = H(Y) – H(Y|X)

(equivalent to InfoGain(Y,X))

Information GainInfoGain(S,A): expected reduction in entropy due to A

Select A with max InfoGainResulting in lowest average entropy

Computing Average Entropy

Disorder of class distribution on branch i

Fraction of samples down branch i

|S| instances

Branch1 Branch 2

Sa1 a Sa1 b

Sa2 aSa2 b

Entropy in Sunburn Example

Hair color = 0.954-(4/8(-2/4 log 2/4 - 2/4log2/4) + 1/8*0 + 3/8 *0) = 0.954- 0.5 = 0.454

Height = 0.954 - 0.69= 0.264Weight = 0.954 - 0.94= 0.014Lotion = 0.954 - 0.61= 0.344

S = [3B,5N]

Entropy in Sunburn Example

Height = 1-2/4(-1/2log1/2-1/2log1/2) + 1/4*0+1/4*0 = 1- 0.5 =0.5Weight = 1-2/4(-1/2log1/2-1/2log1/2) +2/4(-1/2log1/2-1/2log1/2) = 1 Lotion = 1- 0 = 1

S=[2B,2N]

Building Decision Trees with Information Gain

Until there are no inhomogeneous leavesSelect an inhomogeneous leaf nodeReplace that leaf node by a test node creating

subsets that yield highest information gain

Effectively creates set of rectangular regionsRepeatedly draws lines in different axes

Alternate MeasuresIssue with Information Gain:

Favors features with more values

Option:Gain Ratio

Sa : elements of S with value A=a

OverfittingOverfitting:

Model fits the training data TOO wellFits noise, irrelevant details

Why is this bad? Harms generalization Fits training data too well, fits new data badly

For model m, training_error(m), D_error(m) – D=all data

If overfit, for another model m’, training_error(m) < training_error(m’), but D_error(m) > D_error(m’)

Avoiding OverfittingStrategies to avoid overfitting:

Early stopping:Stop when InfoGain < thresholdStop when number of instances < thresholdStop when tree depth > threshold

Post-pruningGrow full tree and remove branches

Which is better?Unclear, both used.For some applications, post-pruning better

Post-PruningDivide data into

Training set: used to build the original treeValidation set: used to perform pruning

Build decision tree based on training data

Until pruning does not reduce validation set performanceCompute perf. for pruning each nodes (& its children)Greedily remove nodes that do not reduce VS performance

Yields smaller tree with best performance

Performance MeasuresCompute accuracy on:

Validation setk-fold cross-validation

Weighted classification error cost:Weight some types of errors more heavily

Minimum description length:Favor good accuracy on compact modelsMDL = error(tree) + model_size(tree)

Rule Post-PruningConvert tree to rules

Prune rules independently

Sort final rule set

Probably most widely used method (toolkits)

Modeling FeaturesDifferent types of features need different tests

Binary: Test branches on true/false

Discrete: Branches for each discrete value

Continuous? Need to discretizeEnumerate all values not possible or desirablePick value x

Branches: value < x; value >= xHow can we pick split points?

Picking Splits Need split useful, sufficient split points

What’s a good strategy?

Approach:Sort all values for the feature in training data Identify adjacent instances of different classes

Candidate split points between those instancesSelect candidate with highest information gain

Features in Decision Trees: Pros

Feature selection:Tests features that yield low disorder

E.g. selects features that are important!Ignores irrelevant features

Feature type handling:Discrete type: 1 branch per valueContinuous type: Branch on >= value

Absent features: Distribute uniformly

Features in Decision Trees: Cons

Features Assumed independentIf want group effect, must model explicitly

E.g. make new feature AorB

Feature tests conjunctive

Decision TreesTrain:

Build tree by forming subsets of least disorder

Predict:Traverse tree based on feature testsAssign leaf node sample label

Pros: Robust to irrelevant features, some noise, fast prediction, perspicuous rule reading

Cons: Poor feature combination, dependency, optimal tree build intractable

information theory, classification & decision trees ling 572 advanced statistical methods in nlp...

Documents

nlp practitioner heart of nlp - nlp courses

shallow & deep qa systems ling 573 nlp systems and...

introduction to mallet ling 572 fei xia week 1: 1/08/2009

cky parsing ling 571 deep processing techniques for nlp...

1 introduction ling 572 fei xia, dan jinguji week 1: 1/08/08

k nearest neighbor and rocchio algorithm ling 572 fei xia...

km c364e-20181002091502...572— co employee: hage pin lower...

decision tree ling 572 fei xia 1/16/06. outline basic...

(nlp) nlp secrets

introduction ling 572 fei xia week 1: 1/4/06. outline course...

decision tree ling 572 fei xia 1/10/06. outline basic...

maximum entropy model (i) ling 572 fei xia week 5:...

the classification problem (recap from ling570) ling 572 fei...

introduction to deep processing techniques for nlp deep...

the em algorithm ling 572 fei xia week 10: 03/09/2010

feature selection ling 572 fei xia week 4: 1/29/08 1

ling 2000 - 2006 nlp 1 introduction to computational...

naïve bayes advanced statistical methods in nlp ling 572...

empirical methods in natural language processing lecture 1...

maximum entropy advanced statistical methods in nlp ling 572...