ece471-571 –pattern recognition lecture 13 –decision...

8
day month year documentname/initials 1 ECE471-571 – Pattern Recognition Lecture 13 – Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi Email: [email protected] Pattern Classification Statistical Approach Non-Statistical Approach Supervised Unsupervised Basic concepts: Distance Agglomerative method Basic concepts: Baysian decision rule (MPP, LR, Discri.) Parameter estimate (ML, BL) Non-Parametric learning (kNN) LDF (Perceptron) k-means Winner-takes-all Kohonen maps Dimensionality Reduction FLD, PCA Performance Evaluation ROC curve (TP, TN, FN, FP) cross validation Classifier Fusion majority voting NB, BKS Stochastic Methods local opt (GD) global opt (SA, GA) Decision-tree Syntactic approach NN (BP) Support Vector Machine Deep Learning (DL) Mean-shift 3 Review - Bayes Decision Rule ( ) ( )( ) () x p P x p x P j j j ω ω ω | | = For a given x, if P(w1|x)> P(w2|x), then x belongs to class 1, otherwise 2 If , then x belongs to class 1, otherwise, 2. Maximum Posterior Probability Likelihood Ratio p(x|!1) p(x|!2) > P (!2) P (!1) () () if class to vector x feature a assign will classifier The x g x g j i i > ω Discriminant Function Case 1: Minimum Euclidean Distance (Linear Machine), Si=s 2 I Case 2: Minimum Mahalanobis Distance (Linear Machine), Si = S Case 3: Quadratic classifier , Si = arbitrary Estimate Gaussian (Maximum Likelihood Estimation, MLE), Two-modal Gaussian Dimensionality reduction Performance evaluation ROC curve For a given x, if k1/k > k2/k, then x belongs to class 1, otherwise 2 Non-parametric kNN

Upload: hoangnhu

Post on 17-May-2019

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECE471-571 –Pattern Recognition Lecture 13 –Decision Treeweb.eecs.utk.edu/~qi/ece471-571/lecture13_decisiontree.pdf · Decision tree nRoot nLink (branch) - directional nLeaf nDescendent

day month year documentname/initials 1

ECE471-571 – Pattern Recognition

Lecture 13 – Decision Tree

Hairong Qi, Gonzalez Family ProfessorElectrical Engineering and Computer ScienceUniversity of Tennessee, Knoxvillehttp://www.eecs.utk.edu/faculty/qi

Email: [email protected]

Pattern Classification

Statistical Approach Non-Statistical Approach

Supervised UnsupervisedBasic concepts:

DistanceAgglomerative method

Basic concepts:Baysian decision rule (MPP, LR, Discri.)

Parameter estimate (ML, BL)

Non-Parametric learning (kNN)

LDF (Perceptron)

k-means

Winner-takes-all

Kohonen maps

Dimensionality Reduction

FLD, PCA

Performance EvaluationROC curve (TP, TN , FN , FP)cross validation

Classifier Fusionmajority votingNB, BKS

Stochastic Methodslocal opt (GD)global opt (SA, GA)

Decision-tree

Syntactic approach

NN (BP)

Support Vector Machine

Deep Learning (DL)

Mean-shift

3

Review - Bayes Decision Rule( ) ( ) ( )

( )xpPxp

xP jjj

ωωω

|| =

For a given x, if P(w1|x) > P(w2|x), then x belongs to class 1, otherwise 2

If , then x belongs to class 1, otherwise, 2.

Maximum PosteriorProbability

Likelihood Ratiop(x|!1)

p(x|!2)>

P (!2)

P (!1)

( ) ( ) if class to vector x feature aassign willclassifier The

xgxg ji

i

>

ωDiscriminantFunction

Case 1: Minimum Euclidean Distance (Linear Machine), Si=s2ICase 2: Minimum Mahalanobis Distance (Linear Machine), Si = SCase 3: Quadratic classifier , Si = arbitrary

Estimate Gaussian (Maximum Likelihood Estimation, MLE),Two-modal Gaussian

Dimensionality reduction

Performance evaluationROC curve

For a given x, if k1/k > k2/k, then x belongs to class 1, otherwise 2Non-parametrickNN

Page 2: ECE471-571 –Pattern Recognition Lecture 13 –Decision Treeweb.eecs.utk.edu/~qi/ece471-571/lecture13_decisiontree.pdf · Decision tree nRoot nLink (branch) - directional nLeaf nDescendent

day month year documentname/initials 2

Nominal Data

Descriptions that are discrete and without any natural notion of similarity or even ordering

4

Some Terminologies

Decision treenRootn Link (branch) -

directionaln LeafnDescendent node

5

CART

Classification and regression treesA generic tree growing methodologyIssues studiednHow many splits from a node?nWhich property to test at each node?nWhen to declare a leaf?nHow to prune a large, redundant tree?n If the leaf is impure, how to classify?nHow to handle missing data?

6

Page 3: ECE471-571 –Pattern Recognition Lecture 13 –Decision Treeweb.eecs.utk.edu/~qi/ece471-571/lecture13_decisiontree.pdf · Decision tree nRoot nLink (branch) - directional nLeaf nDescendent

day month year documentname/initials 3

Number of Splits

Binary treeExpressive power and comparative simplicity in training

7

Node Impurity – Occam�s Razor• The fundamental principle underlying tree creation is that of simplicity: we

prefer simple, compact tree with few nodes• http://math.ucr.edu/home/baez/physics/occam.html• Occam's (or Ockham's) razor is a principle attributed to the 14th century

logician and Franciscan friar; William of Occam. Ockham was the village in the English county of Surrey where he was born.

• The principle states that �Entities should not be multiplied unnecessarily.�• "when you have two competing theories which make exactly the same

predictions, the one that is simpler is the better.�• Stephen Hawking explains in A Brief History of Time: �We could still imagine

that there is a set of laws that determines events completely for some supernatural being, who could observe the present state of the universe without disturbing it. However, such models of the universe are not of much interest to us mortals. It seems better to employ the principle known as Occam's razor and cut out all the features of the theory which cannot be observed.�

• Everything should be made as simple as possible, but not simpler

8

Property Query and Impurity Measurement

We seek a property query T at each node N that makes the data reach the immediate descendent nodes as pure as possibleWe want i(N) to be 0 if all the patterns reach the node bear the same category label

9€

i N( ) = − P ω j( ) log2 P ω j( )j∑

P ω j( ) is the fraction of patterns at node N that are in category ω j

Entropy impurity (information impurity)

Page 4: ECE471-571 –Pattern Recognition Lecture 13 –Decision Treeweb.eecs.utk.edu/~qi/ece471-571/lecture13_decisiontree.pdf · Decision tree nRoot nLink (branch) - directional nLeaf nDescendent

day month year documentname/initials 4

Other Impurity Measurements

Variance impurity (2-category case)

Gini impurity

Misclassification impurity

10

( ) ( ) ( )21 ωω PPNi =

( ) ( ) ( ) ( )∑∑ −==≠ j

jji

ji PPPNi ωωω 21

( ) ( )jjPNi ωmax1−=

Choose the Property Test?

Choose the query that decreases the impurity as much as possible

nNL, NR: left and right descendent nodesn i(NL), i(NR): impuritiesnPL: fraction of patterns at node N that will go to NL

Solve for extrema (local extrema)

11

( ) ( ) ( ) ( )( )RLLL NiPNiPNiNi −−−=Δ 1

Example

Node N: n 90 patterns in w1

n 10 patterns in w2

Split candidate: n 70 w1 patterns & 0 w2 patterns to the rightn 20 w1 patterns & 10 w2 patterns to the left

12

Page 5: ECE471-571 –Pattern Recognition Lecture 13 –Decision Treeweb.eecs.utk.edu/~qi/ece471-571/lecture13_decisiontree.pdf · Decision tree nRoot nLink (branch) - directional nLeaf nDescendent

day month year documentname/initials 5

When to Stop Splitting?

Two extreme scenariosn Overfitting (each leaf is one sample)n High error rate

Approachesn Validation and cross-validation

w 90% of the data set as training dataw 10% of the data set as validation data

n Use thresholdw Unbalanced treew Hard to choose threshold

n Minimum description length (MDL)w i(N) measures the uncertainty of the training dataw Size of the tree measures the complexity of the classifier itself

13

MDL =α •size+ i N( )leaf nodes∑

When to Stop Splitting? (cont�)

Use stopping criterion based on the statistical significance of the reduction of impurityn Use chi-square statisticn Whether the candidate split differs significantly from a random

split

14

( )∑=

−=

2

1

22

i ie

ieiL

nnn

χ

Pruning

Another way to stop splittingHorizon effectn Lack of sufficient look ahead

Let the tree fully grow, i.e. beyond any putative horizon, then all pairs of neighboring leaf nodes are considered for elimination

15

Page 6: ECE471-571 –Pattern Recognition Lecture 13 –Decision Treeweb.eecs.utk.edu/~qi/ece471-571/lecture13_decisiontree.pdf · Decision tree nRoot nLink (branch) - directional nLeaf nDescendent

day month year documentname/initials 6

Instability

16

Other Methods

• Quinlan�s ID3• C4.5 (successor and refinement of ID3)http://www.rulequest.com/Personal/

17

MATLAB Routine

• classregtree• Classification tree vs. Regression tree

– If the target variable is categorical or numeric

18

http://www.mathworks.com/help/stats/classregtree.html

Page 7: ECE471-571 –Pattern Recognition Lecture 13 –Decision Treeweb.eecs.utk.edu/~qi/ece471-571/lecture13_decisiontree.pdf · Decision tree nRoot nLink (branch) - directional nLeaf nDescendent

day month year documentname/initials 7

19

http://www.simafore.com/blog/bid/62482/2-main-differences-between-classification-and-regression-trees

Random Forest

• Potential issue with decision trees• Prof. Leo Breiman• Ensemble learning methods

– Bagging (Bootstrap aggregating): Proposed by Breiman in 1994 to improve the classification by combining classifications of randomly generated training sets

– Random forest: bagging + random selection of features at each node to determine a split

20

MATLAB Implementation

• B = TreeBagger(nTrees, train_x, train_y);• pred = B.predict(test_x);

21

Page 8: ECE471-571 –Pattern Recognition Lecture 13 –Decision Treeweb.eecs.utk.edu/~qi/ece471-571/lecture13_decisiontree.pdf · Decision tree nRoot nLink (branch) - directional nLeaf nDescendent

day month year documentname/initials 8

Reference

• [CART] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software, 1984.

• [Bagging] L. Breiman, “Bagging predictors,” Machine Learning, 24(2):123-140, August 1996. (citation: 16,393)

• [RF] L. Breiman, “Random forests,” Machine Learning, 45(1):5-32, October 2001.

22