statistics 202: statistical aspects of data mining professor rajan...

1

Statistics 202: Statistical Aspects of Data Mining

Professor Rajan Patel

Lecture 10 = Finish Chapter 5

Agenda:1) Assign Homework #4 (due Aug 13th)2) Lecture over more of Chapter 5

2

Homework Assignment:

Homework #4 is due Monday 8/13 at 4:15 PM

Either email it to [email protected] as a pdf file.

The assignment is posted at

http://sites.google.com/site/stats202/homework

*If emailing, please send a single file (pdf is preferred) and make sure your name is on the first page and in the body of the email. Also, the file name should say “homework4” and your name.

3

Introduction to Data Mining

byTan, Steinbach, Kumar

Chapter 5: Classification: Alternative Techniques

4

Ensemble Methods (Section 5.6, page 276)

Ensemble methods aim at “improving classification accuracy by aggregating the predictions from multiple classifiers” (page 276)

One of the most obvious ways of doing this is simply by averaging classifiers which make errors somewhat independently of each other

5

Ensemble Methods (Section 5.6, page 276)

Ensemble methods include-Bagging (page 283)-Random Forests (page 290)-Boosting (page 285)

Bagging builds different classifiers by training on repeated samples (with replacement) from the data

Random Forests averages many trees which are constructed with some amount of randomness

Boosting combines simple base classifiers by upweighting data points which are classified incorrectly

6

Random Forests (Section 5.6.6, page 290)

One way to create random forests is to grow decision trees top down but at each node consider only a random subset of attributes for splitting instead of all the attributes

Random Forests are a very effective technique

They are based on the paper

L. Breiman. Random forests. Machine Learning, 45:5-32, 2001

They can be fit in R using the function randomForest() in the library randomForest

7

In class exercise #39:Use randomForest() in R to fit the default Random Forest to the last column of the sonar training data at http://sites.google.com/site/stats202/data/sonar_train.csv Compute the misclassification error for the test data at

http://sites.google.com/site/stats202/data/sonar_test.csv

8

In class exercise #39:Use randomForest() in R to fit the default Random Forest to the last column of the sonar training data at http://sites.google.com/site/stats202/data/sonar_train.csv Compute the misclassification error for the test data at

http://sites.google.com/site/stats202/data/sonar_test.csv

Solution:

install.packages("randomForest")library(randomForest)train<-read.csv("sonar_train.csv",header=FALSE)test<-read.csv("sonar_test.csv",header=FALSE)y<-as.factor(train[,61])x<-train[,1:60]y_test<-as.factor(test[,61])x_test<-test[,1:60]fit<-randomForest(x,y)1-sum(y_test==predict(fit,x_test))/length(y_test)

9

Boosting (Section 5.6.5, page 285)

Boosting has been called the “best off-the-shelf classifier in the world”

There are a number of explanations for boosting, but it is not completely understood why it works so well

The most popular algorithm is AdaBoost from

Boosting works by sequentially applying a classification algorithm to reweighted versions of the training data, and then taking a weighted majority vote of the sequence of classifiers thus produced.

10


Boosting can use any classifier as its weak learner (base classifier) but decision trees are by far the most popular

Boosting usually gives zero training error, but rarely overfits which is very curious

11


Boosting works by upweighing points at each iteration which are misclassified

On paper, boosting looks like an optimization (similar to maximum likelihood estimation), but in practice it seems to benefit a lot from averaging like Random Forests does

There exist R libraries for boosting, but these are written by statisticians who have their own views of boosting, so I would not encourage you to use them

The best thing to do is to write code yourself since the algorithms are very basic

12

AdaBoost

Here is a version of the AdaBoost algorithm

The algorithm repeats until a chosen stopping time

The final classifier is based on the sign of Fm

13

In class exercise #40:Use R to fit the AdaBoost classifier to the last column of the sonar training data athttp://sites.google.com/site/stats202/data/sonar_train.csvPlot the misclassification error for the training data and the test data athttp://sites.google.com/site/stats202/data/sonar_test.csvas a function of the iterations. Run the algorithm for 500 iterations. Use default rpart() as the base learner.

Solution:

train<-read.csv("sonar_train.csv",header=FALSE)test<-read.csv("sonar_test.csv",header=FALSE)y<-train[,61]x<-train[,1:60]y_test<-test[,61]x_test<-test[,1:60]

14


Solution (continued):train_error<-rep(0,500) # Keep track of errorstest_error<-rep(0,500)f<-rep(0,130) # 130 pts in training dataf_test<-rep(0,78) # 78 pts in test datai<-1library(rpart)

15


Solution (continued):while(i<=500){w<-exp(-y*f) # This is a shortcut to compute ww<-w/sum(w)fit<-rpart(y~.,x,w,method="class")g<--1+2*(predict(fit,x)[,2]>.5) # make -1 or 1g_test<--1+2*(predict(fit,x_test)[,2]>.5)e<-sum(w*(y*g<0))

16

In class exercise #40:Use R to fit the AdaBoost classifier to the last column of the sonar training data at http://sites.google.com/site/stats202/data/sonar_train.csvPlot the misclassification error for the training data and the test data athttp://sites.google.com/site/stats202/data/sonar_test.csvas a function of the iterations. Run the algorithm for 500 iterations. Use default rpart() as the base learner.

Solution (continued):alpha<-.5*log ( (1-e) / e )f<-f+alpha*gf_test<-f_test+alpha*g_testtrain_error[i]<-sum(1*f*y<0)/130test_error[i]<-sum(1*f_test*y_test<0)/78i<-i+1

}

17

In class exercise #40:Use R to fit the AdaBoost classifier to the last column of the sonar training data at http://sites.google.com/site/stats202/data/sonar_train.csvPlot the misclassification error for the training data and the test data athttp://sites.google.com/site/stats202/data/sonar_test.csvas a function of the iterations. Run the algorithm for 500 iterations. Use default rpart() as the base learner.

Solution (continued):plot(seq(1,500),test_error,type="l",

ylim=c(0,.5),ylab="Error Rate",xlab="Iterations",lwd=2)

lines(train_error,lwd=2,col="purple")legend(4,.5,c("Training Error","Test Error"),

col=c("purple","black"),lwd=2)

18


Solution (continued):

0 100 200 300 400 500

0.0

0.1

0.2

0.3

0.4

0.5

Iterations

Erro

r Rat

eTraining ErrorTest Error

19

Naive Bayes Classifier (Section 5.3.3, page 231)

This is a simple method that is sometimes quite effective

To understand this, it is helpful to understand conditional probability, Bayes Theorem and (conditional) independence

These topics are discussed in your book in Section 5.3 and on the following slides

20

Conditional Probability

The conditional probability of event A given event B is written as P(A|B)

This is the probability of event A if/when/given event B happens

Example:

P(make) = 8/20=.4 P(make|close)=3/5=.6

far close totalmake 5 3 8miss 10 2 12total 15 5 20

21

Conditional Probability

You can define the conditional probability P(A|B) as P(A,B) / P(B)

Example: P(make|close) = P(make,close) / P(close) = (3/20) / (5/20) = .15/.25 = .6

Bayes Theorem turns this around to solve for P(B|A) if you have P(A|B)

P(B|A) = P(A,B) / P(A) = P(A|B) P(B) / P(A)


my friend

22

Example of Bayes Theorem

On Friday my friend said that there was a 50 percent chance he would go snowboarding over the weekend. He said that if he goes snowboarding there is about a 40% chance he will break his leg; whereas, if he does not go snowboarding there is only a 2% chance he will break his leg doing something else. If I see him on Monday and his leg is broken, what would I think is the probability he went snowboarding over the weekend?

23

Example of Bayes Theorem

On Friday my friend said that there was a 50 percent chance he would go snowboarding over the weekend. He said that if he goes snowboarding there is about a 40% chance he will break his leg; whereas, if he does not go snowboarding there is only a 2% chance he will break his leg doing something else. If I see him on Monday and his leg is broken, what would I think is the probability he went snowboarding over the weekend?

P(S) = 0.5P(B|S) = 0.4P(B|not S) = 0.02

P(S|B) = P(B|S)*P(S) / P(B)= 0.4*0.5 / (0.4*0.5 + 0.02*0.5)= 0.2 / 0.21 = 0.95

24

Independence

A and B are independent iff P(A,B) = P(A)*P(B)

Here the events are not independent:P(make,far) = 5/20=.25 but P(make)*P(far) = 8/20*15/20= .30

Here the events are independent:P(make,far) = 9/20=.45which equals P(make)*P(far) = 12/20*15/20=.45



25

Conditional Independence

A and B are conditionally independent given C iff P(A,B|C) = P(A|C)*P(B|C)

Example: Height and reading ability are not independent but they are conditionally independent given the age level

short tall totalreads poorly 90 9 99

reads well 10 1 11total 100 10 110

youngshort tall total

reads poorly 2 20 22reads well 8 80 88

total 10 100 110

old

short tall totalreads poorly 92 29 121

reads well 18 81 99total 110 110 220

all

26

Naive Bayes Classifier (Section 5.3.3, page 231)

The naive Bayes classifier assumes that the x attributes are conditionally independent given the class attribute y

Thus, P(Y|X) = P(Y) * P(X|Y) / P(X) =P(Y) * P(X1|Y)* …. * P(Xd|Y) / P(X)

Then for any x you choose the class y that gives you the largest numerator

You estimate the P(Xi|Y) values based on the data (see next slide)

27

How to Estimate the P(Xi|Y)

For categorical x’s, just use counts (although some people modify this to fix problems with zero or small counts, see page 236)

For continuous x’s, fit some distribution function. The normal distribution using the observed sample mean and observed sample standard deviation is popular

The normal probability density function is given by

where μ is the mean and σ is the standard deviation

2μ)/σ](1/2)[(Xe2π1p(X) −−=σ

28

Example of the Naive Bayes Classifier

For this data, use naive Bayes to classify an observation with

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

120K)IncomeMarried,No,Refund( ===X

29

Example of the Naive Bayes Classifier

For this data, use naive Bayes to classify an observation with

P(yes|X) = 3/10*1/P(X)*3/3*0/3*1.2*10-9 < P(no|X) = 7/10*1/P(X) *4/7*4/7*.0072So we classify this X as “NO”

Tid Refund Marital Status

Taxable Income Evade

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

120K)IncomeMarried,No,Refund( ===XP(refund=yes|yes)=0/3 P(refund=yes|no)=3/7P(refund=no|yes)=3/3 P(refund=no|no)=4/7

P(ms=single|yes)=2/3 P(ms=single|no)=2/7P(ms=divorced|yes)=1/3 P(ms=divorced|no)=1/7P(ms=married|yes)=0/3 P(ms=married|no)=4/7

Given yes, Given no,income has mean=90 income has mean=110and sd=5 and sd=54.4P(120|yes)= P(120|no)=dnorm(120,90,5)= dnorm(120,110,54.4)=1.2*10-9 0.0072

statistics 202: statistical aspects of data mining professor rajan...

Documents