lecture 08: classification wrap-upjcrouser/sds293/lectures/08... · sds 293: machine learning. ......

10
LECTURE 08: CLASSIFICATION WRAP-UP October 4, 2017 SDS 293: Machine Learning

Upload: others

Post on 29-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LECTURE 08: CLASSIFICATION WRAP-UPjcrouser/SDS293/lectures/08... · SDS 293: Machine Learning. ... -Logistic regression-LDA-QDA ... Discussion: classification methods Logistic Regression

LECTURE 08:CLASSIFICATION WRAP-UPOctober 4, 2017SDS 293: Machine Learning

Page 2: LECTURE 08: CLASSIFICATION WRAP-UPjcrouser/SDS293/lectures/08... · SDS 293: Machine Learning. ... -Logistic regression-LDA-QDA ... Discussion: classification methods Logistic Regression

Outline

• Lab: comparing classification approaches• Evaluating models - Today: validation set-Monday: resampling

Page 3: LECTURE 08: CLASSIFICATION WRAP-UPjcrouser/SDS293/lectures/08... · SDS 293: Machine Learning. ... -Logistic regression-LDA-QDA ... Discussion: classification methods Logistic Regression

60 min: comparing classification methods

• Work in teams of 2-4 people (3 is ideal)• Goal: to deepen our understanding of how the 4

classification methods we’ve discussed measure up- K-nearest neighbors- Logistic regression- LDA- QDA

• Instructions:http://www.science.smith.edu/~jcrouser/SDS293/labs/lab6.html

• You can check your intuition against p.153-4 of ISLR

Page 4: LECTURE 08: CLASSIFICATION WRAP-UPjcrouser/SDS293/lectures/08... · SDS 293: Machine Learning. ... -Logistic regression-LDA-QDA ... Discussion: classification methods Logistic Regression

Discussion: classification methodsLogistic

RegressionK-nearest neighbors LDA QDA

Scenario 1

Scenario 2

Scenario 3

Scenario 4

Scenario 5

Scenario 6

( )

( )

( )

( )

Page 5: LECTURE 08: CLASSIFICATION WRAP-UPjcrouser/SDS293/lectures/08... · SDS 293: Machine Learning. ... -Logistic regression-LDA-QDA ... Discussion: classification methods Logistic Regression

Rough guide to choosing a method

Is the decision boundary ~linear?

Are the observations normally distributed?

Do you have limited training data?

Start with

LDA

Start withLogistic

Regression

Start with

QDA

Start withK-NearestNeighbors

YES

NO

YES

NO

YES

NO

Linear methodsNon-linear methods

Page 6: LECTURE 08: CLASSIFICATION WRAP-UPjcrouser/SDS293/lectures/08... · SDS 293: Machine Learning. ... -Logistic regression-LDA-QDA ... Discussion: classification methods Logistic Regression

Test vs. training

• In all of these methods, we evaluate performance using test error rather than training error (why?)

• Real life: if the data comes in stages (i.e. trying to predict election results), we have a natural “test set”

• Question: when that’s not the case, what do we do?• Answer: set aside a “validation set”

Page 7: LECTURE 08: CLASSIFICATION WRAP-UPjcrouser/SDS293/lectures/08... · SDS 293: Machine Learning. ... -Logistic regression-LDA-QDA ... Discussion: classification methods Logistic Regression

What to set aside?

• Smarket: used years 2001-2004 to train, tested on 2005• Caravan: used the first 1000 records as test

• Carseats: - some of you split the data in half- some sampled randomly- some split on one of the predictors (population>300, sales>7, etc.)

Question: did it make a difference?

Page 8: LECTURE 08: CLASSIFICATION WRAP-UPjcrouser/SDS293/lectures/08... · SDS 293: Machine Learning. ... -Logistic regression-LDA-QDA ... Discussion: classification methods Logistic Regression

Answer: absolutely

• Example: Auto dataset (392 observations)• Goal: use linear regression to predict mpg using

polynomial f (horsepower)

• Randomly split into 2 sets of 196 obs. (test and training)

2 4 6 8 10

16

18

20

22

24

26

28

Degree of Polynomial

Mean S

quare

d E

rror

2 4 6 8 10

16

18

20

22

24

26

28

Degree of Polynomial

Mean S

quare

d E

rror

2 4 6 8 10

16

18

20

22

24

26

28

Degree of Polynomial

Mean S

quare

d E

rror

2 4 6 8 10

16

18

20

22

24

26

28

Degree of Polynomial

Mean S

quare

d E

rror

^10x?

Huge variation in prediction accuracy!

Page 9: LECTURE 08: CLASSIFICATION WRAP-UPjcrouser/SDS293/lectures/08... · SDS 293: Machine Learning. ... -Logistic regression-LDA-QDA ... Discussion: classification methods Logistic Regression

Issues with the “validation set” approach

1. The test error rate depends heavily on which observations are used for training vs. testing

2. We’re only training on a subset of the data (why is this a problem?)

We need a new approach...

Page 10: LECTURE 08: CLASSIFICATION WRAP-UPjcrouser/SDS293/lectures/08... · SDS 293: Machine Learning. ... -Logistic regression-LDA-QDA ... Discussion: classification methods Logistic Regression

Coming up

• Monday: Resampling Methods• A2 due tonight

• A3 posted, due Wednesday Oct. 11th by 11:59pm