lecture 4: linear regression and classification - stats 202 ...lecture 4: linear regression and...

67
Lecture 4: Linear Regression and Classification STATS 202: Data Mining and Analysis Linh Tran [email protected] Department of Statistics Stanford University June 30, 2021 STATS 202: Data Mining and Analysis L. Tran 1/49

Upload: others

Post on 24-Aug-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Lecture 4: Linear Regression and Classification

STATS 202: Data Mining and Analysis

Linh [email protected]

Department of StatisticsStanford University

June 30, 2021

STATS 202: Data Mining and Analysis L. Tran 1/49

Page 2: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Announcements

I HW1 due next Monday.

I Can ask for regrades up to a week from grades being released.

I Solutions will be posted on Wednesday.

I Section on R/Python programming for DS this Friday.

I Please enroll in Piazza/Gradescope.

STATS 202: Data Mining and Analysis L. Tran 2/49

Page 3: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Outline

I Regression issues

I Comparing linear regression to KNN

I More classification

I Logistic regression

I Linear/quadratic discriminant analysis

STATS 202: Data Mining and Analysis L. Tran 3/49

Page 4: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Recap

So far, we have:

I Defined Multiple Linear Regression

I Discussed how to estimate model parameters

I Discussed how to test the importance of variables

I Described one approach to choose a subset of variables

I Explained how to code dummy indicators

What are some potential issues?

STATS 202: Data Mining and Analysis L. Tran 4/49

Page 5: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Potential issues in linear regression

I Interactions between predictors

I Non-linear relationships

I Correlation of error terms

I Non-constant variance of error (heteroskedasticity)

I Outliers

I High leverage points

I Collinearity

I Mis-specification

STATS 202: Data Mining and Analysis L. Tran 5/49

Page 6: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Interactions between predictors

Linear regression has an additive assumption, e.g.:

sales = β0 + β1 · tv + β2 · radio + ε (1)

e.g. An increase of $ 100 dollars in TV ads correlates to a fixedincrease in sales, independent of how much you spend on radioads.

If we visualize the residuals, it is clear that this is false:

Sales

Radio

TV

STATS 202: Data Mining and Analysis L. Tran 6/49

Page 7: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Interactions between predictors

One way to deal with this:

I Include multiplicative variables (aka interaction variables) inthe model

sales = β0 + β1 · tv + β2 · radio + β3 · (tv × radio) + ε (2)

I Makes the effect of TV ads dependent on the radio ads (andvice versa)

I The interaction variable is high when both tv and radio arehigh

STATS 202: Data Mining and Analysis L. Tran 7/49

Page 8: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Interactions between predictors

Two ways of including interaction variables (in R):

I Create a new variable that is the product of the two

I Specify the interaction in the model formula

STATS 202: Data Mining and Analysis L. Tran 8/49

Page 9: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Non-linear relationships

50 100 150 200

10

20

30

40

50

Horsepower

Mile

s p

er

gallo

n

Linear

Degree 2

Degree 5

Scatterplots between X and Y mayreveal non-linear relationships

I Solution: Include polynomialterms in the model

MPG =β0 + β1 · horsepower+ β2 · horsepower2

+ β3 · horsepower3 + ...+ ε

(3)

STATS 202: Data Mining and Analysis L. Tran 9/49

Page 10: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Non-linear relationships

In 2 or 3 dimensions, this is easy to visualize. What do we do whenwe have too many predictors?

Plot the residuals against the response and look for apattern:

5 10 15 20 25 30

−15

−1

0−

50

51

015

20

Fitted values

Resid

uals

Residual Plot for Linear Fit

323

330

334

15 20 25 30 35

−1

5−

10

−5

05

10

15

Fitted values

Resid

uals

Residual Plot for Quadratic Fit

334323

155

STATS 202: Data Mining and Analysis L. Tran 10/49

Page 11: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Non-linear relationships

In 2 or 3 dimensions, this is easy to visualize. What do we do whenwe have too many predictors?

Plot the residuals against the response and look for apattern:

5 10 15 20 25 30

−15

−10

−5

05

10

15

20

Fitted values

Resid

uals

Residual Plot for Linear Fit

323

330

334

15 20 25 30 35

−15

−10

−5

05

10

15

Fitted values

Resid

uals

Residual Plot for Quadratic Fit

334323

155

STATS 202: Data Mining and Analysis L. Tran 10/49

Page 12: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Correlation of error terms

We assumed that the errors for each sample areindependent:

yi = f (xi ) + εi : εiiid∼ N (0, σ2) (4)

When it doesn’t hold:

I Invalidates any assertions about Standard Errors, confidenceintervals, and hypothesis tests

Example: Suppose that by accident, we double the data (i.e. weuse each sample twice). Then, the standard errors would beartificially smaller by a factor of

√2.

STATS 202: Data Mining and Analysis L. Tran 11/49

Page 13: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Correlation of error terms

We assumed that the errors for each sample areindependent:

yi = f (xi ) + εi : εiiid∼ N (0, σ2) (4)

When it doesn’t hold:

I Invalidates any assertions about Standard Errors, confidenceintervals, and hypothesis tests

Example: Suppose that by accident, we double the data (i.e. weuse each sample twice). Then, the standard errors would beartificially smaller by a factor of

√2.

STATS 202: Data Mining and Analysis L. Tran 11/49

Page 14: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Correlation of error terms

Examples of when this happens:

I Time series: Each sample corresponds to a different point intime. The errors for samples that are close in time arecorrelated.

I Spatial data: Each sample corresponds to a different locationin space.

I Clustered data: Study on predicting height from weight atbirth. Suppose some of the subjects in the study are in thesame family, their shared environment could make themdeviate from f (x) in similar ways.

STATS 202: Data Mining and Analysis L. Tran 12/49

Page 15: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Correlation of error terms

Simulations of time series with increasing correlations on εi .

0 20 40 60 80 100

−3

−1

01

23

ρ=0.0

Resid

ual

0 20 40 60 80 100

−4

−2

01

2

ρ=0.5

Resid

ual

0 20 40 60 80 100

−1.5

−0.5

0.5

1.5

ρ=0.9

Resid

ual

Observation

STATS 202: Data Mining and Analysis L. Tran 13/49

Page 16: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Non-constant variance of error (heteroskedasticity)

The variance of the error depends on the input value.

To diagnose this, we can plot residuals vs. fitted values:

10 15 20 25 30

−10

−5

05

10

15

Fitted values

Resid

uals

Response Y

998

975845

2.4 2.6 2.8 3.0 3.2 3.4−

0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

Fitted values

Resid

uals

Response log(Y)

437671

605

Solution: If the trend in variance is relatively simple, we cantransform the response using a logarithm, for example.

STATS 202: Data Mining and Analysis L. Tran 14/49

Page 17: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Non-constant variance of error (heteroskedasticity)

The variance of the error depends on the input value.

To diagnose this, we can plot residuals vs. fitted values:

10 15 20 25 30

−10

−5

05

10

15

Fitted values

Resid

uals

Response Y

998

975845

2.4 2.6 2.8 3.0 3.2 3.4−

0.8

−0.6

−0.4

−0.2

0.0

0.2

0.4

Fitted values

Resid

uals

Response log(Y)

437671

605

Solution: If the trend in variance is relatively simple, we cantransform the response using a logarithm, for example.

STATS 202: Data Mining and Analysis L. Tran 14/49

Page 18: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Outliers

Outliers are points with very large errors, e.g.

−2 −1 0 1 2

−4

−2

02

46

20

−2 0 2 4 6−

10

12

34

Fitted Values

Resid

uals

20

−2 0 2 4 6

02

46

Fitted Values

Stu

dentized R

esid

uals

20

X

Y

While they may not affect the fit, they might affect our assessmentof model quality.

Possible solutions:

I If we believe an outlier is due to an error in data collection, wecan remove it.

I An outlier might be evidence of a missing predictor, or theneed to specify a more complex model.

STATS 202: Data Mining and Analysis L. Tran 15/49

Page 19: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Outliers

Outliers are points with very large errors, e.g.

−2 −1 0 1 2

−4

−2

02

46

20

−2 0 2 4 6−

10

12

34

Fitted Values

Resid

uals

20

−2 0 2 4 6

02

46

Fitted Values

Stu

dentized R

esid

uals

20

X

Y

While they may not affect the fit, they might affect our assessmentof model quality.

Possible solutions:

I If we believe an outlier is due to an error in data collection, wecan remove it.

I An outlier might be evidence of a missing predictor, or theneed to specify a more complex model.

STATS 202: Data Mining and Analysis L. Tran 15/49

Page 20: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

High leverage points

Some samples with extreme inputs have a large effect on β.

−2 −1 0 1 2 3 4

05

10

20

41

−2 −1 0 1 2−

2−

10

12

0.00 0.05 0.10 0.15 0.20 0.25

−1

01

23

45

Leverage

Stu

de

ntize

d R

esid

ua

ls

20

41

X

Y

X1

X2

This can be measured with the leverage statistic or selfinfluence:

hii =∂yi∂yi

= (X(X>X)−1X>︸ ︷︷ ︸Hat matrix)

)i ,i ∈[

1

n, 1

](5)

Values closer to 1 have high leverage.STATS 202: Data Mining and Analysis L. Tran 16/49

Page 21: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Studentized residuals

−2 −1 0 1 2 3 4

05

10

20

41

−2 −1 0 1 2

−2

−1

01

2

0.00 0.05 0.10 0.15 0.20 0.25

−1

01

23

45

Leverage

Stu

de

ntize

d R

esid

ua

ls

20

41

X

Y

X1

X2

I The residual εi = yi − yi is an estimate for the noise εi

I The standard error of εi is σ√

1− hii

I A studentized residual is εi divided by its standard error

I It follows a Student-t distribution with n − p − 2 degrees offreedom

STATS 202: Data Mining and Analysis L. Tran 17/49

Page 22: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Collinearity

Two predictors are collinear if one explains the other well,e.g.

limit = a× rating + b (6)

i.e. they contain the same information

2000 4000 6000 8000 12000

30

40

50

60

70

80

Limit

Age

2000 4000 6000 8000 12000

200

400

600

800

Limit

Rating

STATS 202: Data Mining and Analysis L. Tran 18/49

Page 23: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Collinearity

Problem: The coefficients become unidentifiable.

I i.e. different coefficients can mean the same fit

Example: using two identical predictors (limit):

balance = β0 + β1 · limit + β2 · limit (7)

= β0 + (β1 + 100) · limit + (β2 − 100) · limit (8)

The fit (β0, β1, β2) is just as good as (β0, β1 + 100, β2−100)

STATS 202: Data Mining and Analysis L. Tran 19/49

Page 24: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Collinearity

Problem: The coefficients become unidentifiable.

I i.e. different coefficients can mean the same fit

Example: using two identical predictors (limit):

balance = β0 + β1 · limit + β2 · limit (7)

= β0 + (β1 + 100) · limit + (β2 − 100) · limit (8)

The fit (β0, β1, β2) is just as good as (β0, β1 + 100, β2−100)

STATS 202: Data Mining and Analysis L. Tran 19/49

Page 25: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Collinearity effect

Collinearity results in unstable estimates of β.

21.25

21.5

21.8

0.16 0.17 0.18 0.19

−5

−4

−3

−2

−1

0

21.5

21.8

−0.1 0.0 0.1 0.2

01

23

45

βLimitβLimit

βAge

βRating

STATS 202: Data Mining and Analysis L. Tran 20/49

Page 26: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Recognizing collinearity

If 2 variables are collinear, we can easily diagnose this using theircorrelation.

A group of q variables is multilinear if these variables “contain lessinformation” than q independent variables. Pairwise correlationsmay not reveal multilinear variables.

The Variance Inflation Factor (VIF) measures how necessary avariable is, or how predictable it is given the other variables:

VIF (βj) =1

1− R2Xj |X−j

, (9)

where R2Xj |X−j

is the R2 statistic for multiple linear regression of

the predictor Xj onto the remaining predictors.

STATS 202: Data Mining and Analysis L. Tran 21/49

Page 27: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Dealing with collinearity

Three primary ways:

1. Drop one of the correlated features (e.g. Ridge/LASSO).

2. Combine the correlated features (e.g. PCA).

3. More data.

STATS 202: Data Mining and Analysis L. Tran 22/49

Page 28: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Mis-specification

What if our true distribution P0 isn’t linear?

Estimates will still converge to a fixed value within our model,e.g.

θ0 , arg minθ

D(Pθ,P0) : θ = (β0, ..., βp) (10)

STATS 202: Data Mining and Analysis L. Tran 23/49

Page 29: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Comparing Linear Regression to K-nearest neighbors

Linear regression: prototypical parametric methodKNN regression: prototypical nonparametric method

fn(x) =1

K

∑i∈NK (x)

yi (11)

yy

x1x1

x 2x 2

Examples of KNN with K = 1 (left) and K = 9 (right)

STATS 202: Data Mining and Analysis L. Tran 24/49

Page 30: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Comparing Linear Regression to K-nearest neighbors

Linear regression: prototypical parametric methodKNN regression: prototypical nonparametric method Long storyshort:

I KNN is only better when the function f0 is not linear (andplenty of data)

I Question: What if the true function f0 IS linear?

I When n is not much larger than p, even if f0 is nonlinear,linear regression can outperform KNN.

I KNN has smaller bias, but this comes at a price of (much)higher variance (c.f. overfitting)

STATS 202: Data Mining and Analysis L. Tran 25/49

Page 31: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Comparing Linear Regression to K-nearest neighbors

Linear regression: prototypical parametric methodKNN regression: prototypical nonparametric method Long storyshort:

I KNN is only better when the function f0 is not linear (andplenty of data)

I Question: What if the true function f0 IS linear?

I When n is not much larger than p, even if f0 is nonlinear,linear regression can outperform KNN.

I KNN has smaller bias, but this comes at a price of (much)higher variance (c.f. overfitting)

STATS 202: Data Mining and Analysis L. Tran 25/49

Page 32: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Comparing Linear Regression to K-nearest neighbors

Linear regression: prototypical parametric methodKNN regression: prototypical nonparametric method Long storyshort:

I KNN is only better when the function f0 is not linear (andplenty of data)

I Question: What if the true function f0 IS linear?

I When n is not much larger than p, even if f0 is nonlinear,linear regression can outperform KNN.

I KNN has smaller bias, but this comes at a price of (much)higher variance (c.f. overfitting)

STATS 202: Data Mining and Analysis L. Tran 25/49

Page 33: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Comparing Linear Regression to K-nearest neighbors

KNN estimates for a simulation from a linear model

I True function f0 is linear

−1.0 −0.5 0.0 0.5 1.0

12

34

−1.0 −0.5 0.0 0.5 1.01

23

4

yy

xx

KNN fits with K = 1 (left) and K = 9 (right)

STATS 202: Data Mining and Analysis L. Tran 26/49

Page 34: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Comparing Linear Regression to K-nearest neighbors

Linear models dominate KNN

I We’re able to gain statistical efficiency by taking advantage ofthe linear association

−1.0 −0.5 0.0 0.5 1.0

12

34

0.2 0.5 1.0

0.0

00.0

50.1

00.1

5

Mean S

quare

d E

rror

y

x 1/K

STATS 202: Data Mining and Analysis L. Tran 27/49

Page 35: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Comparing Linear Regression to K-nearest neighbors

Increasing deviations from linearity

−1.0 −0.5 0.0 0.5 1.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

0.2 0.5 1.0

0.0

00.0

20.0

40.0

60.0

8

Mean S

quare

d E

rror

−1.0 −0.5 0.0 0.5 1.0

1.0

1.5

2.0

2.5

3.0

3.5

0.2 0.5 1.0

0.0

00.0

50.1

00.1

5

Mean S

quare

d E

rror

yy

x

x

1/K

1/K

STATS 202: Data Mining and Analysis L. Tran 28/49

Page 36: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Comparing Linear Regression to K-nearest neighbors

When there are more predictors than observations, linear regressiondominates

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=1

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=2

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=3

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=4

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=10

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=20

Mean S

quare

d E

rror

1/K

When p >> n, each sample has no nearest neighbors, this isknown as the curse of dimensionality.

I The variance of KNN regression is very large

STATS 202: Data Mining and Analysis L. Tran 29/49

Page 37: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Classification problems

Recall: Supervised learning with a qualitative or categoricalresponse. As common (if not more) than regression

I Medical diagnosis: Given the symptoms a patient shows,predict which of 3 conditions they are attributed to

I Online banking: Determine whether a transaction is fraudulentor not, on the basis of the IP address, client’s history, etc.

I Web searching: Based on a user’s attributes and the string ofa web search, predict which link a person will click

I Online advertising: Predict whether a user will click on an ad

STATS 202: Data Mining and Analysis L. Tran 30/49

Page 38: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Classification problems

Recall: In classification, the function f0 we care about is

f0 , P0[Y = y |X1,X2, ...,Xp] (12)

To get a prediction, we use the Bayes Classifier:

y = arg maxy

P0[Y = y |X1,X2, ...,Xp] (13)

Example: Suppose Y ∈ {0, 1}. We could use linear model:

P[Y = 1|X] = β0 + β1X1 + · · ·+ βpXp (14)

Problems:

I This would allow probabilities < 0 and > 1

I Difficult to extend to more than 2 categories

STATS 202: Data Mining and Analysis L. Tran 31/49

Page 39: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Classification problems

Recall: In classification, the function f0 we care about is

f0 , P0[Y = y |X1,X2, ...,Xp] (12)

To get a prediction, we use the Bayes Classifier:

y = arg maxy

P0[Y = y |X1,X2, ...,Xp] (13)

Example: Suppose Y ∈ {0, 1}. We could use linear model:

P[Y = 1|X] = β0 + β1X1 + · · ·+ βpXp (14)

Problems:

I This would allow probabilities < 0 and > 1

I Difficult to extend to more than 2 categories

STATS 202: Data Mining and Analysis L. Tran 31/49

Page 40: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Logistic regression

An idea:Let’s apply a function to the result to keep it within [0, 1]

g−1(z) =1

1 + exp(−z)(15)

i.e.

P[Y = 1|X] =1

1 + exp(−(β0 + β1X1 + · · ·+ βpXp))(16)

This is equivalent to modeling the log-odds, e.g.

log

[P[Y = 1|X]

P[Y = 0|X]

]= β0 + β1X1 + · · ·+ βpXp (17)

n.b. exp(βj) is commonly referred to as the odds-ratio for Xj

STATS 202: Data Mining and Analysis L. Tran 32/49

Page 41: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Logistic regression

An idea:Let’s apply a function to the result to keep it within [0, 1]

g−1(z) =1

1 + exp(−z)(15)

i.e.

P[Y = 1|X] =1

1 + exp(−(β0 + β1X1 + · · ·+ βpXp))(16)

This is equivalent to modeling the log-odds, e.g.

log

[P[Y = 1|X]

P[Y = 0|X]

]= β0 + β1X1 + · · ·+ βpXp (17)

n.b. exp(βj) is commonly referred to as the odds-ratio for Xj

STATS 202: Data Mining and Analysis L. Tran 32/49

Page 42: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Fitting a logistic regression

Let (x1, y1), (x2, y2), ..., (xn, yn) be our training data.

In the linear model

log

[P[Y = 1|X]

P[Y = 0|X]

]= β0 + β1X1 + · · ·+ βpXp, (18)

We don’t actually observe the left side

I We observe Y ∈ {0, 1}, not probabilities

I This prevents us from using e.g. least squares to estimate ourparameters

STATS 202: Data Mining and Analysis L. Tran 33/49

Page 43: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Fitting a logistic regression

Solution:Let’s try to maximize the probability of our training data

L(θ) =n∏

i=1

P(Y = yi |X = xi ) (19)

=n∏

i=1

pyii · (1− pi )1−yi (20)

where pi = g−1(β0 + β1x1,i + · · ·+ βpxp,i )

I We look for θ such that L(θ) is maximized

I aka Maximum likelihood estimation (MLE)

I Has no closed form solution, so solved with numericalmethods (e.g. Newton’s method)

STATS 202: Data Mining and Analysis L. Tran 34/49

Page 44: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Fitting a logistic regression

Solution:Let’s try to maximize the probability of our training data

L(θ) =n∏

i=1

P(Y = yi |X = xi ) (19)

=n∏

i=1

pyii · (1− pi )1−yi (20)

where pi = g−1(β0 + β1x1,i + · · ·+ βpxp,i )

I We look for θ such that L(θ) is maximized

I aka Maximum likelihood estimation (MLE)

I Has no closed form solution, so solved with numericalmethods (e.g. Newton’s method)

STATS 202: Data Mining and Analysis L. Tran 34/49

Page 45: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Fitting a logistic regression

Solution:Let’s try to maximize the probability of our training data

L(θ) =n∏

i=1

P(Y = yi |X = xi ) (19)

=n∏

i=1

pyii · (1− pi )1−yi (20)

where pi = g−1(β0 + β1x1,i + · · ·+ βpxp,i )

I We look for θ such that L(θ) is maximized

I aka Maximum likelihood estimation (MLE)

I Has no closed form solution, so solved with numericalmethods (e.g. Newton’s method)

STATS 202: Data Mining and Analysis L. Tran 34/49

Page 46: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Fitting a logistic regression

Note:We typically deal with the log-likelihood:

`(θ) = logL(θ) (21)

=n∑

i=1

yi log(pi ) + (1− yi ) log(1− pi ) (22)

=1∑

k=0

n∑i=1

I(Yi = k) log(P(k |X = xi )) (23)

STATS 202: Data Mining and Analysis L. Tran 35/49

Page 47: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Fitting a logistic regression

Given our loss function:

`(θ) =n∑

i=1

yi log(pi ) + (1− yi ) log(1− pi ) (24)

Where

pi =1

1 + exp(−Zi )(25)

Zi = Xiβ (26)

We can deriving the gradient using the chain rule:

∂`(θ)

∂β=

∂`(θ)

∂pi× ∂pi∂Zi× ∂Zi

∂β(27)

STATS 202: Data Mining and Analysis L. Tran 36/49

Page 48: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Fitting a logistic regression

Given our loss function:

`(θ) =n∑

i=1

yi log(pi ) + (1− yi ) log(1− pi ) (24)

Where

pi =1

1 + exp(−Zi )(25)

Zi = Xiβ (26)

We can deriving the gradient using the chain rule:

∂`(θ)

∂β=

∂`(θ)

∂pi× ∂pi∂Zi× ∂Zi

∂β(27)

STATS 202: Data Mining and Analysis L. Tran 36/49

Page 49: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Fitting a logistic regression

Given our loss function:

`(θ) =n∑

i=1

yi log(pi ) + (1− yi ) log(1− pi ) (24)

Where

pi =1

1 + exp(−Zi )(25)

Zi = Xiβ (26)

We can deriving the gradient using the chain rule:

∂`(θ)

∂β=

∂`(θ)

∂pi× ∂pi∂Zi× ∂Zi

∂β(27)

STATS 202: Data Mining and Analysis L. Tran 36/49

Page 50: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Logistic regression in R

Estimating uncertainty

I We can estimate the Standard Error of each coefficient (e.g.using Fisher’s information)

IY(β) = −Eβ[∇2`(β)] (28)

I The z-statistic (for logistic regression) is the equivalent of thet-statistic (in linear regression):

z =βj

SE (βj)(29)

I The p-values are test of the null hypothesis βj = 0

STATS 202: Data Mining and Analysis L. Tran 37/49

Page 51: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Logistic regression in R

Example fit

STATS 202: Data Mining and Analysis L. Tran 38/49

Page 52: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Example: Predicting credit card default

Predictors:

I student: 1 if student, 0 otherwise

I balance: credit card balance

I income: person’s income

In this dataset there is confounding, but little collinearity

I Students tend to have higher balances. So, balance isexplained by student, but not very well.

I People with a high balance are more likely to default.

I Among people with a given balance, students are less likely todefault.

STATS 202: Data Mining and Analysis L. Tran 39/49

Page 53: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Example: Predicting credit card default

Predictors:

I student: 1 if student, 0 otherwise

I balance: credit card balance

I income: person’s income

500 1000 1500 2000

0.0

0.2

0.4

0.6

0.8

Credit Card Balance

De

fau

lt R

ate

No Yes

0500

1000

1500

2000

2500

Student Status

Cre

dit C

ard

Ba

lan

ce

STATS 202: Data Mining and Analysis L. Tran 40/49

Page 54: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Example: Predicting credit card default

STATS 202: Data Mining and Analysis L. Tran 41/49

Page 55: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Some issues with logistic regression

I The coefficients become unstable when there is collinearity

I This also affects the convergence of the fitting algorithm

I When the classes are well separated, the coefficients becomeunstable

I This is always the case when p ≥ n − 1.

I Sometimes may not converge

I e.g. Needs more iterations

STATS 202: Data Mining and Analysis L. Tran 42/49

Page 56: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Linear Discriminant Analysis

A linear model (like logistic regression). Unlike logistic regression:

I Does not become unstable when classes are well separated

I With small n and X approximately normal, is stable

I Popular when we have > 2 classes

High level idea:Model distribution of X given Y , and apply Bayes’ theorem,i.e.

P(Y = k |X = x) =πk fk(x)∑Kl=1 πl fl(x)

(30)

I A common assumption is fk(x) is Gaussian

STATS 202: Data Mining and Analysis L. Tran 43/49

Page 57: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Linear Discriminant Analysis

A linear model (like logistic regression). Unlike logistic regression:

I Does not become unstable when classes are well separated

I With small n and X approximately normal, is stable

I Popular when we have > 2 classes

High level idea:Model distribution of X given Y , and apply Bayes’ theorem,i.e.

P(Y = k |X = x) =πk fk(x)∑Kl=1 πl fl(x)

(30)

I A common assumption is fk(x) is Gaussian

STATS 202: Data Mining and Analysis L. Tran 43/49

Page 58: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Linear Discriminant Analysis

Example: K = 2 with Gaussian fk(x) and common σ2

P(Y = k|X = x) =πk fk(x)∑Kl=1 πl fl(x)

(31)

=πk

1√2πσ

exp(− 1

2σ2 (x − µk)2)∑2

l=1 πl1√2πσ

exp(− 1

2σ2 (x − µl)2) (32)

Taking the log and rearranging gives:

δk(x) = x · µkσ2−

µ2k2σ2

+ log(πk) (33)

If π1 = π2, our Bayes Classifier is:

2x(µ1 − µ2) > µ21 − µ22 (34)

STATS 202: Data Mining and Analysis L. Tran 44/49

Page 59: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Linear Discriminant Analysis

Example: K = 2 with Gaussian fk(x) and common σ2

P(Y = k|X = x) =πk fk(x)∑Kl=1 πl fl(x)

(31)

=πk

1√2πσ

exp(− 1

2σ2 (x − µk)2)∑2

l=1 πl1√2πσ

exp(− 1

2σ2 (x − µl)2) (32)

Taking the log and rearranging gives:

δk(x) = x · µkσ2−

µ2k2σ2

+ log(πk) (33)

If π1 = π2, our Bayes Classifier is:

2x(µ1 − µ2) > µ21 − µ22 (34)

STATS 202: Data Mining and Analysis L. Tran 44/49

Page 60: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Linear Discriminant Analysis

Example: K = 2 with Gaussian fk(x) and common σ2

P(Y = k|X = x) =πk fk(x)∑Kl=1 πl fl(x)

(31)

=πk

1√2πσ

exp(− 1

2σ2 (x − µk)2)∑2

l=1 πl1√2πσ

exp(− 1

2σ2 (x − µl)2) (32)

Taking the log and rearranging gives:

δk(x) = x · µkσ2−

µ2k2σ2

+ log(πk) (33)

If π1 = π2, our Bayes Classifier is:

2x(µ1 − µ2) > µ21 − µ22 (34)

STATS 202: Data Mining and Analysis L. Tran 44/49

Page 61: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Linear Discriminant Analysis

Example of LDA

−4 −2 0 2 4

−4

−2

02

4

−4 −2 0 2 4

−4

−2

02

4

X1X1

X2

X2

STATS 202: Data Mining and Analysis L. Tran 45/49

Page 62: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Quadratic Discriminant Analysis

Similar to LDA

I Assumes Gaussian fk(x)I Unlike LDA:

I Assumes each class has its own covariance matrix (Σk)

This results in a quadratic discriminant function:

δk(x) = −1

2(x − µk)>Σ−1k (x − µk)− 1

2log |Σk |+ log πk

= −1

2x>Σ−1k x + x>Σ−1k µk −

1

2µ>k Σ

−1k µk −

1

2log |Σk |+ log πk

(35)

This results in more parameters to fit:

I LDA: Kp parametersI QDA: Kp(p + 1)/2 parameters

STATS 202: Data Mining and Analysis L. Tran 46/49

Page 63: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Quadratic Discriminant Analysis

Similar to LDA

I Assumes Gaussian fk(x)I Unlike LDA:

I Assumes each class has its own covariance matrix (Σk)

This results in a quadratic discriminant function:

δk(x) = −1

2(x − µk)>Σ−1k (x − µk)− 1

2log |Σk |+ log πk

= −1

2x>Σ−1k x + x>Σ−1k µk −

1

2µ>k Σ

−1k µk −

1

2log |Σk |+ log πk

(35)

This results in more parameters to fit:

I LDA: Kp parametersI QDA: Kp(p + 1)/2 parameters

STATS 202: Data Mining and Analysis L. Tran 46/49

Page 64: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Quadratic Discriminant Analysis

Similar to LDA

I Assumes Gaussian fk(x)I Unlike LDA:

I Assumes each class has its own covariance matrix (Σk)

This results in a quadratic discriminant function:

δk(x) = −1

2(x − µk)>Σ−1k (x − µk)− 1

2log |Σk |+ log πk

= −1

2x>Σ−1k x + x>Σ−1k µk −

1

2µ>k Σ

−1k µk −

1

2log |Σk |+ log πk

(35)

This results in more parameters to fit:

I LDA: Kp parametersI QDA: Kp(p + 1)/2 parameters

STATS 202: Data Mining and Analysis L. Tran 46/49

Page 65: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Linear Discriminant Analysis vs Logistic regression

Assume a two-class setting with one predictor

Linear Discriminant Analysis:

log

[p1(x)

1− p1(x)

]= c0 + c1x (36)

I c0 and c1 computed using µ0, µ1, and σ2

Logistic regression:

log

[P[Y = 1|x ]

1− P[Y = 1|x ]

]= β0 + β1x (37)

I β0 and β1 estimated using MLE

STATS 202: Data Mining and Analysis L. Tran 47/49

Page 66: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

Comparison of classification methods

KNN−1 KNN−CV LDA Logistic QDA

0.2

50.3

00.3

50.4

00.4

5

SCENARIO 1

KNN−1 KNN−CV LDA Logistic QDA

0.1

50.2

00.2

50.3

0

SCENARIO 2

KNN−1 KNN−CV LDA Logistic QDA

0.2

00.2

50.3

00.3

50.4

00.4

5

SCENARIO 3

KNN−1 KNN−CV LDA Logistic QDA

0.3

00.3

50.4

0

SCENARIO 4

KNN−1 KNN−CV LDA Logistic QDA

0.2

00.2

50.3

00.3

50.4

0

SCENARIO 5

KNN−1 KNN−CV LDA Logistic QDA

0.1

80.2

00.2

20.2

40.2

60.2

80.3

00.3

2

SCENARIO 6

STATS 202: Data Mining and Analysis L. Tran 48/49

Page 67: Lecture 4: Linear Regression and Classification - STATS 202 ...Lecture 4: Linear Regression and Classi cation STATS 202: Data Mining and Analysis Linh Tran tranlm@stanford.edu Department

References

[1] ISL. Chapters 3-4.

[2] ESL. Chapters 3.

STATS 202: Data Mining and Analysis L. Tran 49/49