lecture 4: linear regression and classification - stats 202 ...lecture 4: linear regression and...
TRANSCRIPT
Lecture 4: Linear Regression and Classification
STATS 202: Data Mining and Analysis
Linh [email protected]
Department of StatisticsStanford University
June 30, 2021
STATS 202: Data Mining and Analysis L. Tran 1/49
Announcements
I HW1 due next Monday.
I Can ask for regrades up to a week from grades being released.
I Solutions will be posted on Wednesday.
I Section on R/Python programming for DS this Friday.
I Please enroll in Piazza/Gradescope.
STATS 202: Data Mining and Analysis L. Tran 2/49
Outline
I Regression issues
I Comparing linear regression to KNN
I More classification
I Logistic regression
I Linear/quadratic discriminant analysis
STATS 202: Data Mining and Analysis L. Tran 3/49
Recap
So far, we have:
I Defined Multiple Linear Regression
I Discussed how to estimate model parameters
I Discussed how to test the importance of variables
I Described one approach to choose a subset of variables
I Explained how to code dummy indicators
What are some potential issues?
STATS 202: Data Mining and Analysis L. Tran 4/49
Potential issues in linear regression
I Interactions between predictors
I Non-linear relationships
I Correlation of error terms
I Non-constant variance of error (heteroskedasticity)
I Outliers
I High leverage points
I Collinearity
I Mis-specification
STATS 202: Data Mining and Analysis L. Tran 5/49
Interactions between predictors
Linear regression has an additive assumption, e.g.:
sales = β0 + β1 · tv + β2 · radio + ε (1)
e.g. An increase of $ 100 dollars in TV ads correlates to a fixedincrease in sales, independent of how much you spend on radioads.
If we visualize the residuals, it is clear that this is false:
Sales
Radio
TV
STATS 202: Data Mining and Analysis L. Tran 6/49
Interactions between predictors
One way to deal with this:
I Include multiplicative variables (aka interaction variables) inthe model
sales = β0 + β1 · tv + β2 · radio + β3 · (tv × radio) + ε (2)
I Makes the effect of TV ads dependent on the radio ads (andvice versa)
I The interaction variable is high when both tv and radio arehigh
STATS 202: Data Mining and Analysis L. Tran 7/49
Interactions between predictors
Two ways of including interaction variables (in R):
I Create a new variable that is the product of the two
I Specify the interaction in the model formula
STATS 202: Data Mining and Analysis L. Tran 8/49
Non-linear relationships
50 100 150 200
10
20
30
40
50
Horsepower
Mile
s p
er
gallo
n
Linear
Degree 2
Degree 5
Scatterplots between X and Y mayreveal non-linear relationships
I Solution: Include polynomialterms in the model
MPG =β0 + β1 · horsepower+ β2 · horsepower2
+ β3 · horsepower3 + ...+ ε
(3)
STATS 202: Data Mining and Analysis L. Tran 9/49
Non-linear relationships
In 2 or 3 dimensions, this is easy to visualize. What do we do whenwe have too many predictors?
Plot the residuals against the response and look for apattern:
5 10 15 20 25 30
−15
−1
0−
50
51
015
20
Fitted values
Resid
uals
Residual Plot for Linear Fit
323
330
334
15 20 25 30 35
−1
5−
10
−5
05
10
15
Fitted values
Resid
uals
Residual Plot for Quadratic Fit
334323
155
STATS 202: Data Mining and Analysis L. Tran 10/49
Non-linear relationships
In 2 or 3 dimensions, this is easy to visualize. What do we do whenwe have too many predictors?
Plot the residuals against the response and look for apattern:
5 10 15 20 25 30
−15
−10
−5
05
10
15
20
Fitted values
Resid
uals
Residual Plot for Linear Fit
323
330
334
15 20 25 30 35
−15
−10
−5
05
10
15
Fitted values
Resid
uals
Residual Plot for Quadratic Fit
334323
155
STATS 202: Data Mining and Analysis L. Tran 10/49
Correlation of error terms
We assumed that the errors for each sample areindependent:
yi = f (xi ) + εi : εiiid∼ N (0, σ2) (4)
When it doesn’t hold:
I Invalidates any assertions about Standard Errors, confidenceintervals, and hypothesis tests
Example: Suppose that by accident, we double the data (i.e. weuse each sample twice). Then, the standard errors would beartificially smaller by a factor of
√2.
STATS 202: Data Mining and Analysis L. Tran 11/49
Correlation of error terms
We assumed that the errors for each sample areindependent:
yi = f (xi ) + εi : εiiid∼ N (0, σ2) (4)
When it doesn’t hold:
I Invalidates any assertions about Standard Errors, confidenceintervals, and hypothesis tests
Example: Suppose that by accident, we double the data (i.e. weuse each sample twice). Then, the standard errors would beartificially smaller by a factor of
√2.
STATS 202: Data Mining and Analysis L. Tran 11/49
Correlation of error terms
Examples of when this happens:
I Time series: Each sample corresponds to a different point intime. The errors for samples that are close in time arecorrelated.
I Spatial data: Each sample corresponds to a different locationin space.
I Clustered data: Study on predicting height from weight atbirth. Suppose some of the subjects in the study are in thesame family, their shared environment could make themdeviate from f (x) in similar ways.
STATS 202: Data Mining and Analysis L. Tran 12/49
Correlation of error terms
Simulations of time series with increasing correlations on εi .
0 20 40 60 80 100
−3
−1
01
23
ρ=0.0
Resid
ual
0 20 40 60 80 100
−4
−2
01
2
ρ=0.5
Resid
ual
0 20 40 60 80 100
−1.5
−0.5
0.5
1.5
ρ=0.9
Resid
ual
Observation
STATS 202: Data Mining and Analysis L. Tran 13/49
Non-constant variance of error (heteroskedasticity)
The variance of the error depends on the input value.
To diagnose this, we can plot residuals vs. fitted values:
10 15 20 25 30
−10
−5
05
10
15
Fitted values
Resid
uals
Response Y
998
975845
2.4 2.6 2.8 3.0 3.2 3.4−
0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
Fitted values
Resid
uals
Response log(Y)
437671
605
Solution: If the trend in variance is relatively simple, we cantransform the response using a logarithm, for example.
STATS 202: Data Mining and Analysis L. Tran 14/49
Non-constant variance of error (heteroskedasticity)
The variance of the error depends on the input value.
To diagnose this, we can plot residuals vs. fitted values:
10 15 20 25 30
−10
−5
05
10
15
Fitted values
Resid
uals
Response Y
998
975845
2.4 2.6 2.8 3.0 3.2 3.4−
0.8
−0.6
−0.4
−0.2
0.0
0.2
0.4
Fitted values
Resid
uals
Response log(Y)
437671
605
Solution: If the trend in variance is relatively simple, we cantransform the response using a logarithm, for example.
STATS 202: Data Mining and Analysis L. Tran 14/49
Outliers
Outliers are points with very large errors, e.g.
−2 −1 0 1 2
−4
−2
02
46
20
−2 0 2 4 6−
10
12
34
Fitted Values
Resid
uals
20
−2 0 2 4 6
02
46
Fitted Values
Stu
dentized R
esid
uals
20
X
Y
While they may not affect the fit, they might affect our assessmentof model quality.
Possible solutions:
I If we believe an outlier is due to an error in data collection, wecan remove it.
I An outlier might be evidence of a missing predictor, or theneed to specify a more complex model.
STATS 202: Data Mining and Analysis L. Tran 15/49
Outliers
Outliers are points with very large errors, e.g.
−2 −1 0 1 2
−4
−2
02
46
20
−2 0 2 4 6−
10
12
34
Fitted Values
Resid
uals
20
−2 0 2 4 6
02
46
Fitted Values
Stu
dentized R
esid
uals
20
X
Y
While they may not affect the fit, they might affect our assessmentof model quality.
Possible solutions:
I If we believe an outlier is due to an error in data collection, wecan remove it.
I An outlier might be evidence of a missing predictor, or theneed to specify a more complex model.
STATS 202: Data Mining and Analysis L. Tran 15/49
High leverage points
Some samples with extreme inputs have a large effect on β.
−2 −1 0 1 2 3 4
05
10
20
41
−2 −1 0 1 2−
2−
10
12
0.00 0.05 0.10 0.15 0.20 0.25
−1
01
23
45
Leverage
Stu
de
ntize
d R
esid
ua
ls
20
41
X
Y
X1
X2
This can be measured with the leverage statistic or selfinfluence:
hii =∂yi∂yi
= (X(X>X)−1X>︸ ︷︷ ︸Hat matrix)
)i ,i ∈[
1
n, 1
](5)
Values closer to 1 have high leverage.STATS 202: Data Mining and Analysis L. Tran 16/49
Studentized residuals
−2 −1 0 1 2 3 4
05
10
20
41
−2 −1 0 1 2
−2
−1
01
2
0.00 0.05 0.10 0.15 0.20 0.25
−1
01
23
45
Leverage
Stu
de
ntize
d R
esid
ua
ls
20
41
X
Y
X1
X2
I The residual εi = yi − yi is an estimate for the noise εi
I The standard error of εi is σ√
1− hii
I A studentized residual is εi divided by its standard error
I It follows a Student-t distribution with n − p − 2 degrees offreedom
STATS 202: Data Mining and Analysis L. Tran 17/49
Collinearity
Two predictors are collinear if one explains the other well,e.g.
limit = a× rating + b (6)
i.e. they contain the same information
2000 4000 6000 8000 12000
30
40
50
60
70
80
Limit
Age
2000 4000 6000 8000 12000
200
400
600
800
Limit
Rating
STATS 202: Data Mining and Analysis L. Tran 18/49
Collinearity
Problem: The coefficients become unidentifiable.
I i.e. different coefficients can mean the same fit
Example: using two identical predictors (limit):
balance = β0 + β1 · limit + β2 · limit (7)
= β0 + (β1 + 100) · limit + (β2 − 100) · limit (8)
The fit (β0, β1, β2) is just as good as (β0, β1 + 100, β2−100)
STATS 202: Data Mining and Analysis L. Tran 19/49
Collinearity
Problem: The coefficients become unidentifiable.
I i.e. different coefficients can mean the same fit
Example: using two identical predictors (limit):
balance = β0 + β1 · limit + β2 · limit (7)
= β0 + (β1 + 100) · limit + (β2 − 100) · limit (8)
The fit (β0, β1, β2) is just as good as (β0, β1 + 100, β2−100)
STATS 202: Data Mining and Analysis L. Tran 19/49
Collinearity effect
Collinearity results in unstable estimates of β.
21.25
21.5
21.8
0.16 0.17 0.18 0.19
−5
−4
−3
−2
−1
0
21.5
21.8
−0.1 0.0 0.1 0.2
01
23
45
βLimitβLimit
βAge
βRating
STATS 202: Data Mining and Analysis L. Tran 20/49
Recognizing collinearity
If 2 variables are collinear, we can easily diagnose this using theircorrelation.
A group of q variables is multilinear if these variables “contain lessinformation” than q independent variables. Pairwise correlationsmay not reveal multilinear variables.
The Variance Inflation Factor (VIF) measures how necessary avariable is, or how predictable it is given the other variables:
VIF (βj) =1
1− R2Xj |X−j
, (9)
where R2Xj |X−j
is the R2 statistic for multiple linear regression of
the predictor Xj onto the remaining predictors.
STATS 202: Data Mining and Analysis L. Tran 21/49
Dealing with collinearity
Three primary ways:
1. Drop one of the correlated features (e.g. Ridge/LASSO).
2. Combine the correlated features (e.g. PCA).
3. More data.
STATS 202: Data Mining and Analysis L. Tran 22/49
Mis-specification
What if our true distribution P0 isn’t linear?
Estimates will still converge to a fixed value within our model,e.g.
θ0 , arg minθ
D(Pθ,P0) : θ = (β0, ..., βp) (10)
STATS 202: Data Mining and Analysis L. Tran 23/49
Comparing Linear Regression to K-nearest neighbors
Linear regression: prototypical parametric methodKNN regression: prototypical nonparametric method
fn(x) =1
K
∑i∈NK (x)
yi (11)
yy
x1x1
x 2x 2
Examples of KNN with K = 1 (left) and K = 9 (right)
STATS 202: Data Mining and Analysis L. Tran 24/49
Comparing Linear Regression to K-nearest neighbors
Linear regression: prototypical parametric methodKNN regression: prototypical nonparametric method Long storyshort:
I KNN is only better when the function f0 is not linear (andplenty of data)
I Question: What if the true function f0 IS linear?
I When n is not much larger than p, even if f0 is nonlinear,linear regression can outperform KNN.
I KNN has smaller bias, but this comes at a price of (much)higher variance (c.f. overfitting)
STATS 202: Data Mining and Analysis L. Tran 25/49
Comparing Linear Regression to K-nearest neighbors
Linear regression: prototypical parametric methodKNN regression: prototypical nonparametric method Long storyshort:
I KNN is only better when the function f0 is not linear (andplenty of data)
I Question: What if the true function f0 IS linear?
I When n is not much larger than p, even if f0 is nonlinear,linear regression can outperform KNN.
I KNN has smaller bias, but this comes at a price of (much)higher variance (c.f. overfitting)
STATS 202: Data Mining and Analysis L. Tran 25/49
Comparing Linear Regression to K-nearest neighbors
Linear regression: prototypical parametric methodKNN regression: prototypical nonparametric method Long storyshort:
I KNN is only better when the function f0 is not linear (andplenty of data)
I Question: What if the true function f0 IS linear?
I When n is not much larger than p, even if f0 is nonlinear,linear regression can outperform KNN.
I KNN has smaller bias, but this comes at a price of (much)higher variance (c.f. overfitting)
STATS 202: Data Mining and Analysis L. Tran 25/49
Comparing Linear Regression to K-nearest neighbors
KNN estimates for a simulation from a linear model
I True function f0 is linear
−1.0 −0.5 0.0 0.5 1.0
12
34
−1.0 −0.5 0.0 0.5 1.01
23
4
yy
xx
KNN fits with K = 1 (left) and K = 9 (right)
STATS 202: Data Mining and Analysis L. Tran 26/49
Comparing Linear Regression to K-nearest neighbors
Linear models dominate KNN
I We’re able to gain statistical efficiency by taking advantage ofthe linear association
−1.0 −0.5 0.0 0.5 1.0
12
34
0.2 0.5 1.0
0.0
00.0
50.1
00.1
5
Mean S
quare
d E
rror
y
x 1/K
STATS 202: Data Mining and Analysis L. Tran 27/49
Comparing Linear Regression to K-nearest neighbors
Increasing deviations from linearity
−1.0 −0.5 0.0 0.5 1.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
0.2 0.5 1.0
0.0
00.0
20.0
40.0
60.0
8
Mean S
quare
d E
rror
−1.0 −0.5 0.0 0.5 1.0
1.0
1.5
2.0
2.5
3.0
3.5
0.2 0.5 1.0
0.0
00.0
50.1
00.1
5
Mean S
quare
d E
rror
yy
x
x
1/K
1/K
STATS 202: Data Mining and Analysis L. Tran 28/49
Comparing Linear Regression to K-nearest neighbors
When there are more predictors than observations, linear regressiondominates
0.2 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
p=1
0.2 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
p=2
0.2 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
p=3
0.2 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
p=4
0.2 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
p=10
0.2 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
p=20
Mean S
quare
d E
rror
1/K
When p >> n, each sample has no nearest neighbors, this isknown as the curse of dimensionality.
I The variance of KNN regression is very large
STATS 202: Data Mining and Analysis L. Tran 29/49
Classification problems
Recall: Supervised learning with a qualitative or categoricalresponse. As common (if not more) than regression
I Medical diagnosis: Given the symptoms a patient shows,predict which of 3 conditions they are attributed to
I Online banking: Determine whether a transaction is fraudulentor not, on the basis of the IP address, client’s history, etc.
I Web searching: Based on a user’s attributes and the string ofa web search, predict which link a person will click
I Online advertising: Predict whether a user will click on an ad
STATS 202: Data Mining and Analysis L. Tran 30/49
Classification problems
Recall: In classification, the function f0 we care about is
f0 , P0[Y = y |X1,X2, ...,Xp] (12)
To get a prediction, we use the Bayes Classifier:
y = arg maxy
P0[Y = y |X1,X2, ...,Xp] (13)
Example: Suppose Y ∈ {0, 1}. We could use linear model:
P[Y = 1|X] = β0 + β1X1 + · · ·+ βpXp (14)
Problems:
I This would allow probabilities < 0 and > 1
I Difficult to extend to more than 2 categories
STATS 202: Data Mining and Analysis L. Tran 31/49
Classification problems
Recall: In classification, the function f0 we care about is
f0 , P0[Y = y |X1,X2, ...,Xp] (12)
To get a prediction, we use the Bayes Classifier:
y = arg maxy
P0[Y = y |X1,X2, ...,Xp] (13)
Example: Suppose Y ∈ {0, 1}. We could use linear model:
P[Y = 1|X] = β0 + β1X1 + · · ·+ βpXp (14)
Problems:
I This would allow probabilities < 0 and > 1
I Difficult to extend to more than 2 categories
STATS 202: Data Mining and Analysis L. Tran 31/49
Logistic regression
An idea:Let’s apply a function to the result to keep it within [0, 1]
g−1(z) =1
1 + exp(−z)(15)
i.e.
P[Y = 1|X] =1
1 + exp(−(β0 + β1X1 + · · ·+ βpXp))(16)
This is equivalent to modeling the log-odds, e.g.
log
[P[Y = 1|X]
P[Y = 0|X]
]= β0 + β1X1 + · · ·+ βpXp (17)
n.b. exp(βj) is commonly referred to as the odds-ratio for Xj
STATS 202: Data Mining and Analysis L. Tran 32/49
Logistic regression
An idea:Let’s apply a function to the result to keep it within [0, 1]
g−1(z) =1
1 + exp(−z)(15)
i.e.
P[Y = 1|X] =1
1 + exp(−(β0 + β1X1 + · · ·+ βpXp))(16)
This is equivalent to modeling the log-odds, e.g.
log
[P[Y = 1|X]
P[Y = 0|X]
]= β0 + β1X1 + · · ·+ βpXp (17)
n.b. exp(βj) is commonly referred to as the odds-ratio for Xj
STATS 202: Data Mining and Analysis L. Tran 32/49
Fitting a logistic regression
Let (x1, y1), (x2, y2), ..., (xn, yn) be our training data.
In the linear model
log
[P[Y = 1|X]
P[Y = 0|X]
]= β0 + β1X1 + · · ·+ βpXp, (18)
We don’t actually observe the left side
I We observe Y ∈ {0, 1}, not probabilities
I This prevents us from using e.g. least squares to estimate ourparameters
STATS 202: Data Mining and Analysis L. Tran 33/49
Fitting a logistic regression
Solution:Let’s try to maximize the probability of our training data
L(θ) =n∏
i=1
P(Y = yi |X = xi ) (19)
=n∏
i=1
pyii · (1− pi )1−yi (20)
where pi = g−1(β0 + β1x1,i + · · ·+ βpxp,i )
I We look for θ such that L(θ) is maximized
I aka Maximum likelihood estimation (MLE)
I Has no closed form solution, so solved with numericalmethods (e.g. Newton’s method)
STATS 202: Data Mining and Analysis L. Tran 34/49
Fitting a logistic regression
Solution:Let’s try to maximize the probability of our training data
L(θ) =n∏
i=1
P(Y = yi |X = xi ) (19)
=n∏
i=1
pyii · (1− pi )1−yi (20)
where pi = g−1(β0 + β1x1,i + · · ·+ βpxp,i )
I We look for θ such that L(θ) is maximized
I aka Maximum likelihood estimation (MLE)
I Has no closed form solution, so solved with numericalmethods (e.g. Newton’s method)
STATS 202: Data Mining and Analysis L. Tran 34/49
Fitting a logistic regression
Solution:Let’s try to maximize the probability of our training data
L(θ) =n∏
i=1
P(Y = yi |X = xi ) (19)
=n∏
i=1
pyii · (1− pi )1−yi (20)
where pi = g−1(β0 + β1x1,i + · · ·+ βpxp,i )
I We look for θ such that L(θ) is maximized
I aka Maximum likelihood estimation (MLE)
I Has no closed form solution, so solved with numericalmethods (e.g. Newton’s method)
STATS 202: Data Mining and Analysis L. Tran 34/49
Fitting a logistic regression
Note:We typically deal with the log-likelihood:
`(θ) = logL(θ) (21)
=n∑
i=1
yi log(pi ) + (1− yi ) log(1− pi ) (22)
=1∑
k=0
n∑i=1
I(Yi = k) log(P(k |X = xi )) (23)
STATS 202: Data Mining and Analysis L. Tran 35/49
Fitting a logistic regression
Given our loss function:
`(θ) =n∑
i=1
yi log(pi ) + (1− yi ) log(1− pi ) (24)
Where
pi =1
1 + exp(−Zi )(25)
Zi = Xiβ (26)
We can deriving the gradient using the chain rule:
∂`(θ)
∂β=
∂`(θ)
∂pi× ∂pi∂Zi× ∂Zi
∂β(27)
STATS 202: Data Mining and Analysis L. Tran 36/49
Fitting a logistic regression
Given our loss function:
`(θ) =n∑
i=1
yi log(pi ) + (1− yi ) log(1− pi ) (24)
Where
pi =1
1 + exp(−Zi )(25)
Zi = Xiβ (26)
We can deriving the gradient using the chain rule:
∂`(θ)
∂β=
∂`(θ)
∂pi× ∂pi∂Zi× ∂Zi
∂β(27)
STATS 202: Data Mining and Analysis L. Tran 36/49
Fitting a logistic regression
Given our loss function:
`(θ) =n∑
i=1
yi log(pi ) + (1− yi ) log(1− pi ) (24)
Where
pi =1
1 + exp(−Zi )(25)
Zi = Xiβ (26)
We can deriving the gradient using the chain rule:
∂`(θ)
∂β=
∂`(θ)
∂pi× ∂pi∂Zi× ∂Zi
∂β(27)
STATS 202: Data Mining and Analysis L. Tran 36/49
Logistic regression in R
Estimating uncertainty
I We can estimate the Standard Error of each coefficient (e.g.using Fisher’s information)
IY(β) = −Eβ[∇2`(β)] (28)
I The z-statistic (for logistic regression) is the equivalent of thet-statistic (in linear regression):
z =βj
SE (βj)(29)
I The p-values are test of the null hypothesis βj = 0
STATS 202: Data Mining and Analysis L. Tran 37/49
Logistic regression in R
Example fit
STATS 202: Data Mining and Analysis L. Tran 38/49
Example: Predicting credit card default
Predictors:
I student: 1 if student, 0 otherwise
I balance: credit card balance
I income: person’s income
In this dataset there is confounding, but little collinearity
I Students tend to have higher balances. So, balance isexplained by student, but not very well.
I People with a high balance are more likely to default.
I Among people with a given balance, students are less likely todefault.
STATS 202: Data Mining and Analysis L. Tran 39/49
Example: Predicting credit card default
Predictors:
I student: 1 if student, 0 otherwise
I balance: credit card balance
I income: person’s income
500 1000 1500 2000
0.0
0.2
0.4
0.6
0.8
Credit Card Balance
De
fau
lt R
ate
No Yes
0500
1000
1500
2000
2500
Student Status
Cre
dit C
ard
Ba
lan
ce
STATS 202: Data Mining and Analysis L. Tran 40/49
Example: Predicting credit card default
STATS 202: Data Mining and Analysis L. Tran 41/49
Some issues with logistic regression
I The coefficients become unstable when there is collinearity
I This also affects the convergence of the fitting algorithm
I When the classes are well separated, the coefficients becomeunstable
I This is always the case when p ≥ n − 1.
I Sometimes may not converge
I e.g. Needs more iterations
STATS 202: Data Mining and Analysis L. Tran 42/49
Linear Discriminant Analysis
A linear model (like logistic regression). Unlike logistic regression:
I Does not become unstable when classes are well separated
I With small n and X approximately normal, is stable
I Popular when we have > 2 classes
High level idea:Model distribution of X given Y , and apply Bayes’ theorem,i.e.
P(Y = k |X = x) =πk fk(x)∑Kl=1 πl fl(x)
(30)
I A common assumption is fk(x) is Gaussian
STATS 202: Data Mining and Analysis L. Tran 43/49
Linear Discriminant Analysis
A linear model (like logistic regression). Unlike logistic regression:
I Does not become unstable when classes are well separated
I With small n and X approximately normal, is stable
I Popular when we have > 2 classes
High level idea:Model distribution of X given Y , and apply Bayes’ theorem,i.e.
P(Y = k |X = x) =πk fk(x)∑Kl=1 πl fl(x)
(30)
I A common assumption is fk(x) is Gaussian
STATS 202: Data Mining and Analysis L. Tran 43/49
Linear Discriminant Analysis
Example: K = 2 with Gaussian fk(x) and common σ2
P(Y = k|X = x) =πk fk(x)∑Kl=1 πl fl(x)
(31)
=πk
1√2πσ
exp(− 1
2σ2 (x − µk)2)∑2
l=1 πl1√2πσ
exp(− 1
2σ2 (x − µl)2) (32)
Taking the log and rearranging gives:
δk(x) = x · µkσ2−
µ2k2σ2
+ log(πk) (33)
If π1 = π2, our Bayes Classifier is:
2x(µ1 − µ2) > µ21 − µ22 (34)
STATS 202: Data Mining and Analysis L. Tran 44/49
Linear Discriminant Analysis
Example: K = 2 with Gaussian fk(x) and common σ2
P(Y = k|X = x) =πk fk(x)∑Kl=1 πl fl(x)
(31)
=πk
1√2πσ
exp(− 1
2σ2 (x − µk)2)∑2
l=1 πl1√2πσ
exp(− 1
2σ2 (x − µl)2) (32)
Taking the log and rearranging gives:
δk(x) = x · µkσ2−
µ2k2σ2
+ log(πk) (33)
If π1 = π2, our Bayes Classifier is:
2x(µ1 − µ2) > µ21 − µ22 (34)
STATS 202: Data Mining and Analysis L. Tran 44/49
Linear Discriminant Analysis
Example: K = 2 with Gaussian fk(x) and common σ2
P(Y = k|X = x) =πk fk(x)∑Kl=1 πl fl(x)
(31)
=πk
1√2πσ
exp(− 1
2σ2 (x − µk)2)∑2
l=1 πl1√2πσ
exp(− 1
2σ2 (x − µl)2) (32)
Taking the log and rearranging gives:
δk(x) = x · µkσ2−
µ2k2σ2
+ log(πk) (33)
If π1 = π2, our Bayes Classifier is:
2x(µ1 − µ2) > µ21 − µ22 (34)
STATS 202: Data Mining and Analysis L. Tran 44/49
Linear Discriminant Analysis
Example of LDA
−4 −2 0 2 4
−4
−2
02
4
−4 −2 0 2 4
−4
−2
02
4
X1X1
X2
X2
STATS 202: Data Mining and Analysis L. Tran 45/49
Quadratic Discriminant Analysis
Similar to LDA
I Assumes Gaussian fk(x)I Unlike LDA:
I Assumes each class has its own covariance matrix (Σk)
This results in a quadratic discriminant function:
δk(x) = −1
2(x − µk)>Σ−1k (x − µk)− 1
2log |Σk |+ log πk
= −1
2x>Σ−1k x + x>Σ−1k µk −
1
2µ>k Σ
−1k µk −
1
2log |Σk |+ log πk
(35)
This results in more parameters to fit:
I LDA: Kp parametersI QDA: Kp(p + 1)/2 parameters
STATS 202: Data Mining and Analysis L. Tran 46/49
Quadratic Discriminant Analysis
Similar to LDA
I Assumes Gaussian fk(x)I Unlike LDA:
I Assumes each class has its own covariance matrix (Σk)
This results in a quadratic discriminant function:
δk(x) = −1
2(x − µk)>Σ−1k (x − µk)− 1
2log |Σk |+ log πk
= −1
2x>Σ−1k x + x>Σ−1k µk −
1
2µ>k Σ
−1k µk −
1
2log |Σk |+ log πk
(35)
This results in more parameters to fit:
I LDA: Kp parametersI QDA: Kp(p + 1)/2 parameters
STATS 202: Data Mining and Analysis L. Tran 46/49
Quadratic Discriminant Analysis
Similar to LDA
I Assumes Gaussian fk(x)I Unlike LDA:
I Assumes each class has its own covariance matrix (Σk)
This results in a quadratic discriminant function:
δk(x) = −1
2(x − µk)>Σ−1k (x − µk)− 1
2log |Σk |+ log πk
= −1
2x>Σ−1k x + x>Σ−1k µk −
1
2µ>k Σ
−1k µk −
1
2log |Σk |+ log πk
(35)
This results in more parameters to fit:
I LDA: Kp parametersI QDA: Kp(p + 1)/2 parameters
STATS 202: Data Mining and Analysis L. Tran 46/49
Linear Discriminant Analysis vs Logistic regression
Assume a two-class setting with one predictor
Linear Discriminant Analysis:
log
[p1(x)
1− p1(x)
]= c0 + c1x (36)
I c0 and c1 computed using µ0, µ1, and σ2
Logistic regression:
log
[P[Y = 1|x ]
1− P[Y = 1|x ]
]= β0 + β1x (37)
I β0 and β1 estimated using MLE
STATS 202: Data Mining and Analysis L. Tran 47/49
Comparison of classification methods
KNN−1 KNN−CV LDA Logistic QDA
0.2
50.3
00.3
50.4
00.4
5
SCENARIO 1
KNN−1 KNN−CV LDA Logistic QDA
0.1
50.2
00.2
50.3
0
SCENARIO 2
KNN−1 KNN−CV LDA Logistic QDA
0.2
00.2
50.3
00.3
50.4
00.4
5
SCENARIO 3
KNN−1 KNN−CV LDA Logistic QDA
0.3
00.3
50.4
0
SCENARIO 4
KNN−1 KNN−CV LDA Logistic QDA
0.2
00.2
50.3
00.3
50.4
0
SCENARIO 5
KNN−1 KNN−CV LDA Logistic QDA
0.1
80.2
00.2
20.2
40.2
60.2
80.3
00.3
2
SCENARIO 6
STATS 202: Data Mining and Analysis L. Tran 48/49
References
[1] ISL. Chapters 3-4.
[2] ESL. Chapters 3.
STATS 202: Data Mining and Analysis L. Tran 49/49