lecture 5: clustering, linear regression · lecture 5: clustering, linear regression reading:...
Post on 11-Jun-2020
14 Views
Preview:
TRANSCRIPT
Lecture 5: Clustering, Linear Regression
Reading: Chapter 10, Sections 3.1-2
STATS 202: Data mining and analysis
Sergio BacalladoSeptember 19, 2018
1 / 23
Announcements
I Starting next week, Julia Fukuyama will be having her officehours after business hours for SCPD students (throughSkype). The times will be updated on the website as usual.
I Probability problems.
I Homework 2 will go out this afternoon. It’s a long one.
2 / 23
Hierarchical clustering
Most algorithms for hierarchical clustering are agglomerative.
3
4
1 6
9
2
8
5 7
0.0
0.5
1.0
1.5
2.0
2.5
3.0
1
2
3
4
5
6
7
8
9
−1.5 −1.0 −0.5 0.0 0.5 1.0
−1
.5−
1.0
−0
.50
.00
.5
X1
X2
The output of the algorithm is adendogram. We must be carefulabout how we interpret thedendogram.
3 / 23
Notion of distance between clusters
At each step, we link the 2 clusters that are “closest” to each other.
Hierarchical clustering algorithms are classified according to thenotion of distance between clusters.
Complete linkage:The distance between 2 clusters is themaximum distance between any pair ofsamples, one in each cluster.
4 / 23
Notion of distance between clusters
At each step, we link the 2 clusters that are “closest” to each other.
Hierarchical clustering algorithms are classified according to thenotion of distance between clusters.
Single linkage:The distance between 2 clusters is theminimum distance between any pair ofsamples, one in each cluster.
4 / 23
Notion of distance between clusters
At each step, we link the 2 clusters that are “closest” to each other.
Hierarchical clustering algorithms are classified according to thenotion of distance between clusters.
Average linkage:The distance between 2 clusters is theaverage of all pairwise distances.
4 / 23
Example
Average Linkage Complete Linkage Single Linkage
Figure 10.12
5 / 23
Clustering is riddled with questions and choices
I Is clustering appropriate? i.e. Could a sample belong to morethan one cluster?
I Mixture models, soft clustering, topic models.
I How many clusters are appropriate?I Choose subjectively — depends on the inference sought.
I There are formal methods based on gap statistics, mixturemodels, etc.
I Are the clusters robust?I Run the clustering on different random subsets of the data. Is
the structure preserved?
I Try different clustering algorithms. Are the conclusionsconsistent?
I Most important: temper your conclusions.
6 / 23
Clustering is riddled with questions and choices
I Should we scale the variables before doing the clustering.I Variables with larger variance have a larger effect on the
Euclidean distance between two samples.
( Area in acres, Price in US$, Number of houses )Property 1 ( 10, 450,000, 4 )Property 2 ( 5, 300,000, 1 )
I Does Euclidean distance capture dissimilarity between samples?
7 / 23
Correlation distance
Example: Suppose that we want to cluster customers at a storefor market segmentation.
I Samples are customers
I Each variable corresponds to a specific product and measuresthe number of items bought by the customer during a year.
5 10 15 20
05
10
15
20
Variable Index
Observation 1
Observation 2
Observation 3
1
2
3
8 / 23
Correlation distance
I Euclidean distance would cluster all customers who purchasefew things (orange and purple).
I Perhaps we want to cluster customers who purchase similarthings (orange and teal).
I Then, the correlation distance may be a more appropriatemeasure of dissimilarity between samples.
5 10 15 20
05
10
15
20
Variable Index
Observation 1
Observation 2
Observation 3
1
2
3
9 / 23
Simple linear regression
0 50 100 150 200 250 300
510
15
20
25
TV
Sale
s
Figure 3.1
yi = β0 + β1xi + εi
εi ∼ N (0, σ) i.i.d.
The estimates β0 and β1 arechosen to minimize the residualsum of squares (RSS):
RSS =
n∑i=1
(yi − yi)2
=
n∑i=1
(yi − β0 − β1xi)2.
10 / 23
Estimates β0 and β1
A little calculus shows that the minimizers of the RSS are:
β1 =
∑ni=1(xi − x)(yi − y)∑n
i=1(xi − x)2,
β0 = y − β1x.
11 / 23
Assesing the accuracy of β0 and β1
−2 −1 0 1 2
−10
−5
05
10
X
Y
−2 −1 0 1 2
−10
−5
05
10
X
Y
Figure 3.3
The Standard Errors for theparameters are:
SE(β0)2 = σ2[1
n+
x2∑ni=1(xi − x)2
]
SE(β1)2 =σ2∑n
i=1(xi − x)2.
The 95% confidence intervals:
β0 ± 2 · SE(β0)
β1 ± 2 · SE(β1)
12 / 23
Hypothesis test
H0: There is no relationship between X and Y .
Ha: There is some relationship between X and Y . H0:β1 = 0.
Ha: β1 6= 0.
Test statistic: t = β1−0
SE(β1).
Under the null hypothesis, this has a t-distributionwith n− 2 degrees of freedom.
13 / 23
Interpreting the hypothesis test
I If we reject the null hypothesis, can we assume there is a linearrelationship?
I No. A quadratic relationship may be a better fit, for example.
I If we don’t reject the null hypothesis, can we assume there isno relationship between X and Y ?
I No. This test is only powerful against certain monotonealternatives. There could be more complex non-linearrelationships.
14 / 23
Multiple linear regression
X1
X2
Y
Figure 3.4
Y = β0 + β1X1 + · · ·+ βpXp + ε
ε ∼ N (0, σ) i.i.d.
or, in matrix notation:
Ey = Xβ,
where y = (y1, . . . , yn)T ,
β = (β0, . . . , βp)T and X is our
usual data matrix with an extracolumn of zeroes on the left toaccount for the intercept.
15 / 23
Multiple linear regression answers several questions
I Is at least one of the variables Xi useful for predicting theoutcome Y ?
I Which subset of the predictors is most important?
I How good is a linear model for these data?
I Given a set of predictor values, what is a likely value for Y ,and how accurate is this prediction?
16 / 23
The estimates β
Our goal again is to minimize the RSS:
RSS =
n∑i=1
(yi − yi)2
=
n∑i=1
(yi − β0 − β1xi,1 − · · · − βpxi,p)2.
One can show that this is minimized by the vector β:
β = (XTX)−1XTy.
17 / 23
Which variables are important?
Consider the hypothesis:
H0 : The last q predictors have no relation with Y .
Let RSS0 be the residual sum of squares for the model whichexcludes these variables. The F -statistic is defined by:
F =(RSS0 − RSS)/qRSS/(n− p− 1)
.
Under the null hypothesis, this has an F -distribution.
Example: If q = p, we test whether any of the variables isimportant.
RSS0 =n∑i=1
(yi − y)2
18 / 23
Which variables are important?
Consider the hypothesis:
H0 : βp−q+1 = βp−q+2 = · · · = βp = 0.
Let RSS0 be the residual sum of squares for the model whichexcludes these variables. The F -statistic is defined by:
F =(RSS0 − RSS)/qRSS/(n− p− 1)
.
Under the null hypothesis, this has an F -distribution.
Example: If q = p, we test whether any of the variables isimportant.
RSS0 =n∑i=1
(yi − y)2
18 / 23
Which variables are important?A multiple linear regression in R has the following output:
19 / 23
Which variables are important?
The t-statistic associated to the ith predictor is the square root ofthe F -statistic for the null hypothesis which sets only βi = 0.
A low p-value indicates that the predictor is important.
Warning: If there are many predictors, even under the nullhypothesis, some of the t-tests will have low p-values.
20 / 23
How many variables are important?
When we select a subset of the predictors, we have 2p choices.
A way to simplify the choice is to define a range of models with anincreasing number of variables, then select the best.
I Forward selection: Starting from a null model, includevariables one at a time, minimizing the RSS at each step.
I Backward selection: Starting from the full model, eliminatevariables one at a time, choosing the one with the largestp-value at each step.
I Mixed selection: Starting from a null model, includevariables one at a time, minimizing the RSS at each step. Ifthe p-value for some variable goes beyond a threshold,eliminate that variable.
Choosing one model in the range produced is a form of tuning.21 / 23
How good is the fit?
To assess the fit, we focus on the residuals.I The RSS always decreases as we add more variables.
I The residual standard error (RSE) corrects this:
RSE =
√1
n− p− 1RSS.
I Visualizing the residuals can reveal phenomena that are notaccounted for by the model; eg. synergies or interactions:
Sales
Radio
TV
22 / 23
How good are the predictions?
The function predict in R output predictions from a linear model:
Confidence intervals reflect the uncertainty on β.
Prediction intervals reflect uncertainty on β and the irreducibleerror ε as well.
23 / 23
top related