1
Introduction to Linear regression
analysis
Part 1
Simple linear regression
• Two continuous variables:
– response (dependent) variable (Y)
– predictor (independent) variable (X)
– each recorded for n observations (replicates)
• Predictor variable “influences” response variable
• Does not necessarily demonstrate causality
(depends on design of experiment or survey
2
Scatterplot
0 500 1000 1500 2000 2500
Riparian tree density
0
50
100
150
200
CW
D(c
oa
rse
wo
od
y d
eb
ris)
ba
sa
l a
rea
Linear regression
• Description:
– relationship between response (Y) and
predictor (X) variable
• Explanation:
– how much of variation in Y explained by
linear relationship with X
• Prediction:
– new Y-values from new X-values
– precision of those estimates
3
Regression model
yi = b0 + b1xi + ei
(CWD basal area)i = b0 + b1(tree density)i + ei
yi value of Y for ith observation
xi value of X for ith observation
b0 population intercept (value of Y when X = 0)
b1 population slope (change in Y per unit change in X)
ei error term (measures variation in Y at each xi -
deviation of each Yi from predicted value )
Regression line
X
Y
Intercept
Slope:
change in Y per unit
change in X
4
Regression model
yi = b0 + b1xi + ei
E(yi) = b0 + b1xi
where:
• E(yi) = expected value of yi
• ei (error term) measures difference
between yi and E(yi) at each xi
Sample regression equation
predicted Y-value for xi
estimates E(yi)
sample intercept
estimates b0 sample regression slope
estimates b1
5
Regression line
0 500 1000 1500 2000 2500
Riparian tree density
-100
0
100
200
CW
D b
asa
l a
rea
Slope
Intercept
0 500 1000 1500 2000 2500
Riparian tree density
-100
0
100
200
CW
D b
asa
l a
rea
-y ns /t( ) +y ns /t( )y
The logic of the assessment of
regression models – what to
compare to
If data are normally
distributed, then
unbiased estimate of
distribution of means
can be obtained from
y, (SE)
6
Year
Em
plo
yment
(1000’s
, C
onfidence inte
rval)
5.1954
317,65
x
y
Longley.csv
Now lets assume that we think there may be a
relationship between year and employment
Question: does the mean (or some other estimator that does
not include the relationship between y and x) fit the data
better than an estimator that includes the effect of x
5.1954
317,65
x
y
1945 1950 1955 1960 1965Year
60000
62000
64000
66000
68000
70000
72000
Em
plo
ym
ent
(thousands)
1945 1950 1955 1960 1965Year
60000
62000
64000
66000
68000
70000
72000
Em
plo
ym
ent
(thousands)
7
Analysis of variance in Y
( )y yi -2
Total variation (Sum of Squares) in Y
Variation in Y explained
by regression
(SSRegression)
Variation in Y
unexplained by
regression (SSResidual)
Y
X
least squares
regression line
y
x
y i
yi
xi
y
})ˆ( i yy -}
)ˆ( ii yy -
)( i yy -}
222)ˆ()ˆ()( iiii yyyyyy -+-
Ordinary Least Squares (OLS)
8
y yi i- small y yi i- big
Unexplained or residual variation
Explained variation
y yi - small
y
y yi - big
y
9
Source of SS df Variance
variation (= mean square)
Regression 1 SSRegression / 1
Variation in Y explained by regression
Residual n-2 SSResidual / n-2
Variation in Y unexplained by regression
Analysis of variance
Why n-2?
Analysis of variance
It follows that if:
Variation in Y explained by regression >> Variation in Y
unexplained by regression (MSRegression >> MSResidual)
Then:
Regression function contributes to estimation of Y
(Slope = b1 > 0, or b1 < 0)
10
> 0
x
y
xx
yy oo
oo
o
o o
oo
oo
oo
oo
o
o o
oo
oo
o
oo
o
o o
oo
oo
x
y
oo
oo
o
o o
oo
oo
x
y
xx
yy
oo
oo
o
o o
oo
oo
oo
oo
o
o o
oo
oo
o
oo
o
o o
oo
oo
o o
oo
o
o
o o o
x
y
xx
yyo o
oo
o
o
o o o
o o
oo
o
o
o o o
< 0
= 0
b1Slope =
Variation in Y explained
by regression
> 0
> 0
= 0
y yi For most xi
y yi For most xi
y yi For all xi~
Null hypothesis
• Null hypothesis: b1 = 0
• F-ratio statistic = MSRegression / MSResidual
– if H0 true, F-ratio follows F distribution with
dfRegression and dfResidual
• t-statistic = b1 / SE(b1)
– if H0 true, t-statistic follows t distribution
with df = n-2
11
Model comparisons
ANOVA for regression
Total variation in Y
SSTotal
=
Variation explained by regression with X
SSRegression
+
Residual variation
SSResidual
12
Full model
yi = b0 + b1xi + ei
• Unexplained variation in Y from full
model = SSResidual
Reduced model (H0 true)
• Reduced model (H0: b1 = 0 true):
yi = b0 + ei
• Unexplained variation in Y from reduced
model = SSTotal
(Mean and error)
( )y yi -2
13
Model comparison
• Difference in unexplained variation between
full and reduced models:
SSTotal - SSResidual
= SSRegression
• Variation explained by including b1 in model
Explained variation
• Proportion of variation in Y explained by
linear relationship with X
• Termed r2, coefficient of determination:
SS Regression
SS Total
• r2 is simply square of correlation
coefficient (r) between X and Y.
14
0 5000 10000 15000
X
0
100
200
300
400
500
Y1
0 5000 10000 15000
X
0
100
200
300
400
500
Y2
Which is the better model??
Which is the better model??
0 5000 10000 15000
X
0
100
200
300
400
500
Y1
Dep Var: Y1 N: 26 Multiple R: 0.754377 Squared multiple R: 0.569085
Adjusted squared multiple R: 0.551131 Standard error of estimate: 86.934708
Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)
CONSTANT 11.207815 30.277197 0.000000 . 0.37017 0.71450
X 0.026573 0.004720 0.754377 1.000000 5.62987 0.00001
Analysis of Variance
Source Sum-of-Squares df Mean-Square F-ratio P
Regression 2.39543E+05 1 2.39543E+05 31.695479 0.000009
Residual 1.81383E+05 24 7557.643448
15
Which is the better model??
0 5000 10000 15000
X
0
100
200
300
400
500
Y2
Dep Var: Y2 N: 5 Multiple R: 0.978152 Squared multiple R: 0.956781
Adjusted squared multiple R: 0.942374 Standard error of estimate: 33.608617
Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)
CONSTANT -31.455158 32.524324 0.000000 . -0.96713 0.40482
X 0.033584 0.004121 0.978152 1.000000 8.14944 0.00386
Analysis of Variance
Source Sum-of-Squares df Mean-Square F-ratio P
Regression 7.50166E+04 1 7.50166E+04 66.413444 0.003864
Residual 3388.617362 3 1129.539121
0 5000 10000 15000
X
0
100
200
300
400
500
Y1
0 5000 10000 15000
X
0
100
200
300
400
500
Y2
Which is the better model??
n = 5
P = 0.00386
r2 = 0.942
n = 26
P = 0.000009
r 2= 0.551
16
Which is the better model??
0 5000 10000 15000
X
0
100
200
300
400
500
Y1
0 5000 10000 15000
X
0
100
200
300
400
500
Y2
95% Confidence bands (for slope)
n = 5
P = 0.00386
r2 = 0.942
n = 26
P = 0.000009
r 2= 0.551
Assumptions
17
Normality
Y normally distributed at each value of X:
– Boxplot of y should be symmetrical - watch
out for outliers and skewness
– Transformations often help
Homogeneity of variance
Variance (spread) of Y should be constant
for each value of xi (homogeneity of
variance):
– Very difficult to assess usually (for
models with only one value of y per x).
18
x1 x2 X
Y
Y1
Y2
b by iix +0 1
Homogeneity of variance
Variance (spread) of Y should be constant for
each value of xi (homogeneity of variance):
– Very difficult to assess usually (for models with
only one value of y per x).
– Spread of residuals should be even when
plotted against xi or predicted yi’s
– Transformations often help
– Transformations that improve normality of Y will
also usually make variance of Y more constant
19
Independence
Values of yi are independent of each
other:
– watch out for data which are a time series
on same experimental or sampling units
– should be considered at design stage
Linearity
For Linear regression: true population
relationship between Y and X is linear:
– scatterplot of Y against X
– watch out for asymptotic or exponential
patterns
– transformations of Y or Y and X often help
– Always look at residuals
20
EDA and regression diagnostics
• Check assumptions
• Check fit of model
• Warn about influential observations and
outliers
EDA
• Boxplots of Y (and X):
– check for normality, outliers etc.
• Scatterplot of Y and X:
– check for linearity, homogeneity of
variance, outliers etc.
21
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15 20
Anscombe (1973) data set
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15 20
R2 = 0.667, y = 3.0 + 0.5*x, t = 4.24, P = 0.002
22
Limited or weighted data Smoothers (for data
exploration – especially useful for model
fitting)
• Nonparametric description of
relationship between Y and X
– unconstrained by specific model structure
• Useful exploratory technique:
– is linear model appropriate?
– are particular observations influential?
23
Limited or weighted data Smoothers
• Each observation replaced by mean or median of surrounding observations– or predicted value of regression model through
surrounding observations
• Surrounding observations in window (or band)– covers range along X-axis
– size of window (number of observations) determined by smoothing parameter
Limited or weighted data Smoothers
• Adjacent windows overlap
– resulting line is smooth
– smoothness controlled by smoothing
parameter (width of window)
• Any section of line robust to extreme
values in other windows
24
Types of limited or weighted data
smoothers (examples)
• Running (moving) means or averages:
– means or medians within each window
• Lo(w)ess:
– locally weighted regression scatterplot
smoothing
– observations within a window weighted
differently
– observations replaced by predicted values
from local regression line
Residuals – very useful for examining
regression assumptions
• Difference between observed value and
value predicted or fitted by the model
• Residual for each observation:– difference between observed y and value
of y predicted by linear regression
equation:
25
Studentised residuals
• residual / SE residuals• follow a t-distribution• studentised residuals can be compared
between different regressions
Observations with large residual (or studentised residual) are outliers from fitted model.
0
-se
+se
Predicted yi
Res
idual •No pattern in residuals
• Indicates assumptions
OK
x
y •Even spread of Y
around line
Plot residuals against predicted yi
26
• Increasing spread of
residuals, ie. wedge-shape
• Unequal variance in Y
• Skewed distribution of Y
• Transformation of Y helps
• Uneven spread of Y
around line
0-se
+se
Predicted yi
Res
idual
x
y
Other indicators
• Outliers
• Leverage
• Influence
27
Outliers
• Observations further
from fitted model than
remaining observations
– might be different from
sample outliers in
boxplots
• Large residual
outlier
Use robust estimator.syz
Leverage
• How extreme
observation is for
X-variable
• Measures how
much each xi
influences
predicted yi Large leverage
28
Influence
• Cook’s D statistic:– incorporates leverage & residual
– observations with large influence on
estimated slope
– observations with D near or greater than 1
should be checked
• Observation 1 is X and Y outlier but not
influential
1Y
X
2
• Observation 2 has large residual - outlier
3
• Observation 3 is very influential (large Cook’s
D) - also outlier
29
Full set of lecture notes
30
Introduction to Linear regression
analysis
Part 1
Simple linear regression
• Two continuous variables:
– response (dependent) variable (Y)
– predictor (independent) variable (X)
– each recorded for n observations (replicates)
• Predictor variable “influences” response variable
• Does not necessarily demonstrate causality
(depends on design of experiment or survey
31
Scatterplot
0 500 1000 1500 2000 2500
Riparian tree density
0
50
100
150
200
CW
D(c
oa
rse
wo
od
y d
eb
ris)
ba
sa
l a
rea
Linear regression
• Description:
– relationship between response (Y) and
predictor (X) variable
• Explanation:
– how much of variation in Y explained by
linear relationship with X
• Prediction:
– new Y-values from new X-values
– precision of those estimates
32
Regression model
yi = b0 + b1xi + ei
(CWD basal area)i = b0 + b1(tree density)i + ei
yi value of Y for ith observation
xi value of X for ith observation
b0 population intercept (value of Y when X = 0)
b1 population slope (change in Y per unit change in X)
ei error term (measures variation in Y at each xi -
deviation of each Yi from predicted value )
5 10 15 20
X
5
10
15
20
Y
Calculations of the Linear
Regression equation
33
Calculate mean x and y values
5 10 15 20
X
5
10
15
20
Y
x = 13.14
y = 13.00
Calculate deviations from mean x
and y values
5 10 15 20
X
5
10
15
20
Y
x = 13.14
y = 13.00}3
2.86
}
}
}-5
-4.14
+ +
+ -
- +
- -
x
y
34
Calculate sum of xy deviation
cross-products
+ +
+ -
- +
- -
x
y
5 10 15 20
X
5
10
15
20
Y
x = 13.14
y = 13.00}3
2.86
}
}
}-5
-4.14
(xi - x ) (yi - y )
+
-
-
+
x
y
(xi - x ) (yi - y )(xi - x ) (yi - y )Deviations (x,y)
Calculate slope
(xi - x ) (yi - y )(xi - x )
2
Slope = = 1.09
X Y CPXY SSX
9 8 20.7143 17.1633
11 12 2.1429 4.5918
12 11 2.2857 1.3061
13 14 -0.1429 0.0204
14 12 -0.8571 0.7347
16 17 11.4286 8.1633
17 17 15.4286 14.8776
Mean 13.14286 13
sum 51.0000 46.8571
slope 1.0884
(xi - x ) 2
(xi - x ) (xi - x ) 2(xi - x ) (yi - y )(xi - x ) (yi - y )
0 5 10 15 20
X
-5
0
5
10
15
20
Y
35
Solve for intercept
0 5 10 15 20
X
-5
0
5
10
15
20
Y
y = b0 + b1x
Rearrange
y - b1x = b0
Where:
b0 = Intercept
b1 = Slope
It can be shown that:
-1.32
13 = b0 + b1(13.14)
13 – 1.09(13.14) = b0
b0 =
+
-
-
+
x
y
Slopes
oo
oo
o
o o
oo
oo
(xi - x ) (yi - y )(xi - x )
2 b1Slope =
b1 > 0
+
-
-
+
x
y
oo
oo
o
o o
oo
oo
b1 < 0
+
-
-
+
x
yo o
oo
o
o
o o o
+
-
-
+
x
y
+
-
-
+
xx
yy
(xi - x ) (yi - y )(xi - x ) (yi - y )(xi - x ) (yi - y )(xi - x ) (yi - y )
Regions of cross-
product
b1 = 0
36
Regression line
X
Y
Intercept
Slope:
change in Y per unit
change in X
x1 x2 X
Y
Y1
Y2
b by iix +0 1
37
Regression model
yi = b0 + b1xi + ei
E(yi) = b0 + b1xi
where:
• E(yi) = (yi)
• ei (error term) measures difference
between yi and (yi) at each xi
Sample regression equation
predicted Y-value for xi
estimates (yi)
sample intercept
estimates b0 sample regression slope
estimates b1
38
Regression line
0 500 1000 1500 2000 2500
Riparian tree density
-100
0
100
200
CW
D b
asa
l a
rea
Slope
Intercept
0 500 1000 1500 2000 2500
Riparian tree density
-100
0
100
200
CW
D b
asa
l a
rea
-y ns /t( ) +y ns /t( )y
The logic of the assessment of
regression models – what to
compare to
If data are normally
distributed, then
unbiased estimate of
distribution of means
can be obtained from
y, (SE)
39
TOTAL60000
62000
64000
66000
68000
70000
72000
Em
plo
ym
en
t (t
ho
usa
nd
s +
- C
I)
1945 1950 1955 1960 1965
Year
60000
62000
64000
66000
68000
70000
72000
Em
plo
ym
en
t (t
ho
usa
nds)
Now lets assume that we think there may be a
relationship between year and employment
5.1954
317,65
x
y
Longley.syd
Question: does the mean (or some other estimator that does
not include the relationship between y and x) fit the data
better than an estimator that includes the effect of x
5.1954
317,65
x
y
1945 1950 1955 1960 1965Year
60000
62000
64000
66000
68000
70000
72000
Em
plo
ym
ent
(thousands)
1945 1950 1955 1960 1965Year
60000
62000
64000
66000
68000
70000
72000
Em
plo
ym
ent
(thousands)
40
Analysis of variance in Y
( )y yi -2
Total variation (Sum of Squares) in Y
Variation in Y explained
by regression
(SSRegression)
Variation in Y
unexplained by
regression (SSResidual)
Y
X
least squares
regression line
y
x
y i
yi
xi
y
})ˆ( i yy -}
)ˆ( ii yy -
)( i yy -}
222)ˆ()ˆ()( iiii yyyyyy -+-
Ordinary Least Squares (OLS)
41
y yi i- small y yi i- big
Unexplained or residual variation
Explained variation
y yi - small
y
y yi - big
y
42
Source of SS df Variance
variation (= mean square)
Regression 1 SSRegression / 1
Variation in Y explained by regression
Residual n-2 SSResidual / n-2
Variation in Y unexplained by regression
Analysis of variance
Why n-2?
Analysis of variance
It follows that if:
Variation in Y explained by regression >> Variation in Y
unexplained by regression (MSRegression >> MSResidual)
Then:
Regression function contributes to estimation of Y
(Slope = b1 > 0, or b1 < 0)
43
> 0
x
y
xx
yy oo
oo
o
o o
oo
oo
oo
oo
o
o o
oo
oo
o
oo
o
o o
oo
oo
x
y
oo
oo
o
o o
oo
oo
x
y
xx
yy
oo
oo
o
o o
oo
oo
oo
oo
o
o o
oo
oo
o
oo
o
o o
oo
oo
o o
oo
o
o
o o o
x
y
xx
yyo o
oo
o
o
o o o
o o
oo
o
o
o o o
< 0
= 0
b1Slope =
Variation in Y explained
by regression
> 0
> 0
= 0
y yi For most xi
y yi For most xi
y yi For all xi~
Null hypothesis
• Null hypothesis: b1 = 0
• F-ratio statistic = MSRegression / MSResidual
– if H0 true, F-ratio follows F distribution with
dfRegression and dfResidual
• t-statistic = b1 / SE(b1)
– if H0 true, t-statistic follows t distribution
with df = n-2
44
Model comparisons
ANOVA for regression
Total variation in Y
SSTotal
=
Variation explained by regression with X
SSRegression
+
Residual variation
SSResidual
45
Full model
yi = b0 + b1xi + ei
• Unexplained variation in Y from full
model = SSResidual
Reduced model (H0 true)
• Reduced model (H0: b1 = 0 true):
yi = b0 + ei
• Unexplained variation in Y from reduced
model = SSTotal
(Mean and error)
( )y yi -2
46
Model comparison
• Difference in unexplained variation between
full and reduced models:
SSTotal - SSResidual
= SSRegression
• Variation explained by including b1 in model
Explained variation
• Proportion of variation in Y explained by
linear relationship with X
• Termed r2, coefficient of determination:
SS Regression
SS Total
• r2 is simply square of correlation
coefficient (r) between X and Y.
47
0 5000 10000 15000
X
0
100
200
300
400
500
Y1
0 5000 10000 15000
X
0
100
200
300
400
500
Y2
Which is the better model??
Which is the better model??
0 5000 10000 15000
X
0
100
200
300
400
500
Y1
Dep Var: Y1 N: 26 Multiple R: 0.754377 Squared multiple R: 0.569085
Adjusted squared multiple R: 0.551131 Standard error of estimate: 86.934708
Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)
CONSTANT 11.207815 30.277197 0.000000 . 0.37017 0.71450
X 0.026573 0.004720 0.754377 1.000000 5.62987 0.00001
Analysis of Variance
Source Sum-of-Squares df Mean-Square F-ratio P
Regression 2.39543E+05 1 2.39543E+05 31.695479 0.000009
Residual 1.81383E+05 24 7557.643448
48
Which is the better model??
0 5000 10000 15000
X
0
100
200
300
400
500
Y2
Dep Var: Y2 N: 5 Multiple R: 0.978152 Squared multiple R: 0.956781
Adjusted squared multiple R: 0.942374 Standard error of estimate: 33.608617
Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)
CONSTANT -31.455158 32.524324 0.000000 . -0.96713 0.40482
X 0.033584 0.004121 0.978152 1.000000 8.14944 0.00386
Analysis of Variance
Source Sum-of-Squares df Mean-Square F-ratio P
Regression 7.50166E+04 1 7.50166E+04 66.413444 0.003864
Residual 3388.617362 3 1129.539121
0 5000 10000 15000
X
0
100
200
300
400
500
Y1
0 5000 10000 15000
X
0
100
200
300
400
500
Y2
Which is the better model??
n = 5
P = 0.00386
r2 = 0.942
n = 26
P = 0.000009
r 2= 0.551
49
Which is the better model??
0 5000 10000 15000
X
0
100
200
300
400
500
Y1
0 5000 10000 15000
X
0
100
200
300
400
500
Y2
95% Confidence bands (for slope)
n = 5
P = 0.00386
r2 = 0.942
n = 26
P = 0.000009
r 2= 0.551
Assumptions
50
Normality
Y normally distributed at each value of X:
– Boxplot of y should be symmetrical - watch
out for outliers and skewness
– Transformations often help
Homogeneity of variance
Variance (spread) of Y should be constant
for each value of xi (homogeneity of
variance):
– Very difficult to assess usually (for
models with only one value of y per x).
51
x1 x2 X
Y
Y1
Y2
b by iix +0 1
Homogeneity of variance
Variance (spread) of Y should be constant for
each value of xi (homogeneity of variance):
– Very difficult to assess usually (for models with
only one value of y per x).
– Spread of residuals should be even when
plotted against xi or predicted yi’s
– Transformations often help
– Transformations that improve normality of Y will
also usually make variance of Y more constant
52
Independence
Values of yi are independent of each
other:
– watch out for data which are a time series
on same experimental or sampling units
– should be considered at design stage
Linearity
For Linear regression: true population
relationship between Y and X is linear:
– scatterplot of Y against X
– watch out for asymptotic or exponential
patterns
– transformations of Y or Y and X often help
– Always look at residuals
53
EDA and regression diagnostics
• Check assumptions
• Check fit of model
• Warn about influential observations and
outliers
EDA
• Boxplots of Y (and X):
– check for normality, outliers etc.
• Scatterplot of Y and X:
– check for linearity, homogeneity of
variance, outliers etc.
54
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15 20
Anscombe (1973) data set
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15 20
R2 = 0.667, y = 3.0 + 0.5*x, t = 4.24, P = 0.002
55
Limited or weighted data Smoothers (for data
exploration – especially useful for model
fitting)
• Nonparametric description of
relationship between Y and X
– unconstrained by specific model structure
• Useful exploratory technique:
– is linear model appropriate?
– are particular observations influential?
56
Limited or weighted data Smoothers
• Each observation replaced by mean or median of surrounding observations– or predicted value of regression model through
surrounding observations
• Surrounding observations in window (or band)– covers range along X-axis
– size of window (number of observations) determined by smoothing parameter
Limited or weighted data Smoothers
• Adjacent windows overlap
– resulting line is smooth
– smoothness controlled by smoothing
parameter (width of window)
• Any section of line robust to extreme
values in other windows
57
Types of limited or weighted data
smoothers (examples)
• Running (moving) means or averages:
– means or medians within each window
• Lo(w)ess:
– locally weighted regression scatterplot
smoothing
– observations within a window weighted
differently
– observations replaced by predicted values
from local regression line
Residuals – very useful for examining
regression assumptions
• Difference between observed value and
value predicted or fitted by the model
• Residual for each observation:– difference between observed y and value
of y predicted by linear regression
equation:
58
Studentised residuals
• residual / SE residuals• follow a t-distribution• studentised residuals can be compared
between different regressions
Observations with large residual (or studentised residual) are outliers from fitted model.
0
-se
+se
Predicted yi
Res
idual •No pattern in residuals
• Indicates assumptions
OK
x
y •Even spread of Y
around line
Plot residuals against predicted yi
59
• Increasing spread of
residuals, ie. wedge-shape
• Unequal variance in Y
• Skewed distribution of Y
• Transformation of Y helps
• Uneven spread of Y
around line
0-se
+se
Predicted yi
Res
idual
x
y
Other indicators
• Outliers
• Leverage
• Influence
60
Outliers
• Observations further
from fitted model than
remaining observations
– might be different from
sample outliers in
boxplots
• Large residual
outlier
Use robust estimator.syz
Leverage
• How extreme
observation is for
X-variable
• Measures how
much each xi
influences
predicted yi Large leverage
61
Influence
• Cook’s D statistic:– incorporates leverage & residual
– observations with large influence on
estimated slope
– observations with D near or greater than 1
should be checked
• Observation 1 is X and Y outlier but not
influential
1Y
X
2
• Observation 2 has large residual - outlier
3
• Observation 3 is very influential (large Cook’s
D) - also outlier