![Page 1: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/1.jpg)
1
Introduction to Linear regression
analysis
Part 1
Simple linear regression
• Two continuous variables:
– response (dependent) variable (Y)
– predictor (independent) variable (X)
– each recorded for n observations (replicates)
• Predictor variable “influences” response variable
• Does not necessarily demonstrate causality
(depends on design of experiment or survey
![Page 2: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/2.jpg)
2
Scatterplot
0 500 1000 1500 2000 2500
Riparian tree density
0
50
100
150
200
CW
D(c
oa
rse
wo
od
y d
eb
ris)
ba
sa
l a
rea
Linear regression
• Description:
– relationship between response (Y) and
predictor (X) variable
• Explanation:
– how much of variation in Y explained by
linear relationship with X
• Prediction:
– new Y-values from new X-values
– precision of those estimates
![Page 3: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/3.jpg)
3
Regression model
yi = b0 + b1xi + ei
(CWD basal area)i = b0 + b1(tree density)i + ei
yi value of Y for ith observation
xi value of X for ith observation
b0 population intercept (value of Y when X = 0)
b1 population slope (change in Y per unit change in X)
ei error term (measures variation in Y at each xi -
deviation of each Yi from predicted value )
Regression line
X
Y
Intercept
Slope:
change in Y per unit
change in X
![Page 4: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/4.jpg)
4
Regression model
yi = b0 + b1xi + ei
E(yi) = b0 + b1xi
where:
• E(yi) = expected value of yi
• ei (error term) measures difference
between yi and E(yi) at each xi
Sample regression equation
predicted Y-value for xi
estimates E(yi)
sample intercept
estimates b0 sample regression slope
estimates b1
![Page 5: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/5.jpg)
5
Regression line
0 500 1000 1500 2000 2500
Riparian tree density
-100
0
100
200
CW
D b
asa
l a
rea
Slope
Intercept
0 500 1000 1500 2000 2500
Riparian tree density
-100
0
100
200
CW
D b
asa
l a
rea
-y ns /t( ) +y ns /t( )y
The logic of the assessment of
regression models – what to
compare to
If data are normally
distributed, then
unbiased estimate of
distribution of means
can be obtained from
y, (SE)
![Page 6: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/6.jpg)
6
Year
Em
plo
yment
(1000’s
, C
onfidence inte
rval)
5.1954
317,65
x
y
Longley.csv
Now lets assume that we think there may be a
relationship between year and employment
Question: does the mean (or some other estimator that does
not include the relationship between y and x) fit the data
better than an estimator that includes the effect of x
5.1954
317,65
x
y
1945 1950 1955 1960 1965Year
60000
62000
64000
66000
68000
70000
72000
Em
plo
ym
ent
(thousands)
1945 1950 1955 1960 1965Year
60000
62000
64000
66000
68000
70000
72000
Em
plo
ym
ent
(thousands)
![Page 7: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/7.jpg)
7
Analysis of variance in Y
( )y yi -2
Total variation (Sum of Squares) in Y
Variation in Y explained
by regression
(SSRegression)
Variation in Y
unexplained by
regression (SSResidual)
Y
X
least squares
regression line
y
x
y i
yi
xi
y
})ˆ( i yy -}
)ˆ( ii yy -
)( i yy -}
222)ˆ()ˆ()( iiii yyyyyy -+-
Ordinary Least Squares (OLS)
![Page 8: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/8.jpg)
8
y yi i- small y yi i- big
Unexplained or residual variation
Explained variation
y yi - small
y
y yi - big
y
![Page 9: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/9.jpg)
9
Source of SS df Variance
variation (= mean square)
Regression 1 SSRegression / 1
Variation in Y explained by regression
Residual n-2 SSResidual / n-2
Variation in Y unexplained by regression
Analysis of variance
Why n-2?
Analysis of variance
It follows that if:
Variation in Y explained by regression >> Variation in Y
unexplained by regression (MSRegression >> MSResidual)
Then:
Regression function contributes to estimation of Y
(Slope = b1 > 0, or b1 < 0)
![Page 10: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/10.jpg)
10
> 0
x
y
xx
yy oo
oo
o
o o
oo
oo
oo
oo
o
o o
oo
oo
o
oo
o
o o
oo
oo
x
y
oo
oo
o
o o
oo
oo
x
y
xx
yy
oo
oo
o
o o
oo
oo
oo
oo
o
o o
oo
oo
o
oo
o
o o
oo
oo
o o
oo
o
o
o o o
x
y
xx
yyo o
oo
o
o
o o o
o o
oo
o
o
o o o
< 0
= 0
b1Slope =
Variation in Y explained
by regression
> 0
> 0
= 0
y yi For most xi
y yi For most xi
y yi For all xi~
Null hypothesis
• Null hypothesis: b1 = 0
• F-ratio statistic = MSRegression / MSResidual
– if H0 true, F-ratio follows F distribution with
dfRegression and dfResidual
• t-statistic = b1 / SE(b1)
– if H0 true, t-statistic follows t distribution
with df = n-2
![Page 11: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/11.jpg)
11
Model comparisons
ANOVA for regression
Total variation in Y
SSTotal
=
Variation explained by regression with X
SSRegression
+
Residual variation
SSResidual
![Page 12: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/12.jpg)
12
Full model
yi = b0 + b1xi + ei
• Unexplained variation in Y from full
model = SSResidual
Reduced model (H0 true)
• Reduced model (H0: b1 = 0 true):
yi = b0 + ei
• Unexplained variation in Y from reduced
model = SSTotal
(Mean and error)
( )y yi -2
![Page 13: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/13.jpg)
13
Model comparison
• Difference in unexplained variation between
full and reduced models:
SSTotal - SSResidual
= SSRegression
• Variation explained by including b1 in model
Explained variation
• Proportion of variation in Y explained by
linear relationship with X
• Termed r2, coefficient of determination:
SS Regression
SS Total
• r2 is simply square of correlation
coefficient (r) between X and Y.
![Page 14: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/14.jpg)
14
0 5000 10000 15000
X
0
100
200
300
400
500
Y1
0 5000 10000 15000
X
0
100
200
300
400
500
Y2
Which is the better model??
Which is the better model??
0 5000 10000 15000
X
0
100
200
300
400
500
Y1
Dep Var: Y1 N: 26 Multiple R: 0.754377 Squared multiple R: 0.569085
Adjusted squared multiple R: 0.551131 Standard error of estimate: 86.934708
Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)
CONSTANT 11.207815 30.277197 0.000000 . 0.37017 0.71450
X 0.026573 0.004720 0.754377 1.000000 5.62987 0.00001
Analysis of Variance
Source Sum-of-Squares df Mean-Square F-ratio P
Regression 2.39543E+05 1 2.39543E+05 31.695479 0.000009
Residual 1.81383E+05 24 7557.643448
![Page 15: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/15.jpg)
15
Which is the better model??
0 5000 10000 15000
X
0
100
200
300
400
500
Y2
Dep Var: Y2 N: 5 Multiple R: 0.978152 Squared multiple R: 0.956781
Adjusted squared multiple R: 0.942374 Standard error of estimate: 33.608617
Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)
CONSTANT -31.455158 32.524324 0.000000 . -0.96713 0.40482
X 0.033584 0.004121 0.978152 1.000000 8.14944 0.00386
Analysis of Variance
Source Sum-of-Squares df Mean-Square F-ratio P
Regression 7.50166E+04 1 7.50166E+04 66.413444 0.003864
Residual 3388.617362 3 1129.539121
0 5000 10000 15000
X
0
100
200
300
400
500
Y1
0 5000 10000 15000
X
0
100
200
300
400
500
Y2
Which is the better model??
n = 5
P = 0.00386
r2 = 0.942
n = 26
P = 0.000009
r 2= 0.551
![Page 16: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/16.jpg)
16
Which is the better model??
0 5000 10000 15000
X
0
100
200
300
400
500
Y1
0 5000 10000 15000
X
0
100
200
300
400
500
Y2
95% Confidence bands (for slope)
n = 5
P = 0.00386
r2 = 0.942
n = 26
P = 0.000009
r 2= 0.551
Assumptions
![Page 17: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/17.jpg)
17
Normality
Y normally distributed at each value of X:
– Boxplot of y should be symmetrical - watch
out for outliers and skewness
– Transformations often help
Homogeneity of variance
Variance (spread) of Y should be constant
for each value of xi (homogeneity of
variance):
– Very difficult to assess usually (for
models with only one value of y per x).
![Page 18: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/18.jpg)
18
x1 x2 X
Y
Y1
Y2
b by iix +0 1
Homogeneity of variance
Variance (spread) of Y should be constant for
each value of xi (homogeneity of variance):
– Very difficult to assess usually (for models with
only one value of y per x).
– Spread of residuals should be even when
plotted against xi or predicted yi’s
– Transformations often help
– Transformations that improve normality of Y will
also usually make variance of Y more constant
![Page 19: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/19.jpg)
19
Independence
Values of yi are independent of each
other:
– watch out for data which are a time series
on same experimental or sampling units
– should be considered at design stage
Linearity
For Linear regression: true population
relationship between Y and X is linear:
– scatterplot of Y against X
– watch out for asymptotic or exponential
patterns
– transformations of Y or Y and X often help
– Always look at residuals
![Page 20: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/20.jpg)
20
EDA and regression diagnostics
• Check assumptions
• Check fit of model
• Warn about influential observations and
outliers
EDA
• Boxplots of Y (and X):
– check for normality, outliers etc.
• Scatterplot of Y and X:
– check for linearity, homogeneity of
variance, outliers etc.
![Page 21: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/21.jpg)
21
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15 20
Anscombe (1973) data set
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15 20
R2 = 0.667, y = 3.0 + 0.5*x, t = 4.24, P = 0.002
![Page 22: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/22.jpg)
22
Limited or weighted data Smoothers (for data
exploration – especially useful for model
fitting)
• Nonparametric description of
relationship between Y and X
– unconstrained by specific model structure
• Useful exploratory technique:
– is linear model appropriate?
– are particular observations influential?
![Page 23: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/23.jpg)
23
Limited or weighted data Smoothers
• Each observation replaced by mean or median of surrounding observations– or predicted value of regression model through
surrounding observations
• Surrounding observations in window (or band)– covers range along X-axis
– size of window (number of observations) determined by smoothing parameter
Limited or weighted data Smoothers
• Adjacent windows overlap
– resulting line is smooth
– smoothness controlled by smoothing
parameter (width of window)
• Any section of line robust to extreme
values in other windows
![Page 24: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/24.jpg)
24
Types of limited or weighted data
smoothers (examples)
• Running (moving) means or averages:
– means or medians within each window
• Lo(w)ess:
– locally weighted regression scatterplot
smoothing
– observations within a window weighted
differently
– observations replaced by predicted values
from local regression line
Residuals – very useful for examining
regression assumptions
• Difference between observed value and
value predicted or fitted by the model
• Residual for each observation:– difference between observed y and value
of y predicted by linear regression
equation:
![Page 25: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/25.jpg)
25
Studentised residuals
• residual / SE residuals• follow a t-distribution• studentised residuals can be compared
between different regressions
Observations with large residual (or studentised residual) are outliers from fitted model.
0
-se
+se
Predicted yi
Res
idual •No pattern in residuals
• Indicates assumptions
OK
x
y •Even spread of Y
around line
Plot residuals against predicted yi
![Page 26: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/26.jpg)
26
• Increasing spread of
residuals, ie. wedge-shape
• Unequal variance in Y
• Skewed distribution of Y
• Transformation of Y helps
• Uneven spread of Y
around line
0-se
+se
Predicted yi
Res
idual
x
y
Other indicators
• Outliers
• Leverage
• Influence
![Page 27: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/27.jpg)
27
Outliers
• Observations further
from fitted model than
remaining observations
– might be different from
sample outliers in
boxplots
• Large residual
outlier
Use robust estimator.syz
Leverage
• How extreme
observation is for
X-variable
• Measures how
much each xi
influences
predicted yi Large leverage
![Page 28: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/28.jpg)
28
Influence
• Cook’s D statistic:– incorporates leverage & residual
– observations with large influence on
estimated slope
– observations with D near or greater than 1
should be checked
• Observation 1 is X and Y outlier but not
influential
1Y
X
2
• Observation 2 has large residual - outlier
3
• Observation 3 is very influential (large Cook’s
D) - also outlier
![Page 29: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/29.jpg)
29
Full set of lecture notes
![Page 30: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/30.jpg)
30
Introduction to Linear regression
analysis
Part 1
Simple linear regression
• Two continuous variables:
– response (dependent) variable (Y)
– predictor (independent) variable (X)
– each recorded for n observations (replicates)
• Predictor variable “influences” response variable
• Does not necessarily demonstrate causality
(depends on design of experiment or survey
![Page 31: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/31.jpg)
31
Scatterplot
0 500 1000 1500 2000 2500
Riparian tree density
0
50
100
150
200
CW
D(c
oa
rse
wo
od
y d
eb
ris)
ba
sa
l a
rea
Linear regression
• Description:
– relationship between response (Y) and
predictor (X) variable
• Explanation:
– how much of variation in Y explained by
linear relationship with X
• Prediction:
– new Y-values from new X-values
– precision of those estimates
![Page 32: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/32.jpg)
32
Regression model
yi = b0 + b1xi + ei
(CWD basal area)i = b0 + b1(tree density)i + ei
yi value of Y for ith observation
xi value of X for ith observation
b0 population intercept (value of Y when X = 0)
b1 population slope (change in Y per unit change in X)
ei error term (measures variation in Y at each xi -
deviation of each Yi from predicted value )
5 10 15 20
X
5
10
15
20
Y
Calculations of the Linear
Regression equation
![Page 33: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/33.jpg)
33
Calculate mean x and y values
5 10 15 20
X
5
10
15
20
Y
x = 13.14
y = 13.00
Calculate deviations from mean x
and y values
5 10 15 20
X
5
10
15
20
Y
x = 13.14
y = 13.00}3
2.86
}
}
}-5
-4.14
+ +
+ -
- +
- -
x
y
![Page 34: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/34.jpg)
34
Calculate sum of xy deviation
cross-products
+ +
+ -
- +
- -
x
y
5 10 15 20
X
5
10
15
20
Y
x = 13.14
y = 13.00}3
2.86
}
}
}-5
-4.14
(xi - x ) (yi - y )
+
-
-
+
x
y
(xi - x ) (yi - y )(xi - x ) (yi - y )Deviations (x,y)
Calculate slope
(xi - x ) (yi - y )(xi - x )
2
Slope = = 1.09
X Y CPXY SSX
9 8 20.7143 17.1633
11 12 2.1429 4.5918
12 11 2.2857 1.3061
13 14 -0.1429 0.0204
14 12 -0.8571 0.7347
16 17 11.4286 8.1633
17 17 15.4286 14.8776
Mean 13.14286 13
sum 51.0000 46.8571
slope 1.0884
(xi - x ) 2
(xi - x ) (xi - x ) 2(xi - x ) (yi - y )(xi - x ) (yi - y )
0 5 10 15 20
X
-5
0
5
10
15
20
Y
![Page 35: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/35.jpg)
35
Solve for intercept
0 5 10 15 20
X
-5
0
5
10
15
20
Y
y = b0 + b1x
Rearrange
y - b1x = b0
Where:
b0 = Intercept
b1 = Slope
It can be shown that:
-1.32
13 = b0 + b1(13.14)
13 – 1.09(13.14) = b0
b0 =
+
-
-
+
x
y
Slopes
oo
oo
o
o o
oo
oo
(xi - x ) (yi - y )(xi - x )
2 b1Slope =
b1 > 0
+
-
-
+
x
y
oo
oo
o
o o
oo
oo
b1 < 0
+
-
-
+
x
yo o
oo
o
o
o o o
+
-
-
+
x
y
+
-
-
+
xx
yy
(xi - x ) (yi - y )(xi - x ) (yi - y )(xi - x ) (yi - y )(xi - x ) (yi - y )
Regions of cross-
product
b1 = 0
![Page 36: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/36.jpg)
36
Regression line
X
Y
Intercept
Slope:
change in Y per unit
change in X
x1 x2 X
Y
Y1
Y2
b by iix +0 1
![Page 37: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/37.jpg)
37
Regression model
yi = b0 + b1xi + ei
E(yi) = b0 + b1xi
where:
• E(yi) = (yi)
• ei (error term) measures difference
between yi and (yi) at each xi
Sample regression equation
predicted Y-value for xi
estimates (yi)
sample intercept
estimates b0 sample regression slope
estimates b1
![Page 38: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/38.jpg)
38
Regression line
0 500 1000 1500 2000 2500
Riparian tree density
-100
0
100
200
CW
D b
asa
l a
rea
Slope
Intercept
0 500 1000 1500 2000 2500
Riparian tree density
-100
0
100
200
CW
D b
asa
l a
rea
-y ns /t( ) +y ns /t( )y
The logic of the assessment of
regression models – what to
compare to
If data are normally
distributed, then
unbiased estimate of
distribution of means
can be obtained from
y, (SE)
![Page 39: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/39.jpg)
39
TOTAL60000
62000
64000
66000
68000
70000
72000
Em
plo
ym
en
t (t
ho
usa
nd
s +
- C
I)
1945 1950 1955 1960 1965
Year
60000
62000
64000
66000
68000
70000
72000
Em
plo
ym
en
t (t
ho
usa
nds)
Now lets assume that we think there may be a
relationship between year and employment
5.1954
317,65
x
y
Longley.syd
Question: does the mean (or some other estimator that does
not include the relationship between y and x) fit the data
better than an estimator that includes the effect of x
5.1954
317,65
x
y
1945 1950 1955 1960 1965Year
60000
62000
64000
66000
68000
70000
72000
Em
plo
ym
ent
(thousands)
1945 1950 1955 1960 1965Year
60000
62000
64000
66000
68000
70000
72000
Em
plo
ym
ent
(thousands)
![Page 40: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/40.jpg)
40
Analysis of variance in Y
( )y yi -2
Total variation (Sum of Squares) in Y
Variation in Y explained
by regression
(SSRegression)
Variation in Y
unexplained by
regression (SSResidual)
Y
X
least squares
regression line
y
x
y i
yi
xi
y
})ˆ( i yy -}
)ˆ( ii yy -
)( i yy -}
222)ˆ()ˆ()( iiii yyyyyy -+-
Ordinary Least Squares (OLS)
![Page 41: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/41.jpg)
41
y yi i- small y yi i- big
Unexplained or residual variation
Explained variation
y yi - small
y
y yi - big
y
![Page 42: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/42.jpg)
42
Source of SS df Variance
variation (= mean square)
Regression 1 SSRegression / 1
Variation in Y explained by regression
Residual n-2 SSResidual / n-2
Variation in Y unexplained by regression
Analysis of variance
Why n-2?
Analysis of variance
It follows that if:
Variation in Y explained by regression >> Variation in Y
unexplained by regression (MSRegression >> MSResidual)
Then:
Regression function contributes to estimation of Y
(Slope = b1 > 0, or b1 < 0)
![Page 43: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/43.jpg)
43
> 0
x
y
xx
yy oo
oo
o
o o
oo
oo
oo
oo
o
o o
oo
oo
o
oo
o
o o
oo
oo
x
y
oo
oo
o
o o
oo
oo
x
y
xx
yy
oo
oo
o
o o
oo
oo
oo
oo
o
o o
oo
oo
o
oo
o
o o
oo
oo
o o
oo
o
o
o o o
x
y
xx
yyo o
oo
o
o
o o o
o o
oo
o
o
o o o
< 0
= 0
b1Slope =
Variation in Y explained
by regression
> 0
> 0
= 0
y yi For most xi
y yi For most xi
y yi For all xi~
Null hypothesis
• Null hypothesis: b1 = 0
• F-ratio statistic = MSRegression / MSResidual
– if H0 true, F-ratio follows F distribution with
dfRegression and dfResidual
• t-statistic = b1 / SE(b1)
– if H0 true, t-statistic follows t distribution
with df = n-2
![Page 44: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/44.jpg)
44
Model comparisons
ANOVA for regression
Total variation in Y
SSTotal
=
Variation explained by regression with X
SSRegression
+
Residual variation
SSResidual
![Page 45: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/45.jpg)
45
Full model
yi = b0 + b1xi + ei
• Unexplained variation in Y from full
model = SSResidual
Reduced model (H0 true)
• Reduced model (H0: b1 = 0 true):
yi = b0 + ei
• Unexplained variation in Y from reduced
model = SSTotal
(Mean and error)
( )y yi -2
![Page 46: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/46.jpg)
46
Model comparison
• Difference in unexplained variation between
full and reduced models:
SSTotal - SSResidual
= SSRegression
• Variation explained by including b1 in model
Explained variation
• Proportion of variation in Y explained by
linear relationship with X
• Termed r2, coefficient of determination:
SS Regression
SS Total
• r2 is simply square of correlation
coefficient (r) between X and Y.
![Page 47: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/47.jpg)
47
0 5000 10000 15000
X
0
100
200
300
400
500
Y1
0 5000 10000 15000
X
0
100
200
300
400
500
Y2
Which is the better model??
Which is the better model??
0 5000 10000 15000
X
0
100
200
300
400
500
Y1
Dep Var: Y1 N: 26 Multiple R: 0.754377 Squared multiple R: 0.569085
Adjusted squared multiple R: 0.551131 Standard error of estimate: 86.934708
Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)
CONSTANT 11.207815 30.277197 0.000000 . 0.37017 0.71450
X 0.026573 0.004720 0.754377 1.000000 5.62987 0.00001
Analysis of Variance
Source Sum-of-Squares df Mean-Square F-ratio P
Regression 2.39543E+05 1 2.39543E+05 31.695479 0.000009
Residual 1.81383E+05 24 7557.643448
![Page 48: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/48.jpg)
48
Which is the better model??
0 5000 10000 15000
X
0
100
200
300
400
500
Y2
Dep Var: Y2 N: 5 Multiple R: 0.978152 Squared multiple R: 0.956781
Adjusted squared multiple R: 0.942374 Standard error of estimate: 33.608617
Effect Coefficient Std Error Std Coef Tolerance t P(2 Tail)
CONSTANT -31.455158 32.524324 0.000000 . -0.96713 0.40482
X 0.033584 0.004121 0.978152 1.000000 8.14944 0.00386
Analysis of Variance
Source Sum-of-Squares df Mean-Square F-ratio P
Regression 7.50166E+04 1 7.50166E+04 66.413444 0.003864
Residual 3388.617362 3 1129.539121
0 5000 10000 15000
X
0
100
200
300
400
500
Y1
0 5000 10000 15000
X
0
100
200
300
400
500
Y2
Which is the better model??
n = 5
P = 0.00386
r2 = 0.942
n = 26
P = 0.000009
r 2= 0.551
![Page 49: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/49.jpg)
49
Which is the better model??
0 5000 10000 15000
X
0
100
200
300
400
500
Y1
0 5000 10000 15000
X
0
100
200
300
400
500
Y2
95% Confidence bands (for slope)
n = 5
P = 0.00386
r2 = 0.942
n = 26
P = 0.000009
r 2= 0.551
Assumptions
![Page 50: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/50.jpg)
50
Normality
Y normally distributed at each value of X:
– Boxplot of y should be symmetrical - watch
out for outliers and skewness
– Transformations often help
Homogeneity of variance
Variance (spread) of Y should be constant
for each value of xi (homogeneity of
variance):
– Very difficult to assess usually (for
models with only one value of y per x).
![Page 51: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/51.jpg)
51
x1 x2 X
Y
Y1
Y2
b by iix +0 1
Homogeneity of variance
Variance (spread) of Y should be constant for
each value of xi (homogeneity of variance):
– Very difficult to assess usually (for models with
only one value of y per x).
– Spread of residuals should be even when
plotted against xi or predicted yi’s
– Transformations often help
– Transformations that improve normality of Y will
also usually make variance of Y more constant
![Page 52: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/52.jpg)
52
Independence
Values of yi are independent of each
other:
– watch out for data which are a time series
on same experimental or sampling units
– should be considered at design stage
Linearity
For Linear regression: true population
relationship between Y and X is linear:
– scatterplot of Y against X
– watch out for asymptotic or exponential
patterns
– transformations of Y or Y and X often help
– Always look at residuals
![Page 53: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/53.jpg)
53
EDA and regression diagnostics
• Check assumptions
• Check fit of model
• Warn about influential observations and
outliers
EDA
• Boxplots of Y (and X):
– check for normality, outliers etc.
• Scatterplot of Y and X:
– check for linearity, homogeneity of
variance, outliers etc.
![Page 54: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/54.jpg)
54
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15 20
Anscombe (1973) data set
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15
0
2
4
6
8
10
12
14
0 5 10 15 20
R2 = 0.667, y = 3.0 + 0.5*x, t = 4.24, P = 0.002
![Page 55: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/55.jpg)
55
Limited or weighted data Smoothers (for data
exploration – especially useful for model
fitting)
• Nonparametric description of
relationship between Y and X
– unconstrained by specific model structure
• Useful exploratory technique:
– is linear model appropriate?
– are particular observations influential?
![Page 56: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/56.jpg)
56
Limited or weighted data Smoothers
• Each observation replaced by mean or median of surrounding observations– or predicted value of regression model through
surrounding observations
• Surrounding observations in window (or band)– covers range along X-axis
– size of window (number of observations) determined by smoothing parameter
Limited or weighted data Smoothers
• Adjacent windows overlap
– resulting line is smooth
– smoothness controlled by smoothing
parameter (width of window)
• Any section of line robust to extreme
values in other windows
![Page 57: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/57.jpg)
57
Types of limited or weighted data
smoothers (examples)
• Running (moving) means or averages:
– means or medians within each window
• Lo(w)ess:
– locally weighted regression scatterplot
smoothing
– observations within a window weighted
differently
– observations replaced by predicted values
from local regression line
Residuals – very useful for examining
regression assumptions
• Difference between observed value and
value predicted or fitted by the model
• Residual for each observation:– difference between observed y and value
of y predicted by linear regression
equation:
![Page 58: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/58.jpg)
58
Studentised residuals
• residual / SE residuals• follow a t-distribution• studentised residuals can be compared
between different regressions
Observations with large residual (or studentised residual) are outliers from fitted model.
0
-se
+se
Predicted yi
Res
idual •No pattern in residuals
• Indicates assumptions
OK
x
y •Even spread of Y
around line
Plot residuals against predicted yi
![Page 59: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/59.jpg)
59
• Increasing spread of
residuals, ie. wedge-shape
• Unequal variance in Y
• Skewed distribution of Y
• Transformation of Y helps
• Uneven spread of Y
around line
0-se
+se
Predicted yi
Res
idual
x
y
Other indicators
• Outliers
• Leverage
• Influence
![Page 60: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/60.jpg)
60
Outliers
• Observations further
from fitted model than
remaining observations
– might be different from
sample outliers in
boxplots
• Large residual
outlier
Use robust estimator.syz
Leverage
• How extreme
observation is for
X-variable
• Measures how
much each xi
influences
predicted yi Large leverage
![Page 61: Introduction to Linear regression analysis Part 1 · 2020-01-25 · 1 Introduction to Linear regression analysis Part 1 Simple linear regression • Two continuous variables: –response](https://reader034.vdocument.in/reader034/viewer/2022042712/5f9f497bc6264f79747e67f1/html5/thumbnails/61.jpg)
61
Influence
• Cook’s D statistic:– incorporates leverage & residual
– observations with large influence on
estimated slope
– observations with D near or greater than 1
should be checked
• Observation 1 is X and Y outlier but not
influential
1Y
X
2
• Observation 2 has large residual - outlier
3
• Observation 3 is very influential (large Cook’s
D) - also outlier