- parameter point & interval estimates...1 a refresher in applied statistics model fitting -...
TRANSCRIPT
![Page 1: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/1.jpg)
1
A refresher in applied statistics
Model fitting
- parameter point & interval estimates
Simple and multiple linear regression
ANOVA and ANCOVA
Beate Sick
![Page 2: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/2.jpg)
We use R for performing statistical data analysisRecommended environment: RStudio
Main reasons:• open source• powerful• wide spread• reproducible• transparent
2
![Page 3: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/3.jpg)
3
probability world ↔ data world
![Page 4: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/4.jpg)
describe data
Visualize & Summarize Statistical inference- model choice
- Parameter estimation
- Confidence intervals
- Tests
- Regression, ANOVA
Statistics connects data with models
data, sampleModel
Predictions
inductive
statistics
PredictionProbability calculus
4
![Page 5: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/5.jpg)
We use a sample to learn about the population
Results from statistical
inference are only correct, if
the sample was representative.
A sample is representative
if it does not systematically
differ from the population
(e.g. the percentages of male
and female are similar in the
sample than in the population).
![Page 6: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/6.jpg)
The world of data and the world of models
data/reality model
sample population
discrete data/features(numeric or categorical)
discrete random variable(numeric)
continuous data/features(numeric)
continuous random variable(numeric)
observation Random variable
relative frequency probability (P)
histogram (scaled)Density
continuous distribution
bar plot of frequencies (scaled)(of rel. frequency at discrete features)
Probability distributiondiscrete distribution
average expected value m
sample variance s2 variance s2
x
![Page 7: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/7.jpg)
The expected value = population mean
•The expected value of a random variable is the average,
which we would get with an infinite big sample
• It measures the location of the random variable
• It corresponds to the centre of mass of the density (see red line)
•It often determines the parameter of the model
•The expected value can also be calculated due to the density
0.0 0.5 1.0 1.5 2.0
01
23
4
Exponentialverteilung
x
f(x)
70 2 4 6 8 10
0.0
00
.05
0.1
00
.15
0.2
00
.25
my.x
P
Po(l=2.5)
1
( ) ( )
( ) ( )
n
i i
i
EW X x P X x
EW X x f x dx
![Page 8: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/8.jpg)
8
The most famous discrete distributions/models
name of the
distribution
possible values x
P(X=k)
expected value m
variance s2
application
Bernoulli
X~Bern(p)
{0,1} m=E(X)=p
s2=Var(X)=p*(1-p)
X: indicates if an
event occurres or
not
Binomial
X~B(n,p)
{0,1,…,n} m=E(X)=n*p
s2=Var(X)=n*p*(1-p)
X: number of
successes in n
independent
Bernoulli trials
Poisson
X~Po(l)
{0,1,...} m=E(X)=l
s2=Var(X)=l
X: number of events
in a certain interval
or time-bin
( ) (1 )k n kn
P X kk
p p - -
( )!
k
P X k ek
ll -
( 1)
( 0) 1
P X
P X
p
p
-
![Page 9: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/9.jpg)
9
The most important continuous distributions/models
Name of the
distribution
(parameter)
domain
density f
distribution F
expected value
variance
application
Uniform V.
X~U(a,b)
R if all events have the same
probability or if the
probability is not known at
all
Exponential V.
X~Exp(l)
R0+ waiting times,
time to fail
Normal V.
X~N(m,s2)
R typical measurements(affected symmetrically by various
factors),
Asymptotic approximation
for other distributions
( ) xf x e ll -
'( ) 1 xF x e l- -
1( ) ,f x
b a
-
for a £ x £ b,
otherwise f (x) = 0
( )x a
F x für a x bb a
-
-
2
( )2
( )( )
12
a bE X
b aVar X
-
2
1( )
1( )
E X
Var X
l
l
2
( )
( )
E X
Var X
m
s
2
2
( )
21
( )2
x
f x e
m
s
s p
--
2
2
( ' )
21
( ) '2
xx
F x e dx
m
s
s p
--
-
![Page 10: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/10.jpg)
10
Normal
X~N(m,s2)
( )E X m2( )Var X s
E X n p( )
Var X n p p( ) ( ) -1
Binomial
X~B(n,p)
Poisson
X~Po(l)
( )E x l
( )Var X l
Relation
Parameter-E(X)-Var(X)Parameter-estimator
as function of the data1
( )E Xl
X~Exp(l)2
1( )Var X
l
1
1ˆˆ ( )n
i
i
E X x xn
l
ˆ . . .. .
p average no successesper n trials
1 1ˆ
( ) xE Xl
2 2
1
1ˆˆ var( ) ( )1
n
i
i
X x xn
s
--
1
1ˆˆ ( )n
i
i
E X x xn
m
Paramter estimationfor the most important distributions
Exponential
Distributionfamily V
X~V(Parameter-Set)
![Page 11: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/11.jpg)
11
The probability density function (pdf)
0.0 0.5 1.0 1.5 2.0
01
23
4
Dichtefunktion
Wartezeit
De
nsity
0.6
0.3
( ) ( 0.6) () )( 0.3
b
a
P fa X b F b F ax dx
-
a b
( ) xf x e ll -
The probability of getting a result between a and b is equal to the area
under the density function above the interval [a,b]. The calculation of the
probability is made by integrating the density function in interval [a,b]
pexp(0.6,rate=4) – pexp(0.3,rate=4)
x (waiting times)
![Page 12: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/12.jpg)
12
Data vary! That’s why statisticians can find jobs ;-)
![Page 13: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/13.jpg)
Wenn ich alle
messen würde.
Eine
Zufallstichprobe
mit n=30
n=30
CtsVar_1samp_Dots30.pdf
Beispiel: Grösse 12 jähriger neuseeländischer Schülerinnen
Where is the center of the population?
13
![Page 14: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/14.jpg)
14
Vizualize boxplot with memory
Quelle: http://www.stat.auckland.ac.nz/~wild/09.USCOTSTalk.html
Where is the center of the population?
14
![Page 15: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/15.jpg)
15Quelle: http://www.stat.auckland.ac.nz/~wild/09.USCOTSTalk.html
Vizualize boxplot with memory
Where is the center of the population?
15
![Page 16: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/16.jpg)
16Quelle: http://www.stat.auckland.ac.nz/~wild/09.USCOTSTalk.html
Vizualize boxplot with memory
Where is the center of the population?
16
![Page 17: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/17.jpg)
17Quelle: http://www.stat.auckland.ac.nz/~wild/09.USCOTSTalk.html
Vizualize boxplot with memory
Where is the center of the population?
17
![Page 18: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/18.jpg)
18Quelle: http://www.stat.auckland.ac.nz/~wild/09.USCOTSTalk.html
Vizualize boxplot with memory
Where is the center of the population?
18
![Page 19: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/19.jpg)
19Quelle: http://www.stat.auckland.ac.nz/~wild/09.USCOTSTalk.html
Where is the center of the population?
19
![Page 20: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/20.jpg)
20Quelle: http://www.stat.auckland.ac.nz/~wild/09.USCOTSTalk.html
Where is the center of the population?We get more certain with increasing sample size
20
![Page 21: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/21.jpg)
21
Confidence intervals
![Page 22: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/22.jpg)
22
How sure can I be about the true paramter value?
Goal:
We would like to determine from our sample/observations an interval,
which covers the true parameter value with a probability of 95%.
+/-1.58 IQR/sqrt(n)
The notch covers the
median «quite certain»
boxplot(x,notch=TRUE)
22
![Page 23: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/23.jpg)
23
Sample Variation & Central Limit Theorem
population
Sample with
sample-size n=10
Distribution of many
sample means
Animation: http://onlinestatbook.com/stat_sim/sampling_dist/index.html
Because of the sample variation also
derived statistics like the mean value
varies from sample to sample
The sample mean is an unbiased
estimator for the population mean E(X).
CLT: The sample mean is normaly
distributed around the population mean
and the variation decreases with
increasing sample size.
2( )
~ ( , ) ( ( ), )a
xx
Var XX N N E X
n n
sm
![Page 24: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/24.jpg)
1 1
12 2
n nt tq q
- -
- -
1-
/2/2
Construction of an exact CI for the expected value m
1 1
1 1
12 2
1 12 2
( ) 1
( ) 1
n n
n n
t tx
x
t tx xx
XP q q
n
P X q X qn n
m
s
s sm
- -
- -
-
- -
- -
- -
exact 95% CI for mx0.975
z xX qn
s
~ (0,1)x
x
XT N
n
m
s
-
( )2
{1,2,... }
2
. . . ~ ,
~ ,
i n x x
xx
X i d d N
X Nn
m s
sm
t
Distribution of T under H0
24
The construction of the CI is based on
the distribution of T under H0 : m = mx
1Test-Statistic or Pivot: ~x
xdf n
XT t
s
n
m -
-
known s
1
0.975
ˆnt xX q
n
s-
estimateds
![Page 25: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/25.jpg)
12 2
z z -
-
1-
/2/2
12 2
1 12 2
( ) 1ˆ
ˆ ˆ( ) 1
x
x
x xx
XP z z
n
P X z X zn n
m
s
s sm
-
- -
- -
- -
approx. 95% CI for mx 0.975
ˆ ˆ1.96x xX z X
n n
s s
Test-Statistic or Pivot:
~ (0,1)x
x
aXT N
s
n
m-
2
2Central Limit Theorem
. . . 1,..., , 25 , ( ) , ( )
~ ,
i x x
xx
a
X i d d i n n E X Var X
X Nn
m s
sm
t
Distribution of T under H0
25
The construction of the CI is based on
the distribution of T under H0 : m = mx
standard error:
se(x)University of Zurich, Department of Biostatistics
Construction of an approximative CI for m
![Page 26: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/26.jpg)
26
The CI is as random as the sample
( )2
sd xx
n
5m
95 out of 100 95%-CI for m do cover the true population parameter m=5 when
simulating 100 random samples from a population following N(m=5,s2).
With a 95%-CI we have a risk of 5% that the true population parameter is not
contained by the CI.
![Page 27: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/27.jpg)
27Source: The Cartoon Guide to Statistics, Larry Gonick and Woollcott Smith
![Page 28: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/28.jpg)
For what purpose do we develop a statistical model?
• Description:
Describe data by a statistical model.
• Explanation:Search for the “true” model to understand and
causally explain the relationships between
variables and to plan for interventions.
• Prediction:Use model to make reliable predictions.
This is done well with a conventional statistical model
Difficult with observational data – in medicine we do RCT to learn about causal effects
To evaluate and tune a good prediction model it is best to work with train, validation and test data.
28
![Page 29: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/29.jpg)
29
Simple linear regression1 explanatory variable
![Page 30: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/30.jpg)
30
Example:
In our first example we investigate how the size of trees
depends on the ph-level of the soil.
In India, it was observed that alkaline soil hampers plant growth. This gave rise
to a search for tree species which show high tolerance against high ph-values.
An outdoor trial was performed, where 120 trees of a particular species were
planted on a big field with under different pH-value variation.
After 3 years of growth, every trees height was measured. Additionally, the pH-
value of the soil in the vicinity of each tree was determined and recorded.
Simple Linear Regression
![Page 31: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/31.jpg)
31
7.5 8.0 8.5
23
45
67
phvalue
he
igh
tTree Height vs. pH-Value
Scatterplot: Tree Height vs. pH-value
ph=7.9
???
Which height would we expect at ph=7.9?
![Page 32: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/32.jpg)
32
7.5 8.0 8.5
23
45
67
phvalue
he
igh
t
Tree Height vs. pH-Value
7.5 8.0 8.5
23
45
67
phvalue
he
igh
t
Tree Height vs. pH-Value
7.5 8.0 8.5
23
45
67
phvalue
he
igh
t
Tree Height vs. pH-Value
Systematic Relation: What is a good model?
What is a good model for the relation between pH-value and tree height?
The first model fits the training data perfect but does probably overfit the data.
To evaluate the performance of a model we can use cross-validation: leave out
successively each data point – determine the model with remaining data and
use the model to predict left out value. The model is best which produces the
best predictions on new or left out data points.
![Page 33: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/33.jpg)
33
> summary(fit)
Call: lm(formula = height ~ phvalue, data =
treeheight)
Coefficients: Estimate Std. Error t-value Pr(>|t|)
(Intercept) 28.7227 2.2395 12.82 <2e-16 ***
phvalue -3.0034 0.2844 -10.56 <2e-16 ***
Reading the R-summary of a linear model
0ˆ
ˆ( )t
se
-
ˆint :ercept ˆ ˆ( )se s
ˆslope: ˆ ˆse( ) s test value
ˆ 28.7 3i i iy x x -
Residual stand. err.: 1.008 on 121 degrees of freedom
Multiple R-squared: 0.4797,
Adjusted R-squared: 0.4754
F-statistic: 111.5 on 1 and 121 DF,
p-value: < 2.2e-16
Global test for the model(is full model better than only using the intercept?)
Adj. R2 (use in multiple regression)
ˆˆs
R2 (in simple regression equals corr(x,y)2)
= #obs. - #(estimated parameter)
ˆ
0 0
p-value: p
H : 0
ˆ
0 0
p-value: p
H : 0
![Page 34: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/34.jpg)
34
The regression line will not run through all the data points.
Thus, there are random errors:
Meaning of variables/parameters:
is the response variable (height) of observation .
is the predictor variable (pH-value) of observation .
are the regression coefficients. They are unknown
previously, and need to be estimated from the data.
is the residual or error, i.e. the random difference bet-
ween observation and regression line.
ixiy
i
,
i
i
systematic
part of the model
2, . . ~ (0. )i i iiiY i i d Na b X s
random part of the model,
random errors
Linear regression: a traditional view as seen in many textbooks
![Page 35: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/35.jpg)
Linear regression: a more general view
2
Model for the condition probability distribution
CPD: =(Y|X ) ~ N( , ) ii i xXY m s
The predicted value of the linear regression
gives only one of the parameter of the CPD: mx
that depend on the predictor values. The
second parameter of the CPD (s2) is assumed
to be independent of the predictor values and
defines the variance of the error term .
2
Y ~ V
(Y|X ) ~ N( , ) i
contiuous
arbirar
i x
y
sm
Y , xx m
35
CP
D
CP
D
CP
D
MP
D
( )1
2
0
1
1
2
0
1
y = +
E Y
Var(Y )=Var(Y|X
=( |X
)=Var( )
=x )=
i.i.d. ~ (0, )
+i i
i
x i
i
i i i
X
X i
i
i
x
x
N
m m
s
s
identical independent distributedY
is c
on
tin
uo
us
and
can
hav
e an
arb
itra
ry m
argi
nal
p
rob
abili
ty d
istr
ibu
tio
n
![Page 36: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/36.jpg)
36
We need to fit a straight line that
fits the data well.
Many possible solutions exist,
some are good, some are worse.
Our paradigm is to fit the line such
that the squared errors are
minimized.
Least Squares Fitting
We minimize the sum of
squared residuals2 2 2
1 1 1
ˆ( ) ( ( )) min!n n n
i i i i i
i i i
r y y y x
- -
http://hspm.sph.sc.edu/courses/J716/demos/LeastSquares/LeastSquaresDemo.html
Remark: According to the Gauss-Markov-Theorem the OLS (ordinary least square) fitting procedure
leads to the best linear unbiased estimators (BLUE) of the regression parameters.
![Page 37: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/37.jpg)
37
> summary(fit)
Call: lm(formula = height ~ phvalue, data =
treeheight)
Coefficients: Estimate Std. Error t-value Pr(>|t|)
(Intercept) 28.7227 2.2395 12.82 <2e-16 ***
phvalue -3.0034 0.2844 -10.56 <2e-16 ***
Linear regression in R
0ˆ
ˆ( )t
se
-
ˆint :ercept ˆ( )se p-value
ˆ:slope ˆ( )se p-valuetest value
ˆ 28.7 3i i iy x x -
Residual stand. err.: 1.008 on 121 degrees of freedom
Multiple R-squared: 0.4797,
Adjusted R-squared: 0.4754
F-statistic: 111.5 on 1 and 121 DF,
p-value: < 2.2e-16
Global test for the model
(will see later)
R2
![Page 38: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/38.jpg)
38
7.5 8.0 8.5
23
45
67
pH-Value
Tre
e H
eig
ht
Tree Height vs. pH-Value
Least Squares Regression Model
( ) 28.7 3
(8) 28.7 24 4.7
height ph ph
heigth
-
-
ph=8
![Page 39: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/39.jpg)
Confidence- and “prediction” intervals
The expected value of y at the position x is with 95% percentage certainty
covered by the confidence interval.
95% percentage of all individual observations y (of the training data set) is
contained in the “prediction” interval.39
y
x
regression model
confidence interval
prediction interval upper prediction limit
lower prediction limit
( )( )E y x
![Page 40: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/40.jpg)
Interesting intervals when doing regression
2
2
ˆ ˆ( )ntb q se b-
2
2
ˆ ˆ( )nta q se a-
i i iY a bX
2
2
ˆ ( )nt
k ky q se y-
2
2
ˆ ˆ( )nt
k ky q se y-
Prediction
interval for yl:
CI
for E(yk):
CI for
the slope:
CI for
y-intercept:
confint(fit, parm=1,level=0.95)
2.5 % 97.5 %
(Intercept) -1513786 -881578.2
confint(fit,parm=2,level=0.95)
2.5 % 97.5 %
x 124.4983 153.0250
predict(fit, new, se.fit=T,
interval=c("confidence"))
fit lwr upr
1855075 1829770 1880379
predict(fit, new, se.fit=T,
interval=c(„prediction"))
fit lwr upr
1855075 1728713 1981436
40
![Page 41: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/41.jpg)
41
Residual Analysis= Checking the model assumptions
![Page 42: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/42.jpg)
Before we continue to look into the results, we need to check if the
modelling assumptions are met!
Why? Because otherwise we draw invalid conclusions from the results.
The assumption we took here is that the errors i i.i.d. ~N(0,s2)
We use the observed residuals as estimate for the unobserved errors.
This implies four things for the residuals:
a) The expected value of ri is 0: E(ri )= 0.
b) All ri have the same variance: Var(ri) = ො𝜎2 .
c) The ri are normally distributed.
d) The ri are independent of each other.
Assumptions of a linear regression model
42
![Page 43: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/43.jpg)
x
( , )i ix y0 1ˆ ˆ x
ir
0 1ˆ ˆˆ( , )i i ix y x
y
independent variable
de
pe
ndent vari
able
ˆi i ir y y -
Observed residuals serve as estimate for the error
0 1 1
0 1 1
y = +
ˆ ˆy = +
i i i
i i
x
x
( )
( )0 1 1
0 1 1
y - +
ˆ ˆˆ ˆy - + y -y
i i i
i i i i i i
x
r x
43
![Page 44: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/44.jpg)
Standardized and Studentized residuals
The standardized residual raw 𝑟𝑖 can be derived from the raw residual 𝑟𝑖 by dividing it by an estimate of its standard deviation.
Where ො𝜎𝐸 is the residual standard error and 𝐻 is the hat matrix
With the same formula we get Studentized Residuals if the estimate of the residual standard error ො𝜎𝐸 is obtained by ignoring the 𝑖𝑡ℎ data point.
ˆ 1
ii
E ii
rr
Hs
-
44
![Page 45: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/45.jpg)
There are 4 "standard plots" in R:
- Residuals vs. Fitted, aka Tukey-Anscombe-Plot
- Normal Plot (uses standardized residuals)
- Scale-Location-Plot (uses standardized residuals)
- Leverage-Plot (uses standardized residuals)
In R: > plot(fit)
Model checking: residual analysis in R
45
![Page 46: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/46.jpg)
Model checking: residual analysis in R
46
![Page 47: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/47.jpg)
The Tukey-Anscombe diagram plots the residuals against the fitted values
is the most important model checking tool. This plot is ideal to check if
assumptions a) and b) (and partially d)) are met.
A perfect Tukey-Anscombe show a horizontal smoother at height 0 around
which the residuals are at each point distributed with same variance.
The residual vs fitted plot aka as Tukey-Anscombe plot
47
![Page 48: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/48.jpg)
Quelle: http://www.uni-forst.gwdg.de/~dgaffre/elan/institut_fbi/skripten/statistica/v10/v10a.html
Examples of Tukey-Anscombe-Plots
Resid
uen
Re
sid
uen
Re
sid
uen
Re
sid
uen
4848
![Page 49: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/49.jpg)
The Normal Plot
With the Normal Plot we check if the residuals show strong deviations from a Normal distribution. In a perfect Normal plot all points are close to a straight line.
49
![Page 50: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/50.jpg)
Draw data from Normal-Distribution and generate the Normal-Quantil-Quantil-Plots
# normal Q-Q Plot w/o CI
qqnorm(residuals(fit))
# normal Q-Q Plot with CI
library(car)
qqPlot(residuals(fit), dist="norm")
50
![Page 51: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/51.jpg)
The Scale-Location Plot
Here we plot | ǁ𝑟|𝑖 vs ො𝑦𝑖 and check for constant variance meaning the spread
of the absolute residual values do not change over the range of fitted values
meaning if the variance of the residual is constant.
A perfect Scale-Location Plot shows a smooth horizontal line.
51
![Page 52: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/52.jpg)
The Leverage Plot
With the Leverage Plot we check for influential points with large Cook’s distances.
A Leverage plot without points beyond the dashed level curves is fine.
Points with Cooks distance larger than 0.5 or 1 must be further checked.
52
![Page 53: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/53.jpg)
A high leverage of an observation 𝑦𝑖 means that the data point has extreme predictor value and has the potential to force the regression relation to strongly adapt to that data point. The leverage is simply given by the 𝑖𝑡ℎ diagonal element of the hat matrix 𝐻, since 𝐻𝑖𝑖Δ𝑦𝑖 is the change in ො𝑦𝑖 if 𝑦𝑖 changes by Δ𝑦𝑖. The average leverage in a regression model with 𝑝 estimated coefficients and 1 intercept is given
by Τ(𝑝+1)𝑛 . We say a data point has high leverage if 𝐻𝑖𝑖≥ 2 ∙ Τ(𝑝+1)
𝑛
What is meant by leverage points?
53
![Page 54: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/54.jpg)
With Cook’s Distance we estimate the potential change in all the fitted values if 𝑖𝑡ℎ data point is omitted from the analysis.
What is Cook’s Distance measuring?
[ ] 2 2
2
ˆ ˆ( )
( 1) 1 ( 1)
i
k k ii ii
E ii
y y H rD
p H ps
- -
-
If 𝐷𝑖 ≥ 0.5 , the 𝑖𝑡ℎ data point is called influential.
If 𝐷𝑖 ≥ 1 , the 𝑖𝑡ℎ data point might be really
dangerous.
54
![Page 55: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/55.jpg)
Plot residuals vs predictors
Residuals should not show structure when be
plotted vs fitted values or any of the predictors.
2~ (0, ), . .i N I i i d s
-> residual plots look o.k.
Tuckey Anscombe Plot
55
![Page 56: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/56.jpg)
56
What to do if the model
assumptions are violated?
![Page 57: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/57.jpg)
Should we apply an transformation to outcome and predictor?
See the in-class exercises on transformations for the effect of applying
a non-linear transformation on the response or predictor variable.
57
![Page 58: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/58.jpg)
58
Improve structural form of the model
- add missing predictors
- add interactions
- apply transformation on predictors (change form of relationship between y and predictor)
- apply transformation on outcome (to stabilize residual variance & change form)
Handle extreme values, leverage points outliers
- outliers and leverage points should be identified with diagnostic plots
- check if the usage of robust methods is necessary
- use transformations to make variable distributions less skewed
Make sure that observations within a group are independent
- unrecorded predictors or inhomogeneous population
- observations coming from a matched study are not independent
(analyze such data e.g. with mixed models)
- subjects may influence other subjects under study
consider using another model
- e.g. Poisson regression in case of count data
How can a linear regression model be improved?
58
![Page 59: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/59.jpg)
59
First-Aid Transformations:
do always apply these (if no practical reasons against it)
to both response and predictors
Absolute values and concentrations:
log-transformation:
Count data:
square-root transformation:
Proportions:
arcsine transformation:
log( )y y
y y
( )siny arc y
First Aid Transformations
Variance-stabilizing transformations
![Page 60: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/60.jpg)
Motivation for the partial residual plot
The partial residuals 𝑟𝑗,𝑝𝑎𝑟𝑡𝑖𝑎𝑙 give insight of the relationship between predictor 𝑥𝑗and the “adjusted outcome”, which is corrected for effect of all other predictors.
A perfect partial residual plot shows a linear relation between 𝑟𝑗,𝑝𝑎𝑟𝑡𝑖𝑎𝑙 and 𝑥𝑗.
Partial residuals are residuals we get when omitting 𝑥𝑗 from the model formula.
j,partialˆ ˆ ˆˆ
j j j j k k
k j k j
r y x y r x x r
- -
Partial residual plots in R:- library(car); crPlots(...)
- library(faraway); prplot(...)
- residuals(fit,
type="partial")0 1 2 3 4 5
80
09
00
10
00
11
00
Mortality vs. log(NOx)
log(NOx)
Mo
rta
lity
0 1 2 3 4 5
-10
0-5
00
50
log(NOx)P
art
ial R
esid
ua
ls f
or
log
(NO
x)
Partial Residual Plot for log(NOx)
60
![Page 61: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/61.jpg)
Marginal relationship ≠ relation in partial residual plot
The marginal plot of outcome vs predictor does not take into account the
influence of all other predictors on the outcome and is therefore not
appropriate if we are interested in the additional influence of a predictor on
the outcome given all other predictors are already in the model.
0 1 2 3 4 5
80
09
00
10
00
11
00
Mortality vs. log(NOx)
log(NOx)
Mo
rta
lity
0 1 2 3 4 5
-10
0-5
00
50
log(NOx)
Pa
rtia
l R
esid
ua
ls f
or
log
(NO
x)
Partial Residual Plot for log(NOx)
61
![Page 62: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/62.jpg)
We want to check if the adjusted relation between an untransformed
predictor and the outcome is linear. Hence, we use the partial
residual plot which only shows the "isolated“ influence of that
predictor on the response.
The observed shape in the partial residual plot indicates if/which
transformation we should use for the selected predictor.
Partial residual plots help to find transformations for predictors
62
𝑟𝑗,𝑝𝑎𝑟𝑡𝑖𝑎𝑙 𝑟𝑗,𝑝𝑎𝑟𝑡𝑖𝑎𝑙 𝑟𝑗,𝑝𝑎𝑟𝑡𝑖𝑎𝑙
𝑥𝑗 𝑥𝑗 𝑥𝑗A quadratic transformation
of xj might be appropriate.No transformation required. .Predictor xj has no additional
explanatory power when
added to the model.
![Page 63: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/63.jpg)
63
ANOVA = ANalysis Of VAriance
Total sample variability
TSS
Variability explained by
the modelSSmodel
Unexplained (or error) variability
RSS
![Page 64: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/64.jpg)
Example with one factorial predictorDo medical doctors spend less time with obese patients?
In an observational study it was measured
how much time doctors spend with a patient.
64
![Page 65: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/65.jpg)
Do medical doctors spend less time with obese patients?How can we test this with linear regression and ANOVA?
An ANOVA with 1 factor with 2 levels is equivalent to a two-sample t-test.
Normality check
passed
t.test(TIME~WEIGHT, data=dat)
# t = 2.9, df = 67, p-value = 0.0057
# alternative hypothesis: true difference in
# means is not equal to 0
# 95 percent confidence interval:
# 2 11
# sample estimates:
# mean of x mean of y
# 31 25
# do it by regression with one factorial predictor:
fit=lm(TIME~WEIGHT, data=dat)
anova(fit)
# get anova-table from lm-object
# Response: TIME
# Df Sum Sq Mean F value Pr(>F)
# WEIGHT 1 776 776 8.16 0.0057 **
# Residuals 69 6561 95
65
![Page 66: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/66.jpg)
How to test for an effect between >2 groups?Applying 1-way ANOWA with >2 levels
fit=lm(folate~group, data=dat)
anova(fit) # p=0.044
Here, we want to investigate, if three different treatments
result in different levels of the output: folate in red blood cells
We can apply a regression with the group factor as predictor
to investigate this question, given the folate values y in each
group are i.i.d. normal distributed (check not shown).
Since p<5%, we can conclude that there are differences,
i.e. the folate level is not the same in all groups.
Remark: If there is only 1 factor as predictor, like treatment group, we talk about
1-way ANOVA regardless of the number of groups.
66
![Page 67: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/67.jpg)
The ANOVA gets significantBetween which groups are the differences?
We can perform three pair-wise t-tests.
Only the t-test comparing group 1
versus 2 gets significant.
We need to correct for multiple testing,
e.g. by Bonferroni-correction. Here, this
correction leads to non-significance for
all 3 tests.
The significant ANOVA result, only tells us, that there are any differences.
We need to perform post-hoc tests to investigate, between which groups
we can really find differences.
List of post-hoc tests (from wiki)
• Fisher's least significant difference: LSD
• Bonferroni correction
• Duncan's new multiple range test
• Friedman test
• Newman–Keuls method
• Scheffé's method
• Tukey's range test
• Dunnett's test
Result of (uncorrected) pair-wise t-tests:
67
![Page 68: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/68.jpg)
68
Multiple linear regression≥2 explanatory variables
![Page 69: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/69.jpg)
Multiple linear regression: interpretation of coefficientas used in descriptive modelling
( )2
0 1 1 iy = + ... ε , ~ 0,i i p ip ix x N s
11= y - yk k k kk x x x xy
-
The coefficient k gives the change of the outcome y, given the explanatory
variable xk is increased by one unit and all other variables are held constant.
69
= y X β ε
11 1 01 1
21 2 12 2
1
1
1with , , ,
1
p
p
n np pn n
x xy
x xy
x xy
y X β ε
![Page 70: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/70.jpg)
Modeling HDL as example of multiple regression:
Estimate Std. Error t-value Pr(>|t|)
Intercept 1.16448 0.28804 4.04 <.0001
AGE -0.00092 0.00125 -0.74 0.4602
BMI -0.01205 0.00295 -4.08 <.0001
BLC 0.05055 0.02215 2.28 0.0239
PRSSY -0.00041 0.00044 -0.95 0.3436
DIAST 0.00255 0.00103 2.47 0.0147
GLUM -0.00046 0.00018 -2.50 0.0135
SKINF 0.00147 0.00183 0.81 0.4221
LCHOL 0.31109 0.10936 2.84 0.0051
The predictors of log(HDL) are age, body mass index, blood vitamin C,
systolic and diastolic blood pressures, skinfold thickness, and the log of
total cholesterol. The equation is:
Log(HDL) = 1.16 - 0.00092(Age) -0.012(BMI)+…+ 0.311(LCHOL)
70
![Page 71: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/71.jpg)
Interpretation of coefficients on previous slide:
Need to use entire equation for making predictions.
Each coefficient j measures the difference in expected LHDL between 2
subjects if the factor xj differs by 1 unit between the two subjects, and if
all other factors are the same.
E.g., expected log(LHDL) is 0.012 lower in a subject whose BMI is 1 unit
greater, but is the same as the other subject on other factors.
The meanings of the coefficients in the HDL example
log(HDL) = 1.16 - 0.00092(Age) -0.012(BMI)+…+ 0.311(LCHOL)
0 1 1 2 2 ...i i i p ip iY x x x
71
![Page 72: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/72.jpg)
The p-values measure the significance of the association of a factor with Log(HDL)
in the presence of all other predictors of the model.
This is sometimes expressed as “after accounting for other factors” or “adjusting for
other factors”, and is called independent association.
SKINF alone probably is associated. However, its p=0.42 says that it provides no
additional information that helps to predict LogHDL, after accounting for other
factors such as BMI.
The p-value and also the coefficient-value of a predictor depend i.g. not only on the
association with the outcome variable but also on the other predictors in the model.
Only if all predictors are independent multiple regression leads the same p-values
and coefficients than p simple regression each with only one predictor.
The meanings of the p-values and the coefficientsin the multiple linear regression output
72
![Page 73: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/73.jpg)
73
The larger a sample, the smaller the p-values for the very
same predictor effect. Thus do not confuse a small p-values
with an important predictor effect!!!
More important than p-values:
• Look at absolute values of (significant) coefficients.
• Look at confidence intervals!
Significance vs. Relevance
![Page 74: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/74.jpg)
74
ANCOVA = ANalysi of COVAriance
![Page 75: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/75.jpg)
75
Output: hours: lifetime of a cutting tool
Predictor 1: rpm: speed of the machine in rpm
Predictor 2: tool: tool type A or B
Linear Regression with continuous and factorial predictors
fit1 <- lm(hours ~ rpm + tool, data=my.dat)
500 600 700 800 900 1000
15
20
25
30
35
40
rpm
ho
urs
A
A
A
AA
A
A
A
A
A
B
BB B
B
B
B
BB
B
Durability of Lathe Cutting Tools
We have an additive model: the difference between the tools is a shift.
![Page 76: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/76.jpg)
76
What does interaction mean?Different slopes of continuous variables at different levels of a factor
fit2=lm(hours ~ rpm * tool,
data=my.dat)
500 600 700 800 900 1000
15
20
25
30
35
40
rpm
ho
urs
A
A
A
AA
A
A
A
A
A
B
BB B
B
B
B
BB
B
Durability of Lathe Cutting Tools: with Interaction
500 600 700 800 900 1000
15
20
25
30
35
40
rpm
ho
urs
A
A
A
AA
A
A
A
A
A
B
BB B
B
B
B
BB
B
Durability of Lathe Cutting Tools
Do not allow for interaction Interaction as allowed
fit1=lm(hours ~ rpm + tool,
data=my.dat)
In case of interaction, the slope of the predictor “rpm” changes for different levels of
the second predictor “tool”.
![Page 77: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/77.jpg)
77
Do we get the same slope in rpm for tool A and tool B?Is there an interaction between rpm and tool?
fit2 <- lm(hours ~ rpm * tool, data=my.dat)
> summary(fit2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.774760 4.633472 7.073 2.63e-06 ***
rpm -0.020970 0.006074 -3.452 0.00328 **
toolB 23.970593 6.768973 3.541 0.00272 **
rpm:toolB -0.011944 0.008842 -1.351 0.19553
---
Residual standard error: 2.968 on 16 degrees of freedom
Multiple R-squared: 0.9105, Adjusted R-squared: 0.8937
F-statistic: 54.25 on 3 and 16 DF, p-value: 1.319e-08
The main effects are hard to interpret in case of interactions.
Here the interactions seems not to be significant. With ANOVA we can test for
nested models if the more complex model leads to a significant improvement:
hour 32.8 0.02 rpm 24 toolB -0.01 (rpm toolB) -
![Page 78: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/78.jpg)
78
How to read a model with interaction?
500 600 700 800 900 1000
15
20
25
30
35
40
rpm
ho
urs
A
A
A
AA
A
A
A
A
A
B
BB B
B
B
B
BB
B
Durability of Lathe Cutting Tools: with InteractionInteraction is allowed
In case of interaction, the slope of the predictor “rpm” changes for different levels of
the second predictor “tool” – also the intercept is changing for the two tools.
toolB (toolB= ) :
0.02 rpm -0.01 (rpm )
56.9 0
hour 32.8 0.02 rpm 24 toolB -0.01 (rpm toolB)
hour 32.8 24
hour
toolA (toolB
hour
= ):
0.02 rpm -0.01 (rpm )
32.8 0.02 rp
.03 rp
32.8 24
m
m
hour
1
11
0
0
0
-
-
-
-
-
Remark: In case of interaction between two continuous predictors, slope (and intercept) of one predictor changes
continuously with a continuous changing value of the other predictor and vice versa.
![Page 79: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/79.jpg)
79
Do we need the complex model with the interaction?
fit2=lm(hours ~ rpm * tool,
data=my.dat)
500 600 700 800 900 1000
15
20
25
30
35
40
rpm
ho
urs
A
A
A
AA
A
A
A
A
A
B
BB B
B
B
B
BB
B
Durability of Lathe Cutting Tools: with Interaction
500 600 700 800 900 1000
15
20
25
30
35
40
rpm
ho
urs
A
A
A
AA
A
A
A
A
A
B
BB B
B
B
B
BB
B
Durability of Lathe Cutting Tools
Do not allow for interaction Interaction is allowed
fit1=lm(hours ~ rpm + tool,
data=my.dat)
anova(fit2, fit1, test="F")
# p>5%, therefore interaction is not needed
![Page 80: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/80.jpg)
80
0) Preprocessing
- learning the meaning of all variables, check for correlations
- give short and informative names
- check for impossible values, errors
- if they exist (missing, error): set them to NA
- consider imputation methods, but be careful
1) First-aid transformations
- bring all variables to a suitable scale (use also field knowledge)
- routinely apply the first-aid transformations
2) Find a good model
- start with a model including important confounders
- perform a residual analysis
- improve model by transformations or adding better predictors
- reduce step by step complexity (be aware of introduced biases)
- use your specific knowledge to choose between variables
Steps in linear modelling
![Page 81: - parameter point & interval estimates...1 A refresher in applied statistics Model fitting - parameter point & interval estimates Simple and multiple linear regression ANOVA and ANCOVA](https://reader033.vdocument.in/reader033/viewer/2022051907/5ff9df7842bdea253c1e1e5d/html5/thumbnails/81.jpg)
81
Limits of linear Regression
If your residuals do not follow a Normal distribution (even after
transformations) use generalized linear modeling
(glm – e.g. logisitic regression)
If your predictors show a strong correlation use shrinkage methods
(e.g. lasso)
If your data are not independent use mixed models or methods for
time-series.
If you do not have a linear relation, use non-linear regression
(e.g. nlm) or generalizes additive models (e.g. gam) or tree models