multiple regression. problem: to draw a straight line through the points that best explains the...
Post on 19-Dec-2015
216 Views
Preview:
TRANSCRIPT
Multiple regression
Problem: to draw a straight line through the points that best explains the variance
Regression
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6
Problem: to draw a straight line through the points that best explains the variance
Regression
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6
Problem: to draw a straight line through the points that best explains the variance
Regression
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6
Test with F, just like ANOVA:
Variance explained by x-variable / dfVariance still unexplained / df
Regression
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6
Varianceexplained
(change in line lengths2)
Varianceunexplained
(residualline lengths2)
Test with F, just like ANOVA:
Variance explained by x-variable / dfVariance still unexplained / df
Regression
In regression, each x-variable will normally have 1 df
Test with F, just like ANOVA:
Variance explained by x-variable / dfVariance still unexplained / df
Regression
Essentially a cost: benefit analysis –
Is the benefit in variance explained worth the cost in using up degrees of freedom?
Total variance for 32 data points is 300 units.
An x-variable is then regressed against the data, accounting for 150 units of variance.
1. What is the R2?
2. What is the F ratio?
Regression example
Total variance for 32 data points is 300 units.
An x-variable is then regressed against the data, accounting for 150 units of variance.
1. What is the R2?
2. What is the F ratio?
Regression example
R2 = 150/300 = 0.5
F 1,30 = 150/1 = 30 150/30
Why is df error = 30?
Multiple regression
Tree age
Herbivore damage
Higher nutrient treesLower nutrient trees
Damage= m1*age + b
Tree age
Herbivore damage
Tree nutrient concentration
Residuals ofherbivore damage
Tree age
Herbivore damage
Tree nutrient concentration
Residuals ofherbivore damage
Damage= m1*age + m2*nutrient + b
0
20
40
60
1 2 3 41 0
50
100
1 2 3 41
Damage= m1*age + m2*nutrient + m3*age*nutrient +b
No interaction (additive): Interaction (non-additive):
y y
Non-linear regression?
Just a special case of multiple regression!
Y = m1 x +m2 x2 +b
X X2 Y1 1 1.12 4 2.03 9 3.64 16 3.15 25 5.26 36 6.77 49 11.3
X2X1
Y = m1 x1 +m2 x2 +b
STEPWISE REGRESSION
8 11109
Jump height (how high ball can be raised off the ground)
Feet off ground
Total SS = 11.11
7
7.5
8
8.5
9
9.5
10
10.5
11
4.5 5.5 6.5 7.5 8.5
Height (ft)
Ju
mp
(ft
)
X variable parameter SS F1,13 p
Height +0.943 9.96 112 <0.0001of player
7
7.5
8
8.5
9
9.5
10
10.5
11
105 125 145 165 185 205
Weight (lbs)
Ju
mp
(ft
)
X variable parameter SS p
Weight +0.040 7.92 32 <0.0001of player
F1,13
Why do you think weight is + correlated with jump height?
An idea
Perhaps if we took two people of identical height, the lighter one might actually jump higher? Excess weight may reduce ability to jump high…
How could we test this idea?
7
7.5
8
8.5
9
9.5
10
10.5
11
4 5 6 7 8
Height (lbs)
Ju
mp
(ft
)
lighterheavier
X variable parameter SS F p
Height +2.133 9.956 803 <0.0001Weight -0.059 1.008 81 <0.0001
Questions:
•Why did the parameter estimates change?
•Why did the F tests change?
Heavy people often tall (tall people often
heavy)
Tall people can jump higher
People light for their height can jump a bit more
Weight
HeightJump
+
+
-
The problem:
The parameter estimate and significance of an x-variable is affected by the x-variables already in the model!
How do we know which variables are significant, and which order to enter them in model?
Solutions
1) Use a logical order. For example in ANCOVA it makes sense to test the interaction first
2) Stepwise regression: “tries out” various orders of removing variables.
Stepwise regression
Enters or removes variables in order of significance, checks after each step if the significance of other variables has changed
Enters one by one: forward stepwise
Enters all, removes one by one: backwards stepwise
Forward stepwise regression
• Enter the variable with the highest correlation with y-variable first (p>p enter).
• Next enter the variable to explains the most residual variation (p>p enter).
• Remove variables that become insignificant (p> p leave) due to other variables being added. And so on…
top related