multiple linear regression
DESCRIPTION
Multiple linear regression. Tron Anders Moger 11.10.2006. Example:. Repetition: Simple linear regression. We define a model where are independent, normally distributed, with equal variance - PowerPoint PPT PresentationTRANSCRIPT
Multiple linear regression
Tron Anders Moger
11.10.2006
Example:
50,00 100,00 150,00 200,00 250,00
weight in pounds
0,00
1000,00
2000,00
3000,00
4000,00
5000,00b
irth
wei
gh
t
Repetition: Simple linear regression
• We define a model
where are independent, normally distributed, with equal variance
• Wish to fit a line as close to the observed data (two normally distributed varaibles) as possible
• Example: Birth weight=β0+β1*mother’s weight
0 1i i iY x i 2
Least squares regression
50,00 100,00 150,00 200,00 250,00
weight in pounds
0,00
1000,00
2000,00
3000,00
4000,00
5000,00
bir
thw
eig
ht
R Sq Linear = 0,035
How to compute the line fit with the least squares method?
• Let (x1, y1), (x2, y2),...,(xn, yn) denote the points in the plane. • Find a and b so that y=a+bx fit the points by minimizing
• Solution:
where and all sums are done for i=1,...,n.
n
iiinn ybxaybxaybxaybxaS
1
22222
211 )()()()(
2222
ii
ii
ii
iiii
xnx
yxnyx
xxn
yxyxnb
xbyn
xbya ii
inin yyxx 11 ,
How do you get this answer?
• Differentiate S with respect to a og b, and set the result to 0
We get:
This is two equations with two unknowns, and the solution of these give the answer.
n
iii ybxa
a
S
1
02
n
iiii xybxa
b
S
1
02
0 ii yxbna
02 iiii yxxbxa
How close are the data to the fitted line? R2
• Define– SSE: Error sum of squares– SSR: Regression sum of squares– SST: Total sum of squares
• We can show that SST = SSR + SSE• Define • R2 is the ”coefficient of determination”
2( )i iy a bx 2( )ia bx y
2( )iy y
2 21 ( , )SSR SSE
R corr x ySST SST
What is the logic behind R2?
y
x
ˆi iy a bx
xi
ˆi i iSSE y y
ˆiSSR y y
iSST y y
Example: Regression of birth weight with mother’s weight as
independent variableModel Summaryb
,186a ,035 ,029 718,24270Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), weight in poundsa.
Dependent Variable: birthweightb.
Coefficientsa
2369,672 228,431 10,374 ,000 1919,040 2820,304
4,429 1,713 ,186 2,586 ,010 1,050 7,809
(Constant)
weight in pounds
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Dependent Variable: birthweighta.
Interpretation:• Have fitted the line
Birth weight=2369.672+4.429*mother’s weight
• If mother’s weight increases by 20 pounds, what is the predicted impact on infant’s birth weight?4.429*20=89 grams
• What’s the predicted birth weight of an infant with a 150 pound mother?2369.672+4.429*150=3034 grams
But how to answer questions like:
• Given that a positive slope (b) has been estimated: Does it give a reproducible indication that there is a positive trend, or is it a result of random variation?
• What is a confidence interval for the estimated slope?
• What is the prediction, with uncertainty, at a new x value?
Confidence intervals for simple regression
• In a simple regression model, – a estimates – b estimates– estimates
• Also,
where estimates variance of b
• So a confidence interval for is given by
0
12ˆ /( 2)SSE n 2
1 2( ) / ~b nb S t 2
22
ˆ
( 1)bx
Sn s
12, / 2n bb t S
Hypothesis testing for simple regression
• Choose hypotheses:
• Test statistic:
• Reject H0 if or
0 1: 0H 1 1: 0H
2/ ~b nb S t
2, / 2/ b nb S t 2, / 2/ b nb S t
Prediction from a simple regression model
• A regression model can be used to predict the response at a new value xn+1
• The uncertainty in this prediction comes from two sources: – The uncertainty in the regression line– The uncertainty of any response, given the
regression line
• A confidence interval for the prediction:
211
1 2, / 2 2
( )ˆ 1
( )n
n n ni
x xa bx t
x x
Example: The confidence interval of the predicted birth weight of an infant with a 150
pound mother
• Found that the predicted weight was 3034 grams• The confidence interval for the prediction is:
2369.67+4.43*150±t187,0.025*1.71*√(1+1/189+(150-129.81)2/(175798.52))
• Which becomes (3030.8, 3037.5)
=1.96
More than one independent variable: Multiple regression
• Assume we have data of the type (x11, x12, x13, y1), (x21, x22, x23, y2), ...• We want to ”explain” y from the x-values by
fitting the following model:
• Just like before, one can produce formulas for a,b,c,d minimizing the sum of the squares of the ”errors”.
• x1,x2,x3 can be transformations of different variables, or transformations of the same variable
321 dxcxbxay
Multiple regression model
• The errors are independent random (normal) variables with expectation zero and variance
• The explanatory variables x1i, x2i, …, xni cannot be linearily related
0 1 1 2 2 ...i i i n ni iy x x x
i
2
New example: Traffic deaths in 1976 (from file crash on textbook CD)
• Want to find if there is any relationship between highway death rate (deaths per 1000 per state) in the U.S. and the following variables:– Average car age (in months)– Average car weight (in 1000 pounds)– Percentage light trucks– Percentage imported cars
• All data are per state
First: Scatter plots:
69,00 69,50 70,00 70,50 71,00 71,50
carage
0,05
0,10
0,15
0,20
0,25
0,30
0,35d
ea
ths
0,00 5,00 10,00 15,00 20,00 25,00 30,00
impcars
0,05
0,10
0,15
0,20
0,25
0,30
0,35
de
ath
s
5,00 10,00 15,00 20,00 25,00 30,00 35,00
lghttrks
0,05
0,10
0,15
0,20
0,25
0,30
0,35
de
ath
s
3,00 3,20 3,40 3,60 3,80
vehwt
0,05
0,10
0,15
0,20
0,25
0,30
0,35
de
ath
s
Univariate effects (one independent variable at a time!):
• Hence: If all else is equal, if average car age increases by one month, you get 0.062 fewer deaths per 1000 inhabitants; increase age by 12 months, you get 12*0.062=0.74 fewer deaths per 1000 inhabitants
Model Summaryb
,492a ,242 ,226 ,05206Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), caragea.
Dependent Variable: deathsb. Coefficientsa
4,516 1,134 3,981 ,000 2,233 6,800
-,062 ,016 -,492 -3,834 ,000 -,094 -,029
(Constant)
carage
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Dependent Variable: deathsa.
Deaths per 1000=a+b*car age (in months)
Deaths per 1000=a+b*car weight (in pounds)
Model Summaryb
,281a ,079 ,059 ,05740Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), vehwta.
Dependent Variable: deathsb. Coefficientsa
-,271 ,221 -1,227 ,226 -,716 ,174
,124 ,062 ,281 1,983 ,053 -,002 ,249
(Constant)
vehwt
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Dependent Variable: deathsa.
Univariate effects cont’d (one independent variable at a time!):
Model Summaryb
,716a ,512 ,501 ,04178Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), lghttrksa.
Dependent Variable: deathsb.
Coefficientsa
,046 ,018 2,478 ,017 ,009 ,083
,007 ,001 ,716 6,947 ,000 ,005 ,010
(Constant)
lghttrks
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Dependent Variable: deathsa.
Model Summaryb
,308a ,095 ,075 ,05690Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), impcarsa.
Dependent Variable: deathsb. Coefficientsa
,206 ,020 10,462 ,000 ,166 ,246
-,004 ,002 -,308 -2,193 ,033 -,007 ,000
(Constant)
impcars
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Dependent Variable: deathsa.
Hence: Increase prop. light trucks by 20 means20*0.007=0.14 more deaths per 1000 inhabitants
Predicted number of deaths per 1000 if prop.Imported cars is 10%: 0.206-0.004*10=0.17
Building a multiple regression model:
• Forward regression: Try all independent variables, one at a time, keep the variable with the lowest p-value
• Repeat step 1, with the independent variable from the first round now included in the model
• Repeat until no more variables can be added to the model (no more significant variables)
• Backward regression: Include all independent variables in the model, remove the variable with the highest p-value
• Continue until only significant variables are left• However: These methods are not always correct to use
in practice!
For the traffic deaths, end up with:
• Deaths per 1000=2.7-0.037*car age +0.006*perc. light trucks
Model Summaryb
,768a ,590 ,572 ,03871Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), lghttrks, caragea.
Dependent Variable: deathsb.
Coefficientsa
2,668 ,895 2,981 ,005 ,865 4,470
-,037 ,013 -,295 -2,930 ,005 -,063 -,012
,006 ,001 ,622 6,181 ,000 ,004 ,009
(Constant)
carage
lghttrks
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Lower Bound Upper Bound
95% Confidence Interval for B
Dependent Variable: deathsa.
Conclusion: Did a multiple linear regression on traffic deaths, with car age,car weight, prop. light trucks and prop. imported cars as independent variables.Car age (in months, β=-0.037, 95% CI=(-0.063, -0.012)) and prop. light trucks (β=0.006, 95% CI=(0.004, 0.009)) were significant on 5%-level
Check of assumptions:
-3 -2 -1 0 1 2 3 4
Regression Standardized Residual
0
2
4
6
8
10
12
14
Fre
qu
en
cy
Mean = 2,23E-17Std. Dev. = 0,978N = 48
Dependent Variable: deaths
Histogram
0,0 0,2 0,4 0,6 0,8 1,0
Observed Cum Prob
0,0
0,2
0,4
0,6
0,8
1,0
Ex
pe
cte
d C
um
Pro
b
Dependent Variable: deaths
Normal P-P Plot of Regression Standardized Residual
Check of assumptions cont’d:
-2 -1 0 1 2
Regression Standardized Predicted Value
-3
-2
-1
0
1
2
3
4
Re
gre
ss
ion
Sta
nd
ard
ize
d R
es
idu
al
Dependent Variable: deaths
Scatterplot
Least squares estimation
• The least squares estimates of are the values b1, b2, …, bK minimizing
• They can be computed with similar but more complex formulas as with simple regression
0 1 1 2 2 ...i i i K Ki iy x x x
2
0 1 1 2 21
...n
i i K Ki ii
SSE b b x b x b x y
0 1, ,..., K
Explanatory power
• Defining
• We get as before
• We define
• We also get that
0 1 1 2 2ˆ ...i i i K Kiy b b x b x b x
2
1
n
ii
SST y y
2
1
ˆn
i ii
SSE y y
2
1
ˆn
ii
SSR y y
SST SSR SSE
2 1SSR SSE
RSST SST
ˆ( , )R Corr y yCoefficient ofdetermination
Adjusted coefficient of determination
• Adding more independent variables will generally increase SSR and decrease SSE
• Thus the coefficient of determination will tend to indicate that models with many variables always fit better.
• To avoid this effect, the adjusted coefficient of determination may be used:
2 /( 1)1
/( 1)
SSE n KR
SST n
Drawing inference about the model parameters
• Similar to simple regression, we get that the following statistic has a t distribution with n-K-1 degrees of freedom:
where bj is the least squares estimate for and sbj is its estimated standard deviation
• sbj is computed from SSE and the correlation between independent variables
j
j jb
bj
bt
s
Confidence intervals and hypothesis tests
• A confidence interval for becomes
• Testing the hypothesis vs
– Reject if or
j
1, / 2 jj n K bb t s
0 : 0jH 1 : 0jH
1, / 2j
n Kbj
bt
s 1, / 2j
n Kbj
bt
s
Testing sets of parameters
• We can also test the null hypothesis that a specific set of the betas are simultaneously zero. The alternative hypothesis is that at least one beta in the set is nonzero.
• The test statistic has an F distribution, and is computed by comparing the SSE in the full model, and the SSE when setting the parameters in the set to zero.
Making predictions from the model
• As in simple regression, we can use the estimated coefficients to make predictions
• As in simple regression, the uncertainty in the predictions has two sources: – The variance around the regression estimate– The variance of the estimated regression
model
What if the relationship is non-linear: Transformed variables
• The relationship between variables may not be linear
• Example: The natural model may be
• We want to find a and b so that the line approximates the points as well as possible
bxaey
15 20 25 30
0.05
0.10
0.15
0.20
bxaey
Example (cont.)
• When then
• Use standard formulas on the pairs (x1,log(y1)), (x2, log(y2)), ..., (xn, log(yn))
• We get estimates for log(a) and b, and thus a and b
15 20 25 30
0.05
0.10
0.15
0.20
bxaey
bxay )log()log(
Another example of transformed variables
• Another natural model may be
• We get that
• Use standard formulas on the pairs
(log(x1), log(y1)),
(log(x2), log(y2)), ...,(log(xn),log(yn))
0 2 4 6 8
0.00
80.
010
0.01
20.
014
0.01
6
baxy
)log()log()log( xbay
Note: In this model, the curve goes through (0,0)
A third example:
• Assume data (x1,y1),..., (xn,yn) seem to follow a third degree polynomial
• We use multivariate regression on (x1, x1
2, x13,
y1), (x2, x22, x2
3, y2),...
• We get estimated a,b,c,d, in a third degree polynomial curve
32 dxcxbxay 0.0 0.5 1.0 1.5 2.0 2.5 3.0
-3.0
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
Doing a regression analysis• Plot the data first, to investigate whether there is
a natural relationship• Linear or transformed model? • Are there outliers which will unduly affect the
result? • Fit a model. Different models with same number
of parameters may be compared with R2
• Check the assumptions!• Make tests / confidence intervals for parameters• A lot of practice is needed!
Conclusion and further options
• Regression versus correlation:– Can include more independent variables in regression– Gets a more detailed picture on the effect a
independent variable has on the dependent variable
• What if the dependent variable only has two possible values? Logistic regression– Similar to linear regression– But the interpretations of the β’s are different: They
are interpreted as odds-ratios instead of the slope of a line