multiple linear regression

Multiple linear regression

Tron Anders Moger

11.10.2006

Example:

50,00 100,00 150,00 200,00 250,00

weight in pounds

0,00

1000,00

2000,00

3000,00

4000,00

5000,00b

irth

wei

gh

t

Repetition: Simple linear regression

• We define a model

where are independent, normally distributed, with equal variance

• Wish to fit a line as close to the observed data (two normally distributed varaibles) as possible

• Example: Birth weight=β0+β1*mother’s weight

0 1i i iY x i 2

Least squares regression

50,00 100,00 150,00 200,00 250,00

weight in pounds

0,00

1000,00

2000,00

3000,00

4000,00

5000,00

bir

thw

eig

ht

R Sq Linear = 0,035

How to compute the line fit with the least squares method?

• Let (x1, y1), (x2, y2),...,(xn, yn) denote the points in the plane. • Find a and b so that y=a+bx fit the points by minimizing

• Solution:

where and all sums are done for i=1,...,n.

n

iiinn ybxaybxaybxaybxaS

1

22222

211 )()()()(

2222

ii

ii

ii

iiii

xnx

yxnyx

xxn

yxyxnb

xbyn

xbya ii

inin yyxx 11 ,

How do you get this answer?

• Differentiate S with respect to a og b, and set the result to 0

We get:

This is two equations with two unknowns, and the solution of these give the answer.

n

iii ybxa

a

S

1

02

n

iiii xybxa

b

S

1

02

0 ii yxbna

02 iiii yxxbxa

How close are the data to the fitted line? R2

• Define– SSE: Error sum of squares– SSR: Regression sum of squares– SST: Total sum of squares

• We can show that SST = SSR + SSE• Define • R2 is the ”coefficient of determination”

2( )i iy a bx 2( )ia bx y

2( )iy y

2 21 ( , )SSR SSE

R corr x ySST SST

What is the logic behind R2?

y

x

î iy a bx

xi

î i iSSE y y

îSSR y y

iSST y y

Example: Regression of birth weight with mother’s weight as

independent variableModel Summaryb

,186a ,035 ,029 718,24270Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), weight in poundsa.

Dependent Variable: birthweightb.

Coefficientsa

2369,672 228,431 10,374 ,000 1919,040 2820,304

4,429 1,713 ,186 2,586 ,010 1,050 7,809

(Constant)

weight in pounds

Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig. Lower Bound Upper Bound

95% Confidence Interval for B

Dependent Variable: birthweighta.

Interpretation:• Have fitted the line

Birth weight=2369.672+4.429*mother’s weight

• If mother’s weight increases by 20 pounds, what is the predicted impact on infant’s birth weight?4.429*20=89 grams

• What’s the predicted birth weight of an infant with a 150 pound mother?2369.672+4.429*150=3034 grams

But how to answer questions like:

• Given that a positive slope (b) has been estimated: Does it give a reproducible indication that there is a positive trend, or is it a result of random variation?

• What is a confidence interval for the estimated slope?

• What is the prediction, with uncertainty, at a new x value?

Confidence intervals for simple regression

• In a simple regression model, – a estimates – b estimates– estimates

• Also,

where estimates variance of b

• So a confidence interval for is given by

0

12ˆ /( 2)SSE n 2

1 2( ) / ~b nb S t 2

22

ˆ

( 1)bx

Sn s

12, / 2n bb t S

Hypothesis testing for simple regression

• Choose hypotheses:

• Test statistic:

• Reject H0 if or

0 1: 0H 1 1: 0H

2/ ~b nb S t

2, / 2/ b nb S t 2, / 2/ b nb S t

Prediction from a simple regression model

• A regression model can be used to predict the response at a new value xn+1

• The uncertainty in this prediction comes from two sources: – The uncertainty in the regression line– The uncertainty of any response, given the

regression line

• A confidence interval for the prediction:

211

1 2, / 2 2

( )ˆ 1

( )n

n n ni

x xa bx t

x x

Example: The confidence interval of the predicted birth weight of an infant with a 150

pound mother

• Found that the predicted weight was 3034 grams• The confidence interval for the prediction is:

2369.67+4.43*150±t187,0.025*1.71*√(1+1/189+(150-129.81)2/(175798.52))

• Which becomes (3030.8, 3037.5)

=1.96

More than one independent variable: Multiple regression

• Assume we have data of the type (x11, x12, x13, y1), (x21, x22, x23, y2), ...• We want to ”explain” y from the x-values by

fitting the following model:

• Just like before, one can produce formulas for a,b,c,d minimizing the sum of the squares of the ”errors”.

• x1,x2,x3 can be transformations of different variables, or transformations of the same variable

321 dxcxbxay

Multiple regression model

• The errors are independent random (normal) variables with expectation zero and variance

• The explanatory variables x1i, x2i, …, xni cannot be linearily related

0 1 1 2 2 ...i i i n ni iy x x x

i

2

New example: Traffic deaths in 1976 (from file crash on textbook CD)

• Want to find if there is any relationship between highway death rate (deaths per 1000 per state) in the U.S. and the following variables:– Average car age (in months)– Average car weight (in 1000 pounds)– Percentage light trucks– Percentage imported cars

• All data are per state

First: Scatter plots:

69,00 69,50 70,00 70,50 71,00 71,50

carage

0,05

0,10

0,15

0,20

0,25

0,30

0,35d

ea

ths

0,00 5,00 10,00 15,00 20,00 25,00 30,00

impcars

0,05

0,10

0,15

0,20

0,25

0,30

0,35

de

ath

s

5,00 10,00 15,00 20,00 25,00 30,00 35,00

lghttrks

0,05

0,10

0,15

0,20

0,25

0,30

0,35

de

ath

s

3,00 3,20 3,40 3,60 3,80

vehwt

0,05

0,10

0,15

0,20

0,25

0,30

0,35

de

ath

s

Univariate effects (one independent variable at a time!):

• Hence: If all else is equal, if average car age increases by one month, you get 0.062 fewer deaths per 1000 inhabitants; increase age by 12 months, you get 12*0.062=0.74 fewer deaths per 1000 inhabitants

Model Summaryb

,492a ,242 ,226 ,05206Model1



Predictors: (Constant), caragea.

Dependent Variable: deathsb. Coefficientsa

4,516 1,134 3,981 ,000 2,233 6,800

-,062 ,016 -,492 -3,834 ,000 -,094 -,029

(Constant)

carage

Model1

B Std. Error


Beta




Dependent Variable: deathsa.

Deaths per 1000=a+b*car age (in months)

Deaths per 1000=a+b*car weight (in pounds)

Model Summaryb

,281a ,079 ,059 ,05740Model1



Predictors: (Constant), vehwta.


-,271 ,221 -1,227 ,226 -,716 ,174

,124 ,062 ,281 1,983 ,053 -,002 ,249

(Constant)

vehwt

Model1

B Std. Error


Beta





Univariate effects cont’d (one independent variable at a time!):

Model Summaryb

,716a ,512 ,501 ,04178Model1



Predictors: (Constant), lghttrksa.

Dependent Variable: deathsb.

Coefficientsa

,046 ,018 2,478 ,017 ,009 ,083

,007 ,001 ,716 6,947 ,000 ,005 ,010

(Constant)

lghttrks

Model1

B Std. Error


Beta





Model Summaryb

,308a ,095 ,075 ,05690Model1



Predictors: (Constant), impcarsa.


,206 ,020 10,462 ,000 ,166 ,246

-,004 ,002 -,308 -2,193 ,033 -,007 ,000

(Constant)

impcars

Model1

B Std. Error


Beta





Hence: Increase prop. light trucks by 20 means20*0.007=0.14 more deaths per 1000 inhabitants

Predicted number of deaths per 1000 if prop.Imported cars is 10%: 0.206-0.004*10=0.17

Building a multiple regression model:

• Forward regression: Try all independent variables, one at a time, keep the variable with the lowest p-value

• Repeat step 1, with the independent variable from the first round now included in the model

• Repeat until no more variables can be added to the model (no more significant variables)

• Backward regression: Include all independent variables in the model, remove the variable with the highest p-value

• Continue until only significant variables are left• However: These methods are not always correct to use

in practice!

For the traffic deaths, end up with:

• Deaths per 1000=2.7-0.037*car age +0.006*perc. light trucks

Model Summaryb

,768a ,590 ,572 ,03871Model1



Predictors: (Constant), lghttrks, caragea.

Dependent Variable: deathsb.

Coefficientsa

2,668 ,895 2,981 ,005 ,865 4,470

-,037 ,013 -,295 -2,930 ,005 -,063 -,012

,006 ,001 ,622 6,181 ,000 ,004 ,009

(Constant)

carage

lghttrks

Model1

B Std. Error


Beta





Conclusion: Did a multiple linear regression on traffic deaths, with car age,car weight, prop. light trucks and prop. imported cars as independent variables.Car age (in months, β=-0.037, 95% CI=(-0.063, -0.012)) and prop. light trucks (β=0.006, 95% CI=(0.004, 0.009)) were significant on 5%-level

Check of assumptions:

-3 -2 -1 0 1 2 3 4

Regression Standardized Residual

0

2

4

6

8

10

12

14

Fre

qu

en

cy

Mean = 2,23E-17Std. Dev. = 0,978N = 48

Dependent Variable: deaths

Histogram

0,0 0,2 0,4 0,6 0,8 1,0

Observed Cum Prob

0,0

0,2

0,4

0,6

0,8

1,0

Ex

pe

cte

d C

um

Pro

b


Normal P-P Plot of Regression Standardized Residual

Check of assumptions cont’d:

-2 -1 0 1 2

Regression Standardized Predicted Value

-3

-2

-1

0

1

2

3

4

Re

gre

ss

ion

Sta

nd

ard

ize

d R

es

idu

al


Scatterplot

Least squares estimation

• The least squares estimates of are the values b1, b2, …, bK minimizing

• They can be computed with similar but more complex formulas as with simple regression

0 1 1 2 2 ...i i i K Ki iy x x x

2

0 1 1 2 21

...n

i i K Ki ii

SSE b b x b x b x y

0 1, ,..., K

Explanatory power

• Defining

• We get as before

• We define

• We also get that

0 1 1 2 2ˆ ...i i i K Kiy b b x b x b x

2

1

n

ii

SST y y

2

1

ˆn

i ii

SSE y y

2

1

ˆn

ii

SSR y y

SST SSR SSE

2 1SSR SSE

RSST SST

ˆ( , )R Corr y yCoefficient ofdetermination

Adjusted coefficient of determination

• Adding more independent variables will generally increase SSR and decrease SSE

• Thus the coefficient of determination will tend to indicate that models with many variables always fit better.

• To avoid this effect, the adjusted coefficient of determination may be used:

2 /( 1)1

/( 1)

SSE n KR

SST n

Drawing inference about the model parameters

• Similar to simple regression, we get that the following statistic has a t distribution with n-K-1 degrees of freedom:

where bj is the least squares estimate for and sbj is its estimated standard deviation

• sbj is computed from SSE and the correlation between independent variables

j

j jb

bj

bt

s

Confidence intervals and hypothesis tests

• A confidence interval for becomes

• Testing the hypothesis vs

– Reject if or

j

1, / 2 jj n K bb t s

0 : 0jH 1 : 0jH

1, / 2j

n Kbj

bt

s 1, / 2j

n Kbj

bt

s

Testing sets of parameters

• We can also test the null hypothesis that a specific set of the betas are simultaneously zero. The alternative hypothesis is that at least one beta in the set is nonzero.

• The test statistic has an F distribution, and is computed by comparing the SSE in the full model, and the SSE when setting the parameters in the set to zero.

Making predictions from the model

• As in simple regression, we can use the estimated coefficients to make predictions

• As in simple regression, the uncertainty in the predictions has two sources: – The variance around the regression estimate– The variance of the estimated regression

model

What if the relationship is non-linear: Transformed variables

• The relationship between variables may not be linear

• Example: The natural model may be

• We want to find a and b so that the line approximates the points as well as possible

bxaey

15 20 25 30

0.05

0.10

0.15

0.20

bxaey

Example (cont.)

• When then

• Use standard formulas on the pairs (x1,log(y1)), (x2, log(y2)), ..., (xn, log(yn))

• We get estimates for log(a) and b, and thus a and b

15 20 25 30

0.05

0.10

0.15

0.20

bxaey

bxay )log()log(

Another example of transformed variables

• Another natural model may be

• We get that

• Use standard formulas on the pairs

(log(x1), log(y1)),

(log(x2), log(y2)), ...,(log(xn),log(yn))

0 2 4 6 8

0.00

80.

010

0.01

20.

014

0.01

6

baxy

)log()log()log( xbay

Note: In this model, the curve goes through (0,0)

A third example:

• Assume data (x1,y1),..., (xn,yn) seem to follow a third degree polynomial

• We use multivariate regression on (x1, x1

2, x13,

y1), (x2, x22, x2

3, y2),...

• We get estimated a,b,c,d, in a third degree polynomial curve

32 dxcxbxay 0.0 0.5 1.0 1.5 2.0 2.5 3.0

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

Doing a regression analysis• Plot the data first, to investigate whether there is

a natural relationship• Linear or transformed model? • Are there outliers which will unduly affect the

result? • Fit a model. Different models with same number

of parameters may be compared with R2

• Check the assumptions!• Make tests / confidence intervals for parameters• A lot of practice is needed!

Conclusion and further options

• Regression versus correlation:– Can include more independent variables in regression– Gets a more detailed picture on the effect a

independent variable has on the dependent variable

• What if the dependent variable only has two possible values? Logistic regression– Similar to linear regression– But the interpretations of the β’s are different: They

are interpreted as odds-ratios instead of the slope of a line

multiple linear regression

Documents

regression of birth

predicted weight

predicted birth weight

linebirth weight

infants birth weight

regression sum of squaressst

regression linethe uncertainty

gramsthe confidence