ams 572 group #2 1. 2 3 outline jinmiao fu—introduction and history ning ma—establish and...

140
AMS 572 Group #2 Multiple Linear Regression 1

Upload: ellen-cameron

Post on 01-Apr-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

AMS 572 Group #2

Multiple Linear Regression

1

Page 2: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

2

Page 3: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

3

Page 4: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

4/140

Outline• Jinmiao Fu—Introduction and History • Ning Ma—Establish and Fitting of the model• Ruoyu Zhou—Multiple Regression Model in Matrix

Notation• Dawei Xu and Yuan Shang—Statistical Inference for

Multiple Regression• Yu Mu—Regression Diagnostics• Chen Wang and Tianyu Lu—Topics in Regression

Modeling• Tian Feng—Variable Selection Methods• Hua Mo—Chapter Summary and modern

application

Page 5: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

5/140

Introduction

• Multiple linear regression attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data. Every value of the independent variable x is associated with a value of the dependent variable

Page 6: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

6

Example: The relationship between an adult’s health and his/her daily eating amount of wheat, vegetable and meat.

Page 7: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

History

7

Page 8: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

8

Correlation coefficient Method of momentsPearson's system of continuous curves.

Karl Pearson (1857–1936)Lawyer, Germanist, eugenicist, mathematician and statistician

Chi distance, P-valueStatistical hypothesis testing theory, statistical decision theory.Pearson's chi-square test, Principal component analysis.

Page 9: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

9/140

Sir Francis Galton FRS (16 February 1822 – 17 January 1911)Anthropology and polymathyDoctoral students Karl Pearson

In the late 1860s, Galton conceived the standard deviation. He created the statistical concept of correlation and also discovered the properties of the bivariate normal distribution and its relationship to regression analysis

Page 10: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

10/140

Galton invented the use of the regression line (Bulmer 2003, p. 184), and was the first to describe and explain the common phenomenon of regression toward the mean, which he first observed in his experiments on the size of the seeds of successive generations of sweet peas.

Page 11: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

11/140

The publication by his cousin Charles Darwin of The Origin of Species in 1859 was an event that changed Galton's life. He came to be gripped by the work, especially the first chapter on "Variation under Domestication" concerning the breeding of domestic animals.

Page 12: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

12/140

Adrien-Marie Legendre (18 September 1752 – 10 January 1833) was a French mathematician. He made important contributions to statistics, number theory, abstract algebra and mathematical analysis.

He developed the least squares method, which has broad application in linear regression, signal processing, statistics, and curve fitting.

Page 13: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

13/140

Johann Carl Friedrich Gauss (30 April 1777 – 23 February 1855) was a German mathematician and scientist who contributed significantly to many fields, including number theory, statistics, analysis, differential geometry, geodesy, geophysics, electrostatics, astronomy and optics.

Page 14: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

14/140

Gauss, who was 23 at the time, heard about the problem and tackled it. After three months of intense work, he predicted a position for Ceres in December 1801—just about a year after its first sighting—and this turned out to be accurate within a half-degree. In the process, he so streamlined the cumbersome mathematics of 18th century orbital prediction that his work—published a few years later as Theory of Celestial Movement—remains a cornerstone of astronomical computation.

Page 15: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

15/140

It introduced the Gaussian gravitational constant, and contained an influential treatment of the method of least squares, a procedure used in all sciences to this day to minimize the impact of measurement error. Gauss was able to prove the method in 1809 under the assumption of normally distributed errors (see Gauss–Markov theorem; see also Gaussian). The method had been described earlier by Adrien-Marie Legendre in 1805, but Gauss claimed that he had been using it since 1795.

Page 16: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

16/140

Sir Ronald Aylmer Fisher FRS (17 February 1890 – 29 July 1962) was an English statistician, evolutionary biologist, eugenicist and geneticist. He was described by Anders Hald as "a genius who almost single-handedly created the foundations for modern statistical science," and Richard Dawkins described him as "the greatest of Darwin's successors".

Page 17: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

17/140

In addition to "analysis of variance", Fisher invented the technique of maximum likelihood and originated the concepts of sufficiency, ancillarity, Fisher's linear discriminator and Fisher information.

Page 18: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

18/140

Establish and Fittingof the Model

Page 19: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

Probabilistic Modeliy : the observed value of the random

variable(r.v.)

1 2, , ,i i ikx x x

0 1, , , k unknown model parameters

depends on fixed predictor values iY

n is the number of observations.

~ N (0, )i 2i.i.d

,i=1,2,3,…,n

iY

19

Page 20: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

Fitting the model•LS provides estimates of the unknown model parameters,

1 22

0 1 2

1

[ ( ... )]k

n

i i i k i

i

Q y x x x

0 1, , , k which minimizes Q

(j=1,2,…,k)

20

Page 21: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

Tire tread wear vs. mileage (example11.1 in textbook)

Mileage (in 1000 miles)

Groove Depth (in mils)

0 394.33

4 329.50

8 291.00

12 255.17

16 229.33

20 204.83

24 179.00

28 163.83

32 150.33

• The table gives the measurements on the groove of one tire after every 4000 miles.

• Our Goal: to build a model to find the relation between the mileage and groove depth of the tire.

21

Page 22: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

Data example;Input mile depth @@;Sqmile=mile*mile;Datalines;0 394.33 4 329.5 8 291 12 255.17 16 229.33 20 204.83 24 179 28 163.83 32 150.33;run;

Proc reg data=example;Model Depth= mile sqmile;Run;

SAS code----fitting the model

22

Page 23: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

Depth=386.26-12.77mile+0.172sqmile

23

Page 24: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

24/140

Goodness of Fit of the Model

ˆ ( 1,2, , )i i ie y y i n •Residuals

• are the fitted valuesˆiy

1 1ˆ ˆ ˆ ˆˆ ( 1,2,..., )ki i i k iy x x x i n

total sum of squares (SST):2( )iSST y y

regression sum of squares (SSR):

SSR SST SSE

2

1

min n

ii

Q SSE e

An overall measure of the goodness of fit

Error sum of squares (SSE):

Page 25: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

25/140

Multiple Regression Model

In Matrix Notation

Page 26: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

26/140

1. Transform the Formulas to Matrix Notation

Page 27: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

27/140

• The first column of X denotes the constant term (We can treat this as with)

Page 28: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

28/140

• Finally let

where the (k+1)1 vectors of unknown parameters LS estimates

Page 29: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

29/140

• Formula

becomes

• Simultaneously, the linear equation

are changed to

Solve this equation respect to and we get

(if the inverse of the matrix exists.)

-1

Page 30: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

30/140

2. Example 11.2 (Tire Wear Data: Quadratic Fit Using Hand Calculations)

• We will do Example 11.1 again in this part using the matrix approach.

• For the quadratic model to be fitted

Page 31: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

31/140

• According to formula

we need to calculate first and then invert it and get

-1

Page 32: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

32/140

• Finally, we calculate the vector of LS estimates

Page 33: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

33/140

• Therefore, the LS quadratic model is

This model is the same as we obtained in Example 11.1.

Page 34: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

34/140

Statistical Inference

for

Multiple Regression

Page 35: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

35/140

Statistical Inference for Multiple Regression

• Determine which predictor variables have statistically significant effects

• We test the hypotheses:

• If we can’t reject H0j, then xj is not a significant predictor of y.

0 1: 0 . : 0j j j jH vs H

Page 36: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

36/140

• Review statistical inference for

Simple Linear Regression

21 1

1 1

22

22 2

21 1

22

ˆˆ , (0,1)

/

( 2)

ˆ ( 2):

( 2)// ( 2)

xxxx

n

n

xx

N NS S

n S SSE

N n St t

nSW n

Statistical Inference on ' s

Page 37: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

37/140

Statistical Inference on

• What about Multiple Regression?• The steps are similar

1 121 1

22

( 1)2 2

ˆˆ , (0,1)

[ ( 1)]

jj

jj

n k

N V NV

n k S SSE

' s

21 1

( 1)2

ˆ [ ( 1)]:

[ ( 1)]/ [ ( 1)]n k

jj

N n k St t

n kVW n k

Page 38: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

38/140

Statistical Inference on

• What’s Vjj? Why ?

1. Mean

Recall from simple linear regression, the least squares estimators for the regression parameters and are unbiased.

Here, of least

squares estimators

is also unbiased.

21 1ˆ , jjN V

0 10 0

11

ˆ( )

ˆ( )ˆ( )

ˆ( ) kk

E

EE

E

' s

Page 39: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

39/140

Statistical Inference on

• 2.Variance• Constant Variance assumption:

– 2iV ( )

2

22

2

00

00var( )

0

00

kY I

' s

Page 40: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

40/140

Statistical Inference on

• Let Vjj be the jth diagonal of the matrix

1ˆ ( )T TX X X Y cY

1 2 1

2 1

ˆvar( )

( ) (

var( )

var( )

)

( )

T

T T T T

T

X X X X X X

X

cY

c Y c

X

T

k( I )( )

2ˆvar( )j jjV

' s

1( )TX X

Page 41: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

41/140

Statistical Inference on

2

2

ˆ ˆSum up, ( ) , var( )

ˆand we get ( , )ˆ

(0,1)j j

j j j jj

j j jj

jj

E V

N V NV

' s

Page 42: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

42/140

Statistical Inference on

• 2

22

Like simple linear regression, the unbiased estimator

of the unknown error variance is given by

( 1) ( 1) . .ieSSE MSE

Sn k n k d f

22

( 1)2 2

2

( ( 1))~

ˆand that and are statistically independent

n k

j

n k S SSEW

S

' s

Page 43: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

43/140

Statistical Inference on

• Therefore, 2

2( 1)2

ˆ ( ( 1))(0,1), ~j j

n k

jj

n k SN

V

2

( 1)2

ˆˆ [ ( 1)][ ( 1)]

j jj jn k

jj jj

n k St

n kV S V

' s

( 1)

ˆ ˆˆ( )

ˆ( )j j j j

n k j jj

jj j

t t SE s vS V SE

Page 44: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

44/140

Statistical Inference on

• Derivation of confidence interval of

j

( 1), /2 ( 1), /2

ˆ( ) 1

ˆ( )j j

n k n k

j

P t tSE

( 1), /2 ( 1), /2ˆ ˆ ˆ ˆ( ( ) ( )) 1j n k j j j n k jP t SE t SE

( 1), /2ˆ ˆ( )j n k jt SE

The 100(1-α)% confidence interval for is

j

' s

Page 45: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

45/140

Statistical Inference on

• Rejects H0j if 0

( 1), /2

ˆ

ˆ( )

j j

j n k

j

t tSE

0 00 1

An level test of hypotheses

: . :j j j jj jH vs H

0 0

( 1), /2

P (Reject H | H is true) ( )j j j

n k

P t c

c t

' s

Page 46: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

46/140

Prediction of Future Observation

• Having fitted a multiple regression model, suppose we wish to predict the future value of Y for a specified vector of predictor variables x*=(x0*,x1*,…,xk*)

• One way is to estimate E(Y*) by a confidence interval(CI).

Page 47: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

47/140

Prediction of Future Observation

* * * * *0 1 1

* * * * 2 1 *

2 * *

ˆ ˆ ˆ ˆˆˆ ( ) ( )

ˆ ˆ[( ) ] ( ) ( ) ( ) ( ) ( )

( ) ( )

Tk k k

T T T T Tk k k k k

T Tk k

E Y x x x

Var x x Var x x X X x

x V x

2 2

*

Replacing by its estimate s MSE, which has

n K 1 d.f ., and using methods as in Simple Linear

Regression, a (1- )-level CI for is given by

* * * * * * *( 1), /2 ( 1), /2( ) ( )T T

n k n kt s x Vx t s x Vx

Page 48: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

48/140

• F-Test for

Consider:

Here is the overall null hypothesis, which states that none of the variables are related to . The alternative one shows at least one is related.

'j s

0 1

1

: 0;

: 0.k

j

H

vs H At least one

0Hx

y

Page 49: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

49

How to Build a F-Test……

• The test statistic F=MSR/MSE follows F-distribution with k and n-(k+1) d.f. The α -level test rejects if

recall thatMSE(error mean square)

with n-(k+1) degrees of freedom.

0H

, ( 1),k n k

MSRF f

MSE

2

1

( 1)

n

iie

MSEn k

Page 50: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

50/140

The relation between F and r

F can be written as a function of r.By using the formula:

F can be as:

We see that F is an increasing function of r ² and test the significance of it.

2 2; (1 ) .SSR r SST SSE r SST

2

2

[ ( 1)]

(1 )

r n kF

k r

Page 51: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

51/140

Analysis of Variance (ANOVA)

The relation between SST, SSR and SSE:

where they are respectively equals to:

The corresponding degrees of freedom(d.f.) is:

SST SSR SSE

2 2 2

1 1 1

( ) ; ( ) ; ( )n n n

i i i ii i i

SST y y SSR y y SSE y y

. .( ) 1; . .( ) ; . .( ) ( 1).d f SST n d f SSR k d f SSE n k

Page 52: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

52/140

ANOVA Table for Multiple RegressionSource of Variation(source)

Sum of Squares

(SS)

Degrees of Freedom

(d.f.)

Mean Square

(MS)F

Regression

Error

SSR

SSE

k

n-(k+1)

Total SST n-1

( 1)

SSRMSR

kSSE

MSEn k

MSRF

MSE

This table gives us a clear view of analysis of variance of Multiple Regression.

Page 53: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

53/140

Extra Sum of Squares Method for Testing Subsets of Parameters

Before, we consider the full model with k parameters. Now we consider the partial model:

while the rest m coefficients are set to zero. And we could test these m coefficients to check out the significance:

0 1 1 , ( 1,2, , )i i k m i k m iY x x i n

0 1

1 1

: 0;

: , , 0.k m k

k m k

H

vs H At least one of

Page 54: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

54/140

Building F-test by Using Extra Sum of Squares Method

Let and be the regression and errorsums of squares for the partial model. Since SSTIs fixed regardless of the particular model, so:

then, we have:

The α-level F-test rejects null hypothesis if

k mSSR k mSSE

k m k m k kSST SSR SSE SSR SSE

k m k k k mSSR SSE SSR SSR

, ( 1),

( ) /

/ [ ( 1)]k m k

m n kk

SSE SSE mF f

SSE n k

Page 55: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

55/140

Remarks on the F-test

The numerator d.f. is m which is the number of coefficients set to zero. While the denominator d.f. is n-(k+1) which is the error d.f. for the full model.The MSE in the denominator is the normalizing factor, which is an estimate of σ² for the full model. If the ratio is large, we reject .

0H

Page 56: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

56/140

Links between ANOVA and Extra Sum of Squares Method

Let m=1 and m=k respectively, we have:

From above we can derive:

Hence, the F-ratio equals:

with k and n-(k+1) d.f.

20 1

( ) ,n

i kiSSE y y SST SSE SSE

0 kSSE SSE SST SSE SSR

/

/ [ ( 1)]

SSR k MSRF

SSE n k MSE

Page 57: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

Regression Diagnostics

57

Page 58: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

5 Regression Diagnostics

5.1 Checking the Model Assumptions

Plots of the residuals against individual predictor variables: check for linearityA plot of the residuals against fitted values: check for constant varianceA normal plot of the residuals: check for normality

58

Page 59: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

59/140

A run chart of the residuals: check if the random errors are auto correlated.Plots of the residuals against any omitted predictor variables: check if any of the omitted predictor variables should be included in the model.

Page 60: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

60/140

Example: Plots of the residuals against individual predictor variables

Page 61: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

61/140

SAS code

Page 62: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

62/140

Example: plot of the residuals against fitted values

Page 63: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

63/140

SAS code

Page 64: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

64/140

Example: normal plot of the residuals

Page 65: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

65/140

SAS code

Page 66: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

66/140

5.2 Checking for Outliers and Influential Observations

• Standardized residuals

Large values indicate outlier observation. • Hat matrix If the Hat matrix diagonal , thenith observation is influential.

.1)(

*

he

eee

ii

i

i

ii sSE

ei

*

XXXXH 1

n

khii

12

Page 67: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

67/140

Example: graphical exploration of outliers

Page 68: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

68/140

Example: leverage plot

Page 69: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

69/140

5.3 Data transformation

Transformations of the variables(both y and the x’s) are often necessary to satisfy the assumptions of linearity, normality, and constant error variance. Many seemingly nonlinear models can be written in the multiple linear regression model form after making a suitable transformation. For example,

after transformation: or

21210

* xxy

22110 loglogloglog xxy

*2

*2

*1

*1

*0

* xxy

Page 70: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

70/140

Topics in Regression Modeling

Page 71: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

71/140

Multicollinearity

• Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

• Example of multicollinear predictors are height and weight of a person, years of education and income, and assessed value and square footage of a home.

• Consequences of high multicollinearity: a. Increased standard error of estimates of the β ’s b. Often confused and misled results.

Page 72: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

72/140

Detecting Multicollinearity

• Easy way: compute correlations between all pairs of predictors. If some r are close to 1 or -1, remove one of the two correlated predictors from the model.Variable X1 X2 X3

X1 2 12

13

X2 21

2 23

X3 31

32

2

Equal to 1

Correlations

X1

colinear

X2

independent

X3

X2

Page 73: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

73/140

Detecting Multicollinearity

• Another way: calculate the variance inflation factors for each predictor xj:

where is the coefficient of determination of the model that includes all predictors except the jth predictor.

• If VIFj≥10, then there is a problem of multicollinearity.

Page 74: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

74/140

Muticollinearity-Example• See Example11.5 on Page 416, Response is the heat of

cement on a per gram basis (y) and predictors are tricalcium aluminate(x1), tricalcium silicate(x2), tetracalcium alumino ferrite(x3) and dicalcium silicate(x4).

Page 75: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

75/140

Muticollinearity-Example

• Estimated parameters in first order model: ˆy =62.4+1.55x1+0.510x2+0.102x3-0.144x4.

• F = 111.48 with p−value below 0.0001. Individual t−statistics and p−values: 2.08 (0.071), 0.7 (0.501) and 0.14 (0.896), -0.20 (0.844).

• Note that sign on β4 is opposite of what is expected. And very high F would suggest more than just one significant predictor.

Page 76: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

76/140

Muticollinearity-Example

• Correlations

• Correlations were r13 = -0.824, r24 =-0.973. Also the VIF were all greater than 10. So there is a multicollinearity problem in such model and we need to choose the optimal algorithm to help us select the variables necessary.

Page 77: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

77/140

Muticollinearity-Subsets Selection

• Algorithms for Selecting Subsets – All possible subsets

• Only feasible with small number of potential predictors (maybe 10 or less)

• Then can use one or more of possible numerical criteria to find overall best

– Leaps and bounds method • Identifies best subsets for each value of p • Requires fewer variables than observations • Can be quite effective for medium-sized data sets • Advantage to have several slightly different models to

compare

Page 78: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

78/140

Muticollinearity-Subsets Selectioin

– Forward stepwise regression • Start with no predictors

– First include predictor with highest correlation with response – In subsequent steps add predictors with highest partial correlation with response

controlling for variables already in equations – Stop when numerical criterion signals maximum (minimum) – Sometimes eliminate variables when t value gets too small

• Only possible method for very large predictor pools • Local optimization at each step, no guarantee of finding overall

optimum

– Backward elimination • Start with all predictors in equation

– Remove predictor with smallest t value – Continue until numerical criterion signals maximum (minimum)

• Often produces different final model than forward stepwise method

Page 79: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

79/140

Muticollinearity-Best Subsets Criteria

• Numerical Criteria for Choosing Best Subsets – No single generally accepted criterion

• Should not be followed too mindlessly

– Most common criteria combine measures of with add penalties for increasing complexity (number of predictors)

– Coefficient of determination • Ordinary multiple R-square

• Always increases with increasing number of predictors, so not very good for comparing models with different numbers of predictors

– Adjusted R-Square • Will decrease if increase in R-Square with increasing p is small

Page 80: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

80/140

Muticollinearity-Best Subsets Criteria– Residual mean square (MSEp)

• Equivalent to adjusted r-square except look for minimum

• Minimum occurs when added variable doesn't decrease error sum of squares enough to offset loss of error degree of freedom

– Mallows' Cp statistic

• Should be about equal to p and look for small values near p • Need to estimate overall error variance

– PRESS statistic

• The one associated with the minimum value of PRESSp is chosen• Intuitively easier to grasp than the Cp-criterion.

Page 81: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

Muticollinearity-Forward Stepwise

• First include predictor with highest correlation with response

>FIN=4 81

Page 82: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

82/140

Muticollinearity-Forward Stepwise

• In subsequent steps add predictors with highest partial correlation with response controlling for variables already in equations. (if Fi>FIN=4, enter the Xi and Fi<FOUT=4, remove the Xi)

>FIN=4

Page 83: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

83/140

Muticollinearity-Forward Stepwise

<FOUT=4>FIN=4

Page 84: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

84/140

Muticollinearity-Forward Stepwise

• Summarize the stepwise algorithms

• Therefore our “Best Model” should only include x1 and x2, which is y=52.5773+1.4683x1+0.6623x2

Page 85: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

Muticollinearity-Forward Stepwise

• Check the significance of the model and individual parameter again. We find p value are all small and each VIF is far less than 10.

85

Page 86: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

86/140

Muticollinearity-Best Subsets

• Also we can stop when numerical criterion signals maximum (minimum) and sometimes eliminate variables when t value gets too small.

Page 87: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

87/140

Muticollinearity-Best Subsets

• The largest R squared value 0.9824 is associated with the full model.

• The best subset which minimizes the Cp-criterion includes x1,x2

• The subset which maximizes Adjusted R squared or equivalently minimizes MSEp is x1,x2,x4. And the Adjusted R squared increases only from 0.9744 to 0.9763 by the addition of x4to the model already containing x1 and x2.

• Thus the simpler model chosen by the Cp-criterion is preferred, which the fitted model is

y=52.5773+1.4683x1+0.6623x2

Page 88: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

88/140

Polynomial model• Polynomial models are useful in situations where

the analyst knows that curvilinear effects are present in the true response function.

• We can do this with more than one explanatory variable using Polynomial regression model:

Page 89: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

89/140

Multicollinearity-Polynomial Models• Multicollinearity is a problem in polynomial

regression (with terms of second and higher order): x and x2 tend to be highly correlated.

• A special solution in polynomial models is to use zi = xi − ¯xi instead of just xi. That is, first subtract each predictor from its mean and then use the deviations in the model.

Page 90: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

90/140

Multicollinearity – Polynomial model• Example: x = 2, 3, 4, 5, 6 and x2 = 4, 9, 16, 25, 36. As x

increases, so does x2. rx,x2 = 0.98.• = 4 then z = −2,−1, 0, 1, 2 and z2 = 4, 1, 0, 1, 4.

Thus, z and z2 are no longer correlated. rz,z2 = 0.• We can get the estimates of the β’s from the

estimates of the γ ’s. Since

Page 91: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

Dummy Predictor Variable

The dummy variable is a simple and useful method of introducing into a

regression analysis information contained in variables that are not

conventionally measured on a numerical scale, e.g., race, gender, region, etc.

91

Page 92: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

92/140

Dummy Predictor Variable

• The categories of an ordinal variable could be assigned suitable numerical scores.

• A nominal variable with c≥2 categories can be coded using c – 1 indicator variables, X1,…,Xc-1, called dummy variables.

• Xi=1, for ith category and 0 otherwise• X1=,…,=Xc-1=0, for the cth category

Page 93: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

93/140

Dummy Predictor Variable

• If y is a worker’s salary and Di = 1 if a non-smoker

Di = 0 if a smoker We can model this in the following way:

tii uDy

Page 94: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

94/140

Dummy Predictor Variable• Equally we could have used the dummy variable

in a model with other explanatory variables. In addition to the dummy variable we could also add years of experience (x), to give:

( )

( )i

i

E y X

E y X

For smoker

For non-smoker

tiii uxDy

Page 95: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

95/140

Dummy Predictor Variable

Non-smoker

Smoker

α

α+β

y

x

Page 96: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

96/140

Dummy Predictor Variable

• We can also add the interaction to between smoking and experience with respect to their effects on salary.

i i i i i ty D x D x u

( ) ( ) ( )

( )i

i

E y X

E y X

For non-smoker

For smoker

Page 97: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

97/140

Dummy Predictor Variable

Non-smoker

Smoker

α

α+β

y

x

Page 98: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

98/140

Standardized Regression Coefficients

• We typically wants to compare predictors in terms of the magnitudes of their effects on response variable.

• We use standardized regression coefficients to judge the effects of predictors with different units

Page 99: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

99/140

Standardized Regression Coefficients

• They are the LS parameter estimates obtained by running a regression on standardized variables, defined as follows:

• Where and are sample SD’s of and

_

* ii

y

y yy

s

_

* ( 1,2, , ; 1, 2, , )ij jij

xij

x xx i n j k

s

ys xjs iy jx

Page 100: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

100/140

Standardized Regression Coefficients

• Let • And

• The magnitudes of can be directly compared to judge the relative effects of on y.

*0 0

* ( )( 1,2, , )xjj

y

sj k

s

*j

jx

Page 101: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

101/140

Standardized Regression Coefficients

• Since , the constant can be dropped from the model. Let be the vector of the

and be the matrix of

*0 0

* 'y s*y

*x * 'x s

1 2 1

2 1 2*' *

1 2

1

111

1

x x x xk

x x x xk

xkx xkx

r r

r r

n

r r

x x R

1

2*' *11

yx

yx

yxk

r

ry r

n

r

x

Page 102: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

102/140

Standardized Regression Coefficients

• So we can get

• This method of computing is numerically more stable than computing directly, because all entries of R and r are between -1 and 1.

*1*

*

*' * 1 *' * 1( )

k

y R r

x x x

'j s

'j s

Page 103: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

103/140

Standardized Regression Coefficients• Example (Given in page 424)• From the calculation, we can obtain that And sample standard deviations of x1,x2 andare

Then we have Note that ,although .Thus x1 has a

larger effect than x2 on y.

1 20.19244, 0.3406

1 26.830, 0.641, 1.501x x ys s s

1

* *1 2

1 2 2( ) 0.875, ( ) 0.105x x

y y

s s

s s

1

* *

2

1 2

Page 104: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

104/140

Standardized Regression Coefficients• We can also use the matrix method to compute standardized

regression coefficients.• First we compute the correlation matrix between x1 ,x2 and y

• Then we have

• Next calculate

• Hence

• Which is as same result as before

1 2

2 0.913

0.971 0.904

x x

x

y1 0.913

0.913 1

R0.971

0.904

r

1 212

1 21 2

1 6.009 5.58611 5.486 6.0091x x

x xx x

r

rr

R

1

*

1

*

2

0.875

0.105

R r

Page 105: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

Variable Selection Methods

105

Page 106: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

106

How to decide their salaries?

Lionel Messi10,000,000 EURO/yr

Carles Puyol5,000,000 EURO/yr

23 32

AttackerDefender

5 years11 years

more than 20 goals per year

less than 1 goals per year

Page 107: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

107/140

How to select variables?

• 1) Stepwise Regression

• 2)Best Subset Regression

Page 108: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

108/140

Stepwise Regression

• Partial F-test

• Partial Correlation Coefficients

• How to do it by SAS?

• Drawbacks

Page 109: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

Partial F-test

0 1 1 1 , 1...i i p i p iY x x

(p-1)-Variable Model:

p-Variable Model:

0 1 1 1 , 1 ,...i i p i p p i p iY x x x

109

Page 110: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

How to do the test?

vs 0 : 0p pH 1 : 1 0p pH

We reject in favor of at level α if0 pH 1pH

1,1, ( 1)

( ) /1

/ [ ( 1)]p p

p n pp

SSE SSEF f

SSE n p

110

Page 111: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

Another way to interpret the test:

2p pt F

• test statistics:

• We reject at level α if0 pH

( 1), /2| |p n pt t

( )p

pp

tSE

111

Page 112: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

Partial Correlation Coeffientients

1,..., 1

1 1 1 12|

1 1 1

( ,..., ) ( ,..., )

( ,..., )p p

p p p pyx x x

p p

SSE SSE SSE x x SSE x xr

SSE SSE x x

1,..., 1

1,..., 1

2|2

2|

[ ( 1)]

1p p

p p

yx x x

p pyx x x

r n pF t

r

px

1 1,..., px x

test statistics:

*Add to the regression equation that includes

pFonly if is large enough.112

Page 113: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

How to do it by SAS? (EX9 Continuity of Ex5)

No. X1 X2 X3 X4 Y

1 7 26 6 60 78.5

2 1 29 15 52 74.3

3 11 56 8 20 104.3

4 11 31 8 47 87.6

5 7 52 6 33 95.9

6 11 55 9 22 109.2

7 3 71 17 6 102.7

8 1 31 22 44 72.5

9 2 54 18 22 93.1

10 21 47 4 26 1159

11 1 40 23 34 83.8

12 11 66 9 12 113.3

13 10 68 8 12 109.4

The table shows data on the heat evolved in calories during the hardening of cement on a per gram basis (y) along with the percentages of four ingredients: tricalcium aluminate (x1), tricalcium silicate (x2), tetracalcium alumino ferrite (x3), and dicalcium silicate (x4).

113

Page 114: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

114/140

SAS Codedata example1;input x1 x2 x3 x4 y;datalines; 7 26 6 60 78.5 1 29 15 52 74.311 56 8 20 104.311 31 8 47 87.6 7 52 6 33 95.911 55 9 22 109.2 3 71 17 6 102.7 1 31 22 44 72.5 2 54 18 22 93.121 47 4 26 115.9 1 40 23 34 83.811 66 9 12 113.310 68 8 12 109.4;Run;

proc reg data=example1;model y= x1 x2 x3 x4 /selection=stepwise;run;

Page 115: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

115/140

SAS output

Page 116: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

116/140

SAS output

Page 117: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

117/140

Interpretation• At the first step, x4 is chosen into the equation as

it has the largest correlation with y among the 4 predictors;

• At the second step, we choose x1 into the equation for it has the highest partial correlation with y controlling for x4;

• At the third step, since is greater than , x2 is chosen into the equation rather than

x3.

2 4 1| ,yx x xr3 4 1| ,yx x xr

Page 118: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

118/140

• At the 4th step, we removed x4 from the model since its partial F-statistics is too small.

• From Ex11.5, we know that x4 is highly correlated with x2. Note that in Step4, the R-Square is 0.9787, which is slightly higher that 0.9725, the R-Square of Step 2. It indicates that even x4 is the best predictor of y, the pair (x1,x2) is a better predictor than the predictor (x1,x4).

Interpretation

Page 119: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

119/140

Drawbacks

• The final model is not guaranteed to be optimal in any specified case.

• It yields a single final model while in practice there are often several equally good model.

Page 120: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

120/140

Best Subset Regression

• Comparison to Stepwise Method

• Optimality Criteria

• How to do it by SAS?

Page 121: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

121/140

• In best subsets regression, a subset of variables is chosen from that optimizes a well-defined objective criterion.

• The best regression algorithm permits determination of a specified number of best subsets from which the choice of the final model can be made by the investigator.

Comparison to Stepwise Regression

Page 122: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

122/140

Optimality Criteria2pr Criterion

Adjusted 2pr Criterion

2 1p pp

SSR SSEr

SST SST

2,

/ ( ( 1))1 1

/ 1p p

adj p

SSE n p MSEr

SST n MST

Page 123: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

123/140

Optimality Criteria

pC Criterion

22

1

1[ ( )]

n

ipp ii

E Y E Y

Standardized mean square error of prediction:

p involves unknown parameters such as ‘s, so minimize a sample estimate of . Mallows’ :

j

p pC statistic

22( 1)p

p

SSEC p n

Page 124: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

124/140

• It practice, we use the because of its ease of computation and its ability to judge the predictive power of a model.

pC Criterion

Optimality Criteria

Page 125: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

125/140

How to do it by SAS?(Ex11.9)

• proc reg data=example1; model y= x1 x2 x3 x4 /selection=adjrsq mse cp; run;

Page 126: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

126/140

SAS output

Page 127: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

127/140

• The best subset which minimizes the is x1, x2 which is the same model selected using

stepwise regression in the former example.• The subset which maximizes is x1, x2, x4.

However, increases only from 0.9744 to 0.9763 by the addition of x4 to the model which already contains x1 and x2.

• Thus, the model chosen by the is preferred.

InterpretationpC Criterion

2,adj pr

2,adj pr

pC Criterion

Page 128: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

Chapter Summaryand Modern Application

128

Page 129: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

129/140

Multiple Regression

Model

Fitting the MLR Model

MLR Model in Matrix Notation

0 1 1 2 2i i i k ik iy x x x 0, 1, 2,.... k are unknown parameters

1 22

0 1 2

1

[ ( ... )]k

n

i i i k i

i

Q y x x x

1 20 1 2

10

2 [ ( ... )] 0k

n

i i i k i

i

Qy x x x

1 20 1 2

1

2 [ ( ... )] 0k

n

i i i k i ijij

Qy x x x x

2 SSRr

SST

' 1 '( )X X X Y

Y X

Model (Extension of Simple Regression):

Least squares method:

Goodness of fit of the model:

' 1 '( )X X X Y

Page 130: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

130/140

Statistical Inference for Multiple Regression

Regression Diagnostics

's

Hypotheses:0 : 0j jH 1 : 0j jH

Test statistic:( 1)

ˆ~

/ ( 1)j j

n k

jj

ZT T

W n k S v

0 1: 0kH :aH At least one 0j

2

2

{ ( 1)}

(1 )

MSR r n kF

MSE k r

Statistical Inference on

vs.

Hypotheses: vs.

Test statistic:

Residual Analysis

Data Transformation

Page 131: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

131/140

Yi 0 1x i1 ...kx ik i

Yi 0 1x i1 ...k mx i,k m iCompare the full model:

the partial model:

Hypotheses:

H0 :k m1 ...k 0 vs.

Ha : j 0

Test statistic: 0 , ( 1)

( ) /~

/[ ( 1)]k m k

m n kk

SSE SSE mF f

SSE n k

RejectH0 when 0 , ( 1),m n kF f

The General Hypothesis Test:

Estimating and Predicting Future Observations:Let

x* (x0*,x1

*,...,xk*)' * * * * *

0 1 1 ... k kY x x x and

Test statistic:* *

( 1)* *

ˆ~ n kT T

s x Vx

CI for the estimated mean *: * *2/),1(

*ˆ Vxxst kn

PI for the estimated Y*: * *2/),1(

* 1ˆ VxxstY kn

Page 132: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

132/140

Topics in regression modeling

•Multicollinearity

•Polynomial Regression

•Dummy Predictor Variables

•Logistic egression Model

Variable Selection Methods

•Stepwise Regression:

•Stepwise Regression Algorithm

•Best Subsets Regression

Strategy for building a MLR

model

1, 1

12/

1p p

p pyx x x

p

SSE SSEr

SSE

partial F-test

partial Correlation Coefficient

/ 1 1

/ 1 1

2

2

1

1p p

p p

yx x x

pyx x x

r n pF

r

Page 133: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

Application of the MLR model

Linear regression is widely used in biological, chemistry, finance and social sciences to describe possible relationships between variables. It ranks as one of the most important tools used in these disciplines.

133

Page 134: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

134/140

Multiple linear regression

Chemistry

heredity

Financial market

biology

Housing price

Page 135: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

135/140

Broadly speaking, an asset pricing model can be expressed as:

Example

1 1 2 2i i j j kj k ir a b f b f b f

Where , and k denote the expected return on asset i, the kth risk factor and the number of risk factors, respectively.

ir kf

i denotes the specific return on asset i.

Page 136: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

136/140

The equation can also be expressed in the matrix notation:

is called the factor loading

Page 137: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

137/140

What’s the most important factors?

Interest rate

Inflation rate

Employment rate

Rate of return on the market portfolio

Government policies

GDP

Page 138: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

138/140

Method

• Step 1: Find the efficient factors (EM algorithms, maximum likelihood) • Step 2: Fit the model and estimate the factor loading (Multiple linear regression)

Page 139: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

139/140

• According to the multiple linear regression and run data on SAS, we can get the factor loading and the coefficient of multiple determination

• We can ensure the factors that mostly effect the return in term of SAS output and then build the appropriate multiple factor models

• We can use the model to predict the future return and make a good choice!

2r

Page 140: AMS 572 Group #2 1. 2 3 Outline Jinmiao Fu—Introduction and History Ning Ma—Establish and Fitting of the model Ruoyu Zhou—Multiple Regression Model

140/140

Questions

Thank you