slr: inferences, part i

50
PubH 7405: REGRESSION ANALYSIS SLR: INFERENCES, Part I

Upload: others

Post on 21-Dec-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

PubH 7405: REGRESSION ANALYSIS

SLR: INFERENCES, Part I

),0(

:Model RegressionError Normal

210

σε

εββ

NxY

++=

xxXYE 10)|(:ResponseMean Theββ +==

ESTIMATED SLOPE

ii

i2

_

i

_

i

2_

__

1

yk

y)x(x

)x(x

)x(x

)y)(yx(xb

∑∑

∑∑

=

−=

−−=

∑∑∑∑

−=

−−−=−−

))((

)()())(())((_

_____

ii

iiiii

yxx

xxyyxxyyxx

Reason: Inspecting the numerator of b1

SAMPLING DISTRIBUTION

∑ −=

=

++=

2_

2

12

11

1

)x(x

σ)(bσ

β)E(b:Variance and Mean with Normal is b

slope estimated the of ondistributi sampling The),0(

:Model" RegressionError Normal" Under the

210

σε

εββ

NxY

The sampling distribution of the estimated slope, b1, is “normal” because b1 is a linear combination of the observations yi and the distribution of each observation is normal under the “normal error regression model”.

ii

i2

_

i

_

i1

yk

y)x(x

)x(xb

∑∑

=

−=

−=

=

2_

i

_

ii

ii1

)x(x

)x(xk

ykb

∑∑∑∑

−=

=

=

2_

2

)(

11

0: thatshowcan We

xxk

xk

k

i

i

ii

i

1)(

))((

)(

))((

)()())(())((

2_

_

2_

__

____

=−

−=

−=

−−=

−−−=−

∑∑∑

∑∑

∑∑∑

xx

xxxxk

xx

xxxx

xxxxxxxxx

i

iiii

i

ii

iiiii

Next step:

b1 is an UNBIASED ESTIMATOR

1

10

10

1

2_

_

1

)(

)()(

)(

)(

b

β

ββ

ββ

=

+=

+=

=

−=

=

∑∑∑∑∑

iii

ii

ii

i

ii

ii

xkk

xk

yEkbE

xx

xxk

yk

0 1

∑∑∑∑

−=

=

=

2_

2

)(

11

0:have We

xxk

xk

k

i

i

ii

i

VARIANCE & STANDARD ERROR

∑∑

−=

=

=

=

−=

=

2_

2

22

21

21

2_

_

1

)(

)(

)()(

)(

)(

b

xx

k

yVarkbbVar

xx

xxk

yk

i

i

ii

i

ii

ii

σ

σ

σ

−=

=

−=

−=

2_

11

2_

^

2_

2

12

)(

)()()(

)()(

xx

MSE

bsbSExx

MSExx

b

i

i

i

σσ

∑ −=

2_1

)()(

xx

MSEbSE

Design Implication: The larger the sum of squares of X, the more precise the estimate of the Slope.

MORE ON SAMPLING DISTRIBUTION

)()(

)()( 1

1

1

11

1

11

bbs

bb

bsb

σσββ

÷−

=−

222

1−=− ndfn

χdistributed as N(0,1)

freedom of degrees 2)(n with t"" as ddistribute is )(

:

1

11 −−bs

b βTheorem

CONFIDENCE INTERVALS

freedom of degrees 2)(n with t"" as ddistribute is )(

:

1

11 −−bs

b βTheorem

freedom of degrees 2)-(non with distributi t"" theof percentile α/2)100(1 theis 2)nα/2;t(1

:is βfor Interval Confidence α)100%(1 1

−−−−−±

−)2)s(bnα/2;t(1b 11

THE TEST FOR INDEPENDENCE

2

1

1

10

10

12

:r"" using test the toidentical iswhich

:freedom of degrees 2)(nat test t""0:

)|(:ResponseMean The

rnrt

)s(bbt

HxxXYE

−−

=

=

−=

+==β

ββ

freedom of degrees 2)(n with t"" as ddistribute is )(

:

1

11 −−bs

b βTheorem

STATISTICAL POWER The “power”, which is (1 – probability of Type II error), of the t-test for independence can be obtained, if needed, from Appendix B5 of the text

1

0

:0:β=

=SlopeHSlopeH

A

STATISTICAL POWER

).(H zero from is slope theof valuetruefar the how of measure edstandardiz

ameasure,ity noncentral theis δ where)(

);||Pr(|

2;)(

0:

A

1

12/1

1

1

10

bttPower

ndfbsbt

H

σβδδ

β

α =>=

−==

=

The “power”, which is (1 – probability of Type II error), of the t-test for independence can be obtained, if needed, from Appendix B5 of the text

freedom of degrees 2)(n with t"" as ddistribute is )(

:Theorem

1

11 −−bs

b β

More details on pages 50-51 of your textbook

For practical applications, you will less likely have anything to do with this issue of statistical power. When & if you do, specification of an Alternative Hypothesis is not easy.

Excel: SCATTER DIAGRAM Create a scatter diagram using Chart Wizard

Steps: (a) Click on the new Chart (scatter diagram) to make it active, (b) Click on Chart (on the top row menu), (c) a box appears to let you choose “Add Trendline”

Excel: REGRESSION LINE

Excel: ANALYSIS (1) click the Tools then (2) Data Analysis; among functions available, choose Regression.

A box appears, use the cursor to fill in the ranges of Y and X's. The results include all items needed, including regression estimates of coefficients, their standard errors, and their 95% confidence intervals. And much more.

BASIC “SAS” PROGRAM data tc; input x y; label x = 'Lot Size' y = 'Work Hrs'; cards;

proc REG data = tc; model y = x; plot y*x; run;

80 399 30 121 50 221 90 376 70 361 60 224 120 546 80 352 100 353 50 157 40 160 70 252 90 389 20 113 110 435 100 420 30 212 50 268 90 377 110 421 30 273 90 468 40 244 80 342 70 323;

(Toluca Company: data go in the middle)

x (oz) y (%)112 63111 66107 72119 5292 7580 11881 12084 114

118 42106 72103 9094 91

EXAMPLE #1: Birth weight data: Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 255.971912 19.04537379 13.44011 9.99506E-08 213.536175 298.4076492X Variable 1 -1.7370861 0.187689258 -9.255117 3.21622E-06 -2.155283843 -1.31888839

)319.1,155.2()1877)(.2281.2(7371.1C.I. 95%

255.91877.

7371.1

−−=±−=

−=

−=t

EXAMPLE #2: Age and SBP

Age (x) SBP (y)42 13046 11542 14871 10080 15674 16270 15180 15685 16272 15864 15581 16041 12561 15075 165

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 99.9585145 19.25516927 5.191256 0.000174 58.3602504 141.556779X Variable 1 0.70490069 0.286078656 2.46401 0.028454 0.08686533 1.32293605

)323.1,087(.)2861)(.1604.2(7049.C.I. 95%

464.22861.7049.

=±=

=

=t

LotSize WorkHours80 39930 12150 22190 37670 36160 224

120 54680 352

100 35350 15740 16070 25290 38920 113

110 435100 42030 21250 26890 377

110 42130 27390 46840 24480 34270 323

EXAMPLE #3: Toluca Company Data (Description on page 19 of Text)

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 62.3658586 26.17743389 2.382428 0.025851 8.21371106 116.518006X Variable 1 3.57020202 0.346972157 10.28959 4.45E-10 2.85243543 4.28796861

)288.4,852.2()3470)(.0687.2(5702.3C.I. 95%

290.103470.5702.3

=±=

=

=t

UNBIASED LINEAR ESTIMATORS

We consider the family of all unbiased linear estimators and prove that b1 is a member of this family with minimum variance:

∑∑

=

=

)( 1

1

^

ii

ii

ykb

ycβ

1xc

0c:Conditions

)xc(β)c(β

)xβ(βc

)E(yc)βE(

ycβ

ii

i

ii1i0

i10i

ii1

^

ii1

^

=

=

+=

+=

=

=

∑∑

∑∑∑∑∑

b1 satisfies these two conditions with ci = ki

0)(

1

)(

)(

1

)(

)(

2_

2_

_

2_

2_

_

2

=

−−

−=

−−

−=

−=

−=

+=

∑∑∑∑

∑∑∑

∑∑∑∑

xxxx

cxxc

xxxx

xxc

kkc

kckdk

ii

iii

ii

ii

iii

iiiii

iii dkcWant to prove that:

∑kidi = 0

1

0

[ ]

)()(

)0)(()(

2

)(

)(

12

1

^2

2221

2

222

22

221

^2

1

^

b

db

dkdk

dk

c

yc

i

iiii

ii

i

ii

σβσ

σσσ

σ

σ

σβσ

β

++=

++=

+=

=

=

∑∑ ∑ ∑∑

∑∑

b1 is the member of the family with minimum variance.

ESTIMATED INTERCEPT

∑∑

−−=

−=−

2_

__

1

_

10

)(

))((b

b

xx

yyxx

xby

:Recall

SAMPLING DISTRIBUTION

−+=

=

++=

∑ 2_

2_

20

2

00

0

)x(x

xn1σ)(bσ

β)E(b:Variance and Mean with Normal is b

intercept estimated the of ondistributi sampling The),0(

:Model" RegressionError Normal" Under the

210

σε

εββ

NxY

ii

iii

ykxn

ykxy

xby

∑∑

−=

−=

−=

1n1

b

:StepNext

_

_

10

The sampling distribution of b0 is “normal” because b0, like b1, is a linear combination of the observations yi and the distribution of each observation is normal under the “normal error regression model”:

0

110

1

1

_

0

)()(

)()(

)()()E(b

β

βββ

β

=

−+

=

−=

−=

∑∑

∑∑

nx

nx

nx

nyE

bExyE

ii

ii

b0 is an UNBIASED ESTIMATOR

−+=

−+=

+=

−=

2_

2_

2

2_

22

_2

02

12

_

0

_

10

)(

)(1

)()()b(

)()()()Var(b

b

xx

xn

xxx

n

bVarxyVar

xby

σ

σσσ

VARIANCE & STANDARD ERROR

−+=

−+=

−+=

2_

2_

0

)x(x

)x(n1MSE)SE(b

2_

2_

02

2_

2_

20

2

)(

)(1)b(

)(

)(1)b(

xx

xn

MSEs

xx

xn

σσ

DESIGN IMPLICATION

−+=

−=

2_

2_

2_1

)x(x

)x(n1MSE)SE(b

)x(x

MSE)SE(b

0

These Standard Errors, for given n, are affected by the spacing of the X’s levels in the data. The larger the sum of squares of X, the more precise the estimates of the Slope and the Intercept.

MORE ON SAMPLING DISTRIBUTION

)()(

)()( 0

0

0

00

0

00

bbs

bb

bsb

σσββ

÷−

=−

222

1−=− ndfn

χdistributed as N(0,1)

freedom of degrees 2)(n with t"" as ddistribute is )(

:

0

00 −−bs

b βTheorem

CONFIDENCE INTERVALS

freedom of degrees 2)(n with t"" as ddistribute is )(

:

0

00 −−bs

b βTheorem

freedom of degrees 2)-(non with distributi t"" theof percentile α/2)100(1 theis 2)nα/2;t(1

:is βfor Interval Confidence α)100%(1 0

−−−−−±

−)2)s(bnα/2;t(1b 00

x (oz) y (%)112 63111 66107 72119 5292 7580 11881 12084 114

118 42106 72103 9094 91

EXAMPLE #1: Birth weight data: Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 255.971912 19.04537379 13.44011 9.99506E-08 213.536175 298.4076492X Variable 1 -1.7370861 0.187689258 -9.255117 3.21622E-06 -2.155283843 -1.31888839

)4076.298,536.213()0454.19)(2281.2(9719.255C.I. 95%

440.130454.199719.255

=±=

=

=t

EXAMPLE #2: Age and SBP

Age (x) SBP (y)42 13046 11542 14871 10080 15674 16270 15180 15685 16272 15864 15581 16041 12561 15075 165

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 99.9585145 19.25516927 5.191256 0.000174 58.3602504 141.556779X Variable 1 0.70490069 0.286078656 2.46401 0.028454 0.08686533 1.32293605

)557.141,360.58()2552.19)(1604.2(9585.99C.I. 95%

191.52552.199585.99

=±=

=

=t

LotSize WorkHours80 39930 12150 22190 37670 36160 224

120 54680 352

100 35350 15740 16070 25290 38920 113

110 435100 42030 21250 26890 377

110 42130 27390 46840 24480 34270 323

EXAMPLE #3: Toluca Company Data (Description on page 19 of Text)

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 62.3658586 26.17743389 2.382428 0.025851 8.21371106 116.518006X Variable 1 3.57020202 0.346972157 10.28959 4.45E-10 2.85243543 4.28796861

)518.116,218.8()1774.26)(0687.2(3659.62C.I. 95%

382.21774.263659.62

=±=

=

=t

DEPARTURE FROM NORMALITY

),0(

:Model RegressionError Normal

210

σε

εββ

NxY

++=

If the probability distributions of Y are not exactly normal but do not depart seriously, the sampling distributions of b0 and b1 would still be approximately normal with very little effects on the level of significance of the t-test for independence and the coverage of the confidence intervals. Even if the probability distributions of Y are far from normal, the effects are still minimal provided that the samples sizes are sufficiently large; i.e. the sampling distributions of b0 and b1 are asymptotically normal.

Sometimes it is known, a priori, that the true intercept is zero; the regression function is linear but the line goes through the origin (0,0):

),0(

:

21

σε

εβ

NxY

+=origin the through Regression

xxXYE 1)|(:ResponseMean The

β==

RESULTS Let b1 be the estimate of slop β1, “the sum of squared errors” becomes Q=Σ (y - b1x)2 . The new “Least Squares Estimate” is:

xbY

xxy

1

^

21b

=

=∑∑

All inferences are still drawn through the use of the “t” distribution but with (n-1) degrees of freedom; for example:

=

−=

212

2

)(

1

xMSEbs

ne

MSE i

Besides “Least Squares”, parameters can be estimated using the method of “Maximum Likelihood”; results are called “MLE” – maximum likelihood estimators/estimates.

−−−=

++=

21022/12

210

)(2

1exp)(2

1f(y)

:Yfor Function Density ),0(

:Model

xy

NxY

ββσπσ

σε

εββ

−−−=

−−−=

−−−=

=

=

n

iiin

ii

xy

xy

xy

1

21022/2

n

1i

21022/12

21022/12

)(2

1exp)(2

1

)(2

1exp)(2

1L

:Function Likelihood

)(2

1exp)(2

1f(y)

:Yfor Function Density

ββσπσ

ββσπσ

ββσπσ

SLOPE & INTERCEPT • The maximum likelihood estimates of the Intercept and

the Slope are identical to the Least Squares estimates. • Their variances, obtained from the inverse of the

Fisher’s Information Matrix, are also identical. • As the least squares estimates, Estimated Intercept and

Slope have the properties of least squares estimates: (1) they are unbiased, and (2) they have minimum variance in the class of all linear unbiased estimators).

• In addition, as MLEs for the normal error regression model: (3) they are consistent & (4) they are sufficient.

Readings & Exercises • Readings: A thorough reading of the text’s

sections 2.1-2.3 (pp. 40-51) is highly recommended.

• Exercises: The following exercises are good for practice, all from chapter 2 of text: 2.1-2.6.

• Due as Homework: 2.4 and 2.6.

Due As Homework #6.1 Refer to dataset “Infants”, with X = Gestational Weeks

and Y = Birth Weight: a) Obtain the 95% confidence interval for the slope and

interpret your result. Does your confidence interval include zero? What would be your conclusion about the possible linear relationship between X and Y.

b) Using the t-statistic test to see whether or not a linear association exist between X and Y.

c) Does your conclusion in part (b) agree with your conclusion in part (a)? Which result, in (a) or in (b), could tell you more about the strength of the relationship between X and Y?

#6.2 Answer the 3 questions of Exercise 6.1 using dataset “Vital Capacity” with X = Age and Y = (100)(Vital Capacity).