lecture 2 linear regression: a model for the mean

Lecture 2 Linear Regression:A Model for the Mean

Sharyn O’Halloran

Spring 2005 2U9611

Closer Look at:

Linear Regression ModelLeast squares procedureInferential toolsConfidence and Prediction Intervals

AssumptionsRobustness Model checkingLog transformation (of Y, X, or both)

Spring 2005 3U9611

Linear Regression: Introduction

Data: (Yi, Xi) for i = 1,...,n

Interest is in the probability distribution of Y as a function of X

Linear Regression model: Mean of Y is a straight line function of X, plus an error term or residualGoal is to find the best fit line that minimizes the sum of the error terms

Spring 2005 4U9611

5.5

66.

57

PH

0 1 2ltime

Fitted values PH

Estimated regression lineSteer example (see Display 7.3, p. 177)

.73Intercept=6.98

1 Y= 6.98Y= 6.98--.73X.73X

Equation for estimated regression line:Equation for estimated regression line:

^Fitted line

Error term

Spring 2005 5U9611

Create a new variableltime=log(time)

Regression analysis

Spring 2005 6U9611

Regression TerminologyRegressionRegression: the mean of a response variable as a function of one or more explanatory variables:

µ{Y | X}

Regression modelRegression model: an ideal formula to approximate the regression

Simple linear regression modelSimple linear regression model:

XXY 10}|{ ββµ +=

Intercept Slope“mean of Y given X” or“regression of Y on X”

Unknownparameter

Spring 2005 7U9611

Y’s probability distribution is to be explained by X

b0 and b1 are the regression coefficientsregression coefficients

(See Display 7.5, p. 180)

Note: Y = b0 + b1 X is NOT simple regression

Control variableResponse variable

Explanatory variableExplained variable

Independent variableIndependent variableDependent variableDependent variable

XXYY

Regression Terminology

Spring 2005 8U9611

Regression Terminology: Estimated coefficients

1

0

10

10

ˆ

ˆ

ˆˆ

β

β

ββ

ββ

X

X

+

+

10 ββ +

10 ˆˆ ββ +

X10 ββ +

X10 ˆˆ ββ +

Choose and to make the residuals small0β̂ 1β̂

Spring 2005 9U9611

Fitted valueFitted value for obs. i is its estimated mean:

ResidualResidual for obs. i:

Least SquaresLeast Squares statistical estimation method finds those estimates that minimize the sum of squared residuals.

Solution (from calculus) on p. 182 of Sleuth

XXYfitY i 10}|{ ˆ ββµ +===

Regression Terminology

YYe iiˆfit - Y res iii −=⇒=

∑∑==

−=+−n

ii

n

iii yyxy

1

2

1

210 )ˆ())(( ββ

Spring 2005 10U9611

Least Squares ProcedureThe Least-squares procedure obtains estimates of the linear equation coefficients β0 and β1, in the model

by minimizing the sum of the squared residuals or errors (ei)

This results in a procedure stated as

Choose β0 and β1 so that the quantity is minimized.

ii xy 10ˆ ββ +=

22 )ˆ( iii yyeSSE −== ∑∑

210

2 ))(( iii xyeSSE ββ +−== ∑∑

Spring 2005 11U9611

Least Squares ProcedureThe slope coefficient estimator is

And the constant or intercept indicator is

X

Yxyn

ii

n

iii

ssr

Xx

YyXx=

−

−−=

∑

∑

=

=

1

2

11

)(

))((β̂

XY 10ˆˆ ββ −=

CORRELATION BETWEEN X AND Y

STANDARD DEVIATION OF Y OVER THE STANDARD DEVIATION OF X

Spring 2005 12U9611

Least Squares Procedure(cont.)

Note that the regression line always goes through the mean X, Y.

Relation Between Yield and Fertilizer

0

20

40

60

80

100

0 100 200 300 400 500 600 700 800

Fertilizer (lb/Acre)

Yiel

d (B

ushe

l/Acr

e)

Trend lineThat is, for any value of the independent variable there is a single most likely value for the dependent variable

Think of this regression line as the expected value of Y for a given value of X.

Spring 2005 13U9611

Tests and Confidence Intervals for β0, β1

Degrees of freedom: (n-2) = sample size - number of coefficients

Variance {Y|X}σσ22= (sum of squared residuals)/(n-2)

Standard errors (p. 184)Ideal normal model:

the sampling distributions of β0 and β1 have the shape of a t-distribution on (n-2) d.f.

Do t-tests and CIs as usual (df=n-2)

Spring 2005 14U9611

Confidence intervals

P values for Ho=0

Spring 2005 15U9611

Inference ToolsHypothesis TestHypothesis Test and Confidence IntervalConfidence Interval for mean of Y at some X:

Estimate the mean of Y at X = X0 by

Standard Error of

Conduct t-test and confidence interval in the usual way (df = n-2)

0100ˆˆ}|{ˆ XXY ββµ +=

2

20

0 )1()(1ˆ}]|{ˆ[xsn

XXn

XYSE−−

+= σµ

0β̂

Spring 2005 16U9611

Confidence bands for conditional means

the lfitcilfitci command automatically

calculate and graph the confidence bands

confidence bands in simple regression

have an hourglass shape, narrowest at the mean of X

Spring 2005 17U9611

PredictionPredictionPrediction of a future Y at X=X0

PredStandard error of predictionStandard error of prediction:

}|{ˆ)|( 00 XYXY µ=

Variability of Y about its mean

Uncertainty in the estimated mean

95% prediction interval95% prediction interval:

)]|(Pred[*)975(.)|(Pred 00 XYSEtXY df±

20

20 )])|(ˆ[(ˆ)]|(Pred[ XYSEXYSE µσ +=

Spring 2005 18U9611

Residuals vs. predicted values plot

After any regression analysis we can automatically draw a residual-versus-fitted plot

just by typing

Spring 2005 19U9611

Predicted values (yhatyhat)

After any regression,the predictpredict command can create

a new variable yhatyhatcontaining predicted Y values

about its mean

Spring 2005 20U9611

Residuals (ee)

the residresid command can create a new variable ee

containing the residuals

Spring 2005 21U9611

The residual-versus-predicted-values plot could be drawn “by hand” using these commands

Second type of confidence interval for regression prediction: ““prediction bandprediction band””

This express our uncertainty in estimating

the unknown value of Y for an individual observation

with known X value

Command:lftcilftci with stdfstdf option

Additional note: Predict: Predict can generate two kinds of standard errorsstandard errorsfor the predicted y value, which have two different applications.

01

23

Dis

tanc

e

-500 0 500 1000VELOCITY

Confidence bands for conditional means (stdp)

-10

12

3D

ista

nce

-500 0 500 1000VELOCITY

Confidence bands for individual-case predictions (stdf)

Spring 2005 24U9611

01

23

Dis

tanc

e

-500 0 500 1000VELOCITY

Confidence bands for conditional means (stdp)

-10

12

3D

ista

nce

-500 0 500 1000VELOCITY

Confidence bands for individual-case predictions (stdf)

95% confidence interval95% confidence intervalfor µ{Y|1000}

95% prediction interval95% prediction intervalfor Y at X=1000

confidence bandconfidence band: a set of

confidence intervalsfor µ{Y|X0}

Calibration intervalCalibration interval: values of X for which Y0is in a

prediction interval

Spring 2005 25U9611

Notes about confidence and prediction bands

Both are narrowest at the mean of XBeware of extrapolation

The width of the Confidence Interval is zero if n is large enough; this is not true of the Prediction this is not true of the Prediction Interval.Interval.

Spring 2005 26U9611

Review of simple linear regression

220

21

1

2

10

10

1

2

11

2

10

)1/()/1(/ˆ)ˆ(

)1(/ˆ)ˆ(

)2/(ˆ

),..,1(ˆˆ

ˆˆ

.)(/))((ˆ

}|var{}|{

2

x

x

n

ii

iii

n

ii

n

iii

snXnSE

snSE

nres

niXYres

XY

XXYYXX

XYXXY

−+=

−=

−=

=−−=

−=

−−−=

=

+=

∑

∑∑

=

==

σβ

σβ

σ

ββ

ββ

β

σ

ββµ1. Model with constant variance.constant variance.

2. Least squaresLeast squares: choose estimators

β0 and β1to minimize the sum of

squared residuals.

3. PropertiesPropertiesof estimators.

Spring 2005 27U9611

Assumptions of Linear Regression

A linear regression model assumes:Linearity:

µ {Y|X} = β0 + β1XConstant Variance:

var{Y|X} = σ2

NormalityDist. of Y’s at any X is normal

IndependenceGiven Xi’s, the Yi’s are independent

Spring 2005 28U9611

Examples of ViolationsNon-Linearity

The true relation between the independent and dependent variables may not be linear.

For example, consider campaign fundraising and the probability of winning an election.

$ 5 0 ,0 0 0

P (w)

S p e n d in g

Probability of Winning an

Election

The probability of winning increases with each additional dollar spent and then levels

off after $50,000.

Spring 2005 29U9611

Consequences of violation of linearityIf “linearity” is violated, misleading conclusions may occur (however, the degree of the problem depends on the degree of non-linearity)

:

Spring 2005 30U9611

Examples of Violations: Constant Variance

Constant Variance or HomoskedasticityThe Homoskedasticity assumption implies that, on average, we do not expect to get larger errors in some cases than in others.

Of course, due to the luck of the draw, some errors will turn out to be larger then others. But homoskedasticity is violated only when this happens in a predictable manner.

Example: income and spending on certain goods. People with higher incomes have more choices about what to buy. We would expect that there consumption of certain goods is more variable than for families with lower incomes.

Spring 2005 31U9611

Violation of constant variance

X

income

XX

XX

X

X

X

X

1

2

3

4

5

6

X9

8

7

10

ε 9

ε 8

ε 7

ε 5

ε 6

Spending

ε9= − +Y a bX9 9( ( ))

As income increases so do the errors (vertical distance from the predicted line)

Relation between Income and Spending violates

homoskedasticity

ε6= − +Y a bX6 6( ( ))

Spring 2005 32U9611

Consequences of non-constant variance

If “constant variance” is violated, LS estimates are still unbiased but SEs, tests, Confidence Intervals, and Prediction Intervals are incorrect

However, the degree depends…

Spring 2005 33U9611

Violation of Normality

Non-Normality

Nicotine use is characterized by a large number of people not smoking at all and another large number of people who smoke every day.

An example of a bimodal distribution

Frequency of Nicotine use

Spring 2005 34U9611

Consequence of non-Normality If “normality” is violated,

LS estimates are still unbiasedtests and CIs are quite robust PIs are not

Of all the assumptions, this is the one that we need to be least worried about violating.

Why?

Spring 2005 35U9611

Violation of Non-independenceNon-Independence

The independence assumption means that errors terms of two variables will not necessarily influence one another.

Technically, the RESIDUALS or error terms are uncorrelated.

The most common violation occurs with data that are collected over time or time series analysis.

Example: high tariff rates in one period are often associated with very high tariff rates in the next period. Example: Nominal GNP and Consumption

Residuals of GNP and Consumption over Time

Highly Correlated

Spring 2005 36U9611

Consequence of non-independenceIf “independence” is violated:

- LS estimates are still unbiased- everything else can be misleading

Plottingcode is

litter(5 mice

from eachof 5 litters)

Log Weight

Log

Hei

ght

Note that mice from litters 4 and 5 have higher weight and

height

Spring 2005 37U9611

Robustness of least squaresThe “constant variance” assumption is important.

Normality is not too important for confidence intervals and p-values, but is important for prediction intervals.

Long-tailed distributions and/or outliers can heavily influence the results.

Non-independence problems: serial correlation (Ch. 15) and cluster effects (we deal with this in Ch. 9-14).

Strategy for dealing with these potential problemsStrategy for dealing with these potential problems

Plots; Residual plots; Consider outliers (more in Ch. 11)

Log Transformations (Display 8.6)

Spring 2005 38U9611

Tools for model checking

Scatterplot of Y vs. X (see Display 8.6 p. 213)*

Scatterplot of residuals vs. fitted values*

*Look for curvature, non*Look for curvature, non--constantconstantvariance, and outliersvariance, and outliers

Normal probability plot (p.224)It is sometimes useful—for checking if the distribution is symmetric or normal (i.e. for PIs).

Lack of fit F-test when there are replicates(Section 8.5).

Spring 2005 39U9611

Scatterplot of Y vs. X

Command: graph twoway Y X Case study: 7.01 page175

Spring 2005 40U9611

Scatterplot of residuals vs. fitted values

Command: rvfplot, yline(0)…Case study: 7.01 page175

Spring 2005 41U9611

Normal probability plot (p.224)

Command: qnorm variable, grid Case study: 7.01, page 175

Quantile normal plots compare quantiles of a variable distribution

with quantiles of a normal distribution having the same

mean and standard deviation.

They allow visual inspection for departures from normality in every part of the distribution.

Spring 2005 42U9611

Diagnostic plots of residuals

Plot residuals versus fitted values almost always:

For simple reg. this is about the same as residuals vs. x

Look for outliers, curvature, increasing spread (funnel or horn shape); then take appropriate action.

If data were collected over time, plot residuals versus time

Check for time trend and Serial correlation

If normality is important, use normal probability plot.

A straight line is expected if distribution is normal

Spring 2005 43U9611

Voltage Example (Case Study 8.1.2)

Goal: to describe the distribution of breakdown time of an insulating fluid as a function of voltage applied to it.

Y=Breakdown timeX= Voltage

Statistical illustrationsRecognizing the need for a log transformation of the response from the scatterplot and the residual plot

Checking the simple linear regression fit with a lack-of-fit F-test

Stata (follows)

Spring 2005 44U9611

The residuals vsfitted values plot

presents increasing spread

with increasing

fitted values

Simple regression

Next step:We try with

log(Ylog(Y) ~ ) ~ log(timelog(time))

Spring 2005 45U9611

Simple regression with Y logged

The residuals vsfitted values plot does not present

any obvious curvature

or trend in spread.

Spring 2005 46U9611

Interpretation after log transformations

% ∆y=(β1)%∆xlog(X)log(Y)Log-log

%∆y=(100β1)∆xXlog(Y)Log-level

∆y=(β1/100)%∆xlog(X)YLevel-log

∆y=β1∆xXYLevel-level

Interpretation of β1Independent

VariableDependent

VariableModel

Spring 2005 47U9611

Dependent variable logged

µ{log(Y)|X} = β0 + β1X is the same as:

(if the distribution of log(Y), given X, is symmetric)

As As X X increases by 1, what happens?increases by 1, what happens?

XeXYMedian 10}||{ ββ +=

}|{}1|{

}|{}1|{

1

1

10

10 )1(

xXYMedianexXYMedian

ee

exXYMedian

xXYMedianx

x

==+=

===+=

+

++

β

βββ

ββ

Spring 2005 48U9611

Interpretation of Y logged“As X increases by 1, the median of Y changes by the multiplicative factor of

.”

Or, better:If β1>0: “As X increases by 1, the median of Y increases by ”

If β1 < 0: “As X increases by 1, the median of Y decreases by ”%100*)1( 1βe−

%100*)1( 1 −βe

1βe

Spring 2005 49U9611

Example: µ{log(time)|voltage} = β0 – β1 voltage1- e-0.5=.4

Spring 2005 50U9611

µ{log(time)|voltage} = 18.96 - .507voltage1- e-0.5=.4

It is estimated that the median breakdown time decreases by 40% with each 1kV increase in voltage

-20

24

68

Log

of ti

me

until

bre

akdo

wn

25 30 35 40VOLTAGE

Fitted values logarithm of breakdown time

050

010

0015

0020

0025

00B

reak

dow

n tim

e (m

inut

es)

25 30 35 40VOLTAGE

Fitted values TIME

Spring 2005 51U9611

If the explanatory variable (X) is logged

If µ{Y|log(X)} = β0 + β1log(X) then:“Associated with each two-fold increase (i.e doubling) of X is a β1log(2) change in the mean of Y.”

An example will follow:

Spring 2005 52U9611

Example with X logged (Display 7.3 – Case 7.1):

Y = pHX = time after slaughter (hrs.)estimated model: µ{Y|log(X)} = 6.98 - .73log(X).

-.73´log(2) = -.5 “It is estimated that for eachdoubling of time after slaughter (between 0 and 8 hours) themean pH decreases by .5.”

5.5

66.

57

pH

0 .5 1 1.5 2ltime

Fitted values PH

5.5

66.

57

pH

0 2 4 6 8TIME

Fitted values PH

Spring 2005 53U9611

Both Y and X logged

µ{log(Y)|log(X)} = β0 + β1log(X) is the same as:

As As X X increases by 1, what happens?increases by 1, what happens?

If β1>0: “As X increases by 1, the median of Y increases by ”

If β1 < 0: “As X increases by 1, the median of Ydecreases by ”

%100*)1( 1)2log( −βe

%100*)1( 1)2log( βe−

Spring 2005 54U9611

Example with Y and X logged Display 8.1 page 207

Y: number of species on an islandX: island area

µ{log(Y)|log(X)} = β0 – β1 log(X)

Spring 2005 55U9611

µ{log(Y)|log(X)} = 1.94 – .25 log(X)Since e.25log(2)=.19

“Associated with each doubling ofisland area is a 19% increase in the

median number of bird species”

Y and X logged

Spring 2005 56U9611

Example: Log-LogIn order to graph the Log-log plot

we need to generate two new variables (natural logarithms)

lecture 2 linear regression: a model for the mean

Documents