lecture 2 linear regression: a model for the mean
TRANSCRIPT
Lecture 2 Linear Regression:A Model for the Mean
Sharyn O’Halloran
Spring 2005 2U9611
Closer Look at:
Linear Regression ModelLeast squares procedureInferential toolsConfidence and Prediction Intervals
AssumptionsRobustness Model checkingLog transformation (of Y, X, or both)
Spring 2005 3U9611
Linear Regression: Introduction
Data: (Yi, Xi) for i = 1,...,n
Interest is in the probability distribution of Y as a function of X
Linear Regression model: Mean of Y is a straight line function of X, plus an error term or residualGoal is to find the best fit line that minimizes the sum of the error terms
Spring 2005 4U9611
5.5
66.
57
PH
0 1 2ltime
Fitted values PH
Estimated regression lineSteer example (see Display 7.3, p. 177)
.73Intercept=6.98
1 Y= 6.98Y= 6.98--.73X.73X
Equation for estimated regression line:Equation for estimated regression line:
^Fitted line
Error term
Spring 2005 5U9611
Create a new variableltime=log(time)
Regression analysis
Spring 2005 6U9611
Regression TerminologyRegressionRegression: the mean of a response variable as a function of one or more explanatory variables:
µ{Y | X}
Regression modelRegression model: an ideal formula to approximate the regression
Simple linear regression modelSimple linear regression model:
XXY 10}|{ ββµ +=
Intercept Slope“mean of Y given X” or“regression of Y on X”
Unknownparameter
Spring 2005 7U9611
Y’s probability distribution is to be explained by X
b0 and b1 are the regression coefficientsregression coefficients
(See Display 7.5, p. 180)
Note: Y = b0 + b1 X is NOT simple regression
Control variableResponse variable
Explanatory variableExplained variable
Independent variableIndependent variableDependent variableDependent variable
XXYY
Regression Terminology
Spring 2005 8U9611
Regression Terminology: Estimated coefficients
1
0
10
10
ˆ
ˆ
ˆˆ
β
β
ββ
ββ
X
X
+
+
10 ββ +
10 ˆˆ ββ +
X10 ββ +
X10 ˆˆ ββ +
Choose and to make the residuals small0β̂ 1β̂
Spring 2005 9U9611
Fitted valueFitted value for obs. i is its estimated mean:
ResidualResidual for obs. i:
Least SquaresLeast Squares statistical estimation method finds those estimates that minimize the sum of squared residuals.
Solution (from calculus) on p. 182 of Sleuth
XXYfitY i 10}|{ ˆ ββµ +===
Regression Terminology
YYe iiˆfit - Y res iii −=⇒=
∑∑==
−=+−n
ii
n
iii yyxy
1
2
1
210 )ˆ())(( ββ
Spring 2005 10U9611
Least Squares ProcedureThe Least-squares procedure obtains estimates of the linear equation coefficients β0 and β1, in the model
by minimizing the sum of the squared residuals or errors (ei)
This results in a procedure stated as
Choose β0 and β1 so that the quantity is minimized.
ii xy 10ˆ ββ +=
22 )ˆ( iii yyeSSE −== ∑∑
210
2 ))(( iii xyeSSE ββ +−== ∑∑
Spring 2005 11U9611
Least Squares ProcedureThe slope coefficient estimator is
And the constant or intercept indicator is
X
Yxyn
ii
n
iii
ssr
Xx
YyXx=
−
−−=
∑
∑
=
=
1
2
11
)(
))((β̂
XY 10ˆˆ ββ −=
CORRELATION BETWEEN X AND Y
STANDARD DEVIATION OF Y OVER THE STANDARD DEVIATION OF X
Spring 2005 12U9611
Least Squares Procedure(cont.)
Note that the regression line always goes through the mean X, Y.
Relation Between Yield and Fertilizer
0
20
40
60
80
100
0 100 200 300 400 500 600 700 800
Fertilizer (lb/Acre)
Yiel
d (B
ushe
l/Acr
e)
Trend lineThat is, for any value of the independent variable there is a single most likely value for the dependent variable
Think of this regression line as the expected value of Y for a given value of X.
Spring 2005 13U9611
Tests and Confidence Intervals for β0, β1
Degrees of freedom: (n-2) = sample size - number of coefficients
Variance {Y|X}σσ22= (sum of squared residuals)/(n-2)
Standard errors (p. 184)Ideal normal model:
the sampling distributions of β0 and β1 have the shape of a t-distribution on (n-2) d.f.
Do t-tests and CIs as usual (df=n-2)
Spring 2005 14U9611
Confidence intervals
P values for Ho=0
Spring 2005 15U9611
Inference ToolsHypothesis TestHypothesis Test and Confidence IntervalConfidence Interval for mean of Y at some X:
Estimate the mean of Y at X = X0 by
Standard Error of
Conduct t-test and confidence interval in the usual way (df = n-2)
0100ˆˆ}|{ˆ XXY ββµ +=
2
20
0 )1()(1ˆ}]|{ˆ[xsn
XXn
XYSE−−
+= σµ
0β̂
Spring 2005 16U9611
Confidence bands for conditional means
the lfitcilfitci command automatically
calculate and graph the confidence bands
confidence bands in simple regression
have an hourglass shape, narrowest at the mean of X
Spring 2005 17U9611
PredictionPredictionPrediction of a future Y at X=X0
PredStandard error of predictionStandard error of prediction:
}|{ˆ)|( 00 XYXY µ=
Variability of Y about its mean
Uncertainty in the estimated mean
95% prediction interval95% prediction interval:
)]|(Pred[*)975(.)|(Pred 00 XYSEtXY df±
20
20 )])|(ˆ[(ˆ)]|(Pred[ XYSEXYSE µσ +=
Spring 2005 18U9611
Residuals vs. predicted values plot
After any regression analysis we can automatically draw a residual-versus-fitted plot
just by typing
Spring 2005 19U9611
Predicted values (yhatyhat)
After any regression,the predictpredict command can create
a new variable yhatyhatcontaining predicted Y values
about its mean
Spring 2005 20U9611
Residuals (ee)
the residresid command can create a new variable ee
containing the residuals
Spring 2005 21U9611
The residual-versus-predicted-values plot could be drawn “by hand” using these commands
Second type of confidence interval for regression prediction: ““prediction bandprediction band””
This express our uncertainty in estimating
the unknown value of Y for an individual observation
with known X value
Command:lftcilftci with stdfstdf option
Additional note: Predict: Predict can generate two kinds of standard errorsstandard errorsfor the predicted y value, which have two different applications.
01
23
Dis
tanc
e
-500 0 500 1000VELOCITY
Confidence bands for conditional means (stdp)
-10
12
3D
ista
nce
-500 0 500 1000VELOCITY
Confidence bands for individual-case predictions (stdf)
Spring 2005 24U9611
01
23
Dis
tanc
e
-500 0 500 1000VELOCITY
Confidence bands for conditional means (stdp)
-10
12
3D
ista
nce
-500 0 500 1000VELOCITY
Confidence bands for individual-case predictions (stdf)
95% confidence interval95% confidence intervalfor µ{Y|1000}
95% prediction interval95% prediction intervalfor Y at X=1000
confidence bandconfidence band: a set of
confidence intervalsfor µ{Y|X0}
Calibration intervalCalibration interval: values of X for which Y0is in a
prediction interval
Spring 2005 25U9611
Notes about confidence and prediction bands
Both are narrowest at the mean of XBeware of extrapolation
The width of the Confidence Interval is zero if n is large enough; this is not true of the Prediction this is not true of the Prediction Interval.Interval.
Spring 2005 26U9611
Review of simple linear regression
220
21
1
2
10
10
1
2
11
2
10
)1/()/1(/ˆ)ˆ(
)1(/ˆ)ˆ(
)2/(ˆ
),..,1(ˆˆ
ˆˆ
.)(/))((ˆ
}|var{}|{
2
x
x
n
ii
iii
n
ii
n
iii
snXnSE
snSE
nres
niXYres
XY
XXYYXX
XYXXY
−+=
−=
−=
=−−=
−=
−−−=
=
+=
∑
∑∑
=
==
σβ
σβ
σ
ββ
ββ
β
σ
ββµ1. Model with constant variance.constant variance.
2. Least squaresLeast squares: choose estimators
β0 and β1to minimize the sum of
squared residuals.
3. PropertiesPropertiesof estimators.
Spring 2005 27U9611
Assumptions of Linear Regression
A linear regression model assumes:Linearity:
µ {Y|X} = β0 + β1XConstant Variance:
var{Y|X} = σ2
NormalityDist. of Y’s at any X is normal
IndependenceGiven Xi’s, the Yi’s are independent
Spring 2005 28U9611
Examples of ViolationsNon-Linearity
The true relation between the independent and dependent variables may not be linear.
For example, consider campaign fundraising and the probability of winning an election.
$ 5 0 ,0 0 0
P (w)
S p e n d in g
Probability of Winning an
Election
The probability of winning increases with each additional dollar spent and then levels
off after $50,000.
Spring 2005 29U9611
Consequences of violation of linearityIf “linearity” is violated, misleading conclusions may occur (however, the degree of the problem depends on the degree of non-linearity)
:
Spring 2005 30U9611
Examples of Violations: Constant Variance
Constant Variance or HomoskedasticityThe Homoskedasticity assumption implies that, on average, we do not expect to get larger errors in some cases than in others.
Of course, due to the luck of the draw, some errors will turn out to be larger then others. But homoskedasticity is violated only when this happens in a predictable manner.
Example: income and spending on certain goods. People with higher incomes have more choices about what to buy. We would expect that there consumption of certain goods is more variable than for families with lower incomes.
Spring 2005 31U9611
Violation of constant variance
X
income
XX
XX
X
X
X
X
1
2
3
4
5
6
X9
8
7
10
ε 9
ε 8
ε 7
ε 5
ε 6
Spending
ε9= − +Y a bX9 9( ( ))
As income increases so do the errors (vertical distance from the predicted line)
Relation between Income and Spending violates
homoskedasticity
ε6= − +Y a bX6 6( ( ))
Spring 2005 32U9611
Consequences of non-constant variance
If “constant variance” is violated, LS estimates are still unbiased but SEs, tests, Confidence Intervals, and Prediction Intervals are incorrect
However, the degree depends…
Spring 2005 33U9611
Violation of Normality
Non-Normality
Nicotine use is characterized by a large number of people not smoking at all and another large number of people who smoke every day.
An example of a bimodal distribution
Frequency of Nicotine use
Spring 2005 34U9611
Consequence of non-Normality If “normality” is violated,
LS estimates are still unbiasedtests and CIs are quite robust PIs are not
Of all the assumptions, this is the one that we need to be least worried about violating.
Why?
Spring 2005 35U9611
Violation of Non-independenceNon-Independence
The independence assumption means that errors terms of two variables will not necessarily influence one another.
Technically, the RESIDUALS or error terms are uncorrelated.
The most common violation occurs with data that are collected over time or time series analysis.
Example: high tariff rates in one period are often associated with very high tariff rates in the next period. Example: Nominal GNP and Consumption
Residuals of GNP and Consumption over Time
Highly Correlated
Spring 2005 36U9611
Consequence of non-independenceIf “independence” is violated:
- LS estimates are still unbiased- everything else can be misleading
Plottingcode is
litter(5 mice
from eachof 5 litters)
Log Weight
Log
Hei
ght
Note that mice from litters 4 and 5 have higher weight and
height
Spring 2005 37U9611
Robustness of least squaresThe “constant variance” assumption is important.
Normality is not too important for confidence intervals and p-values, but is important for prediction intervals.
Long-tailed distributions and/or outliers can heavily influence the results.
Non-independence problems: serial correlation (Ch. 15) and cluster effects (we deal with this in Ch. 9-14).
Strategy for dealing with these potential problemsStrategy for dealing with these potential problems
Plots; Residual plots; Consider outliers (more in Ch. 11)
Log Transformations (Display 8.6)
Spring 2005 38U9611
Tools for model checking
Scatterplot of Y vs. X (see Display 8.6 p. 213)*
Scatterplot of residuals vs. fitted values*
*Look for curvature, non*Look for curvature, non--constantconstantvariance, and outliersvariance, and outliers
Normal probability plot (p.224)It is sometimes useful—for checking if the distribution is symmetric or normal (i.e. for PIs).
Lack of fit F-test when there are replicates(Section 8.5).
Spring 2005 39U9611
Scatterplot of Y vs. X
Command: graph twoway Y X Case study: 7.01 page175
Spring 2005 40U9611
Scatterplot of residuals vs. fitted values
Command: rvfplot, yline(0)…Case study: 7.01 page175
Spring 2005 41U9611
Normal probability plot (p.224)
Command: qnorm variable, grid Case study: 7.01, page 175
Quantile normal plots compare quantiles of a variable distribution
with quantiles of a normal distribution having the same
mean and standard deviation.
They allow visual inspection for departures from normality in every part of the distribution.
Spring 2005 42U9611
Diagnostic plots of residuals
Plot residuals versus fitted values almost always:
For simple reg. this is about the same as residuals vs. x
Look for outliers, curvature, increasing spread (funnel or horn shape); then take appropriate action.
If data were collected over time, plot residuals versus time
Check for time trend and Serial correlation
If normality is important, use normal probability plot.
A straight line is expected if distribution is normal
Spring 2005 43U9611
Voltage Example (Case Study 8.1.2)
Goal: to describe the distribution of breakdown time of an insulating fluid as a function of voltage applied to it.
Y=Breakdown timeX= Voltage
Statistical illustrationsRecognizing the need for a log transformation of the response from the scatterplot and the residual plot
Checking the simple linear regression fit with a lack-of-fit F-test
Stata (follows)
Spring 2005 44U9611
The residuals vsfitted values plot
presents increasing spread
with increasing
fitted values
Simple regression
Next step:We try with
log(Ylog(Y) ~ ) ~ log(timelog(time))
Spring 2005 45U9611
Simple regression with Y logged
The residuals vsfitted values plot does not present
any obvious curvature
or trend in spread.
Spring 2005 46U9611
Interpretation after log transformations
% ∆y=(β1)%∆xlog(X)log(Y)Log-log
%∆y=(100β1)∆xXlog(Y)Log-level
∆y=(β1/100)%∆xlog(X)YLevel-log
∆y=β1∆xXYLevel-level
Interpretation of β1Independent
VariableDependent
VariableModel
Spring 2005 47U9611
Dependent variable logged
µ{log(Y)|X} = β0 + β1X is the same as:
(if the distribution of log(Y), given X, is symmetric)
As As X X increases by 1, what happens?increases by 1, what happens?
XeXYMedian 10}||{ ββ +=
}|{}1|{
}|{}1|{
1
1
10
10 )1(
xXYMedianexXYMedian
ee
exXYMedian
xXYMedianx
x
==+=
===+=
+
++
β
βββ
ββ
Spring 2005 48U9611
Interpretation of Y logged“As X increases by 1, the median of Y changes by the multiplicative factor of
.”
Or, better:If β1>0: “As X increases by 1, the median of Y increases by ”
If β1 < 0: “As X increases by 1, the median of Y decreases by ”%100*)1( 1βe−
%100*)1( 1 −βe
1βe
Spring 2005 49U9611
Example: µ{log(time)|voltage} = β0 – β1 voltage1- e-0.5=.4
Spring 2005 50U9611
µ{log(time)|voltage} = 18.96 - .507voltage1- e-0.5=.4
It is estimated that the median breakdown time decreases by 40% with each 1kV increase in voltage
-20
24
68
Log
of ti
me
until
bre
akdo
wn
25 30 35 40VOLTAGE
Fitted values logarithm of breakdown time
050
010
0015
0020
0025
00B
reak
dow
n tim
e (m
inut
es)
25 30 35 40VOLTAGE
Fitted values TIME
Spring 2005 51U9611
If the explanatory variable (X) is logged
If µ{Y|log(X)} = β0 + β1log(X) then:“Associated with each two-fold increase (i.e doubling) of X is a β1log(2) change in the mean of Y.”
An example will follow:
Spring 2005 52U9611
Example with X logged (Display 7.3 – Case 7.1):
Y = pHX = time after slaughter (hrs.)estimated model: µ{Y|log(X)} = 6.98 - .73log(X).
-.73´log(2) = -.5 “It is estimated that for eachdoubling of time after slaughter (between 0 and 8 hours) themean pH decreases by .5.”
5.5
66.
57
pH
0 .5 1 1.5 2ltime
Fitted values PH
5.5
66.
57
pH
0 2 4 6 8TIME
Fitted values PH
Spring 2005 53U9611
Both Y and X logged
µ{log(Y)|log(X)} = β0 + β1log(X) is the same as:
As As X X increases by 1, what happens?increases by 1, what happens?
If β1>0: “As X increases by 1, the median of Y increases by ”
If β1 < 0: “As X increases by 1, the median of Ydecreases by ”
%100*)1( 1)2log( −βe
%100*)1( 1)2log( βe−
Spring 2005 54U9611
Example with Y and X logged Display 8.1 page 207
Y: number of species on an islandX: island area
µ{log(Y)|log(X)} = β0 – β1 log(X)
Spring 2005 55U9611
µ{log(Y)|log(X)} = 1.94 – .25 log(X)Since e.25log(2)=.19
“Associated with each doubling ofisland area is a 19% increase in the
median number of bird species”
Y and X logged
Spring 2005 56U9611
Example: Log-LogIn order to graph the Log-log plot
we need to generate two new variables (natural logarithms)