week 4
DESCRIPTION
Week 4. Multiple Regression models Estimation Goodness of fit tests. Announcements. Midterm on May 5 th in Week 6 I sent an email to online students to remind them about registering for the exam! - PowerPoint PPT PresentationTRANSCRIPT
Week 4
Multiple Regression models Estimation Goodness of fit tests
Announcements Midterm on May 5th in Week 6
I sent an email to online students to remind them about registering for the exam!
I reserved computer lab 658 on the 6th floor from 7:30-9:00pm to review SAS procedures for regression analysis.
Outline Last Week:
Straight line regression
This Week: Multiple regression analysis with several predictors Significance test on predictors Goodness of fit test
Next 2 Weeks: Residual Analysis (model diagnostics) Computing predictions Non linear models Model selection techniques
Airline Index Price vs Jet Fuel Spot prices
10/10/2006
4/28/2007
11/14/2007
6/1/2008
12/18/2008
0
50
100
150
200
250
300
350
400
450
0
10
20
30
40
50
60
70
Jet Fuel Spot Prices vs Airline Index Weekly data from Jan 2007 to August 2008
Jet Fuel Spot Price
Airline Index Price
Jet F
uel P
rice
Index Price
100 150 200 250 300 350 400 4500
10
20
30
40
50
60
70
Jet Fuel Spot Prices vs Airline Index Weekly data from Jan 2007 to August 2008
Jet Fuel Spot Price
Airli
ne In
dex
Another graph to visualize the relationship
Corr(Airline index, Fuel Price) = -0.972Regression line: Airline Index = 87.55209 - 0.1816 * fuel
More general regression modelConsider one Y variable and k independent variables Xi, e.g.
X1, X2, X3. Data on n tuples (yi, xi1,xi2,xi3). Scatter plots show linear association between Y and the
X-variables The observations on y can be assumed to satisfy the
following model
y x x x e for i ni i i i i 0 1 1 2 2 3 3 1,...,
DataPrediction
error
Assumptions on the regression model1. The relationship between the Y-variable and the X-variables is
linear
2. The error terms ei (measured by the residuals)– have zero mean (E(ei)=0) – have the same standard deviation e for each fixed x– are approximately normally distributed - Typically true for
large samples! – are independent (true if sample is S.R.S.)
Such assumptions are necessary to derive the inferential methods for testing and prediction (to be seen later)!
WARNING: if the sample size is small (n<50) and errors are not normal, you can’t use regression methods!
Parameter estimates
Suppose we have a random sample of n observations on Y and on k independent X-variables.
How do we estimate the values of the coefficients ’s in the regression model of Y versus the X-variables?
The parameter estimates are those values for ’s that minimize the sum of the square errors:
Thus the parameter estimates are those values for ’s that will make the model residuals as small as possible!
The fitted model to compute predictions for Y is
kk xxy ˆ...ˆˆˆ 110
Regression Parameter estimates
2110
2 )]...([)ˆ( kkii i
ii xxyyy
Using Linear Algebra for model estimation (section 12.9) Let be the parameter vector for the
regression model in p variables.
The parameter estimates for each beta can be efficiently found using linear algebra as
where X is the (k x k) data matrix for the X-variables and Y is the data vector for the response variable.
Hard to compute by hand – better use a computer!
Tk ),...,,( 10 β
YXX)Xβ T1T (ˆ XT is the transpose of matrix XX-1
denotes the inverse of matrix X
EXAMPLE: CPU usageA study was conducted to examine what factors affect the CPU usage.
A set of 38 processes written in a programming language was considered. For each program, data were collected on the
Y = CPU usage (time) in seconds of time, X1= the number of lines (linet) in thousands generated by the process
execution. X2 = number of programs (step) forming the process X3 = number of mounted computer devices (device).
Problem: Estimate the regression model of Y on X1,X2 and X3
DEVICESTEPLINETy 3210ˆˆˆˆˆ
I) Exploratory data step: Are the associations between Y and the x-variables linear?Draw the scatter plot for each pair (Y, Xi)
Lines executed in process Number of programs
Mounted devices
CPU time CPU time
CPU time
Do the plots showlinearity?
PROC REG - SAS OUTPUT The REG Procedure
Parameter Estimates
Parameter StandardVariable Label DF Estimate Error t Value Pr > |t|Intercept --- 1 0.00147 0.01071 0.14 0.8920Linet --- 1 0.02109 0.00271 7.79 <.0001step --- 1 0.00924 0.00210 4.41 <.0001device --- 1 0.01218 0.00288 4.23 0.0002
The fitted regression model is
DEVICESTEPLINETy 012.0009.0021.00014.0ˆ DEVICESTEPLINETy 012.0009.0021.00014.0ˆ
Fitted model
The ’s estimated values measure the changes in Y for changes in X’s.
For instance, for each increase of 1000 lines executed by the process (keeping the other variables fixed), the CPU usage time will increase of 0.021 seconds.
Fixing the other variables, what happens on the CPU time if I add another device?
The fitted regression model is
DEVICESTEPLINETy 012.0009.0021.00014.0ˆ
Interpretation of model parameters
In multiple regression
The coefficient value i of Xi measures the predicted change in Y for any unit increase in Xi while the other independent variables stay constant.
For instance: measures the changes in Y for a unit increase of the variable X2 if the other x-variables X1 and X3 are fixed.
y x x x e for i ni i i i i 0 1 1 2 2 3 3 1,...,
2
SAS procedure for regression analysis: PROC REGPROC REG <DATA=dataset-name>;MODEL yvar=xvar1 xvar2 xvar3 xvar4;PLOT yvar*xvar1; RUN;--------------------------------------------MODEL: defines the variables in the modelPLOT: defines plots of yvar vs x-variables defined in the MODEL
statement. It can create multiple plots. Use the syntax:PLOT yvar *(xvar1 xvar2); ---------------------------------------------
SAS Procedures for correlation valuesThe CORR option in PROC REG computes the correlation matrix for all
the variables listed in MODEL.PROC REG / CORR;MODEL yvar=xvar1 xvar2 xvar3;RUN;
Alternatively use PROC CORR to compute correlations:PROC CORR;VAR yvar xvar1 xvar2;
RUN; list of variables. Correlation values are computed for each pair of variables in the list
SAS procedure for scatterplot: PROC GPLOTPROC GPLOT;PLOT yvar*xvar;RUN;Creates a scatter plot of yvar versus xvar – Both variables
must be defined in the SAS dataset.
PROC GPLOT;PLOT yvar*xvar=zvar; RUN; Creates a scatter plot of yvar versus xvar where points are
labeled according to the z-var classifier.
SAS Code for the CPU usage data
Data cpu;infile “cpudat.txt";input time line step device;linet=line/1000;label time="CPU time in seconds" line="lines in program execution"
step="number of computer programs" device="mounted devices" linet="lines in program (thousand)";
/*Exploratory data analysis *//* computes correlation values between all variables in dataset */proc corr data=cpu;run;/* creates scatterplots between time vs linet, time vs step and
time vs device, respectively */proc gplot data=cpu;plot time*(linet step device);run;
/* Regression analysis: fits model to predict time using linet, step and device*/
proc reg data=cpu;model time=linet step device;plot time*linet /nostat ;run; quit;
If you want to fit a model with no intercept use the following model statement:
model time=linet step device / noint;
Are the estimated values accurate?
Residual Standard Deviation (pg. 632) Testing effects of individual variables (pg. 652- 655)
How do we measure the accuracy of the estimated parameter values? (section 11.2)
2ˆ)(
11 xxi
e
2
2
ˆ )(1
0 xxx
n ie
For a simple linear regression with one X, the standard deviation of the parameter estimates are defined as:
They are both functions of the error variance regarded as asort of standard deviation (spread) of the points around the line! The error variance is estimated by the residual standard deviation se (a.k.a. root mean square error )
e
2)ˆ( 2
n
yys ii
eResiduals!
How do we interpret residual standard deviation? Used as a coarse approximation of the prediction error in
predicting Y-values from regression model Probable error in new predictions is
+/- 2 se
se also used in the formula of standard errors of parameter estimates:
They can be computed from the data and measure the noise in the parameter estimates
2
2
ˆ )(1
0 xxx
nss
ie
2ˆ
)(1
1 xxss
i
e
For general regression models with k x-variables
For k predictors (or explanatory variables), the standard errors of the parameter estimates have a complicated form. But they still depend on the error standard deviation
The residual standard deviation or root mean square error is defined as
k+1 = number of parameters ’s
)1()ˆ(
)1()()
2
knyy
knResidualSSlMS(Residuas ii
e
e
This measures the precision of our predictions!
The REG Procedure Analysis of Variance
Sum of MeanSource DF Squares Square F Value Pr > FModel 3 0.59705 0.19902 166.38 <.0001Error 34 0.04067 0.00120Corrected Total 37 0.63772
Root MSE 0.03459 R-Square 0.9362 Dependent Mean 0.15710 Adj R-Sq 0.9306 Coeff Var 22.01536
The root mean square error for the CPU usage regression model is computed above. That gives an estimate of the error standard deviation
se=0.03459
Inference about regression parameters! Regression estimates are affected by random
error The concepts of hypothesis testing and
confidence intervals apply! The t-distribution is used to construct significance
tests and confidence intervals for the “true” parameters of the population.
Tests are used to select those x-variables that have a significant effect on Y
Tests on the slope for straight line regressionConsider the simple straight line case.We often test the null hypothesis that the slope is equal to zero
which is equivalent to “X has no effect on Y”! Or in statistical terms :
X has a negative effectHo: vs Ha: X has a significant effect
X has a positive effect01 01
The test is computed using the t-statistic
21
1
1
)(/1
ˆ
)ˆ.(.0ˆ
xxsest
ie
With t-distribution with n-2 degrees of freedom!
Tests on the parameters in multiple regression
Significance Test on parameter tests hypothesis:“Xj has no effect on Y” or in statistical terms
j
0: vs0:
jj HaHo Xj has a negative effect on Y Xj has a significant effect on Y Xj has a positive effect on Y
The test is given by the t-statistic
with t-distribution with n-(k+1) degrees of freedom )ˆ.(.
ˆ
j
j
est
Computed by SAS
Assumptions on the data:1. e1, e2, … en are independent of each other.2. The ei are normally distributed with mean zero and
have common variance .
Tests in SAS
The test p-values for regression coefficients are computed by PROC REG
SAS will produce the two-sided p-value. If your alternative hypothesis is one-sided (either > or < ), then
find the one-sided p-value dividing by 2 the p-value computed by SAS
one-sided p-value = (two-sided p-value)/2
SAS Output The REG Procedure
Parameter Estimates
Parameter StandardVariable Label DF Estimate Error t Value Pr > |t|Intercept --- 1 0.00147 0.01071 0.14 0.8920Linet --- 1 0.02109 0.00271 7.79 <.0001step --- 1 0.00924 0.00210 4.41 <.0001device --- 1 0.01218 0.00288 4.23 0.0002
T-statistic value P-value
T-tests on individual parameter values show that all the x-variables in the model are significant at 5% level (p-values <0.05). We conclude that there is a significant association between Y and each x-variable.
Test on the intercept 0
The REG Procedure Parameter Estimates
Parameter StandardVariable Label DF Estimate Error t Value Pr > |t|Intercept --- 1 0.00147 0.01071 0.14 0.8920Linet --- 1 0.02109 0.00271 7.79 <.0001step --- 1 0.00924 0.00210 4.41 <.0001device --- 1 0.01218 0.00288 4.23 0.0002
The test on the intercept says that the null hypothesis of 0=0 should be accepted. The test p-value is 0.8920.
This means that the model should have no intercept ! This is not recommended though – unless you know that Y=0 when all the x-variables are equal to zero.
What do we do if a model parameter is not significant? If the t-test on a parameter j shows that the parameter
value is not significantly different from zero, we should:1. Refit the regression model without the x-variable
corresponding to j. 2. Check if remaining variables are significant
Next
Model diagnostics: Evaluate the goodness of fit for the observed data Analyze if model assumptions are satisfied
Using the model for predictions Computing predictions and evaluating predictions
accuracy
How is a Linear Regression Analysis done? A Protocol
Steps of Regression Analysis1. Examine the scatterplot of the data.
Does the relationship look linear?Are there points in locations they shouldn’t be?Do we need a transformation?
2. Assuming a linear function looks appropriate, estimate the regression parameters.
Least Squares Estimates (done)3. Examine the residuals for systematic inadequacies in the linear model as fit to
the data. (next week)Is there evidence that a more complicated relationship (say a polynomial) should
be examined? (Residual analysis).Are there data points which do not seem to follow the proposed relationship?
(influential values or outliers).4. Test whether there really is a statistically significant linear relationship.
Does model fit the data adequately?Goodness of fit test (F-test for Variances) (next week)
5. Using the model for predictions (next week)
Residual analysis(Sections 11.5 and 13.4)
What are the residuals?
Standard residuals and standardized/studentized residuals Standard residuals measure the difference between the actual y-
values and the predicted values using the regression model:
For analysis we often use the studentized residuals to identify
outliers, i.e. points that do not appear to be consistent with the rest of the data.
A studentized residual is computed as the i-th residual divided by its standard error.
MSEyy
yyesyy
e ii
ii
iii
)ˆ()ˆ.(.
)ˆ(
)ˆ( iii yyr
Residual analysis
The studentized residuals follow an approximate t-distribution (or N(0,1) in large datasets).
Detecting outliers: Observations with | Student ei|>2 (i.e. larger than 2 or smaller than –2) may be
outliers.
Use scatter plots of Studentized residuals vs predicted values. Studentized residuals vs x-values
Residual analysis may display problems in the regression analysis and shows if there is some important variation in Y that is not explained by the regression model.
Residual Plots
The analysis of the residuals is useful to detect possible problems and anomalies in the regression
Residual plots are scatter plots of the regression residuals against the X variables or the predicted values.
If model assumptions hold, points should be randomly scattered inside a band centered around the horizontal line at zero (the mean of the residuals).
Checking Assumptions
How well does the model fit?
Do predicted values seem to be placed in the middle of observed values?Do data satisfy regression assumptions? (Problems seen in plot of X vs. Y will be reflected in residual plot.)
• Constant variance?• Systematic patterns suggest
lack of independence or more complex model?
• Poorly fit observations?
x
y
Res
idua
l
X“Bad cases”Non linear relationship Non constant variance
“Good case”Points are randomly scattered around the zero line
How do I know I have a problem?
1) Plot predicted values versus residuals.
What is the pattern of the spread in the residuals as the predicted values increase?
• Spread constant.• Spread increases.• Spread decreases then increases.
Acceptable
Problems Problems
Y
YY
xx x
x x
x
x xx x
x
x
xx x
x
x
x x
xx
xxxx
x
x
Residuals are the computed as “observed y – predicted y”iii yyr ˆ
Homoscedasticity (constant variance)
Residual AnalysisResidual analysis may display problems in the regression analysis – It shows if there is some important variation in Y that is not explained by
the regression model. Plot residuals vs predicted values: To check linearity assumptions &
constant variance for the errors;plot residual.*predicted.; (PROC REG) plot student.*predicted.;
Plot residuals vs each x-variable: To check linearity assumptions for Y and the x-variable;
plot residual.*xvar; (PROC REG) plot student.*xvar;
Draw normal probability plot of residuals: To check normality assumption for the error terms; if points lie close to a line, the errors can be assumed to be approximately normal. Otherwise the assumption of normality is not satisfied.
plot npp.*residual.; (PROC REG)plot npp.*student.;Note the “.”
Plot residuals vs predicted valuesTo check linearity assumptions & constant
variance for the errors:PLOT student.*predicted.;
Plot residuals vs each x-variable;To check linearity assumptions for Y and the x-
variable;PLOT student.*linet;
Look for non-random patterns as they may be signs that model assumptions are not satisfied by the data
Residual plots using studentized residuals
Normal probability plot of residualsDraw normal probability plot of residuals;
To check normality assumption for the error terms; if points lie close to a line, the errors can be assumed to be approximately normal. Otherwise the assumption of normality is not satisfied.
SAS: Plot npp.*student.; (PROC REG)
Points look fairly linear.Assumption of normality is satisfied!
SAS Code for the CPU usage data
Data cpu;infile “cpudat.txt";input time line step device;linet=line/1000;label time="CPU time in seconds" line="lines in program execution"
step="number of computer programs" device="mounted devices" linet="lines in program (thousand)";
/*Exploratory data analysis */
proc corr data=cpu;run;
proc gplot data=cpu;plot time*(linet step device);run;
/* Regression analysis*/proc reg data=cpu;model time=linet step device;*/noint;plot (residual. student.)*predicted./nostat; plot student.*(linet step device)/nostat hplots=2
vplots=3;plot npp.*student./nostat ;run; quit;
If you want to fit a model with no intercept use the following model statement:
model time=linet step device / noint;