stat 112: lecture 17 notes chapter 6.8: assessing the assumption that the disturbances are...

21
Stat 112: Lecture 17 Notes • Chapter 6.8: Assessing the Assumption that the Disturbances are Independent • Chapter 7.1: Using and Interpreting Indicator Variables.

Post on 19-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Stat 112: Lecture 17 Notes

• Chapter 6.8: Assessing the Assumption that the Disturbances are Independent

• Chapter 7.1: Using and Interpreting Indicator Variables.

Time Series Data and Autocorrelation

• When Y is a variable collected for the same entity (person, state, country) over time, we call the data time series data.

• For time series data, we need to consider the independence assumption for the simple and multiple regression model.

• Independence Assumption: The residuals are independent of one another. This means that if the residual is positive this year, it needs to be equally likely for the residuals to be positive or negative next year, i.e., there is no autocorrelation.

• Positive autocorrelation: Positive residuals are more likely to be followed by positive residuals than by negative residuals.

• Negative autocorrelation: Positive residuals are more likely to be followed by negative residuals than by positive residuals.

Ski Ticket Sales

• Christmas Week is a critical period for most ski resorts.• A ski resort in Vermont wanted to determine the effect that weather had on its sale of lift tickets during Christmas week. • Data from past 20 years. Yi= lift tickets during Christmas week in year i Xi1=snowfall during Christmas week in year i Xi2= average temperature during Christmas week in year

i.Data in skitickets.JMP

Response Tickets Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 8308.0114 903.7285 9.19 <.0001 Snowfall 74.593249 51.57483 1.45 0.1663 Temperature -8.753738 19.70436 -0.44 0.6625 Bivariate Fit of Residual Tickets By Year

-3000

-2000

-1000

0

1000

2000

3000

Res

idua

l Tic

kets

0 5 10 15 20

Year

Residuals suggestpositive autocorrelation

Durbin-Watson Test of Independence

• The Durbin-Watson test is a test of whether the residuals are independent.

• The null hypothesis is that the residuals are independent and the alternative hypothesis is that the residuals are not independent (either positively or negatively) autocorrelated.

• The test works by computing the correlation of consecutive residuals.

• To compute Durbin-Watson test in JMP, after Fit Model, click the red triangle next to Response, click Row Diagnostics and click Durbin-Watson Test. Then click red triangle next to Durbin-Watson to get p-value.

• For ski ticket data, p-value = 0.0002. Strong evidence of autocorrelation

Durbin-Watson Durbin-Watson Number of Obs. AutoCorrelation Prob<DW

0.5931403 20 0.5914 0.0002

Remedies for Autocorrelation

• Add time variable to the regression.• Add lagged dependent (Y) variable to the

regression. We can do this by creating a new column and right clicking, then clicking Formula, clicking Row and clicking Lag and then clicking the Y variables.

• After adding these variables, refit the model and then recheck the Durbin-Watson statistic to see if autocorrelation has been removed.

Response Tickets Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 5965.5876 631.2518 9.45 <.0001 Snowfall 70.183059 28.85142 2.43 0.0271 Temperature -9.232802 11.01971 -0.84 0.4145 Year 229.96997 37.13209 6.19 <.0001 Durbin-Watson

Durbin-Watson Number of Obs. AutoCorrelation Prob<DW 1.8849875 20 0.0405 0.3512

Bivariate Fit of Residual Tickets 3 By Year

-2000

-1500

-1000

-500

0

500

1000

1500

2000

Res

idua

l Tic

kets

3

0 5 10 15 20

Year

No evidence of autocorrelation once Year has been added as an explanatory variable

Example 6.10 in bookResponse SALES Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -632.6945 47.27697 -13.38 <.0001 ADV 0.1772326 0.007045 25.16 <.0001 Durbin-Watson

Durbin-Watson Number of Obs. AutoCorrelation Prob<DW 0.4672937 36 0.7091 <.0001

Strong Evidence of Autocorrelation Bivariate Fit of Residual SALES By Year

-100

-50

0

50

Res

idua

l SA

LES

1965 1970 1975 1980 1985 1990 1995 2000 2005

Year

Response SALES Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -234.4752 78.06875 -3.00 0.0051 ADV 0.0630703 0.020228 3.12 0.0038 Lagged Sales 0.6751139 0.112302 6.01 <.0001 Durbin-Watson

Durbin-Watson Number of Obs. AutoCorrelation 2.3330219 35 -0.2063

Adding Lagged Sales removes the autocorrelation. Bivariate Fit of Residual SALES 2 By Year

-100

-80

-60

-40

-20

0

20

40

Res

idua

l SA

LES

2

1965 1970 1975 1980 1985 1990 1995 2000 2005

Year

Categorical variables

• Categorical (nominal) variables: Variables that define group membership, e.g., sex (male/female), color (blue/green/red), county (Bucks County, Chester County, Delaware County, Philadelphia County).

• How to use categorical variables as explanatory variables in regression analysis:– If the variable has two categories (e.g., sex

(male/female), rain or not rain, snow or not snow), we have defined a variable that equals 1 for one of the categories and 0 for the other category.

Predicting Emergency Calls to the AAA Club

Response Calls Summary of Fit RSquare 0.692384 RSquare Adj 0.584719 Root Mean Square Error 1735.151 Mean of Response 4318.75 Observations (or Sum Wgts)

28

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 3628.7902 2153.788 1.68 0.1076 Average Temperature

-35.63182 51.52383 -0.69 0.4972

Range 133.30434 50.85675 2.62 0.0164 Rain forecast 429.70588 1211.933 0.35 0.7266 Snow forecast 548.80038 1342.27 0.41 0.6870 Weekday -1603.1 876.7378 -1.83 0.0824 Sunday -1847.152 1212.612 -1.52 0.1433 Subzero 3857.6004 1489.803 2.59 0.0175

Rain forecast=1 if rain is in forecast, 0 if notSnow forecast=1 if snow is inforecast, 0 if notWeekday=1 if weekday, 0 ifnot

Comparing Toy Factory Managers

• An analysis has shown that the time required to complete a production run in a toy factory increases with the number of toys produced. Data were collected for the time required to process 20 randomly selected production runs as supervised by three managers (A, B and C). Data in toyfactorymanager.JMP.

• How do the managers compare?

Marginal Comparison

• Marginal comparison could be misleading. We know that large production runs with more toys take longer than small runs with few toys. How can we be sure that Manager c has not simply been supervising very small production runs?

• Solution: Run a multiple regression in which we include size of the production run as an explanatory variable along with manager, in order to control for size of the production run.

Tim

e fo

r R

un

150

200

250

300

a b c

Manager

Oneway Analysis of Time for Run By Manager

Including Categorical Variable in Multiple Regression: Wrong

Approach • We could assign codes to the managers, e.g., Manager

A = 0, Manager B=1, Manager C=2.

• This model says that for the same run size, Manager B is 31 minutes faster than Manager A and Manager C is 31 minutes faster than Manager B.

• This model restricts the difference between Manager A and B to be the same as the difference between Manager B and C – we have no reason to do this.

• If we use a different coding for Manager, we get different results, e.g., Manager B=0, Manager A=1, Manager C=2

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 211.92804 7.212609 29.38 <.0001 Run Size 0.2233844 0.029184 7.65 <.0001 Managernumber -31.03612 3.056054 -10.16 <.0001

Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 188.63636 12.73082 14.82 <.0001 Run Size 0.2103122 0.048921 4.30 <.0001 Managernumber2 -5.008207 5.122956 -0.98 0.3324

Manager A 5 min.faster than Manager B

Including Categorical Variable in Multiple Regression: Right

Approach• Create an indicator (dummy) variable for

each category.• Manager[a] = 1 if Manager is A 0 if Manager is not A • Manager[b] = 1 if Manager is B 0 if Manager is not B• Manager[c] = 1 if Manager is C 0 if Manager is not C

• For a run size of length 100, the estimated time for run of Managers A, B and C ar

• For the same run size, Manager A is estimated to be on average 38.41-(-14.65)=53.06 minutes slower than Manager B and

38.41-(-23.76)=62.17 minutes slower than Manager C.

Response Time for Run Expanded Estimates Nominal factors expanded to all levels Term Estimate Std Error t Ratio Prob>|t| Intercept 176.70882 5.658644 31.23 <.0001 Run Size 0.243369 0.025076 9.71 <.0001 Manager[a] 38.409663 3.005923 12.78 <.0001 Manager[b] -14.65115 3.031379 -4.83 <.0001 Manager[c] -23.75851 2.995898 -7.93 <.0001

1*76.230*65.140*41.38100*24.071.176),100|(ˆ

0*76.231*65.140*41.38100*24.071.176),100|(ˆ

0*76.230*65.141*41.38100*24.071.176),100|(ˆ

cManagerRunsizeTimeE

bManagerRunsizeTimeE

aManagerRunsizeTimeE

Categorical Variables in Multiple Regression in JMP

• Make sure that the categorical variable is coded as nominal. To change coding, right clock on column of variable, click Column Info and change Modeling Type to nominal.

• Use Fit Model and include the categorical variable into the multiple regression.

• After Fit Model, click red triangle next to Response and click Estimates, then Expanded Estimates (the initial output in JMP uses a different, more confusing coding of the dummy variables).

Equivalence of Using One 0/1 Dummy Variable and Two 0/1 Dummy

Variables when Categorical Variable has two categories

• Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 3628.7902 2153.788 1.68 0.1076 Average Temperature

-35.63182 51.52383 -0.69 0.4972

Range 133.30434 50.85675 2.62 0.0164 Rain forecast 429.70588 1211.933 0.35 0.7266 Snow forecast 548.80038 1342.27 0.41 0.6870 Weekday -1603.1 876.7378 -1.83 0.0824 Sunday -1847.152 1212.612 -1.52 0.1433 Subzero 3857.6004 1489.803 2.59 0.0175

Expanded Estimates Nominal factors expanded to all levels Term Estimate Intercept 4321.7173 Average Temperature -35.63182 Range 133.30434 Rain forecast[0] -214.8529 Rain forecast[1] 214.85294 Snow forecast[0] -274.4002 Snow forecast[1] 274.40019 Weekday[0] 801.55002 Weekday[1] -801.55 Sunday[0] 923.57625 Sunday[1] -923.5762 Subzero[0] -1928.8 Subzero[1] 1928.8002

Two models give equivalent predictions. The difference in mean number of Emergency calls between a day with a rain forecast and a day without a rain forecastholding all other variables fixed is 429.71=214.85-(-214.85).

Effect Tests

• Effect test for manager: vs. Haa: not all manager[a],manager[b],manager[c] equal. Null hypothesis is that all : not all manager[a],manager[b],manager[c] equal. Null hypothesis is that all

managers are the same (in terms of mean run time) when run size is held fixed, managers are the same (in terms of mean run time) when run size is held fixed, alternative hypothesis is that not all managers are the same (in terms of mean run alternative hypothesis is that not all managers are the same (in terms of mean run time) when run size is held fixed.time) when run size is held fixed.

• p-value for Effect Test <.0001. Strong evidence that not all managers are the same p-value for Effect Test <.0001. Strong evidence that not all managers are the same when run size is held fixed. when run size is held fixed.

• Note: equivalent to Note: equivalent to because JMP has constraint that manager[a]+manager[b]+manager[c]=0.• Effect test for Run size tests null hypothesis that Run Size coefficient is 0 versus

alternative hypothesis that Run size coefficient isn’t zero. Same p-value as t-test.

Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F Run Size 1 1 25260.250 94.1906 <.0001 Manager 2 2 44773.996 83.4768 <.0001

Expanded Estimates Nominal factors expanded to all levels Term Estimate Std Error t Ratio Prob>|t| Intercept 176.70882 5.658644 31.23 <.0001 Run Size 0.243369 0.025076 9.71 <.0001 Manager[a] 38.409663 3.005923 12.78 <.0001 Manager[b] -14.65115 3.031379 -4.83 <.0001 Manager[c] -23.75851 2.995898 -7.93 <.0001

][][][:0 cManagerbManageraManagerH

0][][][: cmanagerbmanageramanagerHa

][][][:0 cManagerbManageraManagerH

• Effect tests shows that managers are not equal.• For the same run size, Manager C is best (lowest mean

run time), followed by Manager B and then Manager C.• The above model assumes no interaction between

Manager and run size – the difference between the mean run time of the managers is the same for all run sizes.

Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F Run Size 1 1 25260.250 94.1906 <.0001 Manager 2 2 44773.996 83.4768 <.0001 Expanded Estimates Nominal factors expanded to all levels Term Estimate Std Error t Ratio Prob>|t| Intercept 176.70882 5.658644 31.23 <.0001 Run Size 0.243369 0.025076 9.71 <.0001 Manager[a] 38.409663 3.005923 12.78 <.0001 Manager[b] -14.65115 3.031379 -4.83 <.0001 Manager[c] -23.75851 2.995898 -7.93 <.0001

Election EquationGoal: Predict the Incumbent Party’s share of the vote Data in Elections.JMP Y = Incumbent party’s share of Vote X_1 = Nominal Variable for party in power X_2 = Economic Growth X_3 = Inflation X_4 = Consecutive Quarters of Good News X_5 = Duration Value X_6 = President Running X_7 = War