chapter 7 using indicator and interaction variables
DESCRIPTION
Chapter 7 Using Indicator and Interaction Variables. Terry Dielman Applied Regression Analysis: A Second Course in Business and Economic Statistics, fourth edition. 7.1 Using and Interpreting Indicator Variables. - PowerPoint PPT PresentationTRANSCRIPT
Indicator VariablesIndicator Variables 11Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Chapter 7Chapter 7Using Indicator and Using Indicator and Interaction VariablesInteraction Variables
Terry DielmanTerry DielmanApplied Regression Analysis:Applied Regression Analysis:
A Second Course in Business and A Second Course in Business and Economic Statistics, fourth editionEconomic Statistics, fourth edition
Indicator VariablesIndicator Variables 22Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
7.1 Using and Interpreting Indicator Variables7.1 Using and Interpreting Indicator Variables
Suppose some observations Suppose some observations have a particular characteristic have a particular characteristic or attribute, while others do not.or attribute, while others do not.
We can include this information We can include this information in the regression model by using in the regression model by using dummy or indicator variables.dummy or indicator variables.
Indicator VariablesIndicator Variables 33Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Add the info thru a coding schemeAdd the info thru a coding scheme
Use a binary (dummy) variable to “indicate” Use a binary (dummy) variable to “indicate” when the characteristic is presentwhen the characteristic is present
DDi i = 1= 1 if observation i has the attributeif observation i has the attribute
DDii = 0 = 0 if observation i does not have itif observation i does not have it
Indicator VariablesIndicator Variables 44Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
An ExampleAn Example
DDii = 1 if individual i is employed = 1 if individual i is employed
DDii = 0 if individual i is not employed = 0 if individual i is not employed
We could do it the other way and use We could do it the other way and use the "1" to indicate an unemployed the "1" to indicate an unemployed individual. individual.
Indicator VariablesIndicator Variables 55Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Multiple CategoriesMultiple Categories For multiple categories, use multiple For multiple categories, use multiple
indicators.indicators.
For example, to indicate where a firm's For example, to indicate where a firm's stock is listed, we could define 3 indicator stock is listed, we could define 3 indicator variables; one each for the NYSE, AMEX and variables; one each for the NYSE, AMEX and NASDAQ.NASDAQ.
For computational reasons, we would For computational reasons, we would include only two of these in the regression.include only two of these in the regression.
Indicator VariablesIndicator Variables 66Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 7.1 Employment DiscriminationExample 7.1 Employment Discrimination
If two groups have apparently different If two groups have apparently different salary structures, you first need to account salary structures, you first need to account for differences in education, training and for differences in education, training and experience before any claim of experience before any claim of discrimination can be made.discrimination can be made.
Regression analysis with an indicator Regression analysis with an indicator variable for the group is a way to variable for the group is a way to investigate this.investigate this.
Indicator VariablesIndicator Variables 77Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Treasury Versus HarrisTreasury Versus Harris
The data set HARRIS7 contains The data set HARRIS7 contains information on the salaries of 93 information on the salaries of 93 employees of the Harris Trust and employees of the Harris Trust and Savings Bank. They were sued by Savings Bank. They were sued by the US Department of Treasury in the US Department of Treasury in 1981.1981.
Here we examine how salary depends Here we examine how salary depends on education, also accounting for on education, also accounting for gender.gender.
Indicator VariablesIndicator Variables 88Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
0
1
1615141312111098
8000
7000
6000
5000
4000
EDUCAT
SA
LA
RY
Salary Versus Years of EducationSalary Versus Years of Education
At all levels of education, the male salaries appear higher.
Indicator VariablesIndicator Variables 99Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Regression AnalysisRegression AnalysisThe regression equation isSALARY = 4173 + 80.7 EDUCAT + 692 MALES
Predictor Coef SE Coef T PConstant 4173.1 339.2 12.30 0.000EDUCAT 80.70 27.67 2.92 0.004MALES 691.8 132.2 5.23 0.000
S = 572.4 R-Sq = 36.3% R-Sq(adj) = 34.9%
How do we interpret this equation?How do we interpret this equation?
Indicator VariablesIndicator Variables 1010Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
An Intercept AdjusterAn Intercept AdjusterFor an indicator variable, the For an indicator variable, the bbjj is not really a slope. is not really a slope.
To see this, evaluate the equation for the two To see this, evaluate the equation for the two groups.groups.
FEMALES (MALES = 0)FEMALES (MALES = 0)SALARY = 4173 + 80.7 EDUCAT + 692 MALES = 4173 + 80.7 EDUCAT + 692 (0) = 4173 + 80.7 EDUCAT
MALES (MALES = 1)MALES (MALES = 1)SALARY = 4173 + 80.7 EDUCAT + 692 MALES = 4173 + 80.7 EDUCAT + 692 (1) = 4173 + 80.7 EDUCAT + 692 = 4865 + 80.7 EDUCAT
Indicator VariablesIndicator Variables 1111Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
0
1
1615141312111098
8000
7000
6000
5000
4000
EDUCAT
SA
LA
RY
Parallel Salary EquationsParallel Salary Equations
Indicator VariablesIndicator Variables 1212Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Is The Difference Significant?Is The Difference Significant?
HH00: : MALESMALES = 0 = 0
HHaa: : MALESMALES ≠ 0 ≠ 0
Use t = b/SEUse t = b/SEbb as usual as usual
t = 5.23 is significantt = 5.23 is significant
(After accounting for years of (After accounting for years of education, there is no salary education, there is no salary difference)difference)
(After accounting for education, (After accounting for education, there IS a salary differencethere IS a salary difference))
Indicator VariablesIndicator Variables 1313Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
What if the Coding Was Different?What if the Coding Was Different?
If we had an indicator for females and If we had an indicator for females and used it, the equation would be:used it, the equation would be:
SALARY = 4865 + 80.7 EDUCAT - 692 FEMALES
The difference between the groups is The difference between the groups is the same. For females, the intercept the same. For females, the intercept in the equation is 4865 – 692 = 4173in the equation is 4865 – 692 = 4173
Indicator VariablesIndicator Variables 1414Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Multiple CategoriesMultiple Categories
Pick one category as the "base Pick one category as the "base category".category".
Create one indicator variable for Create one indicator variable for each other category.each other category.
In general, if there are In general, if there are mm categories, categories, use use m – 1m – 1 indicator variables. indicator variables.
Indicator VariablesIndicator Variables 1515Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 7.3 Meddicorp SalesExample 7.3 Meddicorp Sales
Y = Sales in one of 25 territoriesY = Sales in one of 25 territories
XX11 = advertising in territory = advertising in territory
XX22 = bonuses paid in territory = bonuses paid in territory
Also Region: Also Region: 1 = South1 = South
2 = West2 = West
3 = Midwest3 = Midwest
Indicator VariablesIndicator Variables 1616Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
How do you use region?How do you use region?
What happens if you just put it in the What happens if you just put it in the model?model?
Sales = -84 + 1.55 ADV + 1.11 BONUS + 119 RegionSales = -84 + 1.55 ADV + 1.11 BONUS + 119 Region
RR22 = 92.0% and S = 92.0% and See = 68.89 = 68.89
SE(Region) = 28.69 so tSE(Region) = 28.69 so tstatstat = 4.14 is significant = 4.14 is significant
Indicator VariablesIndicator Variables 1717Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Region as an XRegion as an X
This implies the difference between Region 3 This implies the difference between Region 3 (MW) and Region 2 (W) = b(MW) and Region 2 (W) = b33 = 119 = 119
And the difference between Region 2 (W) And the difference between Region 2 (W) and Region 1 (S) is also 119and Region 1 (S) is also 119
The sales differences may not be equal but The sales differences may not be equal but this this forcesforces them to be estimated that way them to be estimated that way
Indicator VariablesIndicator Variables 1818Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
A more flexible approachA more flexible approach
Use two indicator variables to tell the Use two indicator variables to tell the three regions apartthree regions apart
Can use any one of the three as the Can use any one of the three as the “base” category. “base” category.
Here is what it looks like if Midwest is Here is what it looks like if Midwest is selected as the base.selected as the base.
Indicator VariablesIndicator Variables 1919Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Coding schemeCoding scheme
RegionRegionDD11
SouthSouth
DD22
WestWest
SOUTHSOUTH
WESTWEST
MIDWESTMIDWEST
1
0
0
0
1
0
Indicator VariablesIndicator Variables 2020Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
ResultsResults
SALES = 435 + 1.37ADV + .975 BONUSSALES = 435 + 1.37ADV + .975 BONUS
- 258 South – 210 West- 258 South – 210 West
RR22 = 94.7 and S = 94.7 and See = 57.63 = 57.63
Both indicators are significantBoth indicators are significant
Indicator VariablesIndicator Variables 2121Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
This Defines Three EquationsThis Defines Three Equations
SALES = 435 + 1.37ADV + .975 BONUSSALES = 435 + 1.37ADV + .975 BONUS
- 258 South – 210 West- 258 South – 210 West
S: SALES = 177 + 1.37ADV + .975 BONUSS: SALES = 177 + 1.37ADV + .975 BONUS
W: SALES = 225 + 1.37ADV + .975 BONUSW: SALES = 225 + 1.37ADV + .975 BONUS
MW: SALES = 435 + 1.37ADV + .975 BONUSMW: SALES = 435 + 1.37ADV + .975 BONUS
Indicator VariablesIndicator Variables 2222Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Is Location Significant?Is Location Significant?
Because location is measured by two Because location is measured by two variables in a group, we need to do a variables in a group, we need to do a partial F test.partial F test.
The full Model has ADV, BONUS, The full Model has ADV, BONUS, SOUTH and WEST and has RSOUTH and WEST and has R2 2 = 94.7 = 94.7
The reduced model has only ADV and The reduced model has only ADV and BONUS, with RBONUS, with R22 = 85.5 = 85.5
Indicator VariablesIndicator Variables 2323Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Output For F-TestOutput For F-TestFULL MODEL
S = 57.63 R-Sq = 94.7% R-Sq(adj) = 93.6%
Analysis of Variance
Source DF SS MS F PRegression 4 1182560 295640 89.03 0.000Residual Error 20 66414 3321Total 24 1248974
REDUCED MODEL
S = 90.75 R-Sq = 85.5% R-Sq(adj) = 84.2%
Analysis of Variance
Source DF SS MS F PRegression 2 1067797 533899 64.83 0.000Residual Error 22 181176 8235Total 24 1248974
Indicator VariablesIndicator Variables 2424Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Partial F ComputationsPartial F Computations
(SSE(SSERR – SSE – SSEFF) / (K – L)) / (K – L)F = F = MSEMSEFF
(181176 - 66414)/ (4-2)(181176 - 66414)/ (4-2) = = 17.3= = 17.3 33213321
Indicator VariablesIndicator Variables 2525Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
7.2 Interaction Variables7.2 Interaction Variables
Another type of variable used in Another type of variable used in regression models is an interaction regression models is an interaction variable.variable.
This is usually formulated as the This is usually formulated as the product of two variables; for product of two variables; for example, example, xx33 = = xx11xx22
With this variable in the model, it With this variable in the model, it means the level of means the level of xx22 changes how changes how xx11 affects Yaffects Y
Indicator VariablesIndicator Variables 2626Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Interaction ModelInteraction Model
With two With two xx variables the model is: variables the model is:
If we factor out If we factor out xx11 we get:we get:
so each value of so each value of xx22 yields a different slope in yields a different slope in the relationship between the relationship between y y and and xx11
exxxxy 21322110
exxxy 2212310 )(
Indicator VariablesIndicator Variables 2727Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Interaction Involving an IndicatorInteraction Involving an IndicatorIf one of the two variables is binary, the If one of the two variables is binary, the
interaction produces a model with two interaction produces a model with two different slopes.different slopes.
When When xx22 = 0 = 0
When When xx22 = 1 = 1
exy 110
exy 13120 )()(
Indicator VariablesIndicator Variables 2828Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 7.4 Discrimination (again)Example 7.4 Discrimination (again)
In the Harris Bank case, suppose we In the Harris Bank case, suppose we suspected that the salary difference suspected that the salary difference by gender changed with different by gender changed with different levels of education.levels of education.
To investigate this, we created a new To investigate this, we created a new variable MSLOPE = EDUCAT*MALES variable MSLOPE = EDUCAT*MALES and added it to the model.and added it to the model.
Indicator VariablesIndicator Variables 2929Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Regression OutputRegression Output
The regression equation isSALARY = 4395 + 62.1 EDUCAT - 275 MALES + 73.6 MSLOPE
Predictor Coef SE Coef T PConstant 4395.3 389.2 11.29 0.000EDUCAT 62.13 31.94 1.95 0.055MALES -274.9 845.7 -0.32 0.746MSLOPE 73.59 63.59 1.16 0.250
S = 571.4 R-Sq = 37.3% R-Sq(adj) = 35.2%
How do we interpret the equation this time?How do we interpret the equation this time?
Indicator VariablesIndicator Variables 3030Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
A Slope AdjusterA Slope AdjusterTo see the interaction effect, once again evaluate To see the interaction effect, once again evaluate
the equation for the two groups.the equation for the two groups.
FEMALES (MALES = 0)FEMALES (MALES = 0) SALARY = 4395 + 62.1 EDUCAT - 275 MALES + 73.6 MSLOPE = 4395 + 62.1 EDUCAT - 275 (0) + 73.6 (EDUCAT*0)
= 4395 + 62.1 EDUCAT
MALES (MALES = 1)MALES (MALES = 1) SALARY = 4395 + 62.1 EDUCAT - 275 MALES + 73.6 MSLOPE = 4395 + 62.1 EDUCAT - 275 (1) + 73.6 (EDUCAT*1)
= 4395 + 62.1 EDUCAT – 275 + 73.6 EDUCAT = 4120 + 135.7 EDUCAT
Indicator VariablesIndicator Variables 3131Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
0
1
1615141312111098
8000
7000
6000
5000
4000
EDUCAT
SA
LA
RY
Lines With Two Different SlopesLines With Two Different Slopes
A bigger gap occurs at higher education levels
Indicator VariablesIndicator Variables 3232Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Tests in This ModelTests in This Model Although the slope adjuster implies Although the slope adjuster implies
the salary gap increases with the salary gap increases with education, this effect is not really education, this effect is not really significant (tsignificant (tMSLOPEMSLOPE = 1.16). = 1.16).
The overall affect of gender is now The overall affect of gender is now contained in two variables, so a contained in two variables, so a partial F test would be needed to test partial F test would be needed to test for differences between male and for differences between male and female salaries.female salaries.
Indicator VariablesIndicator Variables 3333Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
7.3 Seasonal Effects in Time Series Regression7.3 Seasonal Effects in Time Series Regression
Data collected over time (say quarterly)Data collected over time (say quarterly)
If we think the Y variable depends on the If we think the Y variable depends on the calendar can do a kind of “seasonal calendar can do a kind of “seasonal adjustment” by adding quarter dummiesadjustment” by adding quarter dummies
Q1 = 1 if this was first quarter, Q2 = 1 if a Q1 = 1 if this was first quarter, Q2 = 1 if a second quarter, Q3 = 1 if thirdsecond quarter, Q3 = 1 if third
Don’t use Q4 since that is the “base”Don’t use Q4 since that is the “base”
Indicator VariablesIndicator Variables 3434Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Example 7.5 ABX Company SalesExample 7.5 ABX Company Sales
We fit a trend to these sales in We fit a trend to these sales in Example 3.11 by regressing sales on Example 3.11 by regressing sales on a time index variable.a time index variable.
Because this company sells winter Because this company sells winter sports merchandise, including sports merchandise, including seasonal effects should markedly seasonal effects should markedly improve the fit.improve the fit.
Indicator VariablesIndicator Variables 3535Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
403020100
300
250
200
TIME
SA
LE
SABX Company SalesABX Company Sales
4th qtr
Indicator VariablesIndicator Variables 3636Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Two RegressionsTwo RegressionsThe regression equation isSALES = 199 + 2.56 TIME
Predictor Coef SE Coef T PConstant 199.017 5.128 38.81 0.000TIME 2.5559 0.2180 11.73 0.000
S = 15.91 R-Sq = 78.3% R-Sq(adj) = 77.8%
The regression equation isSALES = 211 + 2.57 TIME + 3.75 Q1 - 26.1 Q2 - 25.8 Q3
Predictor Coef SE Coef T PConstant 210.846 3.148 66.98 0.000TIME 2.56610 0.09895 25.93 0.000Q1 3.748 3.229 1.16 0.254Q2 -26.118 3.222 -8.11 0.000Q3 -25.784 3.217 -8.01 0.000
S = 7.190 R-Sq = 95.9% R-Sq(adj) = 95.5%
Indicator VariablesIndicator Variables 3737Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Are the Seasonal Effects Significant?Are the Seasonal Effects Significant?
The strong t-ratios for Q2 and Q3 say "yes" The strong t-ratios for Q2 and Q3 say "yes" and the model Rand the model R22 increased by 17.6% increased by 17.6% when we added the seasonal indicators.when we added the seasonal indicators.
With evidence this strong we probably With evidence this strong we probably don't need to test further.don't need to test further.
In general, however, we would need In general, however, we would need another partial F test to see if the overall another partial F test to see if the overall seasonal effect is significant.seasonal effect is significant.