# regression assumptions

Embed Size (px)

DESCRIPTION

Regression Assumptions. Best Linear Unbiased Estimate (BLUE). If the following assumptions are met: The Model is Complete Linear Additive Variables are measured at an interval or ratio scale without error T h e regression error term is normally distributed has an expected value of 0 - PowerPoint PPT PresentationTRANSCRIPT

If the following assumptions are met: The Model is

Complete Linear Additive

Variables are measured at an interval or ratio scale without error

The regression error term is unrelated to predictors normally distributed has an expected value of 0 errors are independent homoscedasticity In a system of interrelated equations the errors are unrelated to each other

Characteristics of OLS if sample is probability sample Unbiased Efficient Consistent

Unbiased: E(b)=β b is the sample β is the true, population coefficient

On the average we are on target Efficient

Standard error will be minimum Consistent

As N increases the standard error decreases and closes in on the population value

Meals

Parents’ education

. regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB if AVG_ED>0 & AVG_ED<6, beta

Source | SS df MS Number of obs = 10082--------------+------------------------------ -------------------------------------- F( 6, 10075) = 2947.08 Model | 65503313.6 6 10917218.9 Prob > F = 0.0000 Residual | 37321960.3 10075 3704.41293 R-squared = 0.6370-------------+---------------------------------------------------------------------- Adj R-squared = 0.6368 Total | 102825274 10081 10199.9081 Root MSE = 60.864

------------------------------------------------------------------------------------------------------------ API13 | Coef. Std. Err. t P>|t| Beta-------------+---------------------------------------------------------------------------------------------- MEALS | .1843877 .0394747 4.67 0.000 .0508435 AVG_ED | 92.81476 1.575453 58.91 0.000 .6976283 P_EL | .6984374 .0469403 14.88 0.000 .1225343 P_GATE | .8179836 .0666113 12.28 0.000 .0769699 EMER | -1.095043 .1424199 -7.69 0.000 -.046344 DMOB | 4.715438 .0817277 57.70 0.000 .3746754 _cons | 52.79082 8.491632 6.22 0.000 .------------------------------------------------------------------------------------------------------------. regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR if AVG_ED>0 & AVG_ED<6, beta

Source | SS df MS Number of obs = 10082----------------+-------------------------------------------------------------------- F( 13, 10068) = 1488.01 Model | 67627352 13 5202104 Prob > F = 0.0000 Residual | 35197921.9 10068 3496.01926 R-squared = 0.6577-------------+---------------------------------------------------------------------- Adj R-squared = 0.6572 Total | 102825274 10081 10199.9081 Root MSE = 59.127

-------------------------------------------------------------------------------------------------------------- API13 | Coef. Std. Err. t P>|t| Beta--------------+----------------------------------------------------------------------------------------------- MEALS | .370891 .0395857 9.37 0.000 .1022703 AVG_ED | 89.51041 1.851184 48.35 0.000 .6727917 P_EL | .2773577 .0526058 5.27 0.000 .0486598 P_GATE | .7084009 .0664352 10.66 0.000 .0666584 EMER | -.7563048 .1396315 -5.42 0.000 -.032008 DMOB | 4.398746 .0817144 53.83 0.000 .349512 PCT_AA | -1.096513 .0651923 -16.82 0.000 -.1112841 PCT_AI | -1.731408 .1560803 -11.09 0.000 -.0718944 PCT_AS | .5951273 .0585275 10.17 0.000 .0715228 PCT_FI | .2598189 .1650952 1.57 0.116 .0099543 PCT_HI | .0231088 .0445723 0.52 0.604 .0066676 PCT_PI | -2.745531 .6295791 -4.36 0.000 -.0274142 PCT_MR | -.8061266 .1838885 -4.38 0.000 -.0295927 _cons | 96.52733 9.305661 10.37 0.000 .-----------------------------------------------------------------------------------------------------------

Diagnosis Theoretical

Remedy Including new variables

Violation of linearity An almost perfect

relationship will appear as a weak one

Almost all linear relations stop being linear at a

certain point

X

12001000800600400200

Y

160000

140000

120000

100000

80000

60000

40000

20000

0

-20000 Rsq = 0.9313

X

12001000800600400200

Y

160000

140000

120000

100000

80000

60000

40000

20000

0

-20000 Rsq = 0.1174

X

12001000800600400200

Z

1.000

.998

.996

.994

.992

.990

.988

.986 Rsq = 0.6211

Diagnosis: Visual scatter plots Comparing regression with continuous and dummied independent variable

Remedy: Use dummies

Y=a+bX+e becomes Y=a+b1D1+ …+bk-1Dk-1+e where X is broken up into k dummies (Di) and k-1 is included. If the R-

square of this equation is significantly higher than the R-square of the original that is a sign of non-linearity. The pattern of the slopes (bi) will indicate the shape of the non-linearity.

Transform the variables through a non-linear transformation, therefore Y=a+bX+e becomes

Quadratic: Y=a+b1X+b2X2+e Cubic: Y=a+b1X+b2X2+b3X3+e Kth degree polynomial: Y=a+b1X+…+bkXk+e

Logarithmic: Y=a+b*log(X)+e or Exponential: log(Y)=a+bX+e or Y=ea+bx+e Inverse: Y=a+b/X+e etc.

_cons 922.5229 3.183358 289.80 0.000 . MEALS2 .0181756 .0011998 15.15 0.000 .5437501 MEALS -3.666183 .1316338 -27.85 0.000 -.9997207 API13 Coef. Std. Err. t P>|t| Beta

Total 107122342 10241 10460.1447 Root MSE = 89.187 Adj R-squared = 0.2396 Residual 81444841.5 10239 7954.3746 R-squared = 0.2397 Model 25677500.1 2 12838750 Prob > F = 0.0000 F( 2, 10239) = 1614.05 Source SS df MS Number of obs = 10242

. regress API13 MEALS MEALS2, beta

Inflection point: -b1/2*b2 -(-3.666183)/2*.0181756=100.85425As you approach 100% the negative effect disappears

Meaningless!

Y=a+b1X1+b2X2+e

The assumption is that both X1 and X2 each, separately add to Y regardless of the value of the other. E.g. Inc=a+b1Education+b2Citizenship+e

Imagine, that the effect of X1 depends on X2. If Citizen Inc=a+b*1Education+e* If Not Citizen Inc=a+b**1Education+e**

where b*1 >b**1 You cannot simply add the two. If Citizenship is takes only two values,

their effect is multiplicative: Inc=a+b1Education*b2Citizenship+e

There are many examples of the violation of additivity: E.g., the effect of previous knowledge (X1) and effort (X2) on grades (Y) The effect of race and skills on income (discrimination) The effect of paternal and maternal education on academic achievement

Diagnosis: Try other functional forms and compare R-squares

Remedy: Introducing the multiplicative term as a new variable so

Yi=a+b1X1+b2X2+e becomes

Yi=a+b1X1+b2X2+b3Z+ e where Z=X1*X2

Or transforming the equation into additive form If Y=a*X1

b1*X2b2*e then

log Y=log(a)+b1log(X1)+b2log(X2)+e so

Coefficients(a)

Model Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta 1(Constant) 454.542 4.151 109.497 .000 AVG_ED 107.938 1.481 .801 72.896 .000 ESCHOOL 145.801 5.386 .707 27.073 .000 AVG_ED*ESCHOOL(interaction) -33.145 1.885 -.495 -17.587 .000a Dependent Variable: API13

Model Summary

Model R R Square Adjusted R Square Std. Error of the Estimate

1 .730(a) .533 .533 69.867a Predictors: (Constant), INTESXED, AVG_ED, ESCHOOL

Coefficients(a)

Model Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta 1(Constant) 510.030 2.738 186.250 .000 AVG_ED 87.476 .930 .649 94.085 .000 ESCHOOL 54.352 1.424 .264 38.179 .000a Dependent Variable: API13

Model Summary

Model R R Square Adjusted R Square Std. Error of the Estimate1 .720(a) .519 .519 70.918a Predictors: (Constant), ESCHOOL, AVG_ED

Does parents’ education matter more in elementary school or later?

Pred(API13i)= 454.542+ 107.938*AVG_EDi+ 145.801*ESCHOOLi+(-33.145)*AVG_EDi*ESCHOOLi

IF ESCHOOL=1 i.e. school is an elementary school Pred(API13i)= 454.542+ 107.938*AVG_EDi+ 145.801*1+(-33.145)*AVG_EDi*1 = 454.542+ 107.938*AVG_EDi+ 145.801+(-33.145)*AVG_EDi = (454.542 + 145.801)+ (107.938 -33.145)*AVG_EDi =

600.343+74.793*AVG_EDi

IF ESCHOOL=0 i.e. school is not an elementary but a middle or high school Pred(API13i)= 454.542+ 107.938*AVG_EDi+ 145.801*0+(-33.145)*AVG_EDi*0 =

454.542+ 107.938*AVG_EDi

The effect of parental education is larger after elementary school! Is this difference statistically significant?

Coefficients(a)

Model Unstandardized Coefficients Standardized Coefficients t Sig. B Std. Error Beta 1(Constant) 454.542 4.151 109.497 .000 AVG_ED 107.938 1.481 .801 72.896 .000 ESCHOOL 145.801 5.386 .707 27.073 .000 AVG_ED*ESCHOOL(interaction) -33.145 1.885 -.495 -17.587 .000a Dependent Variable: API13

_cons 446.5234 4.420695 101.01 0.000 . AVG_ED 111.0936 1.268689 87.57 0.000 .8246885 P_EL 1.364798 .0544075 25.08 0.000 .2362464 API13 Coef. Std. Err. t P>|t| Beta

Total 106264864 10172 10446.8014 Root MSE = 73.587 Adj R-squared = 0.4817 Residual 55071454 10170 5415.08889 R-squared = 0.4818 Model 51193410.2 2 25596705.1 Prob > F = 0.0000 F( 2, 10170) = 4726.92 Source SS df MS Number of obs = 10173

. regress API13 P_EL AVG_ED, beta

_cons 439.2113 4.910184 89.45 0.000 . INTELXED -.2345271 .0686992 -3.41 0.001 -.0818319 AVG_ED 113.9676 1.522041 74.88 0.000 .8460229 P_EL 1.886735 .1622719 11.63 0.000 .3265936 API13 Coef. Std. Err. t P>|t| Beta

Total 106264864 10172 10446.8014 Root MSE = 73.549 Adj R-squared = 0.4822 Residual 55008411.5 10169 5409.42192 R-squared = 0.4823 Model 51256452.7 3 17085484.2 Prob > F = 0.0000 F( 3, 10169) = 3158.47 Source SS df MS Number of obs = 10173

. regress API13 P_EL AVG_ED INTELXED, beta

Dependent Nominal Ordinal Interval/Ratio Independent Dichotomous Polytomous Dichotomous 2x2 table

Dummy variables with logit/probit

Kx2 table Dummy variables with multinomial logit/probit

Nx2 table Dummy variables with ordered logit/probit

Difference of means test Regression with dummy variables

Polytomous 2xK table Dummy variables with logit/probit

KxK table Dummy variables with multinomial logit/probit

NxK table Dummy variables with ordered logit/probit

ANOVA Regression with dummy variables

Ordinal 2xN table Dummy variables with logit/probit or just logit/probit

NxK table Dummy variables with multinomial logit/probit or just multinomial logit/probit

KxK table Dummy variables with ordered logit/probit or just ordered logit/probit

Regression with dummy or just Regression

Interval/Ratio Logit/probit Multinomial logit/pobit

Ordered logit/probit

Regression

Take Y=a+bX+e Suppose X*=X+e where X is the real value and e is a random measurement error

Then Y=a+b’X*+e’ Y=a+b’(X+e)+e’=a+b’X+b’e+e’ Y=a+b’X+E where E=b’e+e’ and b’=b The slope (b) will not change but the error will increase as a result

Our R-square will be smaller Our standard errors will be larger t-values smaller significance smaller

Suppose X#=X+cW+ewhere W is a systematic measurement error c is a weight Then Y=a+b’X#+e’ Y=a+b’(X+cW+e)+e’=a+b’X+b’cW+E b’=b iff rwx=0 or rwy=0 otherwise b’≠b which means that the slope will change together with the

increase in the error. Apart from the problems stated above, that means that Our slope will be wrong

Diagnosis: Look at the correlation of the measure with other

measures of the same variable Remedy:

Use multiple indicators and structural equation models (AMOS)

Confirmatory factor analysis Better measures

Our calculations of statistical significance depends on this assumption

Statistical inference can be robust even when error is non-normal

Diagnosis: You can look at the distribution of the error. Because of the

homoscedasticity assumption (see later) the error when summed up for each prediction should be also normal. (In principle, we have multiple observations for each prediction.)

Remember! Our measured variables (Y and X) do not have to have a normal distribution! Only the error for each prediction.

Remedy: Any non-linear transformation will change the shape of the

distribution of the error

_cons 1.405787 .2164236 6.50 0.000 . age .0321178 .0020686 15.53 0.000 .3330204 sibs .0860119 .0124929 6.88 0.000 .1557163 income06 .0161114 .0067835 2.38 0.018 .0567489 educ -.1196158 .0133317 -8.97 0.000 -.2169182 childs Coef. Std. Err. t P>|t| Beta

Total 4852.06054 1750 2.77260602 Root MSE = 1.4833 Adj R-squared = 0.2065 Residual 3841.48416 1746 2.20016275 R-squared = 0.2083 Model 1010.57637 4 252.644093 Prob > F = 0.0000 F( 4, 1746) = 114.83 Source SS df MS Number of obs = 1751

. regress childs educ income06 sibs age, beta

N Minimum Maximum MeanStd.

Deviation

childs NUMBER OF CHILDREN

1751 0 8 1.89 1.665

DEPENDENT VARIABLEUnderdispersion : Mean/Std.Dev.>1Overdispersion : Mean/Std.Dev.<1

As Mean >Std. Deviation we have a case of a (small) underdispersion

Likelihood-ratio test of alpha=0: chibar2(01) = 26.10 Prob>=chibar2 = 0.000 alpha .1001395 .0229456 .0639093 .1569084 /lnalpha -2.301191 .2291361 -2.75029 -1.852093 _cons .2482068 .1169235 2.12 0.034 .019041 .4773726 age .0168531 .0011061 15.24 0.000 .0146852 .0190209 sibs .0409034 .0060993 6.71 0.000 .0289491 .0528577 income06 .010126 .0037223 2.72 0.007 .0028304 .0174217 educ -.0593967 .0069457 -8.55 0.000 -.0730099 -.0457835 childs Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -2962.2647 Pseudo R2 = 0.0606Dispersion = mean Prob > chi2 = 0.0000 LR chi2(4) = 381.88Negative binomial regression Number of obs = 1751

Iteration 4: log likelihood = -2962.2647 Iteration 3: log likelihood = -2962.2649 Iteration 2: log likelihood = -2962.3446 Iteration 1: log likelihood = -2964.9271 Iteration 0: log likelihood = -2990.357

Fitting full model:

Iteration 4: log likelihood = -3153.2072 Iteration 3: log likelihood = -3153.2072 Iteration 2: log likelihood = -3153.2079 Iteration 1: log likelihood = -3153.7715 Iteration 0: log likelihood = -3262.2505

Fitting constant-only model:

Iteration 2: log likelihood = -2975.314 Iteration 1: log likelihood = -2975.314 Iteration 0: log likelihood = -2975.315

Fitting Poisson model:

. nbreg childs educ income06 sibs age, dispersion(mean)

Log of expected counts is now the unit of the dependent variable

Poisson assumes Mean=Std.Dev (No over- or underdispersion)

Negative Binomial does not make this assumption

The solid line gives a negative

The dotted line a positive mean This can happen when we have some

selection problem

Diagnosis: Visual scatter plot will not help unless we

know in advance somehow the true regression line

Remedy: If it is a selection problem try to address

it.

X

12001000800600400200

Z

1.000

.998

.996

.994

.992

.990

.988

.986 Rsq = 0.6211

Example 1: Suppose you take a survey of 10 people but you interview everyone 10 times. Now your N=1000 but your errors are not independent. For the same person you will have similar errors

Example 2: Suppose you take 10 countries and you observe them in 10 different time period Now your N=1000 but your errors are not independent. For the same country you will have similar errors

Example 3: Suppose you take 100 countries and you observe them only once. Now your N=100. But countries that are next to each other are often similar (same geography and climate, similar history, cooperation etc.). If your model underpredicts Denmark, it is likely to underpredict Sweden as well.

Example 4: Suppose you take 100 people but they are all couples, so what you really have is 50 couples. Husband and wife tend to be similar. If your model underestimates one chances are it does the same for the other. Spouses have similar errors.

Statistical inference assumes that each case is independent of the other and in the two examples above it is not the case. In fact, your N < 100.

This biases your standard error because the formula is “tricked into believing” that you have a larger sample than you actually have and larger samples give smaller standard errors and better statistical

significance. This may also bias your estimates of the intercept and the slope. Non-linearity is a special case of

correlated errors.

It is called autocorrelation because the correlation is between cases and not variables, although autocorrelations often can be traced to certain variables such as common geographic location or same country or person or family.

Diagnosis Visual, scatterplot Checking groups of cases that are theoretically suspect Certain forms of serial or spatial autocorrelations can be diagnosed by calculating

certain statistics (e.g., Durbin-Watson test) Remedy:

You can include new variables in the equation E.g.: for serial (temporal) correlation you can include the value of Y in t-1 as an independent

variable For spatial correlation we can often model the relationships by introducing an weight matrix

Homoscedasticity means equal variance Heteroscedasticity means unequal variance We assume that each prediction is not just on target on average but also that we make the

same amount of error Heteroscedasticity results in biased standard errors and statistical significance

Diagnosis: Visual, scatter plot

Remedy: Introducing a weight matrix (e.g. using 1/X)

Error represents all factors influencing Y that are not included in the regression equation If an omitted variable is related to X the assumption is violated. This is the same as the

Completeness or Omitted Variable Problem Diagnosis:

The error will ALWAYS be uncorrelated with X, there is no way to establish the TRUE error Theoretical

Remedy: Adding new variables to the model

We sometimes estimate more than one regression. Suppose Yt=a+b1Xt-1+b2Zt-1+e but

Xt=a’+b’1Yt-1+b’2Zt-1+e’ e and e’ will be correlated

(whatever is omitted from both equations will show up in both e and e’ making them correlated)

This is also the case in sample selection models S=a+b1X+b2Z+e S is whether one is selected into the sample

Y=a+b’1X+b’2Z+b’3W+b’4V+e’ Y is the outcome of interest

e and e’ will be correlated (whatever is omitted from both equations will show up in both e and e’ making them

correlated)