testing the assumptions of linear regression

Upload: arshiyanawaz7802

Post on 02-Apr-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Testing the Assumptions of Linear Regression

    1/14

    Testing the assumptions of linearregression

    Quantitative models always rest on assumptions about the way the worldworks, and regression models are no exception. There are four principal

    assumptions which justify the use of linear regression models for purposesof prediction:

    (i) linearity of the relationship between dependent and independent

    variables

    (ii) independence of the errors (no serial correlation)

    (iii) homoscedasticity (constant variance) of the errors

    (a) versus time

    (b) versus the predictions (or versus any independent variable)

    (iv) normality of the error distribution.

    If any of these assumptions is violated (i.e., if there is nonlinearity, serial

    correlation, heteroscedasticity, and/or non-normality), then the forecasts,confidence intervals, and economic insights yielded by a regression model

    may be (at best) inefficient or (at worst) seriously biased or misleading.

    Violations of linearity are extremely serious--if you fit a linear model todata which are nonlinearly related, your predictions are likely to be seriously

    in error, especially when you extrapolate beyond the range of the sampledata.

    How to detect: nonlinearity is usually most evident in a plot of theobserved versus predicted values or a plot ofresiduals versus predictedvalues, which are a part of standard regression output. The points should besymmetrically distributed around a diagonal line in the former plot or a

    http://www.duke.edu/~rnau/411home.htm
  • 7/27/2019 Testing the Assumptions of Linear Regression

    2/14

    horizontal line in the latter plot. Look carefully for evidence of a "bowed"

    pattern, indicating that the model makes systematic errors whenever it ismaking unusually large or small predictions.

    How to fix: consider applying a nonlinear transformation to the dependent

    and/or independent variables--if you can think of a transformation thatseems appropriate. For example, if the data are strictly positive, a log

    transformation may be feasible. Another possibility to consider is addinganother regressor which is a nonlinear function of one of the other variables.

    For example, if you have regressed Y on X, and the graph of residuals versuspredicted suggests a parabolic curve, then it may make sense to regress Y

    on both X and X^2 (i.e., X-squared). The latter transformation is possible

    even when X and/or Y have negative values, whereas logging may not be.

    Violations of independence are also very serious in time series regressionmodels: serial correlation in the residuals means that there is room for

    improvement in the model, and extreme serial correlation is often asymptom of a badly mis-specified model, as we saw in the auto sales

    example. Serial correlation is also sometimes a byproduct of a violation ofthe linearity assumption--as in the case of a simple (i.e., straight) trend line

    fitted to data which are growing exponentially over time.

    How to detect: The best test for residual autocorrelation is to look at anautocorrelation plotof the residuals. (If this is not part of the standard

    output for your regression procedure, you can save the RESIDUALS and useanother procedure to plot the autocorrelations.) Ideally, most of the residualautocorrelations should fall within the 95% confidence bands around zero,

    which are located at roughly plus-or-minus 2-over-the-square-root-of-n,where n is the sample size. Thus, if the sample size is 50, the

    autocorrelations should be between +/- 0.3. If the sample size is 100, theyshould be between +/- 0.2. Pay especially close attention to significant

    correlations at the first couple of lags and in the vicinity of the seasonalperiod, because these are probably not due to mere chance and are also

    fixable. The Durbin-Watson statisticprovides a test for significant residual

    autocorrelation at lag 1: the DW stat is approximately equal to 2(1-a) wherea is the lag-1 residual autocorrelation, so ideally it should be close to 2.0--

    say, between 1.4 and 2.6 for a sample size of 50.

    How to fix: Minor cases ofpositive serial correlation (say, lag-1 residualautocorrelation in the range 0.2 to 0.4, or a Durbin-Watson statistic between

    1.2 and 1.6) indicate that there is some room for fine-tuing in the model.Consider adding lags of the dependent variable and/or lags of some of the

  • 7/27/2019 Testing the Assumptions of Linear Regression

    3/14

    independent variables. Or, if you have ARIMA options available, try adding

    an AR=1 or MA=1 term. (An AR=1 term in Statgraphics adds a lag of thedependent variable to the forecasting equation, whereas an MA=1 term adds

    a lag of the forecast error.) If there is significant correlation at lag 2, then a2nd-order lag may be appropriate.

    If there is significant negative correlation in the residuals (lag-1

    autocorrelation more negative than -0.3 or DW stat greater than 2.6), watchout for the possibility that you may have overdifferencedsome of your

    variables. Differencing tends to drive autocorrelations in the negativedirection, and too much differencing may lead to patterns of negative

    correlation that lagged variables cannot correct for.

    If there is significant correlation at the seasonalperiod (e.g. at lag 4 for

    quarterly data or lag 12 for monthly data), this indicates that seasonality hasnot been properly accounted for in the model. Seasonality can be handled ina regression model in one of the following ways: (i) seasonally adjustthevariables (if they are not already seasonally adjusted), or (ii) use seasonal

    lags and/or seasonally differenced variables (caution: be careful not tooverdifference!), or (iii) add seasonal dummy variables to the model (i.e.,

    indicator variables for different seasons of the year, such as MONTH=1 orQUARTER=2, etc.) The dummy-variable approach enables additive seasonal

    adjustmentto be performed as part of the regression model: a different

    additive constant can be estimated for each season of the year. If thedependent variable has been logged, the seasonal adjustment is

    multiplicative. (Something else to watch out for: it is possible that although

    your dependent variable is already seasonally adjusted, some of yourindependent variables may not be, causing their seasonal patterns to leakinto the forecasts.)

    Major cases of serial correlation (a Durbin-Watson statistic well below 1.0,

    autocorrelations well above 0.5) usually indicate a fundamental structuralproblem in the model. You may wish to reconsider the transformations (if

    any) that have been applied to the dependent and independent variables. It

    may help to stationarize all variables through appropriate combinations ofdifferencing, logging, and/or deflating.

    Violations of homoscedasticity make it difficult to gauge the truestandard deviation of the forecast errors, usually resulting in confidenceintervals that are too wide or too narrow. In particular, if the variance of the

    errors is increasing over time, confidence intervals for out-of-samplepredictions will tend to be unrealistically narrow. Heteroscedasticity may also

  • 7/27/2019 Testing the Assumptions of Linear Regression

    4/14

    have the effect of giving too much weight to small subset of the data

    (namely the subset where the error variance was largest) when estimatingcoefficients.

    How to detect: look at plots ofresiduals versus time and residuals versus

    predicted value, and be alert for evidence of residuals that are getting larger(i.e., more spread-out) either as a function of time or as a function of the

    predicted value. (To be really thorough, you might also want to plotresiduals versus some of the independent variables.)

    How to fix: In time series models, heteroscedasticity often arises due to theeffects of inflation and/or real compound growth, perhaps magnified by a

    multiplicative seasonal pattern. Some combination oflogging and/ordeflating will often stabilize the variance in this case. Stock market data may

    show periods of increased or decreased volatility over time--this is normaland is often modeled with so-called ARCH (auto-regressive conditionalheteroscedasticity) models in which the error variance is fitted by anautoregressive model. Such models are beyond the scope of this course--

    however, a simple fix would be to work with shorter intervals of data inwhich volatility is more nearly constant. Heteroscedasticity can also be a

    byproduct of a significant violation of the linearity and/or independenceassumptions, in which case it may also be fixed as a byproduct of fixing

    those problems.

    Violations of normality compromise the estimation of coefficients and thecalculation of confidence intervals. Sometimes the error distribution is"skewed" by the presence of a few large outliers. Since parameter

    estimation is based on the minimization ofsquarederror, a few extremeobservations can exert a disproportionate influence on parameter estimates.

    Calculation of confidence intervals and various signficance tests forcoefficients are all based on the assumptions of normally distributed errors.

    If the error distribution is significantly non-normal, confidence intervals maybe too wide or too narrow.

    How to detect: the best test for normally distributed errors is a normalprobability plotof the residuals. This is a plot of the fractiles of errordistribution versus the fractiles of a normal distribution having the samemean and variance. If the distribution is normal, the points on this plot

    should fall close to the diagonal line. A bow-shapedpattern of deviationsfrom the diagonal indicates that the residuals have excessive skewness (i.e.,

    they are not symmetrically distributed, with too many large errors in thesame direction). An S-shaped pattern of deviations indicates that the

  • 7/27/2019 Testing the Assumptions of Linear Regression

    5/14

    residuals have excessive kurtosis--i.e., there are either two many or two few

    large errors in both directions.

    How to fix: violations of normality often arise either because (a) thedistributions of the dependent and/or independent variables are themselves

    significantly non-normal, and/or (b) the linearity assumption is violated. Insuch cases, a nonlinear transformation of variables might cure both

    problems. In some cases, the problem with the residual distribution ismainly due to one or two very large errors. Such values should be

    scrutinized closely: are they genuine (i.e., not the result of data entryerrors), are they explainable, are similar events likely to occur again in the

    future, and how influentialare they in your model-fitting results? (The

    "influence measures" report is a guide to the relative influence of extremeobservations.) If they are merely errors or if they can be explained as

    unique events not likely to be repeated, then you may have cause to removethem. In some cases, however, it may be that the extreme values in the

    data provide the most useful information about values of some of thecoefficients and/or provide the most realistic guide to the magnitudes of

    forecast errors.

    http://www.duke.edu/~rnau/testing.htm

    Normality

    You also want to check that your data is normally distributed. To do this,

    you can construct histograms and "look" at the data to see its

    distribution. Often the histogram will include a line that depicts what the

    shape would look like if the distribution were truly normal (and you can"eyeball" how much the actual distribution deviates from this line). This

    histogram shows that age is normally distributed:

    http://www.duke.edu/~rnau/testing.htmhttp://www.duke.edu/~rnau/testing.htmhttp://www.duke.edu/~rnau/testing.htm
  • 7/27/2019 Testing the Assumptions of Linear Regression

    6/14

    You can also construct a normal probability plot. In this plot, the actual

    scores are ranked and sorted, and an expected normal value is computed

    and compared with an actual normal value for each case. The expected

    normal value is the position a case with that rank holds in a normaldistribution. The normal value is the position it holds in the actual

    distribution. Basically, you would like to see your actual values lining up

    along the diagonal that goes from lower left to upper right. This plot also

    shows that age is normally distributed:

  • 7/27/2019 Testing the Assumptions of Linear Regression

    7/14

    You can also test for normality within the regression analysis by looking

    at a plot of the "residuals." Residuals are the difference betweenobtained and predicted DV scores. (Residuals will be explained in more

    detail in a later section.) If the data are normally distributed, then

    residuals should be normally distributed around each predicted DV

    score. If the data (and the residuals) are normally distributed, the

    residuals scatterplot will show the majority of residuals at the center of

    the plot for each value of the predicted score, with some residuals

    trailing off symmetrically from the center. You might want to do the

    residual plot before graphing each variable separately because if this

    residuals plot looks good, then you don't need to do the separate plots.

    Below is a residual plot of a regression where age of patient and time (in

    months since diagnosis) are used to predict breast tumor size. These data

    are not perfectly normally distributed in that the residuals about the zero

  • 7/27/2019 Testing the Assumptions of Linear Regression

    8/14

    line appear slightly more spread out than those below the zero line.Nevertheless, they do appear to be fairly normally distributed.

    In addition to a graphic examination of the data, you can also

    statistically examine the data's normality. Specifically, statistical

    programs such as SPSS will calculate the skewness and kurtosis for each

    variable; an extreme value for either one would tell you that the data are

    not normally distributed. "Skewness" is a measure of how symmetrical

    the data are; a skewed variable is one whose mean is not in the middle of

    the distribution (i.e., the mean and median are quite different).

    "Kurtosis" has to do with how peaked the distribution is, either toopeaked or too flat. "Extreme values" for skewness and kurtosis are

    values greater than +3 or less than -3. If any variable is not normally

    distributed, then you will probably want to transform it (which will be

    discussed in a later section). Checking for outliers will also help with the

    normality problem.

  • 7/27/2019 Testing the Assumptions of Linear Regression

    9/14

    Linearity

    Regression analysis also has an assumption of linearity. Linearity means

    that there is a straight line relationship between the IVs and the DV. This

    assumption is important because regression analysis only tests for alinear relationship between the IVs and the DV. Any nonlinear

    relationship between the IV and DV is ignored. You can test for linearity

    between an IV and the DV by looking at a bivariate scatterplot (i.e., a

    graph with the IV on one axis and the DV on the other). If the two

    variables are linearly related, the scatterplot will be oval.

    Looking at the above bivariate scatterplot, you can see that friends is

    linearly related to happiness. Specifically, the more friends you have, the

    greater your level of happiness. However, you could also imagine that

    there could be a curvilinear relationship between friends and happiness,

    such that happiness increases with the number of friends to a point.

  • 7/27/2019 Testing the Assumptions of Linear Regression

    10/14

    Beyond that point, however, happiness declines with a larger number offriends. This is demonstrated by the graph below:

    You can also test for linearity by using the residual plots described

    previously. This is because if the IVs and DV are linearly related, then

    the relationship between the residuals and the predicted DV scores will

    be linear. Nonlinearity is demonstrated when most of the residuals are

    above the zero line on the plot at some predicted values, and below the

    zero line at other predicted values. In other words, the overall shape of

    the plot will be curved, instead of rectangular. The following is a

    residuals plot produced when happiness was predicted from number offriends and age. As you can see, the data are not linear:

  • 7/27/2019 Testing the Assumptions of Linear Regression

    11/14

    The following is an example of a residuals plot, again predicting

    happiness from friends and age. But, in this case, the data are linear:

    If your data are not linear, then you can usually make it linear by

    transforming IVs or the DV so that there is a linear relationship between

  • 7/27/2019 Testing the Assumptions of Linear Regression

    12/14

    them. Sometimes transforming one variable won't work; the IV and DV

    are just not linearly related. If there is a curvilinear relationship between

    the DV and IV, you might want to dichotomize the IV because a

    dichotomous variable can only have a linear relationship with another

    variable (if it has any relationship at all). Alternatively, if there is acurvilinear relationship between the IV and the DV, then you might need

    to include the square of the IV in the regression (this is also known as a

    quadratic regression).

    The failure of linearity in regression will not invalidate your analysis so

    much as weaken it; the linear regression coefficient cannot fully capture

    the extent of a curvilinear relationship. If there is both a curvilinear and

    a linear relationship between the IV and DV, then the regression will atleast capture the linear relationship.

    Homoscedasticity

    The assumption of homoscedasticity is that the residuals are

    approximately equal for all predicted DV scores. Another way of

    thinking of this is that the variability in scores for your IVs is the same at

    all values of the DV. You can check homoscedasticity by looking at the

    same residuals plot talked about in the linearity and normality sections.Data are homoscedastic if the residuals plot is the same width for all

    values of the predicted DV. Heteroscedasticity is usually shown by a

    cluster of points that is wider as the values for the predicted DV get

    larger. Alternatively, you can check for homoscedasticity by looking at a

    scatterplot between each IV and the DV. As with the residuals plot, you

    want the cluster of points to be approximately the same width all over.

    The following residuals plot shows data that are fairly homoscedastic. In

    fact, this residuals plot shows data that meet the assumptions of

    homoscedasticity, linearity, and normality (because the residual plot is

    rectangular, with a concentration of points along the center):

  • 7/27/2019 Testing the Assumptions of Linear Regression

    13/14

    Heteroscedasiticy may occur when some variables are skewed and

    others are not. Thus, checking that your data are normally distributed

    should cut down on the problem of heteroscedasticity. Like theassumption of linearity, violation of the assumption of homoscedasticity

    does not invalidate your regression so much as weaken it.

    Multicollinearity and Singularity

    Multicollinearity is a condition in which the IVs are very highly

    correlated (.90 or greater) and singularity is when the IVs are perfectly

    correlated and one IV is a combination of one or more of the other IVs.

    Multicollinearity and singularity can be caused by high bivariate

    correlations (usually of .90 or greater) or by high multivariate

    correlations. High bivariate correlations are easy to spot by simply

    running correlations among your IVs. If you do have high bivariate

    correlations, your problem is easily solved by deleting one of the two

    variables, but you should check your programming first, often this is a

  • 7/27/2019 Testing the Assumptions of Linear Regression

    14/14

    mistake when you created the variables. It's harder to spot high

    multivariate correlations. To do this, you need to calculate the SMC for

    each IV. SMC is the squared multiple correlation ( R2 ) of the IV when

    it serves as the DV which is predicted by the rest of the IVs. Tolerance, a

    related concept, is calculated by 1-SMC. Tolerance is the proportion of avariable's variance that is not accounted for by the other IVs in the

    equation. You don't need to worry too much about tolerance in that most

    programs will not allow a variable to enter the regression model if

    tolerance is too low.

    Statistically, you do not want singularity or multicollinearity because

    calculation of the regression coefficients is done through matrix

    inversion. Consequently, if singularity exists, the inversion isimpossible, and if multicollinearity exists the inversion is unstable.

    Logically, you don't want multicollinearity or singularity because if they

    exist, then your IVs are redundant with one another. In such a case, one

    IV doesn't add any predictive value over another IV, but you do lose a

    degree of freedom. As such, having multicollinearity/ singularity can

    weaken your analysis. In general, you probably wouldn't want to includetwo IVs that correlate with one another at .70 or greater.

    Transformations