10 inference for regression part2

Upload: rama-dulce

Post on 06-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/17/2019 10 Inference for Regression Part2

    1/12

    - 1 -

    INFERENCE FOR REGRESSION –  PART 2

    Topics Outline

    Analysis of Variance (ANOVA)

     F  Test for the Slope  

    Calculating Confidence and Prediction Intervals

     Normal Probability Plots

    Analysis of Variance (ANOVA)

    Question: If the least squares fit is the best fit, how good is it?

    The answer to this question depends on the variability in the values of the response variable,

    that is on the deviations of the observed y’s from their mean  y :

    Figure 1 Decomposition of total variation

    The total variation  y yi  for an observed i y  can be decomposed into two parts:

    total = regression + error y yi   =  y yiˆ   + ii   y y   ˆ  

    totalvariation

    =variation explained by the regression line

    +unexplainedvariation

    If we square these deviations and add them up, we obtain the following three sources ofvariability:

  • 8/17/2019 10 Inference for Regression Part2

    2/12

    - 2 -

    Total Sumof Squares

    =Regression Sumof Squares

    +Error Sumof Squares

    n

    i

    i   y y1

    2)(   =n

    i

    i   y y1

    2)ˆ(   +n

    i

    ii   y y1

    2)ˆ(  

    SST = SSR + SSE

    If we divide SST by n  –  1, we get the sample variance of the observations n y y y   ,,, 21   :

    n

    i

    i y   SST n

     y yn

     s1

    22

    1

    1

    1

    So, the total sum of squares SST really is a measure of total variation.It has n  –  1 degrees of freedom.

    The regression sum of squares SSR  represents variation due to the relationship between x and y.

    It has 1 degree of freedom.

    The error sum of squares SSE measures the amount of variability in the response variable dueto factors other than the relationship between x and y.It has n  –  2 degrees of freedom.

    For the car plant electricity usage example (see Excel output on pages 5 and 6),

    SST = SSR + SSE

    1.5115 = 1.2124 + 0.2991

    By themselves, SSR, SSE, and SST provide little that can be directly interpreted.However, a simple ratio of the regression sum of squares SSR to the total sum of squares SST provides a measure of the goodness of fit for the estimated regression equation. This ratio is the

    coefficient of determination: 

    SST 

    SSE 

    SST 

    SSRr    12  

    The coefficient of determination is the proportion of the total sum of squares that can be

    explained by the sum of squares due to regression. In other words, 2r   measures the proportion ofvariation in the response variable y that can be explained by y’s linear dependence on x in the

    regression model.

    For our data,

    802.05115.1

    2124.12

    SST 

    SSRr   

    This high proportion of explained variation indicates that the estimated regression equation provides a good fit and can be very useful for predictions of electricity usage.

  • 8/17/2019 10 Inference for Regression Part2

    3/12

    - 3 -

    If we divide the sum of squares for regression and error by their degrees of freedom,we obtain the regression mean square MSR and the error mean square MSE:

    SSRSSR

     MSR1

      (variance due to regression)

    2nSSE  MSE    (variance due to error)

    Taking a square root of the variance due to error we obtain the regression standard erroror standard error of estimate:

    2n

    SSE  MSE  se

     

     Note: Recall that2

    ˆ

    1

    2

    n

     y y

     s

    n

    i ii

    e  is the standard deviation of the residuals and it estimates

    the standard deviation of the errors in the model.

    For our example,

    2124.11

    SSR MSR  

    02991.0

    212

    2991.0

    2n

    SSE  MSE   

    F  Test for the Slope

    The ratio of the mean squares provides the F- statistic 

     MSE 

     MSR F   

    For our example,

    53.4002991.0

    2124.1

     MSE 

     MSR F   

    It can be proved that the F -statistic follows an F-distri bution with 1 and (n –  2) degrees offreedom  and can be used to test the hypothesis for a linear relationship between x and y.

  • 8/17/2019 10 Inference for Regression Part2

    4/12

    - 4 -

    Recall that represents the slope of the true unknown regression line

     x y E   y)(  

    The null and alternative hypotheses for the slope are stated as follows:

    0:0 H   0:a H   

    If the null hypothesis is rejected, we would conclude that there is evidence of a linear relationship.

    For our example, the corresponding F  distribution has df1 = 1 and df2 = n  –  2 = 12 –  2 = 10degrees of freedom. To calculate the P -value associated with the value F  = 40.53 of the teststatistic, Excel can be used as follows:

     P -value = FDIST(test statistic,df1,df2) = FDIST(40.53,1,10) = 0.000082

    Figure 2 Testing for significance of slope using F  distribution with 1 and 10 degrees of freedom

    With P -value this small, we reject the null hypothesis and conclude that there is a significantlinear relationship between the electricity usage and the production levels.

     Notice that the P -value = 0.000082 for the F test of the slope is the same as the P -value for thet test of the slope performed earlier. Moreover, it can be shown that the square of t  distributionwith n  –  2 degrees of freedom equals the F  distribution with 1 and n  –  2 degrees of freedom:

    2,1

    2

    2   nn   F t   

    For our example,

     F SE 

    bt 

    b

    53.403665267.6078352.0

    4988301.0   222

    2  

    With only one explanatory variable, the F  test will provide the same conclusion as the t  test.But with more than one explanatory variable, only the F  test can be used to test for an overallsignificant relationship.

  • 8/17/2019 10 Inference for Regression Part2

    5/12

    - 5 -

    Example 1 (Car plant electricity usage)The manager of a car plant wishes to investigate how the plant’s electricity usage depends upon

    the plant’s production, based on the data for each month of the previous year: x y

    Month Production Electricity usage

    January 4.51 2.48February 3.58 2.26March 4.31 2.47April 5.06 2.77May 5.64 2.99June 4.99 3.05July 5.29 3.18August 5.83 3.46September 4.70 3.03October 5.61 3.26 November 4.90 2.67

    December 4.20 2.53

    df SS MS F Significance F

    Regression 1 1.212382 1.212382 40.532970 0.000082

    Residual 10 0.299110 0.029911

    Total 11 1.511492

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

    Intercept 0.409048 0.385991 1.059736 0.314190 -0.450992 1.269089

    Production 0.498830 0.078352 6.366551 0.000082 0.324252 0.673409

    y = 0.4988x + 0.409

    R² = 0.8021

    22.25

    2.5

    2.75

    3

    3.25

    3.5

    3.5 4 4.5 5 5.5 6

       E    l   e   c   t   r   i   c   i   t   y   u

       s   a   g   e    (   m   i    l    l   i   o   n    k   W    h    )

    Production ($ million)

    Car Plant Electricity Usage

     Regression Statistics

    Multiple R 0.895606

    R Square 0.802109

    Adjusted R Square 0.782320

    Standard Error 0.172948

    Observations 12

  • 8/17/2019 10 Inference for Regression Part2

    6/12

    - 6 -

    Interpretation of Excel’s Regression Output 

     Regression Statistics

    Multiple R 0.895606 (sign of b)   2r   

    R Square 0.802109 SST 

    SSE 

    SST 

    SSRr    1

    2

     

    Adjusted R Square 0.782320

    1

    212

    n

    SST n

    SSE 

    r ad j  

    Standard Error 0.1729482n

    SSE  MSE  se

     Observations 12 n  

    ANOVA

    Source of

    Variation

     Degrees

    of Freedom

    Sum

    of Squares

     Mean Square

    (Variance) F-statistic P-value

    Regression 1 SSR1

    SSR MSR

       MSE 

     MSR F    Prob > F

    Error 2n   SSE2n

    SSE  MSE 

     Total 1n   SST

    df SS MS F Significance F

    Regression 1 1.212382 1.212382 40.532970 0.000082

    Residual 10 0.299110 0.029911

    Total 11 1.511492

    Coefficients Std Error t Stat P-value Lower 95% Upper 95%

     y-intercept a SE aaSE 

    a  Prob>| t | a  –  t *SE a a + t *SE a 

     x b SE  bbSE 

    b   Prob>| t | b  –  t *SE  b  b + t *SE  b 

    Coefficients Std Error t Stat P-value Lower 95% Upper 95%

    Intercept 0.409048 0.385991 1.059736 0.314190 -0.450992 1.269089

    Production 0.498830 0.078352 6.366551 0.000082 0.324252 0.673409

  • 8/17/2019 10 Inference for Regression Part2

    7/12

    - 7 -

    Calculating Confidence and Prediction Intervals 

    StatTools provides prediction intervals for individual values, but it does not provide confidenceintervals for the mean of y, given a set of x’s. 

    The Excel file ConfPredInt.xlsx (located in the Materials folder on Blackboard) computes theconfidence interval and the prediction interval for the Car Plant Electricity Usage data.The same file can be used to calculate the confidence and prediction intervals for any other dataafter providing the information in the cells filled with yellow color.

    Reminder: The prediction interval is always wider than the confidence interval because there ismuch more variation in predicting an individual value than in estimating a mean value.

    Normal Probability Plots

    The third regression assumption states that the error terms are normally distributed.You can evaluate this assumption using a histogram or a normal probability plot of the residuals.

    One common normal probability plot is called the quantile-quantile or Q-Q plot.

    It is a scatter plot of the standardized values from the data set being evaluated versus the values thatwould be expected if the data were perfectly normally distributed (with the same mean and standarddeviation as in the data set). If the data are, in fact, normally distributed, the points in this plot tend to

    cluster around a o45 line. Any large deviation from a o45 line signals some type of non-normality.

    The following figure illustrates the typical shape of normal probability plots.If the data are left-skewed, the curve will rise more rapidly at first and then level off.If the data are normally distributed, the points will plot along an approximately straight line.If the data are right-skewed, the data will rise more slowly at first and then rise at a faster rate forhigher values of the variable being plotted.

    Figure 3 Normal probability plots for (a) left-skewed,(b) normal, and (c) right-skewed distributions.

    Regression analysis procedures are very robust with modest departures from normality.Unless the distribution of the residuals is severely non-normal, the inferences made from theregression output are still approximately valid. In addition, some forms on non-normality canoften be remedied by transformations of the response variable.

    Although the Regression tool in Excel offers a normal probability plot, this is not the appropriate probability plot for the residuals, so do not use this option. With StatTools, select Q-Q Normal Plotfrom the Normality Tests dropdown list and check both options at the bottom of the dialog box.

  • 8/17/2019 10 Inference for Regression Part2

    8/12

    - 8 -

    Example 2

    Sunflowers Apparel

    The sales for Sunflowers Apparel, a chain of upscale clothing stores for women, have increasedduring the past 12 years as the chain has expanded the number of stores. Until now, Sunflowersmanagers selected sites based on subjective factors, such as the availability of a good lease or the

     perception that a location seemed ideal for an apparel store.The new director of planning wants to develop a systematic approach that will lead to making better decisions during the site selection process. He believes that the size of the store significantlycontributes to store sales, and wants to use this relationship in the decision-making process.

    To examine the relationship between the store size and its annual sales, data were collected froma sample of 14 stores. The data are stored in Sunflowers_Apparel.xlsx .

    (a) Use Excel’s Regression tool or StatTools to run a linear regression.Does a straight line provide a useful mathematical model for this relationship?

     Regression Statistics

    Multiple R 0.950883R Square 0.904179

    Adjusted R Square 0.896194

    Standard Error 0.966380

    Observations 14

    ANOVA

    df SS MS F Significance F

    Regression 1 105.747610 105.747610 113.233513 0.00000018

    Residual 12 11.206676 0.933890

    Total 13 116.954286

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

    Intercept 0.964474 0.526193 1.832927 0.091727 -0.182003 2.11095

    Square Feet 1.669862 0.156925 10.641124 0.000000 1.327951 2.01177

    y = 1.6699x + 0.9645

    R² = 0.9042

    0

    2

    4

    6

    8

    10

    12

    14

    0 2 4 6 8

       A   n   n   u   a    l   S   a    l   e   s    (    $   m   i    l    l   i   o   n    )

    Square Feet (thousands)

    Square Feet Line Fit Plot

  • 8/17/2019 10 Inference for Regression Part2

    9/12

    - 9 -

    As the scatter plot and the high r  = 0.95 show, as the size of the store increases, annual salesincrease approximately as a straight line. Thus, we can assume that a straight line provides auseful mathematical model for this relationship.

    (b) Can you safely predict the annual sales for a store whose size is 7 thousands of square feet?

    The square footage varies from 1.1 to 5.8 thousands of square feet. Therefore annual salesshould be predicted only for stores whose size is between 1.1 to 5.8 thousands of square feet.It would be improper to use the prediction line to forecast the sales for a new store containing7,000 square feet because the relationship between sales and store size has a point ofdiminishing returns. If that is true, as square footage increases beyond 5,800 square feet, theeffect on sales becomes smaller and smaller.

    (c) What are the values of SST, SSR, and SSE? Please verify that SST = SSR + SSE.

    SST = 116.954286

    SSR = 105.747610SSE = 11.206676

    116.954286 = 105.747610 + 11.206676

    (d) Use the above sums to calculate the coefficient of determination and interpret it.

    904179.0954286.116

    747610.1052

    SST 

    SSRr   

    Therefore, 90.42% of the variation in annual sales is explained by the variability in the size ofthe store as measured by the square footage. This large r 2 indicates a strong linear relationship

     between these two variables because the use of a regression model has reduced the variability in predicting annual sales by 90.42%. Only 9.58% of the sample variability in annual sales is due tofactors other than what is accounted for by the linear regression model that uses square footage.

    (e) Interpret the standard error of the estimate.

     s = 0.966380

    Recall that the standard error of the estimate represents a measure of the variation around the prediction line. It is measured in the same units as the dependent variable y.Here, the typical difference between actual annual sales at a store and the predicted annual salesusing the regression equation is approximately 0.966380 millions of dollars or $966,380.

    (f) What are the expected annual sales and the residual value for the last data pair ( x = 3, y = 4.1).Interpret both of these values in business terms.

    From the regression output, the expected (or predicted) annual sales equal 5.974061,indicating that we expect annual sales to be $5,974,061, on average, for a store with size of3,000 square feet. The residual equals – 1.874061, indicating that for the store correspondingto the last pair in the data set the actual annual sales were $1,874,061 lower than expected.

  • 8/17/2019 10 Inference for Regression Part2

    10/12

    - 10 -

    (g) Construct a residual plot.

    (h) Use the residual plot to evaluate the regression model assumptions about linearity(mean of zero), independence, and equal spread of the residuals.

    LinearityAlthough there is widespread scatter in the residual plot, there is no clear pattern orrelationship between the residuals and xi’s. The residuals appear to be evenly spread aboveand below 0 for different values of x.

    IndependenceYou can evaluate the assumption of independence of the errors by plotting the residuals in

    the order or sequence in which the data were collected. If the values of  y are part of a timeseries, one residual may sometimes be related to the previous residual.If this relationship exists between consecutive residuals (which violates the assumption ofindependence), the plot of the residuals versus the time in which the data were collected willoften show a cyclical pattern. Because the Sunflowers Apparel data were collected during thesame time period, you can assume that the independence assumption is satisfied for these data.

    Equal spreadThere do not appear to be major differences in the variability of the residuals for different xi values. Thus, you can conclude that there is no apparent violation in the assumption of equalspread at each level of x.

    -2.5

    -2

    -1.5

    -1

    -0.5

    0

    0.51

    1.5

    0 2 4 6 8

       R   e   s   i    d   u   a    l   s

    Square Feet

    Square Feet Residual Plot

  • 8/17/2019 10 Inference for Regression Part2

    11/12

    - 11 -

    (i) Use StatTools to construct a normal probability plot and evaluate the regression modelassumption about normality of the residuals.

    From the Q – Q plot of the residuals, the data do not appear to depart substantially from anormal distribution. The robustness of regression analysis with modest departures fromnormality enables you to conclude that you should not be overly concerned about departuresfrom this normality assumption in the Sunflowers Apparel data.

     Note: Excel does not readily provide a normal probability plot of residuals, but could be used toget it in the following way. Run a regression with y = Residuals, x = any numbers (for example x = 1,2,3, ...) and check the Normal Probability Plots box. Here is the result for our example.

    (j) Find the standard error of the slope coefficient. What does this number indicate?

    The standard error of the slope coefficient indicates the uncertainty in the estimated slope.It measures about how far the estimated slope (the regression coefficient computed from the

    sample) differs from the (idealized) true population slope, , due to the randomness of

    sampling. Here, the estimated slope b = 1.669862 or $1,669,862 differs from the population

    slope by about bSE   = 0.156925 or $156,925.

    -3.5

    -2.5

    -1.5

    -0.5

    0.5

    1.5

    2.5

    3.5

    -3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5

       S   t   a   n    d   a   r    d   i   z   e    d   Q  -   V   a    l   u   e

    Z-Value

    Q-Q Normal Plot of Residual / Data Set #2

    -3

    -2

    -1

    0

    1

    2

    0 20 40 60 80 100 120

       R   e   s   i    d   u   a    l   s

    Sample Percentile

    Normal Probability Plot

  • 8/17/2019 10 Inference for Regression Part2

    12/12

    - 12 -

    (k) Find and interpret the 95% confidence interval for the slope coefficient.(Note: In economics language for the slope, this question sounds like the following.Find the 95% confidence interval for the expected marginal value of an additional 1,000square feet to Sunflowers Apparel.)

    The 95% confidence interval extends from 1.327951 to 2.011773 or $1,327,951 to $2,011,773.We are 95% confident that the additional annual sales, for an additional 1,000 square feet instore size, are between $1,327,951 to $2,011,773, on average.(In economics language: We are 95% sure that the expected marginal value of an additional1,000 square feet is between $1,327,951 to $2,011,773.)

    (l) Use the F  statistic from the regression output to determine whether the slope is statisticallysignificant.

    The ANOVA table shows that the computed test statistic is F  = 113.2335 and the P -value is

    approximately 0. Therefore, you reject the null hypothesis 0:0 H   and conclude that the

    size of the store is significantly related to annual sales.

    (m) Use the t  statistic from the regression output to determine whether the slope is statisticallysignificant.

    The slope is significantly different from 0. This can be seen either from the P -value of thetest statistic (for t  = 10.6411, P -value 0) or from the confidence interval for the slope(1.33 to 2.01) which does not include 0. This says that the size of the store significantlycontributes to store annual sales. Stores with larger size make larger annual sales, on average.

    (n) Use the ConfPredInt.xlsx to construct a 95% confidence interval of the mean annual sales

    for the entire population of stores that contain 4,000 square feet ( x = 4).

    The confidence interval is 6.971119 to 8.316727. Therefore, the mean annual sales are between $6,971,119 and $8,316,727 for the population of stores with 4,000 square feet.

    (o) Use the ConfPredInt.xlsx to construct a 95% prediction interval of the annual sales for anindividual store that contains 4,000 square feet ( x = 4).

    The prediction interval is 5.433482 to 9.854364. Therefore, with 95% confidence, you predictthat the annual sales for an individual store with 4,000 square feet is between $5,433,482 and$9,854,364.

    (p) Compare the intervals constructed in (n) and (o).

    If you compare the results of the confidence interval estimate and the prediction intervalestimate, you see that the width of the prediction interval for an individual store is much widerthan the confidence interval estimate for the mean.