07 simple linear regression part2

Upload: rama-dulce

Post on 06-Jul-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/17/2019 07 Simple Linear Regression Part2

    1/9

    - 1 -

    SIMPLE LINEAR REGRESSION – PART 2

    Topics Outline

    • Scatterplots and Correlation

    • The Least Squares Regression Line

    Example 1

    Sales versus Promotions at Pharmex

    Pharmex is a chain of drugstores that operate around the country. To see how effective itsadvertising and other promotional activities are, the company has collected data from 50randomly selected metropolitan regions. In each region it has compared its own promotionalexpenditures and sales to those of the leading competitor in the region over the past year.The data are listed in the file Drugstore_Sales.xlsx.

    There are two variables:– Promote: Pharmex’s promotional expenditures as a percentage of those of the leading competitor– Sales: Pharmex’s sales as a percentage of those of the leading competitor

    Note that each of these variables is an index, not a dollar amount. For example, if Promote equals 95for some region, this tells us only that Pharmex’s promotional expenditures in that region are 95%as large as those for the leading competitor in that region.

    The company expects that there is a positive relationship between these two variables,so that regions with relatively larger expenditures have relatively larger sales.However, it is not clear what the nature of this relationship is.

    Using StatTools=================

    Define a StatTools data setClick anywhere within the data setData Set ManagerOK

    Get a scatterplotSummary GraphsScatterplotChoose X and YOK

    Get a regression outputRegression and ClassificationRegressionFill in the resulting dialog box asshown to the rightOK

  • 8/17/2019 07 Simple Linear Regression Part2

    2/9

    - 2 -

    Using Excel=============

    Getting a regression output in Excel:DataData Analysis, Regression, OK

    Input Y Range, Input X RangeCheck Labels, Line Fit PlotsOK

    Formatting the scatterplot:Delete unwanted information (right-click, delete)Right-click, Add TrendlineCheck Display Equation on chart

    Display R-squared value on chartClose

    Formatting the axes:Right-clickFormat axisEnter desired valuesClose

    Note 1:If the Analysis ToolPack is not installed in Excel, follow these steps:Click the arrow in the upper left cornerMore commandsAdd-Ins

    Analysis ToolPackGoAnalysis ToolPackOK

    Note 2:The slope and intercept of the least squares line can also be calculated directly in Excel using thefunctions SLOPE and INTERCEPT.

  • 8/17/2019 07 Simple Linear Regression Part2

    3/9

    - 3 -

    (a) What type of relationship, if any, is apparent from a scatterplot?

    This scatterplot indicates that there is a positive relationship between Promote and Sales –the points tend to rise from bottom left to top right – but the relationship is not perfect.If it were perfect, a given value of Promote would prescribe the value of Sales exactly.Clearly, this is not the case. For example, there are five regions with promotional values of 96but all of them have different sales values. So the scatterplot indicates that while the variablePromote is helpful for predicting Sales, it does not lead to perfect predictions.

    (b) Can the drugstore manager conclude that larger promotional expenses cause larger sales values?

    No. Unless the data are obtained in a carefully controlled experiment – which is certainly notthe case here – you can never be absolutely sure about causation. One reason is that you can’talways be sure which direction the causation goes. Does  x cause  y, or does  y cause  x?Another reason is that you can almost never rule out the possibility that some other variableis causing the variation in both of the observed variables. Although this is unlikely in thisdrugstore example, it is still a possibility.

    (c) Calculate and interpret the correlation between Sales and Promote.

    To calculate a correlation between two variables, you can use Excel’s CORREL function or usethe value called Multiple R (with an appropriate sign) from Excel’s Regression Analysis output.

    Alternatively, you can use StatTools to obtain a whole table of correlations between a set ofvariables.

    The correlation between Sales and Promote is positive – as the upward-sloping scatter of pointssuggests – and is equal to

    r  = 0.673

                                      

  • 8/17/2019 07 Simple Linear Regression Part2

    4/9

    - 4 -

    This is a moderately large correlation. It confirms the pattern in the scatterplot, namely,that the points increase linearly from left to right but with considerable variation around anyparticular straight line.

    Reminder: Correlations apply only to linear relationship. If a correlation is close to zero,

    you cannot automatically conclude that there is no relationship between the two variables.You should look at a scatterplot first. The chances are that the points are a shapeless swarm andthat no relationship exists. But it is also possible that the points cluster around some curve.

    (d) What is the least squares line for sales as a function of promotional expenses at Pharmex?

    In the StatTools output, the intercept and slope of the least squares line appear under theCoefficient label in cells B18 and B19. They imply that the equation for the least squares line is

    Predicted Sales = 25.1264 + 0.7623Promote

    (e) Interpret the slope of the regression line.

    The slope, 0.7623, indicates that the sales index tends to increase by about 0.76 for each one-unit increase in the promotional expenses index. Alternatively, if two regions are compared,where the second region spends one unit more than the first region, the predicted sales indexfor the second region is 0.76 larger than the sales index for the first region.

    (f) Interpret the intercept of the regression line.

    The intercept is literally the predicted sales index for a region that does no promotions.However, no region in the sample has anywhere near a zero promotional value.

    Note: In many applications it makes no sense to have the explanatory variable(s) equal to zero.Then the intercept term has no practical or economic meaning. Therefore, in such situations,where the range of observed values for the explanatory variable does not include zero, it is bestto think of the intercept term as simply an “anchor” for the least squares line that enablespredictions of  y values for the range of observed  x values.

    (g) Interpret the coefficient of determination.

    The coefficient of determination 2r   is the square of the correlation between the observed  y values and the fitted  ŷ values. Aside from rounding, the square of r  = 0.673 is 0.453,

    which is shown as the2

     R  value in the Excel output.

    r 2 = 0.453

    The explanatory variable Promote is able to explain only 45.3% of the variation in the Salesvariable. This is not particularly good. There is still 54.7% of the variation left unexplained.

    Of course, we would like 2r  to be as close to 1 as possible. Usually, the only way to increaseit is to use better and/or more explanatory variables.

  • 8/17/2019 07 Simple Linear Regression Part2

    5/9

    - 5 -

    Note:If the correlation between two variables  y and  x is ±0.8, the regression of  y on  x will have an

    2r  of 0.64; that is, the regression with  x as the only explanatory variable will explain 64% ofthe variation in  y.

    If the correlation drops to ±0.7, this percentage drops to 49%; if the correlation increases to ±0.9,the percentage increases to 81%. The point is that before a single variable  x can explain alarge percentage of the variation in some other variable  y, the two variables must be highlycorrelated – in either  a positive or negative direction.

    (h) What is the standard error of estimate? What does it measure?

    The standard error of estimate is approximately es = 7.39. It indicates the typical magnitude

    of error when using promotional expenses, via the regression equation, to predict sales.More specifically, if the regression equation is used to predict sales for many regions,

    about two-thirds of the predictions will be within es = 7.39 of the actual sales values,

    and about 95% of the predictions will be within two standard errors, or 2 es = 14.78,

    of the actual sales values.

    (i) Is this level of accuracy good?

    One way to measure the regression equation’s ability to predict is to compare the standard

    error of estimate, es , to the standard deviation of the response variable,  ys . The idea is that

    es  is (essentially) the standard deviation of the residuals, whereas  ys  is the standard deviation

    of residuals from a horizontal regression line at height  y , the sample mean of the response

    variable. Therefore, if es  is small compared to  ys  (that is, if  ye   ss  / is small), the regressionline is evidently doing a good job in explaining the variation of the response variable.

    The standard deviation of the Sales variable is  ys = 9.90. (This is obtained by the usual

    STDEV function applied to the observed sales values  y.)

    It can be interpreted as the standard deviation of the residuals around a horizontal linepositioned at the mean value  y  of Sales. This is the relevant regression line if there are no

    explanatory variables – that is, if Promote is ignored. In other words, it is a measure of theprediction error if the sample mean  y  of Sales is used as the prediction for every region and

    Promote is ignored.

    Unfortunately, the standard error of estimate, es  = 7.39, is not much less than  ys  = 9.90.

    This means that the Promote variable adds a relatively small amount to prediction accuracy.Predictions with it are not much better than predictions without it. A standard error of estimatewell below 9.90 would certainly be preferred.

  • 8/17/2019 07 Simple Linear Regression Part2

    6/9

    - 6 -

    (j) If the expenditure index for a given region is 95, what would you predict this region’s salesindex to be?

    Predicted Sales = 25.1264 + 0.7623Promote= 25.1264 + 0.7623(95) = 97.5449 ≈98

    (k) Find and interpret the residual value for region six.

    Residual = observed Sales – predicted Sales= 103 – 97.5449 = 5.4551 ≈5

    The sales index for region six was about 5 points higher than we would expect for a region withan expenditure index of 95.

    (l) What does the residual plot show?

    A scatterplot of residuals (on the vertical axis) versus fitted values is a useful graph in almostany regression analysis. You typically examine residual plots for any striking patterns.A good fit not only has small residuals, but it has residuals scattered randomly around zerowith no apparent pattern. This appears to be the case for the Pharmex data.

  • 8/17/2019 07 Simple Linear Regression Part2

    7/9

    - 7 -

    Example 2

    Overhead Costs at Bendrix

    The Bendrix Company manufactures various types of parts for automobiles. The manager of thefactory wants to get a better understanding of overhead costs. These overhead costs includesupervision, indirect labor, supplies, payroll taxes, overtime premiums, depreciation, and a number

    of miscellaneous items such as insurance, utilities, and janitorial and maintenance expenses.Some of these overhead costs are fixed in the sense that they do not vary appreciably with the volumeof work being done, whereas others are variable and do vary directly with the volume of work.The fixed overhead costs tend to come from the supervision, depreciation, and miscellaneouscategories, whereas the variable overhead costs tend to come from the indirect labor, supplies,payroll taxes, and overtime categories. However, it is not easy to draw a clear line between thefixed and variable overhead components.

    The Bendrix manager has tracked total overhead costs for the past 36 months. To help explainthese, he has also collected data on two variables that are related to the amount of work done atthe factory (see Overhead_Costs.xlsx):

    – MachHrs: number of machine hours used during the month– ProdRuns: the number of separate production runs during the month

    The first of these is a direct measure of the amount of work being done. To understand the second,we note that Bendrix manufactures parts in large batches. Each batch corresponds to a productionrun. Once a production run is completed, the factory must set up for the next production run.During this setup there is typically some downtime while the machinery is reconfigured for thepart type scheduled for production in the next batch. Therefore, the manager believes that both ofthese variables could be responsible (in different ways) for variations in overhead costs.

    (a) Construct scatterplots to examine the relationships between each potential explanatory variable(MachHrs and ProdRuns) and the dependent variable (Overhead).

    Note: Because Overhead, MachHrs, and ProdRuns are time series variables, we should also beon the lookout for any relationships between these variables and the Month variable.That is, we should also investigate any time series behavior in these variables.

    Here are the two scatterplots of interest:

                                  

         

                                  

         

  • 8/17/2019 07 Simple Linear Regression Part2

    8/9

    - 8 -

    These scatterplots show that Overhead tends to increase as either MachHrs increases orProdRuns increases. However, both relationships are far from perfect.

    With time series data, as we have in this example, there is always the possibility that timeitself is an explanatory variable. So it's a good idea to create one or more time series graphs.

    One of these, the time series graph for Overhead, is shown below.

    It indicates a fairly random pattern through time, with no apparent upward trend or otherobvious time series pattern. You can check that time series graphs of the MachHrs andProdRuns variables also indicate no obvious time series patterns.

    Finally, when there are multiple explanatory variables, we should check for relationshipsamong them. The scatterplot of MachHrs versus ProdRuns appears below.(Either variable could be chosen for the vertical axis.)

    This “cloud” of points indicates no relationship worth pursuing.

    In summary, the Bendrix manager should continue to explore the positive relationship betweenOverhead and each of the MachHrs and ProdRuns variables. However, none of the variablesappears to have any time series behavior, and the two potential explanatory variables do notappear to be related to each other.

                    

              

              

              

              

              

              

              

              

              

              

              

              

                 

                      

  • 8/17/2019 07 Simple Linear Regression Part2

    9/9

    - 9 -

    (b) Calculate and interpret the correlation coefficients between the three pairs of variables.

    The scatterplots for the Bendrix manufacturing data indicate moderately large positivecorrelations, 0.632 and 0.521, between Overhead and MachHrs and between Overhead andProdRuns. However, the correlation between MachHrs and ProdRuns, –0.229, is quite smalland indicates almost no relationship between these two variables.

    (c) What are the least squares lines for regressing overhead expenses against machine hours andagainst production runs?

    The two least squares lines are:

    Predicted Overhead = 48621 + 34.7 MachHrs

    Predicted Overhead = 75606 + 655.1 ProdRuns

    Clearly, these two equations are quite different, although each effectively breaks Overhead intoa fixed component and a variable component. The first equation implies that the fixedcomponent of overhead is about $48,621. Bendrix can expect to incur this amount even if zero

    machine hours are used. The variable component is the 34.7MachHrs term. It implies that theexpected overhead increases by about $35 for each extra machine hour.

    The second equation, on the other hand, breaks overhead down into a fixed component of$75,606 and a variable component of about $655 per each production run.

    The difference between these two equations can be attributed to the fact that neither tells thewhole story. If the manager's goal is to split overhead into a fixed component and a variablecomponent, the variable component should include both of the measures of work activity(and maybe others) to give a more complete explanation of overhead. We will explain how todo this when this example is reanalyzed with multiple regression.

    (d) Use the standard error of estimate to judge which of the two potential regression equations ismore useful for predicting the Overhead.

    In general, the standard error of estimate indicates the level of accuracy of predictions madefrom the regression equation. The smaller it is, the more accurate predictions tend to be.

    We estimated two regression lines, one using MachHrs and one using ProdRuns. Theirstandard errors are approximately $8,585 and $9,457. These imply that MachHrs is a slightlybetter predictor of overhead. The predictions based on MachHrs will tend to be slightly moreaccurate than those based on ProdRuns. Of course, the predictions based on both predictors

    should yield even more accurate predictions, as you will see when we discuss multipleregression for this example.

    (e) Interpret the coefficients of determination for the two regression lines.

    The 2r   values using MachHrs and ProdRuns as single explanatory variables are 39.9% and 27.1%,respectively. These provide one more piece of evidence that MachHrs is a slightly better predictorof Overhead than ProdRuns. Of course, they also suggest that the percentage of variation ofOverhead explained could be increased by including both variables in a single equation.