07 simple linear regression part2

8/17/2019 07 Simple Linear Regression Part2

1/9

- 1 -

SIMPLE LINEAR REGRESSION – PART 2

Topics Outline

• Scatterplots and Correlation

• The Least Squares Regression Line

Example 1

Sales versus Promotions at Pharmex

Pharmex is a chain of drugstores that operate around the country. To see how effective itsadvertising and other promotional activities are, the company has collected data from 50randomly selected metropolitan regions. In each region it has compared its own promotionalexpenditures and sales to those of the leading competitor in the region over the past year.The data are listed in the file Drugstore_Sales.xlsx.

There are two variables:– Promote: Pharmex’s promotional expenditures as a percentage of those of the leading competitor– Sales: Pharmex’s sales as a percentage of those of the leading competitor

Note that each of these variables is an index, not a dollar amount. For example, if Promote equals 95for some region, this tells us only that Pharmex’s promotional expenditures in that region are 95%as large as those for the leading competitor in that region.

The company expects that there is a positive relationship between these two variables,so that regions with relatively larger expenditures have relatively larger sales.However, it is not clear what the nature of this relationship is.

Using StatTools=================

Define a StatTools data setClick anywhere within the data setData Set ManagerOK

Get a scatterplotSummary GraphsScatterplotChoose X and YOK

Get a regression outputRegression and ClassificationRegressionFill in the resulting dialog box asshown to the rightOK


2/9

- 2 -

Using Excel=============

Getting a regression output in Excel:DataData Analysis, Regression, OK

Input Y Range, Input X RangeCheck Labels, Line Fit PlotsOK

Formatting the scatterplot:Delete unwanted information (right-click, delete)Right-click, Add TrendlineCheck Display Equation on chart

Display R-squared value on chartClose

Formatting the axes:Right-clickFormat axisEnter desired valuesClose

Note 1:If the Analysis ToolPack is not installed in Excel, follow these steps:Click the arrow in the upper left cornerMore commandsAdd-Ins

Analysis ToolPackGoAnalysis ToolPackOK

Note 2:The slope and intercept of the least squares line can also be calculated directly in Excel using thefunctions SLOPE and INTERCEPT.


3/9

- 3 -

(a) What type of relationship, if any, is apparent from a scatterplot?

This scatterplot indicates that there is a positive relationship between Promote and Sales –the points tend to rise from bottom left to top right – but the relationship is not perfect.If it were perfect, a given value of Promote would prescribe the value of Sales exactly.Clearly, this is not the case. For example, there are five regions with promotional values of 96but all of them have different sales values. So the scatterplot indicates that while the variablePromote is helpful for predicting Sales, it does not lead to perfect predictions.

(b) Can the drugstore manager conclude that larger promotional expenses cause larger sales values?

No. Unless the data are obtained in a carefully controlled experiment – which is certainly notthe case here – you can never be absolutely sure about causation. One reason is that you can’talways be sure which direction the causation goes. Does x cause y, or does y cause x?Another reason is that you can almost never rule out the possibility that some other variableis causing the variation in both of the observed variables. Although this is unlikely in thisdrugstore example, it is still a possibility.

(c) Calculate and interpret the correlation between Sales and Promote.

To calculate a correlation between two variables, you can use Excel’s CORREL function or usethe value called Multiple R (with an appropriate sign) from Excel’s Regression Analysis output.

Alternatively, you can use StatTools to obtain a whole table of correlations between a set ofvariables.

The correlation between Sales and Promote is positive – as the upward-sloping scatter of pointssuggests – and is equal to

r = 0.673


4/9

- 4 -

This is a moderately large correlation. It confirms the pattern in the scatterplot, namely,that the points increase linearly from left to right but with considerable variation around anyparticular straight line.

Reminder: Correlations apply only to linear relationship. If a correlation is close to zero,

you cannot automatically conclude that there is no relationship between the two variables.You should look at a scatterplot first. The chances are that the points are a shapeless swarm andthat no relationship exists. But it is also possible that the points cluster around some curve.

(d) What is the least squares line for sales as a function of promotional expenses at Pharmex?

In the StatTools output, the intercept and slope of the least squares line appear under theCoefficient label in cells B18 and B19. They imply that the equation for the least squares line is

Predicted Sales = 25.1264 + 0.7623Promote

(e) Interpret the slope of the regression line.

The slope, 0.7623, indicates that the sales index tends to increase by about 0.76 for each one-unit increase in the promotional expenses index. Alternatively, if two regions are compared,where the second region spends one unit more than the first region, the predicted sales indexfor the second region is 0.76 larger than the sales index for the first region.

(f) Interpret the intercept of the regression line.

The intercept is literally the predicted sales index for a region that does no promotions.However, no region in the sample has anywhere near a zero promotional value.

Note: In many applications it makes no sense to have the explanatory variable(s) equal to zero.Then the intercept term has no practical or economic meaning. Therefore, in such situations,where the range of observed values for the explanatory variable does not include zero, it is bestto think of the intercept term as simply an “anchor” for the least squares line that enablespredictions of y values for the range of observed x values.

(g) Interpret the coefficient of determination.

The coefficient of determination 2r is the square of the correlation between the observed y values and the fitted ŷ values. Aside from rounding, the square of r = 0.673 is 0.453,

which is shown as the2

R value in the Excel output.

r 2 = 0.453

The explanatory variable Promote is able to explain only 45.3% of the variation in the Salesvariable. This is not particularly good. There is still 54.7% of the variation left unexplained.

Of course, we would like 2r to be as close to 1 as possible. Usually, the only way to increaseit is to use better and/or more explanatory variables.


5/9

- 5 -

Note:If the correlation between two variables y and x is ±0.8, the regression of y on x will have an

2r of 0.64; that is, the regression with x as the only explanatory variable will explain 64% ofthe variation in y.

If the correlation drops to ±0.7, this percentage drops to 49%; if the correlation increases to ±0.9,the percentage increases to 81%. The point is that before a single variable x can explain alarge percentage of the variation in some other variable y, the two variables must be highlycorrelated – in either a positive or negative direction.

(h) What is the standard error of estimate? What does it measure?

The standard error of estimate is approximately es = 7.39. It indicates the typical magnitude

of error when using promotional expenses, via the regression equation, to predict sales.More specifically, if the regression equation is used to predict sales for many regions,

about two-thirds of the predictions will be within es = 7.39 of the actual sales values,

and about 95% of the predictions will be within two standard errors, or 2 es = 14.78,

of the actual sales values.

(i) Is this level of accuracy good?

One way to measure the regression equation’s ability to predict is to compare the standard

error of estimate, es , to the standard deviation of the response variable, ys . The idea is that

es is (essentially) the standard deviation of the residuals, whereas ys is the standard deviation

of residuals from a horizontal regression line at height y , the sample mean of the response

variable. Therefore, if es is small compared to ys (that is, if ye ss / is small), the regressionline is evidently doing a good job in explaining the variation of the response variable.

The standard deviation of the Sales variable is ys = 9.90. (This is obtained by the usual

STDEV function applied to the observed sales values y.)

It can be interpreted as the standard deviation of the residuals around a horizontal linepositioned at the mean value y of Sales. This is the relevant regression line if there are no

explanatory variables – that is, if Promote is ignored. In other words, it is a measure of theprediction error if the sample mean y of Sales is used as the prediction for every region and

Promote is ignored.

Unfortunately, the standard error of estimate, es = 7.39, is not much less than ys = 9.90.

This means that the Promote variable adds a relatively small amount to prediction accuracy.Predictions with it are not much better than predictions without it. A standard error of estimatewell below 9.90 would certainly be preferred.


6/9

- 6 -

(j) If the expenditure index for a given region is 95, what would you predict this region’s salesindex to be?

Predicted Sales = 25.1264 + 0.7623Promote= 25.1264 + 0.7623(95) = 97.5449 ≈98

(k) Find and interpret the residual value for region six.

Residual = observed Sales – predicted Sales= 103 – 97.5449 = 5.4551 ≈5

The sales index for region six was about 5 points higher than we would expect for a region withan expenditure index of 95.

(l) What does the residual plot show?

A scatterplot of residuals (on the vertical axis) versus fitted values is a useful graph in almostany regression analysis. You typically examine residual plots for any striking patterns.A good fit not only has small residuals, but it has residuals scattered randomly around zerowith no apparent pattern. This appears to be the case for the Pharmex data.


7/9

- 7 -

Example 2

Overhead Costs at Bendrix

The Bendrix Company manufactures various types of parts for automobiles. The manager of thefactory wants to get a better understanding of overhead costs. These overhead costs includesupervision, indirect labor, supplies, payroll taxes, overtime premiums, depreciation, and a number

of miscellaneous items such as insurance, utilities, and janitorial and maintenance expenses.Some of these overhead costs are fixed in the sense that they do not vary appreciably with the volumeof work being done, whereas others are variable and do vary directly with the volume of work.The fixed overhead costs tend to come from the supervision, depreciation, and miscellaneouscategories, whereas the variable overhead costs tend to come from the indirect labor, supplies,payroll taxes, and overtime categories. However, it is not easy to draw a clear line between thefixed and variable overhead components.

The Bendrix manager has tracked total overhead costs for the past 36 months. To help explainthese, he has also collected data on two variables that are related to the amount of work done atthe factory (see Overhead_Costs.xlsx):

– MachHrs: number of machine hours used during the month– ProdRuns: the number of separate production runs during the month

The first of these is a direct measure of the amount of work being done. To understand the second,we note that Bendrix manufactures parts in large batches. Each batch corresponds to a productionrun. Once a production run is completed, the factory must set up for the next production run.During this setup there is typically some downtime while the machinery is reconfigured for thepart type scheduled for production in the next batch. Therefore, the manager believes that both ofthese variables could be responsible (in different ways) for variations in overhead costs.

(a) Construct scatterplots to examine the relationships between each potential explanatory variable(MachHrs and ProdRuns) and the dependent variable (Overhead).

Note: Because Overhead, MachHrs, and ProdRuns are time series variables, we should also beon the lookout for any relationships between these variables and the Month variable.That is, we should also investigate any time series behavior in these variables.

Here are the two scatterplots of interest:


8/9

- 8 -

These scatterplots show that Overhead tends to increase as either MachHrs increases orProdRuns increases. However, both relationships are far from perfect.

With time series data, as we have in this example, there is always the possibility that timeitself is an explanatory variable. So it's a good idea to create one or more time series graphs.

One of these, the time series graph for Overhead, is shown below.

It indicates a fairly random pattern through time, with no apparent upward trend or otherobvious time series pattern. You can check that time series graphs of the MachHrs andProdRuns variables also indicate no obvious time series patterns.

Finally, when there are multiple explanatory variables, we should check for relationshipsamong them. The scatterplot of MachHrs versus ProdRuns appears below.(Either variable could be chosen for the vertical axis.)

This “cloud” of points indicates no relationship worth pursuing.

In summary, the Bendrix manager should continue to explore the positive relationship betweenOverhead and each of the MachHrs and ProdRuns variables. However, none of the variablesappears to have any time series behavior, and the two potential explanatory variables do notappear to be related to each other.


9/9

- 9 -

(b) Calculate and interpret the correlation coefficients between the three pairs of variables.

The scatterplots for the Bendrix manufacturing data indicate moderately large positivecorrelations, 0.632 and 0.521, between Overhead and MachHrs and between Overhead andProdRuns. However, the correlation between MachHrs and ProdRuns, –0.229, is quite smalland indicates almost no relationship between these two variables.

(c) What are the least squares lines for regressing overhead expenses against machine hours andagainst production runs?

The two least squares lines are:

Predicted Overhead = 48621 + 34.7 MachHrs

Predicted Overhead = 75606 + 655.1 ProdRuns

Clearly, these two equations are quite different, although each effectively breaks Overhead intoa fixed component and a variable component. The first equation implies that the fixedcomponent of overhead is about $48,621. Bendrix can expect to incur this amount even if zero

machine hours are used. The variable component is the 34.7MachHrs term. It implies that theexpected overhead increases by about $35 for each extra machine hour.

The second equation, on the other hand, breaks overhead down into a fixed component of$75,606 and a variable component of about $655 per each production run.

The difference between these two equations can be attributed to the fact that neither tells thewhole story. If the manager's goal is to split overhead into a fixed component and a variablecomponent, the variable component should include both of the measures of work activity(and maybe others) to give a more complete explanation of overhead. We will explain how todo this when this example is reanalyzed with multiple regression.

(d) Use the standard error of estimate to judge which of the two potential regression equations ismore useful for predicting the Overhead.

In general, the standard error of estimate indicates the level of accuracy of predictions madefrom the regression equation. The smaller it is, the more accurate predictions tend to be.

We estimated two regression lines, one using MachHrs and one using ProdRuns. Theirstandard errors are approximately $8,585 and $9,457. These imply that MachHrs is a slightlybetter predictor of overhead. The predictions based on MachHrs will tend to be slightly moreaccurate than those based on ProdRuns. Of course, the predictions based on both predictors

should yield even more accurate predictions, as you will see when we discuss multipleregression for this example.

(e) Interpret the coefficients of determination for the two regression lines.

The 2r values using MachHrs and ProdRuns as single explanatory variables are 39.9% and 27.1%,respectively. These provide one more piece of evidence that MachHrs is a slightly better predictorof Overhead than ProdRuns. Of course, they also suggest that the percentage of variation ofOverhead explained could be increased by including both variables in a single equation.

07 simple linear regression part2

Documents