06 simple linear regression part1

Upload: rama-dulce

Post on 06-Jul-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/17/2019 06 Simple Linear Regression Part1

    1/8

    - 1 -

    SIMPLE LINEAR REGRESSION – PART 1Topics Outline

    • Explanatory and Response Variables

    • Interpreting Scatterplots

    • Correlation• The Least Squares Regression Line

    Explanatory and Response Variables

    Regression analysis provides us with a regression equation describing the nature of therelationship between two (or more) variables. In addition, regression analysis supplies variancemeasures which allow us to access the accuracy with which the regression equation can predictvalues on the response variable.

    Example 1 (Car plant electricity usage)

    The manager of a car plant wishes to investigate how the plant’s electricity usage depends uponthe plant’s production, based on the data for each month of the previous year:

     x y

    Month Production($ million)

    Electricity usage(million kWh)

    January 4.51 2.48February 3.58 2.26March 4.31 2.47April 5.06 2.77May 5.64 2.99

    June 4.99 3.05July 5.29 3.18August 5.83 3.46September 4.70 3.03October 5.61 3.26November 4.90 2.67December 4.20 2.53

    Questions: 1. How are these two data sets related?2. Given an observation for the variable x, can we predict the value of the variable y?

     y is called response (dependent, target, criterion) variable.The response variable measures an outcome of a study.

     x is called explanatory (independent, predictor, regressor) variable.The explanatory variable explains or influences changes in the response variable.

    There is both simple and multiple regression. In simple regression, we have one explanatory variable x.

    In the case of multiple regression, we work with several explanatory variables n x x x ,,, 21   K .

  • 8/17/2019 06 Simple Linear Regression Part1

    2/8

    - 2 -

    Interpreting Scatterplots

    The easiest way to see how two numerical variables are related is to consider their scatterplot.Typically the explanatory variable is plotted on the x axis and the response variable is plotted onthe y axis. Below is the scatter plot of our data:

    Figure 1 Scatter plot of car plant electricity usage

    After plotting two variables on a scatterplot, we describe the relationship by examining the form,direction, and strength of the association. We look for an overall pattern and striking deviationsfrom that pattern:

    Form:  linear, curved, clusters, no patternDirection: positive, negative, no directionStrength:  how closely the points fit the “form”Outlier:  point that falls outside the overall pattern of the relationship

    The form of association in our example is linear. That is, the overall pattern follows a straight line.The direction of the association is positive. That is, high production values tend to accompany highelectricity usage values. The association is quite strong – the data points do not lie on a straight line, butthey appear to cluster very closely about a straight line. There are no apparent outliers in this example.

    Of course, not all relationships have a simple form and a clear direction that we can describe aspositive association or negative association. Sometimes x and y vary independently and knowing x tells you nothing about y.

    Correlation

    The correlation coefficient measures the direction and strength of the linear relationshipbetween two numerical variables. It is calculated using the mean and the standard deviation ofboth the x and y variables:

    ∑=

     

     

     

        − 

      

        −

    =

    n

    i   y

    i

     x

    i

    s

     y y

    s

     x x

    nr 

    11

    That is, the correlation is an average of the products of the standardized values of each pair (x, y) in the data set.

    2.00

    2.25

    2.50

    2.75

    3.00

    3.25

    3.50

    3.5 4.0 4.5 5.0 5.5 6.0

    Production ($ million)

       E   l  e  c   t  r   i  c   i   t  y  u  s  a  g  e   (  m   i   l   l   i  o  n   k   W   h   )

  • 8/17/2019 06 Simple Linear Regression Part1

    3/8

    - 3 -

    Facts about Correlation

    1. Correlation can only be used to describe numerical variables.Categorical variables do not have means and standard deviations.

    2. The value of the correlation coefficient r  does not change if the explanatory and responsevariables are switched.

    3. Since r  uses the standardized values of the observations, r  does not change when wechange the units of measurement of x, y or both.

    4. Positive r  indicates positive association between the variables, and negative r  indicatesnegative association.

    5. The correlation r  is always a number between −1 and 1.It is equal to –1 when the data points lie on a straight line with a downward slope,and r  is equal to +1 when the data points lie on a straight line with an upward slope.Values of r  close to −1 or 1 indicate that the points in a scatterplot lie close to a straight line.A value of r  near 0 indicates at most a weak linear relationship between the data points.

    6. Correlation measures the strength of only the linear relationship between two variables.A correlation of 0=r   means that there is no linear relationship between the datapoints, although there might be a strong nonlinear relationship.

    7. Correlation is not a resistant measure: like the mean and standard deviation, it is stronglyinfluenced by a few outlying observations.

    Example 2

    The scatterplots in the figure to the right

    illustrate how values of r  closer to 1 or −1correspond to stronger linearrelationships.

    In general, it is not so easy to guess thevalue of r  from the appearance of ascatterplot. Remember that changing theplotting scales in a scatterplot maymislead our eyes, but it does not changethe correlation.

  • 8/17/2019 06 Simple Linear Regression Part1

    4/8

    - 4 -

    The Least-Squares Regression Line

    In simple linear regression we suppose that there is an underlying linear relationship between theexplanatory variable x and the response variable y.

    Example 1 (Continued)Consider the scatter plot of our data:

    Clearly, the data points do not lie on a straight line, but they appear to cluster about a straight line,which suggests a linear relationship between x and y. We want to fit a straight line to the data points.

    However, there are an infinite number of possible lines bxa y   += , differing in slope b  and/or

     y-intercept a , that could be drawn through the cluster of our data points.

    The linear least squares fitting technique is the simplest and most commonly applied form oflinear regression. The linear least-squares regression line (also known as fitted, estimated,predicted line) is the line

    bxa y   +=ˆ  

    that makes the sum of the squares of the vertical distances of the data points ),( ii   y x  from the

    line as small as possible. It can be shown that the values of a and b that minimize the sum of thesquared vertical distances are given by

     x

     y

    s

    sr b  =  

     xb ya   −=  

    where

     x  and  y are the means of variables x and y;

     xs and  ys are the standard deviations of variables x and y;

    r  is the correlation coefficient for variables x and y.

    2.00

    2.25

    2.50

    2.75

    3.00

    3.25

    3.50

    3.5 4.0 4.5 5.0 5.5 6.0

    Production ($ million)

       E   l  e  c   t  r   i  c

       i   t  y  u  s  a  g  e   (  m   i   l   l   i  o  n   k   W   h   )

  • 8/17/2019 06 Simple Linear Regression Part1

    5/8

    - 5 -

    Example 1 (Continued)

    For our example,

     x  = 4.885  y  = 2.846

     xs  = 0.6655  ys = 0.3707

    r   = 0.8956

    The slope is therefore

    499.049883.0)5570.0)(8956.0(6655.0

    3707.08956.0   ≈====

     x

     y

    s

    sr b  

    and the intercept is

     xb ya   −=  = 409.0)885.4)(49883.0(846.2   =−  

    The least squares regression line is thus

     x y 499.0409.0ˆ   +=  

    which is shown together with the data points in Figure 2.

    Figure 2 Fitted regression line for car plant electricity usageNotes:

    1. The regression line does not pass through even one of the original points, and yet it isthe straight line that best approximates them.

    2. The regression line passes through the point

    )846.2,885.4(),(   = y x  

    and it is not a coincidence. The regression line always passes through the point ).,(   y x  Why?

    3. If we reverse the roles of x and y, we get a different least-squares regression line (see Figure 3).

    y = 0.409 + 0.4988x

    2.00

    2.25

    2.50

    2.75

    3.00

    3.25

    3.50

    3.5 4.0 4.5 5.0 5.5 6.0

    Production ($ million)

       E   l  e  c   t  r   i  c   i   t  y  u  s  a  g  e   (  m   i   l   l   i  o  n   k   W   h   )

  • 8/17/2019 06 Simple Linear Regression Part1

    6/8

    - 6 -

    Figure 3 Fitted regression line if x and y are switched

    Interpretation of the slope: Variable “cost”How much will y increase/decrease if x increases by 1 unit?

    The slope of 0.499 means for each increase of $1 million in production, the linear regression modelpredicts that the electricity usage increases by 0.499 (about half a) million kilowatt-hours.

    Interpretation of the intercept: Fixed “cost”What is the value of y if x is equal to 0 units?

    The intercept of 0.409 means that if x = 0 (that is, nothing is produced),the model predicts that the electricity usage is 0.409 million kWh.

    As the above example shows, the interpretation of the intercept in regression analysis does notalways make sense in real life. Sometimes you might get a negative value for the intercept eventhough the variable y is such that it is always positive. The value of the intercept is meaningful inreal life only when the explanatory variable x can actually take values close to zero.

    Making Predictions

    The regression line can be used to predict response values ( y’s) at one or more values of theexplanatory variable x within the range studied. This is called interpolation.

    If a production level of $5.5 million worth of cars is planned for next month, then the plantmanager can predict that the electricity usage will be

    1535.3)5.5)(499.0(409.0ˆ   =+= y  

    We must be cautioned, though, against applying this equation for values of  x which are beyondthose used to develop the equation (that is, below 3.5 and above 6), for the relationship may notbe linear for those values of x.

    The use of a regression line for predictions outside the range of the data from which the line wascalculated is called extrapolation. Such predictions are often not accurate and should be avoided.

    y = 0.309 + 1.608x

    3.50

    4.00

    4.50

    5.00

    5.50

    6.00

    2.2 2.4 2.6 2.8 3 3.2 3.4 3.6

    Electricity usage (million kWh)

       P  r  o   d  u  c   t   i  o  n   (   $  m   i   l   l   i  o  n   )

  • 8/17/2019 06 Simple Linear Regression Part1

    7/8

    - 7 -

    Residuals

    A residual is the difference between an observed value of y and the value of y predicted by theregression line:

    residual = (observed y) – (predicted y) = y –  ŷ  

    The observed y for the first x = 4.51 in our data set is 2.48.

    The predicted y for x = 4.51 is  ŷ  = 0.409 + 0.4988 x = 0.409 + 0.4988(4.51) = 2.66

    The residual for this observation is

    residual = (observed y) – (predicted y) = y −  ŷ = 2.48 – 2.66 = −0.18

    Thus, the observed electricity usage for the first month lies 0.18 million kWh below the least-squares line on the scatterplot.

    If we repeat this calculation eleven more times, we will get all the residuals:

    Observation  1 2 3 4 5 6 7 8 9 10 11 12

    Residual –0.18 0.07 –0.09 –0.16 –0.23 0.15 0.13 0.14 0.28 0.05 –0.18 0.03

    It can be shown that the mean of the least-squares residuals is always zero.

    The standard deviation of the residuals, denoted by s or es , is given by the following equation

    ∑∑==

    =−

    =

    n

    i

    ii

    n

    i

    i   y yn

    residualn

    s1

    2

    1

    2 )ˆ(2

    1)0(

    2

    and is referred to as the regression standard error (or standard error of estimate).Note that the squared residuals are averaged by dividing by n – 2 and not by the usual n – 1.The rule is to subtract the number of parameters being estimated from the sample size n to obtainthe denominator. Here there are two parameters being estimated: the intercept and the slope.

    Since you usually want your forecasts and predictions to be as accurate as possible, you wouldbe glad to find a small value for es . We judge the value of es by comparing it to the values of the

    response variable y or more specifically to the sample mean  y . Because in our example,

     y = 2.85 million kWh and es = 0.17 million kWh

    it does appear that the standard error of estimate is small. This tells you that, for a typical month,the actual electricity usage was different from the predicted electricity usage (on the least squaresline) by about 0.17 million kWh.

    If the residuals are approximately normally distributed, the 68% – 95% – 99.7% empirical rule forstandard deviations can be applied to the standard error of estimate. For example, approximately

    68% (or about two-thirds) of the residuals are typically within one standard error of their mean(which is zero). Stated another way, about 68% (or two-thirds) of the observed y values are

    typically within a distance es  either above or below the regression line. Similarly, about 95% of

    the observed y values are typically within 2 es  of the corresponding fitted  ŷ  values, and so forth.

    A residual plot is a scatterplot of the residuals against the explanatory variable x or the predictedvalues  ŷ . The horizontal line at zero residual corresponds to the fitted regression line. Ideally,

    the plot of the residuals should truly show random fluctuations around the zero residual line.

  • 8/17/2019 06 Simple Linear Regression Part1

    8/8

    - 8 -

    Coefficient of Determination 2r   

    The total variation in the values of y can be decomposed into two parts – explained andunexplained variation:

    Figure 4 Decomposition of total variation

    The coefficient of determination 2r   is the square of the correlation coefficient r .It measures the fraction of the variation in the values of  y that can be explained by y’s linear dependenceon x in the regression model. The idea is that when there is a linear relationship, some of the variationin y is accounted for by the fact that as x changes it pulls y with it along the regression line.

    This coefficient always lies between 0 and 1. A value of 2r   near 1 indicates that changes in x explain almost 100% of the variations in y and therefore the regression equation is extremely

    useful for making predictions. A value of 2r   near 0 indicates that the amount of unexplainedvariation in the regression model is big in relation to the explained variation. In this case, weshould be cautious when using the regression equation for predictions.

    For our data, ( ) 802.08956.0 22 ==r  . Evidently, the regression equation obtained in thisexample is quite useful for predicting the electricity usage because about 80% of the variabilityin the electricity usage can be explained by changes in the production levels.

    Outliers and Influential Observations

    Recall that an outlier is an observation that lies outside the overall pattern of the other observations.An observation is called influential if removing it would significantly change the equation of theregression line. Points that are outliers in the x direction are often (but not always) influential.

    Caution about Correlation and Regression

    Association does not imply causation!

    The observation that two variables tend to vary simultaneously in the same direction does notimply a direct relationship between them. It would be not surprising, for example, to obtain a highpositive correlation between the annual sales of chewing gum and the incidence of crime in citiesof various sizes within the United States, but one cannot conclude that crime might be reduced byprohibiting the sale of chewing gum. Both variables depend upon the size of the population, and itis this mutual relationship with a third variable (population size) which produces the positivecorrelation. This third variable, called a lurking variable, is often overlooked when mistakenclaims are made about x causing y.