correlaton & regression correlation and regression are concerned with the investigation of...
Post on 18-Dec-2015
223 views
TRANSCRIPT
CORRELATON & REGRESSION
Correlation and regression are concerned with the investigation of relationships between two or more variables.
We consider just two associated variables.
We might want to know:
If a relationship exists between those variables
If so, how strong that relationship is
What form that relationship takes
Can we make use of that relationship for predictive purposes i.e. forecasting?
Correlation is used to find the strength of the relationship
Regression describes the relationship itself in the form of an equation which best fits the data
General method for investigating the relationship between 2 variables:
For an initial insight into the relationshipFor an initial insight into the relationshipbetween two variables:between two variables:
plot a scatter diagramplot a scatter diagram
If there appears to be a linear If there appears to be a linear relationship, quantify it:relationship, quantify it: calculate the correlation coefficientcalculate the correlation coefficient
This is a measure of the strength of this This is a measure of the strength of this linearlinear
relationship. relationship. Its symbol is 'r' and its value lies betweenIts symbol is 'r' and its value lies between -1 and +1 -1 and +1
If the relationship is found to be significantly strong: find the equation of the ‘line of best fit’
through the data, using linear regression
The 'goodness of fit' statistic can be calculated to see how useful the regression equation is likely to be
Once defined by an equation, the relationship can be used for predictive purposes.
ExampleThe data represents a sample of advertisingexpenditures and sales for ten randomlyselected months. See slide 12 for complete data.
Month Advertising Salesexpenditure (£0.000’s) y(£0,000’s) x
1 1.2 1012 0.8 923 1.0 110 etc.
Plot a scatter diagram of the data
advertising (£0,000's)
sale
s (£
0,0
00's
)
1.31.21.11.00.90.80.70.6
120
110
100
90
80
70
Plot of Sales (£0,000's) against Avertising Expenditure (£0,000's)
The graph suggests a linear relationship between sales and advertising expenditure.
The larger the amount spent on advertising the higher the sales in general.
Note scales are not started at zero
If there is a relationship, we need to be able to measure the strength of that relationship.
i.e. calculate the value of the correlation coefficient
Pearson's Product Moment CorrelationPearson's Product Moment Correlation
Coefficient (r)Coefficient (r)is a measure of how close a linear relationship there is between x and y.
can be produced directly from a calculator in LR (linear regression) mode
For the sales and advertising data the correlation coefficient: r = 0.875
The value of r is always between + 1 and -1
x
y
1412108642
45
40
35
30
25
20
15
Plot of Sales (£0,000's) against Avertising Expenditure (£0,000's)
x
y
1412108642
50
45
40
35
30
25
20
15
Plot of Sales (£0,000's) against Avertising Expenditure (£0,000's)
x
y
1412108642
12
10
8
6
4
2
0
Plot of Sales (£0,000's) against Avertising Expenditure (£0,000's)
x
y
1412108642
30
25
20
15
10
5
0
Plot of Sales (£0,000's) against Avertising Expenditure (£0,000's)
x
y
1412108642
30
25
20
15
10
5
0
Plot of Sales (£0,000's) against Avertising Expenditure (£0,000's)
r = -1 perfect negative correlation
r = -0.7
r = 0 no correlation
r = +0.8
r = +1 perfect positive correlation
Formula for correlation coefficient, r
wherewhere
Sxx = x2 - x x nSyy = y2 - y y nSxy = x2 - x y n
r = Sxy Sxx Syy
Longhand calculations for correlation coefficient r.
Month Advertising
Expenditure £0000’s x
Sales £0000’s y
x2
y2
xy
1 1.2 101 1.44 10201 121.2 2 0.8 92 0.64 8464 73.6 3 1.0 110 1.00 12100 110.0 4 1.3 120 1.69 14400 156.0 5 0.7 90 0.49 8100 63.0 6 0.8 82 0.64 6724 65.6 7 1.0 93 1.00 8649 93.0 8 0.6 75 0.36 5625 45.0 9 0.9 91 0.81 8281 81.9 10 1.1 105 1.21 11025 115.5 Totals 9.4 959 9.28 93569 924.8
Step 1
Therefore:
Sxx = x2 - x x = 9.28 - 9.4 x 9.4 = 0.444 n 10
Syy = y2 - y y = 93569 - 959 x 959 = 1600.9
n 10
Sxy = xy - x y = 924.8 - 9.4 x 959 = 23.34 n 10 Step 3
Therefore: r = Sxy = 23.34 = 0.875
Sxx Syy 0.444 x 1600.9
Step 2
Hypothesis test for the value of rHypothesis test for the value of r We shall not go into the details here!We shall not go into the details here!
Null hypothesis (H0): A linear relationship does not exist between sales and advertising
Alternative hypothesis(H1): A linear relationship
does exist between sales and advertising.
If we calculate a test statistic and critical value we discover that test statistic > critical value
so we reject H0
Conclude that a linear relationship exists between sales and amount spent on advertising.
The Goodness of Fit Statistic (R2)
This also measures of the closeness of the relationship between x and y
R2 = 100r2
R2 tells us what percentage of the total variation in y (here sales) is explained by the variation in x (here advertising expenditure)
If r = +1 or –1, then RIf r = +1 or –1, then R22 =100% =100%
So 100% of the variation in y is explained So 100% of the variation in y is explained by the variation in x.by the variation in x.
If r = 0, then RIf r = 0, then R22 = 0% = 0%
So none of the variation in y is explained So none of the variation in y is explained by the variation in xby the variation in x
For the data above the goodness of fit For the data above the goodness of fit statistic Rstatistic R22 = 100 r = 100 r22 = 100 x 0.875 = 100 x 0.87522
= = 76.6%76.6%
Interpretation:Interpretation:
76.6% of the variation in sales is 76.6% of the variation in sales is explained by the variation in the explained by the variation in the amount spent on advertising.amount spent on advertising.
The remaining 23.4% of the variation The remaining 23.4% of the variation is explained by other factors:is explained by other factors:
e.g. pricee.g. price
competitor’s prices etc.competitor’s prices etc.
Regression equation
Since we know, for the sample data, thatthere is a significant relationship betweenthe two variables,
the next obvious step is to find its equation.
We can then add the regression line to the scatter diagram and use it to predict futuresales, given advertising expenditure for aparticular month.
The regression equation can be produceddirectly from a calculator in LR mode.
The regression line has the equation:
y = a + bx
x is the independent variabley is the dependent variable
a is the intercept on the y-axisb is the gradient or slope of the line.
For the sales and advertising data, thevalues of a and b are 46.5 and 52.6. So regression equation is:
y = 46.5 + 52.6x
Sales = 46.5 + 52.6 advertising
(a and b can be found using LR mode on your calculator or by calculation)
Formula for a and b This is found by calculating the square ofThis is found by calculating the square of the the
differences between actual and expected differences between actual and expected values.values.
We chose We chose aa and and b b so that the total difference so that the total difference is is minimizied:minimizied:
b = b = SxySxy a = a = y - b x y - b x
Sxx Sxx ( ( x , y )x , y )
is called theis called the
centroidcentroid
WhereWhere x , y x , y are the meansare the means of theof the x x and and y y datadata
and the and the S’s S’s are defined as previously.are defined as previously.
Calculations for the regression equation. Calculations for the regression equation.
In the regression equation y = a + bxIn the regression equation y = a + bx
b = b = SxySxy = = 23.3423.34 = = 52.652.6
Sxx 0.444Sxx 0.444
a = y - b x = 95.9 - 52.6 x 0.94 = a = y - b x = 95.9 - 52.6 x 0.94 = 46.546.5
(As y = = yy = = 959959 and x = and x = xx = = 9.49.4 = = 0.94)0.94)
n 10 n 10n 10 n 10
Therefore the regression equation is Therefore the regression equation is
y y = 46.5 + 52.6x= 46.5 + 52.6x
Plotting the regression equation on thescatter diagram.
The line y = a + bx can be plotted on the scatterdiagram by plotting three points.
The centroid ( x , y ) and any other two points,which satisfy the regression equation.
From the data (x, y) = (0.94, 95.9)
When x = 0.6, y = 46.5 + (52.6 x 0.6) = 78.06
When x = 1.2, y = 46.5 + (52.6 x 1.2) = 109.6
Plot (0.6, 78.6)
Plot (0.94,95.9)
Plot (1.3, 109.6)
advertising
sale
s
1.31.21.11.00.90.80.70.6
120
110
100
90
80
70
Plot of sales (£0,000's) against Advertising expenditure (£),000's)
xx
x
x
NoteNote regression equation y = a + bx
can only be used to calculate an estimate for y given the value of x
The linear relationship y = a + bx can only be assumed to exist between y and x for the range of values within the sample
Interpreting the coefficients in theInterpreting the coefficients in the
regression equation -regression equation -
first the a valuefirst the a value
The intercept (a) is the estimate ofThe intercept (a) is the estimate of
y when x = 0, y when x = 0, but care is needed if using this – but care is needed if using this – why?why?
y = 46.5 + 52.6xy = 46.5 + 52.6x
Sales = Sales = 46.546.5 + 52.6 advertising + 52.6 advertising
When x = 0, y = 46.5When x = 0, y = 46.5
i.e. When nothing is spent on advertising,i.e. When nothing is spent on advertising,
sales would be expected on average to be 46.5 sales would be expected on average to be 46.5 units = 46.5 x £10,0000units = 46.5 x £10,0000
=£ 465,000=£ 465,000
the b valuethe b valuey = 46.5 + y = 46.5 + 52.652.6xx
If x = 0If x = 0 y = 46.5, y = 46.5, but care is needed here!but care is needed here!
If x = 0.6 y = 46.5 + (52.6)(0.6) = If x = 0.6 y = 46.5 + (52.6)(0.6) = If x = 0.8 y = 46.5 + (52.6)(0.8) = If x = 0.8 y = 46.5 + (52.6)(0.8) = If x = 1If x = 1 y = 46.5 + 52.6 =y = 46.5 + 52.6 =If x = 1.2If x = 1.2 y = 46.5 + (52.6)(1. 2) =y = 46.5 + (52.6)(1. 2) =If x = 2If x = 2 y = 46.5 + 52.6 x 2 y = 46.5 + 52.6 x 2 but care is but care is
needed needed here also!here also!
etc.etc.
So if advertising expenditure is increasedSo if advertising expenditure is increasedby 1 unit, sales will be increased by 52.6by 1 unit, sales will be increased by 52.6units on average.units on average.
For each additional £10,000 spent onFor each additional £10,000 spent on
advertising, sales will increase byadvertising, sales will increase by
£52.6 x £10,000 = £526,000 on £52.6 x £10,000 = £526,000 on average.average.
But we cannot estimate sales outside But we cannot estimate sales outside the range:the range:
E.g. we should not try to estimate E.g. we should not try to estimate sales sales
for x = 5 using this method.for x = 5 using this method.