chapter4zica4.6
TRANSCRIPT
-
7/28/2019 chapter4ZICA4.6
1/26
4.6 Regression and Correlation Analysis
This section introduces regression analysis which is a method used to describe arelationship between two variables and goes on to explain about correlationanalysis which measures the strength of the relationship between two variables.
This manual uses Spearmans rank correlation coefficient and Pearsons productmoment coefficient of correlation as a measure of strength between two variables.
Regression analysis is concerned with the estimating of one variable (dependentvariable) on the basis of one or more other variables (independent variable). If ananalyst for instance is trying to predict the share price of a particular sector therewill be a whole range of independent variables to be considered. In this manual,we will restrict our attention to the particular case where a dependent variable y isrelated to a single independent variable x .
The Regression Equation
When only one independent variable is used in making forecast, the techniqueused is called Simple Linear regression. The forecasts are made by means of astraight line using the equation
bxay +=
xinchangetuniawithchangesythatamounttheslopeb
xwhenerceptythea
==
== 0int
The linear function is useful because it is mathematically simple and it can beshown to be reasonably close to the approximation of many situations.
The first step to establish whether there is a relationship between variables is bymeans of a scatter diagram. This is a plot of the two variables on an yx graph.Given that we believe there is a relationship between the two variables, the secondstep is to determine the form of this relationship.
194
-
7/28/2019 chapter4ZICA4.6
2/26
Example 1
Consider the following data of a major appliance store. The daily high temperature andof air conditioning units sold for 8 randomly selected business days during the hot dryseason.
Daily High Temperature
(x) oc
Number of Units
(y)
2735182046362623
56216433
Draw a Scatter diagram for the data.
y
6
Numberof
units 5
used
4
3
2
1
18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 x
Daily High temperature (oc)
Figure 1.0
195
-
7/28/2019 chapter4ZICA4.6
3/26
The distribution of points in the Scatter diagram suggests that a straight line roughly fitsthese points.The most straight forward method of fitting a straight line to the set of data points is byeye. The values of a and b can then be determined from the graph, a is the intercepton the vertical axis and b is the slope.
The other method is that of semi averages. This technique consists of splitting the datainto two equal groups, plotting the mean point for each group and joining these pointswith a straight line.
Example 2
Using data of Example 1, fit a straight line using the method of semi-averages.
The procedure for obtaining the y on x regression line is as follows:
Step 1
Sort the data into size order by x - value.
x y
1820232627
353646
21335
641
Step 2
Split the data up into two equal groups, a lower half and upper half (if there is an odd ofitems, drop the central one).
Lower half of Data Upper half of datax18202326
y
2133
x27353646
y
5641
Totals 87
Averages 21.75
9
2.5
144
36
16
4
196
-
7/28/2019 chapter4ZICA4.6
4/26
Table 1.0y
6
5
4
3
2
1
18 26 34 42 x
Method of Semi-average for Example 1.
Figure 2.0
Step 3
Calculate the mean point for each group
Step 4
Plot the mean points in Step 3 on a graph within suitably scaled axes and joining themwith a straight line. This is the required y on x regression line.
Least Square Line
Let us consider a typical data point with coordinates ),( ii yx (See Figure 3.0). The
error in the forecast ( y coordinate of data point-forecasted coordinate as given by the
straight line ) is denoted by ie . The line which minimizes the value of ie is called the
least square line or the regression line. This can be shown by using calculus. Here wejust give the best estimates of a and b by the following formula.
197
-
7/28/2019 chapter4ZICA4.6
5/26
( )
spodataofnumbertheisnwherexbya
n
xx
n
yxxy
b
int
2
2
=
=
The values of a and b are then substituted into equation bxay +=
y
least squares line
y
ie
iy
xFigure 3.0
The least squares line with the error term ie .
Example 3
Fit the least squares line to the data in Example 1.
198
-
7/28/2019 chapter4ZICA4.6
6/26
Table 2 shows the calculations for the estimates of a and b.
x y 2x 2y xy
18
20232627353646
2
1335641
324
400529676729122512962116
4
1992536161
36
20697813521014446
=231x =25y 72952 =x 1012
=y =738xy
125.3,8 == yn , 1.23=x
( )
( )
38.2
)875.28(0258.0125.3
0258.0
875.624
125.16
8
2317295
8
)25)(231(738
2
2
2
=
=
=
=
=
=
=
xbya
b
b
n
xx
n
yxxy
b
giving the equation for the regression line of xy 0258.038.2 +=
Forecasting Using the Regression Line
199
-
7/28/2019 chapter4ZICA4.6
7/26
Having obtained the regression line, It can be used to forecast the value of y for a givenvalue ofx . Suppose that we wish to determine the number of units sold if we have adaily temperature of co42 .
From the regression line the forecasted value y is 46.3)42(0258.038.2 =+=y i.e.
the expected number of units sold is 3.
Now suppose that we wish to determine the number of units sold if the temperature is
.49 co The forecasted value of y is the given by 64.3)49(0258.038.2 =+=y i.e.
the expected number of units sold is 4.
The two examples differ due to the fact that the first y value was forecasted from an x value within the range of x values, while the second value outside the range of x values in the original data set.. The first example is a case of interpolation and thesecond is that of extrapolation. With extrapolation, the assumption is that therelationship between the two variables continue to behave in the same way outside the
given range ofx values from which the least square line was computed.
Exercise 7
1. For the following data
x 2 5 6 8 10 11 13 16
y 2 3 4 5 6 8 9 10
a) Draw a scatter diagram
b) By eye, fit a straight line to the data (ensuring it passes through the meanvalue)
c) Fit the equation of the line by the method of semi-averages.
2. The following data have been collected regarding sales and advertisingexpenditure.
Sales Advertising expenditure
(Kms) (K 000s)
200
-
7/28/2019 chapter4ZICA4.6
8/26
10.5 23011.2 2809.9 31010.6 350
11.4 40012.1 430
a) Plot the above data on a scatter diagram.
b) Fit the regression line using the method of least squares.
c) Estimate the sales if K530 000 is spent on advertising expenditure.
Note that advertising expenditure is the x variable and sales is the y variable.
3. Fit a least square line to the data in the table below.
x 5 7 8 10 11 13
y 4 5 6 8 7 10
4. The table below shows the final grades in Mathematics and Communicationobtained by students selected at random from a large group of students.
a) Graph the data
b) Fit a least-squares line
c) If a student receives a grade of 85 in Mathematics, what is her expectedgrade in Communication?
d) If a student receives a grade of 65 in Communication, what is her expectedgrade in Mathematics?
Mathematics (x) 80 86 97 70 89 75 99 69 87 78
Communication (y) 75 65 80 65 80 70 79 45 70 805. The table below shows the birth rate per 100 population during 1999 2005
year 1999 2000 2001 2002 2003 2004 2005
Birth rate per 1000 14.6 14.5 13.8 13.4 13.6 12.8 12.6
201
-
7/28/2019 chapter4ZICA4.6
9/26
a) Graph the data.
b) Find the least squares line fitting the data. Code the years 1999 to 2005 as
the whole number 1 through 7.
c) Predict the birth rate in 2009, assuming the present trend continues.
4.7 Correlation Analysis
Correlation analysis is used to determine the degree of association between twovariables. Having obtained the equation of the regression line, correlationanalysis can be used to measure how well one variable is linearly related toanother. The coefficient of correlation r can assume any value inclusive in the
range11 + to
. A value of r is close to or equal to 1 , we have a negativecorrelation. The sign of the correlation coefficient is the same as the sign of theslope of the regression line.
The following scatter diagrams illustrate certain values of the correlationcoefficient.
x x x xx x
xx x x x x
1=r 0=r
xx
xx
1=rThe method of investigating whether a linear relationship exists between two variablesx and y is by calculating Pearsons product moment correlation coefficient (PPMCC)denoted by r given by the formula
202
-
7/28/2019 chapter4ZICA4.6
10/26
( ) ( )
=
n
yy
n
xx
n
yxxy
r2
2
2
2
Example 4
By calculating the PPMCC find the degree of association between weekly earnings andthe amount of tax paid for each member of a group of 10 manual workers.
Weekly Wage (K 000) 79 81 87 88 91 92 98 98 103 113
Income Tax (K 000) 10 8 14 14 17 12 18 22 21 24
The PPMCC is calculated in the Table below
x y 2x 2y xy
7981878891
929898103113
108141417
1218222124
62416561756977448281
8464960496041060912769
10064196196289
144324484441576
790648121812321547
11041764215621632712
x 930 =160y =874462x =28142y =15334xy
203
-
7/28/2019 chapter4ZICA4.6
11/26
( ) ( )
( )( )
( ) ( )
921.0
)254)(956(
454
10
1602814
10
93087446
10
16043015334
22
2
2
2
2
=
=
=
=
r
n
yy
n
xx
n
yxxy
r
ris near 1 and indicates a strong positive linear correlation between the twovariables.
Example 5
Evaluate the PPMCC for the following data.
x 15 20 25 30 35
y 143 141 144 149 148
The PPMCC is calculated in the Table below.
x y 2x 2y xy
1520253035
143141144149148
2254006259001225
2044919881207362220121904
21452820360044705180
x =125 =725y =33752x =105171
2y =18215xy
204
-
7/28/2019 chapter4ZICA4.6
12/26
( ) ( )
( )( )
( ) ( )
839.0
)46)(250(
90
5
725105171
5
1253375
5
72512518215
22
2
2
2
2
=
=
=
=
r
r
n
yy
n
xx
n
yxxy
r
The Coefficient of Determination
The coefficient of determination is the square of the coefficient of correlation r. Inwords, it gives the proportion of the variation (in the y - values) that is explained (by thevariation in the x - values).
In Example 10, the correlation coefficient is r = 0.839. Therefore coefficient ofdetermination:
704.0
)839.0( 22
=
=r
( 3 decimals)
This means that only 70.4% of the variation in the variable y is due to the variation in the
variable x . Note that the coefficient of determination 2r is between 0 and +1 inclusive.
Spearmans Rank Correlation Coefficient.
205
-
7/28/2019 chapter4ZICA4.6
13/26
An alternative method of measuring correlation is by means of the Spearmans rankcorrelation coefficient obtained by the formula.
)1(
61
2
2
=
nn
dr
where d = difference between rankings.
Example 6
Two members of an interview panel have ranked seven applicants in order of preferencefor a specified post. Calculate the degree of agreement between the two members.
Applicant A B C D E F G
Interviewer X 1 2 3 4 5 6 7
Interviewer Y 4 3 1 2 5 7 6
The differences in rankings are shown below.
D -3 -1 2 2 0 -1 12d 9 1 4 4 0 1 1
6429.0
3571.01
336
1201
)149(7
)20(61
)1(61
20,0
2
2
2
=
=
=
=
=
==
r
nndr
dd
Example 7
The results of two tests taken by 10 employees are shown below (figures in %)
Employee A B C D E F G H I J
206
-
7/28/2019 chapter4ZICA4.6
14/26
Test X 50 52 58 66 70 74 77 86 92 94
Test Y 56 51 53 65 64 81 76 78 80 92
Rank each employee in order of performance in the two tests and calculate the rank
coefficient .
Ranking the employees in each test we have
Employee A B C D E F G H I J
Test X 10 9 8 7 6 5 4 3 2 1
Test Y 8 10 9 6 7 2 5 4 3 1d 2 -1 -1 1 -1 3 -1 -1 -1 0
2d 4 1 1 1 1 9 1 1 1 0
8788.0
1212.01
990
1201
)1100(10
)20(61
)1(
61
20,0
2
2
2
=
=
=
=
=
==
r
nn
dr
dd
Exercise 8
1. Draw a scatter diagram of each of the sets of values given below, and calculate
the PPMCC in each case.
x 6 7 8 9 10a)
y 3 6 9 12 15
b) x 1 3 5 7 9 11
207
-
7/28/2019 chapter4ZICA4.6
15/26
y 8 7 6 5 4 3
c) x 2 4 6 8 10 12 14
y 12 8 8 14 9 7 13
2. The following table gives the percentage unemployment figures for males andfemales in 9 regions. Draw a scatter diagram of these data and calculate PPMCC.
Region
Unemployment%
Luapula Northern Eastern Central Lusaka Copperbelt N.Western Western Southern
Male 3.4 3.5 4.5 4.4 12.5 12.8 3.2 4.2 4.8
Female 3.2 3.8 4.6 3.8 11.8 11.5 4.0 3.8 3.5
3. In a job evaluation exercise an assessor ranks eight jobs in order of increasinghealth risk. The same jobs have also been ranked in decreasing order on the basisof the number of applicants attracted per advertised post.
Job A B C D E F G H
Health 1 2 3 4 5 6 7 8
Applicant 4 3 2 1 6 5 8 7
Calculate the rank correlation coefficient for this information.
4. The table below gives the Shorthand and Typing speeds of a sample of sevensecretaries
Secretary 1 2 3 4 5 6 7
Speed
(words /min)
Typing 42 44 47 47 50 54 57Shorthand 97 84 95 96 10
7
98 117
Calculate the degree of correlation between the two skills by:
208
-
7/28/2019 chapter4ZICA4.6
16/26
a) the PPMCC, and
b) the rank correlation coefficient.
5. On the different days (picked at random) the following values were obtained forthe price of a share for a particular company together with the index on that day
Share price
(K)
26
0
25
0
350 200 150 100 115 120 135 145
Index 11
5
13
5
140 120 105 110 106 165 175 115
Calculate Spearmans rank correlation coefficient and say whether the index andindicate whether the index is a reasonable indicator for the price of thecompanys share.
EXAMINATION QUESTIONS WITH ANSWERS
Multiple Choice Questions
1.1 If ,8102
== nandd the Spearmans rank correlation coefficient to 3 decimal
places will be?
A. 0.188 B. 0.841 C. 0.821 D. 0.881
(Natech , 1.2. Mathematics & Statistics, December 2004)
1.2 The prices of the following items are to be ranked prior to the calculation ofSpearmans rank correlation coefficient. What is the rank of item G?
209
-
7/28/2019 chapter4ZICA4.6
17/26
Item E F G H I J K L
Price 18 24 23 19 25
A. 5 B. 4 C. 3 D. 2.5(Natech , 1.2. Mathematics & Statistics, December 2003)
1.3
8x x
6x x
4 x xx
2 xx x
0 4 8 12 16
On the basis of the Scatter diagram above, which of the following equationswould best represent the regression line of Y on X?
A. y = x8 B. y = x + 8 C. y = x + 8
D. y = x 8
(Natech , 1.2. Mathematics & Statistics, December 2003)
1.4 An investigation is being carried out regarding the hypothesis that factor X is acause of ailment Y. Which coefficient of correlation between X and Y gives mostsupport to the ailment?
A. -0.9 B. -0.2 C. +0.8 D. 0
(Natech , 1.2. /B1Mathematics & Statistics, December 1999 (Rescheduled))
1.5 If ===== 19635,46075,10436,555,21622 xyyxyx and n =8,
then the value of r, the coefficient of correlation to two decimal paces, is
A. 0.79 B. 0.62 C. 1.01 D. 1.02(Natech , 1.2. Mathematics & Statistics, December 2001)
210
-
7/28/2019 chapter4ZICA4.6
18/26
1.6 The Scatter diagram below shows
A. High positive correlation B. Very high correlation
C. Very high negative correlation D. Perfect correlation.
(Natech , 1.2. Mathematics & Statistics, June 2005)
1.7 Find the value of a in a regression equation if === 400,150,7 yxb andn = 10.
A. 145 B. -65 C. 25 D. -650(Natech , 1.2. Mathematics & Statistics, June 2005)
1.8 In regression analysis, the variable whose value is estimated is referred to as the:
A. Simple variable B. Independent variable
C. Linear variable D. Dependent variable
1.9 The value of the coefficient of determination is interpreted as indicating
A. The proportion of unexplained varianceB. The proportion of explained varianceC. The extent of causationD. The extent of relationship
1.10 Of the following coefficient of correlation, the one that is indicative of thegreatest extent of relationship between the independent and dependent variables is
211
-
7/28/2019 chapter4ZICA4.6
19/26
A. 0 B. +.20 C. 95. D. +.70
SECTION B
QUESTION ONE
a) Derive the product moment correlation coefficient from the following data andcomment on your results.
Pupil A B C D E F G H I J K
Mathematicsmarks, x
41 37 38 39 49 47 42 34 36 48 29
Physicsmark, y
36 20 31 24 37 35 42 26 27 29 23
b) Find the estimated line, by method of least squares, fitting the following resultsfrom a Physics experiment.
Load, x(Newtons)
0.1 0.1 0.2 0.2 0.3 0.3 0.4 0.4 0.5 0.5
Extensions,y (mm)
18 11 25 22 35 50 54 45 52 68
(Natech , 1.2. Mathematics & Statistics, June 2001)
c) A company has the following data on its profit (y) and advertising expenditure9x) over the last six years.
Profits
(Million (K)
Advertising
Million (K)
11.312.114.114.615.115.2
0.520.610.630.700.700.75
i) Use two (2) methods to justify your assumption that there is a relationshipbetween the two variables.
ii) Forecast the profits for next year if an advertising budget of K800 000 isallocated.
(Natech , 1.2. Mathematics & Statistics, December 2003)
212
-
7/28/2019 chapter4ZICA4.6
20/26
QUESTION TWO
a) In the context of regression analysis explain what is meant by the following terms.
i) Regression coefficientii) Explanatory variable.
b) The following data shows the monthly imports (I) of apples and average prices(P) over a twelve-month period.
Monthly Imports (I)
(000 tonnes)
Average Monthly Prices (P)
(K/tones)100
120
125
130
128
126
120
100
90
90
95
98
232
220
218
210
210
212
217
240
242
238
230
230
i) Determine the regression equation if imports (I) of apples on the price (P)and use it to forecast monthly imports when the average monthly price isK250 per tonne.
ii) If the correlation coefficient of the data is 0.95, interpret the results.
(Natech , 1.2. Mathematics & Statistics, December 2004)
QUESTION THREE
Hungry Lion is a major food retailing company, which has recently decided to openseveral new restaurants. In order to assist with the choice of sitting these restaurants themanagement of fast foods limited whished to investigate the effect of income on eatinghabits. As part of their report a marketing agency produced the following table showingthe percentage of annual income spend on food y, for a given annual family income ((K)x)
213
-
7/28/2019 chapter4ZICA4.6
21/26
x
(K000,000)
y
18
27
3645
54
72
90
62
48
3731
27
22
18
a) Plot, on separate Scatter diagrams.
i) y against x
ii) ,loglog 1010 xagainsty and comment on the relationship between
income and percentage of family spent on food.
b) Use the method of least squares to fit the relationship baxy = to the data.
Estimate a and b.
c) Estimate the percentage of annual income spent on food by a family with anannual income of K64,800,000.
(Natech , 1.2. Mathematics & Statistics, December 2001)
QUESTION FOUR
a) Sales of product A between 0 and 4 years were as follows:
Year Units sold (000s)
2000 202001 182002 152003 142004 11
Required:
i) Calculate the correlation coefficient r.ii) Comment on the result in (i) above.iii) Calculate the coefficient of determination and comment.
214
-
7/28/2019 chapter4ZICA4.6
22/26
iv) Use a regression equation to estimate the sales in the year 2005.
b) The table below shows the respective masses X and Y of a sample of 12 fathersand their oldest ones.
Mass Xof father
(Kg)
65 63 67 643 68 62 70 66 68 67 69 71
Mass Y
of son
(Kg)
68 66 68 65 69 66 68 65 71 67 68 70
From the data given above:
i) Construct a scatter diagramii) Calculate the rank correlation coefficient using Spearmans method.
(Natech , 1.2. Mathematics & Statistics, June 2005)
c) Find the degree of correlation between the Bank of Zambia base lending rate andthe dollar exchange rate taken over the past six months using:
i) The product moment coefficient of correlation.ii) The coefficient of rank correlation.
Month Jan Feb Mar Apr May Jun
Base % as on 1st ofeach month
14 14 13.5 12.5 12 12
Average rate ($) 1.90 1.91 1.86 1.84 1.84 1.83
(Natech , 1.2. Mathematics & Statistics, Nov/Dec 2000)
QUESTION FIVE
a) The following table shows the number of units of a good product and the totalcosts incurred.
Units Produced 100 200 300 400 500 600 700
Total Costs (K) 40 000 45 000 50 000 65 000 70 000 70 000 80 000
Draw a scatter diagram
b) Find the appropriate least squares regression line so that the costs can be predictedfrom production levels and estimate the total costs when production is 250 units.
c) State the fixed costs of production.
215
-
7/28/2019 chapter4ZICA4.6
23/26
d) Calculate r and explain how much of the variation in the dependent variable isexplained by the variation of the independent variable.
(Natech , 1.2. Mathematics & Statistics, June 2002)
QUESTION SIX
a) A sample of eight employees is taken from the Production Department of anelectronics factory. The data below relates to the number of weeks experience inthe soldering of components, and the number of components, which were rejectedas unsatisfactory last week.
Employee A B C D E F G H
Weeks of experience (x)
4 5 7 9 10 11 12 14
No. of rejections(y)
21 22 15 18 14 14 11 13
i) Draw a Scatter diagram of the data.
ii) Calculate a coefficient of correlation for these data and interpret its value.
iii) Find the least squares regression equation of rejects on experience.Predict the number of rejects you would expect from an employee withone week experience.
(Natech , 1.2. Mathematics & Statistics, December, 1999 Rescheduled))
b) i) Distinguish between regression and correlation.
ii) A experiment was conducted on 8 children to determine how a childsreading, ability varied with his/her ability to write. The points awardedwere as follows:
Child A B C D E F G H
Writing 7 8 4 0 2 6 9 5
Reading 8 9 4 2 3 7 6 5
Calculate the coefficient of rank correlation and interpret the results.(Natech , 1.2. Mathematics & Statistics, December, 2002)
c) The mass of a growing animal is measured, in g, on the same day each week forwith weeks. The results are given below.
Week x 1 2 3 4 5 6 7 8
Mass (g) y 480 504 560 616 666 702 759 801
216
-
7/28/2019 chapter4ZICA4.6
24/26
i) Using 2cm to represent week 1 on the x-axis and 2cm to represent 100g onthe y-axis, plot a scatter diagram of mass y against week x.
ii) Find the equation of the regression line of y on x.(Natech , 1.2. Mathematics & Statistics, December, 1998)
QUESTION SEVEN
a) The following Table gives the cost price and number of faults per annumexperienced with seven brands of video recorders.
Video Recorders
Brand Price (K000) No. of Faults per Annum
ABCDEFG
492458435460505439477
2674351
i) Determine Spearmans rank Correlation coefficient.
iii) Interpret your answer in (i) above.
(Natech , 1.2. Mathematics & Statistics, December,1998)
b) The following Table gives a set of ten pairs of observation of inspection costs perthousand articles produced recorded on a number of occasions at several factoriescontrolled by a single group and producing comparable products.
Observation Inspection costs per
thousand articles
Number of defective
articles per thousand
1
2
3
4
5
6
0.25
0.30
0.15
0.75
0.40
0.65
50
35
60
15
46
20
217
-
7/28/2019 chapter4ZICA4.6
25/26
-
7/28/2019 chapter4ZICA4.6
26/26
a) Draw a scatter plot of y against x
b) Calculate the coefficient of correlation ad interpret its value.
c) Find the least squares regression equation of the number of defectives onexperience.
d) Estimate the number of defectives in a box inspected by a worker with 6 weeks ofexperience.
(Natech , 1.2. Mathematics & Statistics, December 1996)
CCoefficient of Determination..................................205correlation......194, 202, 204, 205, 206, 208, 209, 210,
211, 212, 213, 214, 215, 216, 219Correlation Analysis........................................194, 202
Eextrapolation...........................................................200
LLeast Square............................................................197
Rregression 194, 196, 197, 199, 200, 201, 202, 210, 211,
213, 215, 216, 217, 218, 219Regression...............................................194, 199, 213