statistics lecture 11 (chapter 11)
DESCRIPTION
Regression & CorrelationTRANSCRIPT
![Page 1: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/1.jpg)
1
![Page 2: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/2.jpg)
2
• Analyze the relationship among two
quantitative variables
• Correlation determines the strength and
direction between the variables
• Regression determines a mathematical
equation to explain the relation
• Equation can be used for prediction
![Page 3: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/3.jpg)
3
• Regression Analysis – X → independent variable
– Y → dependent variable
– Independent variable influence depended variable
– Sample consists of n pairs of observations
– Ascertain if a relation exists
– Examine the nature of the relation
– Obtain an equation that relates Y to X
– The magnitude in change of one variable due to change in another variable can be evaluated
– Predict value of Y on different values of X
![Page 4: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/4.jpg)
4
• Regression Analysis – scatter plot – Effective way to display the relationship
– X variable on horizontal axis
– Y variable on vertical axis
– Plot a dot for each pair of observations
– Can determine the • Form
– Linear or nonlinear
• Direction
– Positive or negative
• Strength
– Dots scattered close – strong relation
– Large scatter – weak relation
![Page 5: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/5.jpg)
5
• Regression Analysis – scatter plot
– Example
– Two variables
• Cost of producing units
• Number of units produced
– Cost is depending on number of
units
Number
Units (x)
Cost per
unit (y)
10 R10,00
20 8,80
30 7,90
50 6,20
60 5,00
80 4,00
100 3,50
120 2,00
Relation between units produced
and cost of production
0.00
2.00
4.00
6.00
8.00
10.00
12.00
0 30 60 90 120 150
Number of units
Co
st p
er u
nit
(R
)
From the graph it seems there is a negative relation between number of units and cost – more units then decrease in cost
![Page 6: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/6.jpg)
6
• Simple linear regression analysis
– Which line fits the data best?
Relation between units produced
and cost of production
0.00
2.00
4.00
6.00
8.00
10.00
12.00
0 30 60 90 120 150
Number of units
Co
st p
er u
nit
(R
)
![Page 7: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/7.jpg)
7
• Simple linear regression analysis
– Which line fits the data best?
– Method of least squares
– y = a + b x
• b → slope
• a → y intercept
– ∑ei = 0
– ∑ei2 measures size
of set of errors
– Least squares method
• Sum squares of errors the smallest
![Page 8: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/8.jpg)
8
• Least squares regression model
– Population regression model
• Y = α + βx + ε
• ε random error
– Sample regression model
• ŷ = a + b x
• b → change in y due to change in x
• a → value of y when x = 0
![Page 9: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/9.jpg)
9
• Least squares
regression model
– ŷ = a + b x
Number Units
(x)
Cost per unit
(y)
10 R10,00
20 8,80
30 7,90
50 6,20
60 5,00
80 4,00
100 3,50
120 2,00
∑x = 470 ∑y = 47,4
∑x2 = 38300 ∑y2 = 335,54
∑xy = 2033
212
212
1
and
where,
S =
S =
S =
xy
xx
xx n
yy n
xy n
Sb a y bx
S
x x
y y
xy x y
58,75x 5,925y
![Page 10: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/10.jpg)
Number
Units (x)
Cost per unit
(y)
10 R10,00
20 8,80
30 7,90
50 6,20
60 5,00
80 4,00
100 3,50
120 2,00
∑x = ? ∑y = ?
∑x2 = ? ∑y2 = ?
∑xy = ? 10
• Least squares
regression model
ŷ = a + b x
212
212
1
and
where,
S =
S =
S =
xy
xx
xx n
yy n
xy n
Sb a y bx
S
x x
y y
xy x y
Calculate Sxx, Syy, Sxy
![Page 11: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/11.jpg)
Number
Units (x)
Cost per unit
(y)
10 R10,00
20 8,80
30 7,90
50 6,20
60 5,00
80 4,00
100 3,50
120 2,00
∑x = 470 ∑y = 47,4
∑x2 = 38300 ∑y2 = 335,54
∑xy = 2033 11
• Least squares
regression model
– ŷ = a + b x
58,75x 5,925y
1 2
8
1 2
8
1
8
S =38300 (470) 10687,5
S =335.54 (47,4) 54,695
S =2033 (470) 47,4
751
d
5
a
,7
nxy
x
x
x
x
yy
xy
Sb a y bx
S
![Page 12: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/12.jpg)
• Least squares regression model
S =10687,5 S =54,695 S 751,75
58,75 5,925
xx yy xy
x y
5,925 ( 0,07)(58,75)
10,0375
a y bx
751,75
10687,5
0,07
xy
xx
Sb
S
→ ŷ = 10,0375 – 0,07x
Note Syy not used
here but we will
use later!!
![Page 13: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/13.jpg)
13
• Least squares regression
model
– ŷ = a + b x
– ŷ = 10,0375 – 0,07x
x
y
b > 0
Positive linear
x
y
b < 0
Negative linear
x
y
b = 0
No relation
![Page 14: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/14.jpg)
14
• Plot least squares regression model
– ŷ = 10,04 – 0,07x
Relation between units produced
and cost of production
0.00
2.00
4.00
6.00
8.00
10.00
12.00
0 30 60 90 120 150
Number of units
Co
st
per
un
it (
R)
If x = 30:
→ ŷ = 10,04 - 0,07(30)
=7,94
If x = 90:
→ ŷ = 10,04 - 0,07(90)
= 3,74
![Page 15: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/15.jpg)
EXAMPLE A car manufacturing business wants to find out
how the price of its car models depreciate with
age. The business took a sample of 8 models and
collected the following information on age (yrs) and
price (R1000):-
Find the equation for the regression line with price
as dependent variable and age as independent
15
Age 8 3 6 9 2 5 6 3
Price 16 74 38 19 102 36 33 69
![Page 16: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/16.jpg)
Example answer
Example 11.4, textbook, part 2, page 383
16
![Page 17: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/17.jpg)
PREDICTIONS IN REGRESSION ANALYSIS
• A sample regression line usually obtained
for the purpose of prediction
• That is to estimate the value of Y
corresponding to as selected value of x
• Two ways to estimate y:-
– Point estimate
– Confidence interval
17
![Page 18: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/18.jpg)
18
• Prediction with regression model – Point estimate using ŷ = 10,04 – 0,07x
– What will be the estimated cost if 60 units
will be produced?
– ŷ = 10,04 – 0,07(60)=R5,84
– What will be the estimated cost if 25 units
will be produced?
– ŷ = 10,075 – 0,07(25)=R8,29
![Page 19: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/19.jpg)
ERRORS
• When regression line estimates every
observed value has a predicted value
• Predicted values will all fall exactly on
regression line
• All observed values will not fall on
regression line
• Difference between the two values is
known as an ERROR and is denoted by
ei
19
![Page 20: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/20.jpg)
ERRORS • Since the observed values deviate from the
predicted values the regression equation is not a
perfect predictor
• Need to be able to assess the accuracy of the
regression line in predicting the values and this
is done by analysing the errors ei
• STD DEV errors measures how widely observed
values are spread around regression line
• The smaller the STD DEV the closer the points
cluster around line
20
![Page 21: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/21.jpg)
21
• Standard deviation of random errors
– ŷ = 10,04 – 0,07x
– ei indicate how the observed and expected values differ
– Standard deviation of errors measures spread around the line
• Smaller - points closer to line
ŷ = 10,04 – 0,07(10) = 9,34 ŷ = 10,04 – 0,07(20) = 8,64
Number
Units
(x)
Cost
per
unit (y)
Predicted
cost per
unit (ŷ)
Difference ei
= yi - ŷi
10 10,00 9,34 0,66
20 8,80 8,64 0,16
30 7,90 7,94 -0,04
50 6,20 6,54 -0,34
60 5,00 5,84 -0,84
80 4,00 4,44 -0,44
100 3,50 3,04 0,46
120 2,00 1,64 0,36
![Page 22: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/22.jpg)
22
• Standard deviation of random errors
– Small
– Values close to line
Number
Units
(x)
Cost
per
unit (y)
Predicted
cost per
unit (ŷ)
Difference ei
= yi - ŷi
10 10,00 9,34 0,66
20 8,80 8,64 0,16
30 7,90 7,94 -0,04
50 6,20 6,54 -0,34
60 5,00 5,84 -0,84
80 4,00 4,44 -0,44
100 3,50 3,04 0,46
120 2,00 1,64 0,36
2
54,695 ( 0,07)( 751,75)
8 2
0,588
yy xy
e
S bSS
n
![Page 23: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/23.jpg)
CONFIDENCE INTERVAL FOR PREDICTION
• Different samples from the same population will
give different point estimates
• Likely that different samples from same
population will give different estimated
regression lines
• Therefore need to construct a confidence
interval for Y based on one sample that will give
a more reliable estimate of Y
• Generally called a PREDICTION INTERVAL
23
![Page 24: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/24.jpg)
24
• Confidence interval for prediction
– Point estimate for 60 units
• ŷ = 10,04 – 0,07(60)=R5,84
– Rather calculate a confidence interval for the
mean value of y for a given x value
– Use the t-distribution
– Confidence interval for the mean of y, given x = x0
0 02
0
| 0 2 ; 11
2
02
| e
XX
1where
S
y x y xn
y x
CONF a bx t s
x xS s
n
![Page 25: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/25.jpg)
25
• Confidence interval for prediction –
0 02
0
| 0 2 ; 11
2
02
| e
XX
2
2
1where
S
60 58,7510,588
8 10687,5
0, 2080
y x y xn
y x
CONF a bx t s
x xS s
n
![Page 26: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/26.jpg)
26
• Confidence interval for prediction – 95% confidence interval if x = 60
– 95% sure mean cost for 60 units will be
between R5,33 an R6,35
0 02| 0 2 ; 11
8 2;1 0,025
10,04 0,07(60) 0,2080
5,84 2,447(0,2080)
5,84 0,508976
5,33 ; 6,35
y x y xnCONF a bx t s
t
![Page 27: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/27.jpg)
27
• Inferences about β (population slope)
– b point estimate of β
– T-distribution used to make inferences about β
– Confidence interval for β
– If confidence interval includes 0 – no linear relation
– If confidence interval not includes 0 – might be a linear relation
2
2 ; 11
where
bn
eb
xx
CONF b t s
ss
s
![Page 28: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/28.jpg)
28
• Inferences about β (population
slope)
– Confidence interval for β
2
2 ; 11
0,588where 0,00569
10687,5
bn
eb
xx
CONF b t s
ss
s
![Page 29: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/29.jpg)
29
• Inferences about β (population slope)
– Confidence interval for β
– 95% sure population slope will be between -0,0839 and -0,0561
– Interval does not include 0
– Might be a linear relation
22 ; 11
0,07 2,447(0,00569
0,0839 ; 0,0561
bnCONF b t s
![Page 30: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/30.jpg)
30
• Inferences about β (population slope)
– Hypothesis test concerning β
Testing H0: β = 0 for n < 30
Alternative
hypothesis
Decision rule:
Reject H0 if Test statistic
H1: β ≠ 0 |t| ≥ tn - 2;1- α/2
H1: β > 0 t ≥ tn-2;1- α
H1: β < 0 t ≤ -tn-2;1- α
with s
b
eb
xx
bt
s
s
s
![Page 31: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/31.jpg)
31
• Solution
– H0 : β = 0
– H1 : β ≠ 0
– α = 0,05
–
– Reject H0
0,5880,00569
10687,5
0,0712,346
0,00569
eb
xx
b
ss
s
bt
s
At α = 0,05 the slope is not zero –
there is a linear relation between
number of units and cost per unit
Reject H0 Accept H0 Reject H0
-2,447 +2,447
If H1 : β > 0 - test for positive slope
If H1 : β < 0 - test for negative slope
![Page 32: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/32.jpg)
32
• Correlation Analysis – Strength of linear relationship
– Direction of linear relationship • Positive
• Negative
– Population correlation coefficient ρ (rho)
– Sample correlation coefficient r
– r always between -1 and +1 • r = 1 perfect positive
• r = -1 perfect negative
• r = 0 no relationship
• near 0 weak relationship
• near -1 or +1 strong relationship
![Page 33: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/33.jpg)
33
Coefficient of correlation
• The coefficient of correlation is used to measure
the strength of association between two
variables.
• The coefficient values range between -1 and 1.
– If r = -1 (negative association) or r = +1
(positive association) every point falls on the
regression line.
– If r = 0 there is no linear pattern.
• The coefficient can be used to test for linear
relationship between two variables.
![Page 34: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/34.jpg)
34
X
Y
X
Y
X
Y
X
Y
X
Y
X
Y
Perfect positive
r = +1
High positive
r = +0,9
Low positive
r = +0,3
Perfect negative
r = -1
High negative
r = -0,8
No Correlation
r = 0
![Page 35: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/35.jpg)
35
• Correlation coefficient r
– Strong negative
relationship
Number
Units (x)
Cost per
unit (y)
10 R10,00
20 8,80
30 7,90
50 6,20
60 5,00
80 4,00
100 3,50
120 2,00
∑x = 470 ∑y = 47,4
∑x2 = 38300 ∑y2 = 335,54
∑xy = 2033
58,75x 5,925y
1 2
8
1 2
8
1
8
S =38300 (470) 10687,5
S =335.54 (47, 4) 54,695
S =2033
751,75
10687,5(
(470)
54
47, 4
,695)
0,98
751,75
xy
x
xx
yy
xy
x yy
Sr
s s
![Page 36: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/36.jpg)
36
• Coefficient of determination
r2
– Measures proportion of
changes in the dependent
variable y that can be
explained by the
independent variable x
– % of total variation in y that
is explained by the
regression model
Number
Units (x)
Cost per
unit (y)
10 R10,00
20 8,80
30 7,90
50 6,20
60 5,00
80 4,00
100 3,50
120 2,00
∑x = 470 ∑y = 47,4
∑x2 = 38300 ∑y2 = 335,54
∑xy = 2033
58,75x 5,925y 2 20,98 96,04%r
– 96% of the variation in the cost of units is explained by the variation in the number of units produced
– 4% is unexplained
![Page 37: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/37.jpg)
37
• Hypothesis test concerning the
correlation coefficient ρ
Testing H0: ρ = 0 for n < 30
Alternative
hypothesis
Decision rule:
Reject H0 if Test statistic
H1: ρ ≠ 0 |t| ≥ tn - 2;1- α/2 21
2
rt
r
n
![Page 38: Statistics lecture 11 (chapter 11)](https://reader034.vdocument.in/reader034/viewer/2022051323/5487b5b0b4af9f9b0d8b5520/html5/thumbnails/38.jpg)
38
• Solution
– H0 : ρ = 0
– H1 : ρ ≠ 0
– α = 0,05
–
– Reject H0
2 2
0,9812,06
1 1 ( 0,98)
2 8 2
rt
r
n
At α = 0,05 the correlation coefficient is
not zero – there is a linear relation
between number of units and cost per unit
Reject H0 Accept H0 Reject H0
-2,447 +2,447