introduction to probability and statistics thirteenth edition chapter 12 linear regression and...
Post on 03-Jan-2016
234 Views
Preview:
TRANSCRIPT
Introduction to Probability Introduction to Probability and Statisticsand Statistics
Thirteenth EditionThirteenth Edition
Chapter 12
Linear Regression and Correlation
Correlation & Regression
Univariate & Bivariate Statistics U: frequency distribution, mean, mode, range,
standard deviation B: correlation – two variables
Correlation linear pattern of relationship between one
variable (x) and another variable (y) – an association between two variables
graphical representation of the relationship between two variables
Warning: No proof of causality Cannot assume x causes y
1. Correlation Analysis
• Correlation coefficient Correlation coefficient measures the strength of the relationship between x and y
yyxx
xy
SS
Sr
yyxx
xy
SS
Sr
Sample Pearson’s correlation Sample Pearson’s correlation coefficient coefficient
n
xxS i
ixx
2
2
n
yyS i
iyy
2
2
n
yxyxS ii
iiyy
Pearson’s Correlation Coefficient
“r” indicates…strength of relationship (strong, weak, or none)direction of relationship
positive (direct) – variables move in same direction
negative (inverse) – variables move in opposite directions
r ranges in value from –1.0 to +1.0
Strong Negative No Rel. Strong Positive
-1.0 0.0 +1.0
Limitations of Correlation
linearity: can’t describe non-linear relationshipse.g., relation between anxiety & performance
no proof of causationCannot assume x causes y
Y
X
Y
X
Y
Y
X
X
Linear relationships Curvilinear relationships
Some Correlation Patterns
Y
X
Y
X
Y
Y
X
X
Strong relationships Weak relationships
Some Correlation Patterns
Example
The table shows the heights and weights of n = 10 randomly selected college football players.
Player 1 2 3 4 5 6 7 8 9 10
Height, x 73 71 75 72 72 75 67 69 71 69
Weight, y 185 175 200 210 190 195 150 170 180 175
8261.)2610)(4.60(
328
2610 4.60 328
r
SSS yyxxxy
Height
Weig
ht
75747372717069686766
210
200
190
180
170
160
150
Scatterplot of Weight vs Height
r = .8261
Strong positive correlation
As the player’s height increases, so does his weight.
r = .8261
Strong positive correlation
As the player’s height increases, so does his weight.
Example – scatter plot
Inference using r
• The population coefficient of coefficient of correlationcorrelation is called (“rho”). We can test for a significant correlation between x and y using a t test:
0:H
0:H
1
0
. 2- withor if HReject 1
2 :StatisticTest
2/2/0
2
dfnttttr
nrt
Is there a significant positive correlation between weight and height in the population of all college football players?
8261.r 8261.r
0:H
0:H
1
0
0:H
0:H
1
0
15.4 8261.1
8 8261.
1
2
:StatisticTest
2
2
r
nrt
15.4 8261.1
8 8261.
1
2
:StatisticTest
2
2
r
nrt
Use the t-table with n-2 = 8 df to bound the p-value as p-value < .005.
There is a significant positive correlation between weight and height in the population of all college football players.
Use the t-table with n-2 = 8 df to bound the p-value as p-value < .005.
There is a significant positive correlation between weight and height in the population of all college football players.
Example
2. Linear Regression2. Linear Regression
Regression: Correlation + Prediction
Regression analysis is used to predict the value of one variable (the dependent variable) on the basis of other variables (the independent variables).◦Dependent variable: denoted Y◦Independent variables: denoted X1, X2, …, Xk
Example
Let y be the monthly sales revenue for a company. This might be a function of several variables:◦x1 = advertising expenditure◦x2 = time of year◦x3 = state of economy◦x4 = size of inventory
We want to predict y using knowledge of x1, x2, x3 and x4.
Some Questions
Which of the independent variables are useful and which are not?
How could we create a prediction equation to allow us to predict y using knowledge of x1, x2, x3 etc?
How good is this prediction?
We start with the simplest case, in which the response y is a function of a single independent variable, x.
We start with the simplest case, in which the response y is a function of a single independent variable, x.
A statistical model separates the systematic component of a relationship from the random component.
A statistical model separates the systematic component of a relationship from the random component.
Data
Statistical model
Systematic component
+Randomerrors
In regression, the systematic component is the overall linear relationship, and the random component is the variation around the line.
In regression, the systematic component is the overall linear relationship, and the random component is the variation around the line.
Model BuildingModel Building
Explanatory and Response Variables are NumericRelationship between the mean of the response
variable and the level of the explanatory variable assumed to be approximately linear (straight line)
Model:
),0(~1 NxY
• 1 > 0 Positive Association
• 1 < 0 Negative Association
• 1 = 0 No Association
A Simple Linear Regression A Simple Linear Regression ModelModel
X
Y
x
= Slope
1
y
Error:
Regression Plot
Picturing the Simple Linear Regression Picturing the Simple Linear Regression ModelModel
0
= Intercept
Simple Linear Regression Simple Linear Regression AnalysisAnalysis
Variables: x = Independent Variable y = Dependent VariableParameters: = y Intercept β = Slopeε ~ normal distribution with mean 0 and
variance 2
xy
exbay ˆy = actual value of a score = predicted valuey
Simple Linear Regression Model…
y
x
b=slope=y/x
intercept
xbay ˆ
a
The Method of Least The Method of Least SquaresSquares
The equation of the best-fitting line is calculated using a set of n pairs (xi, yi).
•We choose our estimates a and b to estimate and so that the vertical distances of the points from the line,are minimized.
22 )()ˆ(
ˆ
bxayyy
ba
bxay
SSE
minimize to and Choose
:line fitting Best
Least Squares Estimators
xbyaS
Sb
bxayn
yxxy
n
yy
n
xx
xx
xy
xy
yyxx
and
where :line fitting Best
S
S S
:squares of sums the Calculate
ˆ
))((
)()( 22
22
xbyaS
Sb
bxayn
yxxy
n
yy
n
xx
xx
xy
xy
yyxx
and
where :line fitting Best
S
S S
:squares of sums the Calculate
ˆ
))((
)()( 22
22
Example
The table shows the IQ scores for a random sample of n = 10 college freshmen, along with their final calculus grades.
Student 1 2 3 4 5 6 7 8 9 10
IQ Scores, x 39 43 21 64 57 47 28 75 34 52
Calculus grade, y 65 78 52 82 92 89 73 98 56 75
Use your calculator to find the sums and sums of squares.
Use your calculator to find the sums and sums of squares.
7646
36854
5981623634
76046022
yx
xy
yx
yx
7646
36854
5981623634
76046022
yx
xy
yx
yx
:line fitting Best
and .76556
S
2056S
2474S
xy
ab
xy
yy
xx
77.78.40ˆ
78.40)46(76556.762474
1894
189410
)760)(460(36854
10
)760(59816
10
)460(23634
2
2
:line fitting Best
and .76556
S
2056S
2474S
xy
ab
xy
yy
xx
77.78.40ˆ
78.40)46(76556.762474
1894
189410
)760)(460(36854
10
)760(59816
10
)460(23634
2
2
ExampleExample
The Analysis of Variance
The total variation in the experiment is measured by the total sum of squarestotal sum of squares:
2)yyS yy ( SS Total2)yyS yy ( SS Total
The Total SSTotal SS is divided into two parts:
SSRSSR (sum of squares for regression): measures the variation explained by using x in the model.
SSESSE (sum of squares for error): measures the leftover variation not explained by x.
We calculate
9741.1449
2474
1894)(SSR
22
xx
xy
S
S
9741.1449
2474
1894)(SSR
22
xx
xy
S
S
0259.606
9741.14492056
)(
SSR - SS TotalSSE2
xx
xyyy S
SS
The Analysis of Variance
The ANOVA Table
Total df = Mean Squares
Regression df = Error df =
n -1
1
n –1 – 1 = n - 2
MSR = SSR/(1)
MSE = SSE/(n-2)
Source df SS MS F
Regression 1 SSR SSR/(1) MSR/MSE
Error n - 2 SSE SSE/(n-2)
Total n -1 Total SS
The Calculus Problem
Source df SS MS F
Regression 1 1449.9741 1449.9741 19.14
Error 8 606.0259 75.7532
Total 9 2056.0000
0259.6069741.14492056
)(
9741.14492474
1894)(
2
22
xx
xyyy
xx
xy
S
SS
S
S
SSR - SS TotalSSE
SSR
0259.6069741.14492056
)(
9741.14492474
1894)(
2
22
xx
xyyy
xx
xy
S
SS
S
S
SSR - SS TotalSSE
SSR
You can test the overall usefulness of the model using an F test. If the model is useful, MSR will be large compared to the unexplained variation, MSE.
This test is exactly equivalent to the t-test, with t2 = F.
predictingin usefulnot is model:H
Hypothesis
0 y
. 2- and1 with FF if HReject
MSE
MSRF
:StatisticTest
0 dfn
Testing the Usefulness of the Model (The F Test)
Regression Analysis: y versus xThe regression equation is y = 40.8 + 0.766 xPredictor Coef SE Coef T PConstant 40.784 8.507 4.79 0.001x 0.7656 0.1750 4.38 0.002
S = 8.70363 R-Sq = 70.5% R-Sq(adj) = 66.8%
Analysis of VarianceSource DF SS MS F PRegression 1 1450.0 1450.0 19.14 0.002Residual Error 8 606.0 75.8Total 9 2056.0
Regression coefficients,
a and b
Minitab Minitab OutputOutput
MSE
0:H
test To
0 Least squares regression line
Ft 2
Testing the Usefulness of the Model
• The first question to ask is whether the independent variable x is of any use in predicting y.
• If it is not, then the value of y does not change, regardless of the value of x. This implies that the slope of the line, , is zero.
0:0:0 aH versusH 0:0:0 aH versusH
The test statistic is function of b, our best estimate of Using MSE as the best estimate of the random variation 2, we obtain a t statistic.
xx
xx
S
MSEtbndf
t
S
MSE
bt
2/2
0
:interval confidencea or with
ondistributi a has which :statistic Test
xx
xx
S
MSEtbndf
t
S
MSE
bt
2/2
0
:interval confidencea or with
ondistributi a has which :statistic Test
Testing the Usefulness of the Model
0:0:0 aH versusH 0:0:0 aH versusH
The Calculus Problem
•Is there a significant relationship between the calculus grades and the IQ scores at the 5% level of significance?
38.42474/7532.75
07656.
/
0
xxS
bt
MSE
0:0:0 aH versusH
Reject H 0 when |t| > 2.306. Since t = 4.38 falls into the rejection region, H 0 is rejected .
There is a significant linear relationship between the calculus grades and the IQ scores for the population of college freshmen.
Measuring the Strength Measuring the Strength of the Relationship of the Relationship
• If the independent variable x is of useful in predicting y, you will want to know how well the model fits.
• The strength of the relationship between x and y can be measured using:
SS TotalSSR
:iondeterminat of tCoefficien
:tcoefficien nCorrelatio
yyxx
xy
yyxx
xy
SS
Sr
SS
Sr
2
2
SS TotalSSR
:iondeterminat of tCoefficien
:tcoefficien nCorrelatio
yyxx
xy
yyxx
xy
SS
Sr
SS
Sr
2
2
Measuring the Strength of the Relationship
• Since Total SS = SSR + SSE, r2
measures the proportion of the total variation in
the responses that can be explained by using the independent variable x in the model.
the percent reduction the total variation by using the regression equation rather than just using the sample mean y-bar to estimate y.SS Total
SSR 2r
SS TotalSSR
2rFor the calculus problem, r2 = .705 or 70.5%. Meaning that 70.5% of the variability of Calculus Scores can be exlain by the model.
Estimation and Estimation and Prediction Prediction
xx
xx
S
xx
nMSEty
xxy
S
xx
nMSEty
xxy
20
2/
0
20
2/
0
)(11ˆ
)(1ˆ
: when of valueparticulara predict To
: when of valueaverage the estimate To
xx
xx
S
xx
nMSEty
xxy
S
xx
nMSEty
xxy
20
2/
0
20
2/
0
)(11ˆ
)(1ˆ
: when of valueparticulara predict To
: when of valueaverage the estimate To Confidence interval
Prediction interval
The Calculus The Calculus ProblemProblem
Estimate the average calculus grade for students whose IQ score is 50 with a 95% confidence interval.
85.61. to 72.51or
79.06.76556(50)40.78424 Calculate
55.606.79
2474
)4650(
10
17532.75306.2ˆ
ˆ
2
y
y
85.61. to 72.51or
79.06.76556(50)40.78424 Calculate
55.606.79
2474
)4650(
10
17532.75306.2ˆ
ˆ
2
y
y
Estimate the calculus grade for a particular student whose IQ score is 50 with a 95% confidence interval.
100.17. to 57.95or
79.06.76556(50)40.78424 Calculate
11.2106.79
2474
)4650(
10
117532.75306.2ˆ
ˆ
2
y
y
100.17. to 57.95or
79.06.76556(50)40.78424 Calculate
11.2106.79
2474
)4650(
10
117532.75306.2ˆ
ˆ
2
y
y
Notice how much wider this interval is!
The Calculus The Calculus ProblemProblem
Predicted Values for New ObservationsNew Obs Fit SE Fit 95.0% CI 95.0% PI1 79.06 2.84 (72.51, 85.61) (57.95,100.17)
Values of Predictors for New ObservationsNew Obs x1 50.0
Minitab OutputMinitab Output
Green prediction bands are always wider than red confidence bands.
Both intervals are narrowest when x = x-bar.
Confidence and prediction intervals when x = 50
x
y
80706050403020
120
110
100
90
80
70
60
50
40
30
S 8.70363R-Sq 70.5%R-Sq(adj) 66.8%
Regression95% CI95% PI
Fitted Line Ploty = 40.78 + 0.7656 x
Estimation and Estimation and Prediction Prediction • Once you have
determined that the regression line is useful used the diagnostic plots to check for violation of
the regression assumptions.• You are ready to use the regression line to
Estimate the average value of y for a given value of x
Predict a particular value of y for a given value of x.
• The best estimate of either E(y) or y for a given value x = x0 is
• Particular values of y are more difficult to predict, requiring a wider range of values in the prediction interval.
0ˆ bxay 0ˆ bxay
Estimation and Estimation and Prediction Prediction
Regression Assumptions
1. The relationship between x and y is linear, given by y = + x +
2. The random error terms are independent and, for any value of x, have a normal distribution with mean 0 and constant variance, 2.
1. The relationship between x and y is linear, given by y = + x +
2. The random error terms are independent and, for any value of x, have a normal distribution with mean 0 and constant variance, 2.
•Remember that the results of a regression analysis are only valid when the necessary assumptions have been satisfied.
Assumptions:
Diagnostic ToolsDiagnostic Tools
1. Normal probability plot or histogram of residuals
2. Plot of residuals versus fit or residuals versus variables
3. Plot of residual versus order
1. Normal probability plot or histogram of residuals
2. Plot of residuals versus fit or residuals versus variables
3. Plot of residual versus order
Residuals
• The residual errorresidual error is the “leftover” variation in each data point after the variation explained by the regression model has been removed.
• If all assumptions have been met, these residuals should be normalnormal, with mean 0 and variance 2.
iiii bxayyy or Residual ˆ iiii bxayyy or Residual ˆ
If the normality assumption is valid, the plot should resemble a straight line, sloping upward to the right.
If not, you will often see the pattern fail in the tails of the graph.
If the normality assumption is valid, the plot should resemble a straight line, sloping upward to the right.
If not, you will often see the pattern fail in the tails of the graph.
Normal Probability Normal Probability PlotPlot
Residual
Perc
ent
20100-10-20
99
95
90
80
70
605040
30
20
10
5
1
Normal Probability Plot of the Residuals(response is y)
If the equal variance assumption is valid, the plot should appear as a random scatter around the zero center line.
If not, you will see a pattern in the residuals.
If the equal variance assumption is valid, the plot should appear as a random scatter around the zero center line.
If not, you will see a pattern in the residuals.
Residuals versus FitsResiduals versus Fits
Fitted Value
Residual
10090807060
15
10
5
0
-5
-10
Residuals Versus the Fitted Values(response is y)
top related