correlation & regression a correlation and regression analysis involves investigating the...
TRANSCRIPT
Correlation & Regression
• A correlation and regression analysis involves investigating the relationship between two (or more) quantitative variables of interest.
• The goal of such an investigation is typically to estimate (predict) the value of one variable based on the observed value of the other variable (or variables).
Quantitative Variables
• Dependent Variable (Y)• the variable being predicted• called the response variable
• Independent Variable (X)• the variable used to explain or predict Y• called the explanatory or predictor variable
Correlation & Regression
• Correlation• Addresses the questions:
“Is there a relationship between X and Y?”
“If so, how strong is it?”
• Regression• Addresses the question
“What is the relationship between X and Y?”
Simple Linear Relationship
• A linear (straight line) relationship between Y and a single X. • The form of the equation is Y = b0 + b1 X,
where b0 is the y-intercept and b1 is the slope
• A scatter-plot of X versus Y is useful for spotting linear relationships, and obvious departures from linear.• Always start with a scatter plot!!
Correlation• A correlation exists between two variables
when they are related in some way.• Linear Correlation Coefficient (r)
• measures the strength of the linear relationship between X and Y
• Properties of r• -1 ≤ r ≤ 1• r=1 for a perfect positive linear relationship• r= -1 for a perfect negative linear relationship• r = 0 if there is no linear relationship
Sample Correlation Coefficient
• Statistics that is useful for estimating the linear correlation coefficient
xy
2 2
2 2xx yy
, where
S ,
S , S
xy
xx yy
Sr
S S
x yxy
n
x yx y
n n
Coefficient of Determination
• The coefficient of determination is the proportion of variability in Y that can be explained by its linear relationship to X.• Computed by squaring the sample correlation
squared (r2)
22 SSE
=1-TSS
xy
xx yy
Sr
S S
Hypothesis Testing of the Linear Correlation Coefficient
• Appropriate Hypothesis:
ip)Relationsh(Linear 0:ip)RelationshLinear (No 0:
1
0
HH
Testing r
• Test Statistic:
• Rejection Region (3 cases of H1)1. Two-tailed: For H1: r ≠ 0, Reject H0 for |t| ≥ tα/2
2. Left-tailed: For H1: r < 0, Reject H0 for t ≤ -tα
3. Right-tailed: For H1: r > 0, Reject H0 for t ≥ tα
2 ,
21 2
ndf
nr
rt
Simple Linear Regression
• The Least Squares Regression line is our "best" line for explaining the relationship between Y and X. • It minimizes the squared error (distance between the
observed values and the values predicted by the line).
• The predicted value of Y for any X can be found by plugging X into the least squares regression line.
n
iii xbbybbf
1
21010 )(),(
Simple Linear Regression Line
• The equation is:
where
and
xbby 10ˆ
1xy
xx
Sb
S
xbyb 10
Proper Use of Correlation & Regression
• Correlation does not imply causation.• Simple linear regression is appropriate
only if the data clusters about a line.• Do not extrapolate.• Do not apply model to other populations.• For multiple regression, the size of the
parameter does not indicate importance.
Effect of Extreme Values
• Extreme values can have a very large effect on correlation and regression analysis.
• Influential outliers can largely impact model fit. • Regression Applet by Webster West
Model Assumptions for Inference
The difference between the observed and the model predicted values is called the residual, and is denoted by e:
The residuals are assumed to be independent and identically normal in distribution with mean 0 and standard deviation se.
So far a particular X, the distribution of Y can be described as normal with mean equal to the predicted value of Y for that X, and standard deviation equal to se.
Inference about the Simple Linear Regression Model Parameters
Is there a significant relationship between X and Y? H0: b1 = 0 versus H1: b1 ≠ 0
• Test Statistic:
xx
xyyy
xx
S
SSSSE
n
SSEs
ndf
Ssb
T
2
11
,2
where
2 ,
Inference about the Simple Linear Regression Model Parameters
• Rejection Region (3 cases of H1)1. Two-tailed: For H1: r ≠ 0, Reject H0 for |t| ≥ tα/2
2. Left-tailed: For H1: r < 0, Reject H0 for t ≤ -tα
3. Right-tailed: For H1: r > 0, Reject H0 for t ≥ tα
Inference about the Simple Linear Regression Model Parameters
Is there a non-zero y-intercept in the linear relationship between X and Y?
H0: b0 = 0 versus H1: b0 ≠ 0• Test Statistic:
xx
xyyy
xx
S
SSSSE
n
SSEs
ndf
nSx
s
bT
2
2
00
,2
where
2 ,
Inference about a Regression Line
E(Y) is the expected value of Y. For a given X, E(Y) is determined by evaluating the simple linear regression equation at X. A t-distribution allows a confidence interval for the true mean value of Y given an X.
xxS
xx
nstx
2*
2/*
00
)(1ˆˆ
Inference about Y for a Given X
The expected observation of Y for a given X is equal to E(Y). A t-distribution on E(Y) allows the construction a predication interval for prediction of a single observation for a particular value of X.
xxS
xx
nstx
2*
2/*
00
)(11ˆˆ
Residual Analysis
Can be useful for checking the model assumptions, which for the linear regression model are: Independent observations Residual have N(0,s2) distribution Plots can be useful for spotting model
inadequacy
Variable Selection in Multiple Regression
Compare all possible regressions Backward elimination Forward Selection Stepwise Elimination