binf 702 chapter 11 regression and correlation methods
TRANSCRIPT
Chapter 11 Regression and Correlation Methods (SPRING 2014) 1
BINF 702 Chapter 11 Regression and Correlation Methods
Chapter 11 Regression and Correlation Methods (SPRING 2014) 2
Section 11.1 Introduction
Example 11.1 Obstetrics Obstetricians sometimes order tests for estriol levels from 24-hour urine specimens taken from pregnant women who are near term, since the level of estriol has been found to be related to the birthweight of the infant. The test can provide indirect evidence of an abnormally small fetus. The relationship between estriol level and birthweight can be quantified by fitting a regression line that relates the two variables.
Example 11.2 Hypertension Much discussion has taken place in the literature concerning the familial aggregation of blood pressure. In general, children whose parents have high blood pressure tend to have higher blood pressure than their peers. One way of expressing this relationship is to compute a correlation coefficient relating the blood pressure of parents and children over a large collection of families.
Chapter 11 Regression and Correlation Methods (SPRING 2014) 3
Section 11.2 General Concepts
Let us return to our consideration of the relationship between estriol level and birthweight data. Let x = estriol level and y = birthweight. We might posit a relationship such as
Eq. 11.1 E(y|x) = a + bx
Our regression line is defined as
Def. 11.1 – y = a + bx, a is the y-intercept and b is the slope.
It is expected of course that our regression line does not fit exactly. There will be some associated error to the fit.
Eq. 11.2 y = a + bx + e where e ~ N(0,s2) where x is the independent variable and y is the dependent variable.
Chapter 11 Regression and Correlation Methods (SPRING 2014) 4
Section 11.2 General Concepts
A linear regression fit for our birthweight data
Chapter 11 Regression and Correlation Methods (SPRING 2014) 5
Section 11.2 General Concepts
Some nuances of the fit
We can vary noise.
b may vary.
Section 11.3 – Fitting Regression Lines The Method of Least Squares
Def. 11.3 – The least-squares line, or estimated regression line, is the line y = a +bx minimizing the sum of squares distances of the sample points from the line given by
Eq. 11.3 Estimation of the Least-Squares Line The coefficients of the least-squares line y = ax + b are given by
1 1
n n
i ixy i i
xx
y b xL
b and a y bxL n
2
1
n
i
i
S d
We choose this criteria because the math is tractable.
6 Chapter 11 Regression and Correlation Methods
(SPRING 2014)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 7
Section 11.3 – Fitting Regression Lines The Method of Least Squares
Section 11.3 – Fitting Regression Lines The Method of Least Squares
2
2
11
1 1
1
nn
i iii
xx
n n
i ini i
xy i i
i
x x
Ln
x y
L x yn
DEF. 11.6 The predicted , or average, value of y for a given value of x , as estimated from the fitted regression line, is denoted by y a bx
Chapter 11 Regression and Correlation Methods (SPRING 2014) 8
Section 11.3 – Fitting Regression Lines The Method of Least Squares
Regression in R
lm {stats}
R Documentation
Fitting Linear Models
Description
lm is used to fit linear models. It can be used to
carry out regression, single stratum analysis of
variance and analysis of covariance (although aov may
provide a more convenient interface for these).
Usage
lm(formula, data, subset, weights, na.action, method =
"qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
singular.ok = TRUE, contrasts = NULL, offset, ...)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 9
Section 11.3 – Fitting Regression Lines The Method of Least Squares
Regression in R (The Arguments) Formula a symbolic description of the model to be fit.
The details of model specification are given below.
Data an optional data frame containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.
Subset an optional vector specifying a subset of observations to be used in the fitting process.
Weights an optional vector of weights to be used in the fitting process. If specified, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used.
na.action a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The “factory-fresh” default is na.omit. Another possible value is NULL, no action.
Chapter 11 Regression and Correlation Methods (SPRING 2014) 10
Section 11.3 – Fitting Regression Lines The Method of Least Squares
Regression in R (The Arguments) Method the method to be used; for fitting, currently
only method = "qr" is supported; method = "model.frame" returns the model frame (the same as with model = TRUE, see below).
model, x, y, qr logicals. If TRUE the corresponding components of the fit (the model frame, the model matrix, the response, the QR decomposition) are returned.
singular.ok logical. If FALSE (the default in S but not in R) a singular fit is an error.
Contrasts an optional list. See the contrasts.arg of model.matrix.default.
offsetthis can be used to specify an a priori known component to be included in the linear predictor during fitting. An offset term can be included in the formula instead or as well, and if both are specified their sum is used
.... additional arguments to be passed to the low level regression fitting functions (see below).
Chapter 11 Regression and Correlation Methods (SPRING 2014) 11
Section 11.3 – Fitting Regression Lines The Method of Least Squares
Regression in R (Some of the Details) Models for lm are specified symbolically. A typical model has the form
response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response. A terms specification of the form first + second indicates all the terms in first together with all the terms in second with duplicates removed. A specification of the form first:second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second. The specification first*second indicates the cross of first and second. This is the same as first + second + first:second. If response is a matrix a linear model is fitted to each column of the matrix. See model.matrix for some further details. The terms in the formula will be re-ordered so that main effects come first, followed by the interactions, all second-order, all third-order and so on: to avoid this pass a terms object as the formula.
A formula has an implied intercept term. To remove this use either y ~ x - 1 or y ~ 0 + x. See formula for more details of allowed formulae.
lm calls the lower level functions lm.fit, etc, see below, for the actual numerical computations. For programming only, you may consider doing likewise.
All of weights, subset and offset are evaluated in the same way as variables in formula, that is first in data and then in the environment of formula.
Section 11.3 – Fitting Regression Lines The Method of Least Squares
Regression in R (Some of the Details) lm returns an object of class "lm" or for multiple responses of class
c("mlm", "lm"). The functions summary and anova are used to obtain and print a summary and analysis of variance table of the results.
The generic accessor functions coefficients, effects, fitted.values and residuals extract various useful features of the value returned by lm. An object of class "lm" is a list containing at least the following components:
Coefficients a named vector of coefficients
Residuals the residuals, that is response minus fitted values.
fitted.values the fitted mean values.
rank the numeric rank of the fitted linear model.
weights (only for weighted fits) the specified weights.
df.residualthe residual degrees of freedom.
call the matched call.
terms the terms object used.
contrasts (only where relevant) the contrasts used.
xlevels (only where relevant) a record of the levels of the factors used in fitting.
y if requested, the response used.
x if requested, the model matrix used.
model if requested (the default), the model frame used. 12
Chapter 11 Regression and Correlation Methods (SPRING 2014)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 13
Section 11.3 – Fitting Regression Lines The Method of Least Squares
Example 11.8 Obstetrics Birthweight as a function of estriol in R.
es =
c(7,9,9,12,14,16,16,14,16,16,17,19,21,24,15,1
6,17,25,27,15,15,15,16,19,18,17,18,20,22,25,2
4)
bw =
c(25,25,25,27,27,27,24,30,30,31,30,31,30,28,3
2,32,32,32,34,34,34,35,35,34,35,36,37,38,40,3
9,43)
library(stats)
bw.lm = lm(bw ~ es)
bw.lm$coefficients
(Intercept) es
21.5234286 0.6081905
plot(es,bw)
lines(es, 0.6081905 * es + 21.5234286)
Section 11.4 Inferences About Parameters from Regression Lines
EQ 11.5 Decomposition of the Total Sum of Squares into Regression and Residual Components
2 2 2
1 1 1
ˆ ˆn n n
i i i i
i i i
y y y y y y
Total Sum of Squares = Regression Sum of Squares + Residual Sum of Squares
A good-fitting regression line will have regression components large in absolute value relative to the residual components whereas the opposite is true for poor fitting lines.
Check out Figure 11.6
14 Chapter 11 Regression and Correlation Methods
(SPRING 2014)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 15
11.4.1 F Test for Simple Linear Regression
We will use the ratio of the regression sum of squares to the residual sum of squares as a regression test. A large ratio will indicate a good fit where we are testing H0: b = 0 versus H1:b != 0 where b is the slope of the regression line.
Some helpful notation
Regression mean square (Reg MS) is (Reg SS)/k, the number of predictors in the model.
Residual mean square, Res MS is (Res SS)/(n – k – 1). Df = (n – k -1), the degrees of freedom of the residual sum of squares, Res df. In the literature Res MS = s2
y,x
Reg SS = bLxy = b2Lxx = L2xy/Lxx
Res SS = Total SS – Reg SS = Lyy – L2xy/Lxx
Chapter 11 Regression and Correlation Methods (SPRING 2014) 16
11.4.1 F Test for Simple Linear Regression
Eq. 11.7 F Test for Simple Linear Regression To test H0: b = 0 versus H1: b != 0, use the following procedure:
1) Compute the test statistic
F = Reg MS/Res MS = (L2xy/Lxx)/[Lyy – L2
xy/Lxx)(n-2)]
that follows an F1,n-2 distribution under H0.
2) For a two-sided test with significance level a, if
F > F1,n-2,1-a then reject H0; if
F <= F1,n-2,1-a then accept H0.
3) The exact p-value is given by P(F1,n-2 > F)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 17
11.4.1 F Test for Simple Linear Regression
Def. 11.14 R2 is defined as (Reg SS)/(Total SS)
Interpretation of R2
R2 can be though of as the proportion of the variance of y that can be explained by the variable x
R2 = 1 all of the data points fall on the regression line
R2 = 0 x gives no information about the variance of y
Chapter 11 Regression and Correlation Methods (SPRING 2014) 18
11.4.1 F Test for Simple Linear Regression
The obstetrics data revisited in R
> summary(bw.lm)
Call:
lm(formula = bw ~ es)
Residuals:
Min 1Q Median 3Q Max
-8.12000 -2.03810 -0.03810 3.35371 6.88000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.5234 2.6204 8.214 4.68e-09 ***
es 0.6082 0.1468 4.143 0.000271 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 3.821 on 29 degrees of freedom
Multiple R-Squared: 0.3718, Adjusted R-squared: 0.3501
F-statistic: 17.16 on 1 and 29 DF, p-value: 0.0002712
Chapter 11 Regression and Correlation Methods (SPRING 2014) 19
11.4.1 F Test for Simple Linear Regression
Using aov in R to perform the regression fit on the obstetrics data
> summary(aov(bw ~ es))
Df Sum Sq Mean Sq F value Pr(>F)
es 1 250.57 250.57 17.162 0.0002712 ***
Residuals 29 423.43 14.60
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Chapter 11 Regression and Correlation Methods (SPRING 2014) 20
11.4.2 t Test for Simple Linear Regression
EQ 11.8 t Test for Simple Linear Regression To test the hypothesis H0: b = 0 versus H1: b != 0, use the following procedure:
1) Compute the test statistic
t = b/(s2yx/Lxx)
1/2
2) For a two-sided test with significance level a, if
t > tn-2,1-a/2 or
t <= tn-2,a/2 = -tn-2,1-a/2
Then reject H0; if –tn-2,1-a/2 <= t <= tn-2,1-a/2
Then accept H0
3) The p-value is given by
p = 2 x (area to the left of t under a tn-2 distribution) if t < 0
p = 2 x (area to the right of t under a tn-2 distribution) if t >= 0
11.4.1 F Test for Simple Linear Regression The R output of the obstetrics data revisited
> summary(bw.lm)
Call:
lm(formula = bw ~ es)
Residuals:
Min 1Q Median 3Q Max
-8.12000 -2.03810 -0.03810 3.35371 6.88000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.5234 2.6204 8.214 4.68e-09 ***
es 0.6082 0.1468 4.143 0.000271 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 3.821 on 29 degrees of freedom
Multiple R-Squared: 0.3718, Adjusted R-squared: 0.3501
F-statistic: 17.16 on 1 and 29 DF, p-value: 0.0002712 21
Chapter 11 Regression and Correlation Methods (SPRING 2014)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 22
11.5 Interval Estimation for Linear Regression
11.5.1 Interval Estimates for Regression Parameters
Under certain assumptions how well can we quantify the uncertainty in our estimates of the slope and y-intercept
11.5.2 Interval Estimation for Predictions Made from Regression Line
Under certain assumptions how well can we quantify the uncertainty in our estimates of the predicted values
Chapter 11 Regression and Correlation Methods (SPRING 2014) 23
11.5 Interval Estimation for Linear Regression – 11.5.1 Interval Estimates for Regression Parameters
Eq. 11.9 Standard Errors of Estimated Parameters in Simple Linear Regression
2
22
( )
1( )
yx
xx
yx
xx
sse b
L
xse a s
n L
Chapter 11 Regression and Correlation Methods (SPRING 2014) 24
11.5 Interval Estimation for Linear Regression – 11.5.1 Interval Estimates for Regression Parameters
Eq. 11.10 Two-Sided 100% x (1 –a) Confidence Intervals for the Parameters of a Regression Line: If b and a are, respectively, the estimated slope and intercept of a regression line as given on the previous slide, i. e. se(b) and se(a) are the estimated standards errors, the the two-sided 100% x (1-a) confidence intervals for b and a are given by
2,1 / 2
2,1 / 2
( )
( )
n
n
b t se b
a t se a
a
a
11.5.1 Interval Estimates for Regression Parameters
Confidence intervals on regression parameters in R > summary(bw.lm)
Call:
lm(formula = bw ~ es)
Residuals:
Min 1Q Median 3Q Max
-8.12000 -2.03810 -0.03810 3.35371 6.88000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.5234 2.6204 8.214 4.68e-09 ***
es 0.6082 0.1468 4.143 0.000271 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 3.821 on 29 degrees of freedom
Multiple R-Squared: 0.3718, Adjusted R-squared: 0.3501
F-statistic: 17.16 on 1 and 29 DF, p-value: 0.0002712 25
Chapter 11 Regression and Correlation Methods (SPRING 2014)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 26
11.5.2 Interval Estimation for Predictions Made from Regression Lines
A pedagogical example Forced expiratory volume (FEV) is a standard measure of pulmonary function. To identify people with abnormal pulmonary function, standards of FEV for normal people must be established. One problem here is that FEV is related to both age and height. Let us focus on boys who are ages 10-15 and postulate a regression model for the form FEV = a + b(height) + e. Data were collected on FEV and height for 655 boys in this age group residing in Tecumseh, Michigan. The mean FEV in liters is presented for each of twelve 4-cmheight groups in the Table below. Find the best fitting regression line and test for statistical significance. What proportion of the variance of FEV can be explained by height?
Chapter 11 Regression and Correlation Methods (SPRING 2014) 27
11.5.2 Interval Estimation for Predictions Made from Regression Lines
Our FEV pedagogical example continued.
Mean Mean
Height FEV Height FEV
(cm) (L) (cm) (L)
134 1.7 158 2.7
138 1.9 162 3.0
142 2.0 166 3.1
146 2.1 170 3.4
150 2.2 174 3.8
154 2.5 178 3.9
11.5.2 Interval Estimation for Predictions Made from Regression Lines
28 Chapter 11 Regression and Correlation Methods
(SPRING 2014)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 29
11.5.2 Interval Estimation for Predictions Made from Regression Lines
EX. 11.17 Pulmonary Function suppose we wish to use the FEV-height regression line computed previously to develop normal ranges for 10- to 15-year-old boys of particular heights. In particular, consider John H., whose is 12 years old and 160 cm tall and whose FEV is 2.5 L. Can his FEV be considered abnormal for his age and height?
11.5.2 Interval Estimation for Predictions Made from Regression Lines
Eq.11.11 Predictions Made from Regression Lines for Individual Observations Suppose we wish to make predictions from a regression line for an individual observations with independent variable x that was not used in constructing the regress line. The distribution of observed y values for the subset of individuals with independent variable x is normal with mean =
and standard deviation given by
y a bx
2
2
1
1ˆ 1yx
xx
x xse y s
n L
Furthermore, 100% x (1-a) of the observed values will fall within the interval
This interval is sometimes called a 100% x (1-a) prediction interval for y.
2,1 / 2 1ˆ ˆ( )ny t se ya
30 Chapter 11 Regression and Correlation Methods
(SPRING 2014)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 31
11.5.2 Interval Estimation for Predictions Made from Regression Lines
Predicted Confidence Intervals in R
> new = list(ht=160)
> predict(fev.lm,new,interval='prediction')
fit lwr upr
[1,] 2.896911 2.616527 3.177295
We note that John’s observed value of 2.5 does not fall within the predicted interval. John merits follow up.
Chapter 11 Regression and Correlation Methods (SPRING 2014) 32
11.5.2 Interval Estimation for Predictions Made from Regression Lines
Suppose we wish to asses the mean FEV value for a large number of boys with the same x value?
Eq. 11.12 Standard Error and Confidence Interval for Predictions Made from Regression Lines for the Average Value of y for a Given x The best estimate of the average value of y for a given x is
Its standard error is given by
y a bx
2
2
2
1ˆ
yx
xx
x xse y s
n L
Furthermore, a two-sided 100% x (1-a) confidence interval for he average value of y is
2,1 / 2 2
ˆ ˆny t se ya
Chapter 11 Regression and Correlation Methods (SPRING 2014) 33
11.5.2 Interval Estimation for Predictions Made from Regression Lines
Predicted Confidence Intervals in R for the average value of y
> predict(fev.lm,new,interval='confidence')
fit lwr upr
[1,] 2.896911 2.81621 2.977613
This is sometimes denoted within the statistics community as the confidence interval for the regression function.
Chapter 11 Regression and Correlation Methods (SPRING 2014) 34
11.5.2 Interval Estimation for Predictions Made from Regression Lines
Example 11.21
Chapter 11 Regression and Correlation Methods (SPRING 2014) 35
11.6 Assessing the Goodness of Fit of Regression Lines
Eq. 11.13 Assumptions Made in Linear-Regression Models
1) For any given value of x, the corresponding value of y has an average value of a + bx, which is a linear function of x.
2) For any given value of x, the corresponding value of y is normally distributed about a + bx with the same variance s2 for any x.
3) For any two data points (x1, y1), (x2, y2), the error terms e1, e2, are independent of each other.
11.6 Assessing the Goodness of Fit of Regression Lines
The simplest type of diagnostic plot.
There may be more variability for larger values of es. Which assumption is this violating?
36 Chapter 11 Regression and Correlation Methods
(SPRING 2014)
11.6 Assessing the Goodness of Fit of Regression Lines
Eq. 11.14 Standard Deviation of Residuals About Fitted Regression Line Let (xi, yi) be a sample point used in estimating the regression line, y = a +bx. If y = a + bx is the estimated regression line, and
= residual for the point (xi, yi) about the estimated regression line, then
and
ie
ˆ ( )i i ie y a bx
2
2 1ˆ ˆ( ) 1i
xx
x xsd e
n Ls
The Studentized residual corresponding to the point (xi,yi) is given by
ˆ
ˆi
i
e
sd e
37 Chapter 11 Regression and Correlation Methods
(SPRING 2014)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 38
11.6 Assessing the Goodness of Fit of Regression Lines (Regression Diagnostic Plots in R - I)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 39
11.6 Assessing the Goodness of Fit of Regression Lines (Regression Diagnostic Plots in R - II)
11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R)
Assessing uniformity of variance and linearity of residual structure.
40 Chapter 11 Regression and Correlation Methods
(SPRING 2014)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 41
11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R)
Assessing normality of residual structure with QQ plots.
Chapter 11 Regression and Correlation Methods (SPRING 2014) 42
11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R)
A few EDA type plots for assessment of normality.
11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R)
QQ plots for various types of distributions.
43 Chapter 11 Regression and Correlation Methods
(SPRING 2014)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 44
11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R)
Cook's Distance for the i-th observation is based on the differences between the predicted responses from the model constructed from all of the data and the predicted responses from the model constructed by setting the i-th observation aside. For each observation, the sum of squared residuals is divided by (p+1) times the Residual Mean Square from the full model. Some analysts suggest investigating observations for which Cook's distance is greater than 1. Others suggest looking at a dot plot to find extreme values.
Cooks Distance Plots.
Chapter 11 Regression and Correlation Methods (SPRING 2014) 45
11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R)
A pedagogical example. Age is age at first word (x-values) and gesell (y-values) is the Gesell adaptive score.
age =
c(15,26,10,9,15,20,18,11,8,20,7,9,1
0,11,11,10,12,42,17,11,10)
gesell =
c(95,71,83,91,102,87,93,100,104,94,
113,96,83,84,102,100,105,57,121,86,
100)
> plot(gesell ~ age)
> identify(gesell ~ age)
[1] 2 18 19
Chapter 11 Regression and Correlation Methods (SPRING 2014) 46
11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R)
Gesell example continued
Chapter 11 Regression and Correlation Methods (SPRING 2014) 47
11.6 Assessing the Goodness of Fit of Regression Lines (Interpreting the Regression Diagnostic Plots in R)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 48
11.17 The Correlation Coefficient
The sample correlation coefficient offers an alternative way to measure a linear association between variables. One can use it rather than the regression coefficient. The sample, Pearson, correlation coefficient is given by
r = Lxy/sqrt(Lxx*Lyy)
Properties of r
r > 0 positively correlated
r < 0 negatively correlated
r = 0 uncorrelated
Chapter 11 Regression and Correlation Methods (SPRING 2014) 49
11.17 The Correlation Coefficient
Relationship between sample correlation coefficient r and the population correlation coefficient r
/( 1)
1 1
xy xy
x yyyxx
L n sr
s sLL
n n
Chapter 11 Regression and Correlation Methods (SPRING 2014) 50
11.17 The Correlation Coefficient
There is actually a simple relationship between the sample correlation coefficient and the regression coefficient
So these two quantities really are just rescaled versions of one another
y
x
rsb
s
Chapter 11 Regression and Correlation Methods (SPRING 2014) 51
11.17 The Correlation Coefficient
The sample Pearson correlation coefficient, r, in R
Example 11.24 > es =
c(7,9,9,12,14,16,16,14,16,16,17,19,21,24,15,16,17,25,27,15,
15,15,16,19,18,17,18,20,22,25,24)
> bw =
c(25,25,25,27,27,27,24,30,30,31,30,31,30,28,32,32,32,32,34,
34,34,35,35,34,35,36,37,38,40,39,43)
> cor(es,bw,method='pearson')
[1] 0.6097313
11.8 Statistical Inference for Correlation Coefficients : One-Sample t-Test for a Correlation Coefficient
Eq. 11.20 One-sample t Test for a Correlation Coefficient To test the hypothesis H0: r = 0 versus H1: r != 0, use the following procedure:
1) Compute the sample correlation coefficient r.
2) Compute the test statsitic
t = r(n – 2)1/2/(1 – r2)1/2
Which under H0 follows a t distribution with n – 2 df.
3) For a two-sided level a test, if
t > tn-2,1-a/2 or t < -tn-2,1-a/2 then reject H0. If –tn-2,1-a/2 <= t <=tn-2,1-a/2
accept
4) The p-value is given by
p = 2 * (area to the left of t under a tn-2 distribution) if t < 0
P = 2 * (area to the right of t under a tn-2 distribution) if t >= 0
5) We assume an underlying normal distribution for each of the random variables used to compute r.
52 Chapter 11 Regression and Correlation Methods
(SPRING 2014)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 53
11.8 Statistical Inference for Correlation Coefficients : One-Sample t-Test for a Correlation Coefficient
Problem 11.36 pg. 505 in R
> logmort = c(-2.35, -2.20, -2.12,-1.95,-1.85,-1.80,-1.70,-1.58)
> logcig = c(-0.26,-0.03,0.30,0.37,0.40,0.50,0.55,0.55)
> cor(logmort,logcig)
[1] 0.9300082
> cor.test(logmort,logcig)
Pearson's product-moment correlation
data: logmort and logcig
t = 6.1981, df = 6, p-value = 0.0008128
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.653812 0.987513
sample estimates:
cor
0.9300082
Chapter 11 Regression and Correlation Methods (SPRING 2014) 54
11.8 Statistical Inference for Correlation Coefficients : One-Sample z-Test for a Correlation Coefficient
Eq. 11.22 One-Sample z Test for a Correlation Coefficient To test the hypothesis H0: r = r0 versus H1:r !=r0, use the following procedure:
1) Compute the sample correlation coefficient r and the z transformation of r.
2) Compute the test statistic
l = (z – z0)*sqrt(n-3)
3) If l > z1-a/2 or l < -z1-a/2 reject H0. If –z1-a/2 <= l <= z1-a/2 accept H0.
4) The exact p-value is given by
P = 2 * F(l) if l <= 0
P = 2 * [1 – F(l)] if l > 0
5) Assume an underlying normal distribution for each of the random variables used to compute r and z.
Chapter 11 Regression and Correlation Methods (SPRING 2014) 55
11.8 Statistical Inference for Correlation Coefficients : One-Sample z-Test for a Correlation Coefficient
0
0
0
11 1 1 1ln ln ,
2 1 2 1 3
rz N under H
r n
r
r
z0
Chapter 11 Regression and Correlation Methods (SPRING 2014) 56
11.8 Statistical Inference for Correlation Coefficients : One-Sample z-Test for a Correlation Coefficient
There is no implementation of this in R but this method is used to compute confidence intervals when the number of observation is larger than 6 when one calls cor.test
11.9 Multiple Regression
Consider Ex 11.38 on pg. 466 of the text.
Eq. 11.28 y = a + b1x1 + b2x2 + e where y is the systolic blood pressure, x1 is birth weight and x2 is the age in days where e ~ N(0, s2) . We choose the method of least square to minimize the sum of [y – (a + b1x1 + b2x2)]
2
In general if we have k independent variables x1, …, xk then a linear-regression model relating y to x1, …, xk is of the form
EQ. 11.29
1
k
j j
j
y x ea b
, e ~ N(0, s2)
57 Chapter 11 Regression and Correlation Methods
(SPRING 2014)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 58
11.9 Multiple Regression
Def. 11.16
1
k
j j
j
y x ea b
Partial Regression Coefficients
Chapter 11 Regression and Correlation Methods (SPRING 2014) 59
11.9 Multiple Regression
Def. 11.17 The standardized regression coefficient bs is given by
b * (sx/sy)
11.9.2 Hypothesis Testing
Eq. 11.31 F Test for Testing the Hypothesis H0: b1 = b2 = …bk = 0 versus H1:At least One of the bj != 0 in Multiple Regression
1) Fit the regression parameters using the method of least squares , and compute Reg SS and Res SS
2
1
2
1
1
ˆRe
Re Re
ˆ
jth independent variable for ith subject, 1, , ; 1, ,
n
i i
i
n
i
i
k
i j ij
j
ik
s SS y y
g SS Total SS s SS
Total SS y y
y a b x
x j k i n
60 Chapter 11 Regression and Correlation Methods
(SPRING 2014)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 61
11.9.2 Hypothesis Testing
Eq. 11.31 F Test for Testing the Hypothesis H0: b1 = b2 = …bk = 0 versus H1:At least One of the bj != 0 in Multiple Regression
2) Compute Reg MS = RegSS/k, RegMS = ResSS/(n-k-1)
3) Compute the test statistic
F=Reg MS/Res MS
which follows an Fk,n-k-1 distribution under H0.
4) For a level a test,
F > Fk, n-k,1-a then reject H0:
If F <= Fk,n-k,1-a then accept H0
5) The exact p-value is given by the area to the right of F under an Fk,n-k-1 distribution = P(Fk,n-k-1 > F)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 62
11.9.2 Hypothesis Testing
Eq. 11.32 t Test for Testing the Hypothesis H0:bl = 0, All Other bj != 0 versus H1:bl != 0, All other bj != 0 in Multiple Linear Regression
1) Compute
t = bl/se(bl)
2) If
t < tn-k-1,a/2 or t > tn-k-1,1-a/2 then reject H0
If tn-k-1,a/2 <= t <= tn-k-1,1-a/2 then accept H0
3) The exact p-value is given by
2 * P(tn-k-1 > t) if t >= 0
2 * P(tn-k-1 <=t) if t < 0
11.9 Multiple Regression (EX. 11.39 in R)
> bwmv = c(135,120,100,105,130,125,125,105,120,90,120,95,120,150,160,125)
> agemv = c(3,4,3,2,4,5,2,3,5,4,2,3,3,4,3,3)
> bpmv = c(89, 90, 83, 77, 92, 98, 82, 85, 96, 95, 80, 79, 86, 97, 92,88)
> bpmv.lm = lm(bpmv ~ bwmv + agemv)
> summary(bpmv.lm)
Call:
lm(formula = bpmv ~ bwmv + agemv)
Residuals:
Min 1Q Median 3Q Max
-4.0438 -1.3481 -0.2395 0.9688 6.6964
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 53.45019 4.53189 11.794 2.57e-08 ***
bwmv 0.12558 0.03434 3.657 0.00290 **
agemv 5.88772 0.68021 8.656 9.34e-07 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 2.479 on 13 degrees of freedom
Multiple R-Squared: 0.8809, Adjusted R-squared: 0.8626
F-statistic: 48.08 on 2 and 13 DF, p-value: 9.844e-07
63
Chapter 11 Regression and Correlation Methods (SPRING 2014)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 64
11.9.3 Regression Diagnostics
Chapter 11 Regression and Correlation Methods (SPRING 2014) 65
11.9 Multiple Regression (EX. 11.39 in R)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 66
11.9 Multiple Regression (EX. 11.39 in R)
Chapter 11 Regression and Correlation Methods (SPRING 2014) 67
Chapter 11 Homework
11.1 – 11.8; 11.17 – 11.20, 11.42 – 11.44