introduction to regression
DESCRIPTION
Introduction to regression. Karen Bandeen -Roche, PhD Department of Biostatistics Johns Hopkins University. July 13, 2011. Introduction to Statistical Measurement and Modeling. Data motivation. Temperature modeling - PowerPoint PPT PresentationTRANSCRIPT
Introduction to regressionIntroduction to regression
July 13, 2011July 13, 2011
Karen Bandeen-Roche, PhDDepartment of Biostatistics
Johns Hopkins University
Introduction to Statistical Measurement
and Modeling
Data motivation
Temperature modeling Scientific question: Can we accurately and
precisely model geographic variation in temperature?
Some related statistical questions:
How does temperature vary as a function of latitude and longitude (what is the “shape”)?
Does the temperature variation with latitude differ by longitude? Or vice versa?
Once we have a set of predictions: How accurate and precise are they?
United States temperature map
http://green-enb150.blogspot.com/2011/01/isorhythmic-map-united-states-weather.html
Temperature data
temp
80 100 120 140 160
20
30
40
50
60
70
80
80
100
120
140
160
longtude
20 30 40 50 60 70 80 20 30 40 50
20
30
40
50
latitude
Data examples Boxing and neurological injury
Scientific question: Does amateur boxing lead to decline in neurological performance?
Some related statistical questions:
Is there a dose-response increase in the rate of cognitive decline with increased boxing exposure?
Is boxing-associated decline independent of initial cognition and age?
Is there a threshold of boxing that initiates harm?
Boxing data-2
0-1
00
10
20
blk
diff
0 100 200 300 400blbouts
bandwidth = .8
Lowess smoother
Outline Regression
Model / interpretation
Estimation / parameter inference
Direct / indirect effects
Nonlinear relationships / splines
Model checking
Precision of prediction
Regression Studies the mean of an outcome or response variable “Y” as a function of predictor or
covariate variables “X” E[Y|X]
Note directionality: dependent Y independent X
Contrast to covariance, correlation: Cov(X,Y)=Cov(Y,X)
-20
-10
010
20
blk
diff
0 100 200 300 400blbouts
bandwidth = .8
Lowess smoother
Regression Purposes
Describing relationships between variables
Making inferences about relationships between variables
Predicting future outcomes
Investigating interplay among multiple variables
Regression Model Y = systematic g(x) + random error
Y = E[Y|x] + ε (population model)
Yi = E[Y|xi] + εi, i = 1,…, n (sample from population)
Yi = fit(xi) + ei (sample model)
This course: the linear regression model
fit(X1,...,Xp;ß) = E[Y|X1,...,Xp] = ß0+ß1X1+...+ßpXp
Simple linear regression: One X yielding
E[Y|X] = ß0+ß1X
Regression Model
E[Y|X1,...,Xp] = ß0+ß1X1+...+ßpXp
Interpretation of β0: Mean response in subpopulation for whom all covariate values equal 0
Interpretation of ßj: The difference in mean response comparing subpopulations whose xj value differs by one unit and values on other covariates do not differ
Modeling geographical variation:
Latitude and Longitude
http://www.enchantedlearning.com/usa/activity/latlong/
Regression – Matrix Specification
Notation:
n-vector v = (v1,...,vn)' (column), where “'” = “transpose”
nxp matrix A = (n rows, p columns) = [Aij], Aij = aij
E[Z] = (E[Z1],...,E[ZM])’
Model: Y = Xß + ε Y, ε = nx1 vector (n “records,” “people,” “units,” etc.)
X = nx(p+1) "design matrix“
Columns: list of ones–codes the intercept–and then p Xs
ß= (p+1) vector (intercept and then coefficients β1,...,βp)
Estimation Principle of least squares
Independently discovered by Gauss (1803) & Legendre (1805)
minimizes = Resulting estimator:
“Fit” = HY
“Residual” e= = e = (I-H)Y
ββ ni 1
[Yi (β0 β1xi1 ... βpxip)]2n
i 1
[Yi (β0 β1xi1 ... βpxip)]2 (Y Xβ)(Y Xβ)(Y Xβ)(Y Xβ)
X X X Y
1
Y i β0 β1xi ... βpxpY i β0 β1xi ... βpxpY Xβ X(X X) 1X YY Xβ X(X X) 1X Y
Yi β0 β1xi ... β pxp Yi Y
iYi Y
i( )
Least squares fit - Intuition Simple linear regression (SLR)
Fit=weighted average of slopes between each (xi,Yi) and :
i.e.
1
1
11
2
1
1
2
1
2
i
n
i i
i
ni
n
i i
xx
i
ni
ii
i
n
i
Y Y x x
x x
Y x x
S
Y Y
x xx x
x x
,11
2
1
2
i
n
ii
ii
i
i
n
i
wY Y
x xw
x x
x x
x Y,
Least squares fit - Intuition Weight increases with distance of xi from
sample mean of xs
LS line must go through :
Equally loaded springs attaching each point to line with fulcrum at
Multiple linear regression “projects” Y into the model space = all linear combinations of columns of X
x Y,
x Y,
Why does LS yield a good estimator?
Need some preliminaries and assumptions
A preliminary – Variance of a random m-vector, z, is an (m x m) matrix,
1. Diagonal elements = variances: = Var(Zi)
2.(i, j) off-diagonal element = Cov(Zi, Zj)
3. = (symmetric)
4. is positive semi-definite: for any .
5. For an (n x m) matrix, A, Var
ii
ij
ij ji a a a, 0
A z AVar z A
Why does LS yield a good estimator?
Assumptions – Random Part of the Model A1: Linear model:
A2: Mean-0, exogenous errors: .
A3: Constant variance: has all diagonal entries =
A4: Random sampling: Errors are statistically independent.
A5: Probability distribution: is distributed as (~) multivariate normal (MVN)
Y X
E X E 0
2
i
Var X
Why does LS yield a good estimator?
1. Unbiasedness (“accuracy”):
2. Standard errors for beta:
a.
b. To estimate: Will need estimate for
3. BLUE: LS produces the Best Linear Unbiased Estimator
a. SLR meaning: If is a competing linear unbiased estimator, then
b. MLR, matrix statement: is positive semidefinite
i. e.g. for any choice of c0, …, cp
E E X X X YT T 1
Var X XT 2 1
Var X XjT
jj
( ) 2 1
2 2
Var Var
Var c Var cj
p
j j j
p
j j( ) ( )
0 0
Inference on coefficients
We have To assess: Will need estimate for
ei approximates
Estimate by ECDF variance of ei
Will justify “n-p” shortly
Estimated standard error of
2 2
Var X XjT
jj
( ) 2 1
2 Var i( )
i 2
2
1
21
n p
ei
n
i
: jjj
X X2 1
Inference on coefficients
SE
100 confidence interval for :
Hypothesis testing versus
Test statistic: ,
i.e., T follows a t distribution with n-p-1 degrees of freedom.
( ): jjj
X X2 1
( )1 j
( ) ( )/ j a jt n p SE 1 2 1
H j j0 :* H j j1 :
*
H j j0 :*
T
SE X Xt n pj j
j
j j
jj
( )
~ ( )
* *
2 11
Direct and Indirect Associationsin Regression
Coefficients’ magnitude, interpretation may vary according to what other predictors are in the model.
Terminology Consider two-covariate model:
= “direct association” of x1 with Y, controlling for x2
The SLR coefficient in
= “total association” of x1 with Y
The difference = “indirect association” through x2 and x1 with Y.
Y x x 0 1 1 2 2
11
* Y x 0 1 1* * *
1 1*
Direct and Indirect Associationsin Regression
The fitted total association of x1 with Y involves the fitted direct association and the regression of x2 on x1:
For SLR
where is the slope in the LS SLR of x2 on x1: =
Total, direct associations only equal if or
Generalizes to more than two covariates
*
1
1 1 0 1 1 2 211
1 1 1 1
S
S
x x x x
SX Y
X X
i i i ii
n
X X
0 1 2 1 1 21
1 1
1 1
S x x x
S
X X i ii
n
X X
, 1 22 1
1 1
S
SX X
X XS
SX X
X X
2 1
1 1
21
21 0 2 0
Model Checking with Residuals
Two areas of concern Widespread failure of assumptions A1-A5
Isolated points: Potential to unduly influence fit
Model checking method Fit full model, calculate
and plot ei in various ways
Rationale: If model fits, ei mimics , so should have mean 0, equal variance, etc.
e y y i ni i i , , . . . , ,1
i
Model Checking with Residuals
Checking A1-A3: Scatterplots of residuals vs. (i) fitted values ; (ii) each continuous covariate x ij, with reference line at y=0 Good fit: flat, unpatterned cloud about reference
line
A1-A2 violation: Curvilinear pattern about reference line
Influential points: Close to reference line at expense of overall shape
A3 violation: Tendency for degree of vertical “spread” about the reference line to vary systematically with or xij
Checking A5: Stem-and-leaf, ECDF-versus-normal plot
y i
y i
Nonlinear relationships: Curve fitting
Method 1: Polynomials Replace by
Curve Method 2: “Splines” = piecewise polynomials
Feature 1: “order” = order of polynomials being joined
Feature 2: “knots” = where the polynomials are joined
j jx j j j j jK jKx x x1 2
2 . . . .
How to fit splines Linear spline with K knots at position k1, …, kK:
where (x-k)+ = x-k if (x-k) > 0 and = 0 otherwise.
Spline of order M:
where x – k0: =x.
To ensure a smooth curve, only include plus functions for the highest order terms
Y x x k x k x kK K
0 1 2 1 3 2 1. . . . ,
Y x kpm pm
M
p
K m
010
,
Spline interpretation, linear spline case
= Mean outcome in subpopulation with x=0
= Mean Y vs. x slope on {x: x<k1}
= Amount by which mean Y vs. x slope on differs from slope on {x: x<k1}
= Amount by which mean Y vs. x slope on differs from slope on
0
1
2
q
x k x k: 1 2
x k x kq q: 1
x k x kq q: 2 1
Checking goodness of fit curveThe partial residual plot
Larsen W, McCleary S. “The use of partial residual plots in regression analysis. Technometrics 1972; 14: 781-90.
Purpose: Assess degree to which a fitted curve describing association between Y and X matches the empirical shape of the association.
Basic idea If there were no covariates except the one for
which a curve is:
being fit, we’d plot Y versus X
and overlay with versus X.
Proposal: Subtract out contributions of other covariates
Y
The partial residual plot - Method
Suppose model is , where
x2 = variable of interest & f(x2) = vector of terms involving x2,
represents “adjustment” for covariates other than x2
Fit the whole model (e.g. get );
Calculate partial residuals ;
Plot rp versus x2, overlaying with the plot of versus x2
Another way to think about it: The partial residuals are + the overall model residuals.
Y X f x 0 1 1 2 2( )
X 1 1
, , 0 1 2
r Y Xp 0 1 1
f x( ) 2 2
f x( ) 2 2
United States temperature map
http://green-enb150.blogspot.com/2011/01/isorhythmic-map-united-states-weather.html
Data motivation
Temperature modeling - Statistical questions:
How does temperature vary as a function of latitude and longitude (what is the “shape”)? - DONE
Does the temperature variation with latitude differ by longitude? Or vice versa?
Once we have a set of predictions: How accurate and precise are they?
Does the temperature variation with latitude differ by longitude? Preliminary: Does temperature vary across latitude
and longitude categories?
Usual device = dummy variables
Suppose the categorical covariate (X) has K categories
Choose one category (say, the Kth) as the “reference”
For other K-1 categories create new var. X1, …XK-1 with
Xki = 1 if Xi = category k; 0 otherwise
Rather than include Xi in MLR, include Xi1, …, X-iK-1.
If more than one categorical variable, can create as many sets of dummy variables as necessary (one per variable)
k E Y ca tegoryK E Y ca tegoryK
Does the temperature variation with latitude differ
by longitude? Formal wording: Is there interaction in the association of latitude and longitude with temperature? Synonyms: effect modification, moderation
Covariates A and B interact if the magnitude of the association between A and Y varies across levels of B
i.e. B “modifies” or “moderates” the association between A & Y
Interaction Modeling Interactions are typically coded as “main
effects” + “interaction terms”
Example: Two categorical predictors (factors)
Model:
Mean Y given predictors
Predictor 1 Predictor 2
Factor level 1
Factor level 2
x2 = 0 x2 = 1
Factor level 1
x11 = 0 x12 = 0
Factor level 2
x11 = 1 x12 = 0
Factor level 3
x11 = 0 x12 = 1
0
0 11
0 12
0 2
0 11 2 31
0 12 2 32
Y x x x x x x x 0 11 11 12 12 2 2 31 11 2 32 12 2M ain M ain In terac tio n
Y x x x x x x x 0 11 11 12 12 2 2 31 11 2 32 12 2
Interaction Modeling: Interpretations
= mean response at the “reference” factor level combination;
= amount by which mean response at the second level of the first factor differs from the mean response at the first level of the first factor, among those at the first level of the second factor
Y x x x x x x x 0 11 11 12 12 2 2 31 11 2 32 12 2
0 11 12 20 0 0 E Y x x x, ,
11 0 11 0 11 12 2 11 12 21 0 0 0 0 0 E Y x x x E Y x x x, , , ,
Interaction Modeling: Interpretations
=
= amount by which the (difference in mean response across levels of the second factor) differs across levels of the first factor
= amount by which the second factor’s effect varies over levels of the first factor.
31 0 11 2 31 0 11 0 2 0
E Y x x x E Y x x x11 12 2 11 12 21 0 1 1 0 0 , , , ,
E Y x x x E Y x x x11 12 2 11 12 20 0 1 0 0 0 , , , ,
How precise are our predictions?
A natural measure: ECDF Corr
“R-squared” = R2
Temperatures example: R2 = 0.90 (spline model), R=0.95
An extremely precise prediction (nearly a straight line)
Caution: This reflects precision for the SAME data used to build the model Generally an overestimate of precision for predicting
future data
Cross-validation needed: Evaluate fit in a separate sample
( , )Y Y R
Main Points Linear regression model
Random part: Errors distributed as independent, identically distributed N(0,σ2)
Systematic part: E[Y] = Xβ
Direct versus total effects
Differences in means
Nonlinear relationships: polynomials, splines
Interactions
Main Points Estimation: Least squares
Unbiased
BLUE: Best linear unbiased estimator
Inference: Var( ) = σ2(X’X)-1
Standard errors = square roots of diagonal elements
Testing, confidence intervals as in Lecture 2
Model fit: Residual analysis
R2 estimates precision of predicting individual outcomes