semi possible model semiparametric modeling, penalized ... · semi 11 slide 21 selecting the number...
TRANSCRIPT
Semi 1
Slide 1
Semiparametric Modeling, PenalizedSplines, and Mixed Models
David Ruppert
Cornell University
http://www.orie.cornell.edu/~davidr
January 2004
Joint work with Babette Brumback, Ray Carroll, Brent
Coull, Ciprian Crainiceanu, Matt Wand, Yan Yu, and
others
Slide 2
Example (data from Hastie and James, this
analysis in RWC)
age (years)
spin
al b
one
min
eral
den
sity
10 15 20 25
0.6
0.8
1.0
1.2
1.4
Semi 2
Slide 3
Possible Model
SBMDi,j is spinal bone mineral density on ith subject at
age equal to agei,j.
SBMDi,j = Ui + m(agei,j) + εi,j,
i = 1, . . . , m = 230, j = i, . . . , ni.
Ui is the random intercept for subject i.
{Ui} are assumed i.i.d. N(0, σ2U).
Slide 4
Underlying philosophy
1. minimalist statistics
• keep it as simple as possible
2. build on classical parametric statistics
3. modular methodology
Semi 3
Slide 5
Reference
Semiparametric Regression by Ruppert, Wand, and
Carroll (2003)
• Lots of examples from biostatistics.
Slide 6
Recent Example — April 17, 2003
Canfield et al. (2003) — Intellectual impairment
and blood lead.
• longitudinal (mixed model)
• nine covariates (modelled linearly)
• effect of lead modelled as a spline (semiparametric
model)
– disturbing conclusion
Semi 4
Slide 7
0 5 10 15 20 25 30 3560
70
80
90
100
110
120
130
lead (microgram/deciliter)
IQ
Quadratic
Spline
Thanks to Rich Canfield for data and estimates.
Slide 8
Semiparametric regression
Partial linear or partial spline model:
Yi = WTi βββW + m(Xi) + εi.
m(x) = XTi βββX + BT(x)b.
BT(x) = ( B1(x) · · · BK(x) ) .
E.g.,
XTi = ( Xi · · · Xp
i )
BT(x) = { (x− κ1)p+ · · · (x− κK)p
+ }
Semi 5
Slide 9
Example
m(x) = β0 + β1x + b1(x− κ1)+ + · · ·+ bK(x− κK)+
• slope jumps by bk at κk
Slide 10
Linear “plus” function
0 0.5 1 1.5 2 2.5 3
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2 plus fn.derivative
Semi 6
Slide 11
Fitting LIDAR data with plus functions
range
log
ratio
400 500 600 700
-1.0
-0.8
-0.6
-0.4
-0.2
0.0
Slide 12
Generalization
m(x) = β0+β1x+· · ·+βpxp+b1(x−κ1)
p++· · ·+bK(x−κK)p
+
• pth derivative jumps by p! bk at κk
• first p− 1 derivatives are continuous
Semi 7
Slide 13
Quadratic “plus” function
0 0.5 1 1.5 2 2.5 30
0.5
1
1.5
2
2.5
3
3.5
4 plus fn.derivative2nd derivative
Slide 14
Ordinary Least Squares
400 600−1
−0.8
−0.6
−0.4
−0.2
0
Raw Data
400 600−1
−0.8
−0.6
−0.4
−0.2
0
2 knots
400 600−1
−0.8
−0.6
−0.4
−0.2
0
3 knots
400 600−1
−0.8
−0.6
−0.4
−0.2
0
5 knots
400 600−1
−0.8
−0.6
−0.4
−0.2
0
10 knots
400 600−1
−0.8
−0.6
−0.4
−0.2
0
20 knots
400 600−1
−0.8
−0.6
−0.4
−0.2
0
50 knots
400 600−1
−0.8
−0.6
−0.4
−0.2
0
100 knots
Semi 8
Slide 15
Penalized least-squares
Minimizen∑
i=1
{Y − (WT
i βββW + XTi βββX + BT(Xi)b)
}2+ λbTDb.
E.g.,
D = I.
Slide 16
Penalized Least Squares
400 600−1
−0.8
−0.6
−0.4
−0.2
0
Raw Data
400 600−1
−0.8
−0.6
−0.4
−0.2
0
2 knots
400 600−1
−0.8
−0.6
−0.4
−0.2
0
3 knots
400 600−1
−0.8
−0.6
−0.4
−0.2
0
5 knots
400 600−1
−0.8
−0.6
−0.4
−0.2
0
10 knots
400 600−1
−0.8
−0.6
−0.4
−0.2
0
20 knots
400 600−1
−0.8
−0.6
−0.4
−0.2
0
50 knots
400 600−1
−0.8
−0.6
−0.4
−0.2
0
100 knots
Semi 9
Slide 17
Ridge Regression
From previous slide:
n∑i=1
{Y − (WT
i βββW + XTi βββX + BT(Xi)b)
}2+ λbTDb.
Let X have row (WTi XT
i BT(Xi) ). Then
βββW
βββX
b
=
{X TX + λ blockdiag(0,0,D)}−1X TY.
• Also, a BLUP in a mixed model and an empirical
Bayes estimator.
Slide 18
Linear Mixed Models
Y = Xβββ + Zb + ε
where b is N(0, σ2bΣΣΣb).
Xβββ are the “fixed effects” and Zb are the “random
effects.”
Henderson’s equations.
(βββ
b
)=
(XTX XTZ
ZTX ZTZ + λΣΣΣ−1b
)−1 (XTY
ZTY
).
λ =σ2
ε
σ2b
.
Semi 10
Slide 19
From previous slides:
Let X have row (WTi XT
i BT(Xi) ). Then
βββW
βββX
b
=
{X TX + λ blockdiag(0,0,D)}−1X TY.
Linear mixed model:(
βββ
b
)=
(XTX XTZ
ZTX ZTZ + λΣΣΣ−1b
)−1 (XTY
ZTY
)
={
(X Z )T (X Z ) + λ blockdiag(0, ΣΣΣ−1b )
}−1
(X Z )T Y
Slide 20
Selecting λ
1. cross-validation (CV)
2. generalized cross-validation (GCV)
3. ML or REML in mixed model framework
Semi 11
Slide 21
Selecting the Number of Knots
0 0.2 0.4 0.6 0.8 1−1
−0.5
0
0.5
1
1.5(a) SpaHet, j = 3, typical data set
y
Truefull−search
5 20 40 80 12095
100
105
110
115
K
rela
tive
MA
SE
(b) MASE comparisons
fixed nknotsmyopicfull−search
1 2 3 4 5 60
50
100
150
number of knots (coded)
freq
uenc
y
0 0.0125 0.0250
0.0125
0.025
ASE − K=5
AS
E −
K=
40
n = 200
Slide 22
0 0.2 0.4 0.6 0.8 1−0.5
0
0.5(a) SpaHetLS, j = 3, n = 2,000
y
Truefull−search
5 20 40 80 12095
100
105
110
115
K
rela
tive
MA
SE
(b) MASE comparisons
fixed nknotsmyopicfull−search
1 2 3 4 5 60
50
100
150
200
250
number of knots (coded)
freq
uenc
y
0 0.5 1 1.5
x 10−3
0
0.5
1
1.5x 10
−3
ASE − K=5
AS
E −
K=
40
n = 2, 000
Semi 12
Slide 23
0 5 10 15 20 250
1
2
x 10−4
dffit
(λ)
MS
E
MSE
Bias
Variance
Optimal
n = 10, 000, 20 knots, quadratic spline
Slide 24
Return to spinal bone mineral density study
age (years)
spin
al b
one
min
eral
den
sity
10 15 20 25
0.6
0.8
1.0
1.2
1.4
SBMDi,j = Ui + m(agei,j) + εi,j,
i = 1, . . . , m = 230, j = i, . . . , ni.
Semi 13
Slide 25 X =
1 age11...
...
1 age1n1
......
1 agem1...
...
1 agemnm
Slide 26 Z =
1 · · · 0 (age11 − κ1)+ · · · (age11 − κK)+
.... . .
......
. . ....
1 · · · 0 (age1n1− κ1)+ · · · (age1n1
− κK)+
......
......
. . ....
0 · · · 1 (agem1 − κ1)+ · · · (agem1 − κK)+
.... . .
......
. . ....
0 · · · 1 (agemnm− κ1)+ · · · (agemnm
− κK)+
Semi 14
Slide 27 u =
U1
...
Um
b1
...
bK
Slide 28
age (years)
spin
al b
one
min
eral
den
sity
10 15 20 25
0.6
0.8
1.0
Variability bars on m and estimated density of Ui
Semi 15
Slide 29
Broken down by ethnicity
0.6
0.8
1.0
1.2
1.4Asian
10 15 20 25
Black
Hispanic
0.6
0.8
1.0
1.2
1.4White
10 15 20 25
age (years)
spin
al b
one
min
eral
den
sity
Slide 30
Model with ethnicity effects
SBMDij = Ui + m(ageij) + β1blacki + β2hispanici
+β3whitei + εij, 1 ≤ j ≤ ni, 1 ≤ i ≤ m.
Asian is the reference group.
Semi 16
Slide 31
Only requires an expansion of the fixed effects by adding
the columns
black1 hispanic1 white1
......
...
black1 hispanic1 white1
......
...
blackm hispanicm whitem
......
...
blackm hispanicm whitem
Slide 32co
ntra
st w
ith A
sian
sub
ject
s0.
00.
050.
100.
15
Black Hispanic White
Semi 17
Slide 33
• In this model, the age effects curve for the four ethnic
groups are parallel.
• Could we model them as non-parallel?
• Might be problematic in this example because of the
small values of the ni.
• But the methodology should be useful in other
contexts.
Slide 34
• Add interactions between age and black, hispanic,
and white.
– These are fixed effects.
• Then add interactions between black, hispanic,
white, and asian and the linear plus functions in
age.
– These are mean-zero random effects with their own
variance component
– This variance component control the amount of
shrinkage of the enthicity-specific curves to the
overall effect.
Semi 18
Slide 35
Penalized Splines and Additive Models
Additive model:
Yi = m1(X1,i) + . . . + mP (XP,i) + εi
Slide 36
Bivariate additive spline model
Yi = β0+βx,1Xi+ bx,1(Xi−κx,1)++ · · ·+ bx,K(Xi−κx,Kx)+
+ βz,1Zi + bz,1(Zi − κz,1)+ + · · ·+ bz,K(Zi − κz,Kz)+ + εi
• no need for backfitting
• computation very rapid
• no identifiability issues
• inference is simple
Semi 19
Slide 37
Bayesian methods
The linear mixed model is half-Bayesian.
• The random effects have a prior.
• The parameters without a prior are:
– fixed effects
∗ give them diffuse normal priors
– variance components
∗ give them diffuse inverse gamma priors
Slide 38
Bayesian methods
Can be easily implemented in WinBUGS or programmed
in, say, MATLAB.
Allows Bayes rather than empirical Bayes inference.
• Uncertainty due to smoothing parameter selection is
taken into account.
Semi 20
Slide 39
The Bias-Variance Trade-off and Confidence Bands
.............................
...................
..........................................................................
............................
.
...
.
.......
..
.
.
.
.
.....
.
.
.
...
.
.
............
.
.
..
.
.
.
.
.
...
.
.
..
.
.
.
.
.
.
...
.
.
.
lambda=0
range
log
ratio
400 500 600 700
-0.8
-0.4
0.0
.............................
...................
..........................................................................
............................
.
...
.
.......
..
.
.
.
.
.....
.
.
.
...
.
.
............
.
.
..
.
.
.
.
.
...
.
.
..
.
.
.
.
.
.
...
.
.
.
lambda=10
range
log
ratio
400 500 600 700
-0.8
-0.4
0.0
.............................
...................
..........................................................................
............................
.
...
.
.......
..
.
.
.
.
.....
.
.
.
...
.
.
............
.
.
..
.
.
.
.
.
...
.
.
..
.
.
.
.
.
.
...
.
.
.
lambda=30
range
log
ratio
400 500 600 700
-0.8
-0.4
0.0
.............................
...................
..........................................................................
............................
.
...
.
.......
..
.
.
.
.
.....
.
.
.
...
.
.
............
.
.
..
.
.
.
.
.
...
.
.
..
.
.
.
.
.
.
...
.
.
.
lambda=1000
range
log
ratio
400 500 600 700
-0.8
-0.4
0.0
Slide 40
How does one adjust confidence intervals for bias?
• undersmooth — so variance dominates and bias can
be safetly ignored.
Semi 21
Slide 41
10−6
10−5
10−4
10−3
10−2
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x 10−4
log(λ)
MS
E
MSE
Bias2 Variance
n=10,00020 knotsσ=.3
optimal
Slide 42
Adjustment for bias continued
• estimate bias by a higher order method and subtract
off bias (essentially the same as above)
• Wahba/Nychka Bayesian intervals
– bias is random so adds to posterior variance
– interval is widened but there is no “offset”.
Semi 22
Slide 43
Wahba/Nychka Bayesian Intervals
y = Xβββ + Zu + ε, Cov
[u
ε
]=
[σ2
uI 0
0 σ2εI
],
C = (X Z )
βββ and u are BLUPs.
Slide 44
Cov
([βββ
u
] ∣∣∣u)
= σ2ε(C
TC+ σ2ε
σ2uD)−1CTC(CTC+ σ2
ε
σ2uD)−1
(Frequentist variance. Ignores bias)
Cov
([βββ
u− u
])= σ2
ε(CTC + σ2
ε
σ2uD)−1.
(Bayesian posterior variance. Takes bias into account.)
Semi 23
Slide 45
age (million years)
stro
ntiu
m r
atio
95 100 105 110 115 120
0.70
720
0.70
725
0.70
730
0.70
735
0.70
740
0.70
745
0.70
750
Slide 46
Effect of measurement error
−4 −3 −2 −1 0 1 2 3 4 5−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
x plus error
y
W = X + error and Var(X) = Var(error).
Semi 24
Slide 47
Correction for measurement error
Relatively little research in this area.
• Fan and Truong (1993): deconvolution kernels
– first work
– inefficient in finite-sample studies
– no inference
– strictly for 1-dimensional smoothing
• Carroll, Maca, Ruppert
– functional SIMEX methods and structural spline
methods
– more efficient than Fan and Truong
Slide 48
• Berry, Carroll, and Ruppert (JASA, 2002)
– fully Bayesian
– smoothing or penalized splines
– rather efficient in finite-sample studies
– inference available
– scales up — semiparametric inference is easy
– structural
Semi 25
Slide 49
Berry, Carroll, and Ruppert
• starts with mixed-model spline formulation
– but fully Bayesian
• conjugate priors
• true covariates are i.i.d. normal
– but surprisingly robust
• normal measurement error
• in Gibbs, only sampling of true (unknown) covariates
requires a Hastings-Metropolis step
Slide 50
Effect of measurement error
−4 −3 −2 −1 0 1 2 3 4 5−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
x plus error
y
W = X + error and Var(X) = Var(error).
Semi 26
Slide 51
Correction for measurement error
−4 −3 −2 −1 0 1 2 3 4
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Solid: true. Dotted: uncorrected. Dashed: corrected.
Slide 52
Measurement Error, continued
Ganguli, Staudenmayer, Wand:
• EM maximum likelihood estimation in BCR model.
• Works about as well as the fully Bayesian approach.
• Extension to additive models.
Semi 27
Slide 53
Generalized Regression
• Extension to non-Gaussian responses is conceptually
easy.
• Get a GLLM.
– However, GLIM’s are not trivial. Can use:
∗ Monte Carlo EM
∗ Or MCMC
Slide 54
Single-Index Models
Yi = g(XTi θθθ) + ZT
i βββ + εi.
Yu and Ruppert (2002, JASA).
Let
g(x) = γ0 + γ1x + · · ·+ γpxp
+c1(x− κ1)p+ + · · ·+ cK(x− κK)p
+.
Becomes a nonlinear regression model
Yi = m(Xi,Zi, θθθ, βββ, γγγ, c) + εi.