Post-stratification and calibration
Thomas Lumley
UW Biostatistics
WNAR—2008–6–22
What are they?
Post-stratification and calibration are ways to use auxiliary
information on the population (or the phase-one sample) to
improve precision.
They are closely related to the Augmented Inverse-Probability
Weighted estimators of Jamie Robins and coworkers, but are
easier to understand.
Estimating a total
Population size N , sample size n, sampling probabilities πi,
sampling indicators Ri.
Goal: estimate
T =N∑i=1
yi
Horvitz–Thompson estimator:
T̂ =∑Ri=1
1
πiyi
To estimate parameters θ replace yi by loglikelihood `i(θ) or
estimating functions Ui(θ).
Auxiliary information
HT estimator is inefficient when some additional population data
are available.
Suppose xi is known for all i
Fit y ∼ xβ by (probability-weighted) least squares to get β̂. Let
r2 be proportion of variation explained.
T̂reg =∑Ri=1
1
πi(yi − xiβ̂) +
N∑i=1
xiβ̂
ie, HT estimator for sum of residuals, plus population sum of
fitted values
Auxiliary information
Let β∗ be true value of β (ie, least-squares fit to whole
population).
Regression estimator
T̂reg =∑Ri=1
1
πi(yi − xiβ∗) +
N∑i=1
xi
β∗+N∑i=1
(1−
Riπi
)xi(β̂ − β∗)
compare to HT estimator
T̂ =∑Ri=1
1
πi(yi − xiβ∗) +
∑Ri=1
1
πixi
β∗
Second term uses known vs observed total of x, third term is
estimation error for β, of smaller order.
Auxiliary information
For large n, N and under conditions on moments and sampling
schemes
var[T̂reg
]= (1−r2) var
[T̂]+O(N/
√n) =
(1− r2 +O(n−1/2)
)var
[T̂]
and the relative bias is O(1/n)
The lack of bias does not require any assumptions about [Y |X]
β̂ is consistent for the population least squares slope β, for which
the mean residual is zero by construction.
Reweighting
Since β̂ is linear in y, we can write xβ̂ as a linear function of y
and so T̂reg is also a linear function of Y
T̂reg =∑Ri=1
wiyi =∑Ri=1
giπiyi
for some (ugly) wi or gi that depend only on the xs
For these weights
N∑i=1
xi =∑Ri=1
giπixi
T̂reg is an IPW estimator using weights that are ‘calibrated’ or
‘tuned’ (French: calage) so that the known population totals are
estimated correctly.
Calibration
The general calibration problem: given a distance function d(·, ·),find calibration weights gi minimizing∑
Ri=1
d(gi, 1)
subject to the calibration constraints
N∑i=1
xi =∑Ri=1
giπixi
Lagrange multiplier argument shows that gi = η(xiβ) for someη(), β; and γ can be computed by iteratively reweighted leastsquares.
For example, can choose d(, ) so that gi are bounded below (andabove).
[Deville et al JASA 1993; JNK Rao et al, Sankhya 2002]
Calibration
When the calibration model in x is saturated, the choice of
d(, ) does not matter: calibration equates estimated and known
category counts.
In this case calibration is also the same as estimating sampling
probabilities with logistic regression, which also equates esti-
mated and known counts.
Calibration to a saturated model gives the same analysis as
pretending the sampling was stratified on these categories: post-
stratification
Post-stratification is a much older method, and is computation-
ally simpler, but calibration can make more use of auxiliary data.
Standard errors
Standard errors come from the regression formulation
T̂reg =∑Ri=1
1
πi(yi − xiβ̂) +
N∑i=1
xiβ̂
The variance of the second term is of smaller order and is ignored.
The variance of the first term is the usual Horvitz–Thompson
variance estimator, applied to residuals from projecting y on the
calibration variables.
Computing
R provides calibrate() for calibration (and postStratify() for
post-stratification)
Three basic types of calibration
• Linear (or regression) calibration: identical to regression
estimator
• Raking: multiplicative model for weights, popular in US,
guarantees gi > 0
• Logit calibration: logit link for weights, popular in Europe,
provides upper and lower bounds for gi
Computing
Upper and lower bounds for gi can also be specified for linear
and raking calibration (these may not be achievable, but we
try). The user can specify other calibration loss functions (eg
Hellinger distance).
Computing
The calibrate() function takes three main arguments
• a survey design object
• a model formula describing the design matrix of auxiliary
variables
• a vector giving the column sums of this design matrix in the
population.
and additional arguments describing the type of calibration.
Computing
> data(api)
> dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
> pop.totals<-c(‘(Intercept)‘=6194, stypeH=755, stypeM=1018)
> (dclus1g<-calibrate(dclus1, ~stype, pop.totals))
1 - level Cluster Sampling design
With (15) clusters.
calibrate(dclus1, ~stype, pop.totals)
> svymean(~api00, dclus1g)
mean SE
api00 642.31 23.921
> svymean(~api00,dclus1)
mean SE
api00 644.17 23.542
Computing
> svytotal(~enroll, dclus1g)
total SE
enroll 3680893 406293
> svytotal(~enroll,dclus1)
total SE
enroll 3404940 932235
> svytotal(~stype, dclus1g)
total SE
stypeE 4421 1.118e-12
stypeH 755 4.992e-13
stypeM 1018 1.193e-13
Computing
> (dclus1g3 <- calibrate(dclus1, ~stype+api99,
c(pop.totals, api99=3914069)))
1 - level Cluster Sampling design
With (15) clusters.
calibrate(dclus1, ~stype + api99, c(pop.totals, api99 = 3914069))
> svymean(~api00, dclus1g3)
mean SE
api00 665.31 3.4418
> svytotal(~enroll, dclus1g3)
total SE
enroll 3638487 385524
> svytotal(~stype, dclus1g3)
total SE
stypeE 4421 1.179e-12
stypeH 755 4.504e-13
stypeM 1018 9.998e-14
Computing
> range(weights(dclus1g3)/weights(dclus1))
[1] 0.4185925 1.8332949
> (dclus1g3b <- calibrate(dclus1, ~stype+api99,
c(pop.totals, api99=3914069),bounds=c(0.6,1.6)))
1 - level Cluster Sampling design
With (15) clusters.
calibrate(dclus1, ~stype + api99, c(pop.totals, api99 = 3914069),
bounds = c(0.6, 1.6))
> range(weights(dclus1g3b)/weights(dclus1))
[1] 0.6 1.6
Computing
> svymean(~api00, dclus1g3b)
mean SE
api00 665.48 3.4184
> svytotal(~enroll, dclus1g3b)
total SE
enroll 3662213 378691
> svytotal(~stype, dclus1g3b)
total SE
stypeE 4421 1.346e-12
stypeH 755 4.139e-13
stypeM 1018 8.238e-14
Computing
> (dclus1g3c <- calibrate(dclus1, ~stype+api99, c(pop.totals,
+ api99=3914069), calfun="raking"))
1 - level Cluster Sampling design
With (15) clusters.
calibrate(dclus1, ~stype + api99, c(pop.totals, api99 = 3914069),
calfun = "raking")
> range(weights(dclus1g3c)/weights(dclus1))
[1] 0.5342314 1.9947612
> svymean(~api00, dclus1g3c)
mean SE
api00 665.39 3.4378
Computing
> (dclus1g3d <- calibrate(dclus1, ~stype+api99, c(pop.totals,
+ api99=3914069), calfun="logit",bounds=c(0.5,2.5)))
1 - level Cluster Sampling design
With (15) clusters.
calibrate(dclus1, ~stype + api99, c(pop.totals, api99 = 3914069),
calfun = "logit", bounds = c(0.5, 2.5))
> range(weights(dclus1g3d)/weights(dclus1))
[1] 0.5943692 1.9358791
> svymean(~api00, dclus1g3d)
mean SE
api00 665.43 3.4325
Types of calibration
Post-stratification allows much more flexibility in weights, in
small samples can result in very influential points, loss of
efficiency.
Calibration allows for less flexibility (cf stratification vs regression
for confounding)
Different calibration methods make less difference
Example from Kalton & Flores-Cervantes (J. Off. Stat, 2003):
a 3× 4 table of values.
Types of calibration
●
●
●●
●●
●
●
●
●
●
●
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Index
Cal
ibra
tion
wei
ght (
g)
1.1 2.1 3.1 4.1 1.2 2.2 3.2 4.2 1.3 2.3 3.3 4.3
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Post−stratifiedLinearBounded linearRaking
Two-phase studies
Sample a cohort of N people from population and measure some
variables then subsample n of them and measure more variables
[genes, biomarkers, coding of open-text data, copies of original
medical records]
Includes nested case–control, case–cohort designs.
Better use of auxiliary information by either stratifying the
sampling or calibrating to full cohort data after sampling.
Calibration of second phase is just like calibration of a single-
phase design.
RRZ estimators
Robins, Rotnitzky & Zhao defined augmented IPW estimators
for two-phase designs
N∑i=1
RiπiUi(θ) +
N∑i=1
(1−
Riπi
)Ai(θ) = 0
where Ai() can be any function of phase-1 data. Equivalent to
calibration estimator T̂reg using Ai as calibration variable.
N∑i=1
Riπi
(Ui(θ)−Ai(θ)) +N∑i=1
Ai(θ) = 0
RRZ estimators
Includes the efficient estimator in the non-parametric phase-1
model (efficient design-based estimator) — the most efficient
estimator that is consistent for the same limit as if we had
complete data.
Typically not fully efficient if outcome-model assumptions are
imposed at phase 1.
Example: Cox model assumes infinitely many constraints at
phase 1, and efficient two-phase estimator is known (Nan 2004,
Can J Stat) and is more efficient than calibration estimator.
Estimated weights
RRZ also note that estimating π from phase-1 data gives better
precision than using true known π . Widely regarded as a
paradox.
Estimated weights (eg logistic regression) solve
N∑i=1
xiRi =N∑i=1
xipi
ie, equate observed and estimated population moments. For
discrete x this is exactly calibration, for continuous x it is
effectively equivalent.
Estimated weights
Gain of precision in calibration is not paradoxical: comes from
replaceing variance of Y with variance of residuals for a reduction
by (1− r2)– nothing to do with ’estimation’
Exactly same issue as gain of precision when adjusting random-
ized trial for baseline: can write randomized trial estimator as
calibration with counterfactuals.
Estimation error in weights does increase uncertainty, but this is
second order: for p predictors it is O(1 + p/n)
Calibration provides increased precision only when r2 is large
enough (compared to p/n).
[Judkins et al, Stat Med 26:1022-33]
Computing
calibrate() also works on two-phase design objects
Since the phase-one data are already stored in the object, thereis no need to specify population totals when calibrating.
It is necessary to specify phase=2.
This morning we had a two-phase case–control design
dccs2<-twophase(id=list(~seqno,~seqno),
strata=list(NULL,~interaction(rel,instit)),
data=nwtco, subset=~incc2)
Calibrating it to 16 strata of relapse×stage×institutional histol-ogy:
gccs8<-calibrate(dccs2, phase=2,
formula=~interaction(rel,stage,instit))
Logistic regression
As all the phase-one data are available we can also esti-
mate sampling weights by logistic regression, as suggested by
Robins,Rotnitzky & Zhao (JASA, 1994).
Either use calibrate with calfun="rrz" or estWeights.
estWeights takes a data frame with missing values as input
and produces a corresponding two-phase design with weights
estimated by logistic regression.
Choice of auxiliaries
The other heuristic gain from the calibration viewpoint is inchoosing predictors for estimating π.
The regression formulation shows that the predictors should havestrong linear relationships with Ui(θ).
If Ui(θ) is of a form such as
ziwi(yi − µi(θ))
then zi is approximately uncorrelated with Ui
So, don’t use a variable correlated with a phase-2 predictor asa calibration variable, use a variable correlated with the phase-2score function.
estWeights() can take a phase-one model as an argument and usethe estimating functions from that model as calibration variables.
More detail from Norm.