division of pharmacoepidemiology and · pdf filedivision of pharmacoepidemiology . and...

31
Division of Pharmacoepidemiology And Pharmacoeconomics Technical Report Series Year: 2013 #007 Metrics for covariate balance in cohort studies of causal effects Jessica M. Franklin a , Jeremy A. Rassen a , Diane C. Ackermann b , Dorothee B. Bartels b,c , Sebastian Schneeweiss a a.) Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA b.) Department of Global Epidemiology, Boerhringer Ingelheim GmbH, Ingelheim, Germany c.) Department of Epidemiology, Hannover Medical School, Hannover, Germany

Upload: lykhue

Post on 06-Mar-2018

227 views

Category:

Documents


3 download

TRANSCRIPT

Division of Pharmacoepidemiology And Pharmacoeconomics

Technical Report Series

Year: 2013 #007

Metrics for covariate balance in cohort studies of causal effects

Jessica M. Franklina, Jeremy A. Rassena, Diane C. Ackermannb, Dorothee B. Bartelsb,c, Sebastian Schneeweissa

a.) Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA b.) Department of Global Epidemiology, Boerhringer Ingelheim GmbH, Ingelheim, Germany c.) Department of Epidemiology, Hannover Medical School, Hannover, Germany

Series Editors: Sebastian Schneeweiss, MD, ScD Jerry Avorn, MD Robert J. Glynn, ScD, PhD Niteesh K. Choudhry, MD, PhD Jeremy A. Rassen, ScD Josh Gagne, PharmD, ScD Contact Information: Division of Pharmacoepidemiology and Pharmacoeconomics Department of Medicine Brigham and Women’s Hospital and Harvard Medical School 1620 Tremont St., Suite 3030 Boston, MA 02120 Tel: 616-278-0930 Fax: 617-232-8602

1

Metrics for covariate balance in cohort studies of causal effects

Jessica A Myers*1, Jeremy A Rassen1, Diana Ackermann2, Dorothee B Bartels2,3, and Sebastian Schneeweiss1

1 Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine

Brigham and Women’s Hospital and Harvard Medical School, Boston, MA 2 Department of Global Epidemiology, Boehringer Ingelheim GmbH, Ingelheim, Germany

3 Department of Epidemiology, Hannover Medical School, Hannover, Germany

May 21, 2013

Abstract

Inferring causation from non-randomized studies of exposure requires that exposure groups can be

balanced with respect to prognostic factors for the outcome. Although there is broad agreement in

the literature that balance should be checked, there is confusion regarding the appropriate metric.

We present a simulation study that compares several balance metrics with respect to the strength

of their association with bias in estimation of the effect of a binary exposure on a binary, count, or

continuous outcome. The simulations utilize matching on the propensity score with successively

decreasing calipers to produce datasets with varying covariate balance. We propose the novel use

of the C-statistic from the propensity score model estimated in the matched cohort as a balance

metric and found that it had consistently strong associations with estimation bias, even when the

propensity score model was misspecified, as long as the propensity score was estimated with

sufficient study size. This metric, along with the average standardized difference and the general

weighted difference, also introduced in this paper, outperformed all other metrics considered,

including the unstandardized absolute difference, Kolmogorov-Smirnov and Levy distances,

overlapping coefficient, Mahalanobis balance, and L1 metrics. The C-statistic and general weighted

difference also have the advantage that they can evaluate balance on all covariates simultaneously.

Therefore, when combined with the usual practice of comparing covariate means and standard

* Address correspondence to: Dr. Jessica Franklin, 1620 Tremont St., Suite 3030, Boston, MA 02120; email: [email protected]; ph: 617-278-0675.

2

deviations across exposure groups, these metrics may provide useful summaries of the observed

covariate imbalance.

Keywords: bias; confounding factors; covariate balance; matching; propensity score

Financial disclosure: This research was funded by a contract from Boehringer Ingelheim

GmbH to the Brigham and Women’s Hospital.

3

Introduction

Inferring causation from studies of exposure requires that exposure groups can be

balanced with respect to prognostic factors for the outcome. In randomized experiments,

balance is achieved on average for all factors. In nonrandomized studies, prognostic

variables must be measured and balanced via covariate adjustments, such as matching,

stratification, regression, or weighting [1, 2]. Adjustments via a balancing score, such as

the propensity score (PS), can also produce exposure groups that are balanced on

measured predictors of outcome [3, 4]. Although each of these methods are guaranteed to

produce unbiased estimates on average across studies under correctly-specified modeling

assumptions, the bias in any particular study will depend on the balance achieved in that

study, not on expected balance [5].

With cohort study designs, covariate balance can be empirically verified prior to

analyzing outcomes [6]. If exposure groups remain unbalanced after covariate adjustment,

then modifications to the adjustment procedure may produce better balance and improved

estimates of treatment effect. If balance cannot be achieved via any adjustment methods,

then the two populations may not share sufficient overlap to be compared. Poor overlap of

covariates may be likely in the context of newly marketed medications, where patients and

prescribers are often hesitant to initiate the new treatment except as second-line therapy

or for narrowly defined subindications [7]. Investigators may consider postponing safety or

effectiveness comparisons between new and standard treatments until the patient

populations become more similar.

Although there is broad agreement in the literature that balance should be checked

[8, 9], there is confusion in the medical literature regarding an appropriate metric. Initial

4

attempts to characterize covariate imbalance proposed in the context of PS stratification

relied on significance tests for a difference between exposure groups in each covariate [4].

These tests measure the evidence for imbalance in the populations from which study

samples were drawn and are strongly dependent on sample size. More recently, many

investigators rely on examining a simple difference in means with or without

standardization by the covariate standard deviation (SD) [10, 11]. Alternatively, metrics

that characterize balance on the full distribution of individual covariates have been

proposed, including the Kolmogorov-Smirnov distance [12, 13], the Lévy distance [13-15],

and the nonparametric overlapping coefficient [16, 17], also known as the proportion of

similar responses [18]. All of the above metrics measure imbalance on one covariate at a

time. Belitser et al. [19] compared their performance in a simulation study of independent,

normally distributed covariates and concluded that mean differences were superior for

predicting the bias of exposure effect estimates.

A variety of other measures quantify balance on several covariates simultaneously.

The Mahalanobis distance has been extensively discussed in the context of choosing

matches [20-23]. Gu and Rosenbaum [24] proposed a variation on this distance, which

they refer to as Mahalanobis balance, to measure the multivariate distance between

exposure groups, rather than between individuals. Applied studies that utilize PS

adjustment often report the C-statistic of the PS model, which measures the collective

ability of model covariates to discriminate between exposed and unexposed subjects [25].

Austin [26] considered the C-statistic of the PS model as a potential balance diagnostic and

found via simulation that it could not be used to diagnose an incorrect model specification.

However, refitting the PS model after covariate adjustment (for example, after matching)

5

could yield a C-statistic that summarizes the residual imbalance. Finally, Iacus et al. [27]

introduced the L1 balance metric, which measures the proportion of overlap in the exposed

and unexposed multi-dimensional histograms. While L1 has several desirable properties

not available in other balance metrics such as the automatic evaluation of balance in non-

linear and high-order interaction terms, its utility in predicting estimation bias has not

been demonstrated.

The objective of this paper is to compare balance metrics to determine which

metrics are most strongly associated with estimation bias. We present a Monte Carlo

simulation study that evaluates the balance metrics discussed above. We also propose and

evaluate a novel use of the C-statistic to measure covariate balance within a matched

sample as well as a new balance metric, the general weighted difference (GWD). Because

the associations between covariates and outcome determine the amount of bias caused by

a given imbalance, we consider a variety of covariate associations with outcome. We also

expand on past studies [19] by considering a wide spectrum of covariate distributions,

including binomial, multinomial, and skewed, as well as correlated covariates.

Methods

Bias and Balance

Let Xi be the covariate vector for subject i. We assume that this vector includes the

confounders, factors that influence the exposure, Ti, as well as the outcome, Yi. Under a

constant linear exposure effect, a model for the outcome is given by

𝐸(𝑌𝑖|𝑿𝒊,𝑇𝑖) = 𝛽0 + 𝛽𝑇𝑇𝑖 + 𝑔(𝑿𝑖)

6

where g may be any function of the covariate vector. Let 𝐷(𝑋) = 𝑋�1 − 𝑋�0, where 𝑋�𝑇 is the

sample mean of the variable X in treatment group T. Conditional on the observed covariate

distributions, the bias of the simple, unadjusted exposure effect estimator is given by

Bias(𝐷(𝑌)|𝑿) = 𝐸(𝑌�1|𝑿,𝑇 = 1) − 𝐸(𝑌�0|𝑿,𝑇 = 0)–𝛽𝑇

= �𝛽0 + 𝛽𝑇 +1𝑛1� 𝑔(𝑿𝑖)

𝑇𝑖=1� − �𝛽0 +

1𝑛0� 𝑔(𝑿𝑖)

𝑇𝑖=0� − 𝛽𝑇

= 𝐷(𝑔(𝑿))

where nT is the number of study subjects in treatment group T.

For example, if the value of covariate c for individual i is denoted by Xci and

𝑔(𝑿𝑖) = ∑ 𝛽𝑐𝑋𝑐𝑖𝑐 , indicating additive linear associations between covariates and outcome,

then the bias is given by ∑ 𝛽𝑐𝐷(𝑋𝑐)𝑐 . If instead 𝑔(𝑿𝑖) = 𝛽1,1𝑋1𝑖2 + 𝛽1,2𝑋1𝑖𝑋2𝑖, then the bias is

given by 𝛽1,1𝐷(𝑋12) + 𝛽1,2𝐷(𝑋1𝑋2). The derivation and examples show that bias depends on

the balance in the functional form of the covariate associations with outcome, as well as the

magnitude of the associations between covariates (or covariate terms) and outcome.

Therefore, it is most important to balance covariates that have the strongest effect on

outcome, and if any nonlinear or interaction terms are present in the outcome model, then

those terms should be balanced as well. When evaluating expected bias under other

outcome models, conclusions are similar (derivations available in the Web Appendix).

Balance Metrics

In this section, we briefly define the implementation of 10 potential balance metrics.

Metrics 1—5 measure imbalance one covariate (X) at a time. Metrics 6—10 measure

imbalance for all measured covariates (X) simultaneously. When comparing metrics 1—5

with metrics 6—10, we generally averaged across covariates to get a summary balance

7

measure; weighted averages, as suggested by Belitser et al. [19], are also possible if

outcome information is available.

Note that each of the balance metrics described below should be estimated in the

sample in which they are meant to describe. For example, to summarize the residual

imbalance that remains after matching, metrics should be calculated in the matched cohort;

in the context of stratification, metrics would be calculated within each stratum; and in the

context of weighting to achieve balance, metrics would be calculated using subject weights

(i.e., the absolute difference would be calculated using the weighted mean in each exposure

group to measure balance in the ‘pseudo-population’). R code for implementation of all

balance metrics is available in Web Appendix 1.

1. The absolute difference is the absolute value of the difference in covariate means

between treatment groups: |𝐷(𝑋)|.

2. The standardized difference is the absolute difference, divided by the pooled within-

group covariate SD: |𝐷(𝑋)|/�(𝑠12 + 𝑠02)/2, where 𝑠𝑇2 is the sample variance of X in

exposure group T.

3. The overlapping coefficient (OVL) is the proportion of overlap in two density

functions, calculated by finding the area under the minimum of both curves:

∫min𝑥�𝑓1(𝑥),𝑓0(𝑥)� 𝑑𝑥, where 𝑓𝑇(𝑥) is the density function in exposure group T

estimated with a normal kernel density estimator and the bandwidth suggested by

Scott [28]. The OVL ranges from 0 to 1 with higher values indicating lower imbalance.

To make it comparable with the other metrics, we generally considered 1-OVL.

8

4. The Kolmogorov-Smirnov (K-S) distance is the maximum vertical distance between

two cumulative distribution functions: max𝑥 |𝐹1� (𝑥) − 𝐹�0(𝑥)|, where 𝐹�𝑇(𝑥) is the

empirical cumulative distribution function in exposure group T.

5. The Lévy distance is the side length of the largest square that can be inscribed between

two cumulative distribution functions:

min𝜖

{𝜖 > 0:𝐹�0(𝑥 − 𝜖) − 𝜖 ≤ 𝐹�1(𝑥) ≤ 𝐹�0(𝑥 + 𝜖) + 𝜖 for all 𝑥}

Both the K-S distance and the Lévy distance range from 0 to 1 with lower values

indicating better balance. See [19] for details on metrics 3—5.

6. The Mahalanobis balance is defined as: (𝑿�1 − 𝑿�0)′Σ−1(𝑿�1 − 𝑿�0), where 𝑿�𝑇 is the

vector of covariate means in exposure group T and Σ is the sample variance-covariance

matrix of covariates. Lower values indicate better balance.

7. The L1 measure requires specification of a stratification for each continuous covariate

(for categorical covariates, the strata are given by the categories). The cross-tabulation

of the covariate-specific strata results in a set of multi-dimensional covariate bins ℋ,

and the L1 measure is calculated as the sum of the imbalances within each bin:

0.5∑ |𝑓1(𝐻)−𝐻∈ℋ 𝑓0(𝐻)|, where 𝑓𝑇(𝐻) is the proportion of subjects in exposure group

T that fall in bin H. This measure varies from 0 to 1 with 0 indicating perfect balance.

When applying the L1 measure, we used the default choice of stratifications as

implemented in the cem package in R [29].

8. The L1 median is a variation on the L1 measure that attempts to weaken its dependence

on a specific stratification. The L1 median is found by first drawing a random sample of

101 multi-dimensional covariate stratifications from the set of all potential

stratifications. The L1 measure is calculated using each of these stratifications in the

9

original (pre-adjustment) cohort, resulting in 101 L1 values. The stratification that

yields the median L1 value (51st out of 101) is used for calculating the L1 median in the

original dataset and all subsequent adjusted datasets (see [27, 29] for details).

9. The C-statistic is the area under the receiver-operating characteristic (ROC) curve from

the PS model. To utilize this statistic as a balance metric, we propose that it be

calculated in the sample that it is meant to describe, as with all other balance metrics.

Therefore, to describe the residual imbalance after PS matching, the C-statistic should

be re-estimated in the matched cohort. The C-statistic ranges from 0.5 to 1.0 with the

minimum indicating that the PS model has no ability to discriminate between treated

and untreated patients, i.e., perfect balance. To compare with other metrics, we

generally considered 𝑐 − 0.5.

10. We also propose the general weighted difference (GWD), given by:

∑ 𝑤𝑎𝑏|𝐷(𝑋𝑎𝑋𝑏)|0≤𝑎≤𝑏≤𝐶 , where C is the number of measured covariates, 𝑋0 is the unit

vector, and 𝑤𝑎𝑏 is a weight assigned to the covariate pair 𝑋𝑎𝑋𝑏. This sum includes the

absolute difference in all individual covariates, all covariate squares, and all pairwise

interactions. Ideally, weights would be based on the strength of association between

covariates (or covariate terms) and outcome, and the empirically derived weights

suggested by Belitser et al. [19] could be extended to develop weights for this purpose.

In the simulations, we assumed that there was no prior knowledge of associations and

that outcome data remained hidden until after matching, as recommended by Rubin

[30]. Thus, as a general-purpose weight, we used 𝑤𝑎𝑏 = 1/𝑠𝑎𝑏 when 𝑎 = 0 and

𝑤𝑎𝑏 = 0.5/𝑠𝑎𝑏 otherwise, giving full weight to differences in individual covariates and

10

half weight to interaction and square terms, standardized by 𝑠𝑎𝑏, the pooled within-

group SD of 𝑋𝑎𝑋𝑏.

Example study

Design and analysis

We implemented the methods described in the previous section in a retrospective cohort

study of the short-term effects of nonsteroidal anti-inflammatory drug (NSAID) use on

gastrointestinal (GI) toxicity and myocardial infarctions (MI) [31-33]. Briefly, our study

population included patients enrolled in Medicare and a state drug insurance program for

the elderly provided by either Pennsylvania or New Jersey. We included all patients 65

years and older that initiated use of a nonselective NSAID (ns-NSAID) or the selective Cox 2

inhibitor, celecoxib, at any point in 1999, the first year that celecoxib was available.

Exposure was classified as ns-NSAID or celecoxib, based on the first prescription.

Covariates were created to capture known risk factors of NSAID-associated gastrotoxicity

and acute MI and were assessed based on healthcare and prescription claims in the 365

days prior to the first NSAID prescription. Beginning on the day after the first NSAID fill,

patients were followed for outcomes until the first of death, loss of Medicare or drug

benefit eligibility, or December 31, 2005.

We evaluated balance before matching by calculating the mean of all covariates in

each treatment group and by calculating the balance summary using each of the 10 metrics

under study. We estimated crude treatment effects for celecoxib versus ns-NSAIDs on GI

and MI events using bivariate Cox proportional hazards models. We then estimated the PS

with a logistic regression including linear terms for all covariates and squared terms for all

11

continuous covariates and performed 1:1 nearest neighbor matching using a caliper of

0.028 (0.2 SDs of the PS), in line with recommendations [34]. We repeated the estimation

of balance and treatment effects within the matched cohort.

Results

Table 1 presents results from the example study. Before matching, strong differences exist

between patients that initiated celecoxib and patients that initiated a ns-NSAID. By far the

largest difference occurs in the proportion of patients with a history of osteoarthritis (51%

of celecoxib users and 33% of ns-NSAID users). Although this difference is large in

magnitude, it is unlikely to be a major source of bias in the analysis since osteoarthritis is

not expected to have a strong association with either outcome. In contrast, the many small

differences that exist in the strongest markers of elevated GI risk (prior GI hemorrhage,

peptic ulcer disease, prior and current use of gastroprotective drugs, and prior use of oral

steroids) and elevated MI risk (coronary artery disease, peripheral vascular disease,

hypertension, congestive heart failure, and prior use of ARBs, beta-blockers, clopidogrel,

and warfarin) may be more concerning, particularly, as all differences are in the same

direction, indicating that celecoxib users are at systematically higher risk of both outcomes.

Many of these differences remain, although reduced, after matching. A small difference in

age (celecoxib users are on average one half year older than ns-NSAID users in the matched

cohort) also remains, further indicating elevated outcome risk in patients using celecoxib.

In evaluating the balance metrics (Table 1), all metrics demonstrate improvement in

balance after matching. While the absolute difference, standardized difference, K-S

distance, and Levy distance differed in absolute value (and are on different scales), they all

indicated approximately a 70% decrease in imbalance through matching. The Mahalanobis

12

metric decreased by 89%, the C-statistic decreased from 0.668 to 0.555 (a 67% reduction

in terms of distance from 0.5), and the GWD decreased by 65%. In contrast, 1-OVL, the L1

measure, and the L1 median decreased by 24%, 0.01%, and 2.4%, respectively.

Many of these metrics align well with the decrease in bias seen in the estimated

effects. The effect on GI events, known from randomized trials to be < 1.0 when comparing

celecoxib to ns-NSAIDs, was reduced from 1.20 to 1.14 after matching, although both

estimates likely suffer from unmeasured confounding [32]. Randomized trials evaluating

the comparative effect on cardiovascular events are ongoing [35], but a review of prior

studies found a rate ratio of 1.06 [36]. In the example study, estimated hazard ratios (HRs)

were reduced from 1.12 to 1.05 through matching.

Simulation Study

Data generation and analysis

We conducted a simulation study to evaluate the association between bias and covariate

imbalance as measured by the ten metrics outlined above. In order to produce datasets

with varying balance and bias, in each simulation scenario we simulated 1000 datasets

with strong imbalance on covariates and then matched each dataset multiple times via 1:1

nearest neighbor matching of exposure groups with successively decreasing calipers of 0.8,

0.4, 0.2, 0.1, and 0.05 SDs of the PS, resulting in a total of 6000 datasets with varying

covariate imbalance.

Datasets were simulated with 6 covariates intended to replicate a wide variety of

realistic measured covariates, shown in Table 1. The first three covariates were

continuous: X1 and X3 were normally distributed, and X2 was a right-skewed lognormal

13

variable. X1 and X2 were on similar scales (X1 had a SD of 1, and X2 was created with a SD of

0.5 on the normal scale, which resulted in a SD of approximately 1 on the lognormal scale),

while X3 was on a much larger scale (SD of 10). Covariates X4 and X5 were binary with

prevalences of 50% and 20%, respectively, and X4 was simulated via a logistic model

conditional on X1 with a log-odds ratio of 2.0 so that X4 was highly correlated with X1 in all

scenarios. Finally, X6 was simulated as an ordered categorical variable with prevalences of

50%, 30%, 10%, 5%, and 5% in categories 1—5, respectively. From these 6 measured

covariates, we created the nonlinear terms, X7= sin(X1) and X8= X22, and the interaction

terms, X9= X3 X4 and X10= X4 X5. The nonlinear terms were intended to provide generic

nonlinear associations with outcome that might be expected when using covariates like

body mass index or seasonality.

Exposure, T, was simulated as a binary variable via the logistic model

logit{Pr(𝑇𝑖 = 1)} = 𝛼𝟎 + 𝜶𝑿𝑖, where Xi is the covariate vector (including squares and

interactions) for subject i and α = (α1, …, α10). The parameters in α define the log odds

ratios between covariates and exposure in the pre-matched dataset, and higher absolute

values generally indicate more imbalance. In the primary simulation scenarios, the

outcome, Y, was simulated as a binary variable via the logistic model logit{Pr(𝑌𝑖 = 1)} =

𝛽𝟎 + 𝜷𝑿𝒊 + 𝛽𝑇𝑇𝑖, where β = (β1, …, β10). Additional simulation scenarios considered

continuous outcomes generated via a linear model and Poisson event counts generated via

a log-linear model (described in detail in the Web Appendix). The β parameters define the

association between covariates and outcome as log-odds ratios, and 𝑒𝛽𝑇 is the true causal

effect of exposure on outcome, expressed as an odds ratio.

14

In each dataset, we estimated two PS models via logistic regression. In the first

model (PS1), we include terms X1—X6 only (no nonlinear or interaction terms). We may

think of this as the “usual practice” setting, where investigators include only main effects of

covariates in the PS model. In the second model (PS2), we include all covariate terms X1—

X10. We may think of this as the “special knowledge” setting, where investigators have

some additional knowledge that these particular nonlinear terms and interactions are

important to balance. Matching was carried out using each of the two estimated PSs.

We measured covariate balance using each potential metric before and after each

round of matching. For metrics that measure balance only one covariate at a time, we

applied the metric to each measured covariate X1—X6, and then took the average across

covariates. For calculating the C-statistic, the PS that was used for matching was re-

estimated in the matched data in order to measure the ability of covariates to discriminate

between treated and untreated patients after matching (i.e., to measure balance in the

matched sample). Bias was also calculated before and after each round of matching as the

difference between the estimated crude odds ratio and the true exposure odds ratio, 𝑒𝛽𝑇 .

To measure the association between each balance metric and bias, we estimated a

separate linear model for each balance metric that included bias as the dependent variable

and linear and squared terms for balance as independent variables. From these models, we

extracted the proportion of variation explained (R2) to measure the strength of the

association. We also extracted the estimated intercept from each model as a measure of

the absolute magnitude of bias relative to measured imbalance; an intercept of 0 is

preferred, indicating that bias and imbalance approach 0 simultaneously. We do not

present the linear correlations between bias and balance, as done in other simulation

15

studies [19], as we have found that the relative linearity of the association between bias

and balance tends to depend on the region of bias that is explored. Furthermore, we

generally focus on the relative R2 across metrics within the simulation scenario, since the

magnitude of the R2 in a given scenario depends strongly on the proportion of variation

explained in the outcome-generating model. When the cumulative strength of association

between covariates and outcome is weaker, the R2 measuring the association between bias

and balance will also be weaker across all metrics.

Simulation scenarios

We simulated data under 7 sets of values for the sample size, N, and the α and β

parameters, as shown in Table 1. In order to avoid issues of ‘non-collapsibility’ of the odds

ratio [37], we used 𝛽𝑇 = 0, indicating no effect in all simulation scenarios. In all but one

scenario, we used a sample size of N=5000 and chose values for 𝛼𝟎 and 𝛽𝟎 so that the

overall prevalence of exposure is approximately 50% and the overall outcome rate is

approximately 5%. The other parameter values were chosen to create initial datasets with

high imbalance on covariates and relatively strong confounding.

In the base case, neither of the outcome or exposure-generating models contained

any nonlinear or interaction covariate terms, X7—X10. The “base case” was repeated using a

lower exposure prevalence of approximately 20% (the “low exposure prevalence” case)

and using a smaller sample size of N=500 (the “small sample” case). In the “nonlinear

outcome” scenario, nonlinear and interaction terms were present in the outcome-

generating model, and in the “nonlinear outcome and exposure” case, these terms were

present in both outcome and exposure-generating models. In the “redundant covariates”

case, the covariates X1 and X4 were redundant in the sense that they were highly correlated

16

(as in all simulation scenarios), but only X1 had a direct effect on exposure or outcome. This

scenario was designed to understand the benefits of the Mahalanobis balance, which

utilizes the covariance among variables to avoid over-penalizing imbalance on multiple

covariates that are highly correlated.

The “instrumental variables” case evaluated the performance of balance metrics

when there are instrumental variables (IVs) present in the set of covariates to be balanced.

An IV (also known as an instrument) is a variable that influences exposure but has no

impact on outcome except through its association with exposure. We specified X2 to be an

instrument by setting the coefficients to zero on all terms in the outcome-generating model

involving X2. Past work has shown that balancing instruments can increase bias from

residual confounding [38, 39]. Since all of the balance metrics considered do not

incorporate information on covariate associations with outcome, we expected that all

metrics would perform poorly in this case.

Results

Figure 1 presents the PS distributions for one example dataset from the base case scenario.

As expected, in this case PS1 and PS2 provided almost identical distributions. On both PSs,

the unmatched data were highly imbalanced in the PS. As matching was performed with

increasingly tight calipers, balance on the PS improved incrementally, and it was nearly

perfect when using the smallest caliper. Results for all other simulation scenarios were

very similar to those presented for the base case.

Also for the base case, Figure 2 presents the mean bias on the x-axis versus the

mean imbalance as measured by each of the ten metrics on the y-axis. Means are taken

across the 1000 datasets in each round of matching, so that the right-most point in each

17

plot shows the average bias and balance in the generated datasets before matching, and the

point nearest to the origin (0,0) shows the average bias and balance for the 1000 datasets

created by matching with the tightest caliper. Bias and imbalance are both shown with

95% quantile bars to show the variation across datasets in each round of matching.

Nearly all of the ten imbalance measures were strongly associated with bias. In

particular, when matching on the correctly specified PS (PS1) both the GWD and the

standardized difference explained 89.2% of the variation in bias. The similarity between

these two measures was expected in this case, since there was no structural imbalance on

nonlinear and interaction terms. The C-statistic, the K-S distance, and the Lévy distance

were also associated with estimation bias, explaining 89.1% of the variation, but the

intercept for the C-statistic was much smaller than for the latter metrics, indicating better

correspondence between the C-statistic and bias. By contrast, both the L1 measure and the

L1 median were poor predictors of bias. In datasets matched with a caliper of 0.05 SDs of

PS1, bias was approximately zero, but the L1 measure with the default bin choice was on

average 0.9, indicating almost perfect separation. In addition, the Mahalanobis balance

indicated nearly perfect balance after matching with a caliper of 0.2 SDs of PS, even though

the average bias was relatively strong in that case. These issues are reflected in the

estimated intercepts far from 0.

Figure 2 also shows that the variation in effect estimates decreased as we matched

with successively smaller calipers, even though the sample size also decreased. This

phenomenon is due to the fact that increasing balance on prognostic covariates reduced the

variation in outcomes, which resulted in improved efficiency of effect estimates.

18

Figures showing the results of all simulation scenarios, as in Figure 2, are available

in the Web Appendix. Figure 3 summarizes these results. The top panel in Figure 3

presents the estimated intercept for each balance metric and the 5 main simulation

scenarios. The bottom panel presents the variation explained (R2) for each metric as

compared to that of the GWD. This figure shows that many of the balance metrics perform

similarly, but some perform better across simulation scenarios. Specifically, the

standardized difference, the C-statistic, and the GWD nearly always have the highest R2,

indicating strong associations with bias. In addition, these metrics have estimated

intercepts near 0, indicating that they accurately identify the zero-bias scenario. The K-S

distance and the Lévy distance generally have high R2 values, but tend to over-estimate

imbalance when there is 0 bias. The absolute difference was also often strongly associated

with bias, but this association was inconsistent across simulation scenarios and PS models.

The absolute difference, the OVL, and the Mahalanobis balance had consistently weaker

associations with bias than most other metrics, even in the “redundant covariates”

scenario, which was specifically designed to expose the strengths of the Mahalanobis

balance. The L1 measure and L1 median had the weakest associations with bias across all

simulation scenarios, and the results for the L1 measure are not shown, as its values were

outside the plotting range in all plots.

The scenarios not shown in Figure 3, which repeated the base case with a smaller

study size or lower exposure prevalence, had similar results to the scenarios shown, with a

few exceptions. Specifically, both of these scenarios resulted in lower performance for the

C-statistic, due to the small number of exposed patients and imprecise estimation of the PS,

and, thus, the C-statistic. However, performance was still comparable to other measures;

19

for example, in the low study size scenario, the C-statistic explained 72.1% of the variation

in bias, versus 72.2% explained by the best metrics in this scenario (the GWD, Mahalanobis

balance, and standardized difference). Complete results for these scenarios are also

available in the Web Appendix. Finally, results for simulations using other outcome types

(Poisson and continuous) were very similar to the results presented here and are available

in the Web Appendix.

Discussion

In this paper, we evaluated ten potential measures of imbalance with respect to their

correlation with estimation bias, including a new GWD measure and a novel use of the C-

statistic for measuring balance. We found that several measures were consistently good

predictors of bias, but the standardized difference, the C-statistic, and the GWD provided

the best performance. Based on these results, we recommend these measures for use in

measuring covariate imbalance in cohort studies. The GWD may be preferred when an

investigator has some additional knowledge of which covariates are the strongest

predictors of outcome, and thus, most important to balance, because this knowledge can be

incorporated into the specification of weights. The average standardized difference could

similarly be weighted across covariates, using the suggestions of Belitser et al., but these

weightings were not evaluated in this study. When information on outcome associations is

not available, the C-statistic may be preferred. Although the C-statistic depends on the

specification of the PS model, this metric performed well in the simulation studies even

when based on an under-specified PS (that was missing nonlinear and interaction terms)

20

or an over-specified PS (such as PS2 in the “instrumental variables” scenario). In addition,

the C-statistic has a finite scale that is already familiar to most investigators.

The simulation study also showed that the Mahalanobis balance was generally

highly correlated with bias, but it was nonetheless consistently outperformed by other

measures, even in the scenario specifically designed to demonstrate its strengths.

Furthermore, the L1 measures had consistently weaker associations with estimation bias

than all other metrics, indicating that the default stratification choices as implemented by

these measures do not necessarily perform well.

Although the relative performance of the various metrics was generally consistent

across the simulation scenarios considered, the specific results observed are dependent on

the data-generating process and parameter values chosen. Specifically, the good

performance of the standardized difference across scenarios is likely due, in part, to the

relatively weak effects on outcome for the nonlinear and interaction terms that were

simulated. While we attempted to choose effect sizes that would produce reasonable

outcome models, data generated with stronger effects of nonlinear or interaction terms

would probably reduce the association between the standardized difference and bias, since

this measure does not account for imbalance in these factors. Similarly, although the C-

statistic performed well in the case of a misspecified PS model, this metric is certain to

perform better when the PS is correctly specified with all covariates and covariate terms

affecting outcome; it is similarly certain to perform worse in the case of severe

misspecification. The success of the default weighting scheme in the GWD metric is also

likely somewhat subject to the simulation scenarios chosen, as many highly correlated

covariates may induce very large values of GWD that do not appropriately reflect likely

21

bias. However, across dozens of simulation scenarios examined (including scenarios not

presented in this paper), the performance of the default GWD weighting was surprisingly

robust.

Most importantly, we have assumed throughout that investigators have identified a

set of covariates that are related to outcome and need to be balanced. Prior work has

shown that balancing some covariates, including mediators (variables on the causal

pathway from exposure to outcome) and IVs of the exposure-outcome relationship, can

increase, rather than decrease, bias from unobserved confounders [38-43]. The

performance of all metrics may be worse if these variables are included in the list of

covariates to be balanced or if important confounders are omitted.

Finally, in this paper we focused on a single overall measure of the imbalance across

covariates. A single metric may be simpler to evaluate and may be particularly useful in

genetic matching, where matching is based on minimizing a loss function for covariate

imbalance [44]. However, the results of this study should not discourage investigators

from the common practice of evaluating balance one covariate at a time. Evaluating the

difference between exposure groups in the mean and SD of an individual covariate can

provide an assessment of balance on the scale of the covariate that can then be directly

interpreted by investigators as to its potential for confounding. In addition, examining

these differences across covariates, as in a forest plot [26], can identify whether the

imbalances are random or if they are clustered in a set of related covariates and represent

real uncontrolled confounding.

22

References

1. Billewicz W. The efficiency of matched samples: An empirical investigation. Biometrics 1965; 21: 623-644. 2. Cochran WG. The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics 1968; 24: 295-313. 3. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: 41-55. 4. Rosenbaum PR, Rubin DB. Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association 1984; 79: 516-524. 5. Hirano K, Imbens GW, Ridder G. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 2003; 71: 1161-1189. 6. Schneeweiss S, Avorn J. A review of uses of health care utilization databases for epidemiologic research on therapeutics. Journal of clinical epidemiology 2005; 58: 323-337. 7. Seeger JD, Williams PL, Walker AM. An application of propensity score matching using claims data. Pharmacoepidemiology and drug safety 2005; 14: 465-476. 8. Rubin DB. On principles for modeling propensity scores in medical research. Pharmacoepidemiology and drug safety 2004; 13: 855-857. 9. Ho DE, Imai K, King G, Stuart EA. Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis 2007; 15: 199-236. 10. Austin PC. Assessing balance in measured baseline covariates when using many-to-one matching on the propensity-score. Pharmacoepidemiology and drug safety 2008; 17: 1218-1225. 11. Austin PC. Goodness-of-fit diagnostics for the propensity score model when estimating treatment effects using covariate adjustment with the propensity score. Pharmacoepidemiology and drug safety 2008; 17: 1202-1217. 12. Stephens MA. Use of the Kolmogorov-Smirnov, Cramer-Von Mises and related statistics without extensive tables. Journal of the Royal Statistical Society, Series B 1970; 32: 115-122. 13. Pestman WR. Mathematical statistics: An introduction. Walter De Gruyter Inc: Berlin, 1998. 14. Lévy P. Théorie de l'addition des variables aléatoires. Gauthier-Villars, 1937. 15. Zolotarev VM. Estimates of the difference between distributions in the Lévy metric. Trudy Matematicheskogo Instituta im. VA Steklova 1971; 112: 224-231. 16. Bradley E. Overlapping coefficient. Encyclopedia of Statistical Sciences 1985; 6: 546-547. 17. Inman HF, Bradley EL. The overlapping coefficient as a measure of agreement between probability distributions and point estimation of the overlap of two normal densities. Communications in Statistics-Theory and Methods 1989; 18: 3851-3874. 18. Rom DM, Hwang E. Testing for individual and population equivalence based on the proportion of similar responses. Statistics in medicine 1996; 15: 1489-1505.

23

19. Belitser SV, Martens EP, Pestman WR, Groenwold RHH, Boer A, Klungel OH. Measuring balance and model selection in propensity score methods. Pharmacoepidemiology and drug safety 2011. 20. Cochran WG, Rubin DB. Controlling bias in observational studies: A review. Sankhyā: The Indian Journal of Statistics, Series A 1973; 35: 417-446. 21. Rubin DB. Bias reduction using Mahalanobis-metric matching. Biometrics 1980; 36: 293-298. 22. Rubin DB. Multivariate matching methods that are equal percent bias reducing, I: Some examples. Biometrics 1976; 32: 109-120. 23. Rubin DB. Using multivariate matched sampling and regression adjustment to control bias in observational studies. Journal of the American Statistical Association 1979; 74: 318-328. 24. Gu XS, Rosenbaum PR. Comparison of multivariate matching methods: Structures, distances, and algorithms. Journal of Computational and Graphical Statistics 1993; 2: 405-420. 25. Stürmer T, Joshi M, Glynn RJ, Avorn J, Rothman KJ, Schneeweiss S. A review of the application of propensity score methods yielded increasing use, advantages in specific settings, but not substantially different estimates compared with conventional multivariable methods. Journal of clinical epidemiology 2006; 59: 437. e431-437. e424. 26. Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Statistics in medicine 2009; 28: 3083-3107. 27. Iacus SM, King G, Porro G. Multivariate matching methods that are monotonic imbalance bounding. Journal of the American Statistical Association 2011; 106: 345-361. 28. Scott DW. Multivariate density estimation. Wiley Online Library, 1992. 29. Iacus SM, King G, Porro G. CEM: Coarsened exact matching software. Journal of statistical Software 2009; 30: 1-27. 30. Rubin DB. The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials. Statistics in medicine 2007; 26: 20-36. 31. Brookhart MA, Wang P, Solomon DH, Schneeweiss S. Evaluating short-term drug effects using a physician-specific prescribing preference as an instrumental variable. Epidemiology 2006; 17: 268-275. 32. Schneeweiss S, Glynn RJ, Tsai EH, Avorn J, Solomon DH. Adjusting for unmeasured confounders in pharmacoepidemiologic claims data using external information: the example of COX2 inhibitors and myocardial infarction. Epidemiology 2005; 16: 17-24. 33. Schneeweiss S, Solomon DH, Wang PS, Rassen J, Brookhart MA. Simultaneous assessment of short-term gastrointestinal benefits and cardiovascular risks of selective cyclooxygenase 2 inhibitors and nonselective nonsteroidal antiinflammatory drugs: an instrumental variable analysis. Arthritis & Rheumatism 2006; 54: 3390-3398. 34. Austin PC. Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies. Pharmaceutical Statistics 2011; 10: 150-161. 35. MacDonald TM, Mackenzie IS, Wei L, Hawkey CJ, Ford I, Hallas J, Webster J, Reid D, Ralston S, Walters M. Methodology of a large prospective, randomised, open, blinded endpoint streamlined safety study of celecoxib versus traditional non-steroidal anti-

24

inflammatory drugs in patients with osteoarthritis or rheumatoid arthritis: protocol of the standard care versus celecoxib outcome trial (SCOT). BMJ open 2013; 3. 36. White WB, Faich G, Borer JS, Makuch RW. Cardiovascular thrombotic events in arthritis trials of the cyclooxygenase-2 inhibitor celecoxib. The American journal of cardiology 2003; 92: 411. 37. Miettinen OS, Cook EF. Confounding: Essence and detection. American journal of epidemiology 1981; 114: 593-603. 38. Myers JA, Rassen JA, Gagne JJ, Huybrechts KF, Schneeweiss S, Rothman KJ, Joffe MM, Glynn RJ. Effects of adjusting for instrumental variables on bias and precision of effect estimates. American journal of epidemiology 2011; 174: 1213-1222. 39. Pearl J. On a class of bias-amplifying variables that endanger effect estimates. In On a class of bias-amplifying variables that endanger effect estimates, Editor (ed)^(eds). AUAI: City, 2010. 40. Greenland S. Quantifying biases in causal models: classical confounding vs collider-stratification bias. Epidemiology 2003; 14: 300--306. 41. Pearl J. Causality: models, reasoning, and inference. Cambridge Univ Press, 2000. 42. Schisterman EF, Cole SR, Platt RW. Overadjustment bias and unnecessary adjustment in epidemiologic studies. Epidemiology 2009; 20: 488--495. 43. Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, Stürmer T. Variable selection for propensity score models. American journal of epidemiology 2006; 163: 1149-1156. 44. Diamond A, Sekhon JS. Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies. In Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies, Editor (ed)^(eds). City, 2010.

25

Table 1. Covariate balance and treatment effect estimates in the example study. Matching was performed with a caliper of 0.028 (0.2 SDs of the PS).

Before matching After matching Celecoxib ns-NSAID Celecoxib ns-NSAID N 8,354 6,430 5,911 5,911 Demographics: mean (sd) / proportion Age 79.9 (6.9) 78.2 (7.0) 79.1 (6.8) 78.6 (6.9) Female 0.85 0.78 0.82 0.81 Black race 0.09 0.13 0.10 0.11 Other race 0.03 0.04 0.03 0.03 Comorbidities: proportion Coronary artery disease 0.48 0.43 0.45 0.43 Prior GI hemorrhage 0.07 0.05 0.06 0.05 Peptic ulcer disease 0.23 0.16 0.19 0.17 Peripheral vascular disease 0.21 0.17 0.19 0.18 Osteoarthritis 0.51 0.33 0.41 0.35 Rheumatoid arthritis 0.08 0.04 0.05 0.04 Diabetes mellitus 0.31 0.31 0.31 0.31 Hyperlipidemia 0.53 0.50 0.51 0.51 Transient ischemic attack 0.07 0.06 0.06 0.06 Stroke 0.10 0.09 0.09 0.09 Angina 0.09 0.09 0.09 0.09 New MI 0.04 0.04 0.04 0.04 Old MI 0.05 0.05 0.05 0.05 Hypertension 0.80 0.78 0.79 0.78 Congestive heart failure 0.26 0.21 0.23 0.22 COPD 0.24 0.22 0.23 0.22 Chronic kidney disease 0.06 0.05 0.06 0.05 Medications: proportion ACE inhibitors 0.27 0.27 0.27 0.27 ARBs 0.09 0.07 0.08 0.07 Beta blockers 0.34 0.32 0.33 0.33 Clopidogrel 0.06 0.04 0.06 0.05 Oral steroids 0.14 0.11 0.13 0.11 Diabetes drugs 0.17 0.19 0.18 0.19 Gastroprotective drugs 0.41 0.30 0.35 0.32 Other lipid lowering drugs 0.02 0.02 0.02 0.02 Statins 0.25 0.25 0.25 0.25 Warfarin 0.12 0.06 0.08 0.07 Concurrent gastroprotective drugs 0.25 0.17 0.20 0.18 Health services intensity: mean (sd) Number of prior hospitalizations 0.54 (1.1) 0.48 (1.0) 0.50 (1.1) 0.48 (1.0) Number of days hospitalized 4.4 (11.1) 3.9 (10.3) 4.0 (10.7) 3.8 (10.0) Number of distinct generics 11.6 (6.1) 10.5 (5.6) 11.0 (5.8) 10.7 (5.7) Number of days in a nursing home 1.77 (8.7) 1.43 (8.4) 1.56 (8.4) 1.44 (8.2) Combined comorbidity score 1.76 (2.4) 1.52 (2.4) 1.60 (2.4) 1.54 (2.4) Number of doctor visits 11.2 (7.6) 10.1 (7.9) 10.7 (7.4) 10.3 (7.9) Balance metrics Absolute difference 0.160 0.049

26

Standardized difference 0.095 0.028 1 – OVL 0.065 0.049 Kolomogrov-Smirnov distance 0.038 0.012 Levy distance 0.036 0.010 Mahalanobis balance 0.339 0.038 C-statistic – 0.5 0.168 0.055 L1 measure 1.000 0.999 L1 median 0.851 0.830 Generalized weighted difference 0.036 0.012 Estimated treatment effects: HR (95% CI) GI events 1.20 (1.01—1.41) 1.14 (0.95—1.37) MI events 1.12 (1.00—1.25) 1.05 (0.93—1.18) Table 2. Covariate terms and parameters for simulation studies. The variable column gives the mean and SD (on the normal distribution scale) for the continuous covariates and the prevalence for the binary covariates. The α values determine the association between covariates and exposure as log odds ratios. The β values determine the association between covariates and outcome as log odds ratios.

Variable Base case Nonlinear

outcome

Nonlinear outcome and

exposure

Redundant covariates

Instrumental variables

Low exposure

prevalence

Small study size

N=5000 N=5000 N=5000 N=5000 N=5000 N=5000 N=500 α β α β α β α β α β α β α β

Intercept -3.5 -5 -3.5 -3.5 -1 -3.5 -1.3 -3.7 -1.3 -3.7 -5 -3.3 -3.5 -3.3 X1 Normal (0,1) 1.0 0.5 1.0 0.4 0.8 0.4 0.8 0.4 0.8 0.4 1.0 0.5 1.0 0.5 X2 Lognormal (0,0.5) 1.0 0.5 1.0 0.03 0.06 0.03 0.06 0.03 0.06 0 1.0 0.5 1.0 0.5 X3 Normal (0,10) 0.1 0.05 0.1 0.03 0.06 0.03 0.1 0.05 0.06 0.03 0.1 0.05 0.1 0.05 X4 Binary (p=0.5) 2.0 1.0 2.0 0.75 1.5 0.75 0 0 1.5 0.75 2.0 1.0 2.0 1.0 X5 Binary (p=0.2) 2.0 1.0 2.0 0.75 1.5 0.75 2.0 1.0 1.5 0.75 2.0 1.0 2.0 1.0 X6 Ordinal categorical 0.4 0.2 0.4 0.2 0.4 0.2 0.4 0.3 0.4 0.2 0.4 0.2 0.4 0.2 X7 sin(X1) 0 0 0 0.4 0.8 0.4 0.8 0.4 0.8 0.4 0 0 0 0 X8 X22 0 0 0 0.02 0.04 0.02 0.04 0.02 0.04 0 0 0 0 0 X9 X3 X4 0 0 0 0.04 0.08 0.04 0 0 0.08 0.04 0 0 0 0 X10 X4 X5 0 0 0 0.5 1.0 0.5 0 0 1.0 0.5 0 0 0 0

27

Figure 1: One example dataset from the base case simulation scenario before and after matching on PS1 (left) or PS2† (right). The PS distribution in exposed patients (dashed curve) and unexposed patients (solid curve) in the unmatched data is in the top panel and lower panels are after matching with calipers of 0.8, 0.4, 0.2, 0.1, and 0.05 SDs of the PS (in order from top to bottom). The average number of treated patients across simulations in each sample is shown in the upper left corner.

† PS1 is the estimated PS that includes covariate terms X1—X6 only. PS2 is the estimated PS that includes covariate terms X1—X10.

28

Figure 2: Base Case. Mean and 95% quantile bars for bias (x-axis) and covariate imbalance (y-axis). Means are taken across 1000 simulated datasets in unmatched data (right-most point) and each matched dataset (moving left as the caliper decreases). Datasets were matched on PS1 (left) and on PS2 (right), and the intercept (�̂�0) and variation explained (R2) are in the lower right corner.

29

Figure 3: Results from 5 main simulations scenarios. The intercept from each metric is plotted in the top panel, and the ratio comparing the variation in bias explained by each balance metric versus the variation explained by the GWD is plotted in the bottom panel. The left and right panels show results when matching on PS1 and PS2, respectively. The x-axis in each plot is the simulation scenario. Results for the L1 measure are not shown because they are outside of the plotting region for all scenarios.