statistical methods for developmental toxicity: analysis of clustered multivariate binary data

16
196 Statistical Methods for Developmental Toxicity Analysis of Clustered Multivariate Binary Data LOUISE RYAN a AND GEERT MOLENBERGHS b a Harvard School of Public Health and Dana-Farber Cancer Institute, 44 Binney Street, Boston, MA 02115, USA b Biostatistics, Center for Statistics, Limburgs Universitair Centrum, Universitaire Campus, B-3590 Diepenbeek, Belgium ABSTRACT: This paper discusses some of the statistical issues that arise from developmental toxicity studies, wherein pregnant mice are exposed to chemi- cals in order to assess possible adverse effects on developing fetuses. We begin with a review of some current approaches to risk assessment, based on NOAELs, and provide justification for the use of methods based on dose- response models. Due to the hierarchical nature of the data, such models are more complicated in the present context than, say, in cancer studies. For exam- ple, multivariate binary outcomes arise when each fetus in a litter is assessed for the presence of malformations and/or low birth weight. We describe a mul- tivariate exponential family model that works well for these data and that is flexible in terms of allowing response rates to depend on cluster size. Maximum likelihood estimation of model parameters and the construction of score tests for dose effect are briefly discussed. Results are illustrated with data from sev- eral NTP studies. INTRODUCTION Society is becoming increasingly concerned about environmental impacts on fer- tility and pregnancy, birth defects, and developmental abnormalities. Consequently, regulatory agencies such as the U.S. Environmental Protection Agency (EPA) and Food and Drug Administration (FDA) have placed an increased priority on research- ing and identifying causes of these problems to protect the public from environmen- tal exposures that may contribute to these risks. Although standard study designs and statistical methods for quantitative risk assessment have emerged for evaluating can- cer risks, the area of quantitative risk assessment for developmental and reproductive toxicity remains a relatively new field of study. In general, risk assessment for repro- ductive and developmental toxicity must rely heavily on data from controlled chem- ical experiments, since epidemiologic data tend to be limited in this area. Because such laboratory studies involve considerable time and expense, as well as large num- bers of animals, it is essential that appropriate, efficient statistical models are used to assess these types of non-cancer risks. a Address for correspondence: 617-632-3602 (voice); 617-632-2444 (fax). e-mail: [email protected]

Upload: louise-ryan

Post on 21-Jul-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Statistical Methods for Developmental Toxicity: Analysis of Clustered Multivariate Binary Data

196

Statistical Methods for Developmental Toxicity

Analysis of Clustered Multivariate Binary Data

LOUISE RYAN

a

AND GEERT MOLENBERGHS

b

a

Harvard School of Public Health and Dana-Farber Cancer Institute,44 Binney Street, Boston, MA 02115, USA

b

Biostatistics, Center for Statistics, Limburgs Universitair Centrum, Universitaire Campus, B-3590 Diepenbeek, Belgium

A

BSTRACT

: This paper discusses some of the statistical issues that arise fromdevelopmental toxicity studies, wherein pregnant mice are exposed to chemi-cals in order to assess possible adverse effects on developing fetuses. We beginwith a review of some current approaches to risk assessment, based onNOAELs, and provide justification for the use of methods based on dose-response models. Due to the hierarchical nature of the data, such models aremore complicated in the present context than, say, in cancer studies. For exam-ple, multivariate binary outcomes arise when each fetus in a litter is assessedfor the presence of malformations and/or low birth weight. We describe a mul-tivariate exponential family model that works well for these data and that isflexible in terms of allowing response rates to depend on cluster size. Maximumlikelihood estimation of model parameters and the construction of score testsfor dose effect are briefly discussed. Results are illustrated with data from sev-eral NTP studies.

INTRODUCTION

Society is becoming increasingly concerned about environmental impacts on fer-tility and pregnancy, birth defects, and developmental abnormalities. Consequently,regulatory agencies such as the U.S. Environmental Protection Agency (EPA) andFood and Drug Administration (FDA) have placed an increased priority on research-ing and identifying causes of these problems to protect the public from environmen-tal exposures that may contribute to these risks. Although standard study designs andstatistical methods for quantitative risk assessment have emerged for evaluating can-cer risks, the area of quantitative risk assessment for developmental and reproductivetoxicity remains a relatively new field of study. In general, risk assessment for repro-ductive and developmental toxicity must rely heavily on data from controlled chem-ical experiments, since epidemiologic data tend to be limited in this area. Becausesuch laboratory studies involve considerable time and expense, as well as large num-bers of animals, it is essential that appropriate, efficient statistical models are usedto assess these types of non-cancer risks.

a

Address for correspondence: 617-632-3602 (voice); 617-632-2444 (fax).e-mail: [email protected]

Page 2: Statistical Methods for Developmental Toxicity: Analysis of Clustered Multivariate Binary Data

197RYAN & MOLENBERGHS: STATISTICAL METHODS FOR TOXICITY

Regulatory approaches for developmental toxicity currently employed by boththe EPA and the FDA are based on calculation of no-observed-adverse-effect levels(NOAELs). One of the underlying assumptions that has motivated use of NOAELsis that there is a threshold of exposure for each environmental agent below which de-velopmental effects will not occur. The NOAEL is defined as the experimental doselevel immediately below the lowest dose that produces a statistically or biologicallysignificant increase in adverse effects as compared to the control group. An

accept-ably safe

daily dose level for humans is then calculated by dividing the NOAEL bya safety factor, usually 100 or 1000, to account for sensitive subgroups of the popu-lation and extrapolation from animal data to human risk. The EPA refers to this safedaily concentration as the

reference dose

(RfD), whereas the FDA uses the term

al-lowable daily intake

(ADI). In the event that the lowest experimental dose shows sig-nificant changes from control, it is termed a LOAEL (lowest observed adverse effectlevel) and the safety factor used to determine the RfD or ADI is increased tenfold.

The use of the NOAEL-safety factor approach to determine reference doses hasbeen widely acknowledged as being subject to a number of serious statistical draw-backs (see Refs. 1 and 2). Estimation of the NOAEL is highly sensitive to the exper-imental design, in terms of the number and spacing of dose groups and the totalsample size. As the sample size increases, NOAEL estimation becomes anticonser-vative. That is, large studies have higher power to detect small changes and, there-fore, produce lower NOAELs than smaller studies. Another disadvantage of thisapproach is that it does not provide measures of the statistical variability in estimat-ing the NOAEL. Thus, no estimates of the upper bound on the risk corresponding tothe NOAEL or RfD are available. The actual risk levels of an adverse effect at theNOAEL or RfD may vary considerably from one developmental toxicity study to thenext, making it difficult to compare environmental agents and prioritize risk manage-ment. Yet another problem with the NOAEL approach is that it typically relies onanalysis of individual outcomes, rather than forming part of a risk assessment pro-cedure that considers the entire process of fetal development. In other words, a sep-arate NOAEL is computed for each adverse effect of interest (for example, death,malformation, and low birth weight), and the minimum NOAEL is selected for reg-ulatory purposes. This strategy may appear to be conservative, but can actually beanticonservative in situations where there are subtle but consistent changes in a num-ber of outcomes.

Because of the limitations in using the NOAEL safety factor approach, interest indeveloping techniques for dose-response modeling of developmental toxicity datahas increased, and new regulatory guidelines

3

emphasize the use of quantitativemethods for risk assessment similar to those developed for cancer risk assessment.An alternative approach to using NOAELs that has recently gained enthusiasm fromboth toxicologists and statisticians is estimation of

benchmark doses

(BD).

1

An ap-propriate dose-response model is first fitted to the animal data to estimate the refer-ence dose, or dose corresponding to a moderate increase (e.g., 1% or 10%) in riskover the background rate. The BD is then defined as the lower 95% or 99% confi-dence limit on the reference dose.

1

Although it is acknowledged that estimation ofrisk levels at very low doses can be very sensitive to the choice of a dose-responsemodel, the BD generally occurs within the range of experimental data so that its es-timation is fairly robust to model choice. Fitting dose-response models to data from

Page 3: Statistical Methods for Developmental Toxicity: Analysis of Clustered Multivariate Binary Data

198 ANNALS NEW YORK ACADEMY OF SCIENCES

developmental toxicity studies is somewhat more complicated than determining aNOAEL, but it offers a number of important advantages. First, estimated benchmarkdoses provide a measure of the degree of variability in risk estimation. Second, dose-response models are much more flexible in describing the fetal development processand can account for special features of developmental toxicity data, such as littereffects and multiple outcomes. Finally, the use of dose-response models allows forthe incorporation of other covariates of interest that may affect the risk of adverseeffects, such as the litter size or the duration and timing of exposure.

Several experimental protocols are used in reproductive and developmental stud-ies. Three test designs (Segments I, II, and III) were established by the U.S. FDA in1966 to assess specific types of effects.

4

The Segment I design, or fertility and repro-duction study, is designed to assess male and female fertility and general reproduc-tive ability. Such studies are typically conducted in one species (usually the rat) andinvolve exposing males for 60 days and females for 14 days prior to mating. Femalescontinue to be exposed after they have been mated, usually until mid-pregnancy.

5

The Segment III design, or peri- and postnatal study, focuses on effects later in ges-tation and involves exposing pregnant animals from day 15 of gestation through lac-tation.

5

We focus on data collected from a Segment II design, which is suitable when in-terest lies in the effects of exposure during the period of major organogenesis andstructural development. These experiments have been often referred to as “teratolo-gy” studies, since historically, the primary goal was to study malformations. Rats,mice, and occasionally rabbits, are usually chosen as the animal model for these ex-periments. Administration of the exposure is generally by the clinical or environ-mental route(s) most closely mimicking human exposure (for example, via food orwater, or by inhalation). Timed-pregnant animals (dams) are exposed during the crit-ical period of major organogenesis (days 6–15 for mice and rats, 6–19 for rabbits)and sacrificed just prior to normal delivery, at which time the uterus is removed andthe contents are thoroughly examined.

Dose levels for the Segment II design consist of a control group and three or fourdifferent dose groups exposed to the test substance. The standard recommendationis to choose the lowest dose to produce no observable maternal toxicity, and increasedose levels gradually to a maximum dose designed to produce some toxicity, but nomore than ten percent maternal deaths.

3

Typically, doses are chosen at equally-spaced intervals on a linear or log-spaced scale. Ordinarily between 20 and 30 preg-nant dams are randomized to each dose group and control, and typical litter sizes(number of live-born offspring) for control animals range from about eight in therabbit to about 12–14 in mice and rats, respectively.

To motivate the methods to be presented later, we consider five developmentaltoxicity studies conducted by the Research Triangle Institute under contract to theNational Toxicology Program (NTP). The studies investigated the effects in mice offive different chemicals: di(2-ethyhexyl)-phthalate (DEHP),

6

ethylene glycol (EG),

7

triethylene glycol dimethyl ether (TGDM),

8

diethylene glycol dimethyl ether(DYME),

9

and theophylline (THEO).

10

Each study involved a control group andthree or four dosed groups, each including 20 to 30 dams with between 2 and 17 off-spring per litter. For these experiments, malformations were classified as being ex-ternal, visceral, and skeletal. Several animals were found to have more than one

Page 4: Statistical Methods for Developmental Toxicity: Analysis of Clustered Multivariate Binary Data

199RYAN & MOLENBERGHS: STATISTICAL METHODS FOR TOXICITY

malformation type. T

ABLE

1 summarizes the data for each of the studies. The fetusesin a given dose group are cross-classified according to presence or absence of thethree malformation indexes, thereby collapsing over litters. For example, the columnlabeled (1, 1, 1)´ denotes the number of fetuses with all three malformations. Simi-larly, the column labeled (1, 0, 0)´ denotes those with only external malformation.

Although the analysis might be simplified by collapsing to litter-specific summa-ries, it is natural here to consider statistical models that account for the multivariatenature of the response, as well as the litter effect induced by the clustering of off-

TABLE 1. Developmental toxicity studies

N

OTE

: Cross-classification of all individual fetuses by study and dose group, with respect toexternal, visceral, and skeletal malformations, respectively. 1 denotes presence and 0 denotesabsence.

Study Dose

DamsLive

fetuses

1 1 1 1 0 0 0 0

1 1 0 0 1 1 0 0

1 0 1 0 1 0 1 0

DEHP 0.000 30 330 0 0 0 0 0 5 4 321

0.025 26 288 0 0 0 3 0 1 1 283

0.050 26 277 0 5 2 8 2 13 8 239

0.100 24 137 3 0 5 16 6 12 11 84

0.150 25 50 8 7 8 4 1 9 7 6

EG 0 25 297 0 0 0 0 0 0 1 296

750 24 276 0 0 1 2 0 0 23 250

1500 23 229 0 0 1 3 0 2 83 140

3000 23 226 1 0 12 3 8 0 105 97

TGDM 0 27 319 0 0 0 1 0 0 0 318

250 26 275 0 0 0 0 0 0 0 275

500 26 262 0 0 0 1 0 0 1 260

1000 28 286 0 0 1 11 0 1 20 253

DYME 0.00 21 282 0 0 0 0 0 1 0 281

62.50 20 225 0 0 0 0 0 0 0 225

125.00 24 290 0 0 0 3 0 1 3 283

250.00 23 261 0 0 2 5 0 2 50 202

500.00 23 141 16 1 58 18 10 1 28 9

THEO 0.000 26 296 0 0 0 1 0 0 0 295

0.075 26 278 0 0 0 2 0 0 0 276

0.150 33 300 0 0 0 5 0 1 1 293

0.200 23 197 0 0 0 4 0 1 0 192

Page 5: Statistical Methods for Developmental Toxicity: Analysis of Clustered Multivariate Binary Data

200 ANNALS NEW YORK ACADEMY OF SCIENCES

spring within dams. Furthermore, since the number of viable fetuses can sometimesbe related to the response probability, a model should be flexible enough to allowcluster size to affect response probabilities.

As a result of the research activity over the past 10 to 15 years, there are presentlyseveral different schools of thought concerning the best approach to the analysisof correlated binary data. Unlike in the normal setting, marginal, conditional, andrandom-effects approaches tend to give dissimilar results, as do likelihood, quasi-likelihood and

generalized estimating equations

(GEE) based inferential methods(for excellent reviews, see Refs. 11–14).

Several likelihood based methods have been proposed. Fitzmaurice and Laird

15

incorporate marginal parameters for the main effects in this model and quantify thedegree of association by means of conditional odds ratios. Fully marginal models areavailable that use marginal correlations,

16,17

or a dichotomized version of a multi-variate normal

18

to analyze multivariate binary data. Alternatively, marginal oddsratios can be used.

19–21

Cox

17

also describes a model whose parameters have inter-pretations in terms of conditional probabilities. Similar models were proposed byRosner

22

and by Liang and Zeger.

23

Random-effects approaches have been studiedby Stiratelli, Laird, and Ware,

24

Zeger, Liang, and Albert,

25

Breslow and Clayton,

26

and by Wolfinger and O’Connell.

27

Generalized estimating equations were devel-oped by Liang and Zeger.

28

The debate about the relative merits of the different approaches continues. Forseveral years it seemed that marginal models, particularly GEEs, were the most pop-ular, perhaps due to their relative computational ease and the availability of goodsoftware. It is noteworthy that the recent renewed interest in random-effects modelsis partly provoked by the availability of the NLMIXED procedure in SAS. There aremerits and disadvantages to all three model families and generally no simple trans-formations between the three families exist. Arguably, model choice has to dependnot only on the application of interest but also on the specific analysis goals.

Because of the need to account for litter effects, all these issues of modeling strat-egy arise with developmental toxicity data. Several additional issues complicate theanalysis. For example, cluster sizes vary and can affect response rates, perhaps dueto competition between littermates or underlying health of the mother.

29

Also, it isoften important to account for the multivariate nature of the outcomes measured oneach littermate. Random-effects models (beta-binomial

30

) were among the first pro-posed for developmental toxicity data.

31

However, they do not extend naturally tomultivariate outcomes. Lefkopoulou, Moore, and Ryan

32

apply generalized estimat-ing equations ideas to model multiple binary outcomes measured on clusters of in-dividuals. Although their approach is simple to apply and it leads to easilyinterpreted tests, a disadvantage is lack of a likelihood basis. Furthermore, there aresome regions of the parameter space in which the method can be quite inefficient.

33

The approach does not lend itself well to quantitative risk assessment.Due to the popularity of marginal (especially GEE) and random effects models

for correlated binary data, conditional models have received relatively little atten-tion, especially in the context of multivariate clustered data. A notable exception isLiang and Zeger,

23

although this approach was criticized (see page 147 in Ref.11)because the interpretation of the dose effect on the risk of one outcome is conditionalon the responses of other outcomes for the same individual, outcomes of other indi-

Page 6: Statistical Methods for Developmental Toxicity: Analysis of Clustered Multivariate Binary Data

201RYAN & MOLENBERGHS: STATISTICAL METHODS FOR TOXICITY

viduals, and the litter size. As Molenberghs and Ryan

34

discuss, however, there aresome advantages to conditional models and with appropriate care the disadvantagescan be overcome. This paper aims to discuss conditional models and their relativemerits. Molenberghs, Declerck, and Aerts

20

and Aerts, Declerck, and Molenberghs

35

have compared marginal, conditional, and random-effects models for univariateclustered data when the focus is on either testing for dose effect or on benchmarkdose estimation. Their results are encouraging for the conditional model. They com-bine a likelihood basis with numerical stability and reasonable computing time. Inaddition to inferential agreement between the models, especially when the likeli-hood ratio statistic is chosen, an unrestricted parameter space and computationalconvenience are among the main advantages. The model proposed here exhibits ahigh flexibility in capturing different patterns of nonlinear dependencies of the mar-ginal probabilities on the cluster size.

In the next section we present the conditional probability model of Molenberghsand Ryan,

34

which allows for clustering, as well as for multiple outcomes. In the ab-sence of clustering, the model reduces to the conditional model of Cox.

17

the thirdsection describes likelihood based inferential techniques, including a score test. Thefourth section applies the model to the NTP data introduced earlier, and comparesthe results with those obtained by applying other available techniques for multipleoutcomes.

MODEL FORMULATION

Consider an experiment involving

N

clusters, the

i

th of which contains

n

i

individ-uals, each of whom are examined for the presence or absence of

M

different respons-es. In the NTP data, we have

M

=

2, referring to malformation and weight. Supposethat

y

ijk

= 1 when the

k

th individual in cluster

i

exhibits the

j

th response, and

1 oth-erwise. We use this coding rather than 1 and 0 since it provides a parametrization thatmore naturally leads to desirable properties when the roles of success and failure arereversed.

36

Let

Y

i

represent the vector of outcomes for the

i

th individual, and

x

i

anassociated vector of cluster level covariates. To develop our approach, we first con-sider a model for unclustered outcomes, then extend it to clustered outcomes in theunivariate and multivariate settings. The joint densities are presented in this sectionand the log-likelihood is constructed in the next section.

No Clustering

First, suppose there is no clustering (

n

i

=

1;

i

=

1,…,

N

). Because in this set-ting, we drop this index temporarily from our notation. The observable outcome isthus

Y

i

=

(

y

i

1

,…,

y

iM

)

T

. We assume the following probability mass function:

(1)

where

i

represents the vector of all unknown natural parameters and

A

(

i

) is a nor-malizing constant, resulting from summing

(1)

over all possible outcomes. Covariatemodels for

i

will be discussed in the next section.

k 1≡

f Y yi �i ni,;( ) θijyijj 1=

M

∑ ωijj ′yijyij ′j j ′<∑ A �i( )–+

,exp=

Page 7: Statistical Methods for Developmental Toxicity: Analysis of Clustered Multivariate Binary Data

202 ANNALS NEW YORK ACADEMY OF SCIENCES

Expression

(1)

follows from the model proposed by Cox,

17

by setting the higherorder interactions to zero. This model was also used by Zhao and Prentice

37

and byFitzmaurice, Laird, and Rotnitzky,

12 who transformed to marginal parameters.Thélot38 studied the case where M = 2. If M = 1, the model reduces to ordinary lo-gistic regression.

Clustered Outcomes

Now consider a single clustered outcome. Because the index j always equals 1,we drop it temporarily from our notation. However, we re-introduce the subscript kto indicate an individual within a cluster. Using logic similar to that of the previoussubsection, a natural candidate for the joint distribution of Y is:

(2)

with describing the association between pairs of individuals within the ith cluster.If we define the number of individuals from cluster i with positive response to be zithen (2) becomes

On absorbing constant terms into the normalizing constant and a trivial reparametri-zation and we obtain

(3)

with and . Independence corresponds to . Posi-tive and negative values of δi correspond to over- and underdispersion, respectively.It is worth noting that even for underdispersion no restrictions are required on theparameter space. This feature is in contrast to other models for clustered data, forexample the Bahadur16 model and the beta-binomial model.39

Model (3) has several desirable properties. First, it is clearly invariant to inter-changing the codes of successes and failures; hence the tests, derived in the next sec-tion, will be invariant to this change as well. Second, consider the conditionalprobability of observing a positive response in a cluster of size ni, given that the re-maining littermates yield zi − 1 successes:

(4)

which decreases to zero when ni increases and zi is bounded, and approaches unityfor increasing ni and bounded , whenever there is a positive association be-tween outcomes. From (4) it is clear that the conditional logit of an additional suc-cess, given zi − 1 successes, equals θi + δi(2zi − ni − 1). Thus, on noting that thesecond term vanishes if zi − 1 = (ni − 1)/2, θi is seen to be the conditional logit foran additional success when about half of the littermates exhibit a success already.Similarly, the log odds ratio for the responses between two littermates is equal to 2δi,confirming the association parameter interpretation of the δ-parameter.

f Y yi �i ni,;( ) θ̃iyikk 1=

ni

∑ δ̃iyikyik ′k k ′<∑ A �̃i( )–+

,exp=

δ̃i

f Y yi �̃i ni,;( ) θ̃i 2zi ni–( ) δ̃ini2

2zini 2zi2+– A �̃i( )–+

.exp=

θi 2θ̃i= δi 2δ̃i=

f Y yi �̃i ni,;( ) θizi1( ) δizi

2( ) A �̃i( )–+{ },exp=

zi1( ) zi= zi

2( ) zi ni zi–( )–= δi 0=

P yik 1|zi 1– ni,=( )θi δi 2zi ni– 1–( )+[ ]exp

1 θi δi 2zi ni– 1–( )+[ ]exp+----------------------------------------------------------------------- ,=

ni zi1( )–

Page 8: Statistical Methods for Developmental Toxicity: Analysis of Clustered Multivariate Binary Data

203RYAN & MOLENBERGHS: STATISTICAL METHODS FOR TOXICITY

Finally, because our model is conditional in nature, the marginal expectation of is clearly a (non-linear) function of ni. This expectation can be easily calcu-

lated and plotted to explore the relationship between cluster size and response prob-ability. Alternatively, methods similar to those of Cox and Wermuth36 could beapplied to develop approximate expressions for the marginal means and odds ratios.In any case, both (4) and the mean function can be investigated graphically to assessthe plausibility of the fitted model.

Consider now an extension of the proposed model to multiple outcomes, with theadditional subscript j returning as in (1) to indicate outcome type. The joint distribu-tion of the outcome vector Y is:

(5)

where is again a normalizing constant, resulting from summing (5) over all possible outcomes. All model parameters depend on the subscript i, indicating

that their values may vary from cluster to cluster through the covariate vector xi foreach cluster. The parameters θij (main effect associated with the presence of out-come j for an individual in cluster i), δij (association between two different individ-uals for the same cluster) and (association between outcomes and for asingle individual within cluster i) have been discussed previously, in (1) and (3).Model (5) incorporates one further association parameter ( ) to characterize therelationship between outcomes and for two different individuals in the samecluster. The absence of individual-specific subscripts reflects the implicit exchange-ability assumption between any two individuals within the same cluster.

After reparametrization and absorption of constant terms into the normalizingconstant, model (5) can be rewritten parsimoniously as:

(6)

where the summary statistics to are defined as follows. Let zij be the num-ber of individuals from cluster i positive on outcome j and let be the number ofindividuals in cluster i with both outcomes and . Then the summary statistics are

(7)

For the ith cluster, these can be thought of as arising from the set of two-by-twotables obtained by cross-classifying every pair of outcomes as in TABLE 2.

zi1( ) ni⁄

f Y yi �̃i ni,;( ) θ̃ijyijkk 1=

ni

∑j 1=

M

∑ δ̃ijyijkyijk′k k ′<∑

j 1=

M

ω̃ijj ′yijkyij ′kk 1=

ni

∑j j ′<∑ γ̃ ijj ′yijkyij ′k ′

k k ′≠∑

j j ′<∑ A �̃i( )–

+

+ +

,

exp=

A �̃i( )2Mni

ωijj ′ j j′

γ ijj ′j j′

f Y yi �̃i ni,;( ) θijzij1( )

j 1=

M

∑ δijzij2( )

j 1=

M

ωijj ′zijj ′3( )

j j ′<∑ γ ijj ′zijj ′

4( )

j j ′<∑ A �̃i( )–

+

+ +

,

exp=

zij1( ) zijj ′

4( )

zijj ′j j′

zij1( ) zij,=

zij2( ) zij ni zij–( ),–=

zijj ′3( ) 2zijj ′ zij– zij ′,–=

zijj ′4( ) zij ni zij ′–( )– zij ′ ni zij–( )– zijj ′

3( ) .–=

Page 9: Statistical Methods for Developmental Toxicity: Analysis of Clustered Multivariate Binary Data

204 ANNALS NEW YORK ACADEMY OF SCIENCES

In the most general case, it is useful to think in terms of the three different typesof associations captured in the model. They are depicted in FIGURE 1.

As discussed already in the single variable clustered case, model (6) is condition-al in nature. Indeed, the model implies conditional odds and odds-ratios that are log-linear in the natural parameters. Furthermore, the conditional logit associated withthe presence and absence of outcome j for an individual k in cluster i, depends oncluster size, as discussed above in the single outcome case. It also depends on theobserved pattern of the remaining outcomes. Similar to (4), let us construct the logitof a success, conditional on all other outcomes in the same cluster. Let κijk = 1 if thekth individual exhibits a success on the jth variable and 0 otherwise. Then

This function clearly depends on the cluster size. The conditional log odds ratios, as-sociated with any two components of the vector Yi, reduce to:

pr Y ijk 1|yij ′k ′ j′ j or k ′ k≠≠,=( )

pr Y ijk 1– |yij ′k ′ j′ j or k ′ k≠≠,=( )-----------------------------------------------------------------------------------log θij δij 2zij ni– 1–( )

ωijj ′ 2κij ′k 1–( )j ′ j≠∑ γ ijj ′ 2zij ′ ni– 2κij ′k 1+–( )

j ′ j≠∑ .

+

+ +

=

TABLE 2. Cross-classification of littermates with respect to outcome variable pairs

Outcome

Outcome

Absent Present

Absent

Present

j′

j

zijj ′ zij ′

zij ni

FIGURE 1. Association structure: outcomes and individuals and , cluster i.j j′ k k′

Page 10: Statistical Methods for Developmental Toxicity: Analysis of Clustered Multivariate Binary Data

205RYAN & MOLENBERGHS: STATISTICAL METHODS FOR TOXICITY

( , ), where symbolizes the conditioning on all other outcomes in litter i.Model (6) includes only pairwise interactions between outcomes. Clearly, in the

spirit of log-linear modeling, it could be extended with three-way and higher-orderinteractions. Furthermore, subscripting the parameters in (5) with the index k allowsthe inclusion of fetus-specific covariates, such as sex or individual-specific dosing.However, Geys, Molenberghs, and Williams40 have shown that the marginal successprobability of one fetus then depends on the covariates of the other fetuses. Thiswould restrict such an extension to a very limited number of applications.

Liang and Zeger23 developed a similar model in which the number of outcomesper individual is variable as well, but neither cluster-level nor individual-level cova-riates are included. Since they do not code the responses as −1/1, their model is notinvariant to reversing the coding.

MODEL FITTING AND INFERENCE

Consider now the issue of modelling covariate effects. A natural example occursin the case of multigroup comparisons or dose-response modelling where the cova-riate xi represents the treatment group or dose level for the ith cluster. In this section,we outline procedures for likelihood based parameter estimation and testing in thissetting.

The first step is to group the summary statistics from (7) into a new vector zi,ordered according to the set of natural parameters in the vector �i. Let �i =E(zi). Applying a linear link function, we assume �i = Xi�, with Xi a designmatrix, constructed from the cluster-specific covariates xi, and � is a vector ofunknown regression coefficients. Assuming a Poisson or multinomial samplingscheme, the kernel of the log-likelihood is proportional to

whence the score function becomes

The maximum likelihood estimator for � is defined as the solution to U(�) = 0.Molenberghs and Ryan34 derived a score test for the null hypothesis of no dose

effect. Using logic similar to Rotnitzky and Jewell,41 the use of empirically adjustedvariances ensures the score test will remain valid under certain model misspecifica-tions, in the sense of White.42 In general, inference can be based on standard likeli-hood procedures, such as Wald and likelihood ratio statistics. General forms areimplemented as GAUSS procedures.

OR Y ijk Y ijk′|i,( )log 2δij,=

OR Y ijk Y ij ′k |i,( )log 2ωijj ′,=

OR Y ijk Y ij ′k ′|i,( )log 2δijj ′ .=

j j′≠ k k′≠ |i

q 1×q p×

p 1×

l ziT X i� A X i�( )–{ },

i 1=

N

∑=

U �( ) X iT zi �i–( ).

i 1=

N

∑=

Page 11: Statistical Methods for Developmental Toxicity: Analysis of Clustered Multivariate Binary Data

206 ANNALS NEW YORK ACADEMY OF SCIENCES

ANALYSIS OF NTP DATA

TABLE 3 summarizes the results of the model fitting for the DEPH data, a detailedtabulation of which can be found in Lefkopoulou and Ryan.33 The model includesmain effects for each of the three outcomes: , includingcluster size as a covariate and a common dose effect (τ), as well as associationbetween animals on the same outcome (δij = δj), within animal association parame-ters ( ), and association between animals on different outcomes( ).

There is a strong dose effect. The association structure differs with the outcomes:there is a strong cluster effect for skeletal malformation, and the association betweena pair of outcomes is higher if skeletal malformation is involved.

TABLE 4 shows the results of several different procedures for testing the null hy-pothesis of no dose effect (τ = 0) in the DEHP data. Included are the results of like-lihood ratio, Wald, and score tests, the latter two with either model based orempirically adjusted variances. All tests give a similar qualitative impression of astrong dose effect, the model based tests tend to have more extreme χ2 values thantheir corresponding empirically adjusted counterparts.

θij σ j1 σ j2niτdi+ +=

ωijj ′ ω jj ′=γ ijj ′ γ jj ′=

TABLE 3. Parameter estimates for CEHP study

Parameter Value Naive Robust

s.e. pr > |Z| s.e. pr > |Z|

σ11 −4.3484 0.2473 0.0000 0.2822 0.0000

σ12 −0.6883 0.1776 0.0001 0.1983 0.0005

σ21 −4.0325 0.2230 0.0000 0.2659 0.0000

σ22 −0.5355 0.1607 0.0009 0.1812 0.0031

σ31 −4.3913 0.2363 0.0000 0.2996 0.0000

σ32 −0.2637 0.1612 0.1018 0.1663 0.1128

δ1 0.1934 0.1517 0.2021 0.1562 0.2154

δ2 0.2548 0.1275 0.0456 0.0890 0.0042

δ3 0.4461 0.0608 0.0000 0.0649 0.0000

ω12 0.0750 0.4043 0.8529 0.5023 0.8813

ω13 1.1088 0.4060 0.0063 0.4215 0.0085

ω23 0.6524 0.4174 0.1180 0.5353 0.2230

γ12 0.1195 0.1161 0.3034 0.1373 0.3843

γ13 −0.0651 0.0904 0.4712 0.1123 0.5621

γ23 −0.2285 0.0977 0.0193 0.1249 0.0673

τ 18.4194 2.9029 0.0000 4.0660 0.0000

Log-likelihood: −530.47

Page 12: Statistical Methods for Developmental Toxicity: Analysis of Clustered Multivariate Binary Data

207RYAN & MOLENBERGHS: STATISTICAL METHODS FOR TOXICITY

TABLE 5 compares different score test procedures for a common dose param-eter in all three outcomes for three of the five studies described in TABLE 1. TheTGDM and THEO studies were omitted from this comparison after inspection ofTABLE 1 revealed several marginal zeros. For instance, in the THEO study, there areno animals recorded with more than one malformation. This leads to estimates onthe boundary of the parameter space. Due to this sparseness, exact inference wouldbe more appropriate, a topic that is currently under investigation. For the three stud-ies included in TABLE 5, using empirically corrected versions of the tests lead to sim-ilar results, whether applied to the Lefkopoulou and Ryan33 model, or to thatproposed in this paper. Again, model based tests yield substantially higher χ2 values.

The observed discrepancy between model based and empirical tests suggests sub-stantial lack of fit. One likely explanation is that the model assumes that associationparameters are constant across clusters. In reality, the degree of association often de-pends on dose level.43 This point is clearly illustrated with an example using the ex-ternal and visceral outcomes from the DEHP study. Starting from a model withlinear and quadratic dose effects on all parameters, a backward elimination proce-

χ12

TABLE 4. DEHP study. Tests for H0:� � 0

Likelihood ratio 86.97

Wald model based 40.26

Wald empirical 20.52

Score model based 46.85

Score empirical 20.62

TABLE 5. Parameter estimates for CEHP study

Study Univariate Multivariate

extern visc skel any

Lefkopoulou-Ryan

EG 7.16 7.42 29.21 30.34 26.21

DYME 22.97 11.32 29.18 32.56 29.05

DEHP 16.36 16.47 13.36 22.92 19.77

MR, Empirically corrected

EG 5.32 6.12 25.78 26.28 22.82

DYME 14.58 10.38 21.99 23.71 21.96

DEHP 15.25 22.78 12.68 25.74 20.62

MR, model based

EG 10.64 11.47 46.59 45.60 46.06

DYME 40.87 8.12 48.55 65.45 52.96

DEHP 31.14 26.95 33.16 53.29 46.85

�12

Page 13: Statistical Methods for Developmental Toxicity: Analysis of Clustered Multivariate Binary Data

208 ANNALS NEW YORK ACADEMY OF SCIENCES

dure, forced to include all intercepts, yields the following model (di represents doselevel):

Comparing this model to that with constant associations and linear-dose main ef-fects gives a likelihood ratio statistic of 30.08 on four degrees of freedom, confirm-ing that the constant association model is too simple for these data. Given this morecomplex model, we can construct a new test for the null hypothesis that θ1(di) andθ2(di) are dose independent. The model based and empirically corrected score teststatistics for a dose main effect are 30.07 and 17.90 on two degrees of freedom, re-spectively. If we compare these statistics to the underspecified model with no depen-dence on cluster size, these statistics are 164.64 and 22.60. Of course, these latterquantities are not well defined, because the corresponding models are not invariantto recoding of the outcome. The discrepancy that is still observed between the modelbased and the robust version of the statistic could be due to over-simplification of thehigher order association.

DISCUSSION

We have discussed the analysis of multivariate clustered binary data of the kindthat arises in developmental toxicity studies. We described the likelihood basedframework for clustered multivariate binary outcomes that has been proposed byMolenberghs and Ryan.34 Particular emphasis was given to the construction of ascore test for assessing dose response. Model based and empirically adjusted testswere considered. The model based test is a natural extension of the Cochran-Armitage test for trend, and corresponds exactly to that test in the absence of clus-tering and for a single outcome. When all cluster sizes are equal, the empiricallyadjusted score test is identical to the GEE-based score test derived by Lefkopoulouand Ryan.33 For the variable cluster sizes (ranging between 2 and 12) encounteredin the NTP data, the model based score test with empirical variance adjustment werenumerically similar to the test derived by Lefkopoulou and Ryan.33

The likelihood basis of the proposed model has both advantages and disadvantag-es. An important advantage of our approach is that it lends itself well to formulationof exact inferential procedures, a topic that we are presently investigating. Anotheradvantage is that when the model is correctly specified, then efficiency can be gainedover other procedures such as GEE methods. This was seen with our example. A dis-advantage is that in the absence of correct model specification, a model based vari-ance will not always guarantee valid inference. The fact that our score test reduces,in special cases, to the GEE based test, however, suggests that there may be classes

θ1 di( ) 30.7– 0.43ni– 0.30di,+=

θ2 di( ) 3.77– 0.43ni– 0.60di,+=

δ1 0.47,=

δ2 0.36,=

ω di( ) 0.04– 3.06di 0.92di2,–+=

γ di( ) 0.12 0.25di– 0.07di2.+=

Page 14: Statistical Methods for Developmental Toxicity: Analysis of Clustered Multivariate Binary Data

209RYAN & MOLENBERGHS: STATISTICAL METHODS FOR TOXICITY

of models within which our methods may be relatively robust. Furthermore, this cor-respondence suggests that our derived score test will enjoy the same properties (in-cluding good power and an accurate type I error) as the GEE-based test.13 Theseobservations highlight the importance of careful modelling, and suggest the devel-opment of model assessment tools as a useful avenue for further research.

ACKNOWLEDGMENTS

We gratefully acknowledge support from National Institutes of Health GrantCA48061, the U.S. Environmental Protection Agency, and NATO Collaborative Re-search Grant 950648.

REFERENCES

1. CRUMP, K. 1984. A new method for determining allowable daily intakes. Fund. Appl.Toxicol. 4: 854–871.

2. LEISENRING, W. & L.M. RYAN. 1992. Statistical properties of the NOAEL. RegulatoryToxicology and Pharmacology 15: 161–171.

3. U.S. ENVIRONMENTAL PROTECTION AGENCY. 1991. Guidelines for developmental tox-icity risk assessment. Federal Register. 56: 63798–63826.

4. FOOD AND DRUG ADMINISTRATION. 1966. Guidelines for reproduction and studies forsafety evaluation of drugs for human use. Bureau of Drugs. Rockville, MD.

5. KIMMEL, C.A. & C.J. PRICE. 1990. Developmental toxicity studies. In Handbook of inVivo Toxicity Testing. 271–300. Academic Press.

6. TYL, R.W., M.C. PRICE, M.C. MARR & C.A. KIMMEL. 1988. Developmental toxicityevaluation of dietary di(2-ethylhexyl)phthalate in Fisher 344 rats and CD-1 mice.Fund. Appl. Toxicol. 10: 395–412.

7. PRICE, C.J., C.A. KIMMEL, J.D. GEORGE & M.C. MARR. 1987. The developmental tox-icity of diethylene glycol dimethyl ether in mice. Fund. Appl. Toxicol. 8: 115–126.

8. GEORGE, J.D., C.J. PRICE, C.A. KIMMEL, & M.C. MARR. 1987. The developmentaltoxicity of triethylene glycol dimethyl ether in mice. Fund. Appl. Toxicol. 9: 173–181.

9. PRICE, C.J., C.A. KIMMEL, R.W. TYL & M.C. MARR. 1985. The developmental toxic-ity of ethylene glycol in rats and mice. Toxico. Appl. Pharmacol. 81: 113–127.

10. LINDSTROM, P., R.E. MORRISSEY, J.D. GEORGE, C.J. PRICE, M.C. MARR, C.A. KIMMEL

& B.A. SCHWETZ. 1990. The developmental toxicity of orally administered theo-phylline in rats and mice. Fund. Appl. Toxicol. 14: 167–178.

11. DIGGLE, P.J., K.-Y. LIANG & S.L. ZEGER. 1994. Analysis of Longitudinal Data. Clar-endon Press, Oxford.

12. FITZMAURICE, G.M., N.M. LAIRD & A. ROTNITZKY. 1993. Regression models for dis-crete longitudinal responses. Stat. Sci. 8: 284–309.

13. LEGLER, J.M., M. LEFKOPOULOU, & L.M. RYAN. 1995. Efficiency and power of testsfor multiple binary outcomes. J. Am. Stat. Assoc. 90: 680–693.

14. PENDERGAST, J.F., S.J. GANGE, M.A. NEWTON, M.J. LINDSTROM, M. PALTA & M.R.FISHER. 1996. A survey of methods for analyzing clustered binary response data. Int.Stat. Rev. 64: 89–118.

15. FITZMAURICE, G.M. & N.M. LAIRD. 1993. A likelihood-based method for analysinglongitudinal binary responses. Biometrika 80: 141–151.

Page 15: Statistical Methods for Developmental Toxicity: Analysis of Clustered Multivariate Binary Data

210 ANNALS NEW YORK ACADEMY OF SCIENCES

16. BAHADUR, R.R. 1961. A representation of the joint distribution of responses to ndichotomous items. In Studies in Item Analysis and Prediction. H. Solomon, Ed.:Stanford Mathematical Studies in the Social Sciences VI. Stanford University Press,Stanford.

17. COX, D.R. 1972. The analysis of multivariate binary data. Appl. Stat. 21: 113–120.18. ASHFORD, J.R. & R.R. SOWDEN. 1970. Multivariate probit analysis. Biometrics 26:

535–546.19. DALE, J.R. 1986. Global cross-ratio models for bivariate, discrete, ordered responses.

Biometrics. 42: 909–917.20. MOLENBERGHS, G., L. DECLERCK, L. & M. AERTS. 1997. Misspecifying the likelihood

for clustered binary data. Computational Statistics and Data Analysis 26: 327–349.21. MOLENBERGHS, G. & E. LESAFFRE. 1994. Marginal modelling of correlated ordinal

data using a multivariate Plackett distribution. J. Am. Stat. Assoc. 89: 633–644.22. ROSNER, B. 1984. Multivariate methods in ophtalmology with applications to other

paired-data situations. Biometrics 40: 1025–1035.23. LIANG, K.-Y. & S.L. ZEGER. 1989. A class of logistic regression models for multivari-

ate binary time series. J. Am. Stat. Assoc. 84: 447–451.24. STIRATELLI, R., N.M. LAIRD & J.H. WARE. 1984. Random-effects models for serial

observations with binary response. Biometrics 40: 961–971.25. ZEGER, S.L., K.-Y. LIANG & P.S. ALBERT. 1988. Models for longitudinal data: a gen-

eralized estimating equation approach. Biometrics 44: 1049–1060.26. BRESLOW, N.E. & D.G. CLAYTON. 1993. Approximate inference in generalized linear

mixed models. J. Am. Stat. Assoc. 88: 9–25.27. WOLFINGER, R. & M. O’CONNELL. 1993. Generalized linear mixed models: a pseudo-

likelihood approach. J. Stat. Comp. Simul. 48: 233–243.28. LIANG, K.-Y. & S.L. ZEGER. 1986. Longitudinal data analysis using generalized linear

models. Biometrika 73: 13–22.29. RAI, K. & J. VAN RYZIN. 1985. A dose-response model for teratological experiments

involving quantal responses. Biometrics 47: 825–839.30. WILLIAMS, D.A. 1975. The analysis of binary responses from toxicological experi-

ments involving reproduction and teratogenicity. Biometrics 31: 949–952.31. CHEN, J.J. & R.L. KODELL. 1989. Quantitative risk assessment for teratological

effects. J. Am. Stat. Assoc. 84: 966–971.32. LEFKOPOULOU, M., D. MOORE & L.M. RYAN. 1989. The analysis of multiple corre-

lated binary outcomes: application to rodent teratology experiments. J. Am. Stat.Assoc. 84: 810–815.

33. Lefkopoulou, M. & L.M. Ryan. 1993. Global tests for multiple binary outcomes. Bio-metrics 49: 975–988.

34. MOLENBERGHS, G. & L.M. RYAN. 1998. Likelihood inference for clustered multivari-ate binary data. Environmetrics 10: 279–300.

35. AERTS, M., L. DECLERCK & G. MOLENBERGHS. 1997. Likelihood misspecification andsafe dose determination for clustered binary data. Environmetrics 8: 613–627.

36. COX, D.R. & N. WERMUTH. 1994. A note on the quadratic exponential binary distribu-tion. Biometrika 81: 403–408.

37. ZHAO, L.P. & R.L. PRENTICE. 1990. Correlated binary regression using a quadraticexponential model. Biometrika 77: 642–648.

38. THÉLOT, C. 1985. Lois logistiques à deux dimensions. Annales de l’Insée. 58: 123–149.

39. KLEINMAN, J.C. 1973. Proportions with extraneous variance: single and independentsamples. J. Am. Stat. Assoc. 68: 46–54.

Page 16: Statistical Methods for Developmental Toxicity: Analysis of Clustered Multivariate Binary Data

211RYAN & MOLENBERGHS: STATISTICAL METHODS FOR TOXICITY

40. GEYS, H., G. MOLENBERGHS & P. WILLIAMS. 1997. Analysis of clustered binary datawith covariates specific to each observation. In Good Statistical Practice, Proc. 12thInternational Workshop on Statistical Modelling, Band 5. C.E. Minder and H. Friedl,Eds. Schriftenreihe der Österreichischen Statistischen Gesellschaft, Wien.

41. ROTNITZKY, A. & N.P. JEWELL. 1990. Hypothesis testing of regression parameters insemiparametric generalized linear models for cluster correlated data. Biometrika 77:485–497.

42. WHITE, H. 1982. Maximum likelihood estimation of misspecified models. Economet-rica. 50: 1–25.

43. KUPPER, L.L., C. PORTIER, M.D. HOGAN & E. YAMAMOTO. 1986. The impact of littereffects on dose-response modeling in teratology. Biometrics 42: 85–98.