combining data from multiple sources, with applications to environmental risk assessment

13
STATISTICS IN MEDICINE Statist. Med. 2008; 27:698–710 Published online 11 December 2007 in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/sim.3053 Combining data from multiple sources, with applications to environmental risk assessment Louise Ryan , Department of Biostatistics, Harvard School of Public Health, Boston, MA, U.S.A. SUMMARY The classical statistical paradigm emphasizes the development and application of methods to account for sampling variability. Many modern day applications, however, require consideration of other sources of uncertainty that are not so easy to quantify. This paper presents a case study involving an assessment of the impact of in-utero methylmercury exposure on the Intelligence Quotient (IQ) of young children. We illustrate how familiar techniques such as hierarchical modeling, Bayesian methods and sensitivity analysis can be used to aid decision making in settings that involve substantial uncertainty. Copyright 2007 John Wiley & Sons, Ltd. KEY WORDS: Bayesian hierarchical model; uncertainty analysis; dose–response modeling 1. INTRODUCTION Environmental decision making is typically informed by imperfect data and subject to many sources of uncertainty. Because interest is often focused on regulating exposures associated with relatively rare outcomes such as cancer or birth defects, researchers sometimes rely on toxico- logical experiments where animals can be exposed to very high doses in order to increase the observed event rates and thereby improve statistical power. The downside of using the results from animal studies to set environmental standards is that there is always an element of uncertainty regarding the relevance of animal studies to human health. The use of epidemiological studies avoids the problems associated with animal to human extrapolation, but encounters many other issues, including confounding, measurement error and missing data. Indeed, the inherent noisiness of epidemiological data can seriously complicate the dose–response modeling effort needed to establish environmental standards [1]. In practice, environmental regulators need to consider and Correspondence to: Louise Ryan, Department of Biostatistics, Harvard School of Public Health, Boston, MA, U.S.A. E-mail: [email protected] Contract/grant sponsor: National Institute of Health; contract/grant number: CA48061 Received 18 December 2006 Copyright 2007 John Wiley & Sons, Ltd. Accepted 23 July 2007

Upload: louise-ryan

Post on 06-Jul-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Combining data from multiple sources, with applications to environmental risk assessment

STATISTICS IN MEDICINEStatist. Med. 2008; 27:698–710Published online 11 December 2007 in Wiley InterScience(www.interscience.wiley.com) DOI: 10.1002/sim.3053

Combining data from multiple sources, with applications toenvironmental risk assessment

Louise Ryan∗,†

Department of Biostatistics, Harvard School of Public Health, Boston, MA, U.S.A.

SUMMARY

The classical statistical paradigm emphasizes the development and application of methods to account forsampling variability. Many modern day applications, however, require consideration of other sources ofuncertainty that are not so easy to quantify. This paper presents a case study involving an assessmentof the impact of in-utero methylmercury exposure on the Intelligence Quotient (IQ) of young children.We illustrate how familiar techniques such as hierarchical modeling, Bayesian methods and sensitivityanalysis can be used to aid decision making in settings that involve substantial uncertainty. Copyright q2007 John Wiley & Sons, Ltd.

KEY WORDS: Bayesian hierarchical model; uncertainty analysis; dose–response modeling

1. INTRODUCTION

Environmental decision making is typically informed by imperfect data and subject to manysources of uncertainty. Because interest is often focused on regulating exposures associated withrelatively rare outcomes such as cancer or birth defects, researchers sometimes rely on toxico-logical experiments where animals can be exposed to very high doses in order to increase theobserved event rates and thereby improve statistical power. The downside of using the results fromanimal studies to set environmental standards is that there is always an element of uncertaintyregarding the relevance of animal studies to human health. The use of epidemiological studiesavoids the problems associated with animal to human extrapolation, but encounters many otherissues, including confounding, measurement error and missing data. Indeed, the inherent noisinessof epidemiological data can seriously complicate the dose–response modeling effort needed toestablish environmental standards [1]. In practice, environmental regulators need to consider and

∗Correspondence to: Louise Ryan, Department of Biostatistics, Harvard School of Public Health, Boston, MA, U.S.A.†E-mail: [email protected]

Contract/grant sponsor: National Institute of Health; contract/grant number: CA48061

Received 18 December 2006Copyright q 2007 John Wiley & Sons, Ltd. Accepted 23 July 2007

Page 2: Combining data from multiple sources, with applications to environmental risk assessment

COMBINING DATA FROM MULTIPLE SOURCES 699

balance the results of multiple different studies, often with conflicting results, varying endpointsand study designs.

The classical statistical paradigm tends to breaks down when it comes to considering all thesevarious sources of uncertainty. While it is of course an oversimplification of what happens inpractice, the classical paradigm is based on the concept that observed data can be expressed as thesum of a predictable component plus random error. Statisticians are traditionally trained to thinkprimarily in terms of the latter representing sampling error. Motivated largely by game theory, earlyprobabilists developed the familiar probability laws based on the concept of random sampling.Laplace was one of the earliest proponents of the idea that while we can only ever hope to achievepartial knowledge about cause and effect, probability theory provides a means to account for thatuncertainty. In general, however, there is relatively little thoughtful discussion in the statisticsliterature on the nature of the uncertainty grouped into the error term. Some epidemiologists haveaddressed the issue. For example, Savitz [2] distinguishes random versus systematic error. Hedescribes statistically predictable random error as including sampling error and random allocation.Systematic error represents potentially measurable features that could affect the observable patternsin the data or change conclusions [3]. Some authors refer to knowledge uncertainty, rather thansystematic error, dividing it into four main classes:

Completeness: Have all relevant factors been considered? In the epidemiological context, forexample, this might translate into the question of whether all the appropriate confounders havebeen considered.

Scenarios: Is the model flexible enough to apply to other settings besides the one under study?Scenario uncertainty is particularly important in contexts where interest lies in predicting the future.For example, how long will it be until the next big earthquake in San Francisco? Or, what is thechance that a nuclear power plant will have a catastrophic failure in the next 50 years?

Model: Does the assumed model appropriately capture all key aspects of the scientific contextof interest? Has the dose–response model been specified correctly?

Parameter: Have any required input parameters been extracted appropriately from the scientificliterature or other experiments? This is especially the case for so-called mathematical models thatare often theory driven rather than empirical.

Recent years have seen increasing interest and awareness of the importance of accounting forsome of these more general sources of error. For example, there has been extensive work on the topicof measurement error and its impact on statistical modeling. In terms of the taxonomy describedabove, measurement error might be considered as falling into the category of ‘completeness’ orperhaps ‘model’ uncertainty.

Many authors have also discussed the importance of quantifying different sources of randomvariation through the use of hierarchical modeling [4, 5] and consideration of variance components[6]. Other authors have discussed the use of model averaging techniques as a way of quantifyingthe uncertainty associated with model choice [1]. Meta-analysis and research synthesis has alsoemerged as a powerful and popular technique for combining data and information from multiplesources [7]. Many of these developments have been facilitated through advances in Bayesiancomputational methods which now make hierarchical modeling and sensitivity analysis relativelystraightforward to accomplish.

This paper presents an elaboration of the statistical issues associated with an effort to combinedata from several different studies related to the effects of in-utero methylmercury exposureon childhood development. In a paper recently published in Environmental Health Perspectives(hereafter referred to as the EHP paper), Axelrad et al. [8] use Bayesian hierarchical modeling

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:698–710DOI: 10.1002/sim

Page 3: Combining data from multiple sources, with applications to environmental risk assessment

700 L. RYAN

techniques to combine dose–response coefficients from a variety of related outcomes measured inthree different epidemiological studies. The analysis is fraught with difficulties, including smallsample sizes, outliers and data limitations. We illustrate how Bayesian methods can be used tocast a helpful light on the problem and facilitate environmental regulation related to mercurycontamination. The case study highlights the importance and, indeed, the necessity of relaxingtraditional statistical rigor when faced with decision making under uncertainty, and instead thinkingof statistical modeling as a tool for exploring the impact of that uncertainty. Sensitivity analysisplays an essential role.

2. BACKGROUND

Mercury exposure has been associated with a number of adverse health effects. Several noto-rious accidental poisoning incidents have unequivocally established that acute high-level expo-sures can lead to serious mental retardation, cerebral palsy, deafness and blindness [9]. However,the impact of chronic low-level exposures associated with the consumption of contaminatedfish remains controversial, in large part because several well-designed epidemiological studieshave led to differing conclusions. At the request of the United States Environmental ProtectionAgency (EPA), the National Academy of Sciences (NAS) formed a committee of experts to assessthe evidence [9]. After reviewing the literature and all available data, the committee concludedthat there were three epidemiologic studies of sufficient quality to consider as the basis of anenvironmental risk assessment: one conducted in the Faroe Islands [10], one in New Zealand[11, 12] and one in the Seychelles Islands [13, 14]. These three populations were selected forstudy because fish consumption was known to be high, and this is the predominant source ofchronic exposure to methylmercury. All three studies had reliable measurements of prenatal expo-sure and assessed a variety of neurodevelopmental endpoints in the children. The Faroe Islandsand New Zealand studies found a statistically significant relationship between higher prenatalexposures to mercury and poorer scores on tests of neurological function, but the Seychellesstudy did not. The NAS Committee used a hierarchical model to perform a combined analysisof all three studies, concluding that the overall weight of evidence supported the conclusion thateven moderate prenatal exposures could lead to measurable decrements in neurological develop-ment. Coull et al. [15, 16] present a more detailed description of the statistical approach used bythe NAS.

While the NAS report supported the general principle of reducing mercury emissions, furtheranalysis was needed in order to identify a specific targeted reduction. As it has done with anumber of other high profile environmental exposures, EPA set out to do a cost–benefit anal-ysis, weighing the costs of reduced emissions against the benefits in terms of adverse effectsavoided. Doing so required the estimation of a dose–response coefficient with respect to anoutcome for which a dollar figure could be associated with each unit of decrement. The Intel-ligence Quotient (IQ) was selected as the endpoint of interest because data related to IQ wereavailable from all three studies, and because methods for economic valuation of IQ decre-ments are well established and have been used successfully by EPA in a cost–benefit analysisfor lead [17]. Performing a combined analysis of the IQ data from the three studies requiredan extension of the hierarchical modeling approach previously considered by the NAS. Thenext section describes the available data in more detail before describing our proposed analysisapproach.

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:698–710DOI: 10.1002/sim

Page 4: Combining data from multiple sources, with applications to environmental risk assessment

COMBINING DATA FROM MULTIPLE SOURCES 701

3. DATA

In the Faroe Islands study, mercury was measured in maternal hair and blood, and neurologicalexaminations of 917 children were conducted at various ages, including the age of 7 years [10].In the New Zealand study, mercury was measured in maternal hair, and neurological tests wereadministered to 237 participating children at ages 4 and 6 years [11, 12, 18]. In the Seychelles study,mercury was measured in maternal hair, and neurodevelopmental assessments were conducted atseveral ages. Most recently, 643 children were assessed at the age of 9 years [14]. Because theoriginal data were not available for combined analysis, our modeling was based on estimateddose–response coefficients and standard errors reported in various publications from the threestudies (more details presently). All cognitive endpoints reported in the Faroe Islands (age 7 years),New Zealand (age 6 years) and Seychelles (age 9 years) studies were considered for inclusionin the analysis. For the New Zealand and Seychelles studies, all information necessary for ourmodeling was obtained from published papers [11, 14]. EPA requested some additional analysis tobe performed for the Faroes study [19]. There were a number of technical issues to be considered,for example, the fact that the Faroe Islands investigators chose to report their dose–response analysesonly in terms of cord blood, while the Seychelles and New Zealand studies used maternal hair.As described in the EHP paper [8], this was addressed through application of a simple conversionfactor to the estimated dose–response coefficient [20]. Another issue was that the Faroes study didnot report a standard measure of full-scale IQ. Instead, their estimated dose–response coefficientassociated with IQ was based on a structural equations analysis [21, 22] applied to three subtests ofthe Wechsler Intelligence Scales for Children (WISC) IQ test. This additional analysis, performedby the Seychelles group at the request of EPA, is described in a technical report available on theEPA website.

The EHP paper [8] describes the process whereby we reviewed the endpoints from each ofthe studies and selected those representing the Cognition/Achievement domain for inclusionin a statistical model. This produced a final set of endpoints for inclusion, shown in Table I,with four endpoints from the Faroe Islands study, four from the New Zealand study, and fivefrom the Seychelles study. In addition to listing the various endpoints, Table I shows the scalefactors needed to convert the estimated coefficients to be on the same scale as IQ. The tablealso shows the rescaled values of the estimated dose–response coefficients and their associatedstandard errors. Figure 1 shows a graphical representation of the data. The figure and table bothgive a general impression of a negative association between methylmercury and these cogni-tive endpoints. Estimates from the Seychelles study are the closest to zero and all associatedconfidence limits overlap 0. The Faroes estimates are greater in magnitude than those of theSeychelles study and confidence intervals are slightly tighter, consistent with the fact that theFaroes investigators reported a significant result. Estimates from the New Zealand study are themost extreme, but this study has wide confidence intervals due to the relatively small sample sizeinvolved.

For the New Zealand study, two sets of dose–response coefficients were reported [11]: onebased on the complete cohort and the other based on analysis that excludes one very influentialobservation with unusually high maternal hair mercury. The NRC committee reviewed the influenceof the one observation and determined that exclusion of this outlier was reasonable and appropriate[9]. The results shown in Table I are based on regression analyses with this extreme child excluded;coefficients based on an analysis including this child are reported in the EHP paper [8] as part ofa sensitivity analysis.

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:698–710DOI: 10.1002/sim

Page 5: Combining data from multiple sources, with applications to environmental risk assessment

702 L. RYAN

Table I. Estimated regression coefficients and associated standard errors (b and s) for endpointsrelated to achievement and cognition.

Original scale IQ scaleScaling

Study Endpoint b (s) factor b (s)

Seychelles∗ Full-scale IQ† −0.130 (0.100) 1.29 −0.17 (0.13)California verbal learning 0.013 (0.010) 15.42 0.20 (0.154)Boston naming test −0.012 (0.046) 3.13 −0.038 (0.144)Visual motor integration −0.010 (0.120) 5.17 −0.109 (0.150)WRAML‡ −0.021 (0.029) 1.28 −0.013 (0.15)

Faroes§ Full-scale IQ¶ −0.024 (0.011) 5.17 −0.124 (0.057)Bender visual (copying)‖ 0.073 (0.059) −1.42 −0.104 (0.083)Boston naming test −0.190 (0.063) 1.37 −0.260 (0.086)California verbal learning −0.058 (0.032) 2.91 −0.169 (0.093)

New Zealand∗∗ Full-scale IQ −0.18 (0.155) 0.94 −0.50 (0.282)Test of language development −0.19 (0.145) 0.94 −0.51 (0.301)Performance IQ −0.12 (0.165) 1.5 −0.50 (0.301)McCarthy perceptual performance −0.18 (0.108) 0.94 −0.90 (0.485)

Note: Scaling factor corresponds to 15 divided by the observed standard deviation of the endpoint of interest.Faroes scale factors also include division by a factor of 2 to account for the transformation from mercury incord blood to mercury in hair.∗Table 2 or 3 of Myers et al. [14].†Based on the Wechsler Intelligence Scale (WISC), Full-scale assessment.‡Wide range assessment of memory and learning.§Table II of Budtz-Jorgensen et al. [19] (also see Table 4 of Grandjean et al. [10]).¶An SEM-based combination of three WISC subscales (digit spans, similarities, squareroot block designs),scaled to match Digit Spans.‖Denotes endpoints where a higher score is adverse.∗∗Tables I and III of Crump et al. [11]. Estimated standard errors obtained by subtracting estimated regressioncoefficient from reported upper confidence limit and dividing by 2.

For several tests and endpoints, results for multiple variations were reported by the inves-tigators. For example, the Seychelles study presents regression coefficients for the effect ofmercury on scores from the Boston Naming Test, administered with and without cues. In orderto avoid over-representing any particular test and also to avoid adding additional complexityto our modeling (associated with having to account for correlation between closely relatedscores from the same test), we chose only one score for a test in such cases (see [8] for moredetail).

4. STATISTICAL MODELING

To estimate the dose–response coefficient characterizing the impact of mercury exposureon child IQ, we use a hierarchical random effects model that includes study-to-studyas well as endpoint-to-endpoint variability. Hierarchical models are commonly used insettings where the goal is to combine related information from several different sources.

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:698–710DOI: 10.1002/sim

Page 6: Combining data from multiple sources, with applications to environmental risk assessment

COMBINING DATA FROM MULTIPLE SOURCES 703

Mer

cury

coe

ffici

ent

NZ Seychelles Faroes

Figure 1. Estimated regression coefficients for mercury and 95 per cent confidence intervals (see Table I)for various endpoints in the New Zealand, Seychelles and Faroe Islands studies. Estimates corresponding

to IQ are shown with solid lines.

For example, Dominici et al. [23] used such an approach to combine dose–response datarelated to particulate matter from different U.S. cities. The approach used here extendsthe Dominici work by including random effects that reflect two types of variability. Ourmodel also extends one described by Coull et al. [15] in their response to [23] and ina later elaboration [16], by assuming that some endpoints are common to more than onestudy.

Our analysis can be described as follows. Let bi j and si j represent the estimated standardizedregression coefficient and corresponding estimated standard error for the j th endpoint in the i thstudy. In addition, define another covariate endpointi j to indicate the particular developmentalendpoint on which the coefficient bi j was based. The covariate takes values 1,2 . . . , J , where Jis the number of unique endpoints observed among the three studies (J =9 for the endpointsdescribed in Table I). We set the covariate endpoint equal to 1 if the coefficient corresponded tofull-scale IQ, 2 if it corresponded to performance IQ, 3 to McCarthy Perceptual Performance andso on. We then fit the model

bi j =�+�i +�endpointi j +�i j (1)

where �1,�2 and �3 are study-specific random effects assumed to be normally distributed withmean 0 and variance �2study, �1,�2, . . . ,�J are endpoint-specific random effects assumed to be

normally distributed with mean 0 and variance �2endpoint and �i j is a normally distributed random

error term with variance assumed to be known and given by s2i j . If each study had measured uniqueendpoints, then model (1) would correspond to a nested hierarchical model. If there were perfectoverlap with respect to the endpoints measured in each study, then model (1) would correspondto a crossed random effects model. Since there is partial overlap, model (1) represents a hybrid ofthese two.

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:698–710DOI: 10.1002/sim

Page 7: Combining data from multiple sources, with applications to environmental risk assessment

704 L. RYAN

While a maximum likelihood analysis of model (1) is technically possible, we took a Bayesianapproach for several reasons. First, a Bayesian analysis is easily accomplished, for example, usingthe statistical package WinBUGS. More importantly, a Bayesian analysis facilitates a variety ofsensitivity analyses, based on forcing various prior assumptions about the nature of the variancecomponents in the model. In particular, because there are only three studies and only threeendpoints (full-scale IQ, Boston Naming Test and California Verbal Learning Testing) shared bytwo or more studies, there is little information available to reliably separate the study-to-studyand endpoint-to-endpoint variance components. Our approach reduced the dimensionality ofthe model by assuming a multiplicative relationship between the variance components, �2study=R�2endpoint. The parameter R represents a ratio of study-to-study variability relative to endpoint-

to-endpoint variability. We did not estimate R, but refitted the model under a feasible range offixed values of R. The resulting model fits were computationally stable and also allowed us toexplicitly evaluate the sensitivity of our results to the assumed prior distribution on the variancecomponents.

A Bayesian analysis of model (1) requires specification of prior distributions for � as well as thevariance component �2endpoint (assuming the multiplicative relationship just described). A typical

approach for the parameter � is to specify a flat, but proper prior by assuming that this parameteris normally distributed with mean 0 and a large variance. In WinBUGS, normal distributions arespecified in terms of a precision parameter, which corresponds to the inverse of the variance.A popular choice for an approximately flat prior is a precision of 10−4. Convergence can beimproved by absorbing � into the mean of one of the random effects [24]. As is well known[25], Bayesian estimation of main effect terms such as � tends to be relatively stable and theresults insensitive to changes in the specified priors. However, specifying appropriate priors for thevariance components is more difficult. Traditionally, it has been common to use gamma priors onthe inverse variance components, since the resulting posterior can be computed in closed form. Theprior can be made effectively flat through appropriate choice of hyperparameters to ensure a largevariance. A commonly used choice is an inverse gamma distribution with rate and scale parameterseach equalling 0.001. This prior distribution has a mean of 1 and a variance of 1000, thus in theoryallowing the variance component to vary over a wide range of values. In a setting similar to ours,however, where the amount of data is relatively small, results turn out to be quite sensitive to thechoice of prior on the variance components. Part of the problem is that gamma distributions withlarge variances can have fairly extreme shapes. This can be seen quite easily by simulating data inR from a gamma distribution with shape and rate parameters equal to 0.001. While the mean of thisdistribution is indeed 1 and the variance is 1000, this is achieved by means of an extremely longtail. The bulk of the distribution is very close to 0 (90th percentile=9.8e−44). In settings wherethe user has an a priori sense of the range in which a variance component should lie, it is difficultto find a suitable inverse gamma distribution that covers that range effectively. Gelman [25] arguesthat with modern-day advances in Bayesian computing, the conjugacy argument for using a gammaprior is less compelling, especially in light of its sensitivity to chosen values of the hyperpriors.Appropriateness of the priors can be difficult to assess because it is less intuitive to think in terms ofinverse variances. Gelman recommends use of a Cauchy distribution on the square root of variancecomponents. An advantage of this approach is that the user is much more able to interpret the resultsand assess sensitivity. We have found that a uniform distribution on the square root of the variancecomponent also works very well. Many examples in the online WinBUGS manual also take thisapproach.

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:698–710DOI: 10.1002/sim

Page 8: Combining data from multiple sources, with applications to environmental risk assessment

COMBINING DATA FROM MULTIPLE SOURCES 705

5. RESULTS

To explore appropriate approximate ranges for prior distributions on the variance componentsin our assumed model, we plotted the profile log-likelihood obtained by treating the parametersR and �study as fixed and known, replacing the unknown overall mean (�) by its maximumlikelihood estimator, expressed in terms of these parameters. Figure 2 shows the resulting profilelog-likelihood surface, with pale and dark colors indicating high and low values, respectively, ofthe profile log-likelihood. Contours of equal log-likelihood are superimposed on the surface. Thefigure reveals a number of interesting features. First, it demonstrates the difficulty in trying toestimate both R and �study together, since the surface reveals a ridge region encompassing a rangeof parameter values that give essentially the same high profile likelihood. However, the figure alsoclearly shows that for any specific value of R the data are relatively informative about �study. Forlower values of R (1–3), the figure suggests a likely range of 0–0.15 for �study. As R increases,there is less information about �study although the figure still suggests a likely range of 0–0.2.These explorations lead us to consider a uniform prior distribution on �study with range [0,0.2]. Wealso ran sensitivity analyses changing the prior distribution to uniform [0,0.3] and uniform [0,0.4]and also varying R from 1.5 to 3. The results suggest that while the resulting mean parametersare stably estimated, the width of the confidence intervals varies, as expected.

Table II shows the estimated coefficients and 95 per cent confidence intervals for the relationshipbetween prenatal methylmercury exposure and IQ for �study distributed as uniform [0,0.2] and forvalues of R (ratio of study-to-study variability relative to endpoint-to-endpoint variability), rangingbetween 0.25 and 4.0. Regardless of the value of R, the central estimate dose–response coefficientwas consistently between −0.15 and −0.19 IQ points per ppm of maternal hair mercury, and wasstatistically significant in all cases. These results suggest that the results are relatively insensitive

0.0 0.2 0.4 0.6 0.8

8

7

6

5

4

3

2

1

R

Figure 2. Profile log-likelihood for fixed values of R and standard deviation correspondingto study-to-study variability, �study.

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:698–710DOI: 10.1002/sim

Page 9: Combining data from multiple sources, with applications to environmental risk assessment

706 L. RYAN

Table II. Results of fitting a hierarchical model for cognition/achievement-related scores.

R �̂study (se) �̂IQ (se) 95 per cent hpd

4.0 0.118 (0.051) −0.188 (0.096) (−0.398,−0.010)3.5 0.116 (0.050) −0.182 (0.091) (−0.390,−0.007)3.0 0.112 (0.051) −0.180 (0.092) (−0.378,−0.009)2.5 0.110 (0.051) −0.183 (0.090) (−0.384,−0.017)2.0 0.107 (0.050) −0.178 (0.088) (−0.371,−0.012)1.5 0.095 (0.053) −0.168 (0.086) (−0.360,−0.003)1.0 0.086 (0.051) −0.165 (0.080) (−0.338,−0.015)0.5 0.068 (0.046) −0.160 (0.071) (−0.321,−0.026)0.25 0.049 (0.038) −0.151 (0.061) (−0.283,−0.033)

Note: Estimated study-to-study variance components, estimated dose–response coefficients (andtheir standard errors) as well as a 95 per cent highest posterior density interval (hpd) are shown fordifferent values of R (ratio of endpoint-to-endpoint variability versus study-to-study variability).

to the assumed value of R. However, in a setting like this where there are many different sourcesof uncertainty, it is generally important to consider sensitivity analyses exploring as many of thesesources as possible. For example, the EHP paper [8] reports that the overall results are quitesensitive to whether or not the New Zealand data were based on inclusion or exclusion of the onechild with a very high exposure level. When this child was included, the estimated dose–responsecoefficients from the New Zealand study were much closer to those obtained for the Seychellesstudy. As a result, the estimated overall dose–response coefficient for IQ was reduced to −0.12.However, because the study-to-study variance component was reduced, the precision associatedwith the coefficient increased, with 95 per cent confidence interval of (−0.204,−0.025). The EHPpaper [8] also shows that the assumed ratio of cord blood to hair mercury had a fairly substantialimpact on the results.

5.1. Impact of working with summary statistics

Our analysis does not use the original individual-specific data, but is based on estimated regressioncoefficients reported in various published articles and technical reports. An important questionconcerns whether this approach is appropriate as well as whether there is a need to account, inaddition, for the fact that repeated outcomes assessed on the same child may be correlated. Thereare in fact several precedents for hierarchical models fit to estimated regression coefficients orother summary statistics computed at the first stage of the hierarchy. In their combined analysis ofdata related to the health effects of air pollution from 24 cities in the United States, Dominici etal. [23] combined only city-specific dose–response coefficients, despite having access to the rawdata. Drawing on theoretical results [26], they argued that such an approach was not only valid andcomputationally appealing, but resulted in little efficiency loss. While it is reasonable to expectthat a similar logic might apply in our setting, it is not entirely clear that it does so because ourmodel is more complicated and involves multiple endpoints within each study. In this section, wepresent a more detailed theoretical argument that explores the question.

Let Yi jk represent the j th of mi outcomes measured on the kth child (k=1, . . . ,ni ) in the i thstudy and let xik represent this child’s observed exposure level. Note that since exposure is the

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:698–710DOI: 10.1002/sim

Page 10: Combining data from multiple sources, with applications to environmental risk assessment

COMBINING DATA FROM MULTIPLE SOURCES 707

same for all endpoints measured on the same child, we do not include a j subscript on x . Ignoringfor now the impact of any additional covariates, a natural model might take the form:

Yi jk =�i j +�i j xik+�i jk (2)

where �i j and �i j are the intercept and slope for the j th endpoint in the i th study and �i jk isan error term. Assuming that the outcomes have all been standardized to the same scale, it isreasonable to assume that the error terms all have the same variance, say �2error. To account forcorrelation between repeated measures taken on the same child, suppose that the vector of errorterms for the kth child (�i1k, . . . ,�imi k) are normally distributed with mean 0 and exchangeablecovariance matrix �=�2error[(1−�)I+�J], where I is the mi ×mi identity matrix and J is a matrixof ones. Study-to-study and endpoint-to-endpoint variability are then accommodated by allowingthe �i j and �i j terms to also have an appropriate bivariate normal distribution. Our analysis isequivalent to a two-stage procedure, first fitting study and endpoint-specific slopes, treating the�’s and �’s as fixed but unknown parameters, then fitting a random effects model to the estimatedvalues as described earlier. Assuming model (2), the estimated regression slope corresponding tothe j th endpoint in the i th study is

b̂i j =∑

k(xik− x̄i )Yi jk∑k(xik− x̄i )2

For fixed values of �i j and �i j , it follows that b̂i j is normally distributed with mean the true bi jand variance

s2i j =�2error

/∑k

(xik− x̄i )2

Note that under this simplified setting, the standard errors of the estimated slopes are identicalfor all endpoints measured in the same study, s2i j =s2i . They may differ in practice, however, onceadditional covariates have been incorporated into the endpoint-specific models. The covariancebetween the estimated slopes for two different endpoints j and j ′ in study i is given by

��2error

/∑k

(xik− x̄i )2 =�s2i

so that the correlation between the two slopes is �. Analogously to what was described earlier, letoutcomei j be a covariate indicating the outcome type for the j th outcome measured in study i .

Then assuming the same random effects structure for �̂i j as described earlier, we can write

�̂i j =�0+�∗i +�∗

endpointi j+�∗i j (3)

where �∗1,�

∗2 and �∗

3 are study-specific random effects, �∗1, . . . ,�

∗J are endpoint-specific random

effects and �∗i j is an error term with mean zero and variance s2i j =s2i . The error terms correspondingto two different outcomes measured in the same study (say �∗i j and �∗i j ′) have correlation �. Underthis model, the marginal correlation between the slopes corresponding to two endpoints ( j and j ′)measured in the same study will be

�2study+�s2i�2study+�2endpoint+s2i

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:698–710DOI: 10.1002/sim

Page 11: Combining data from multiple sources, with applications to environmental risk assessment

708 L. RYAN

0.0 0.1 0.2 0.3 0.4 0.5

0.6

0.5

0.4

0.3

0.1

0.0

0.1

Within individual correlation

Mer

cury

Coe

ffici

ent

0.2

Figure 3. Estimated regression coefficient for mercury (solid line) and 95 per cent highest posterior density(dashed line) as a function of within-child correlation. Results are based on assuming a value of R=3

for the ratio of study-to-study and endpoint-to-endpoint variance components.

In contrast, the correlation assumed by model (1) between two endpoints measured in the samestudy is

�2study�2study+�2endpoint+s2

so that ignoring the correlation term � and fitting model (1) is likely to overestimate �2study and

underestimate �2endpoint. Note that model (1) can be easily modified to be equivalent to model(3) by redefining the error terms so that the study-specific random effects �1,�2 and �3 havevariance �2study+�s2i and the error terms �∗i j have variance (1−�)s2i . While the parameter � is notidentifiable using only the summary data, it is useful to perform a sensitivity analysis to explorethe impact of varying this parameter on overall results. Figure 3 shows the results of fitting thismodified model in WinBUGS for values of � varying between 0 and 0.5. The solid line shows theestimated value of the mercury coefficient for IQ, while the dashed lines show estimated 95 percent highest posterior density intervals. The dotted horizontal line at zero indicates the no-effectlevel. Psychometrics literature suggests that the latter is a reasonable upper range of likely valuesfor �. Hence, despite the non-identifiability of � based on summary data alone, our sensitivityanalysis suggests a relatively high degree of robustness of the results.

6. DISCUSSION

This paper has elaborated on some of the statistical issues underlying a recently reported analysis[8] that combines data on multiple endpoints from three different epidemiological studies designed

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:698–710DOI: 10.1002/sim

Page 12: Combining data from multiple sources, with applications to environmental risk assessment

COMBINING DATA FROM MULTIPLE SOURCES 709

to assess the impact of in-utero methylmercury exposure on child development. This analysisplayed an important role in helping the US EPA to quantify risks from prenatal exposure tomercury and to estimate benefits from reductions in mercury exposure. Although the analysisused data from only three studies, we were able to ‘borrow strength’ from other endpoints inthe cognition/achievement domain in order to more precisely estimate the association betweenmercury and IQ, the latter having been previously used successfully as the basis for quantifyingthe benefits of reducing childhood lead exposure.

The motivating setting for this paper is typical of what often happens in environmental riskassessment. While the question of interest (finding an appropriate standard for mercury exposure)is important and often urgent, available data are invariably inadequate and scattered across a varietyof disparate sources of varying quality. In such cases, a highly rigorous analysis is often notpossible. Instead, the task becomes one of doing the best job possible to synthesize the availabledata using simple but reasonable models. Inevitably, such analyses require a number of strongsimplifying assumptions. While it will typically not be possible to rigorously assess whether theseassumptions are reasonable, thoughtful sensitivity analyses can be used to gauge the impact ofviolating these assumptions.

Despite the relative simplicity of our approach and paucity of available data, the results arequite compelling in terms of suggesting that in-utero exposure to methylmercury has an adverseimpact on child development in general, and IQ in particular. The analysis suggests a numberof interesting and worthwhile avenues for further research. For example, it would be useful togather raw data from the various studies and perform a more detailed analysis. The results of ourcombined analysis also point to the importance of further epidemiological work on health effectsof methylmercury exposure.

We used a Bayesian approach to fitting the proposed hierarchical model, via the statisticalpackage WinBUGS. There were several reasons for this choice. First, it allowed us to quickly andeasily program the various models. It was also easy to restrict the parameter space, for example,by fixing the ratio between study-to-study and endpoint-to-endpoint variance components. Also,we believe that the Bayesian approach is more appropriate in small sample settings such as ours,since it avoids the need for using the standard large sample approximations for statistical inference.The relative sparsity of the data meant that convergence was an issue, especially for variancecomponents. There was also a reasonable amount of sensitivity to prior specification. We addressedthis problem by using an exploratory likelihood approach to identify a reasonable range for thevariance components, then constructing priors that allowed the variance components to vary overthis range.

ACKNOWLEDGEMENTS

This paper is based on a presentation for the 2005 Armitage Lecture at the Medical Research Council,Cambridge, England. The research was supported by National Institute of Health Grant CA48061 and acontract from the United States Environmental Protection Agency.

REFERENCES

1. Morales KH, Ibrahim JG, Chen C-J, Ryan LM. Bayesian model averaging with applications to benchmark doseestimation for arsenic in drinking water. Journal of the American Statistical Association 2006; 101:9–17.

2. Savitz D. Interpreting Epidemiologic Evidence: Strategies for Study Design and Analysis. Oxford UniversityPress: New York, 2003.

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:698–710DOI: 10.1002/sim

Page 13: Combining data from multiple sources, with applications to environmental risk assessment

710 L. RYAN

3. Mosleh A, Siu N, Smidts C, Lui C. Model Uncertainty: Its Characterization and Quantification. Center forReliability Engineering, University of Maryland College Park, Maryland, 1993.

4. Draper D. Assessment and propagation of model uncertainty (with discussion). Journal of the Royal StatisticalSociety, Series B 1995; 57:45–97.

5. Best NG, Spiegelhalter DJ. Bayesian approaches to multiple sources of evidence and uncertainty in complexcost-effectiveness modelling. Statistics in Medicine 2003; 22:3687–3709.

6. Carroll RJ. Variances are not always nuisance parameters. Biometrics 2003; 59:211–220.7. Ades AE, Sutton AJ. Multiparameter evidence synthesis in epidemiology and medical decision-making: current

approaches. Journal of the Royal Statistical Society, Series A 2006; 169:5–35.8. Axelrad DA, Bellinger DC, Ryan LM, Woodruff TJ. Dose–response relationship of prenatal mercury exposure

and IQ: an integrative analysis of epidemiologic data. Environmental Health Perspectives 2007; 115:609–615.9. National Research Council. Toxicological Effects of Methylmercury. National Research Council, National Academy

Press: Washington, DC, 2000.10. Grandjean P, Weihe P, White RF, Debes F, Araki S, Yokoyama K et al. Cognitive deficit in 7-year-old children

with prenatal exposure to methylmercury. Neurotoxicology and Teratology 1997; 19:417–428.11. Crump KS, Kjellstrom T, Shipp AM, Silvers A, Stewart A. Influence of prenatal mercury exposure upon

scholastic and psychological test performance: benchmark analysis of a New Zealand cohort. Risk Analysis 1998;18:701–713.

12. Kjellstrom T, Kennedy P, Wallis P, Mantell C. Physical and mental development of children with prenatalexposure to mercury from fish. Stage 2: Interviews and psychological tests at age 6. Report 3642. NationalSwedish Environmental Protection Board: Solna, Sweden, 1989.

13. Davidson PW, Myers GJ, Cox C, Axtell C, Shamlaye C, Sloane-Reeves J. Effects of prenatal and postnatalmethylmercury exposure from fish consumption on neurodevelopment: outcomes at 66 months of age in theSeychelles child development study. Journal of the American Medical Association 1998; 280:701–707.

14. Myers GJ, Davidson PW, Cox C, Shamlaye CF, Palumbo D, Cernichiari E et al. Prenatal methylmercury exposurefrom ocean fish consumption in the Seychelles child development study. The Lancet 2003; 361:1686–1692.

15. Coull BA, Mezzetti M, Ryan LM. Comment on ‘Combining evidence on air pollution and daily mortality fromthe 20 largest US cities: a hierarchical modelling strategy’ (Pkg: P263–302). Journal of the Royal StatisticalSociety, Series A: Statistics in Society 2000; 163:293.

16. Coull BA, Mezzetti M, Ryan LM. A Bayesian hierarchical model for risk assessment of methylmercury. Journalof Agricultural, Biological, and Environmental Statistics 2003; 8:253–270.

17. U.S. Environmental Protection Agency. The benefits and costs of the clean air act, 1970–1990. Technical Report,U.S. Environmental Protection Agency, 1997.

18. Kjellstrom T, Kennedy P, Wallis S, Mantell C. Physical and Mental Development of Children with PrenatalExposure to Mercury from Fish. Stage 1: Preliminary Tests at Age 4. Swedish National Environmental ProtectionBoard: Solna, Sweden, 1986.

19. Budtz-Jorgensen E, Debes F, Weihe P, Grandjean P. Adverse mercury effects in 7-year old children expressed asloss in IQ. Technical Report, Report to the U.S. Environmental Protection Agency, 2005.

20. Budtz-Jorgensen E, Grandjean P, Jorgensen PJ, Weihe P, Keiding N. Association between mercury concentrationsin blood and hair in methylmercury-exposed subjects at different ages. Environmental Research 2004; 95:385–393.

21. Budtz-Jorgensen E, Keiding N, Grandjean P, Weihe P, White RF. Estimation of health effects of prenatal mercuryexposure using structural equation models. Environmental Health: A Global Access Science Source 2002; 1:2.

22. Sanchez BS, Budtz-Jrgensen E, Ryan LM, Hu H. Structural equation models—a review with applications toenvironmental epidemiology. Journal of the American Statistical Association 2005; 100:1443–1455.

23. Dominici F, Samet JM, Zeger SL. Combining evidence on air pollution and daily mortality from the 20 largestUS cities: a hierarchical modelling strategy (Pkg: P263–302). Journal of the Royal Statistical Society, Series A:Statistics in Society 2000; 163:263–284.

24. Gelfand AE, Sahu SK, Carlin BP. Efficient parameterisations for normal linear mixed models. Biometrika 1995;82:479–488.

25. Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Analysis 2006; 1:515–533.26. Daniels M, Kass R. A note on first-stage approximation in two-stage hierarchical models. Sankhya B 1998;

60:19–30.

Copyright q 2007 John Wiley & Sons, Ltd. Statist. Med. 2008; 27:698–710DOI: 10.1002/sim