assessing consistent treatment effect in a multi-regional clinical trial: a systematic review

12
PHARMACEUTICAL STATISTICS Pharmaceut. Statist. 9: 242–253 (2010) Published online 28 June 2010 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/pst.438 Assessing consistent treatment effect in a multi-regional clinical trial: a systematic review y Joshua Chen 1, ,y , Hui Quan 2 , Bruce Binkowitz 1 , S. Peter Ouyang 3 , Yoko Tanaka 4 , Gang Li 5 , Shailendra Menjoge 6 , and Ekopimo Ibia 1 , for the Consistency Workstream of the PhRMA MRCT Key Issue Team z 1 Merck Research Labs, UG1CD-44, North Wales, PA, USA 2 Sanofi-Aventis, Biostatistics and Programming, Bridgewater, NJ, USA 3 Celgene, Summit, NJ, USA 4 Eli Lilly, Indiapolis, IN, USA 5 Johnson & Johnson, Raritan, NJ, USA 6 Boehringer Ingelheim, Ridgefield, CT, USA A key issue in multi-regional clinical trials (MRCTs) is how to assess the consistency of treatment effect across regions, although there is no a priori reason to believe that the treatment effect should vary across the regions. In this article, we define the research question as an assessment of overall consistency across all regions for which all regions are considered equally important. This is different from the region/country-specific analyses (e.g. US vs Non-US), which are frequently requested by local regulatory agencies and usually performed for multiple agencies. We provide a systematic review of methods that may potentially be used for assessing consistency across regions, including commonly used quantitative/qualitative interaction tests, Japanese Pharmaceutical Medical Device Agency (PMDA) Methods 1 & 2, and those proposed for different purposes (e.g. bridging studies, meta-analysis, and vaccine lot consistency, among others). These methods are classified into three groups: global methods, multivariate quantitative methods, and multivariate qualitative methods. A case study is used to illustrate these methods. We also provide recommendations on how to choose appropriate methods and incorporate them in the study design. Copyright r 2010 John Wiley & Sons, Ltd. Keywords: multi-regional clinical trial; consistency; heterogeneity; quantitative; qualitative y This article is published in Pharmaceutical Statistics as a special issue on Multi-regional Clinical Trials — What are the challenges?, edited by Sue Jane Wang, Office of Biostatistics, Office of Translational Sciences, CDER, U.S. Food and Drug Administration, MD, USA y E-mail: [email protected] z Consistency Workstream of the PhRMA MRCT Key Issue Team: Hui Quan, Joshua Chen, Yoko Tanaka, Gang Li, Paul Gallo, Peter Ouyang, Xiaolong Luo, Shailendra Menjoge, Steven Talerico, Kimitoshi Ikeda, Bruce Binkowitz, Ekopimo Ibia. *Correspondence to: Joshua Chen, Merck Research Labs, PO Box 1000, UG1CD-44, North Wales, PA, USA. Copyright r 2010 John Wiley & Sons, Ltd.

Upload: joshua-chen

Post on 06-Jul-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Assessing consistent treatment effect in a multi-regional clinical trial: a systematic review

PHARMACEUTICAL STATISTICS

Pharmaceut. Statist. 9: 242–253 (2010)

Published online 28 June 2010 in Wiley Online Library

(wileyonlinelibrary.com) DOI: 10.1002/pst.438

Assessing consistent treatment effect

in a multi-regional clinical trial:

a systematic reviewy

Joshua Chen1,�,y, Hui Quan2, Bruce Binkowitz1, S. Peter Ouyang3,Yoko Tanaka4, Gang Li5, Shailendra Menjoge6, and Ekopimo Ibia1, forthe Consistency Workstream of the PhRMA MRCT Key Issue Teamz

1Merck Research Labs, UG1CD-44, North Wales, PA, USA2Sanofi-Aventis, Biostatistics and Programming, Bridgewater, NJ, USA3Celgene, Summit, NJ, USA4Eli Lilly, Indiapolis, IN, USA5Johnson & Johnson, Raritan, NJ, USA6Boehringer Ingelheim, Ridgefield, CT, USA

A key issue in multi-regional clinical trials (MRCTs) is how to assess the consistency of treatment effect

across regions, although there is no a priori reason to believe that the treatment effect should vary across the

regions. In this article,we define the research questionas an assessment of overall consistency across all regions

for which all regions are considered equally important. This is different from the region/country-specific

analyses (e.g. US vs Non-US), which are frequently requested by local regulatory agencies and usually

performed for multiple agencies. We provide a systematic review of methods that may potentially be used for

assessing consistency across regions, including commonly used quantitative/qualitative interaction tests,

Japanese PharmaceuticalMedicalDeviceAgency (PMDA)Methods 1& 2, and those proposed for different

purposes (e.g. bridging studies, meta-analysis, and vaccine lot consistency, among others). Thesemethods are

classified into three groups: global methods, multivariate quantitative methods, and multivariate qualitative

methods. A case study is used to illustrate thesemethods.We also provide recommendations on how to choose

appropriate methods and incorporate them in the study design. Copyrightr 2010 JohnWiley & Sons, Ltd.

Keywords: multi-regional clinical trial; consistency; heterogeneity; quantitative; qualitative

yThis article is published in Pharmaceutical Statistics as aspecial issue on Multi-regional Clinical Trials—What are thechallenges?, edited by Sue Jane Wang, Office of Biostatistics,Office of Translational Sciences, CDER, U.S. Food and DrugAdministration, MD, USA

yE-mail: [email protected] Workstream of the PhRMA MRCT Key Issue

Team: Hui Quan, Joshua Chen, Yoko Tanaka, Gang Li, PaulGallo, Peter Ouyang, Xiaolong Luo, Shailendra Menjoge, StevenTalerico, Kimitoshi Ikeda, Bruce Binkowitz, Ekopimo Ibia.

*Correspondence to: Joshua Chen, Merck Research Labs,PO Box 1000, UG1CD-44, North Wales, PA, USA.

Copyright r 2010 John Wiley & Sons, Ltd.

Page 2: Assessing consistent treatment effect in a multi-regional clinical trial: a systematic review

1. INTRODUCTION

Multi-regional clinical trials (MRCTs) havebecome a common practice in recent years. Forexample, of the 1926 clinical trials reviewed by theUS Food and Drug Administration (FDA) during2001–2007, 50% were multi-regional trials andincluded both US domestic and foreign sites [1].Designed and conducted properly, MRCTs couldaccelerate the availability of important medicalproducts to the needed patients worldwide andincrease efficiency in clinical development, whichin the long run may help lower cost of clinicaldevelopment and hence cost of products topatients and payers. For registration, some localregulatory agencies request for data from theircountries/areas relative to the overall data in theapplication package [2,3]. Nevertheless, the designand conduct of MRCTs to ensure the acceptabilityof results derived from such trials for regulatorydecision making presents significant challenges forindustry and regulators [1,4].

One key question is the assessment of consis-tency of treatment effect across regions. Byconsistency, we specifically refer to the similarityof treatment effect (typically defined as a summarymeasure of a difference between treatment groupand control group responses) across regions. TheICH E5 guideline [5] was adopted in 1998 torecommend a framework for evaluating the impactof ethnic factors on drug effects. In a Q&A format,the answer to the ICH E5 Q11 states, A multi-regional trial for the purpose of bridging could beconducted in the context of a global developmentprogram designed for near simultaneous world-wideregistration. The objectives of such a study wouldbe: (1) to show that the drug is effective in the regionand (2) to compare the results of the study betweenthe regions with the intent of establishing that thedrug is not sensitive to ethnic factors. However,definition of ‘not sensitive’, or similar/consistenttreatment effect, has not been provided. In recentlypublished MRCTs, treatment-by-region interac-tion tests are commonly used to assess hetero-geneous treatment effect across regions [6–9].A non-significant interaction test would lead tothe conclusion that the treatment effect is con-

sistent across regions. Recently, the Japanesehealth authority issued a guidance documentwith one Q&A specifically for consistency assess-ment [3]. Two methods to define consistenttreatment effect between a Japanese cohort andthe rest of the overall patient population in aMRCT are proposed and extensively discussed[10–12]. In light of ICH E5, the PMDA guidancedocument, and recent CHMP position paper onExtrapolation of results from clinical studiesconducted outside Europe to the EU population[CHMP/EWP/692702/08], it is important to re-search various ways to investigate, define, andassess consistency of treatment effects acrossregions. For this reason, the PhRMA MRCTKey Issues Team undertook a systematic reviewof existing methods that may be used for assessingconsistency. Different methods could yieldvery different conclusions for the same study,which partly accounts for differing reviewoutcomes when a MRCT is assessed at multiplehealth authorities using different methods.Sorting through these methods and understandingtheir properties should potentially enhanceuniform reviews across health authorities. Whilewe recommend to pre-specify a primary approachfor assessing consistency, more than one ofthese methods may be used in combinationto help understand the nature of possible incon-sistency.

In this paper, we provide a systematic reviewof existing methods that may potentially be usedfor assessing consistent treatment effect acrossregions. Some of these methods are proposed fordifferent purposes (e.g., bridging studies, meta-analysis, vaccine lot consistency, among others).These methods are classified into three groups:

1. Global Methods, which are based on one singleglobal statistic combining data across allregions. The commonly used treatment-by-region interaction tests are included in thisgroup. Both quantitative and qualitativeinteraction tests are discussed.

2. Multivariate Quantitative Methods, whichassess the quantitative differences amongregions simultaneously. Consistency is con-

Copyright r 2010 John Wiley & Sons, Ltd. Pharmaceut. Statist. 9: 242–253 (2010)DOI: 10.1002/pst

Assessing consistent treatment effect in a multi-regional clinical trial 243

Page 3: Assessing consistent treatment effect in a multi-regional clinical trial: a systematic review

cluded if all pairwise differences betweenregions are within pre-specified bounds.

3. Multivariate Qualitative Methods, which assessthe qualitative differences among regions simul-taneously. Consistency is concluded if all pointestimates from individual regions are betterthan their corresponding thresholds. The twoPMDA methods are included in this group.

The main difference between the global(Category 1) and multivariate methods(Categories 2 and 3) is that for global methods,inference/decisions are made based on onesingle combined statistic while for the multi-variate methods, decisions are made based onsimultaneous assessment of the several region-wiseestimates.

The paper is organized as follows. In Section 2,we describe the research question of interest andintroduce notation. Section 3 provides a systema-tic review of potential methods. A case study isused to illustrate the definitions in Section 4.Finally, in Section 5, we provide a brief discussionand also make some recommendations.

2. SET-UP AND NOTATION

2.1. Research question

In the recent debate on the US findings from theMERIT-HF study [9,13], the investigators tested atreatment-by-country interaction across all coun-tries, which resulted in a p-value of 0.22. However,by comparing US to non-US countries, the USFDA reviewer obtained a p-value of 0.003. Theapparent discrepancy in conclusions is because thetwo different analyses were trying to address twodifferent questions. The interaction test performedby the investigators was intended to assess overallconsistency across countries. For this assessment,all countries are considered equally important.The FDA analysis, however, attempted to assessfor one particular country, whether the results/conclusions were consistent with those based onthe rest of the overall population. Similarly, forassessing consistency in MRCTs where regions are

well defined, there are two different researchquestions: overall consistency across all regions,vs region-specific assessment of consistency. In thispaper, we will be focusing on the assessment ofoverall consistency across all regions. We assumethat regions have been pre-defined, and there is noa priori systematic or scientific reason to believethat the regions would be inconsistent.

2.2. Notation

Consider a continuous and normally distributedoutcome measure in a MRCT. Let s be the numberof regions in the multi-regional trial and let Xij andYij denote, respectively, the placebo and activetreatment group endpoint values for the jth patientwithin region i, assumed to have normal distribu-tions: Xij � NðmiX;s

2i Þ and Yij � NðmiY;s

2i Þ, for

i51,���, s. For simplicity of presentation, we assumethat within a region there are equal numbers ofpatients in each treatment group, and we furthermake the common assumption that the variances arethe same across groups and regions, i.e.s21 ¼ ��� ¼ s2s ¼ s2 ¼ 1. We let Ni be the numberof patients per treatment group for the ith region,and N ¼

Psi¼1 Ni is the total number of patients in

the trial in each treatment arm. Let di5miY�miX bethe true treatment effect for Region i, assuming apositive value implying a better outcome. A com-monly used estimate of di is the difference in samplemeans, denoted by d̂i ¼ �Yi � �Xi � Nðdi; 2=NiÞ.The overall treatment effect based on patientsfrom all regions is d ¼

Psi¼1 fidi, where Ni/N5 fi

is the fraction of patients from Region i. Theestimate of the overall treatment effect isd̂ ¼

Psi¼1 fid̂i � Nðd; 2=NÞ.

3. A SYSTEMATIC REVIEW

3.1. Global methods

3.1.1. Quantitative interaction tests

One example in this group is the treatment-by-region interaction test, which is based on aderived Chi-square statistic combining individual

Copyright r 2010 John Wiley & Sons, Ltd. Pharmaceut. Statist. 9: 242–253 (2010)DOI: 10.1002/pst

244 J. Chen et al.

Page 4: Assessing consistent treatment effect in a multi-regional clinical trial: a systematic review

estimates from all regions. It has been commonlyused to assess consistent treatment effect acrossregions in recently published clinical trials [6–9].The null hypothesis is H0 : d1 ¼ d2 ¼ ��� ¼ ds ¼ d,vs the alternative hypothesis that at least oneregion is different. Consistency of treatment effectacross regions is concluded if H0 is not rejected.The global test statistic is defined as,

Q ¼Xsi¼1

ðd̂i � d̂Þ2

2=Ni¼ ð1=2Þ

Xsi¼1

Niðd̂i � d̂Þ2

Under the null hypothesis, Q follows a central chi-square distribution with (s�1) degrees of freedom.A result of QXw2s�1;1�a indicates that the test isstatistically significant at level a, where w2s�1;1�a isthe (1�a) quantile of a central chi-square distribu-tion with (s�1) degrees of freedom. This Q statisticis also called Cochran’s Q heterogeneity statistic inthe context of meta-analyses [14–17].

When individual patient level data are available,one may consider regression models. Adding atreatment-by-region interaction term in the regres-sion model will enable the test of regionaldifference in treatment effect, similar to thetreatment-by-centre interaction test [18]. Whenthe common variance s2 is unknown, this regres-sion model-based approach will provide slightlydifferent results compared to that based onCochran’s Q statistic. In Cochran’s Q approach,each of the study level estimates is assumedfollowing a normal distribution with a truevariance of that estimated from the data, and asa result, the Q statistic has a chi-square distribu-tion [17]; Section 4.2.9]. However, the regressionmodel-based approach takes the uncertainty of thevariance estimate into consideration and theresulting heterogeneity test is based on anF-distribution [17]; Section 5.2.3]. When samplesize is large, this regression model-based interac-tion test will provide essentially the same results asCochran’s Q test. One of the benefits of usingregression models is that the models can alsoadjust for other potential prognostic factors, suchas baseline difference in patient characteristics [8].

A study is generally not designed (and/or powered)to detect a potential treatment-by-region interaction

[19]. As a result, the statistical power for such aninteraction test is typically low. Or, the false negativerate, or type II error rate, could be high. Even when itis concluded that there is no regional inconsistencybased on the treatment-by-region interaction test, thechance that in fact there is a regional difference couldbe substantial. Because of the lack of power for suchan interaction test, some authors recommend asignificance level of 0.10 rather than the conventional0.05 [18]. Using a higher significance level for theinteraction test can increase the power when in factthere is a treatment-by-region interaction, although theincrease in power may still not be sufficient. However,this also increases the risk of drawing a false-positiveconclusion (i.e. increased type I error rate).

3.1.2. Qualitative interaction tests

The quantitative Cochran’s Q describes the overallmean-square distance of individual regional esti-mates from the overall effect. The contribution ofeach individual region is its mean-square distancefrom the overall treatment effect. The actualtreatment effect in this region, whether it is beneficial(positive) or harmful (negative), is not taken intoconsideration. For example, there could be atreatment that is highly efficacious in all regions,but there is a quantitative difference among them.Patients across all regions can benefit from thistreatment regardless of the fact that the quantitativedifference may be statistically significant based onthe treatment-by-region interaction test. Sometimesa small quantitative difference among regions isacceptable and the focus of interest is whetherpatients in all regions can benefit from the experi-mental treatment. To address this, detecting achange in direction of the treatment effect, or aqualitative interaction, may be more important thandetecting a difference in the size of the effect acrossregions, or a quantitative interaction. In suchcircumstance, qualitative interaction tests may beclinically more relevant [20–23]. The null hypothesisfor these qualitative interaction tests is

H0 : fdiX0; for all i ¼ 1; . . . ; sg [ fdio0;

for all i ¼ 1; . . . ; sg

Assessing consistent treatment effect in a multi-regional clinical trial 245

Copyright r 2010 John Wiley & Sons, Ltd. Pharmaceut. Statist. 9: 242–253 (2010)DOI: 10.1002/pst

Page 5: Assessing consistent treatment effect in a multi-regional clinical trial: a systematic review

For example, the Gail-Simon qualitative test [21] isdescribed below. Define

Q� ¼Xsi¼1

d̂2i2=Ni

Iðd̂i40Þ; Q15Xsi¼1

d̂2i2=Ni

Iðd̂io0Þ

The null hypothesis that there is no qualitativeinteraction between treatment and region is rejectedif fminðQ1;Q�Þ > cg, where the critical value c isprovided in [21]. Other qualitative tests of interac-tions such as the range test may also be used. Therange test can be more powerful if the treatmenteffect is negative in only one region and positive inall other regions. If the number of regions is small(e.g. s5 2 or 3), the range test and the likelihoodratio test will have similar power [22,24].

3.1.3. Quantifying heterogeneity

When sample size is large, the quantitative treatment-by-region interaction tests discussed in Section 3.1.1could be statistically significant even the actualdifference among regions is small. Therefore, thep-value based on an interaction test does not describethe extent of heterogeneity in the results of the trials.Instead, the magnitude of heterogeneity should bequantified and its clinical relevance in terms of itsimpact on the conclusions should be assessed. HigginsI2, derived from the Cochran’s Q statistic, wasproposed to measure the degree of inconsistencyacross studies in the context of meta-analysis [16]. I2 isdefined as, I25100%� [Q�(s�1)]/Q. Negative va-lues of I2 are put equal to zero so that I2 lies between 0and 100%. A value of 0% indicates no observedheterogeneity, whereas a larger value shows increasingheterogeneity. Empirically, I2 values of 25, 50, and75% are considered low, moderate, and high hetero-geneity, respectively [16].

Assume the heterogeneity test based onCochran’s Q gives a p-value of p. The Higgins I2

can be written as I2 ¼ ½w2s�1;1�p � ðs� 1Þ�=w2s�1;1�p.For example, when there are four regions, ap-value of 0.05 based upon Cochran’s Q isequivalent to Higgins I2 of 62%, and a p-valueof 0.10 is equivalent to Higgins I2 of 52%. Whenthere are fewer regions (s5 2), p-values of 0.05 and0.10 correspond to I2 of 74 and 63%, respectively.

3.2. Multivariate quantitative methods

Rather than basing the decision on one singleglobal statistic, the multivariate quantitativemethods discussed in this section allow assessmentof overall consistency by considering all pairwisedifferences simultaneously and can also identifysource of inconsistency, if any, by identifyingpairs that do not meet the consistency criteria.Consistency of treatment effect across regionis concluded if fjd̂i � d̂jjobij :i; j ¼ 1; . . . ; s; i 6¼ jg,where, bij’s are bounds for the pairwise differencesin point estimates and may be different acrosspairs due to sample size difference.

3.2.1. Difference not statistically significant

When there are only two regions, the conventionaltwo-sample tests can be used to compare resultsbetween the two regions. The null hypothesis thatthere is no difference in treatment effect betweenthe regions is accepted if the two-sided test fails toreach statistical significance at the nominal level /.The same idea can be extended to multiple regions.Consistency of treatment effect is concluded ifnone of the pairs is statistically significant, i.e.

fjd̂i � d̂jjobij : i; j ¼ 1; . . . ; s; i 6¼ jg, where bij ¼ za=2ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2N�1ðf�1i 1f�1j Þ

qand za=2 is the (1�a/2) quantile

of a standard normal distribution. The bound bijdepends on the sample sizes in Regions i and j,with a tighter bound for larger sample size. Thisindicates that it is easier to conclude consistencyfor regions with smaller sample size. Also, a largera can actually lead to a tighter bound or higherhurdle for consistency.

An alternative approach is to generate con-fidence intervals (CIs) for individual regions andoverlapping CIs indicate consistent treatmenteffect. This can be considered a special case ofthe above method. For example, consider a simplecase where sample size is the same across regions.If two-sided 100(1�a�)% CIs are generated for theindividual treatment effect, the consistency criteriaof overlapping CIs is equivalent to that based onthe lack of statistical significance at nominal level

Copyright r 2010 John Wiley & Sons, Ltd. Pharmaceut. Statist. 9: 242–253 (2010)DOI: 10.1002/pst

246 J. Chen et al.

Page 6: Assessing consistent treatment effect in a multi-regional clinical trial: a systematic review

2f1� Fðffiffiffi2

pza�=2Þg, where, F is the cumulative

standard normal distribution function. Whena�5 0.166 (or, two-sided 83.3% CIs for individualestimates), this overlapping CI approach isequivalent to the lack of statistical significancemethod at level 0.05.

3.2.2. Equivalence hypothesis tests

In the context of ‘bridging’ studies, a similaritymeasurement is defined and an equivalence test isproposed to test the null hypothesis of being notequivalent [25]. In the context of vaccine clinicaldevelopment, demonstration of manufacturing lotconsistency is generally required before licensure [26].This equivalence test is similar to that described in[25] except that there are three pairs of lot-to-lotcomparisons. These equivalence tests can be extendedto multiple regions with a pre-defined equivalencemargin m. Consistency of treatment effect is con-

cluded if fjd̂i � d̂jjobij : i; j ¼ 1; . . . ; s; i 6¼ jg, where

bij ¼ m� za=2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2N�1ðf�1i 1f�1j Þ

q. When the equiva-

lence margin is tight (i.e. m is small) and sample sizesare small for Regions i and j, the bound bij could benegative, which means assessment of consistencyusing this definition is not possible. This implies thatthe equivalence methods require a sufficiently largesample size in each region for any prespecified m.

3.3. Multivariate qualitative methods

Similar to the argument for qualitative interactiontests (Section 3.1.2), multivariate qualitative meth-ods can be used to simultaneously assess if patientsacross all regions can benefit from the treatment.Consistency of treatment effect is concluded if allindividual point estimates d̂i’s are better than theircorresponding reference values, or fd̂i4pd̂1ci;for all i ¼ 1; . . . ; sg, where, p (0ppp1) is thefraction of the observed overall treatment effectd̂, and ci (�1ocio1) represents the ‘burden ofproof’ in addition to the observed faction pd̂. Apositive ci means additional confidence above thetarget preserved effect pd̂ is required for concludingconsistency, while a negative ci means an allowancebelow the target preserved effect pd̂ is accepted.

3.3.1. PMDA methods (extension)

PMDA Method 1 [3] requires that the observedtreatment effect for Japanese patients is at leasthalf of that observed for all patients. This idea hasbeen extended to multiple regions [27] andconsistency of treatment effect across all regionsis concluded if fd̂i4pd̂; for all i ¼ 1; . . . ; sg. Whenthere are more than two regions included in theassessment, a value of pX0.5 may be tooconservative and even not practical. A smallervalue (e.g. p5 1/s) is recommended instead [27].‘PMDA Method 1’ in this paper is an extension ofMethod 1 in the original Japanese MHLWguidance, which only considers the Japanesecohort vs others.

Under PMDA Method 2 [3], consistencyof treatment effect across regions is concludedif a ‘positive trend’ is observed or fd̂i40;for all i ¼ 1; . . . ; sg. This is equivalent to conclud-ing consistency if fQ1 ¼ 0g, where Q1 is definedfor Gail-Simon qualitative interaction test inSection 3.1.2. This indicates that consistencyconcluded by the PMDA Method 2 always impliesconsistency conclusion by Gail-Simon test.

3.3.2. Non-inferiority hypothesis tests

Some authors interpret the word ‘similarity’ in E5as statistical equivalence and propose equivalence/non-inferiority tests for a bridging study [28]. Toconclude ‘similar’ treatment effect in a new region,one needs to show non-inferiority in the newregion compared to a preserved fraction of thatfrom the original region. This can be extended tomultiple regions. The null and alternative hypoth-eses are H0:d1ppd or y or dsppd, vs analternative of all di’s4pd. Using a two-side100(1�a)% CI approach, consistency of treatmenteffect across regions is claimed iffd̂i4pd̂1za=2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2N�1ðf�1i � 2p1p2Þ

p; i ¼ 1; . . . sg.

The non-inferiority method is always more con-servative than PMDA Method 1 if the same p isused. The positive value of ci, which is the halfwidth of the CI for (di�pd), indicates thatadditional confidence above the target preserved

Assessing consistent treatment effect in a multi-regional clinical trial 247

Copyright r 2010 John Wiley & Sons, Ltd. Pharmaceut. Statist. 9: 242–253 (2010)DOI: 10.1002/pst

Page 7: Assessing consistent treatment effect in a multi-regional clinical trial: a systematic review

effect pd̂ is required for concluding consistency.This additional confidence is usually a highburden. For example, even when p5 0, thismethod requires to demonstrate statistical signifi-cance at level a (two-sided) within each region,when the sample size in each region is only afraction of the total sample size. A possiblemodification to make this approach less conserva-tive is to use a greater a for the non-inferiority test.For example, a two-sided 80% CI (or a5 0.20)

will lead to a lower reference value of pd̂11:28�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2N�1ðf�1i � 2p1p2Þ

pfor region i, which is less

stringent than that replacing 1.28 with 1.96 based onthe conventional 95%CI (or a50.05). However, sucha modification is still conservative. For example, evenwhen a51.0 (or, 0.5 for one-sided; i.e. just comparing

the point estimates), the lower reference value is pd̂,which is the same as the PMDA Method 1.

3.3.3. CI covering a target value

In early clinical development, a decision to dis-continue a clinical development program may bemade if the CI for the treatment effect does notcover a target value of interest (e.g. a clinicallymeaningful minimum effect), or alternatively, theupper bound of the CI is less than the target value.Similar idea may apply to assessing consistencyacross regions. Consistency of treatment effect isconcluded if the target preserved effect pd̂ is coveredby all the individual CIs, or the upper bounds of theCIs are greater than the target value, i.e.fd̂i4pd̂� za=2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2N�1f�1i ;

pi ¼ 1; . . . ; sg. In this ap-

proach, the target value pd̂ is treated as a constantrather than a random observation. It is noted thatci’s are negative, which indicates some allowancebelow the target preserved effect is accepted.

An alternative approach is to plot the confidencelimits for the overall treatment effect scaled by thesquare root of sample size. As sample size increasesalong the X-axis, the confidence limits converge tothe overall treatment effect, resulting in a funnel-shaped plot [29]. Consistency is concluded if pointestimates from individual regions, plotted with xcoordinates corresponding to their sample sizes,respectively, are no worse than the lower

confidence limit. Specifically, the two-sided100(1�a)% CI for the overall treatment effect isðd̂� za=2

ffiffiffiffiffiffiffiffiffi2=N

p; d̂1za=2

ffiffiffiffiffiffiffiffiffi2=N

pÞ. If we plot the con-

fidence limits vs the sample size N, we will get afunnel plot since the distance za=2

ffiffiffiffiffiffiffiffiffi2=N

pbetween the

confidence limits and the center line d̂ is proportionalto N�1/2. By requiring all point estimates fromindividual regions be greater than the lower limit,that is, fd̂i4d̂� za=2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2=ðNfiÞ

p; i ¼ 1; . . . ; sg, this is

exactly the same as described above with p51. Asimilar approach was used in a recent regulatoryreview [29], where in addition to requiring all pointestimates from individual regions be greater than thelower confidence limit (i.e. no evidence that a certainregion is far worse), the reviewer also comparedthem to the upper confidence limit to show that therewas no evidence that a certain region was far better(or, an ‘outlier’).

4. A CASE STUDY – PURSUIT

PURSUIT trial (Platelet Glycoprotein IIb/IIIa inUnstable Angina: Receptor Suppression UsingIntegrilin Therapy) compared the platelet glyco-protein IIb/IIIa inhibitor eptifibatide with placeboin patients who had acute coronary syndromes butdid not have persistent ST-segment elevation[8,19,20]. A total of 10 948 patients were enrolledfrom 726 hospitals in 28 countries within fourgeographic regions [19]. About 41% patients wereenrolled in North America (NA), 39% in WesternEurope (WE), 16% in Eastern Europe (EE), and4% in Latin America (LA). As compared with theplacebo group, the eptifibatide group had a 1.5%absolute reduction in the incidence of the primaryendpoint of death or nonfatal MI within 30 days(14.2 vs 15.7% in the placebo group; p5 0.042).However, the benefit was not apparent in the twosmaller regions of LA and EE. Table I shows theodds ratios and corresponding 95% CIs for theoverall and individual regions, respectively. Forour purpose, we will use the PURSUIT study toillustrate the various methods for assessingconsistency of treatment effect across regions.For discussion below, the treatment effect is

Copyright r 2010 John Wiley & Sons, Ltd. Pharmaceut. Statist. 9: 242–253 (2010)DOI: 10.1002/pst

248 J. Chen et al.

Page 8: Assessing consistent treatment effect in a multi-regional clinical trial: a systematic review

defined as f–log(odds ratio)g, since it can be betterapproximated by a normal distribution and also agreater value indicates better treatment effect.

4.1. Global methods

Table II shows the quantitative interaction testsbased on Cochran’s Q and Breslow-Day test [30].Both tests give similar results with a borderlinetwo-sided p-value of 0.08. To help understand theclinical relevance of the appeared heterogeneouseffect, the Higgins I2 is also derived. A moderateheterogeneity was observed across regions (HigginsI2 5 55%), but the precision of the estimate islimited (95% CI: 0–85%) due to insufficient samplesize. Further investigation using Gail-Simon ap-proach did not show a qualitative difference amongregions (p5 0.59) because the two regions withnegative treatment effect, LA and EE, had muchsmaller sample size and their observed negativeeffects were not far in magnitude from 0. As aresult, the Q1 statistic is small with Q1 5 0.48.

4.2. Multivariate quantitative methods

To help identify regions with inconsistent treat-ment effect, multivariate quantitative methodsbased on pairwise differences are applied to thePURSUIT example. Figure 1 shows the pairwisedifferences and their associated 95% CIs. Forillustration purpose, a two-fold margin for oddsratios (or, equivalently, log(2)5 0.69 for thedifference in –log(odds ratios)) is used in the equi-valence approach. As shown in the graph, theupper bound of the 95% CI for the differencebetween NA and LA exceeds the equivalencemargin and is therefore considered not consistent.The much smaller sample size in LA, which resultsin a wide CI, is the primary reason for thisconclusion. The method of showing lack ofstatistical significance, which is equivalent to CIcovering 0, is also applied. As shown in Figure 1,the lower bound of the CI for the differencebetween NA and EE is above 0, indicatinginconsistency in the EE region.

4.3. Multivariate qualitative methods

For a slightly different question of whetherpatients across regions can benefit from thetreatment, the multivariate qualitative methods,which consider consistency as being ‘consistentlygood’, were applied. Table III shows the derivedthresholds for concluding consistency based ondifferent multivariate qualitative methods. PMDAMethod 2 applies a constant of 0 as the lowerthreshold for individual estimates. The pointestimates from NA and WE, 0.29 and 0.08,respectively, are greater than this threshold.

Table I. PURSUIT: Efficacy by Region.

NOddsratio

95% confidenceinterval

Overall 10 948 0.89 (0.79, 0.99)North America 4358 0.75 (0.63, 0.91)Western Europe 4243 0.92 (0.77, 1.11)Latin America 585 1.03 (0.60, 1.76)Eastern Europe 1762 1.09 (0.85, 1.39)

N: sample size. Odds ratio: eptifibatide vs placebo, lowerthe better [19].

Table II. PURSUIT: Global Methods – InteractionTests.

MethodChi-squarestatistics

Degrees offreedom

Two-sidedp-value

Quantitative interaction testsCochran’s Q 6.64 3 0.08

Breslow-Day 6.72 3 0.08

Qualitative Interaction TestGail-Simon min(Q1,Q�), where

Q�5 10.20, Q1 5 0.480.59

0.0 betterworse

Equivalence margin (+/- 0.69)

Lack of statisticalsignificance if

covering 0

Figure 1. PURSUIT: Multivariate Quantitative Methods.

Assessing consistent treatment effect in a multi-regional clinical trial 249

Copyright r 2010 John Wiley & Sons, Ltd. Pharmaceut. Statist. 9: 242–253 (2010)DOI: 10.1002/pst

Page 9: Assessing consistent treatment effect in a multi-regional clinical trial: a systematic review

However, the point estimates from LA and EE areboth below this threshold. Results from these tworegions therefore cannot be considered consistentbased on the PMDA Method 2. PMDA Method 1applies a higher threshold of 0.06, which is half ofthe overall treatment effect. However, the conclu-sion remains the same since estimates from bothNA and WE remain above this higher threshold.The non-inferiority approach leads to differentthresholds among regions due to varying samplesize, with higher thresholds for regions withsmaller sample size. This approach is much moreconservative compared to other methods in thiscategory. Only the NA region can meet thisconservative definition. On the other hand, theapproach of CI covering the overall treatmenteffect tends to give more relaxed thresholds.In fact, the derived thresholds are all lessthan 0, with the lowest threshold for the smallestregion (LA). This is in contrast with the non-inferiority approach where the threshold is thehighest for the smallest region. All regions meetthis definition.

5. DISCUSSION

In this paper, we have reviewed methods forassessing the consistency of treatment effect acrossall regions in a MRCT, after the treatment effectbased on the overall patient population has beendemonstrated. For such an assessment, not one

particular region is considered more importantthan the others. One of the main objectives forsuch an assessment is to support the robustness ofthe conclusions based on the overall data. InMRCTs, some degree of heterogeneity acrossregions is expected; however, a clinically relevantquestion is how such a heterogeneity may affectthe conclusion. Methods reviewed in this paper tryto address this question, although may approachthis from different angles (e.g., global vs multi-variate methods, quantitative vs qualitative assess-ment, etc.). In other scenarios, a certain region oreven a country may be of special interest. Such aregion/country-specific assessment of consistencyis usually requested by local regulatory agency tosupport registration, sometimes at the countrylevel. Given that the protocol for a global studywill need to be submitted to regulatory agenciesaround the world, it may be difficult to single outone (or a few) regions/countries as being moreimportant and pre-specify region/country-specificassessment for these specific regions/countries. Inaddition, the premise of a global MRCT is thatthere is no a priori systematic or scientific reason tobelieve the regions would be inconsistent. Other-wise, different development strategies such as localstand-alone trials may be more appropriate.Therefore, the region/country-specific treatmenteffects likely are not different from each other, andpotentially many such analyses as requested by thelocal agencies will have an increased chance offalse-positive findings due to multiplicity [31].

Table III. PURSUIT: Multivariate Qualitative Methods Lower Thresholds For Consistency.

PURSUIT: point estimate oftreatment effect�

Consistency criteria: fd̂i4pd̂1ci; for all i ¼ 1; . . . ; sgDerived lower threshold ðpd̂1ciÞ by different methods

PMDA 1 PMDA 2 Non-inferiorityy CIy covering d̂

Overall d̂ 0.12p5 1/2ci 5 0

p5 0ci 5 0

p5 1/2; ci 5 half widthof CI for ðdi � pdÞ

p5 1; ci 5 - fhalfwidth of CI for dig

North America 0.29 0.06 0 0.28 �0.07Western Europe 0.08 0.06 0 0.28 �0.07Latin America �0.03 0.06 0 0.81 �0.42Eastern Europe �0.09 0.06 0 0.38 �0.13�Treatment effect is defined as �log(odds ratio), for which a larger value means greater benefit.yTwo-sided 95% confidence interval (CI).

Copyright r 2010 John Wiley & Sons, Ltd. Pharmaceut. Statist. 9: 242–253 (2010)DOI: 10.1002/pst

250 J. Chen et al.

Page 10: Assessing consistent treatment effect in a multi-regional clinical trial: a systematic review

Without appropriate control of the false-positiverate, potential regional findings from these explo-ratory region/country-specific analyses need to becarefully interpreted. If regional heterogeneity isunexpectedly observed, additional exploration ofintrinsic/extrinsic factors that may contribute tosuch a regional variation may be needed to helpinterpret the results [8].

To strengthen the integrity of conclusions from anassessment of consistency across all regions inMRCTs, it is recommended to pre-define regionsand pre-specify approaches for assessing consistencyacross all regions at the design stage. In many partsof this manuscript, we do not make clear distinctionbetween region and country (e.g. discussion ofMERIT-HF example in Section 2). In practice,country and region for a clinical trial are usuallydefined differently. Country represents a politicaldivision of a geographical region, while the defini-tion of region in an MRCT is based upon, inaddition to geographical consideration, many otherfactors including medical practice, ethnic factors,and other intrinsic/extrinsic factors. Another work-stream of the PhRMA MRCT Key Issue Team isworking on this issue with an objective to provideguidance and specific recommendations on how todefine regions. One statistical consideration is thatthe number of regions should not be large since thatcan affect the power/precision of the statisticalassessment [27]. As in the regional analysis of theEVEREST trial [7], 20 countries from 3 continentswere grouped into 4 regions: North America, SouthAmerica, Eastern Europe, and Western Europe. Insome other examples, multi-national trials couldinclude multiple countries from one single geogra-phical region and have a more homogeneous patientpopulation. As with the definition of region, themethods to assess consistency are better pre-specified at the time of the design of a study. Thetreatment-by-region interaction tests are commonlyused in the literature. Other methods may alsoprovide additional useful information and in somecases, more appropriate answers if the researchquestions are slightly different. For example, quan-tifying the potential heterogeneity (e.g. Higgins I2)could provide useful information for assessing theclinical relevance of the potential regional difference.

Some authors argue that it is clinically more relevantto assess whether patients across all regions canbenefit from the treatment, i.e. the true treatmenteffects are positive across all individual regions, eventhere are potential quantitative differences amongthe regions [19]. To address this, a qualitativedifference (e.g. Gail-Simon qualitative interactiontest, multivariate qualitative methods, etc.) ratherthan quantitative variation may be more relevant. Insome scenarios, one may prefer multivariate meth-ods that allow identifying the source of inconsis-tency as compared to the global methods. While it isrecommended to pre-specify a primary approach forassessing consistency, more than one of thesemethods may be used in combination to helpunderstand the nature of possible inconsistency.The choice of methodology will also have an impacton the false-positive rate, which is the errorprobability of wrongly concluding anoverall regional difference while in fact there isnone, and the power/precision of detecting/estimating the difference when there truly is aregional difference. For quantitative approaches(Sections 3.1 and 3.2), the false-positive rate canbe clearly defined since ‘no quantitative difference’among regions is specified as d1 ¼ ��� ¼ ds. Theinteraction tests control the false-positive rateat the nominal level (e.g. 0.05). However, themultivariate quantitative methods discussed inSection 3.2, for which the false-positive rate isdefined as Prf[i 6¼jfjd̂i � d̂jjXbijgjd1 ¼ ��� ¼ dsg, canlead to a substantially greater false-positive rate dueto increased probability of at least one pairexceeding its bound by chance. Different than forquantitative assessment, there is no simple specifica-tion of ‘no qualitative difference’ for the qualitativemethods. For example, di’s 40 is considered ‘noqualitative difference’ in the Gail-Simon test, whilein the non-inferiority approach (Section 3.3.2)di’s4pd is considered no qualitative difference. Itis therefore difficult to compare false-positive rateamong qualitative methods unless the same defini-tion of ‘no qualitative difference’ can be agreedupon. The statistical power of detecting a realregional difference can also be different dependingon the choice of method. For example, thequantitative interaction tests (Section 3.1.1) usually

Assessing consistent treatment effect in a multi-regional clinical trial 251

Copyright r 2010 John Wiley & Sons, Ltd. Pharmaceut. Statist. 9: 242–253 (2010)DOI: 10.1002/pst

Page 11: Assessing consistent treatment effect in a multi-regional clinical trial: a systematic review

have a low statistical power to detect inconsistency ifthe study is not designed (and/or powered) to detecta potential treatment by region interaction [18]. Thequalitative interaction tests (Section 3.1.2) includingGail-Simon test typically have very small statisticalpower under practical configurations [22]. Notshowing inconsistency based on these low powerstatistical tests does not necessarily provide convin-cing evidence that there is no regional difference.However, the proposal of using a higher significancelevel such as 0.10 (vs conventional 0.05) may alsohave problems. While the power is higher with theincreased significance level, there is no guarantee thatthe resulting power is sufficient. In addition, this canalso increase the risk of wrongfully claiming incon-sistency when in fact there is no regional difference atall. For some multivariate methods, interpretation ofthe dependency between the reference value for acertain region and its corresponding sample sizecould be less intuitive. In the case study (Section 4.3),the non-inferiority approach requires higher thresh-olds for regions with smaller sample size while thecoverage by CI approach tends to allow lowerthresholds for the smaller regions. The apparentdiscrepancy between the two methods is due to thedifferent belief regarding consistency of treatmenteffect. For the non-inferiority method, the belief (nullhypothesis) is that the treatment effect is NOTconsistent and therefore convincing evidence isrequired for concluding consistency. Smaller regionswith less information would need to meet a higherthreshold to be confirmed being consistent. For thecoverage by CI approach, the belief is that thetreatment effect is consistent and unless there is astrong evidence to contradict that, the results will beconsidered consistent. Therefore, smaller regionswith less information usually cannot provide suffi-cient information to convince otherwise. This couldaffect at the design stage sample size calculation andhow to partition the total sample size into pre-definedregions. The non-inferiority approach, which needssufficient information from individual regions toconfirm consistency, usually requires a large samplesize that may not be feasible [27]. For the coverage byCI approach, however, increasing the sample size in aregion will make the CI tighter and decrease theprobability of concluding consistency. The sample

size planning may be based on the goal of achieving atarget probability of concluding inconsistency whenthere is truly inconsistency among regions, similar tothat for interaction tests. More details on sample sizeplanning can be found in [27].

Despite all those limitations discussed above,region/country-specific consistency assessmentsare frequently performed especially by localregulatory agencies. Sometimes there may be goodreasons for such region/country-specific analysessuch as expected differences in potential intrinsic/extrinsic factors. The multivariate quantitative/qualitative methods (Sections 3.2 and 3.3) re-viewed in this paper for assessing overall consis-tency across all regions can be easily extended toregion/country-specific analysis. The global meth-ods (Section 3.1) may not directly apply to region/country-specific consistency assessment.

ACKNOWLEDGEMENTS

The authors thank Dr. Sue-Jane Wang for theinvitation to contribute to this special issue.We also thank the two referees for their carefulreview and insightful comments.

REFERENCES

1. O’Neill RT. Multi-regional clinical trials: a regula-tory perspective on issues. Presented at the FDA/Industry Statistics Workshop, Washington, DC,September 24, 2009.

2. European Medicines Agency. Reflection paper onthe extrapolation of results from clinical studiesconducted outside Europe to the EU-population.EMEA Doc. Ref. CHMP/EWP/692702/2008.February 2009. Available at http://www.emea.europa.eu/pdfs/human/ewp/69270208en.pdf.

3. Ministry of Health, Labour and Welfare of Japan,Basic Principles on Global Clinical Trials. September28, 2007.

4. Vickers A, Goyal N, Harland R, Rees R. Do certaincountries produce only positive results? A systema-tic review of controlled trials. Controlled ClinicalTrials 1998; 19:159–166.

5. ICH International Conference on HarmonizationTripartite Guidance E5: Ethnic Factor in theAcceptability of Foreign Data. The US FederalRegister 1998; 83:31790–31796.

Copyright r 2010 John Wiley & Sons, Ltd. Pharmaceut. Statist. 9: 242–253 (2010)DOI: 10.1002/pst

252 J. Chen et al.

Page 12: Assessing consistent treatment effect in a multi-regional clinical trial: a systematic review

6. The Hirulog and Early Reperfusion or Occlusion(HERO)-2 Trial Investigators. Thrombin-specificanticoagulation with bivalirudin versus heparin inpatients receiving fibrinolytic therapy for acutemyocardial infarction: the HERO-2 randomisedtrial. Lancet 2001; 358:1855–1863.

7. Blair JEA, Zannad F, Konstam MA et al., for theEVEREST Investigators. Continental differences inclinical characteristics, management, and outcomesin patients hospitalized with worsening heart failure.Journal of the American College of Cardiology 2008;52:1640–1648.

8. Akkerhuis KM, Deckers JW, Boersma E et al.for the PURSUIT Investigators. Geographic varia-bility in outcomes within an international trial ofglycoprotein IIb/IIIa inhibition in patients withacute coronary syndromes. European Heart Journal2000; 21:371–381.

9. Wedel H, DeMets D, Deedwania P et al. Challengesof subgroup analyses in multinational clinical trials:experiences from the MERIT-HF trial. AmericanHeart Journal 2001; 142:502–511.

10. Kawai N et al. An approach to rationalize partitioningsample size into individual regions in a multiregionaltrial. Drug Information Journal 2008; 42:139–147.

11. Quan H, Zhao PL, Zhang J et al. Samplesize considerations for Japanese patients in amulti-regional trial based on MHLW guidance.Pharmaceutical Statistics, DOI: 10.1002/pst.380.

12. Uesaka H. Sample size allocation to regions in amultiregional trial. Journal of BiopharmaceuticalStatistics 2009; 19:580–594.

13. Hjalmarson A, Goldstein S, Fagerberg B et al.Effects of controlled release metoprolol on totalmortality, hospitalizations, and well-being in pa-tients with heart failure: the Metoprolol CR/XLRandomized Intervention Trial in Congestive HeartFailure (MERIT-HF). JAMA 2000; 283:1295–1302.

14. Cochran WG. The combination of estimates fromdifferent experiments. Biometrics 1954; 10:101–129.

15. DerSimonian R, Laird N. Meta-analysis in clinicaltrials. Controlled Clinical Trials 1986; 7:177–188.

16. Higgins JPT, Thompson SG, Deeks JJ, Altman DG.Measuring inconsistency in meta-analyses. BritishMedical Journal 2003; 327:557–560.

17. Whitehead A. Meta-analysis of controlled clinicaltrials. Wiley: New York, 2002.

18. Fleiss. Analysis of data from multiclinic trials.Controlled Clinical Trials 1986; 7:267–275.

19. O’Shea JC, DeMets DL. Statistical issues relating tointernational differences in clinical trials. AmericanHeart Journal 2001; 142:21–28.

20. The PURSUIT Trial Investigators. Inhibitionof platelet glycoprotein IIb/IIIa with eptifibatidein patients with acute coronary syndromes.New England Journal of Medicine 1998; 339:436–443.

21. Gail M, Simon R. Testing for qualitative interactionbetween treatment effects and patient subsets.Biometrics 1985; 41:361–372.

22. Piantadosi S, Gail M. A comparison of the power oftwo tests for qualitative interactions. Statistics inMedicine 1993; 12:1239–1248.

23. Ciminera JL, Heyse JF, Nguyen HH, Tukey JW.Tests for qualitative treatment-by-centre interactionusing a ‘Pushback’ procedure. Statistics in Medicine1993; 12:1033–1045.

24. Chen YHJ, Liu GHF. Testing for crossover of twohazard functions using Gail and Simon’s method.Journal of Biopharmaceutical Statistics 2006; 16:313–326.

25. Chow SC, Shao J, Hu YP. Assessing sensitivity andsimilarity in bridging studies. Journal of Biopharma-ceutical Statistics 2002; 12:385–400.

26. Lachenbruch PA, Rida W, Kou J. Lot consistencyas an equivalence problem. Journal of Biopharma-ceutical Statistics 2004; 14:275–290.

27. Quan H, Li M, Chen J et al. for the ConsistencyWorkstream of the PhRMA MRCT Key IssueTeam. Assessment of consistency of treatmenteffects in multi-regional clinical trials. Sanofi-AventisTechnical Report 034, September 2009.

28. Liu JP, Hsueh H, Chen J. Sample size requirementsfor evaluation of bridging evidence. BiometricalJournal 2002; 44:969–981.

29. FDA Medical Review of COZAARTM Tablets(Losartan Potassium). NDA 20-386/SE1-028, March2002. Available at http://www.fda.gov/ohrms/dockets/ac/02/briefing/3849b1_03_medicalreview.pdf.

30. Breslow NE, Day NE. Statistical methods in cancerresearch, volume 1: the analysis of case-controlstudies. International Agency for Research onCancer: Lyon, 1980.

31. Pocock SJ, Assmann SE, Enos LE, Kasten LE.Subgroup analysis, covariate adjustment and base-line comparisons in clinical trial reporting: currentpractice and problems. Statistics in Medicine 2002;21:2917–2930.

Assessing consistent treatment effect in a multi-regional clinical trial 253

Copyright r 2010 John Wiley & Sons, Ltd. Pharmaceut. Statist. 9: 242–253 (2010)DOI: 10.1002/pst