response-adaptive randomization in clinical trials: …myths to practical considerations david s....

Response-adaptive randomization in clinical trials: from

myths to practical considerations

David S. Robertson1, Kim May Lee1, Boryana C. Lopez-Kolkovska1, and Sofıa S. Villar ∗1

1MRC Biostatistics Unit, University of Cambridge, Cambridge, UK

Abstract

Response-adaptive randomization (RAR) is part of a wider class of data-dependent sampling

algorithms, for which clinical trials have commonly been used as a motivating application. In that

context, patient allocation to treatments is defined using the accrued data on responses to alter

randomization probabilities, in order to achieve different experimental goals. RAR has received

abundant theoretical attention from the biostatistical literature since the 1930’s and has been the

subject of heated debates. Recently it has received renewed consideration from the applied com-

munity due to successful practical examples and its widespread use in machine learning. Many

position papers on the subject present a one-sided view on its use, which is of limited value for the

non-expert. This work aims to address this persistent gap by providing a critical, balanced and

updated review of methodological and practical issues to consider when debating the use of RAR

in clinical trials.

Keywords: Adaptive randomization; ethics; multi-armed bandits; patient allocation; sample size imbalance,

time trends, type I error control.

1 Introduction

Few topics in the biostatistical literature have been as heavily debated as response-adaptive random-

ization (RAR). The controversy about its value in clinical trials has persisted over the decades ever

since it was first proposed in the early 1930’s. This lack of consensus and the extreme opposing views

surrounding it does not reflect well upon the biostatistical community. A clinical colleague consider-

ing it for an experiment can easily encounter pieces of work from an utmost enthusiasts and a firm

opponent, both of whom can be respected, experienced and well-known biostatisticians.

∗Address correspondence to Sofıa S. Villar, MRC Biostatistics Unit, University of Cambridge, IPH Forvie Site,Robinson Way, Cambridge CB2 0SR, UK; E-mail: [email protected]

1

arX

iv:2

005.

0056

4v2

[st

at.M

E]

6 J

ul 2

020

These conflicting views are not only detrimental to the use of RAR in practice (which is perhaps

unfortunate given the potential advantages it can offer), but can also be perceived as very confusing

for those to whom RAR is relatively new. This situation has motivated us to write this paper with the

intention that it becomes a “must read” paper for those who, like ourselves, have found some of these

arguments to be partial or confusing. We also present a simulation study to compare some properties

of recently proposed response-adaptive procedures in the literature.

Our intention is that this paper stimulates deep and critical thinking about experimental goals in

challenging contexts, rather than merely following simplistic views on a complex subject. This paper

makes the case for the importance of careful thinking about how to deliver experimental goals through

the appropriate use of trial adaptations such as RAR.

1.1 Background

Randomization as a method to allocate patients to treatments in a clinical trial has long been con-

sidered a defining element of a well-conducted study, ensuring comparability of treatment groups,

mitigating selection bias, and additionally providing the basis for statistical inference (Rosenberger

and Lachin, 2016). In clinical trials practice, a fixed randomization probability through the trial (most

often an equal probability) is still the most popular randomization procedure in use. An alternative

mode of patient allocation is known as RAR, in which randomization probabilities are altered during

the trial based on the accrued data on responses, with the aim of achieving different experimental

objectives while ideally (though not necessarily) preserving inferential validity. Such experimental

objectives may include: early selection of a promising treatment among several candidates, increasing

the power of a specific treatment comparison and/or assigning more patients to an effective treatment

arm during a trial.

RAR (also known as outcome-adaptive randomization) has been a fertile area of methodological

research over the past three decades, with books and many papers in top statistical journals being

published on the subject (which becomes evident from the references section of this paper). Despite

this, the uptake of RAR in practice remains disproportionately slow in comparison with the theoretical

attention it has received, and it continues to stand as a controversial and highly debated issue within

the statistics community. These debates tend to intensify and multiply during health care crises

such as the Ebola outbreak (Brittain and Proschan, 2016; Berry, 2016) or the current COVID-19

pandemic (Proschan and Evans, 2020; Angus et al., 2020; Magaret, 2020). Unfortunately, such debates

are mostly geared towards presenting arguments to justify one-sided positions around the use of RAR

2

in clinical trials, and are usually highly technical which makes it challenging for a non-expert to

follow. None of these debates provide a fair, updated and balanced discussion over the merits and

disadvantages of using this wide adaptive design class. Such a well balanced discussion is very much

needed in practice for a non-expert to make a knowledgeable decision of its use for a specific experiment

and it is what we attempt to do in this piece of work.

Such conflicting and one-sided views, as published in the modern literature, are illustrated by the

quotations below and constitute one of the initial drivers and main motivations for writing this paper.

Outcome adaptive randomization has several undesirable properties. These include a high

probability of sample size imbalance in the wrong direction . . . it produces inferential

problems that decrease potential benefit to future patients, and may decrease benefit to

patients enrolled in the trial . . . For randomized comparative trials to obtain confirmatory

comparisons, designs with fixed randomization probabilities and group sequential decision

rules appear to be preferable to RAR, scientifically, and ethically. (Thall et al., 2015b)

. . . optimal response-adaptive randomization designs allow implementation of complex op-

timal allocations in multiple-objective clinical trials and provide valid tools to inference in

the end of the trial. In many instances they prove superior over traditional balanced ran-

domization designs in terms of both statistical efficiency and ethical criteria. (Rosenberger

et al., 2012)

Response adaptive randomization (RAR) is a noble attempt to increase the likelihood

that patients receive better performing treatments, but it causes numerous problems that

more than offset any potential benefits. We discourage the use of RAR in clinical trials.

(Proschan and Evans, 2020)

The lingering lack of consensus within the specialized literature is perhaps one of the most in-

fluential reasons to explain why the use of RAR procedures remains rare in clinical trials practice.

The one-sided positions on the use of these methods persist, despite recent methodological develop-

ments directly addressing past criticisms and providing guidance for selecting appropriate procedures

in practice. Specifically, over the last 10 years there have been additional theoretical advances which

– to the best of the authors’ knowledge – are currently not included in any published review of

response-adaptive methods.

In parallel to this, response-adaptive procedures have boomed in machine learning applications

(Auer et al., 2002; Bubeck and Cesa-Bianchi, 2012; Kaufmann and Garivier, 2017; Kaibel and Biemann,

2019; Lattimore and Szepesvari, 2019) where the uptake and popularity of Bayesian RAR ideas (also

known as Thompson sampling) has been incredibly high. Their use in practice has been associated

with substantial gains in system performances. In the clinical trial community, a crucial development

3

has been the success of some well-known biomarker led trials, such as I-SPY 2 (Rugo et al., 2016; Park

et al., 2016) or BATTLE (Kim et al., 2011). The goal of these trials was to learn which subgroups (if

any) benefit from a therapy and to then change the randomization ratio to favour patient allocation in

that direction. These trials have set new precedents and expectations which, contrary to what ECMO

trials did to RAR in the 1980s (see the Section 2), are increasingly driving investigators to the use of

RAR to enhance the efficiency and ethics of their trial designs. These trials also show that practically

and logistically, the use of RAR is clearly feasible, at least in areas such as oncology.

Both in the machine learning literature and in the recent trials cited above, the methodology

used in these applications is a class of the larger family of adaptive methods (i.e. some of which are

response-adaptive but not randomized). However, most of the recent general criticisms and praise for

response-adaptive methods in clinical trials has been driven mainly by arguments that only apply to

a very specific subclass within these methods. This paper aims to contribute to the current discussion

by providing an updated and broad critical review, with a summary of some of our thoughts on

this debate. We also compare and discuss recently proposed RAR procedures with a new simulation

study that (to the best of our knowledge) is not found in the literature. We believe this work is

timely and highly needed to support those considering RAR as a potential defining element of clinical

trials to be run in the current COVID-19 pandemic, especially given the conflicting views in the

literature (Proschan and Evans, 2020; Magaret, 2020).

We start by providing an updated historical overview of RAR (Section 2) and continue by a sum-

mary of the classification and comparison of RAR procedures (Section 3). We then present popular

beliefs published in the literature about RAR and we critically discuss each of them (Section 4). Fi-

nally, we give a brief summary of our own opinions on the future of RAR related research and some

general considerations in Section 5.

2 A historical perspective on RAR

“Those who cannot learn from history are doomed to repeat it.” (Paraphrase of an apho-

rism by George Santayana)

We believe it is important to start our review by tracing and reporting the highlights of the historical

development of RAR. This history can be presented by splitting it into the tale of two very distinct ar-

eas: theory and practice. While a large amount of high quality theoretical work has accumulated over

the years, RAR in practice has been marked by only a few highly influential examples. In particular,

4

our view is that we should use both early controversies such as the infamous ECMO trial (see below)

and recent successes such as I-SPY 2 to learn when and how RAR might be used appropriately, rather

than dismissing or encouraging using any kind of RAR on the basis of a single example. Hence, in

this section, we give a high level historical overview of RAR in terms of key methodology and its use

in practice for clinical trials, with a timeline shown in Figure 1. We also include some more recent

developments, which have not been included in previously published historical overviews.

2.1 RAR methodology literature

The origins of response-adaptive procedures can be traced back to Thompson (1933), who first sug-

gested to allocate patients to the more effective treatment arm via a posterior probability computed

using interim data. This work motivated procedures (commonly known as Thompson sampling) which

are used to allocate resources in many modern application areas. Another early method was the play-

the-winner rule, proposed by Robbins (1952) and then Zelen (1969). In this non-randomized rule,

a success on one treatment leads to a subsequent patient being assigned to that treatment, while a

failure leads to a subsequent patient being assigned to the other treatment.

RAR also has roots in the methodology for sequential stopping problems (where the sample size

is random, but with a stopping boundary), as well as bandit problems (where resources are allocated

to maximize the expected reward). Since most of the work in these areas have been non-randomized

(i.e. deterministic), we do not describe their development here. For the interested reader, Rosenberger

and Lachin (2016, Section 10.2) gives a brief summary of the history of both of these areas, and an

overview of multi-arm bandit models is presented in the review paper of Villar et al. (2015a). For a

review of non-randomized algorithms for the two-arm bandit problem, see Jacko (2019).

One of the first explicit uses of randomization in a response-adaptive treatment allocation pro-

cedure was the randomized play-the-winner (RPW) rule proposed by Wei (1978). The RPW rule

can be described as an urn model : each treatment assignment is made by drawing a ball from an

urn (with replacement), where the composition of the urn is updated based on the patient responses.

In the following decades, many RAR designs based on urn models were proposed, with a particular

focus on generalizing the RPW rule. We refer the reader to Hu and Rosenberger (2006, Chapter 4)

and Rosenberger and Lachin (2016, Section 10.5) for a detailed description. One example was the urn

model proposed by Ivanova (2003), called the drop-the-loser rule.

These urn-based RAR procedures are intuitive, but are not optimal in a formal sense. However,

from the early 2000s another perspective on RAR emerged that was based on optimal allocation

5

targets, which are derived as a solution to a formal optimization problem. For two-arm trials, a

general optimization approach was proposed by Jennison and Turnbull (2000) for normally-distributed

outcomes, and helped lead to the development of a whole class of optimal RAR designs. One early

example is the work of Rosenberger et al. (2001) for trials with binary outcomes. Further examples of

optimal RAR designs can be found in Section 4.2. In order to achieve the desired optimal allocation

targets, a key development was the modification by Hu and Zhang (2004) of the double adaptive

biased coin design (DBCD), which was originally described by Eisele (1994). Subsequent theoretical

work by Hu and Rosenberger (2006) focused on asymptotically best RAR procedures, which led to

the development of the class of efficient response adaptive randomization designs (ERADE) proposed

by Hu et al. (2009).

All of the RAR procedures described above are myopic, in the sense that they only use past ob-

servations to determine the treatment allocation for the next patient, without considering the future

patients to be treated and the information they could provide. A more recent methodological devel-

opment has been the proposal of non-myopic or forward-looking RAR procedures, which are based

on solutions to the multi-bandit problem. The first such fully randomized procedure was proposed

by Villar et al. (2015b) for trials with binary responses, with subsequent work by Williamson et al.

(2017) accounting for an explicit finite time-horizon. Even more recently, forward-looking RAR pro-

cedures have been proposed for trials with normally-distributed outcomes as well (Williamson and

Villar, 2020).

2.2 RAR in clinical practice

One of the earliest uses of RAR in clinical practice was the Extracorporeal Circulation in Neonatal

Respiratory Failure (ECMO) trial, performed in Michigan by Bartlett (1985). This trial used the RPW

rule on a study of critically ill babies randomized either to ECMO or to the conventional treatment. In

total, 12 patients were observed: 1 in the control group, who died, and 11 in the ECMO group, who all

survived. This extreme imbalance in sample sizes was a motivation for running a second randomized

ECMO trial, carried out in Harvard by Ware (1989), where patients were assigned treatments using

fixed randomization but with early stopping for futility. The ECMO trials have been the focus of much

debate. Up to date, the two papers above have accrued over 1000 citations. Indeed, to this day the

first ECMO trial is regarded as a key example against the use of RAR in clinical practice, due to the

extreme treatment imbalance and highly controversial interpretation (Rosenberger and Lachin, 1993;

Burton et al., 1997). In large part due to the controversy around the ECMO trials, there was little use

6

of RAR in clinical trials for the subsequent 20 years. One exception was the Fluoxetine trial (Tamura

et al., 1994), which again used the RPW rule, but with a burn-in period to ensure that there would

not be too few controls. However, more recently there have been several high-profile clinical trials

that use Bayesian RAR, in the spirit of Thompson (1933).

Two important examples in oncology are the BATTLE trials and the I-SPY 2 trial. The BATTLE

trials (Zhou, 2008; Kim et al., 2011; Papadimitrakopoulou, 2016) used RAR based on a Bayesian hier-

archical model, where the randomization probabilities are proportional to the observed efficacy based

on the patient’s individual biomarker profiles. Similarly, the I-SPY 2 trial (Barker, 2009; Carey, 2016;

Rugo et al., 2016; Park et al., 2016) used RAR based on Bayesian posterior probabilities, which are

specific to different biomarker signatures. These oncology trials have generated valuable discussions

about the benefits and drawbacks in using RAR in clinical trials (Das, 2017; Korn, 2017; Marchenko,

2014; Siu, 2017). We discuss some of these further in Section 4.

1933

Thompsonsampling

1978

RPW rule

1985

ECMO trial

1994

Fluoxetinetrial

2000

J&Toptimizationapproach

2001

RSIHR paper

2004

DBCD

2008

BATTLE trial

2009

ERADE

2010

I-SPY 2 trial

2015

Non-myopicRAR

Figure 1: Timeline summarizing some of the key developments around the use of RAR in clinicaltrials. J&T = Jennison and Turnbull (2000), RSIHR = Rosenberger et al. (2001).

3 Classifying and comparing RAR procedures

As noted in the historical perspective provided in Section 2, the adoption of the RPW rule in the

ECMO trial has been used as a key example against the use of all RAR procedures in clinical trials.

In reality, RPW is just one example of a RAR procedure out of many other possible ones for a

clinical trial. Many papers have overlooked this fact when criticizing the use of RAR in general using

arguments that apply to a specific RAR procedure and hence, perhaps unintentionally, depreciated the

value of other RAR procedures that are markedly different from those criticized in that work (Villar et

al., 2020). More generally, the literature reviewed for writing this paper provided abundant examples

where, like with RPW and the ECMO trial, broad conclusions for the use of RAR were made despite

only being based on a subset of possible procedures.

Our aim for this section is to help correct for this imbalance by presenting several families of RAR

procedures, discussing how they fit different classification criteria and describing possible meaningful

7

ways to assess RAR procedures in the literature. We hope this section will provide a basis for readers

to navigate the many existing approaches and compare them when considering their use for a specific

application at hand. However, we must warn the reader that given the wealth and breadth of existing

RAR procedures, attempting to provide classification criteria that are exhaustive and completely dif-

ferentiates them all may well be an elusive goal. As will become apparent below, even if we could do

such a thing, it could be very quickly become obsolete again given the current pace of development in

the area (Villar et al., 2020).

3.1 Taxonomies of RAR

The vast number of different RAR procedures is a challenge that non-experts and experts alike may

face with when exploring the RAR literature. The specialized literature has accumulated over the

years using differing criteria and jargon to classify and describe RAR procedures. As well, the many

new developments over the past two decades has meant that even experts in the field can find it

difficult to continually keep up to date with the literature. All this can become confusing and lead to

over-generalizations very easily.

At the broadest level, adaptive randomization procedures have been classified by Hu and Rosen-

berger (2006) as follows: response-adaptive randomization (RAR), covariate-adaptive randomization

(CAR), and covariate-adjusted response-adaptive randomization (CARA). The data used to deter-

mine the allocation probabilities are reflected in the names of the procedures, i.e. using the accrued

information of the response variable and/or covariate(s), in light of the objectives of using a particular

randomization procedure. In this review, we focus specifically on RAR procedures. Of course, many

of the issues we subsequently discuss for RAR are applicable to some degree to CARA, but this is

beyond the scope of this paper. For a in-depth discussion of the use of covariates in randomization,

we refer the reader to the review paper by Rosenberger and Sverdlov (2008).

Optimal and design-driven RAR

For the class of RAR procedures, a classification into two families has previously been proposed

by Rosenberger and Lachin (2002); Hu and Zhang (2004):

(1) response adaptive randomization which is based on an optimal allocation target, where a specific

criterion is optimized based on a population response model.

E.g. The ‘optimal’ allocation of Rosenberger et al. (2001) for binary responses. A formal opti-

8

mization problem is defined based on the population response model and the inference at the

end of the trial: the power of the trial (using a Z-test for the difference in proportions) is fixed

and the expected number of treatment failures is minimized (see Section 4.2 for further details).

(2) design-driven response adaptive randomization, where rules are established with intuitive moti-

vation, but are not optimal in a formal sense.

E.g. The RPW rule for binary responses. The rules for computing and choosing the allocation

probability can be formulated using an intuitive urn-based model (see Wei (1978) for full details).

A key difference between the two families is in the characteristics of the criterion or rules that form

the basis of the computation of allocation probabilities. More specifically, approaches belonging to

family (1) rely on optimizing an objective function that describes aspects of the population response

model explicitly, whereas those belonging to family (2) rely on intuitive motivation that is not defined

analytically from the population response model.

However, while the classification above may have been suitable when it was first proposed, it is

arguably overly simplistic and somewhat obsolete given the methodological developments in the RAR

literature since then. There are many RAR procedures that do not fit neatly into either family. One

difficulty is around the use of ‘optimal’ and ‘optimized’ in the definitions of the two families, as we

now discuss.

Starting first with bandit-based designs, these are based on an optimization approach but do not

explicitly target a pre-specified optimal allocation ratio like in family (1). Rather, they seek to find

the best possible allocation to maximise the expected total reward (i.e. number of successes), which is

unknown beforehand. Considering the forward-looking Gittins index (FLGI) rule as an example (Villar

et al., 2015b), this trades off a small deviation in the objective function to give a fully randomized

RAR procedure, which further complicates the classification of this rule. Similarly, the bandit-based

design proposed by Williamson et al. (2017) is optimal in terms of expected number of successes,

subject to constraints.

Another concept is that of asymptotic optimality, of which there are several notions. An example

where this is relevant is the previously-mentioned Thompson sampling (see also Section 4.1). Rela-

tively recently, this was shown to be asymptotically optimal in terms of minimizing cumulative regret

(see e.g. Kaufmann et al. (2012)). In finite samples though, arguably Thompson sampling (and its

9

generalization proposed by Thall and Wathen (2007)) is closer to having an intuitive motivation based

around assigning more patients to the better arm, as calculated using the posterior distributions of

the success probabilities.

The discussion above demonstrates that the broad-brush classification of RAR procedures into

‘optimal’ vs ‘design-driven’ or ‘intuitive’ requires further clarification around what is meant by these

terms. Additionally, these subtle differences within one class may come with substantial differences

in the resulting operating characteristics in an identical trial context (see Section 4.1 for example).

An alternative classification scheme would be to describe RAR procedures in terms of the broad

methodological ‘families’ they belong to, e.g. urn models, bandit-based designs, procedures based on

Thompson sampling, or procedures that target a pre-specified (optimal) allocation ratio. However,

this classification is by no means watertight, as RAR procedures could conceivably belong to more

than one family, and new types of RAR are continuously being developed.

Bayesian and frequentist procedures

Another way of classifying RAR procedures is based on the school of statistics, either frequentist

or Bayesian, that is used for its design (see below for discussion about analysis). In general, a RAR

procedure depends on unknown parameter(s), which affects either the construction of the optimiza-

tion problem or the computation of the allocation probabilities. When this problem arises, a Bayes

approach could be employed at the design stage of the trials. We suggest the following definitions:

A randomization procedure can be classified as Bayesian when a prior distribution is explicitly

incorporated into the design criteria/optimization problem and/or into the calculation of the allocation

probability

E.g. the optimal RAR design proposed by Cheng and Berry (2007) is based on a Bayesian

decision-analytic approach; Sabo (2014) propose the use of a decreasingly informative

priors for Bayesian RAR.

Meanwhile, a randomization procedure can be classified as frequentist when no prior distribution is

incorporated into the design problem/ optimization problem and/or a frequentist approach is used for

estimating the unknown parameter(s) in the allocation probability.

E.g. the generalized biased coin design by Smith (1984) and Neyman allocation (see Sec-

tion 4.2).

This classification into Bayesian and frequentist designs does not completely differentiate RAR pro-

cedures. In some cases a frequentist approach (using the above definition) corresponds to a Bayesian

10

approach with a specific prior distribution, e.g. the RPW rule (Atkinson and Biswas, 2014, pg. 271).

This is analogous to the situation where the posterior mode coincides with the maximum likelihood

estimator (MLE) when a uniform prior probability is used.

Alternatively, RAR procedures could be defined as frequentist or Bayesian with respect to the

inference procedure used. In our opinion, this approach may not be helpful for someone who is not

familiar with the literature on RAR, since the inference procedure greatly depends on the goal of the

trial and is very much influenced by regulators’ preference between these two approaches. Moreover,

some innovative trial approaches can have Bayesian design aspects but the inference procedure focuses

on the frequentist operating characteristics. This does not add value for classifying RAR procedures

since the objective(s) of RAR hardly cover all aspects of the inference. Readers interested in un-

derstanding the pros and cons of frequentist and Bayesian inference are referred to materials such as

Press (2005); Wagenmakers et al. (2008); Samaniego (2010) as this falls outside the scope of our review.

Objectives of RAR procedures

Finally, an important aspect in which RAR procedures differ is the goal they are designed to

achieve. While some can consider competing objectives such as both efficiency and patient benefit (see

Section 3.2 for definitions), others might prioritize one over the other. Additionally, some procedures

might be non-myopic (i.e. accounting for future patients) while others might be myopic. We also note

that for some RAR procedures, such as the those targeting an optimal allocation, the optimization

problem can account for multiple objectives (see e.g. Hu et al. (2015)). There are many possible

goals that a RAR procedure can aim to achieve, which may also be constrained by e.g. computational

considerations.

We encourage the reader to consider what the different classifications of RAR procedures might

mean in the context of their specific trial. In the next section, we discuss various ways of assessing

the performance of RAR procedures.

3.2 Assessing the performance of RAR procedures

Another area where there has been a lack of consistency and clarity is around how RAR procedures are

assessed. Most metrics reported in the literature seem to be focusing on a specific goal, and there has

been a myriad of ways of describing the properties of a RAR design. This section reviews some of the

most commonly used options for evaluating RAR procedures. Rather than providing an exhaustive

list of all the possible metrics available for comparing variants of RAR, we debate their usefulness and

11

attempt to provide a basis for evaluating them.

An additional reason why we believe this debate is useful is that when comparing two published

results using RAR, an important caveat is that different metrics could have been used and this needs

to be factored in the conclusions of the comparison. This leads us to emphasize that it is extremely

important to report not only the details of the procedures used and how they have been defined, but

also to carefully describe the specification of the metrics used in their evaluation to allow for adequate

comparisons to be made. We also encourage investigators to emphasize when their findings may not

apply to all the different RAR families. This should reduce the chances of readers misunderstanding

the scope of the pros and cons for a family of RAR procedures.

One key example is the use of ‘efficiency’ and ‘patient benefit’ when describing RAR procedures.

Efficiency can correspond to the power of a trial (see below), but can also correspond to the precision

of the estimates for the treatment effectiveness (Flournoy et al., 2013). Similarly, patient benefit can

mean assigning patients to a more effective treatment arm based on the accrued data, or it can also

include the benefit for patients who are outside the trial (but who could potentially benefit from the

trial results). These are important caveats that can have a large impact on the choice of a RAR

procedure.

In Section 4, we aim to identify and clarify some frequent misconceptions around RAR. In partic-

ular, when addressing them we shall use the following most commonly reported metrics: power, type I

error, and bias. In the RAR literature, we find that ‘power’ is often perceived as a frequentist prop-

erty, i.e. the probability of rejecting a null hypothesis when the true parameter follows the alternative

distribution. Similarly, the ‘type I error’ is defined as the probability of rejecting a null hypothesis

when the true parameter follows the null distribution. Moving away from the frequentist definitions,

some authors defined power (respectively type I error) as the probability of satisfying a criterion that

reflects the goal of treatment comparisons. Typically, this is found using simulation studies under the

alternative (null) scenarios.

There are additional metrics that could be reported and have been nevertheless more infrequently

used. For example, in multi-arm trial settings, power can reflect the goal of a trial, such as selecting

the best experimental treatment or declaring at least one experimental treatment effective; with the

false positive rate or familywise error rate as the generalization of type I error. The ‘power’ of a

multi-armed trial can have multiple definitions, and needs to be clearly stated when reporting results.

For instance, pairwise power, marginal power, experiment-wise power and disjunctive power are all

used as definitions of ‘power’. However, it is possible for a RAR procedure to have a high power

according to one definition but not according to another.

12

On the other hand, ‘bias’ is often defined as a property of an estimator which reflects how different

an estimate is from the true underlying parameter. An estimate may be biased due to the rule used for

calculating the estimand, or due to the heterogeneity in the observed data. Heterogeneity corresponds

to the situation where the data may not come from the same underlying distribution. A key example

is the presence of time trends, which can cause a different treatment effect for patients who were

enrolled at different time points during the trial. We return to this issue in Section 4.4.

Apart from the properties described above, some authors report the following when presenting

their simulation results:

• the expected number of treatment failures/successes (for binary outcomes) or the expected total

response (for continuous outcomes) in the trial

• the probability of selecting a truly effective experimental arm

• summary statistics of the sample size per treatment arm, with a particular focus on the allocation

to a superior arm (where this exists)

• the probability of a treatment arm stopping early for futility or for efficacy

• the probability of sample size imbalance (see Section 4.1).

Summary

Unfortunately, there is no perfect RAR procedure, in the sense that it will be superior to any other

in terms of all of these operating characteristics and in any context. This fact makes the need for a

careful consideration when deciding which randomization approach is best suited for a specific clinical

trial (and its goals) even more important. A RAR procedure should be chosen carefully according to

the specific context and specific goals of a trial, in light of the practical challenges and constraints that

implementing RAR poses. We will discuss some practical issues when implementing a RAR procedure

in Section 4.5.

4 Popular beliefs about RAR

In this section, we critically examine a number of popular beliefs that have been published on RAR.

Some of these beliefs can be rationally justified in particular scenarios, but they most certainly do

not apply to all types of RAR procedures and/or all trial settings. Our aim is to reexamine some of

13

these statements to provide a more balanced view of the use of RAR procedures, which acknowledges

potential problems and disadvantages, but also emphasizes the potential solutions and advantages.

We also perform a new simulation study in Section 4.1 that compares and discusses some properties

of more recently proposed RAR procedures. By avoiding extreme views and generalizations (in either

direction), we hope to provide some clarity around the use of RAR in practice and avoid the creation

of myths around it.

In what follows, we tend to focus on specific examples of RAR procedures in order to illustrate

that the popular beliefs on RAR do not always hold in general. As already alluded to, the examples

used below are by no means presented with the intention of being seen as the ‘best’ RAR procedures,

or even necessarily recommended for use in practice – such judgments critically depend on the context

and goals of the specific trial under consideration. Apart from Section 4.1, we do not carry out new

simulations for each of the popular beliefs presented below, but instead point the reader to the many

existing simulation studies in the literature.

4.1 Does RAR lead to a substantial chance of allocating more patients to an

inferior treatment?

Thall et al. (2015a) described a number of undesirable properties of RAR, including the following:

. . . there may be a surprisingly high probability of a sample size imbalance in the wrong

direction, with a much larger number of patients assigned to the inferior treatment arm,

so that AR has an effect that is the opposite of what was intended.

Thall et al. (2015a) illustrated this through simulation studies of two-arm trials. They showed that

Thompson sampling can have a substantial chance (up to 14% for the parameter values considered)

of producing sample size imbalances of more than 20 patients in the wrong direction (i.e., the inferior

arm) out of a maximum sample size of 200. While this result is true for a single RAR procedure,

these conclusions may not hold in general for other types of RAR. Indeed, Thall and Wathen were

among the first to compute this metric of sample size imbalance. Much of the RAR literature has not

reported this metric (or related ones), and therefore it is unclear how other families of RAR procedures

perform with regard to sample size imbalance.

To help fill this gap in the literature, we perform a simulation study looking at a broad range of RAR

procedures in the simple two-arm trial setting with a binary outcome. Let pA and pB denote the true

probabilities of a successful outcome for patients on treatments A and B respectively. Similarly, let NA

14

and NB denote the total (random) number of patients assigned to each treatment, with n = NA +NB

denoting the total sample size (which is fixed). We compare the following patient allocation procedures:

• Fixed randomization: uses an equal, fixed probability to allocate patients to each treatment, i.e.

50% in a two-arm trial.

• Permuted block randomization [PBR] : patients are randomized in blocks to the two treatments,

so that exact balance is achieved at the completion of each block (and hence at the successful

completion of the trial).

• Thall and Wathen procedure [TW(c)] : given the data observed so far, the Thall and Wathen

procedure randomizes the next patient to treatment B with probability

rB =[P (B > A)]c

[P (B > A)]c + [1− P (B > A)]c

Here P (B > A) is the posterior probability that treatment B is better than treatment A.

The parameter c controls the variability of the resulting procedure. Setting c = 0 gives fixed

randomization, while setting c = 1 gives Thompson sampling. Thall and Wathen (2007) suggest

setting c equal to 1/2 or m/(2n), where m is the current sample size and n is the total (or

maximum) sample size of the trial.

• Randomized Play-the-Winner Rule (PRW): an urn-based design, where each patient allocation

is determined by drawing a ball from an urn (with replacement), and the composition of the urn

is updated based on patient responses. See Wei (1978) for full details.

• Drop-The-Loser rule (DTL): an urn model proposed by Ivanova (2003), which is a generalization

of the RPW rule.

• Doubly-Biased Coin Design (DBCD): a response-adaptive procedure targeting the optimal allo-

cation ratio of Rosenberger et al. (2001) (see Section 4.2). For full details, see Hu and Zhang

(2004).

• Efficient Response Adaptive Randomization Designs (ERADE): a response-adaptive procedure

targeting the optimal allocation ratio of Rosenberger et al. (2001), which attains the lower-bound

on the allocation variances. See Hu et al. (2009) for further details.

• Gittins Index (GI): gives the optimal assignment rule that maximizes the expected total (dis-

counted) number of successes (i.e. a two-arm bandit problem) in an infinite horizon. Note this is

15

a non-randomized (i.e. deterministic) rule. For further details, see e.g. the review paper by Villar

et al. (2015a).

• Forward-looking Gittins Index (FLGI): a modification of the Gittins index proposed in Villar

et al. (2015b), that gives a fully randomized RAR procedure with near-optimal patient benefit

properties. This depends on a block size b for randomization.

• Oracle: hypothethical rule that assigns all patients to the best-performing arm, requiring knowl-

edge of the true response probabilities.

In our simulations, we set pA = 0.25 and vary the values of pB and the total sample size n. Unlike

in Thall et al. (2015a), we do not include early stopping in order to isolate the effects of using RAR

procedures. We evaluate the performance of the procedures in terms of the mean (2.5 percentile, 97.5

percentile) of the sample size imbalance NB−NA; the probability of a imbalance of more than 10% of

the total sample size in the wrong direction π0.1 = Pr(NA > NB + 0.1n); and the expected number of

successes (ENS) and its standard deviation. Note that our measure of π0.1 coincides with the measure

used in Thall et al. (2015a) when n = 200.

Table 1 shows the results when pB = 0.35 and n ∈ {200, 654}. Starting with n = 200, we see that

Thompson sampling has a substantial probability (almost 14%) of a large imbalance in the wrong

direction, while using the Thall and Wathen procedure reduces this probability, which all agrees with

the results of Thall et al. (2015a). Unsurprisingly, the aggressive bandit-based procedures (GI and

FLGI) also have relatively large values of π0.1, although these are still less than that for Thompson

sampling. Note that even fixed randomization has about a 7% chance of a large imbalance in the

wrong direction, which arguably is a fairer baseline comparison for the procedures mentioned above.

In contrast, the RPW, DBCD, ERADE and DTL procedures all have negligible values of π0.1 of 1%

or less, which is also reflected in the confidence intervals for the sample size imbalance.

An important factor is the choice of the total sample size. Setting n = 200 means that the trial

has low power to declare treatment B superior to treatment A. Indeed, if the sample size is chosen so

that fixed randomization yields a power of greater than 80% (when using the standard Z-test), then

n needs to be at least 654. When n = 654, Table 1 shows that the values of π0.1 are substantially

reduced for Thompson sampling, the Thall and Wathen procedure and the bandit-based procedures.

Looking at the confidence intervals for sample size imbalance, we see that Thompson sampling and

the bandit-based procedures still have a small risk of getting ‘stuck’ on the wrong treatment arm.

Another important factor is the magnitude of the difference between pA and pB. The scenario

16

Table 1: Properties of various patient allocation procedures, where pA = 0.25 and pB = 0.35. Resultsare from 104 trial replicates.

n Procedure NB −NA π0.1 ENS

200 Fixed 0 (-28, 28) 0.069 60 (6.4)

PBR 0 0 60 (6.4)

Oracle 200 0 70 (6.7)

Thompson 95 (-182, 190) 0.137 65 (8.5)

GI 115 (-182, 190) 0.119 66 (8.5)

FLGI(b = 5) 114 (-176, 190) 0.111 66 (8.3)

FLGI(b = 10) 115 (-172, 190) 0.100 66 (8.2)

TW(1/2) 74 (-90, 174) 0.085 64 (7.5)

TW(m/2n) 49 (-20, 120) 0.037 64 (7.3)

RPW 14 (-16, 44) 0.011 61 (6.5)

DBCD 17 (-10, 46) 0.003 61 (6.4)

ERADE 16 (-6, 42) 0.000 61 (6.4)

DTL 14 (-4, 32) 0.000 61 (6.6)

654 Fixed 0 (-50, 50) 0.005 196 (11.7)

PBR 0 0 196 (11.6)

Oracle 654 0 229 (12.2)

Thompson 461 (-356, 640) 0.042 220 (17.0)

GI 497 (-630, 645) 0.067 222 (19.6)

FLGI(b = 5) 511 (-619, 645) 0.054 222 (18.5)

FLGI(b = 10) 511 (-617, 645) 0.051 222 (18.0)

TW(1/2) 384 (44, 594) 0.011 215 (14.2)

TW(m/2n) 272 (54, 456) 0.010 216 (14.1)

RPW 46 (-8, 100) 0.000 199 (11.8)

DBCD 55 (8, 106) 0.000 199 (11.8)

ERADE 54 (16, 96) 0.000 199 (11.7)

DTL 46 (14, 80) 0.000 198 (11.7)

considered above with pA = 0.25 and pB = 0.35 is a relatively small difference (as shown by the

large sample size required to achieve a power of 80%), and the more aggressive rules would not be

expected to perform well in terms of sample size imbalance in this case. Table 2 shows the results

when pB = 0.45 and n = 200. The values of π0.1 are substantially reduced for Thompson sampling

as well as the Thall and Wathen and bandit-based procedures, and are now much less than for fixed

17

randomization. In terms of the mean and confidence intervals for NB −NA, these are now especially

appealing for FLGI.

Table 2: Properties of various patient allocation procedures, where pA = 0.25 and pB = 0.45. Resultsare from 104 trial replicates.

n Procedure NB −NA π0.1 ENS

200 Fixed 0 (-28, 28) 0.069 70 (6.8)

PBR 0 0 70 (6.6)

Oracle 200 0 90 (7.0)

Thompson 147 (-54, 194) 0.032 85 (9.5)

GI 166 (18, 194) 0.020 87 (8.8)

FLGI(b = 5) 165 (48, 194) 0.014 86 (8.6)

FLGI(b = 10) 164 (56, 192) 0.014 86 (8.6)

TW(1/2) 124 (24, 182) 0.008 82 (8.2)

TW(m/2n) 123 (22, 182) 0.008 82 (8.3)

RPW 30 (-2, 64) 0.000 73 (7.1)

DBCD 30 (6, 58) 0.000 73 (6.6)

ERADE 29 (8, 52) 0.000 73 (6.6)

DTL 30 (10, 50) 0.000 73 (7.1)

Figure 2 extends the above analysis by considering the value of π0.1 for a range of values of pB from

0.25 to 0.85. We see that for pB greater than about 0.4, the probability of a substantial imbalance in

the wrong direction is higher for fixed randomization than all of the other RAR procedures considered.

Figure 2 also demonstrates another weakness of using π0.1 as a performance measure. This probability

of imbalance increases for the RAR procedures considered as the difference pB − pA decreases, but as

this difference decreases, so does the consequence for assigning patients to the wrong treatment arm.

Indeed, looking at Table 1 we can see some trade-offs between sample size imbalance (as measured by

NB −NA and π0.1) and the ENS. The most aggressive RAR procedures (Thompson and FLGI) have

the highest ENS, which are in fact close to the highest possible ENS (the ‘Oracle’ procedure). However,

these procedures also perform the worst in terms of sample size imbalance. This demonstrates our

general point that careful consideration is needed by looking at a variety of performance measures,

and not only focusing on a single measure such as π0.1.

Finally, we note in passing that if sample size imbalance is of particular concern in a trial con-

text, there may be ways of including constraints to avoid imbalance in a similar way to the DBCD

and ERADE procedures, or indeed permuted block randomization procedures. One way of doing so

18

Figure 2: Plot of π0.1 for various RAR procedures as a function of pB, where pA = 0.25 and n = 200.Each data point is the mean of 104 trial replicates.

for bandit-based designs (for example) is to use the constrained optimization approach as described

in Williamson et al. (2017). Recently Lee and Lee (2020) have also proposed an adaptive clip method

that can be used in conjunction with the Thompson sampling based approach to minimize the chance

of assigning patients to inferior arms during the early part of a trial.

Summary

In summary, RAR procedures do not necessarily have a high probability of a substantial sample

size imbalance in the wrong direction, particularly when compared with using fixed randomization.

This probability crucially depends on the true parameter values, as well as the sample size of the trial.

As well, potential sample size imbalances need to be carefully evaluated in light of other performance

measures, such as patient benefit properties.

4.2 Does the use of RAR reduce statistical power?

One of the most popular belief about RAR procedures is that their use can reduce statistical power,

as stated in Thall et al. (2015b):

19

Compared with an equally randomized design, outcome AR . . . [has] smaller power to

detect treatment differences.

Similar statements can be found in Korn and Freidlin (2011a) and Thall et al. (2015a). Through

simulation studies, these papers show that a fixed randomized design can have a higher power than

one using a particular RAR procedure (see below) when the sample sizes are the same, or equivalently

that a larger sample size is required for RAR to achieve the same target power and type I error rate

as a fixed randomized design.

A common feature of these papers is that they only consider the Bayesian RAR procedure proposed

by Thall and Wathen (2007) (see Section 4.1 for a formal definition), which includes Thompson

sampling as a special case. The Thall and Wathen procedure is well-known, and is sometimes referred

to as ‘Bayesian Adaptive Randomization’ (BAR) without qualification. However, although the Thall

and Wathen procedure attempts to assign more patients to the better treatment while preserving

power, this is only established in an intuitive way. Hence, it should not be a surprise that there are trial

designs where using the procedure results in a lower power compared with using fixed randomization.

However, extending this conclusion to RAR in general is an over-generalization.

In what follows, we focus solely on power considerations, but it is important to note that in prac-

tice, maximizing power might not be the only trial objective. Also, for now we assume the use of

standard inferential tests to make power comparisons, which we return to in Section 4.3.

Two-arm trials

As discussed in Section 3.1, there are RAR procedures that formally target optimality criterion

reflecting the trial’s objectives including power. We again consider the two-arm trial setting with

a binary outcome, as described in Rosenberger and Hu (2004). The true success probabilities are

denoted pA and pB (with corresponding total number of patients NA and NB), and we define qA =

1− pA and qB = 1− pB. We assume that the usual Z-test for the difference in proportions is used.

One strategy is to fix the power of the trial and find (NA, NB) to minimize the total sample size n,

or equivalently to fix the total sample size n and find (NA, NB) to maximize the power. This gives

the following allocation ratio:

NA

n=

√pAqA√

pAqA +√pBqB

, NB = n−NA

which is known as Neyman allocation. The result illustrates that in general, it is not true that equal

allocation in a trial maximizes the power for a given sample size. This is a popular belief that appears

20

without qualification in many papers, such as the following published in the BMJ:

Most randomized trials allocate equal numbers of patients to experimental and control

groups. This is the most statistically efficient randomization ratio as it maximizes statis-

tical power for a given total sample size. (Torgerson and Campbell, 2000)

Such a statement is only true in specific settings, such as when comparing the difference in means of

two normally-distributed outcomes with the same known variance.

An issue with Neyman allocation is that if pA + pB > 1, then more patients will be assigned to the

treatment with the smaller success probability. This is clearly an ethical problem, and again highlights

the potential trade-off between power and patient benefit. Rosenberger et al. (2001) resolve this by

modifying the optimization problem. The solution gives an ‘optimal’ allocation that minimizes the

expected number of treatment failures (ENF) given a fixed power, or equivalently fixes the ENF and

maximizes the power. If the response probability of the treatment arms were known, then using this

optimal allocation ratio would therefore guarantee that (on average) the power of the trial is preserved.

For binomial outcomes (as well as survival outcomes), the model parameters in the optimization

problem are unknown and need to be estimated from the accrued data. These estimates can then

be used (for example) in the doubly-adaptive biased coin design (DBCD) (Hu and Zhang, 2004),

or the efficient response adaptive randomization designs (ERADE) (Hu et al., 2009) to target the

optimal allocation ratio. Using the DBCD in this manner, Rosenberger and Hu (2004) found in their

simulation studies that it was

. . . as powerful or slightly more powerful than complete randomization in every case and

expected treatment failures were always less

This is consistent with a general set of guidelines given by Hu and Rosenberger (2006) on which RAR

procedures should be used in a clinical trial, one of which is that power should be preserved. When

following these guidelines, Rosenberger et al. (2012) states that

Response-adaptive randomization should ensure that the expected number of treatment

failures is reduced over standard randomization procedures, and that power should be

slightly enhanced or maintained.

RAR procedures that achieve this aim have also been derived (in a similar spirit to the optimal

allocation above) for continuous (Zhang and Rosenberger, 2006) and survival (Zhang and Rosenberger,

2007) outcomes.

In summary, targeting the Neyman allocation leads to a higher power than fixed randomization

for binary and survival outcomes, which is implemented in practice using a sequential RAR approach

21

such as the DBCD or ERADE. Targeting the optimal allocation leads to the same (or slightly greater)

power than fixed randomization, but is potentially more ethically attractive than targeting the Ney-

man allocation. In both cases, there is no power loss when compared with fixed randomization even

if within the two-armed trial scenario.

Multi-arm trials

There are similar concerns about the reduction in power of multi-arm RAR procedures. For

example, Wathen and Thall (2017) simulate a variety of five-arm trial scenarios and conclude

In multi-arm trials, compared to equal randomization, several commonly used adaptive

randomization methods give much lower probability of selecting superior treatments.

Similarly, Korn and Freidlin (2011b) simulate a four-arm trial and find that a larger average sample

size is required when using a RAR procedure compared with fixed 1:1:1:1 randomization in order to

achieve the same pairwise power. Lee et al. (2012) reaches similar conclusions in the three-arm setting

when considering disjunctive power. However, again these papers only explore generalizations of the

Thall and Wathen procedure (the “commonly used adaptive randomization methods” quoted above)

for multi-arm trials, and these conclusions may not hold for other types of RAR procedures.

The optimal allocation described above for the two-arm setting can be generalized for multi-arm

trials, under the null hypothesis of homogeneity. The allocation is optimal in that it fixes the power of

the test of homogeneity and minimizes the ENF. This was first derived by Tymofyeyev et al. (2007),

who showed through simulation that for three treatment arms, using the DBCD to target the optimal

allocation

. . . provides increases in power along the lines of 2–4% [in absolute terms]. The increase in

power contradicts the conclusions of other authors who have explored other randomization

procedures [for two-arm trials]

Similar conclusions (for three treatment arms) are given by Jeon and Hu (2010), Sverdlov and Rosen-

berger (2013a) and Bello and Sabo (2016).

These optimal allocation procedures maintain (or increase) the power of the test of homogeneity,

but may have low marginal powers compared with equal randomization in some scenarios, as shown

in Villar et al. (2015b). However, even considering the marginal power to reject the null hypothesis

for the best treatment, Villar et al. (2015b) propose non-myopic RAR procedures (i.e. FLGI rules)

that in some scenarios have both a higher marginal power and a higher expected number of treatment

successes when compared with fixed randomization with the same sample size.

22

Finally, many of the power comparisons made throughout this section have been against fixed

randomization. Arguably a more interesting comparison would be to consider group-sequential (GS)

and multi-arm multi-stage (MAMS) designs with a fixed allocation for each treatment in each stage.

It is still unclear as to how the power of different RAR procedures compare with well-chosen GS

and MAMS designs. Although only focusing on RAR based on the Thall and Wathen procedure,

both Wason and Trippa (2014) and Lin and Bunn (2017) show that these RAR procedures can have

a higher power than MAMS designs when there is a single effective treatment. A recent publication

by Viele et al. (2020) also shows that in a multi-arm setting, i.e. without early stopping, the control

allocation plays a part in achieving the power of a study when a variant of the Thall and Wathen

procedure is implemented. The same group of researchers also explore other design aspects in con-

junction with varying the control allocation, and find that RAR can lead to reasonable power under

some settings (Viele et al., 2020).

Summary

In conclusion, if the aim is to maintain (or even increase) power compared to an equally random-

ized design, in many trial scenarios this can be achieved using some kinds of RAR procedures, while

still reducing (or maintaining) the number of treatment failures. However, the choice of the RAR

procedure is crucial, and needs to be made with the objectives of the trial in mind. Of course the

power of the trial is not the only consideration, and sometimes (such as in the rare disease setting)

the patient benefit properties may be much more important. If maintaining power is a key concern

though, then this need not be the sole rationale to use fixed randomization instead of RAR.

4.3 Can robust statistical inference be performed after using RAR?

As noted in Proschan and Evans (2020), the Bayesian approach to statistical inference allows the

seamless analysis of results of a trial that uses RAR. However, when using frequentist inference,

challenges can occur:

The frequentist approach faces great difficulties in the setting of RAR . . . Use of response-

adaptive randomization eliminates the great majority of standard analysis methods . . .

Rosenberger and Lachin (2016) also note that using RAR in a trial can make the subsequent

(frequentist) statistical inference more challenging

Inference for response-adaptive randomization is very complicated because both the treat-

ment assignments and responses are correlated.

23

This raises a key question: how does an investigator analyze a trial using RAR when using frequentist

inference? In particular, can standard statistical tests and regression techniques be used without

inflating the type I error rate? And are standard estimators of the treatment effect(s) biased? Without

clear answers to these questions, it is unsurprising that the challenge of statistical inference (within

the frequentist framework) is still seen as a key barrier to the use of RAR in clinical practice. In this

section, we aim to show that valid statistical inference, especially in terms of type I error rate control

and unbiased estimation, is possible for a wide variety of RAR procedures. Note that in what follows,

we do not consider the issue of time trends and patient drift, a separate discussion of which is given

in Section 4.4.

Perhaps the most straightforward approach to frequentist inference following a trial using RAR is

to simply use standard statistical tests and estimators without adjustment, in contrast to the quotation

above from Proschan and Evans (2020). The justification is that the asymptotic properties of standard

estimators and tests are preserved for a large class of RAR procedures, including those for the multi-

arm setting. Firstly, Melfi and Page (2000) proved that any estimator that is consistent when responses

are independent and identically distributed will also be consistent for any RAR procedure (under the

assumption that the number of observations for each treatment tends to infinity). Secondly, Hu and

Rosenberger (2006) showed that when the responses follow an exponential family, simple conditions

on the RAR procedure ensure the asymptotic normality of the MLE. The basic condition is that the

allocation proportions for each treatment arm converge in probability to constants in (0, 1), which also

implies that the RAR procedure does not make a ‘choice’ or select a treatment during the trial (and

hence the sample size in each arm can tend to infinity). Since many test statistics are just functions of

the MLE, this result implies that the asymptotic null distribution of such test statistics is not affected

by the RAR. Further asymptotic results for urn-based procedures are given in Hu and Rosenberger

(2006) and Zhang et al. (2011).

These asymptotic results are the justification for the first guideline given by Hu and Rosenberger

(2006) on RAR procedures, which states that

Standard inferential tests can be used at the conclusion of the trial.

Of course, relying on asymptotic results to use standard tests and estimators may not be valid for

trials without a sufficiently large sample size, and the effect of a smaller sample size on inference

is greater for more aggressive RAR procedures (see for example the results in Williamson and Villar

(2020)). As noted by Rosenberger et al. (2012), for some RAR procedures in the two-arm trial setting,

there has been extensive literature investigating the accuracy of the large sample approximations under

24

moderate sample sizes using simulation (Hu and Rosenberger, 2003; Rosenberger and Hu, 2004; Zhang

and Rosenberger, 2006; Duan and Hu, 2009). These papers showed that for the DBCD, sample sizes

of n = 50 to 100 are sufficient, while for urn models reasonable convergence was achieved for a sample

size of n = 100. For these procedures, Gu and Lee (2010) explored which asymptotic test statistic to

use for a clinical trial with a small to medium sample size and binary responses.

If the asymptotic results above cannot be used, either because of small sample sizes or because

the conditions on the RAR procedures are not met, then alternative small sample methods for testing

and estimation have been proposed. We summarize the main methods below, concentrating on type I

error rate control and unbiased estimation.

Type I error rate

One common method for controlling the type I error rate, particularly for Bayesian RAR proce-

dures, is a simulation-based calibration approach. Given a trial design that incorporates RAR and an

analysis strategy (e.g. a test statistic and stopping boundary), a large number of trials are simulated

under the null hypothesis. Applying the analysis strategy to each of these simulated trial realizations

gives a Monte Carlo approximation of the relevant type I error rates. If necessary, the analysis strategy

(e.g. the stopping boundaries) can then be adjusted to satisfy the type I error constraints. Variations

of this approach have been used in Wason and Trippa (2014); Wathen and Thall (2017); Zhang et

al. (2019) for example, all in the context of calibrating multi-arm Bayesian RAR procedures to have

correct type I error control. Applying this approach can be computationally intensive however.

A related approach is to use a re-randomization test, which is also known as randomization-based

inference. In such a test, the observed outcome data are treated as fixed, but the randomization

sequence is regenerated many times using the RAR procedure. For each replicate, the test statistic is

recalculated, and a consistent estimator of the p-value is given by the proportion of the randomization

sequences that give a test statistic as (or more) extreme than that observed. Intuitively, this is

valid because under the null hypothesis of no treatment differences, the treatment assignments and

outcome data are independent. Simon and Simon (2011) give commonly held conditions under which

the re-randomization test guarantees the type I error rate. Galbete and Rosenberger (2016) showed

that 15, 000 replicates appears to be sufficient to accurately estimate even very small p-values. A

key advantage of using re-randomization tests is that they can protect against unknown time trends,

as we will discuss further in Section 4.4. However, re-randomization tests can suffer from a lower

power compared with using standard tests (Villar et al., 2018), particularly if the RAR procedure has

allocation probabilities that are highly variable (Proschan and Dodd, 2019).

25

Both of the methods above are simulation-based, and hence there may be concerns about Monte

Carlo error as well as the computation burden of such tests. There have been a few proposals that do

not rely on simulation and which can be used for type I error control. Robertson and Wason (2019)

proposed a re-weighting of the usual z-test that guarantees familywise error control for a large class of

RAR procedures for multi-arm trials with normally-distributed outcomes, although with a potential

substantial loss of power. Galbete et al. (2016) derived the exact distribution of a test statistic for a

family of RAR procedures in the context of a two-arm trial with binary outcomes, and hence showed

how to obtain exact p-values.

Estimation bias

Although the MLEs for the parameters of interest are typically consistent for a trial using RAR,

for finite samples they will be biased in general. This can be seen for a number of RAR procedures

for binary outcomes in the simulation results given in Villar et al. (2015a), and for procedures based

on Thompson sampling in Thall et al. (2015b). However, the latter point out that in their setting,

which incorporates early stopping with continuous monitoring,

. . . most of the bias appears to be due to continuous treatment comparison, rather than

AR per se.

Hence it is important to distinguish between bias induced by early stopping and that induced by the

use of the RAR procedure itself.

A simple formula for the bias of the MLE for the response probability is given in Bowden and

Trippa (2017), which is valid for general multi-arm RAR procedures without early stopping. In the

common case where RAR assigns more patients to treatments that appear to work well, these results

show that the bias of the MLE will be negative. In addition, the magnitude of this bias is decreasing

with the number of patients assigned to the treatment. When estimating the treatment difference

however, the bias can be either negative or positive, which agrees with the results in Thall et al.

(2015b). More general characterisations of the bias of the sample mean (even with early stopping) for

multi-arm bandit procedures is given in Shin et al. (2019, 2020).

Bowden and Trippa (2017) showed that when there is no early stopping, the magnitude of the

bias tends to be small for the RPW rule and the Bayesian RAR procedure proposed by Trippa et

al. (2012). For more aggressive RAR procedures, the bias can be larger however, see Williamson and

Villar (2020) for an example. As a solution, Bowden and Trippa (2017) proposed methods to correct

for the bias of the MLE, using inverse probability weighting and Rao-Blackwellization, although these

26

can be computationally intensive. For urn-based RAR procedures, Coad and Ivanova (2001) also

proposed bias-corrected estimators for the response probability. For sequential maximum likelihood

procedures and the DBCD, Wang et al. (2020) evaluate the bias issue and propose a solution to resolve

this problem.

Finally, adjusted confidence intervals for RAR procedures have received less attention in the lit-

erature. Rosenberger and Hu (1999) proposed a bootstrap procedure for general multi-arm RAR

procedures with binary responses, using a simple rank ordering. Meanwhile, Coad and Govindarajulu

(2000) proposed corrected confidence intervals following a sequential adaptive design for a two-arm

trial with binary responses. Recent work by Hadad et al. (2019) gives a strategy to construct asymp-

totically valid confidence intervals for a large class of adaptive experiments (including the use of RAR).

Summary

In conclusion, for trials with sufficiently large sample sizes, asymptotic results justify the use of

standard statistical tests and frequentist inference procedures when using many types of RAR. When

asymptotic results do not hold, inference does become more complicated compared with using fixed

randomization, but there is a growing body of literature demonstrating how to control the type I

error rate, and corrections for the bias of the MLE have been proposed. All this should give increased

confidence that the results from a trial using RAR, if analyzed appropriately, can be both valid and

convincing. Finally, we reiterate that from a Bayesian viewpoint, the use of RAR does not pose any

additional inferential challenges.

4.4 Can RAR be used if there is potential for time trends or patient drift?

The issue of time trends caused by changes in the standard of care or by patient drift (i.e. changes

in the characteristics of recruited patients over time) is seen as a major barrier to the use of RAR in

practice when compared with using fixed randomization:

One of the most prominent arguments against the use of AR is that it can lead to biased

estimates in the presence of parameter drift. (Thall et al., 2015b)

A more fundamental concern with adaptive randomization, which was noted when it was

first proposed, is the potential for bias if there are any time trends in the prognostic mix

of the patients accruing to the trial. In fact, time trends associated with the outcome

due to any cause can lead to problems with straightforward implementations of adaptive

randomization. (Korn and Freidlin, 2011a)

27

Both papers cited above show (for procedures based on Thompson sampling) that time trends can

dramatically inflate the type I error rate when using standard analysis methods, and induce bias into

the MLE. Further simulation results on the impact of time trends for BAR procedures (in the context

of two-arm trials with binary outcomes) are given in Jiang et al. (2020). In Villar et al. (2018), a

comprehensive simulation study is given for different time trend assumptions for a variety of RAR

procedures in trials with binary outcomes (including the multi-arm setting).

Although all these papers show that time trends can inflate the type I error rate when using RAR

procedures, there are two important caveats given in Villar et al. (2018). Firstly, they conclude that

a largely ignored but highly relevant issue to consider is the size of the trend and its likelihood of

occurrence in a specific trial:

. . . the magnitude of the temporal trend necessary to seriously inflate the type I error of

the patient benefit-oriented RAR rules need to be of an important magnitude (i.e. change

larger than 25% in its outcome probability) to be a source of concern.

Secondly, they also show (through simulation) that certain power-oriented RAR procedures are ef-

fectively immune to time trends. In particular, RAR procedures that protect the allocation to the

control arm in some way are particularly robust.

A more general issue around time trends is that they can invalidate the key assumption that

observations about treatments are exchangeable (i.e., that subjects receiving the same treatment arm

have the same probability of success). This in term can invalidate commonly used frequentist and

Bayesian models, and hence the inference of the trial data. Type I error inflation and estimation bias

can be seen as examples of this wider issue.

As pointed out in Proschan and Evans (2020), temporal trends seem more likely to occur in two

settings:

. . . 1) trials of long duration, such as platform trials in which treatments may continually

be added over many years and 2) trials in infectious diseases such as MERS, Ebola virus,

and coronavirus.

Despite this, little work has looked at estimating these trends, especially when doing so to inform a

trial design choice in the midst of an epidemic. Investigating both of these points would be essential

to be able to make a sound assessment as to the value of using RAR in a particular setting or not.

Furthermore, as we now discuss, there are analysis methods that can prevent the type I error inflation

that those trends could create when combined with a RAR design.

As mentioned in the previous section, one method to correct for type I error inflation is to use a

re-randomization test (or more generally, randomization-based inference). Simon and Simon (2011)

28

proved that using a re-randomization test (under commonly-held conditions) guarantees type I error

control even under arbitrary time trends. Simulation studies illustrating this can be found in Galbete

and Rosenberger (2016) and Villar et al. (2018). However, the latter shows that using a randomization-

based inference can come at the cost of a considerably reduced power compared with using an unad-

justed testing strategy.

An alternative to randomization-based inference is to use a blocking or stratified analysis at the

end of the trial, as proposed in e.g. Coad (1992); Karrison et al. (2003) and Korn and Freidlin (2011a).

These papers show (though simulation) that a stratified analysis can eliminate the type I error inflation

induced through time trends. However, Korn and Freidlin (2011a) also showed that using block

randomization and subsequently block-stratified analysis can reduce the trial efficiency, in terms of

increasing the required sample size and the chance of patients being assigned to the inferior treatment.

Another approach is to explicitly incorporate time-trend information into the regression analysis.

For example, Coad (1992) modified a class of sequential tests to incorporate a linear time trend

for normally-distributed outcomes. Meanwhile, Villar et al. (2018) assessed incorporating the time

trend into a logistic regression (for binary responses), and showed that this can alleviate the type I

error inflation of RAR procedures, if the trend is correctly specified and the associated covariates are

measured and available. However, this can result in a loss of power and complicate estimation (due

to the technical problem of separation).

Finally, it is possible to try control the impact of a time-trend during randomization. Rosenberger

et al. (2011) proposed a covariate-adjusted response-adaptive procedure for a two-armed trial that can

take a specific time trend as a covariate. More recently, Jiang et al. (2020) proposed a BAR procedure

that includes a time trend in a logistic regression model, and uses the resulting posterior probabilities

as the basis for the randomization probabilities. This model-based BAR procedure controls the type I

error rate and mitigates estimation bias, but at the cost of a reduced power.

Summary

In summary, large time trends can inflate the type I error rate when using RAR procedures. How-

ever, not all RAR procedures are affected in this way, with those that protect the allocation to the

control arm being particularly robust. For other types of RAR procedures, methods have been de-

veloped to mitigate the type I error inflation caused by time trends, although these tend to result in

a loss in power. Finally, it is important to note that time trends can affect inference in all types of

adaptive clinical trials, and not just those using RAR.

29

4.5 Practical considerations: Is RAR more challenging to implement?

If an investigator has decided to adopt a certain type of RAR, there is still a decision to be made as

to how to best implement it in the specific context at hand. There are plenty of issues to consider,

most of which are in common with other randomized designs (both adaptive and non-adaptive). In

this section, we focus on a few issues that are potentially different for RAR in particular, and hence

merit an additional discussion.

Measurement/classification error and missing data

The presence of measurement error (for continuous variables) or classification error (for binary

variables) and missing data are common in medical research. Many analysis approaches have been

proposed to reduce the impact of these issues on statistical inference (see e.g. Guolo (2008), Little

and Rubin (2002), and Blackwell et al. (2017)) but limited literature on overcoming these issues when

implementing a RAR procedure is available. The main concern when there are missing values and/or

measurement error is that the sequentially updated allocation probability may become biased. This

happens when the unobserved true values come from a distribution that is different to that of the

observed values.

To the best of our knowledge, the only work considering classification (or measurement) error is by

Li and Wang (2012) and Li and Wang (2013). They derived optimal allocation targets under a model

with constant misclassification probabilities, which could differ between the treatment arms. In the

latter paper, the effect of misclassification (in the two-arm setting) on the usual optimal allocation

designs was also explored through simulation.

As for missing data, Biswas and Rao (2004) considered the presence of missing responses for a

CARA design. For a two-arm setting with a normal outcome, a probit link function that depends on the

covariate adjusted unknown treatment effect parameters is used to construct the allocation probability.

Under the assumption of missing at random (see Rubin (1976)) and with a single imputation for the

missing responses, they found that the standard deviation and the mean of the proportion of patients

assigned to a treatment arm are not affected by the structure of the missing data mechanism. In a

thesis by Ma (2013), a new allocation function for CARA in the presence of missing covariates and

missing responses is defined.

For RAR, Williamson and Villar (2020) proposed a forward-looking bandit-based allocation pro-

cedure for Phase II cancer trials with normal outcome and an imputation method to facilitate the

implementation of the procedure when the outcome underlying the RECIST categories (Therasse et

30

al., 2000; Eisenhauer et al., 2009) is undefined, e.g. due to death or complete removal of a tumor.

They suggested to fill the incomplete data for these extreme cases with random samples drawn from

the lower tail and the upper tail of the distribution under the null and the alternative scenarios that

were used for sample size determination.

The investigation for the more complex scenarios, e.g. not missing at random, remains unexplored.

Nevertheless, these complex issues have rarely been considered at the design stage of a trial, except

for some simple setting such as the work by Lee et al. (2018).

Delayed responses

The use of RAR is clearly not appropriate in clinical trials where the patient outcomes are only

observed after all patients have been recruited and randomized. This may be the case when the

recruitment period is limited (e.g. due to a high recruitment rate), or when the outcome of interest

takes a long time to observe. One way to address the latter point is to use a surrogate outcome that

is more quickly observed. For example, Tamura et al. (1994) used a surrogate response to update the

urn when using a RPW rule in a trial treating patients with depressive disorder. Another possibility

when outcomes are delayed in some way is to use a randomization plan that is implemented in stages

as more date becomes available. As an example, Zhao and Durkalski (2014) described a two-arm trial

for patients suffering acute stroke which was divided into stages: stage 1 was a burn-in period with

equal randomization, stage 2 started to maintain covariate imbalance and only in stage 3 did the RAR

allocation begin.

In general, as long as some responses are available then RAR can be used in trials with delayed

responses, as stated in Hu and Rosenberger (2006, pg. 105):

From a practical perspective, there is no logistical difficulty in incorporating delayed re-

sponses into the response-adaptive randomization procedure, provided some responses be-

come available during the recruitment and randomization period. For urn models, the urn

is simply updated when responses become available (Wei, 1988). For procedures based on

sequential estimation, estimates can be updated when data become available. Obviously,

updates can be incorporated when groups of patients respond also, not just individuals.

Although practically speaking there are no difficulties in incorporating delayed responses into the

RAR procedures, from a hypothesis testing point of view, statistical inferences at the end of the trial

can be affected. As noted in Rosenberger et al. (2012), this has been explored theoretically for urn

models (Bai et al., 2002; Hu and Zhang, 2004; Zhang et al., 2007) as well as the DBCD (Hu et al.,

2008). These papers show that the asymptotic properties of these RAR procedures were preserved

31

under widely applicable conditions. In particular, when more than 60% of patient responses are avail-

able by the end of the recruitment period, simulations show that the power of the trial is essentially

unaffected for these procedures.

Patient consent

Patient consent is as much an ethical element of clinical trials as equipoise is. Informed consent

protects patients’ autonomy, and requires an appropriate balance between information disclosure and

understanding (Beauchamp, 1997). There is considerable evidence that the basic elements to ensure

informed consent (recall and understanding) can be very difficult to ensure even for traditional non-

adaptive studies (Sugarman, 1999; Dawson, 2009).

The added complexity of allocation probabilities that may (or may not change) in response to

accumulated data only makes achieving patient consent more challenging. Understanding the caveats

of each randomization algorithm has proven hard for statisticians, putting into context the challenge

of explaining these concepts to the regular public and expecting them to make an informed decision

when their health is at stake. Moreover, since these novel adaptive procedures are still rarely used in

real trials, there is little practical experience to draw upon.

RCTs also require a level of blinding that will only make the balance between the required disclosure

(for the purpose of understanding) and consenting patients during the trial even harder for studies

that use RAR, as this requires the use of accumulated data to alter the design and to adequately

inform patients as they are recruited.

All these challenges do not automatically imply that a trial design using RAR may not be the best

design option for a particular setting, they simply pose challenges that need to be properly addressed

and considered in conjunction with other reasons to use RAR or not. For a more in depth discussion

of these issues see Sim (2019).

Implementing randomization changes in practice

Randomization of patients in clinical trials, whether adaptive or not, must be done in accordance

with standards of good clinical practice. As such, randomization of patients in most clinical trials

worldwide is done through a dedicated web-based system that will typically be used by several members

of the trial team, including the trial manager, statistician, independent reviewers and those who use

it to randomize patients. Randomization has moved on from the times where it was done with paper

and envelopes, and now requires a system that can be backed up and maintained, and is easy to use,

secure and available 24/7.

32

In the UK, for example, most clinical trials unit will outsource their randomization to external

companies which in turn ensures compliance with good clinical practice. This outsourcing can even-

tually end up limiting the ways in which randomization can be implemented in a trial to those that

are currently offered by the companies in use. To date and to the best of the authors’ knowledge, in

the UK there is no company offering a RAR portfolio, and so every change in a randomization ratio

is treated as a trial change (and charged as such) rather than being considered an integral part of

the trial design. Beyond the extra costs that this brings, it can introduce unnecessary delays as the

randomization of patients needs to be stopped while the change is implemented. This can certainly

be a very important deterrent to the use of RAR in practice, as this can imply a much larger effort to

implement the randomization procedure and ensure that it is compliant with regulations.

A related issue is that of preserving blinding. Maintaining treatment blinding is key to protecting

the integrity of clinical trials. This is particularly important when applying a design that incorporates

RAR, as when an investigator is aware of what treatment is more likely to be allocated to a patient in

the future, selection bias is more likely to occur. Therefore, implementing a RAR algorithm in practice

places extra challenges to preserve the blinding of clinical trial staff to results of the trial and to avoid

biases. In most cases, preserving blindness will require the appointment of an independent statistician

(which requires extra resources) to handle the interim data and implement the randomization ratios,

or a data manager can provide clean data to an external randomization provider who can then update

the randomization probabilities independently of the clinical and statistical team. A further discussion

on some of the practical issues around handling interim data and unblinding can be found in Sverdlov

and Rosenberger (2013b).

4.6 Is using RAR in clinical trials (more) ethical?

Ethical reasons have been the most well-known and cited arguments in favor of using RAR over the

years.

Our explicit goal is to treat patients more effectively, but a happy side effect is that we

learn efficiently. (Berry, 2004)

Research in response-adaptive randomization developed as a response to a classical ethical

dilemma in clinical trials. (Hu and Rosenberger, 2006)

The goal of response-adaptive randomization is to achieve some ethical or statistical objec-

tive that might not be possible by fixing an allocation probability in advance. (Rosenberger

and Lachin, 2016)

33

Nevertheless, within the statistical community there are also positions arguing that RAR may not

be “ethical”.

For RCTs where treatment comparison is the primary scientific goal, it appears that in

most cases designs with fixed randomization probabilities and group sequential decision

rules are preferable to AR scientifically, ethically and logistically (Thall et al., 2015a)

Clinical research (implemented by clinical trials designed to create generalize knowledge) poses

several ethical questions. To start with, there is an inevitable tension between clinical research and

clinical practice, where the latter is concerned with best treating an individual patient. Such ethical

conflicts are becoming ever more discussed as treatment of patients becomes more linked to research

activities, as is currently observed in cancer research (London, 2018). Although the idea of “treating

more patients effectively” by using RAR appears to be ethically attractive, particularly from patients’

point of view, the extent to which these adaptive designs are truly more “ethical” than traditional

randomization designs is only recently starting to be formally addressed by ethicists.

It is our belief that a) this issue should receive more attention from ethicists and b) collaborations

between ethicists and statisticians are needed to fully address all the caveats and complexities that

this broad family of methods can distinctly have. Any argument (be that in favor or against) that

involves an ethical side should ideally be discussed with an ethicist and a statistician jointly. We

also believe that extreme positions regarding RAR based purely on statistical or ethical arguments

each considered in isolation are most likely to be wrong. Two reasons motivate our belief. First, as

illustrated in the previous sections of this paper, RAR is a broad family of methods to which almost

no generalization is true. More importantly, because the trial context to which RAR is applied to

matters significantly, making compromises between statistical and ethical objectives can have very

different implications under different settings. Therefore, in this section, we will not aim to answer if

RAR methods are ethical or not (a question that would need to be addressed for each method and

trial context specifically). However, we review key concepts that would affect this answer and give

some current discussions by ethicists that statisticians would benefit from reading.

The “equipoise” concept

An argument against the use of RAR has been that it violates the principle of equipoise on which

clinical trials (or medical research more generally) is based upon (Laage et al., 2017). Equipoise

is a state of uncertainty of the individual investigator regarding the relative merits of two or more

interventions for some population of patients. Such uncertainty justifies randomizing patients to

treatments as this does not imply knowingly disadvantaging patients by doing so. This concept

34

was considered too narrow and therefore extended to allow for randomizing patients when there is

“honest, professional disagreement among expert clinicians” about the relative clinical merits of some

interventions for a particular patient population (Freedman, 1987). This broader definition of equipoise

is known as ‘clinical equipoise’ while the first one is known as ‘theoretical equipoise’.

Changing the randomization probabilities in light of patients’ responses is viewed as disturbing

equipoise, because the updated allocation weights reflect the relative performance of the interventions

in question. Once the randomization weights become unbalanced, the study has a preferred treatment

and allocating participants to treatments regarded as inferior is unethical, as it violates the concern

for welfare. However, this argument that RAR is unethical because it breaks equipoise is based on

two assumptions: 1) randomization ratios reflect a single agent’s beliefs about the relative merits of

the interventions being tested in a study; and 2) equipoise is a state of belief in which the relevant

probabilities are assumed to be equally balanced. Neither of these two assumptions are consistent with

the definition of ‘clinical equipoise’ as the clinical community is certainly composed of more than one

agent and an honest disagreement among them will not necessarily take a 50%-50% split of opinions.

Patient horizon (individual and collective ethics)

The ethical value of RAR (and of any other design) also depends considerably on trial specific

considerations. A particular feature that can affect comparisons between design options relates to

disease prevalence. Suppose a clinical trial is being planned and let T be the size of the “patient

horizon” for that study, i.e. those patients within and outside of the trials who will benefit from

the conclusions of the trial. The concept of patient horizon can be traced back to Anscombe (1963)

and Colton (1963). The precise value of T is never known, but in extreme situations impact (or

should impact) the choice of design and sample size of that trial. A trial in a rare pediatric cancer (for

example) is likely to have a large proportion of the patient horizon included in the trial, while a trial

relevant to patients with coronary artery disease will have the vast majority of the patient horizon

outside of the trial. Similar thoughts apply in the context of emerging life-threatening diseases (e.g.

the recent Ebola crisis or the current COVID-19 pandemic), where the patient horizon can be short

for other reasons than prevalence.

Certainly, the impact of the patient horizon on the comparison of designs from an ethical point of

view depends on considerations around individual and collective ethics and potential conflicts between

these two. As Tamura et al. (1994, pg. 775) express it, RAR “represents a middle ground between the

community benefit and the individual patient benefit” and because of this “it is subject to attack from

either side”. This specific point has been very well discussed and formally studied in the statistical

35

literature (see Berry and Eick (1995); Berry (2004); Cheng et al. (2003) for some examples). Despite

this, prevalence of a disease is almost never taken into account, neither in practice when designing

real trials nor in a large number of articles comparing RAR methods from an ethical point of view.

The recent work by Lee and Lee (2020) is the only relevant work that we know of in the literature.

They propose a utility function for the evaluation of RAR, which includes a penalty for assigning pa-

tients to the inferior arm while accounting for patient benefit both within and beyond the trial horizon.

Two-armed versus multiple armed trials

A final point that can affect any ethical judgment about a trial design and of a RAR procedure

is the number of arms considered in the trial. In a two-armed setting, trade-offs between ethical and

statistical features are such that there is no scope for a design to be superior (or dominate) any other

in all aspects. In the multi-armed setting, this does not have to be the case, and depending on the

main objective of the trial (e.g. the relevant power definition used), designs using RAR can be superior

to a fixed randomized trial in the sense that they can achieve both efficiency and ethical gains over a

traditional RCT.

5 Discussion

RAR methods have received as much theoretical attention as they have generated heated debates,

especially during acute health crises like the one we are currently facing with COVID-19. This is not

surprising, as the main reason that drives the desire to change randomization ratios in clinical trials

is to better respond to very difficult ethical challenges under pressing conditions. However, for such

debates around its use in practice to be useful and meaningful, they have to remain level-headed and

bear in mind that generalizations within such a large class of methods are very likely to be partial

and/or misleading. Even within one family of RAR procedures, conclusions about “typical” issues can

be very different if one applies the procedure in a two-armed or in a multi-armed trial setting, with or

without early stopping, and if we are considering an early phase study (naturally to be followed by a

confirmatory trial) or a later phase study. In most cases, the relative performance of designs is highly

dependent on the preferred method of inference (frequentist or Bayesian).

We wish to emphasize that in this paper we are certainly not advocating the use of RAR in all trial

settings, as there will be many contexts where the use of a different form of trial adaptation or even a

fixed randomized design may be preferable for both methodological or practical reasons. Indeed, this

36

an issue to consider with adaptive trials in general – sometimes it may be better to ‘keep it simple’

and use traditional (non-adaptive) trial designs instead (Wason et al., 2019). In this review, where

we have chosen to focus on positive statements on the use of RAR procedures it was simply because

these arguments have often been neglected in broad comparisons presented in the literature.

Our recommendation is that if a trial design calls for the use of RAR as a possible option to

address a specific concern (be that ethical or not), careful consideration should be given to all the

issues mentioned in the present work. Our advice is to think of RAR as a long list of possible design

options rather than a simple unique technique to include or not. The number of possible ways to

implement RAR in a clinical trial is far too large to have a statement that will apply to all of them in

a given context. We also advise that simulations are carried out to widely explore the parameter space

(and not only subsets of interest). We encourage a broad look at multiple metrics when considering

the use of a specific RAR procedure, as the design may perform very well according to one metric but

very badly according to another. Trade-offs are ubiquitous and in many cases unavoidable, so the best

one can hope for is to be aware of them when making a design choice. Indeed, RAR procedures can

address a specific need at the expense of a cost in a different area, and this needs to be made explicit

at a design stage. There is no perfect trial design, but some trade-offs might be acceptable in certain

cases (or even necessary for a greater good) while the same trade-offs can be absolutely rejected in

other contexts.

We would like to end the paper with a short summary as to what we feel the future for RAR

methods research should bring to improve its usefulness in clinical practice. Our hope is that rather

than taking the route of explicitly “discouraging” or “encouraging” the use of RAR, researchers can

take the more constructive path of addressing these issues with new ideas that work to offer most

of the advantages that RAR can bring with fewer (or none) of its downsides while taking the trial

context into account.

Indeed, as RAR methods start being more commonly used in clinical trials, this will unlock demand

for applied research in terms of the analysis of such studies. Practical issues such as accounting for

missing data or measurement error when using a RAR method are still a big open question. In terms

of more theoretical work, we feel that the main issue to address is that of robust inference for the

broader class of methods. However, there are some design aspects that could still be addressed. For

instance, the RAR class has scope for addressing delicate issues when designing studies with composite

or complex endpoints, where there might be effects of the same direction or even of opposite ones.

Finally, the use of RAR in practice requires the availability of user-friendly software for both the

implementation of the randomization algorithm as well as for the analysis approaches that were men-

37

tioned in Section 4 of this paper. Applied statisticians could benefit from training specifically oriented

on how to use simulations for the evaluation of RAR techniques, which is crucial to understand their

potential limitations and benefits.

Acknowledgments

The authors thank Peter Jacko for many helpful and constructive comments on an earlier version of

this manuscript, Andi Zhang for providing code for the GI and FLGI procedures used in the simulation

in Section 4.1, Ayon Mukherjee for suggesting the use of the drop-the-loser rule in Section 4.1, Arina

Kazimianec for previous work looking at sample size imbalance which helped motivate Section 4.1,

and Nikolaos Skourlis for screening through the relevant literature on model-based adaptive random-

ization approaches. The authors acknowledge funding and support from the UK Medical Research

Council (grants MC UU 00002/3 (SSV, BCL-K), MC UU 00002/6 (DSR), MR/N028171/1 (KML)),

the Biometrika Trust (DSR) and the National Institute for Health Research Cambridge Biomedical

Research Centre at the Cambridge University Hospitals NHS Foundation Trust (DSR, KML, SSV).

References

Angus, D.C., Berry, S., Lewis, R.J., Al-Beidh, F., Arabi, Y., van Bentum-Puijk, W.,

Bhimani, Z., Bonten, M., Broglio, K., Brunkhorst, F., Cheng, A.C., Chiche, J.D., De

Jong, M., Detry, M., Goossens, H., Gordon, A., Green, C., Higgins, A.M., Hullegie,

S.J., Kruger, P., Lamontagne, F., Litton, E., Marshall, J., McGlothlin, A., McGuin-

ness, S., Mouncey, P., Murthy, S., Nichol, A., O’Neill, G.K., Parke, R., Parker, J.,

Rohde, G., Rowan, K., Turner, A., Young. P,, Derde, L., McArthur, C. and Webb,

S.A. (2020). The Randomized Embedded Multifactorial Adaptive Platform for Community-acquired

Pneumonia (REMAP-CAP) Study: Rationale and Design. Annals of the American Thoracic Soci-

ety, Advance access, doi:10.1513/AnnalsATS.202003-192SD.

Anscombe, F.J. (1963). Sequential medical trials. Journal of the American Statistical Association

58 365–383.

Atkinson, A.C. and Biswas, A. (2014). Randomized Response-Adaptive Designs in Clinical Trials.

CRC Press, Boca Raton.

38

Auer, P., Cesa-Bianchi N. and Fischer, P. (2002). Finite-time Analysis of the Multiarmed Bandit

Problem. Machine Learning 47(2–3) 235–256.

Barker, A.D., Sigman, C.C., Kelloff, G.J., Hylton, N.M., Berry, D.A. and Esserman,

L.J. (2009). I-SPY 2: An adaptive breast cancer trial design in the setting of neoadjuvant chemother-

apy. American Society for Clinical Pharmacology and Therapeutics 86 97–100.

Bartlett, R., Roloff, D., Cornell, R., Andrews, A., Dillon, P. and Zwischenberger,

J. (1985). Extracorporeal Circulation in Neonatal Respiratory Failure: A Prospective Randomized

Study. Pediatrics Journal 76(4) 479–487.

Beauchamp, T.L. Informed consent. In Medical Ethics, 2nd edition, 185-208. ed. Robert M. Veatch,

Boston: Jones and Bartlett.

Bello, G.A. and Sabo, R.T. (2016). Outcome-adaptive allocation with natural lead-in for three-

group trials with binary outcomes. Journal of Statistical Computation and Simulation 86(12) 2441–

2449.

Berry, D.A. and Eick, S.G. (1995). Adaptive assignment versus balanced randomization in clinical

trials: a decision analysis. Statistics in Medicine 14(3) 231–246.

Berry, D.A. (2004). Bayesian Statistics and the Efficiency and Ethics of Clinical Trials. Statistical

Science 19(1) 175–187.

Berry, D.A. (2011). Adaptive Clinical Trials: The Promise and the Caution. Journal of Clinical

Oncology 29(6) 606–609.

Berry, S.M., Petzold, E.A., Dull, P., Thielman, N.M., Cunningham, C.K., Corey, G.R.,

McClain, M.T., Hoover, D.L., Russell, J., Griffiss, J.M. and Woods, C.W. (2016). A

response adaptive randomization platform trial for efficient evaluation of Ebola virus treatments: A

model for pandemic response. Clinical Trials, 13(1), 2230.

Bai, Z.D., Hu, F. and Rosenberger, W.F. (2002). Asymptotic properties of adaptive designs for

clinical trials with delayed responses. Annals of Statistics, 30 122–139.

Biswas, A. and Rao, J. N. K. (2004). Missing responses in adaptive allocation design. Statistics &

Probability Letters, 70(1) 59–70.

Blackwell, M., Honaker,J. and King, G. (2017). A unified approach to measurement error and

missing data: overview and applications. Sociological Methods & Research, 46(3) 303–341.

39

Bowden, J. and Trippa, L. (2017). Unbiased estimation for response adaptive clinical trials. Statis-

tical Methods in Medical Research 26(5) 2376–2388.

Brittain, E.H. and Proschan, M.A. (2016). Comments on Berry et al.’s response-adaptive ran-

domization platform trial for Ebola. Clinical Trials 13(5) 566–567.

Bubeck, S. and Cesa-Bianchi (2012). Regret Analysis of Stochastic and Nonstochastic Multi-armed

Bandit Problems. Foundations and Trends in Machine Learning 5(1) 1–122.

Burton, P.R., Gurrina, L.C. and Hussey, M.H. (1997). Interpreting the clinical trials of ex-

tracorporeal membrane oxygenation in the treatment of persistent pulmonary hypertension of the

newborn. Seminars in Neonatology 2 69–79.

Carey, L.A. and Winer, E.P. (2016). I-SPY 2 – Toward More Rapid Progress in Breast Cancer

Treatment New England Journal of Medicine 375 83–84.

Cheng, Y., Su, F. and Berry, D.A. (2003). Choosing sample size for a clinical trial using decision

analysis. Biometrika 90(4) 923–936.

Cheng, Y. and Berry, D.A. (2007). Optimal adaptive randomized designs for clinical trials.

Biometrika 94 673–689.

Coad, D.S. (1991). Sequential tests for an unstable response variable. Biometrika 78(1) 113–121.

Coad, D.S. (1992). A comparative study of some data-dependent allocation rules for Bernoulli data.

Journal of Statistical Computation and Simulation 40(3–4), 219–231.

Coad, D.S. and Govindarajulu, Z. (2000). Corrected confidence intervals following a sequential

adaptive trial with binary response. Journal of Statistical Planning and Inference 91 53–64.

Coad, D.S. and Ivanova, A. (2001). Bias calculations for adaptive urn designs. Sequential Analysis

20 229–239.

Colton, T. (1963). 963). A model for selecting one of two treatments. Journal of the American

Statistical Association 58 388–400.

Das, S. and Lo, A.W. (2017). Re-inventing drug development: A case study of the I-SPY 2 breast

cancer clinical trials program Contemporary Clinical Trials 62 168–174.

Dawson, A. (2009). The normative status of the requirement to gain an informed consent in clinical

trials: Comprehension, obligations and empirical evidence. In The limits of consent: A sociolegal

40

approach to human subject research in medicine, ed. Oonagh Corrigan, John McMillan, Kathleen

Liddell, Martin Richards, and Charles Weijer, 99-113. Oxford: Oxford University Press.

Duan, L. and Hu, F. (2009). Doubly-adaptive biased coin designs with heterogenous responses.

Journal of Statistical Planning and Inference 139 3220–3230.

Eisele, J.R. (1994). The double adaptive biased coin design for sequential clinical trials. Journal of

Statistical Planning and Inference 38 249–261.

Eisenhauer, E.A., Therasse, P., Bogaerts, J., Schwartz, L.H., Sargent, D., Ford, R.,

Dancey, J., Arbuck, S., Gwyther, S., Mooney, M., Rubinstein, L., Shankar, L., Dodd,

L., Kaplan, R., Lacombe, D. and Verweij, J. (2009). New response evaluation criteria in solid

tumours: revised RECIST guideline (version 1.1). European Journal of Cancer, 45(2) 228–247.

Fan, L., Yeatts, S.D., Wolf, B.J., McClure, L.A., Selim, M. and Palesch, Y.Y. (2018).

The impact of covariate misclassification using generalized linear regression under covariate-adaptive

randomization. Statistical Methods in Medical Research, 27(1) 20–34.

Flournoy, N., Haines, L.M. and Rosenberger, W.F. (2013). A Graphical Comparison of

Response-Adaptive Randomization Procedures. Statistics in Biopharmaceutical Research, 5(2) 126–

141.

Freedman, B. (1987). Equipoise and the Ethics of Clinical Research. New England Journal of

Medicine 317 141–145.

Galbete, A., Moler, J.A. and Plo, F. (2016). Randomization tests in recuresive response-adaptive

randomization procedures. Statistics 50(2) 418–434.

Galbete, A. and Rosenberger, W.F. (2016). On the use of randomization tests followign adaptive

designs. Journal of Biopharmaceutical Statistics 26(3) 466–474.

Gu, X. and Lee, J.J. (2010). A simulation sudy for comparing testing statistics in response-adaptive

randomization. BMC Medical Research Methodology 10 48.

Guolo, A. (2008). Robust techniques for measurement error correction: a review. Statistical Methods

in Medical Research, 17(6) 555–580.

Hadad, V., Hirshberg, D.A., Zhan, R., Wager, S. and Athey, S. (2019). Confidence Intervals

for Policy Evaluation in Adaptive Experiments. arXiv preprint arXiv:1911.02768.

41

Hu, F. and Rosenberger, W.F. (2003). Optimality, variability, power: Evaluating response-

adaptive randomization procedures for treatment comparisons. Journal of the American Statistical

Association 98 671–678.

Hu, F. and Zhang, L-X. (2004a). Asymptotic properties of doubly adaptive biased coin design for

multi-treatment clinical trials. The Annals of Statistics 32(1) 268–301.

Hu, F. and Zhang, L-X. (2004b). Asymptotic normality of adaptive designs with delayed response.

Bernoulli 10 447–463.

Hu, F. and Rosenberger, W.F. (2006). The Theory of Response-Adaptive Randomization in Clinical

Trials. Wiley Series in Probability and Statistics.

Hu, F., Zhang, L-X., Cheung, S.H. and Chan, W.S. (2008). Double-adaptive biased coin designs

with delayed responses. Canadian Journal of Statistics 36 541–559.

Hu F., Zhang L. and He X. (2009). Efficient Randomized-Adaptive Designs. The Annals of Statistics

37(5A) 2543–2560.

Hu, J., Zhu, H. and Hu, F. (2015). A Unified Family of Covariate-Adjusted Response-Adaptive

Designs Based on Efficiency and Ethics. Journal of the American Statistical Association. 110:509

357–367.

Ivanova, A. (2003). A play-the-winner-type urn design with reduced variability. Metrika 58(1) 1–13.

Jacko, P. (2019). The Finite-Horizon Two-Armed Bandit Problem with Binary Responses: A Mul-

tidisciplinary Survey of the History, State of the Art, and Myths. arXiv preprint. arXiv:1906.10173.

Jennison, C. and Turnbull, B.W. (2000). Group Sequential Methods with Applications to Clinical

Trials. Chapman and Hall/CRC, Boca Raton, FL.

Jeon, Y. and Hu, F. (2010). Optimal Adaptive Designs for Binary Response Trials with Three

Treatments. Statistics in Biopharmaceutical Research. 2(3) 310–318.

Jiang, Y., Zhao W. and Durkalski-Mauldin, V. (2020). Time-trend impact on treatment estima-

tion in two-arm clinical trials with a binary outcome and Bayesian response adaptive randomization.

Journal of Biopharmaceutical Statistics. 30(1) 69–88.

Johnston, K., Bruno, A., Pauls, Q., Hall, C.E., Barrett, K.M., Barsan, W., Fansler, A.,

Van de Bruinhorst, K., Janis, S. and Durkalski-Mauldin, V.L. for the Neurological Emer-

gencies Treatment Trials Network and the SHINE Trial Investigators (2019). Intensive vs Standard

42

Treatment of Hyperglycemia and Functional Outcome in Patients With Acute Ischemic Stroke: The

SHINE Randomized Clinical Trial. Journal of the American Medical Association. 322(4) 326–335.

Kaibel, C. and Biemann, T. (2019). Rethinking the Gold Standard With Multi-armed Ban-

dits: Machine Learning Allocation Algorithms for Experiments. Organizational Research Methods

https://doi.org/10.1177/1094428119854153.

Karrison, T.G., Huo, D. and Chappell, R. (2003). A group sequential, response-adaptive design

for randomized clinical trials. Controlled Clinical Trials 24 506–522.

Kaufmann, E., Korda, N. and Munos, A. (2012). Thompson Sampling: An Asymptotically Op-

timal Finite-Time Analysis. International Conference on Algorithmic Learning Theory 2012: Pro-

ceedings 199–213.

Kaufmann, E. and Garivier, A. (2017). Learning the distribution with largest mean: two bandit

frameworks. ESAIM: Procs 60 114–131.

Kim, E. S., Herbst, R.S., Wistuba, I.I., Lee, J.J., Blumenschein, G.R., Tsao, A., Stewart,

D.J., Hicks, M.E., Erasmus, J. Jr, Gupta, S., Alden, C.M., Liu, S., Tang, X., Khuri,

F.R., Tran, H.T., Johnson, B.E., Heymach, J.V., Mao, L., Fossella, F., Kies, M.S., Pa-

padimitrakopoulou, V., Davis, S.E., Lippman, S.M. and Hong, W.K. (2011). The BATTLE

trial: personalizing therapy for lung cancer. Cancer Discovery 1(1) 44–53.

Korn, E.L. and Freidlin, B. (2011a). Outcome-adaptive randomization: is it useful? Journal of

Clinical Oncology, 29 771–776.

Korn, E.L. and Freidlin, B. (2011b). Reply to Y. Yuan et al. Journal of Clinical Oncology 29 e393.

Korn, E.L. and Freidlin, B. (2017). Commentary. Adaptive Clinical Trials: Advantages and Dis-

advantages of Various Adaptive Design Elements. Journal of the National Cancer Institute 109(6)

djx013.

Laage, T., Loewy, J.W., Menon, S., Miller, E.R., Pulkstenis, E., Kan-Dobrosky, N.

and Coffey, C. (2017). Ethical Considerations in Adaptive Design Clinical Trials. Therapeutic

Innovation & Regulatory Science 51(2) 190–199.

Lattimore, T. and Szepesvari, C. (2019). Bandit Algorithms. In press https://tor-

lattimore.com/downloads/book/book.pdf.

43

Lee, J.J., Chen N. and Yin, G. (2012). Worth Adapting? Revisiting the Usefulness of Outcome-

Adaptive Randomization. Clinical Cancer Research 18(17) 4498–4507.

Lee, K.M. and Lee, J.J. (2020+). Evaluating Bayesian adaptive randomization procedures with

adaptive clip methods for multi-arm trials. Under review.

Lee, K.M., Mitra, R. and Biedermann, S. (2018). Optimal design when outcome values are not

missing at random. Statistica Sinica, 28(4) 1821–1838.

Li, X. and Wang, X. (2012). Variance-penalized response-adaptive randomization with mismeasure-

ment. Journal of Statistical Planning and Inference, 142 2128-2135.

Li, X. and Wang, X. (2013). Response adaptive designs with misclassified responses. Communication

in Statistics – Theory and Methods, 42 2071-2083.

Lin, J. and Bunn, V. (2017). Comparison of multi-arm multi-stage design and adaptive randomization

in platform clinical trials. Contemporary Clinical Trials 54 48–59.

Little, R.J. and Rubin, D. B. (2002). Bayes and multiple imputation. In: Statistical analysis with

missing data, 2nd edition, 200–220. New York (NY): Wiley.

London, A.J. (2018). Learning health systems, clinical equipoise and the ethics of response adaptive

randomization. Journal of Medical Ethics 44 409–415.

Ma, Z. (2013). Missing data and adaptive designs in clinical studies. PhD Thesis, University of

Virginia.

Magaret, A.S., Jacob, S.T., Halloran, M.E., Guthrie, K.A., Magaret C.A., Johnston,

C., Simon, N.R. and Wald, A. (2020). Multigroup, Adaptively Randomized Trials Are Advan-

tageous for Comparing Coronavirus Disease 2019 (COVID-19) Interventions. Annals of Internal

Medicine, Advance access, doi:10.7326/M20-2933.

Marchenko, O., Fedorov, V., Lee, J., Nolan, C. and Pinheiro, J. (2014). Adaptive Clinical

Trials: Overview of Early-Phase Designs and Challenges. Therapeutic Innovation & Regulatory

Science 48(1) 20–30.

Melfi, V.F. and Page, C. (2000). Estimation after adaptive allocation. Journal of Statistical Plan-

ning and Inference 87(2) 353–363.

44

Papadimitrakopoulou, v., Lee, J.J., Wistuba, I., Tsao, A. Fossella, F., Kalhor, N.,

Gupta, S., Averett Byers, L., Izzo, J., Gettinger, S., Goldbert, S., Tang, X., Miller,

V., Skoulidis, F., Gibbons, D., Shen, L., Wei, C., Diao, L., Peng, S. A., Wang, J., Tam,

A., Coombes, K., Koo, J. Mauro, D., Rubin, E., Heymach, J., Hong, W. and Herbst,

R. (2016). The BATTLE-2 Study: A Biomarker-Integrated Targeted Therapy Study in Previously

Treated Patients With Advanced Non-Small-Cell Lung Cancer. Journal of Clinical Oncology 34(30)

3638–3647.

Park, J. W., Liu, M.C., Yee, D., Yau, C., van’t Veer, L.J., Symmans, W.F., Paoloni, M.,

Perlmutter, J., Hylton, N.M., Hogarth, M., DeMichele, A., Buxton, M.B., Chien,

A.J., Wallace, A.M., Boughey, J.C., Haddad, T.C., Chui, S.Y., Kemmer, K.A., Kaplan,

H.G., Isaacs, C., Nanda, R., Tripathy, D., Albain, K.S., Edmiston, K.K., Elias, A.D.,

Northfelt, D.W, Pusztai, L., Moulder, S.L., Lang, J.E., Viscusi, R.K. Euhus, D.M.,

Haley, B.B., Khan, Q.J., Wood, W.C., Melisko, M., Schwab, R., Helsten, T., Lyandres,

J., Davis, S.E., Hirst, G.L., Sanil, A., Esserman, L.J. and Berry, D.A. for the I-SPY 2

Investigators (2016). Adaptive Randomization of Neratinib in Early Breast Cancer. New England

Journal of Medicine 375(1) 11–22.

Press, S. J. (2005). Applied multivariate analysis: using Bayesian and frequentist methods of infer-

ence. Courier Corporation.

Proschan, M.A. and Dodd, L.E. (2019). Re-randomization tests in clinical trials. Statistics in

Medicine 38, 2292–2302.

Proschan, M. and Evans, S. (2020). The Temptation of Response-Adaptive Randomization. Clinical

Infectious Diseases. Advance access, doi: 10.1093/cid/ciaa334.

Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American

Mathematical Society 58 527–535.

Robertson, D.S. and Wason, J.M.S. (2019). Familywise error control in multi-armed response-

adaptive trials. Biometrics 75(3) 885–894.

Rosenberger, W.F. and Lachin, J.M. (1993). The use of response-adaptive designs in clinical

trials. Controlled Clinical Trials 14(6) 471–484.

Rosenberger, W.F., Stallard, N., Ivanova, A., Harper, C.N. and Ricks, M.L. (2001).

Optimal adaptive designs for binary response trials. Biometrics 57 909–913.

45

Rosenberger, W.F., Vidyashankar, A.N. and Agarwal, D.K. (2001b). Covariate-adjusted

response-adaptive designs for binary response. Journal of Biopharmaceutical Statistics 11(4) 227–

236.

Rosenberger, W.F., Sverdlov, O. and Hu, F. (2012). Adaptive Randomization for Clinical

Trials. Journal of Biopharmaceutical Statistics 22 719–736.

Rosenberger, W.F. and Hu, F. (1999). Bootstrap methods for adaptive designs. Statistics in

Medicine 18 1757–1767.

Rosenberger, W.F. and Lachin, J.M. (2016). Randomization in Clinical Trials. Wiley Series in

Probability and Statistics, New York.

Rosenberger, W.F. and Hu, F. (2004). Maximising power and minimizing treatment failures in

clinical trials. Clincal Trials 1 141–147.

Rosenberger, W.F. and Sverdlov, O. (2008). Handling Covariates in the Design of Clinical Trials.

Statistical Science 23(3) 404–419.

Rosenberger, W.F. and Lachin, J.M. (2016). Randomization in Clinical Trials (Second Edition).

Wiley Series in Probability and Statistics, Hoboken, New Jersey.

Rubin, D. B. (1976). Inference and missing data. Biometrika 63(3), 581–592.

Rugo, H. S. Olopade, O.I., DeMichele, A., Yau, C., van’t Veer, L.J., Buxton, M.B.,

Hogarth, M., Hylton, N.M., Paoloni, M., Perlmutter, J., Symmans, W.F., Yee, D.,

Chien, A.J., Wallace, A.M., Kaplan, H.G., Boughey, J.C., Haddad, T.C., Albain, K.S.,

Liu, M.C., Isaacs, C., Khan, Q.J., Lang, J.E., Viscusi, R.K., Pusztai, L., Moulder, S.L.,

Chui, S.Y., Kemmer, K.A., Elias, A.D., Edmiston, K.K., Euhus, D.M., Haley, B.B.,

Nanda, R., Northfelt, D.W., Tripathy, D., . Wood, W.C., Ewing, C., Schwab, R.,

Lyandres, J., Davis, S.E., Hirst, G.L., Sanil, A., Berry, D.A. and Esserman, L.J. for

the I-SPY 2 Investigators (2016). Adaptive Randomization of VeliparibCarboplatin Treatment in

Breast Cancer. New England Journal of Medicine 375(1) 23–34.

Sabo, R.T. (2014). Adaptive allocation for binary outcomes using decreasingly informative priors.

Journal of Biopharmaceutical Statistics 24(3) 569–578.

Samaniego, F. J. (2010). A comparison of the Bayesian and frequentist approaches to estimation.

Springer Science & Business Media.

46

Shin, J., Ramdas, A. and Rinaldo, A. (2019). Are sample means in multi-armed bandits positively

or negatively biased? NeurIPS 2019 arXiv:1905.11397.

Shin, J., Ramdas, A. and Rinaldo, A. (2020). On conditional versus marginal bias in multi-armed

bandits. arXiv preprint arXiv:2002.08422.

Sim, J. (2019). Outcome-adaptive randomization in clinical trials: issues of participant welfare and

autonomy. Theoretical Medicine and Bioethics 40(2) 83-101.

Simon, R. and Simon, N.R. (2011). Using randomization tests to preserve type I error with response

adaptive and covariate adaptive randomization. Statistics and Probability Letters 81(7) 767–772.

Smith, R.L. (1984). Properties of Biased Coin Designs in Sequential Clinical Trials. Annals of Statis-

tics 12(3) 1018–1034.

Siu, L. L., Ivy, S. P., Dixon, E. L., Gravell, A. E., Reeves, S. A. and Rosner, G. L. (2017).

Challenges and Opportunities in Adapting Clinical Trial Design of Immunotherapies. Clin Cancer

Res 23(17) 4950–4958.

Sugarman, J., Douglas C. McCrory, D.C., Powell, D., Krasny, A., Adams, B., Ball,

E. and Cassell, C. (1999). Empirical research on informed consent. Hastings Center Report

29(suppl) S1S42.

Sverdlov, O. and Rosenberger, W.F. (2013). On recent advances in optimal allocation designs

in clinical trials. Journal of Statistical Theory and Practice 7(4) 753–773.

Sverdlov, O. and Rosenberger, W.F. (2013). Randomization in clinical trials: can we eliminate

bias? Clinical Investigation Journal 3(1) 37–47.

Tamura, R.N., Faries, D.E., Andersen, J.S. and Heiligenstein, J.H. (1994). A Case Study of

an Adaptive Clinical Trial in the Treatment of Out-Patients with Depressive Disorder. Journal of

the American Statistical Association 89, 768–776.

Thall, P.F. and Wathen, J.K. (2007). Practical Bayesian adaptive randomization in clinical trials.

European Journal of Cancer, 43, 859–866.

Thall, P.F., Fox, P.S. and Wathen, J.K. (2015a). Some caveats for outcome adaptive random-

ization in clinical trials. CRC Press, Boca Raton.

47

Thall, P.F., Fox P. and Wathen, J. (2015b). Statistical controversies in clinical research: scientific

and ethical problems with adaptive randomization in comparative clinical trials. Annals of Oncology

26 1621–1628.

Therasse, P., Arbuck, S.G., Eisenhauer, E.A., Wanders, J., Kaplan, R.S., Rubinstein,

L., Verweij, J., Van Glabbeke, M., van Oosterom, A.T., Christian, M.C. and Gwyther,

S.G. (2000). New guidelines to evaluate the response to treatment in solid tumors. Journal of the

National Cancer Institute, 92(3) 205–216.

Thompson, W.R. (1933). On the likelihood that one unknown probability exceeds another in view

of the evidence of two samples. Biometrika 25 285–294.

Torgerson, D.J. and Campbell, M.K. (2000). Use of unequal randomization to aid the economic

efficiency of clinical trials. BMJ 321 759.

Trippa, L., Lee, E.Q., Wen, P.Y., Batchelor, T.T., Cloughesy, T., Parmigiani, G. and

Alexander, B.M. (2012). Bayesian adaptive trial design for patients with recurrent gliobastoma.

Journal of Clinical Oncology 30 3258–3263.

Tymofyeyev, Y., Rosenberger W.F. and Hu, F. (2007). Implementing optimal allocation in

sequential binary response experiments. Journal of the American Statistical Association 102(477)

224–234.

Viele, K., Broglio, K., McGlothlin, A. and Saville, B.R. (2020) Comparison of methods for

control allocation in multiple arm studies using response adaptive randomization. Clinical Trials

17(1) 5260.

Viele, K., Saville, B.R., McGlothlin, A. and Broglio, K. (2020) Comparison of response

adaptive randomization features in multiarm clinical trials with control. Pharmaceutical Statistics,

Advance access, doi:10.1002/pst.2015

Villar, S.S., Bowden, J. and Wason, J. (2015a). Multi-armed Bandit Models for the Optimal

Design of Clinical Trials: Benefits and Challenges. Statistical Science 30(2) 199–215.

Villar, S.S., Wason J. and Bowden, J. (2015b). Response-adaptive randomization for multi-arm

clinical trials using the forward looking Gittins index rule. Biometrics 71(4) 969–978.

Villar, S.S., Bowden, J. and Wason, J. (2018). Response-adaptive designs for binary responses:

How to offer patient benefit while being robust to time trends? Pharmaceutical Statistics 17(2)

182–197.

48

Villar, S.S., Robertson, D.S. and Rosenberger, W.F. (2020). The Temptation of Overgener-

alizing Response-Adaptive Randomization. Clinical Infectious Diseases, in press.

Wang, Y., Zhu, H. and Lee, J.J. (2020) Evaluation of bias for outcome adaptive randomization

designs with binary endpoints. Statistics and Its Interface 13 287-315

Wagenmakers, E.J., Lee, M., Lodewyckx, T. and Iverson, G. (2008). Bayesian versus fre-

quentist inference. In Bayesian evaluation of informative hypotheses, 181–207. Springer, New York,

NY.

Ware, J.H. (1989). Investigating Therapies of Great Benefit: ECMO - with comments. Statistical

Science 4(4), 298–340.

Wason, J.M.S. and Trippa, L. (2014). A comparison of Bayesian adaptive randomization and multi-

stage designs for multi-arm clinical trials. Statistics in Medicine 33(13) 2206–2221.

Wason, J.M.S., Brocklehurst P. and Yap, C. (2019). When to keep it simple – adaptive designs

are not always useful. BMC Medicine 17:152.

Wathen, J.K. and Thall, P.F. (2017). A simulation study of outcome adaptive randomization in

multi-arm clinical trials. Clinical Trials 14(5) 432–440.

Wei, L.J. and Durham, S. (1978). The randomized play-the-winner rule in medical trials. Journal

of Medical Statistics Association 73 840–843.

Wei, L.J. (1988). Exact two-sample permutation tests based on the randomized play-the-winner rule.

Biometrika 75 603–606.

Williamson, S.F., Jacko, P., Villar, S.S. and Jaki, T. (2017). A Bayesian adaptive design for

clinical trials in rare diseases. Computational Statistics and Data Analysis 113 136–153.

Williamson, S. F. and Villar, S. S. (2020). A ResponseAdaptive Randomization Procedure for

MultiArmed Clinical Trials with Normally Distributed Outcomes. Biometrics 76(1) 197–209

Zelen, M. (1969). Play the Winner Rule and the Controlled Clinical Trial. Journal of the American

Statistical Association 64 131–146.

Zhang, L. and Rosenberger, W.F. (2006). Response-Adaptive Randomization for Clinical Trials

with Continuous Outcomes. Biometrics 62 562–569.

49

Zhang, L. and Rosenberger, W.F. (2007a). Response-adaptive randomization for survival trials:

the parametric approach. Applied Statistics 56(2) 153–165.

Zhang, L., Chan, W.S., Cheung, S.H. and Hu, F. (2007b). A generalized urn model for clinical

trials with delayed responses. Statistica Sinica 17 387–409.

Zhang, L., Hu, F., Cheung, S.H. and Chan, W.S. (2011). Immigrated urn models – theoretical

properties and applications. Annals of Statistics 39 643–671.

Zhang, L., Trippa, L. and Parmigiani, G. (2019). Frequentist operating characteristics of Bayesian

optimal designs via simulation. Statistics in Medicine 38 4026–4039.

Zhao, W. and Durkalski, V. (2014). Managing competing demands in the implementation of

response-adaptive randomization in a large multicenter phase III acute stroke trial. Statistics in

Medicine 33(23) 40434052.

Zhou, X., Liu, S., Kim, E.S., Herbst, R.S. and Lee, J.J. (2008). Bayesian adaptive design for

targeted therapy development in lung cancer – a step toward personalized medicine. Clinical Trials

5(3) 181–193.

50

response-adaptive randomization in clinical trials: …myths to practical considerations david s....

Documents