response-adaptive randomization in clinical trials: …myths to practical considerations david s....
TRANSCRIPT
Response-adaptive randomization in clinical trials: from
myths to practical considerations
David S. Robertson1, Kim May Lee1, Boryana C. Lopez-Kolkovska1, and Sofıa S. Villar ∗1
1MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
Abstract
Response-adaptive randomization (RAR) is part of a wider class of data-dependent sampling
algorithms, for which clinical trials have commonly been used as a motivating application. In that
context, patient allocation to treatments is defined using the accrued data on responses to alter
randomization probabilities, in order to achieve different experimental goals. RAR has received
abundant theoretical attention from the biostatistical literature since the 1930’s and has been the
subject of heated debates. Recently it has received renewed consideration from the applied com-
munity due to successful practical examples and its widespread use in machine learning. Many
position papers on the subject present a one-sided view on its use, which is of limited value for the
non-expert. This work aims to address this persistent gap by providing a critical, balanced and
updated review of methodological and practical issues to consider when debating the use of RAR
in clinical trials.
Keywords: Adaptive randomization; ethics; multi-armed bandits; patient allocation; sample size imbalance,
time trends, type I error control.
1 Introduction
Few topics in the biostatistical literature have been as heavily debated as response-adaptive random-
ization (RAR). The controversy about its value in clinical trials has persisted over the decades ever
since it was first proposed in the early 1930’s. This lack of consensus and the extreme opposing views
surrounding it does not reflect well upon the biostatistical community. A clinical colleague consider-
ing it for an experiment can easily encounter pieces of work from an utmost enthusiasts and a firm
opponent, both of whom can be respected, experienced and well-known biostatisticians.
∗Address correspondence to Sofıa S. Villar, MRC Biostatistics Unit, University of Cambridge, IPH Forvie Site,Robinson Way, Cambridge CB2 0SR, UK; E-mail: [email protected]
1
arX
iv:2
005.
0056
4v2
[st
at.M
E]
6 J
ul 2
020
These conflicting views are not only detrimental to the use of RAR in practice (which is perhaps
unfortunate given the potential advantages it can offer), but can also be perceived as very confusing
for those to whom RAR is relatively new. This situation has motivated us to write this paper with the
intention that it becomes a “must read” paper for those who, like ourselves, have found some of these
arguments to be partial or confusing. We also present a simulation study to compare some properties
of recently proposed response-adaptive procedures in the literature.
Our intention is that this paper stimulates deep and critical thinking about experimental goals in
challenging contexts, rather than merely following simplistic views on a complex subject. This paper
makes the case for the importance of careful thinking about how to deliver experimental goals through
the appropriate use of trial adaptations such as RAR.
1.1 Background
Randomization as a method to allocate patients to treatments in a clinical trial has long been con-
sidered a defining element of a well-conducted study, ensuring comparability of treatment groups,
mitigating selection bias, and additionally providing the basis for statistical inference (Rosenberger
and Lachin, 2016). In clinical trials practice, a fixed randomization probability through the trial (most
often an equal probability) is still the most popular randomization procedure in use. An alternative
mode of patient allocation is known as RAR, in which randomization probabilities are altered during
the trial based on the accrued data on responses, with the aim of achieving different experimental
objectives while ideally (though not necessarily) preserving inferential validity. Such experimental
objectives may include: early selection of a promising treatment among several candidates, increasing
the power of a specific treatment comparison and/or assigning more patients to an effective treatment
arm during a trial.
RAR (also known as outcome-adaptive randomization) has been a fertile area of methodological
research over the past three decades, with books and many papers in top statistical journals being
published on the subject (which becomes evident from the references section of this paper). Despite
this, the uptake of RAR in practice remains disproportionately slow in comparison with the theoretical
attention it has received, and it continues to stand as a controversial and highly debated issue within
the statistics community. These debates tend to intensify and multiply during health care crises
such as the Ebola outbreak (Brittain and Proschan, 2016; Berry, 2016) or the current COVID-19
pandemic (Proschan and Evans, 2020; Angus et al., 2020; Magaret, 2020). Unfortunately, such debates
are mostly geared towards presenting arguments to justify one-sided positions around the use of RAR
2
in clinical trials, and are usually highly technical which makes it challenging for a non-expert to
follow. None of these debates provide a fair, updated and balanced discussion over the merits and
disadvantages of using this wide adaptive design class. Such a well balanced discussion is very much
needed in practice for a non-expert to make a knowledgeable decision of its use for a specific experiment
and it is what we attempt to do in this piece of work.
Such conflicting and one-sided views, as published in the modern literature, are illustrated by the
quotations below and constitute one of the initial drivers and main motivations for writing this paper.
Outcome adaptive randomization has several undesirable properties. These include a high
probability of sample size imbalance in the wrong direction . . . it produces inferential
problems that decrease potential benefit to future patients, and may decrease benefit to
patients enrolled in the trial . . . For randomized comparative trials to obtain confirmatory
comparisons, designs with fixed randomization probabilities and group sequential decision
rules appear to be preferable to RAR, scientifically, and ethically. (Thall et al., 2015b)
. . . optimal response-adaptive randomization designs allow implementation of complex op-
timal allocations in multiple-objective clinical trials and provide valid tools to inference in
the end of the trial. In many instances they prove superior over traditional balanced ran-
domization designs in terms of both statistical efficiency and ethical criteria. (Rosenberger
et al., 2012)
Response adaptive randomization (RAR) is a noble attempt to increase the likelihood
that patients receive better performing treatments, but it causes numerous problems that
more than offset any potential benefits. We discourage the use of RAR in clinical trials.
(Proschan and Evans, 2020)
The lingering lack of consensus within the specialized literature is perhaps one of the most in-
fluential reasons to explain why the use of RAR procedures remains rare in clinical trials practice.
The one-sided positions on the use of these methods persist, despite recent methodological develop-
ments directly addressing past criticisms and providing guidance for selecting appropriate procedures
in practice. Specifically, over the last 10 years there have been additional theoretical advances which
– to the best of the authors’ knowledge – are currently not included in any published review of
response-adaptive methods.
In parallel to this, response-adaptive procedures have boomed in machine learning applications
(Auer et al., 2002; Bubeck and Cesa-Bianchi, 2012; Kaufmann and Garivier, 2017; Kaibel and Biemann,
2019; Lattimore and Szepesvari, 2019) where the uptake and popularity of Bayesian RAR ideas (also
known as Thompson sampling) has been incredibly high. Their use in practice has been associated
with substantial gains in system performances. In the clinical trial community, a crucial development
3
has been the success of some well-known biomarker led trials, such as I-SPY 2 (Rugo et al., 2016; Park
et al., 2016) or BATTLE (Kim et al., 2011). The goal of these trials was to learn which subgroups (if
any) benefit from a therapy and to then change the randomization ratio to favour patient allocation in
that direction. These trials have set new precedents and expectations which, contrary to what ECMO
trials did to RAR in the 1980s (see the Section 2), are increasingly driving investigators to the use of
RAR to enhance the efficiency and ethics of their trial designs. These trials also show that practically
and logistically, the use of RAR is clearly feasible, at least in areas such as oncology.
Both in the machine learning literature and in the recent trials cited above, the methodology
used in these applications is a class of the larger family of adaptive methods (i.e. some of which are
response-adaptive but not randomized). However, most of the recent general criticisms and praise for
response-adaptive methods in clinical trials has been driven mainly by arguments that only apply to
a very specific subclass within these methods. This paper aims to contribute to the current discussion
by providing an updated and broad critical review, with a summary of some of our thoughts on
this debate. We also compare and discuss recently proposed RAR procedures with a new simulation
study that (to the best of our knowledge) is not found in the literature. We believe this work is
timely and highly needed to support those considering RAR as a potential defining element of clinical
trials to be run in the current COVID-19 pandemic, especially given the conflicting views in the
literature (Proschan and Evans, 2020; Magaret, 2020).
We start by providing an updated historical overview of RAR (Section 2) and continue by a sum-
mary of the classification and comparison of RAR procedures (Section 3). We then present popular
beliefs published in the literature about RAR and we critically discuss each of them (Section 4). Fi-
nally, we give a brief summary of our own opinions on the future of RAR related research and some
general considerations in Section 5.
2 A historical perspective on RAR
“Those who cannot learn from history are doomed to repeat it.” (Paraphrase of an apho-
rism by George Santayana)
We believe it is important to start our review by tracing and reporting the highlights of the historical
development of RAR. This history can be presented by splitting it into the tale of two very distinct ar-
eas: theory and practice. While a large amount of high quality theoretical work has accumulated over
the years, RAR in practice has been marked by only a few highly influential examples. In particular,
4
our view is that we should use both early controversies such as the infamous ECMO trial (see below)
and recent successes such as I-SPY 2 to learn when and how RAR might be used appropriately, rather
than dismissing or encouraging using any kind of RAR on the basis of a single example. Hence, in
this section, we give a high level historical overview of RAR in terms of key methodology and its use
in practice for clinical trials, with a timeline shown in Figure 1. We also include some more recent
developments, which have not been included in previously published historical overviews.
2.1 RAR methodology literature
The origins of response-adaptive procedures can be traced back to Thompson (1933), who first sug-
gested to allocate patients to the more effective treatment arm via a posterior probability computed
using interim data. This work motivated procedures (commonly known as Thompson sampling) which
are used to allocate resources in many modern application areas. Another early method was the play-
the-winner rule, proposed by Robbins (1952) and then Zelen (1969). In this non-randomized rule,
a success on one treatment leads to a subsequent patient being assigned to that treatment, while a
failure leads to a subsequent patient being assigned to the other treatment.
RAR also has roots in the methodology for sequential stopping problems (where the sample size
is random, but with a stopping boundary), as well as bandit problems (where resources are allocated
to maximize the expected reward). Since most of the work in these areas have been non-randomized
(i.e. deterministic), we do not describe their development here. For the interested reader, Rosenberger
and Lachin (2016, Section 10.2) gives a brief summary of the history of both of these areas, and an
overview of multi-arm bandit models is presented in the review paper of Villar et al. (2015a). For a
review of non-randomized algorithms for the two-arm bandit problem, see Jacko (2019).
One of the first explicit uses of randomization in a response-adaptive treatment allocation pro-
cedure was the randomized play-the-winner (RPW) rule proposed by Wei (1978). The RPW rule
can be described as an urn model : each treatment assignment is made by drawing a ball from an
urn (with replacement), where the composition of the urn is updated based on the patient responses.
In the following decades, many RAR designs based on urn models were proposed, with a particular
focus on generalizing the RPW rule. We refer the reader to Hu and Rosenberger (2006, Chapter 4)
and Rosenberger and Lachin (2016, Section 10.5) for a detailed description. One example was the urn
model proposed by Ivanova (2003), called the drop-the-loser rule.
These urn-based RAR procedures are intuitive, but are not optimal in a formal sense. However,
from the early 2000s another perspective on RAR emerged that was based on optimal allocation
5
targets, which are derived as a solution to a formal optimization problem. For two-arm trials, a
general optimization approach was proposed by Jennison and Turnbull (2000) for normally-distributed
outcomes, and helped lead to the development of a whole class of optimal RAR designs. One early
example is the work of Rosenberger et al. (2001) for trials with binary outcomes. Further examples of
optimal RAR designs can be found in Section 4.2. In order to achieve the desired optimal allocation
targets, a key development was the modification by Hu and Zhang (2004) of the double adaptive
biased coin design (DBCD), which was originally described by Eisele (1994). Subsequent theoretical
work by Hu and Rosenberger (2006) focused on asymptotically best RAR procedures, which led to
the development of the class of efficient response adaptive randomization designs (ERADE) proposed
by Hu et al. (2009).
All of the RAR procedures described above are myopic, in the sense that they only use past ob-
servations to determine the treatment allocation for the next patient, without considering the future
patients to be treated and the information they could provide. A more recent methodological devel-
opment has been the proposal of non-myopic or forward-looking RAR procedures, which are based
on solutions to the multi-bandit problem. The first such fully randomized procedure was proposed
by Villar et al. (2015b) for trials with binary responses, with subsequent work by Williamson et al.
(2017) accounting for an explicit finite time-horizon. Even more recently, forward-looking RAR pro-
cedures have been proposed for trials with normally-distributed outcomes as well (Williamson and
Villar, 2020).
2.2 RAR in clinical practice
One of the earliest uses of RAR in clinical practice was the Extracorporeal Circulation in Neonatal
Respiratory Failure (ECMO) trial, performed in Michigan by Bartlett (1985). This trial used the RPW
rule on a study of critically ill babies randomized either to ECMO or to the conventional treatment. In
total, 12 patients were observed: 1 in the control group, who died, and 11 in the ECMO group, who all
survived. This extreme imbalance in sample sizes was a motivation for running a second randomized
ECMO trial, carried out in Harvard by Ware (1989), where patients were assigned treatments using
fixed randomization but with early stopping for futility. The ECMO trials have been the focus of much
debate. Up to date, the two papers above have accrued over 1000 citations. Indeed, to this day the
first ECMO trial is regarded as a key example against the use of RAR in clinical practice, due to the
extreme treatment imbalance and highly controversial interpretation (Rosenberger and Lachin, 1993;
Burton et al., 1997). In large part due to the controversy around the ECMO trials, there was little use
6
of RAR in clinical trials for the subsequent 20 years. One exception was the Fluoxetine trial (Tamura
et al., 1994), which again used the RPW rule, but with a burn-in period to ensure that there would
not be too few controls. However, more recently there have been several high-profile clinical trials
that use Bayesian RAR, in the spirit of Thompson (1933).
Two important examples in oncology are the BATTLE trials and the I-SPY 2 trial. The BATTLE
trials (Zhou, 2008; Kim et al., 2011; Papadimitrakopoulou, 2016) used RAR based on a Bayesian hier-
archical model, where the randomization probabilities are proportional to the observed efficacy based
on the patient’s individual biomarker profiles. Similarly, the I-SPY 2 trial (Barker, 2009; Carey, 2016;
Rugo et al., 2016; Park et al., 2016) used RAR based on Bayesian posterior probabilities, which are
specific to different biomarker signatures. These oncology trials have generated valuable discussions
about the benefits and drawbacks in using RAR in clinical trials (Das, 2017; Korn, 2017; Marchenko,
2014; Siu, 2017). We discuss some of these further in Section 4.
1933
Thompsonsampling
1978
RPW rule
1985
ECMO trial
1994
Fluoxetinetrial
2000
J&Toptimizationapproach
2001
RSIHR paper
2004
DBCD
2008
BATTLE trial
2009
ERADE
2010
I-SPY 2 trial
2015
Non-myopicRAR
Figure 1: Timeline summarizing some of the key developments around the use of RAR in clinicaltrials. J&T = Jennison and Turnbull (2000), RSIHR = Rosenberger et al. (2001).
3 Classifying and comparing RAR procedures
As noted in the historical perspective provided in Section 2, the adoption of the RPW rule in the
ECMO trial has been used as a key example against the use of all RAR procedures in clinical trials.
In reality, RPW is just one example of a RAR procedure out of many other possible ones for a
clinical trial. Many papers have overlooked this fact when criticizing the use of RAR in general using
arguments that apply to a specific RAR procedure and hence, perhaps unintentionally, depreciated the
value of other RAR procedures that are markedly different from those criticized in that work (Villar et
al., 2020). More generally, the literature reviewed for writing this paper provided abundant examples
where, like with RPW and the ECMO trial, broad conclusions for the use of RAR were made despite
only being based on a subset of possible procedures.
Our aim for this section is to help correct for this imbalance by presenting several families of RAR
procedures, discussing how they fit different classification criteria and describing possible meaningful
7
ways to assess RAR procedures in the literature. We hope this section will provide a basis for readers
to navigate the many existing approaches and compare them when considering their use for a specific
application at hand. However, we must warn the reader that given the wealth and breadth of existing
RAR procedures, attempting to provide classification criteria that are exhaustive and completely dif-
ferentiates them all may well be an elusive goal. As will become apparent below, even if we could do
such a thing, it could be very quickly become obsolete again given the current pace of development in
the area (Villar et al., 2020).
3.1 Taxonomies of RAR
The vast number of different RAR procedures is a challenge that non-experts and experts alike may
face with when exploring the RAR literature. The specialized literature has accumulated over the
years using differing criteria and jargon to classify and describe RAR procedures. As well, the many
new developments over the past two decades has meant that even experts in the field can find it
difficult to continually keep up to date with the literature. All this can become confusing and lead to
over-generalizations very easily.
At the broadest level, adaptive randomization procedures have been classified by Hu and Rosen-
berger (2006) as follows: response-adaptive randomization (RAR), covariate-adaptive randomization
(CAR), and covariate-adjusted response-adaptive randomization (CARA). The data used to deter-
mine the allocation probabilities are reflected in the names of the procedures, i.e. using the accrued
information of the response variable and/or covariate(s), in light of the objectives of using a particular
randomization procedure. In this review, we focus specifically on RAR procedures. Of course, many
of the issues we subsequently discuss for RAR are applicable to some degree to CARA, but this is
beyond the scope of this paper. For a in-depth discussion of the use of covariates in randomization,
we refer the reader to the review paper by Rosenberger and Sverdlov (2008).
Optimal and design-driven RAR
For the class of RAR procedures, a classification into two families has previously been proposed
by Rosenberger and Lachin (2002); Hu and Zhang (2004):
(1) response adaptive randomization which is based on an optimal allocation target, where a specific
criterion is optimized based on a population response model.
E.g. The ‘optimal’ allocation of Rosenberger et al. (2001) for binary responses. A formal opti-
8
mization problem is defined based on the population response model and the inference at the
end of the trial: the power of the trial (using a Z-test for the difference in proportions) is fixed
and the expected number of treatment failures is minimized (see Section 4.2 for further details).
(2) design-driven response adaptive randomization, where rules are established with intuitive moti-
vation, but are not optimal in a formal sense.
E.g. The RPW rule for binary responses. The rules for computing and choosing the allocation
probability can be formulated using an intuitive urn-based model (see Wei (1978) for full details).
A key difference between the two families is in the characteristics of the criterion or rules that form
the basis of the computation of allocation probabilities. More specifically, approaches belonging to
family (1) rely on optimizing an objective function that describes aspects of the population response
model explicitly, whereas those belonging to family (2) rely on intuitive motivation that is not defined
analytically from the population response model.
However, while the classification above may have been suitable when it was first proposed, it is
arguably overly simplistic and somewhat obsolete given the methodological developments in the RAR
literature since then. There are many RAR procedures that do not fit neatly into either family. One
difficulty is around the use of ‘optimal’ and ‘optimized’ in the definitions of the two families, as we
now discuss.
Starting first with bandit-based designs, these are based on an optimization approach but do not
explicitly target a pre-specified optimal allocation ratio like in family (1). Rather, they seek to find
the best possible allocation to maximise the expected total reward (i.e. number of successes), which is
unknown beforehand. Considering the forward-looking Gittins index (FLGI) rule as an example (Villar
et al., 2015b), this trades off a small deviation in the objective function to give a fully randomized
RAR procedure, which further complicates the classification of this rule. Similarly, the bandit-based
design proposed by Williamson et al. (2017) is optimal in terms of expected number of successes,
subject to constraints.
Another concept is that of asymptotic optimality, of which there are several notions. An example
where this is relevant is the previously-mentioned Thompson sampling (see also Section 4.1). Rela-
tively recently, this was shown to be asymptotically optimal in terms of minimizing cumulative regret
(see e.g. Kaufmann et al. (2012)). In finite samples though, arguably Thompson sampling (and its
9
generalization proposed by Thall and Wathen (2007)) is closer to having an intuitive motivation based
around assigning more patients to the better arm, as calculated using the posterior distributions of
the success probabilities.
The discussion above demonstrates that the broad-brush classification of RAR procedures into
‘optimal’ vs ‘design-driven’ or ‘intuitive’ requires further clarification around what is meant by these
terms. Additionally, these subtle differences within one class may come with substantial differences
in the resulting operating characteristics in an identical trial context (see Section 4.1 for example).
An alternative classification scheme would be to describe RAR procedures in terms of the broad
methodological ‘families’ they belong to, e.g. urn models, bandit-based designs, procedures based on
Thompson sampling, or procedures that target a pre-specified (optimal) allocation ratio. However,
this classification is by no means watertight, as RAR procedures could conceivably belong to more
than one family, and new types of RAR are continuously being developed.
Bayesian and frequentist procedures
Another way of classifying RAR procedures is based on the school of statistics, either frequentist
or Bayesian, that is used for its design (see below for discussion about analysis). In general, a RAR
procedure depends on unknown parameter(s), which affects either the construction of the optimiza-
tion problem or the computation of the allocation probabilities. When this problem arises, a Bayes
approach could be employed at the design stage of the trials. We suggest the following definitions:
A randomization procedure can be classified as Bayesian when a prior distribution is explicitly
incorporated into the design criteria/optimization problem and/or into the calculation of the allocation
probability
E.g. the optimal RAR design proposed by Cheng and Berry (2007) is based on a Bayesian
decision-analytic approach; Sabo (2014) propose the use of a decreasingly informative
priors for Bayesian RAR.
Meanwhile, a randomization procedure can be classified as frequentist when no prior distribution is
incorporated into the design problem/ optimization problem and/or a frequentist approach is used for
estimating the unknown parameter(s) in the allocation probability.
E.g. the generalized biased coin design by Smith (1984) and Neyman allocation (see Sec-
tion 4.2).
This classification into Bayesian and frequentist designs does not completely differentiate RAR pro-
cedures. In some cases a frequentist approach (using the above definition) corresponds to a Bayesian
10
approach with a specific prior distribution, e.g. the RPW rule (Atkinson and Biswas, 2014, pg. 271).
This is analogous to the situation where the posterior mode coincides with the maximum likelihood
estimator (MLE) when a uniform prior probability is used.
Alternatively, RAR procedures could be defined as frequentist or Bayesian with respect to the
inference procedure used. In our opinion, this approach may not be helpful for someone who is not
familiar with the literature on RAR, since the inference procedure greatly depends on the goal of the
trial and is very much influenced by regulators’ preference between these two approaches. Moreover,
some innovative trial approaches can have Bayesian design aspects but the inference procedure focuses
on the frequentist operating characteristics. This does not add value for classifying RAR procedures
since the objective(s) of RAR hardly cover all aspects of the inference. Readers interested in un-
derstanding the pros and cons of frequentist and Bayesian inference are referred to materials such as
Press (2005); Wagenmakers et al. (2008); Samaniego (2010) as this falls outside the scope of our review.
Objectives of RAR procedures
Finally, an important aspect in which RAR procedures differ is the goal they are designed to
achieve. While some can consider competing objectives such as both efficiency and patient benefit (see
Section 3.2 for definitions), others might prioritize one over the other. Additionally, some procedures
might be non-myopic (i.e. accounting for future patients) while others might be myopic. We also note
that for some RAR procedures, such as the those targeting an optimal allocation, the optimization
problem can account for multiple objectives (see e.g. Hu et al. (2015)). There are many possible
goals that a RAR procedure can aim to achieve, which may also be constrained by e.g. computational
considerations.
We encourage the reader to consider what the different classifications of RAR procedures might
mean in the context of their specific trial. In the next section, we discuss various ways of assessing
the performance of RAR procedures.
3.2 Assessing the performance of RAR procedures
Another area where there has been a lack of consistency and clarity is around how RAR procedures are
assessed. Most metrics reported in the literature seem to be focusing on a specific goal, and there has
been a myriad of ways of describing the properties of a RAR design. This section reviews some of the
most commonly used options for evaluating RAR procedures. Rather than providing an exhaustive
list of all the possible metrics available for comparing variants of RAR, we debate their usefulness and
11
attempt to provide a basis for evaluating them.
An additional reason why we believe this debate is useful is that when comparing two published
results using RAR, an important caveat is that different metrics could have been used and this needs
to be factored in the conclusions of the comparison. This leads us to emphasize that it is extremely
important to report not only the details of the procedures used and how they have been defined, but
also to carefully describe the specification of the metrics used in their evaluation to allow for adequate
comparisons to be made. We also encourage investigators to emphasize when their findings may not
apply to all the different RAR families. This should reduce the chances of readers misunderstanding
the scope of the pros and cons for a family of RAR procedures.
One key example is the use of ‘efficiency’ and ‘patient benefit’ when describing RAR procedures.
Efficiency can correspond to the power of a trial (see below), but can also correspond to the precision
of the estimates for the treatment effectiveness (Flournoy et al., 2013). Similarly, patient benefit can
mean assigning patients to a more effective treatment arm based on the accrued data, or it can also
include the benefit for patients who are outside the trial (but who could potentially benefit from the
trial results). These are important caveats that can have a large impact on the choice of a RAR
procedure.
In Section 4, we aim to identify and clarify some frequent misconceptions around RAR. In partic-
ular, when addressing them we shall use the following most commonly reported metrics: power, type I
error, and bias. In the RAR literature, we find that ‘power’ is often perceived as a frequentist prop-
erty, i.e. the probability of rejecting a null hypothesis when the true parameter follows the alternative
distribution. Similarly, the ‘type I error’ is defined as the probability of rejecting a null hypothesis
when the true parameter follows the null distribution. Moving away from the frequentist definitions,
some authors defined power (respectively type I error) as the probability of satisfying a criterion that
reflects the goal of treatment comparisons. Typically, this is found using simulation studies under the
alternative (null) scenarios.
There are additional metrics that could be reported and have been nevertheless more infrequently
used. For example, in multi-arm trial settings, power can reflect the goal of a trial, such as selecting
the best experimental treatment or declaring at least one experimental treatment effective; with the
false positive rate or familywise error rate as the generalization of type I error. The ‘power’ of a
multi-armed trial can have multiple definitions, and needs to be clearly stated when reporting results.
For instance, pairwise power, marginal power, experiment-wise power and disjunctive power are all
used as definitions of ‘power’. However, it is possible for a RAR procedure to have a high power
according to one definition but not according to another.
12
On the other hand, ‘bias’ is often defined as a property of an estimator which reflects how different
an estimate is from the true underlying parameter. An estimate may be biased due to the rule used for
calculating the estimand, or due to the heterogeneity in the observed data. Heterogeneity corresponds
to the situation where the data may not come from the same underlying distribution. A key example
is the presence of time trends, which can cause a different treatment effect for patients who were
enrolled at different time points during the trial. We return to this issue in Section 4.4.
Apart from the properties described above, some authors report the following when presenting
their simulation results:
• the expected number of treatment failures/successes (for binary outcomes) or the expected total
response (for continuous outcomes) in the trial
• the probability of selecting a truly effective experimental arm
• summary statistics of the sample size per treatment arm, with a particular focus on the allocation
to a superior arm (where this exists)
• the probability of a treatment arm stopping early for futility or for efficacy
• the probability of sample size imbalance (see Section 4.1).
Summary
Unfortunately, there is no perfect RAR procedure, in the sense that it will be superior to any other
in terms of all of these operating characteristics and in any context. This fact makes the need for a
careful consideration when deciding which randomization approach is best suited for a specific clinical
trial (and its goals) even more important. A RAR procedure should be chosen carefully according to
the specific context and specific goals of a trial, in light of the practical challenges and constraints that
implementing RAR poses. We will discuss some practical issues when implementing a RAR procedure
in Section 4.5.
4 Popular beliefs about RAR
In this section, we critically examine a number of popular beliefs that have been published on RAR.
Some of these beliefs can be rationally justified in particular scenarios, but they most certainly do
not apply to all types of RAR procedures and/or all trial settings. Our aim is to reexamine some of
13
these statements to provide a more balanced view of the use of RAR procedures, which acknowledges
potential problems and disadvantages, but also emphasizes the potential solutions and advantages.
We also perform a new simulation study in Section 4.1 that compares and discusses some properties
of more recently proposed RAR procedures. By avoiding extreme views and generalizations (in either
direction), we hope to provide some clarity around the use of RAR in practice and avoid the creation
of myths around it.
In what follows, we tend to focus on specific examples of RAR procedures in order to illustrate
that the popular beliefs on RAR do not always hold in general. As already alluded to, the examples
used below are by no means presented with the intention of being seen as the ‘best’ RAR procedures,
or even necessarily recommended for use in practice – such judgments critically depend on the context
and goals of the specific trial under consideration. Apart from Section 4.1, we do not carry out new
simulations for each of the popular beliefs presented below, but instead point the reader to the many
existing simulation studies in the literature.
4.1 Does RAR lead to a substantial chance of allocating more patients to an
inferior treatment?
Thall et al. (2015a) described a number of undesirable properties of RAR, including the following:
. . . there may be a surprisingly high probability of a sample size imbalance in the wrong
direction, with a much larger number of patients assigned to the inferior treatment arm,
so that AR has an effect that is the opposite of what was intended.
Thall et al. (2015a) illustrated this through simulation studies of two-arm trials. They showed that
Thompson sampling can have a substantial chance (up to 14% for the parameter values considered)
of producing sample size imbalances of more than 20 patients in the wrong direction (i.e., the inferior
arm) out of a maximum sample size of 200. While this result is true for a single RAR procedure,
these conclusions may not hold in general for other types of RAR. Indeed, Thall and Wathen were
among the first to compute this metric of sample size imbalance. Much of the RAR literature has not
reported this metric (or related ones), and therefore it is unclear how other families of RAR procedures
perform with regard to sample size imbalance.
To help fill this gap in the literature, we perform a simulation study looking at a broad range of RAR
procedures in the simple two-arm trial setting with a binary outcome. Let pA and pB denote the true
probabilities of a successful outcome for patients on treatments A and B respectively. Similarly, let NA
14
and NB denote the total (random) number of patients assigned to each treatment, with n = NA +NB
denoting the total sample size (which is fixed). We compare the following patient allocation procedures:
• Fixed randomization: uses an equal, fixed probability to allocate patients to each treatment, i.e.
50% in a two-arm trial.
• Permuted block randomization [PBR] : patients are randomized in blocks to the two treatments,
so that exact balance is achieved at the completion of each block (and hence at the successful
completion of the trial).
• Thall and Wathen procedure [TW(c)] : given the data observed so far, the Thall and Wathen
procedure randomizes the next patient to treatment B with probability
rB =[P (B > A)]c
[P (B > A)]c + [1− P (B > A)]c
Here P (B > A) is the posterior probability that treatment B is better than treatment A.
The parameter c controls the variability of the resulting procedure. Setting c = 0 gives fixed
randomization, while setting c = 1 gives Thompson sampling. Thall and Wathen (2007) suggest
setting c equal to 1/2 or m/(2n), where m is the current sample size and n is the total (or
maximum) sample size of the trial.
• Randomized Play-the-Winner Rule (PRW): an urn-based design, where each patient allocation
is determined by drawing a ball from an urn (with replacement), and the composition of the urn
is updated based on patient responses. See Wei (1978) for full details.
• Drop-The-Loser rule (DTL): an urn model proposed by Ivanova (2003), which is a generalization
of the RPW rule.
• Doubly-Biased Coin Design (DBCD): a response-adaptive procedure targeting the optimal allo-
cation ratio of Rosenberger et al. (2001) (see Section 4.2). For full details, see Hu and Zhang
(2004).
• Efficient Response Adaptive Randomization Designs (ERADE): a response-adaptive procedure
targeting the optimal allocation ratio of Rosenberger et al. (2001), which attains the lower-bound
on the allocation variances. See Hu et al. (2009) for further details.
• Gittins Index (GI): gives the optimal assignment rule that maximizes the expected total (dis-
counted) number of successes (i.e. a two-arm bandit problem) in an infinite horizon. Note this is
15
a non-randomized (i.e. deterministic) rule. For further details, see e.g. the review paper by Villar
et al. (2015a).
• Forward-looking Gittins Index (FLGI): a modification of the Gittins index proposed in Villar
et al. (2015b), that gives a fully randomized RAR procedure with near-optimal patient benefit
properties. This depends on a block size b for randomization.
• Oracle: hypothethical rule that assigns all patients to the best-performing arm, requiring knowl-
edge of the true response probabilities.
In our simulations, we set pA = 0.25 and vary the values of pB and the total sample size n. Unlike
in Thall et al. (2015a), we do not include early stopping in order to isolate the effects of using RAR
procedures. We evaluate the performance of the procedures in terms of the mean (2.5 percentile, 97.5
percentile) of the sample size imbalance NB−NA; the probability of a imbalance of more than 10% of
the total sample size in the wrong direction π0.1 = Pr(NA > NB + 0.1n); and the expected number of
successes (ENS) and its standard deviation. Note that our measure of π0.1 coincides with the measure
used in Thall et al. (2015a) when n = 200.
Table 1 shows the results when pB = 0.35 and n ∈ {200, 654}. Starting with n = 200, we see that
Thompson sampling has a substantial probability (almost 14%) of a large imbalance in the wrong
direction, while using the Thall and Wathen procedure reduces this probability, which all agrees with
the results of Thall et al. (2015a). Unsurprisingly, the aggressive bandit-based procedures (GI and
FLGI) also have relatively large values of π0.1, although these are still less than that for Thompson
sampling. Note that even fixed randomization has about a 7% chance of a large imbalance in the
wrong direction, which arguably is a fairer baseline comparison for the procedures mentioned above.
In contrast, the RPW, DBCD, ERADE and DTL procedures all have negligible values of π0.1 of 1%
or less, which is also reflected in the confidence intervals for the sample size imbalance.
An important factor is the choice of the total sample size. Setting n = 200 means that the trial
has low power to declare treatment B superior to treatment A. Indeed, if the sample size is chosen so
that fixed randomization yields a power of greater than 80% (when using the standard Z-test), then
n needs to be at least 654. When n = 654, Table 1 shows that the values of π0.1 are substantially
reduced for Thompson sampling, the Thall and Wathen procedure and the bandit-based procedures.
Looking at the confidence intervals for sample size imbalance, we see that Thompson sampling and
the bandit-based procedures still have a small risk of getting ‘stuck’ on the wrong treatment arm.
Another important factor is the magnitude of the difference between pA and pB. The scenario
16
Table 1: Properties of various patient allocation procedures, where pA = 0.25 and pB = 0.35. Resultsare from 104 trial replicates.
n Procedure NB −NA π0.1 ENS
200 Fixed 0 (-28, 28) 0.069 60 (6.4)
PBR 0 0 60 (6.4)
Oracle 200 0 70 (6.7)
Thompson 95 (-182, 190) 0.137 65 (8.5)
GI 115 (-182, 190) 0.119 66 (8.5)
FLGI(b = 5) 114 (-176, 190) 0.111 66 (8.3)
FLGI(b = 10) 115 (-172, 190) 0.100 66 (8.2)
TW(1/2) 74 (-90, 174) 0.085 64 (7.5)
TW(m/2n) 49 (-20, 120) 0.037 64 (7.3)
RPW 14 (-16, 44) 0.011 61 (6.5)
DBCD 17 (-10, 46) 0.003 61 (6.4)
ERADE 16 (-6, 42) 0.000 61 (6.4)
DTL 14 (-4, 32) 0.000 61 (6.6)
654 Fixed 0 (-50, 50) 0.005 196 (11.7)
PBR 0 0 196 (11.6)
Oracle 654 0 229 (12.2)
Thompson 461 (-356, 640) 0.042 220 (17.0)
GI 497 (-630, 645) 0.067 222 (19.6)
FLGI(b = 5) 511 (-619, 645) 0.054 222 (18.5)
FLGI(b = 10) 511 (-617, 645) 0.051 222 (18.0)
TW(1/2) 384 (44, 594) 0.011 215 (14.2)
TW(m/2n) 272 (54, 456) 0.010 216 (14.1)
RPW 46 (-8, 100) 0.000 199 (11.8)
DBCD 55 (8, 106) 0.000 199 (11.8)
ERADE 54 (16, 96) 0.000 199 (11.7)
DTL 46 (14, 80) 0.000 198 (11.7)
considered above with pA = 0.25 and pB = 0.35 is a relatively small difference (as shown by the
large sample size required to achieve a power of 80%), and the more aggressive rules would not be
expected to perform well in terms of sample size imbalance in this case. Table 2 shows the results
when pB = 0.45 and n = 200. The values of π0.1 are substantially reduced for Thompson sampling
as well as the Thall and Wathen and bandit-based procedures, and are now much less than for fixed
17
randomization. In terms of the mean and confidence intervals for NB −NA, these are now especially
appealing for FLGI.
Table 2: Properties of various patient allocation procedures, where pA = 0.25 and pB = 0.45. Resultsare from 104 trial replicates.
n Procedure NB −NA π0.1 ENS
200 Fixed 0 (-28, 28) 0.069 70 (6.8)
PBR 0 0 70 (6.6)
Oracle 200 0 90 (7.0)
Thompson 147 (-54, 194) 0.032 85 (9.5)
GI 166 (18, 194) 0.020 87 (8.8)
FLGI(b = 5) 165 (48, 194) 0.014 86 (8.6)
FLGI(b = 10) 164 (56, 192) 0.014 86 (8.6)
TW(1/2) 124 (24, 182) 0.008 82 (8.2)
TW(m/2n) 123 (22, 182) 0.008 82 (8.3)
RPW 30 (-2, 64) 0.000 73 (7.1)
DBCD 30 (6, 58) 0.000 73 (6.6)
ERADE 29 (8, 52) 0.000 73 (6.6)
DTL 30 (10, 50) 0.000 73 (7.1)
Figure 2 extends the above analysis by considering the value of π0.1 for a range of values of pB from
0.25 to 0.85. We see that for pB greater than about 0.4, the probability of a substantial imbalance in
the wrong direction is higher for fixed randomization than all of the other RAR procedures considered.
Figure 2 also demonstrates another weakness of using π0.1 as a performance measure. This probability
of imbalance increases for the RAR procedures considered as the difference pB − pA decreases, but as
this difference decreases, so does the consequence for assigning patients to the wrong treatment arm.
Indeed, looking at Table 1 we can see some trade-offs between sample size imbalance (as measured by
NB −NA and π0.1) and the ENS. The most aggressive RAR procedures (Thompson and FLGI) have
the highest ENS, which are in fact close to the highest possible ENS (the ‘Oracle’ procedure). However,
these procedures also perform the worst in terms of sample size imbalance. This demonstrates our
general point that careful consideration is needed by looking at a variety of performance measures,
and not only focusing on a single measure such as π0.1.
Finally, we note in passing that if sample size imbalance is of particular concern in a trial con-
text, there may be ways of including constraints to avoid imbalance in a similar way to the DBCD
and ERADE procedures, or indeed permuted block randomization procedures. One way of doing so
18
Figure 2: Plot of π0.1 for various RAR procedures as a function of pB, where pA = 0.25 and n = 200.Each data point is the mean of 104 trial replicates.
for bandit-based designs (for example) is to use the constrained optimization approach as described
in Williamson et al. (2017). Recently Lee and Lee (2020) have also proposed an adaptive clip method
that can be used in conjunction with the Thompson sampling based approach to minimize the chance
of assigning patients to inferior arms during the early part of a trial.
Summary
In summary, RAR procedures do not necessarily have a high probability of a substantial sample
size imbalance in the wrong direction, particularly when compared with using fixed randomization.
This probability crucially depends on the true parameter values, as well as the sample size of the trial.
As well, potential sample size imbalances need to be carefully evaluated in light of other performance
measures, such as patient benefit properties.
4.2 Does the use of RAR reduce statistical power?
One of the most popular belief about RAR procedures is that their use can reduce statistical power,
as stated in Thall et al. (2015b):
19
Compared with an equally randomized design, outcome AR . . . [has] smaller power to
detect treatment differences.
Similar statements can be found in Korn and Freidlin (2011a) and Thall et al. (2015a). Through
simulation studies, these papers show that a fixed randomized design can have a higher power than
one using a particular RAR procedure (see below) when the sample sizes are the same, or equivalently
that a larger sample size is required for RAR to achieve the same target power and type I error rate
as a fixed randomized design.
A common feature of these papers is that they only consider the Bayesian RAR procedure proposed
by Thall and Wathen (2007) (see Section 4.1 for a formal definition), which includes Thompson
sampling as a special case. The Thall and Wathen procedure is well-known, and is sometimes referred
to as ‘Bayesian Adaptive Randomization’ (BAR) without qualification. However, although the Thall
and Wathen procedure attempts to assign more patients to the better treatment while preserving
power, this is only established in an intuitive way. Hence, it should not be a surprise that there are trial
designs where using the procedure results in a lower power compared with using fixed randomization.
However, extending this conclusion to RAR in general is an over-generalization.
In what follows, we focus solely on power considerations, but it is important to note that in prac-
tice, maximizing power might not be the only trial objective. Also, for now we assume the use of
standard inferential tests to make power comparisons, which we return to in Section 4.3.
Two-arm trials
As discussed in Section 3.1, there are RAR procedures that formally target optimality criterion
reflecting the trial’s objectives including power. We again consider the two-arm trial setting with
a binary outcome, as described in Rosenberger and Hu (2004). The true success probabilities are
denoted pA and pB (with corresponding total number of patients NA and NB), and we define qA =
1− pA and qB = 1− pB. We assume that the usual Z-test for the difference in proportions is used.
One strategy is to fix the power of the trial and find (NA, NB) to minimize the total sample size n,
or equivalently to fix the total sample size n and find (NA, NB) to maximize the power. This gives
the following allocation ratio:
NA
n=
√pAqA√
pAqA +√pBqB
, NB = n−NA
which is known as Neyman allocation. The result illustrates that in general, it is not true that equal
allocation in a trial maximizes the power for a given sample size. This is a popular belief that appears
20
without qualification in many papers, such as the following published in the BMJ:
Most randomized trials allocate equal numbers of patients to experimental and control
groups. This is the most statistically efficient randomization ratio as it maximizes statis-
tical power for a given total sample size. (Torgerson and Campbell, 2000)
Such a statement is only true in specific settings, such as when comparing the difference in means of
two normally-distributed outcomes with the same known variance.
An issue with Neyman allocation is that if pA + pB > 1, then more patients will be assigned to the
treatment with the smaller success probability. This is clearly an ethical problem, and again highlights
the potential trade-off between power and patient benefit. Rosenberger et al. (2001) resolve this by
modifying the optimization problem. The solution gives an ‘optimal’ allocation that minimizes the
expected number of treatment failures (ENF) given a fixed power, or equivalently fixes the ENF and
maximizes the power. If the response probability of the treatment arms were known, then using this
optimal allocation ratio would therefore guarantee that (on average) the power of the trial is preserved.
For binomial outcomes (as well as survival outcomes), the model parameters in the optimization
problem are unknown and need to be estimated from the accrued data. These estimates can then
be used (for example) in the doubly-adaptive biased coin design (DBCD) (Hu and Zhang, 2004),
or the efficient response adaptive randomization designs (ERADE) (Hu et al., 2009) to target the
optimal allocation ratio. Using the DBCD in this manner, Rosenberger and Hu (2004) found in their
simulation studies that it was
. . . as powerful or slightly more powerful than complete randomization in every case and
expected treatment failures were always less
This is consistent with a general set of guidelines given by Hu and Rosenberger (2006) on which RAR
procedures should be used in a clinical trial, one of which is that power should be preserved. When
following these guidelines, Rosenberger et al. (2012) states that
Response-adaptive randomization should ensure that the expected number of treatment
failures is reduced over standard randomization procedures, and that power should be
slightly enhanced or maintained.
RAR procedures that achieve this aim have also been derived (in a similar spirit to the optimal
allocation above) for continuous (Zhang and Rosenberger, 2006) and survival (Zhang and Rosenberger,
2007) outcomes.
In summary, targeting the Neyman allocation leads to a higher power than fixed randomization
for binary and survival outcomes, which is implemented in practice using a sequential RAR approach
21
such as the DBCD or ERADE. Targeting the optimal allocation leads to the same (or slightly greater)
power than fixed randomization, but is potentially more ethically attractive than targeting the Ney-
man allocation. In both cases, there is no power loss when compared with fixed randomization even
if within the two-armed trial scenario.
Multi-arm trials
There are similar concerns about the reduction in power of multi-arm RAR procedures. For
example, Wathen and Thall (2017) simulate a variety of five-arm trial scenarios and conclude
In multi-arm trials, compared to equal randomization, several commonly used adaptive
randomization methods give much lower probability of selecting superior treatments.
Similarly, Korn and Freidlin (2011b) simulate a four-arm trial and find that a larger average sample
size is required when using a RAR procedure compared with fixed 1:1:1:1 randomization in order to
achieve the same pairwise power. Lee et al. (2012) reaches similar conclusions in the three-arm setting
when considering disjunctive power. However, again these papers only explore generalizations of the
Thall and Wathen procedure (the “commonly used adaptive randomization methods” quoted above)
for multi-arm trials, and these conclusions may not hold for other types of RAR procedures.
The optimal allocation described above for the two-arm setting can be generalized for multi-arm
trials, under the null hypothesis of homogeneity. The allocation is optimal in that it fixes the power of
the test of homogeneity and minimizes the ENF. This was first derived by Tymofyeyev et al. (2007),
who showed through simulation that for three treatment arms, using the DBCD to target the optimal
allocation
. . . provides increases in power along the lines of 2–4% [in absolute terms]. The increase in
power contradicts the conclusions of other authors who have explored other randomization
procedures [for two-arm trials]
Similar conclusions (for three treatment arms) are given by Jeon and Hu (2010), Sverdlov and Rosen-
berger (2013a) and Bello and Sabo (2016).
These optimal allocation procedures maintain (or increase) the power of the test of homogeneity,
but may have low marginal powers compared with equal randomization in some scenarios, as shown
in Villar et al. (2015b). However, even considering the marginal power to reject the null hypothesis
for the best treatment, Villar et al. (2015b) propose non-myopic RAR procedures (i.e. FLGI rules)
that in some scenarios have both a higher marginal power and a higher expected number of treatment
successes when compared with fixed randomization with the same sample size.
22
Finally, many of the power comparisons made throughout this section have been against fixed
randomization. Arguably a more interesting comparison would be to consider group-sequential (GS)
and multi-arm multi-stage (MAMS) designs with a fixed allocation for each treatment in each stage.
It is still unclear as to how the power of different RAR procedures compare with well-chosen GS
and MAMS designs. Although only focusing on RAR based on the Thall and Wathen procedure,
both Wason and Trippa (2014) and Lin and Bunn (2017) show that these RAR procedures can have
a higher power than MAMS designs when there is a single effective treatment. A recent publication
by Viele et al. (2020) also shows that in a multi-arm setting, i.e. without early stopping, the control
allocation plays a part in achieving the power of a study when a variant of the Thall and Wathen
procedure is implemented. The same group of researchers also explore other design aspects in con-
junction with varying the control allocation, and find that RAR can lead to reasonable power under
some settings (Viele et al., 2020).
Summary
In conclusion, if the aim is to maintain (or even increase) power compared to an equally random-
ized design, in many trial scenarios this can be achieved using some kinds of RAR procedures, while
still reducing (or maintaining) the number of treatment failures. However, the choice of the RAR
procedure is crucial, and needs to be made with the objectives of the trial in mind. Of course the
power of the trial is not the only consideration, and sometimes (such as in the rare disease setting)
the patient benefit properties may be much more important. If maintaining power is a key concern
though, then this need not be the sole rationale to use fixed randomization instead of RAR.
4.3 Can robust statistical inference be performed after using RAR?
As noted in Proschan and Evans (2020), the Bayesian approach to statistical inference allows the
seamless analysis of results of a trial that uses RAR. However, when using frequentist inference,
challenges can occur:
The frequentist approach faces great difficulties in the setting of RAR . . . Use of response-
adaptive randomization eliminates the great majority of standard analysis methods . . .
Rosenberger and Lachin (2016) also note that using RAR in a trial can make the subsequent
(frequentist) statistical inference more challenging
Inference for response-adaptive randomization is very complicated because both the treat-
ment assignments and responses are correlated.
23
This raises a key question: how does an investigator analyze a trial using RAR when using frequentist
inference? In particular, can standard statistical tests and regression techniques be used without
inflating the type I error rate? And are standard estimators of the treatment effect(s) biased? Without
clear answers to these questions, it is unsurprising that the challenge of statistical inference (within
the frequentist framework) is still seen as a key barrier to the use of RAR in clinical practice. In this
section, we aim to show that valid statistical inference, especially in terms of type I error rate control
and unbiased estimation, is possible for a wide variety of RAR procedures. Note that in what follows,
we do not consider the issue of time trends and patient drift, a separate discussion of which is given
in Section 4.4.
Perhaps the most straightforward approach to frequentist inference following a trial using RAR is
to simply use standard statistical tests and estimators without adjustment, in contrast to the quotation
above from Proschan and Evans (2020). The justification is that the asymptotic properties of standard
estimators and tests are preserved for a large class of RAR procedures, including those for the multi-
arm setting. Firstly, Melfi and Page (2000) proved that any estimator that is consistent when responses
are independent and identically distributed will also be consistent for any RAR procedure (under the
assumption that the number of observations for each treatment tends to infinity). Secondly, Hu and
Rosenberger (2006) showed that when the responses follow an exponential family, simple conditions
on the RAR procedure ensure the asymptotic normality of the MLE. The basic condition is that the
allocation proportions for each treatment arm converge in probability to constants in (0, 1), which also
implies that the RAR procedure does not make a ‘choice’ or select a treatment during the trial (and
hence the sample size in each arm can tend to infinity). Since many test statistics are just functions of
the MLE, this result implies that the asymptotic null distribution of such test statistics is not affected
by the RAR. Further asymptotic results for urn-based procedures are given in Hu and Rosenberger
(2006) and Zhang et al. (2011).
These asymptotic results are the justification for the first guideline given by Hu and Rosenberger
(2006) on RAR procedures, which states that
Standard inferential tests can be used at the conclusion of the trial.
Of course, relying on asymptotic results to use standard tests and estimators may not be valid for
trials without a sufficiently large sample size, and the effect of a smaller sample size on inference
is greater for more aggressive RAR procedures (see for example the results in Williamson and Villar
(2020)). As noted by Rosenberger et al. (2012), for some RAR procedures in the two-arm trial setting,
there has been extensive literature investigating the accuracy of the large sample approximations under
24
moderate sample sizes using simulation (Hu and Rosenberger, 2003; Rosenberger and Hu, 2004; Zhang
and Rosenberger, 2006; Duan and Hu, 2009). These papers showed that for the DBCD, sample sizes
of n = 50 to 100 are sufficient, while for urn models reasonable convergence was achieved for a sample
size of n = 100. For these procedures, Gu and Lee (2010) explored which asymptotic test statistic to
use for a clinical trial with a small to medium sample size and binary responses.
If the asymptotic results above cannot be used, either because of small sample sizes or because
the conditions on the RAR procedures are not met, then alternative small sample methods for testing
and estimation have been proposed. We summarize the main methods below, concentrating on type I
error rate control and unbiased estimation.
Type I error rate
One common method for controlling the type I error rate, particularly for Bayesian RAR proce-
dures, is a simulation-based calibration approach. Given a trial design that incorporates RAR and an
analysis strategy (e.g. a test statistic and stopping boundary), a large number of trials are simulated
under the null hypothesis. Applying the analysis strategy to each of these simulated trial realizations
gives a Monte Carlo approximation of the relevant type I error rates. If necessary, the analysis strategy
(e.g. the stopping boundaries) can then be adjusted to satisfy the type I error constraints. Variations
of this approach have been used in Wason and Trippa (2014); Wathen and Thall (2017); Zhang et
al. (2019) for example, all in the context of calibrating multi-arm Bayesian RAR procedures to have
correct type I error control. Applying this approach can be computationally intensive however.
A related approach is to use a re-randomization test, which is also known as randomization-based
inference. In such a test, the observed outcome data are treated as fixed, but the randomization
sequence is regenerated many times using the RAR procedure. For each replicate, the test statistic is
recalculated, and a consistent estimator of the p-value is given by the proportion of the randomization
sequences that give a test statistic as (or more) extreme than that observed. Intuitively, this is
valid because under the null hypothesis of no treatment differences, the treatment assignments and
outcome data are independent. Simon and Simon (2011) give commonly held conditions under which
the re-randomization test guarantees the type I error rate. Galbete and Rosenberger (2016) showed
that 15, 000 replicates appears to be sufficient to accurately estimate even very small p-values. A
key advantage of using re-randomization tests is that they can protect against unknown time trends,
as we will discuss further in Section 4.4. However, re-randomization tests can suffer from a lower
power compared with using standard tests (Villar et al., 2018), particularly if the RAR procedure has
allocation probabilities that are highly variable (Proschan and Dodd, 2019).
25
Both of the methods above are simulation-based, and hence there may be concerns about Monte
Carlo error as well as the computation burden of such tests. There have been a few proposals that do
not rely on simulation and which can be used for type I error control. Robertson and Wason (2019)
proposed a re-weighting of the usual z-test that guarantees familywise error control for a large class of
RAR procedures for multi-arm trials with normally-distributed outcomes, although with a potential
substantial loss of power. Galbete et al. (2016) derived the exact distribution of a test statistic for a
family of RAR procedures in the context of a two-arm trial with binary outcomes, and hence showed
how to obtain exact p-values.
Estimation bias
Although the MLEs for the parameters of interest are typically consistent for a trial using RAR,
for finite samples they will be biased in general. This can be seen for a number of RAR procedures
for binary outcomes in the simulation results given in Villar et al. (2015a), and for procedures based
on Thompson sampling in Thall et al. (2015b). However, the latter point out that in their setting,
which incorporates early stopping with continuous monitoring,
. . . most of the bias appears to be due to continuous treatment comparison, rather than
AR per se.
Hence it is important to distinguish between bias induced by early stopping and that induced by the
use of the RAR procedure itself.
A simple formula for the bias of the MLE for the response probability is given in Bowden and
Trippa (2017), which is valid for general multi-arm RAR procedures without early stopping. In the
common case where RAR assigns more patients to treatments that appear to work well, these results
show that the bias of the MLE will be negative. In addition, the magnitude of this bias is decreasing
with the number of patients assigned to the treatment. When estimating the treatment difference
however, the bias can be either negative or positive, which agrees with the results in Thall et al.
(2015b). More general characterisations of the bias of the sample mean (even with early stopping) for
multi-arm bandit procedures is given in Shin et al. (2019, 2020).
Bowden and Trippa (2017) showed that when there is no early stopping, the magnitude of the
bias tends to be small for the RPW rule and the Bayesian RAR procedure proposed by Trippa et
al. (2012). For more aggressive RAR procedures, the bias can be larger however, see Williamson and
Villar (2020) for an example. As a solution, Bowden and Trippa (2017) proposed methods to correct
for the bias of the MLE, using inverse probability weighting and Rao-Blackwellization, although these
26
can be computationally intensive. For urn-based RAR procedures, Coad and Ivanova (2001) also
proposed bias-corrected estimators for the response probability. For sequential maximum likelihood
procedures and the DBCD, Wang et al. (2020) evaluate the bias issue and propose a solution to resolve
this problem.
Finally, adjusted confidence intervals for RAR procedures have received less attention in the lit-
erature. Rosenberger and Hu (1999) proposed a bootstrap procedure for general multi-arm RAR
procedures with binary responses, using a simple rank ordering. Meanwhile, Coad and Govindarajulu
(2000) proposed corrected confidence intervals following a sequential adaptive design for a two-arm
trial with binary responses. Recent work by Hadad et al. (2019) gives a strategy to construct asymp-
totically valid confidence intervals for a large class of adaptive experiments (including the use of RAR).
Summary
In conclusion, for trials with sufficiently large sample sizes, asymptotic results justify the use of
standard statistical tests and frequentist inference procedures when using many types of RAR. When
asymptotic results do not hold, inference does become more complicated compared with using fixed
randomization, but there is a growing body of literature demonstrating how to control the type I
error rate, and corrections for the bias of the MLE have been proposed. All this should give increased
confidence that the results from a trial using RAR, if analyzed appropriately, can be both valid and
convincing. Finally, we reiterate that from a Bayesian viewpoint, the use of RAR does not pose any
additional inferential challenges.
4.4 Can RAR be used if there is potential for time trends or patient drift?
The issue of time trends caused by changes in the standard of care or by patient drift (i.e. changes
in the characteristics of recruited patients over time) is seen as a major barrier to the use of RAR in
practice when compared with using fixed randomization:
One of the most prominent arguments against the use of AR is that it can lead to biased
estimates in the presence of parameter drift. (Thall et al., 2015b)
A more fundamental concern with adaptive randomization, which was noted when it was
first proposed, is the potential for bias if there are any time trends in the prognostic mix
of the patients accruing to the trial. In fact, time trends associated with the outcome
due to any cause can lead to problems with straightforward implementations of adaptive
randomization. (Korn and Freidlin, 2011a)
27
Both papers cited above show (for procedures based on Thompson sampling) that time trends can
dramatically inflate the type I error rate when using standard analysis methods, and induce bias into
the MLE. Further simulation results on the impact of time trends for BAR procedures (in the context
of two-arm trials with binary outcomes) are given in Jiang et al. (2020). In Villar et al. (2018), a
comprehensive simulation study is given for different time trend assumptions for a variety of RAR
procedures in trials with binary outcomes (including the multi-arm setting).
Although all these papers show that time trends can inflate the type I error rate when using RAR
procedures, there are two important caveats given in Villar et al. (2018). Firstly, they conclude that
a largely ignored but highly relevant issue to consider is the size of the trend and its likelihood of
occurrence in a specific trial:
. . . the magnitude of the temporal trend necessary to seriously inflate the type I error of
the patient benefit-oriented RAR rules need to be of an important magnitude (i.e. change
larger than 25% in its outcome probability) to be a source of concern.
Secondly, they also show (through simulation) that certain power-oriented RAR procedures are ef-
fectively immune to time trends. In particular, RAR procedures that protect the allocation to the
control arm in some way are particularly robust.
A more general issue around time trends is that they can invalidate the key assumption that
observations about treatments are exchangeable (i.e., that subjects receiving the same treatment arm
have the same probability of success). This in term can invalidate commonly used frequentist and
Bayesian models, and hence the inference of the trial data. Type I error inflation and estimation bias
can be seen as examples of this wider issue.
As pointed out in Proschan and Evans (2020), temporal trends seem more likely to occur in two
settings:
. . . 1) trials of long duration, such as platform trials in which treatments may continually
be added over many years and 2) trials in infectious diseases such as MERS, Ebola virus,
and coronavirus.
Despite this, little work has looked at estimating these trends, especially when doing so to inform a
trial design choice in the midst of an epidemic. Investigating both of these points would be essential
to be able to make a sound assessment as to the value of using RAR in a particular setting or not.
Furthermore, as we now discuss, there are analysis methods that can prevent the type I error inflation
that those trends could create when combined with a RAR design.
As mentioned in the previous section, one method to correct for type I error inflation is to use a
re-randomization test (or more generally, randomization-based inference). Simon and Simon (2011)
28
proved that using a re-randomization test (under commonly-held conditions) guarantees type I error
control even under arbitrary time trends. Simulation studies illustrating this can be found in Galbete
and Rosenberger (2016) and Villar et al. (2018). However, the latter shows that using a randomization-
based inference can come at the cost of a considerably reduced power compared with using an unad-
justed testing strategy.
An alternative to randomization-based inference is to use a blocking or stratified analysis at the
end of the trial, as proposed in e.g. Coad (1992); Karrison et al. (2003) and Korn and Freidlin (2011a).
These papers show (though simulation) that a stratified analysis can eliminate the type I error inflation
induced through time trends. However, Korn and Freidlin (2011a) also showed that using block
randomization and subsequently block-stratified analysis can reduce the trial efficiency, in terms of
increasing the required sample size and the chance of patients being assigned to the inferior treatment.
Another approach is to explicitly incorporate time-trend information into the regression analysis.
For example, Coad (1992) modified a class of sequential tests to incorporate a linear time trend
for normally-distributed outcomes. Meanwhile, Villar et al. (2018) assessed incorporating the time
trend into a logistic regression (for binary responses), and showed that this can alleviate the type I
error inflation of RAR procedures, if the trend is correctly specified and the associated covariates are
measured and available. However, this can result in a loss of power and complicate estimation (due
to the technical problem of separation).
Finally, it is possible to try control the impact of a time-trend during randomization. Rosenberger
et al. (2011) proposed a covariate-adjusted response-adaptive procedure for a two-armed trial that can
take a specific time trend as a covariate. More recently, Jiang et al. (2020) proposed a BAR procedure
that includes a time trend in a logistic regression model, and uses the resulting posterior probabilities
as the basis for the randomization probabilities. This model-based BAR procedure controls the type I
error rate and mitigates estimation bias, but at the cost of a reduced power.
Summary
In summary, large time trends can inflate the type I error rate when using RAR procedures. How-
ever, not all RAR procedures are affected in this way, with those that protect the allocation to the
control arm being particularly robust. For other types of RAR procedures, methods have been de-
veloped to mitigate the type I error inflation caused by time trends, although these tend to result in
a loss in power. Finally, it is important to note that time trends can affect inference in all types of
adaptive clinical trials, and not just those using RAR.
29
4.5 Practical considerations: Is RAR more challenging to implement?
If an investigator has decided to adopt a certain type of RAR, there is still a decision to be made as
to how to best implement it in the specific context at hand. There are plenty of issues to consider,
most of which are in common with other randomized designs (both adaptive and non-adaptive). In
this section, we focus on a few issues that are potentially different for RAR in particular, and hence
merit an additional discussion.
Measurement/classification error and missing data
The presence of measurement error (for continuous variables) or classification error (for binary
variables) and missing data are common in medical research. Many analysis approaches have been
proposed to reduce the impact of these issues on statistical inference (see e.g. Guolo (2008), Little
and Rubin (2002), and Blackwell et al. (2017)) but limited literature on overcoming these issues when
implementing a RAR procedure is available. The main concern when there are missing values and/or
measurement error is that the sequentially updated allocation probability may become biased. This
happens when the unobserved true values come from a distribution that is different to that of the
observed values.
To the best of our knowledge, the only work considering classification (or measurement) error is by
Li and Wang (2012) and Li and Wang (2013). They derived optimal allocation targets under a model
with constant misclassification probabilities, which could differ between the treatment arms. In the
latter paper, the effect of misclassification (in the two-arm setting) on the usual optimal allocation
designs was also explored through simulation.
As for missing data, Biswas and Rao (2004) considered the presence of missing responses for a
CARA design. For a two-arm setting with a normal outcome, a probit link function that depends on the
covariate adjusted unknown treatment effect parameters is used to construct the allocation probability.
Under the assumption of missing at random (see Rubin (1976)) and with a single imputation for the
missing responses, they found that the standard deviation and the mean of the proportion of patients
assigned to a treatment arm are not affected by the structure of the missing data mechanism. In a
thesis by Ma (2013), a new allocation function for CARA in the presence of missing covariates and
missing responses is defined.
For RAR, Williamson and Villar (2020) proposed a forward-looking bandit-based allocation pro-
cedure for Phase II cancer trials with normal outcome and an imputation method to facilitate the
implementation of the procedure when the outcome underlying the RECIST categories (Therasse et
30
al., 2000; Eisenhauer et al., 2009) is undefined, e.g. due to death or complete removal of a tumor.
They suggested to fill the incomplete data for these extreme cases with random samples drawn from
the lower tail and the upper tail of the distribution under the null and the alternative scenarios that
were used for sample size determination.
The investigation for the more complex scenarios, e.g. not missing at random, remains unexplored.
Nevertheless, these complex issues have rarely been considered at the design stage of a trial, except
for some simple setting such as the work by Lee et al. (2018).
Delayed responses
The use of RAR is clearly not appropriate in clinical trials where the patient outcomes are only
observed after all patients have been recruited and randomized. This may be the case when the
recruitment period is limited (e.g. due to a high recruitment rate), or when the outcome of interest
takes a long time to observe. One way to address the latter point is to use a surrogate outcome that
is more quickly observed. For example, Tamura et al. (1994) used a surrogate response to update the
urn when using a RPW rule in a trial treating patients with depressive disorder. Another possibility
when outcomes are delayed in some way is to use a randomization plan that is implemented in stages
as more date becomes available. As an example, Zhao and Durkalski (2014) described a two-arm trial
for patients suffering acute stroke which was divided into stages: stage 1 was a burn-in period with
equal randomization, stage 2 started to maintain covariate imbalance and only in stage 3 did the RAR
allocation begin.
In general, as long as some responses are available then RAR can be used in trials with delayed
responses, as stated in Hu and Rosenberger (2006, pg. 105):
From a practical perspective, there is no logistical difficulty in incorporating delayed re-
sponses into the response-adaptive randomization procedure, provided some responses be-
come available during the recruitment and randomization period. For urn models, the urn
is simply updated when responses become available (Wei, 1988). For procedures based on
sequential estimation, estimates can be updated when data become available. Obviously,
updates can be incorporated when groups of patients respond also, not just individuals.
Although practically speaking there are no difficulties in incorporating delayed responses into the
RAR procedures, from a hypothesis testing point of view, statistical inferences at the end of the trial
can be affected. As noted in Rosenberger et al. (2012), this has been explored theoretically for urn
models (Bai et al., 2002; Hu and Zhang, 2004; Zhang et al., 2007) as well as the DBCD (Hu et al.,
2008). These papers show that the asymptotic properties of these RAR procedures were preserved
31
under widely applicable conditions. In particular, when more than 60% of patient responses are avail-
able by the end of the recruitment period, simulations show that the power of the trial is essentially
unaffected for these procedures.
Patient consent
Patient consent is as much an ethical element of clinical trials as equipoise is. Informed consent
protects patients’ autonomy, and requires an appropriate balance between information disclosure and
understanding (Beauchamp, 1997). There is considerable evidence that the basic elements to ensure
informed consent (recall and understanding) can be very difficult to ensure even for traditional non-
adaptive studies (Sugarman, 1999; Dawson, 2009).
The added complexity of allocation probabilities that may (or may not change) in response to
accumulated data only makes achieving patient consent more challenging. Understanding the caveats
of each randomization algorithm has proven hard for statisticians, putting into context the challenge
of explaining these concepts to the regular public and expecting them to make an informed decision
when their health is at stake. Moreover, since these novel adaptive procedures are still rarely used in
real trials, there is little practical experience to draw upon.
RCTs also require a level of blinding that will only make the balance between the required disclosure
(for the purpose of understanding) and consenting patients during the trial even harder for studies
that use RAR, as this requires the use of accumulated data to alter the design and to adequately
inform patients as they are recruited.
All these challenges do not automatically imply that a trial design using RAR may not be the best
design option for a particular setting, they simply pose challenges that need to be properly addressed
and considered in conjunction with other reasons to use RAR or not. For a more in depth discussion
of these issues see Sim (2019).
Implementing randomization changes in practice
Randomization of patients in clinical trials, whether adaptive or not, must be done in accordance
with standards of good clinical practice. As such, randomization of patients in most clinical trials
worldwide is done through a dedicated web-based system that will typically be used by several members
of the trial team, including the trial manager, statistician, independent reviewers and those who use
it to randomize patients. Randomization has moved on from the times where it was done with paper
and envelopes, and now requires a system that can be backed up and maintained, and is easy to use,
secure and available 24/7.
32
In the UK, for example, most clinical trials unit will outsource their randomization to external
companies which in turn ensures compliance with good clinical practice. This outsourcing can even-
tually end up limiting the ways in which randomization can be implemented in a trial to those that
are currently offered by the companies in use. To date and to the best of the authors’ knowledge, in
the UK there is no company offering a RAR portfolio, and so every change in a randomization ratio
is treated as a trial change (and charged as such) rather than being considered an integral part of
the trial design. Beyond the extra costs that this brings, it can introduce unnecessary delays as the
randomization of patients needs to be stopped while the change is implemented. This can certainly
be a very important deterrent to the use of RAR in practice, as this can imply a much larger effort to
implement the randomization procedure and ensure that it is compliant with regulations.
A related issue is that of preserving blinding. Maintaining treatment blinding is key to protecting
the integrity of clinical trials. This is particularly important when applying a design that incorporates
RAR, as when an investigator is aware of what treatment is more likely to be allocated to a patient in
the future, selection bias is more likely to occur. Therefore, implementing a RAR algorithm in practice
places extra challenges to preserve the blinding of clinical trial staff to results of the trial and to avoid
biases. In most cases, preserving blindness will require the appointment of an independent statistician
(which requires extra resources) to handle the interim data and implement the randomization ratios,
or a data manager can provide clean data to an external randomization provider who can then update
the randomization probabilities independently of the clinical and statistical team. A further discussion
on some of the practical issues around handling interim data and unblinding can be found in Sverdlov
and Rosenberger (2013b).
4.6 Is using RAR in clinical trials (more) ethical?
Ethical reasons have been the most well-known and cited arguments in favor of using RAR over the
years.
Our explicit goal is to treat patients more effectively, but a happy side effect is that we
learn efficiently. (Berry, 2004)
Research in response-adaptive randomization developed as a response to a classical ethical
dilemma in clinical trials. (Hu and Rosenberger, 2006)
The goal of response-adaptive randomization is to achieve some ethical or statistical objec-
tive that might not be possible by fixing an allocation probability in advance. (Rosenberger
and Lachin, 2016)
33
Nevertheless, within the statistical community there are also positions arguing that RAR may not
be “ethical”.
For RCTs where treatment comparison is the primary scientific goal, it appears that in
most cases designs with fixed randomization probabilities and group sequential decision
rules are preferable to AR scientifically, ethically and logistically (Thall et al., 2015a)
Clinical research (implemented by clinical trials designed to create generalize knowledge) poses
several ethical questions. To start with, there is an inevitable tension between clinical research and
clinical practice, where the latter is concerned with best treating an individual patient. Such ethical
conflicts are becoming ever more discussed as treatment of patients becomes more linked to research
activities, as is currently observed in cancer research (London, 2018). Although the idea of “treating
more patients effectively” by using RAR appears to be ethically attractive, particularly from patients’
point of view, the extent to which these adaptive designs are truly more “ethical” than traditional
randomization designs is only recently starting to be formally addressed by ethicists.
It is our belief that a) this issue should receive more attention from ethicists and b) collaborations
between ethicists and statisticians are needed to fully address all the caveats and complexities that
this broad family of methods can distinctly have. Any argument (be that in favor or against) that
involves an ethical side should ideally be discussed with an ethicist and a statistician jointly. We
also believe that extreme positions regarding RAR based purely on statistical or ethical arguments
each considered in isolation are most likely to be wrong. Two reasons motivate our belief. First, as
illustrated in the previous sections of this paper, RAR is a broad family of methods to which almost
no generalization is true. More importantly, because the trial context to which RAR is applied to
matters significantly, making compromises between statistical and ethical objectives can have very
different implications under different settings. Therefore, in this section, we will not aim to answer if
RAR methods are ethical or not (a question that would need to be addressed for each method and
trial context specifically). However, we review key concepts that would affect this answer and give
some current discussions by ethicists that statisticians would benefit from reading.
The “equipoise” concept
An argument against the use of RAR has been that it violates the principle of equipoise on which
clinical trials (or medical research more generally) is based upon (Laage et al., 2017). Equipoise
is a state of uncertainty of the individual investigator regarding the relative merits of two or more
interventions for some population of patients. Such uncertainty justifies randomizing patients to
treatments as this does not imply knowingly disadvantaging patients by doing so. This concept
34
was considered too narrow and therefore extended to allow for randomizing patients when there is
“honest, professional disagreement among expert clinicians” about the relative clinical merits of some
interventions for a particular patient population (Freedman, 1987). This broader definition of equipoise
is known as ‘clinical equipoise’ while the first one is known as ‘theoretical equipoise’.
Changing the randomization probabilities in light of patients’ responses is viewed as disturbing
equipoise, because the updated allocation weights reflect the relative performance of the interventions
in question. Once the randomization weights become unbalanced, the study has a preferred treatment
and allocating participants to treatments regarded as inferior is unethical, as it violates the concern
for welfare. However, this argument that RAR is unethical because it breaks equipoise is based on
two assumptions: 1) randomization ratios reflect a single agent’s beliefs about the relative merits of
the interventions being tested in a study; and 2) equipoise is a state of belief in which the relevant
probabilities are assumed to be equally balanced. Neither of these two assumptions are consistent with
the definition of ‘clinical equipoise’ as the clinical community is certainly composed of more than one
agent and an honest disagreement among them will not necessarily take a 50%-50% split of opinions.
Patient horizon (individual and collective ethics)
The ethical value of RAR (and of any other design) also depends considerably on trial specific
considerations. A particular feature that can affect comparisons between design options relates to
disease prevalence. Suppose a clinical trial is being planned and let T be the size of the “patient
horizon” for that study, i.e. those patients within and outside of the trials who will benefit from
the conclusions of the trial. The concept of patient horizon can be traced back to Anscombe (1963)
and Colton (1963). The precise value of T is never known, but in extreme situations impact (or
should impact) the choice of design and sample size of that trial. A trial in a rare pediatric cancer (for
example) is likely to have a large proportion of the patient horizon included in the trial, while a trial
relevant to patients with coronary artery disease will have the vast majority of the patient horizon
outside of the trial. Similar thoughts apply in the context of emerging life-threatening diseases (e.g.
the recent Ebola crisis or the current COVID-19 pandemic), where the patient horizon can be short
for other reasons than prevalence.
Certainly, the impact of the patient horizon on the comparison of designs from an ethical point of
view depends on considerations around individual and collective ethics and potential conflicts between
these two. As Tamura et al. (1994, pg. 775) express it, RAR “represents a middle ground between the
community benefit and the individual patient benefit” and because of this “it is subject to attack from
either side”. This specific point has been very well discussed and formally studied in the statistical
35
literature (see Berry and Eick (1995); Berry (2004); Cheng et al. (2003) for some examples). Despite
this, prevalence of a disease is almost never taken into account, neither in practice when designing
real trials nor in a large number of articles comparing RAR methods from an ethical point of view.
The recent work by Lee and Lee (2020) is the only relevant work that we know of in the literature.
They propose a utility function for the evaluation of RAR, which includes a penalty for assigning pa-
tients to the inferior arm while accounting for patient benefit both within and beyond the trial horizon.
Two-armed versus multiple armed trials
A final point that can affect any ethical judgment about a trial design and of a RAR procedure
is the number of arms considered in the trial. In a two-armed setting, trade-offs between ethical and
statistical features are such that there is no scope for a design to be superior (or dominate) any other
in all aspects. In the multi-armed setting, this does not have to be the case, and depending on the
main objective of the trial (e.g. the relevant power definition used), designs using RAR can be superior
to a fixed randomized trial in the sense that they can achieve both efficiency and ethical gains over a
traditional RCT.
5 Discussion
RAR methods have received as much theoretical attention as they have generated heated debates,
especially during acute health crises like the one we are currently facing with COVID-19. This is not
surprising, as the main reason that drives the desire to change randomization ratios in clinical trials
is to better respond to very difficult ethical challenges under pressing conditions. However, for such
debates around its use in practice to be useful and meaningful, they have to remain level-headed and
bear in mind that generalizations within such a large class of methods are very likely to be partial
and/or misleading. Even within one family of RAR procedures, conclusions about “typical” issues can
be very different if one applies the procedure in a two-armed or in a multi-armed trial setting, with or
without early stopping, and if we are considering an early phase study (naturally to be followed by a
confirmatory trial) or a later phase study. In most cases, the relative performance of designs is highly
dependent on the preferred method of inference (frequentist or Bayesian).
We wish to emphasize that in this paper we are certainly not advocating the use of RAR in all trial
settings, as there will be many contexts where the use of a different form of trial adaptation or even a
fixed randomized design may be preferable for both methodological or practical reasons. Indeed, this
36
an issue to consider with adaptive trials in general – sometimes it may be better to ‘keep it simple’
and use traditional (non-adaptive) trial designs instead (Wason et al., 2019). In this review, where
we have chosen to focus on positive statements on the use of RAR procedures it was simply because
these arguments have often been neglected in broad comparisons presented in the literature.
Our recommendation is that if a trial design calls for the use of RAR as a possible option to
address a specific concern (be that ethical or not), careful consideration should be given to all the
issues mentioned in the present work. Our advice is to think of RAR as a long list of possible design
options rather than a simple unique technique to include or not. The number of possible ways to
implement RAR in a clinical trial is far too large to have a statement that will apply to all of them in
a given context. We also advise that simulations are carried out to widely explore the parameter space
(and not only subsets of interest). We encourage a broad look at multiple metrics when considering
the use of a specific RAR procedure, as the design may perform very well according to one metric but
very badly according to another. Trade-offs are ubiquitous and in many cases unavoidable, so the best
one can hope for is to be aware of them when making a design choice. Indeed, RAR procedures can
address a specific need at the expense of a cost in a different area, and this needs to be made explicit
at a design stage. There is no perfect trial design, but some trade-offs might be acceptable in certain
cases (or even necessary for a greater good) while the same trade-offs can be absolutely rejected in
other contexts.
We would like to end the paper with a short summary as to what we feel the future for RAR
methods research should bring to improve its usefulness in clinical practice. Our hope is that rather
than taking the route of explicitly “discouraging” or “encouraging” the use of RAR, researchers can
take the more constructive path of addressing these issues with new ideas that work to offer most
of the advantages that RAR can bring with fewer (or none) of its downsides while taking the trial
context into account.
Indeed, as RAR methods start being more commonly used in clinical trials, this will unlock demand
for applied research in terms of the analysis of such studies. Practical issues such as accounting for
missing data or measurement error when using a RAR method are still a big open question. In terms
of more theoretical work, we feel that the main issue to address is that of robust inference for the
broader class of methods. However, there are some design aspects that could still be addressed. For
instance, the RAR class has scope for addressing delicate issues when designing studies with composite
or complex endpoints, where there might be effects of the same direction or even of opposite ones.
Finally, the use of RAR in practice requires the availability of user-friendly software for both the
implementation of the randomization algorithm as well as for the analysis approaches that were men-
37
tioned in Section 4 of this paper. Applied statisticians could benefit from training specifically oriented
on how to use simulations for the evaluation of RAR techniques, which is crucial to understand their
potential limitations and benefits.
Acknowledgments
The authors thank Peter Jacko for many helpful and constructive comments on an earlier version of
this manuscript, Andi Zhang for providing code for the GI and FLGI procedures used in the simulation
in Section 4.1, Ayon Mukherjee for suggesting the use of the drop-the-loser rule in Section 4.1, Arina
Kazimianec for previous work looking at sample size imbalance which helped motivate Section 4.1,
and Nikolaos Skourlis for screening through the relevant literature on model-based adaptive random-
ization approaches. The authors acknowledge funding and support from the UK Medical Research
Council (grants MC UU 00002/3 (SSV, BCL-K), MC UU 00002/6 (DSR), MR/N028171/1 (KML)),
the Biometrika Trust (DSR) and the National Institute for Health Research Cambridge Biomedical
Research Centre at the Cambridge University Hospitals NHS Foundation Trust (DSR, KML, SSV).
References
Angus, D.C., Berry, S., Lewis, R.J., Al-Beidh, F., Arabi, Y., van Bentum-Puijk, W.,
Bhimani, Z., Bonten, M., Broglio, K., Brunkhorst, F., Cheng, A.C., Chiche, J.D., De
Jong, M., Detry, M., Goossens, H., Gordon, A., Green, C., Higgins, A.M., Hullegie,
S.J., Kruger, P., Lamontagne, F., Litton, E., Marshall, J., McGlothlin, A., McGuin-
ness, S., Mouncey, P., Murthy, S., Nichol, A., O’Neill, G.K., Parke, R., Parker, J.,
Rohde, G., Rowan, K., Turner, A., Young. P,, Derde, L., McArthur, C. and Webb,
S.A. (2020). The Randomized Embedded Multifactorial Adaptive Platform for Community-acquired
Pneumonia (REMAP-CAP) Study: Rationale and Design. Annals of the American Thoracic Soci-
ety, Advance access, doi:10.1513/AnnalsATS.202003-192SD.
Anscombe, F.J. (1963). Sequential medical trials. Journal of the American Statistical Association
58 365–383.
Atkinson, A.C. and Biswas, A. (2014). Randomized Response-Adaptive Designs in Clinical Trials.
CRC Press, Boca Raton.
38
Auer, P., Cesa-Bianchi N. and Fischer, P. (2002). Finite-time Analysis of the Multiarmed Bandit
Problem. Machine Learning 47(2–3) 235–256.
Barker, A.D., Sigman, C.C., Kelloff, G.J., Hylton, N.M., Berry, D.A. and Esserman,
L.J. (2009). I-SPY 2: An adaptive breast cancer trial design in the setting of neoadjuvant chemother-
apy. American Society for Clinical Pharmacology and Therapeutics 86 97–100.
Bartlett, R., Roloff, D., Cornell, R., Andrews, A., Dillon, P. and Zwischenberger,
J. (1985). Extracorporeal Circulation in Neonatal Respiratory Failure: A Prospective Randomized
Study. Pediatrics Journal 76(4) 479–487.
Beauchamp, T.L. Informed consent. In Medical Ethics, 2nd edition, 185-208. ed. Robert M. Veatch,
Boston: Jones and Bartlett.
Bello, G.A. and Sabo, R.T. (2016). Outcome-adaptive allocation with natural lead-in for three-
group trials with binary outcomes. Journal of Statistical Computation and Simulation 86(12) 2441–
2449.
Berry, D.A. and Eick, S.G. (1995). Adaptive assignment versus balanced randomization in clinical
trials: a decision analysis. Statistics in Medicine 14(3) 231–246.
Berry, D.A. (2004). Bayesian Statistics and the Efficiency and Ethics of Clinical Trials. Statistical
Science 19(1) 175–187.
Berry, D.A. (2011). Adaptive Clinical Trials: The Promise and the Caution. Journal of Clinical
Oncology 29(6) 606–609.
Berry, S.M., Petzold, E.A., Dull, P., Thielman, N.M., Cunningham, C.K., Corey, G.R.,
McClain, M.T., Hoover, D.L., Russell, J., Griffiss, J.M. and Woods, C.W. (2016). A
response adaptive randomization platform trial for efficient evaluation of Ebola virus treatments: A
model for pandemic response. Clinical Trials, 13(1), 2230.
Bai, Z.D., Hu, F. and Rosenberger, W.F. (2002). Asymptotic properties of adaptive designs for
clinical trials with delayed responses. Annals of Statistics, 30 122–139.
Biswas, A. and Rao, J. N. K. (2004). Missing responses in adaptive allocation design. Statistics &
Probability Letters, 70(1) 59–70.
Blackwell, M., Honaker,J. and King, G. (2017). A unified approach to measurement error and
missing data: overview and applications. Sociological Methods & Research, 46(3) 303–341.
39
Bowden, J. and Trippa, L. (2017). Unbiased estimation for response adaptive clinical trials. Statis-
tical Methods in Medical Research 26(5) 2376–2388.
Brittain, E.H. and Proschan, M.A. (2016). Comments on Berry et al.’s response-adaptive ran-
domization platform trial for Ebola. Clinical Trials 13(5) 566–567.
Bubeck, S. and Cesa-Bianchi (2012). Regret Analysis of Stochastic and Nonstochastic Multi-armed
Bandit Problems. Foundations and Trends in Machine Learning 5(1) 1–122.
Burton, P.R., Gurrina, L.C. and Hussey, M.H. (1997). Interpreting the clinical trials of ex-
tracorporeal membrane oxygenation in the treatment of persistent pulmonary hypertension of the
newborn. Seminars in Neonatology 2 69–79.
Carey, L.A. and Winer, E.P. (2016). I-SPY 2 – Toward More Rapid Progress in Breast Cancer
Treatment New England Journal of Medicine 375 83–84.
Cheng, Y., Su, F. and Berry, D.A. (2003). Choosing sample size for a clinical trial using decision
analysis. Biometrika 90(4) 923–936.
Cheng, Y. and Berry, D.A. (2007). Optimal adaptive randomized designs for clinical trials.
Biometrika 94 673–689.
Coad, D.S. (1991). Sequential tests for an unstable response variable. Biometrika 78(1) 113–121.
Coad, D.S. (1992). A comparative study of some data-dependent allocation rules for Bernoulli data.
Journal of Statistical Computation and Simulation 40(3–4), 219–231.
Coad, D.S. and Govindarajulu, Z. (2000). Corrected confidence intervals following a sequential
adaptive trial with binary response. Journal of Statistical Planning and Inference 91 53–64.
Coad, D.S. and Ivanova, A. (2001). Bias calculations for adaptive urn designs. Sequential Analysis
20 229–239.
Colton, T. (1963). 963). A model for selecting one of two treatments. Journal of the American
Statistical Association 58 388–400.
Das, S. and Lo, A.W. (2017). Re-inventing drug development: A case study of the I-SPY 2 breast
cancer clinical trials program Contemporary Clinical Trials 62 168–174.
Dawson, A. (2009). The normative status of the requirement to gain an informed consent in clinical
trials: Comprehension, obligations and empirical evidence. In The limits of consent: A sociolegal
40
approach to human subject research in medicine, ed. Oonagh Corrigan, John McMillan, Kathleen
Liddell, Martin Richards, and Charles Weijer, 99-113. Oxford: Oxford University Press.
Duan, L. and Hu, F. (2009). Doubly-adaptive biased coin designs with heterogenous responses.
Journal of Statistical Planning and Inference 139 3220–3230.
Eisele, J.R. (1994). The double adaptive biased coin design for sequential clinical trials. Journal of
Statistical Planning and Inference 38 249–261.
Eisenhauer, E.A., Therasse, P., Bogaerts, J., Schwartz, L.H., Sargent, D., Ford, R.,
Dancey, J., Arbuck, S., Gwyther, S., Mooney, M., Rubinstein, L., Shankar, L., Dodd,
L., Kaplan, R., Lacombe, D. and Verweij, J. (2009). New response evaluation criteria in solid
tumours: revised RECIST guideline (version 1.1). European Journal of Cancer, 45(2) 228–247.
Fan, L., Yeatts, S.D., Wolf, B.J., McClure, L.A., Selim, M. and Palesch, Y.Y. (2018).
The impact of covariate misclassification using generalized linear regression under covariate-adaptive
randomization. Statistical Methods in Medical Research, 27(1) 20–34.
Flournoy, N., Haines, L.M. and Rosenberger, W.F. (2013). A Graphical Comparison of
Response-Adaptive Randomization Procedures. Statistics in Biopharmaceutical Research, 5(2) 126–
141.
Freedman, B. (1987). Equipoise and the Ethics of Clinical Research. New England Journal of
Medicine 317 141–145.
Galbete, A., Moler, J.A. and Plo, F. (2016). Randomization tests in recuresive response-adaptive
randomization procedures. Statistics 50(2) 418–434.
Galbete, A. and Rosenberger, W.F. (2016). On the use of randomization tests followign adaptive
designs. Journal of Biopharmaceutical Statistics 26(3) 466–474.
Gu, X. and Lee, J.J. (2010). A simulation sudy for comparing testing statistics in response-adaptive
randomization. BMC Medical Research Methodology 10 48.
Guolo, A. (2008). Robust techniques for measurement error correction: a review. Statistical Methods
in Medical Research, 17(6) 555–580.
Hadad, V., Hirshberg, D.A., Zhan, R., Wager, S. and Athey, S. (2019). Confidence Intervals
for Policy Evaluation in Adaptive Experiments. arXiv preprint arXiv:1911.02768.
41
Hu, F. and Rosenberger, W.F. (2003). Optimality, variability, power: Evaluating response-
adaptive randomization procedures for treatment comparisons. Journal of the American Statistical
Association 98 671–678.
Hu, F. and Zhang, L-X. (2004a). Asymptotic properties of doubly adaptive biased coin design for
multi-treatment clinical trials. The Annals of Statistics 32(1) 268–301.
Hu, F. and Zhang, L-X. (2004b). Asymptotic normality of adaptive designs with delayed response.
Bernoulli 10 447–463.
Hu, F. and Rosenberger, W.F. (2006). The Theory of Response-Adaptive Randomization in Clinical
Trials. Wiley Series in Probability and Statistics.
Hu, F., Zhang, L-X., Cheung, S.H. and Chan, W.S. (2008). Double-adaptive biased coin designs
with delayed responses. Canadian Journal of Statistics 36 541–559.
Hu F., Zhang L. and He X. (2009). Efficient Randomized-Adaptive Designs. The Annals of Statistics
37(5A) 2543–2560.
Hu, J., Zhu, H. and Hu, F. (2015). A Unified Family of Covariate-Adjusted Response-Adaptive
Designs Based on Efficiency and Ethics. Journal of the American Statistical Association. 110:509
357–367.
Ivanova, A. (2003). A play-the-winner-type urn design with reduced variability. Metrika 58(1) 1–13.
Jacko, P. (2019). The Finite-Horizon Two-Armed Bandit Problem with Binary Responses: A Mul-
tidisciplinary Survey of the History, State of the Art, and Myths. arXiv preprint. arXiv:1906.10173.
Jennison, C. and Turnbull, B.W. (2000). Group Sequential Methods with Applications to Clinical
Trials. Chapman and Hall/CRC, Boca Raton, FL.
Jeon, Y. and Hu, F. (2010). Optimal Adaptive Designs for Binary Response Trials with Three
Treatments. Statistics in Biopharmaceutical Research. 2(3) 310–318.
Jiang, Y., Zhao W. and Durkalski-Mauldin, V. (2020). Time-trend impact on treatment estima-
tion in two-arm clinical trials with a binary outcome and Bayesian response adaptive randomization.
Journal of Biopharmaceutical Statistics. 30(1) 69–88.
Johnston, K., Bruno, A., Pauls, Q., Hall, C.E., Barrett, K.M., Barsan, W., Fansler, A.,
Van de Bruinhorst, K., Janis, S. and Durkalski-Mauldin, V.L. for the Neurological Emer-
gencies Treatment Trials Network and the SHINE Trial Investigators (2019). Intensive vs Standard
42
Treatment of Hyperglycemia and Functional Outcome in Patients With Acute Ischemic Stroke: The
SHINE Randomized Clinical Trial. Journal of the American Medical Association. 322(4) 326–335.
Kaibel, C. and Biemann, T. (2019). Rethinking the Gold Standard With Multi-armed Ban-
dits: Machine Learning Allocation Algorithms for Experiments. Organizational Research Methods
https://doi.org/10.1177/1094428119854153.
Karrison, T.G., Huo, D. and Chappell, R. (2003). A group sequential, response-adaptive design
for randomized clinical trials. Controlled Clinical Trials 24 506–522.
Kaufmann, E., Korda, N. and Munos, A. (2012). Thompson Sampling: An Asymptotically Op-
timal Finite-Time Analysis. International Conference on Algorithmic Learning Theory 2012: Pro-
ceedings 199–213.
Kaufmann, E. and Garivier, A. (2017). Learning the distribution with largest mean: two bandit
frameworks. ESAIM: Procs 60 114–131.
Kim, E. S., Herbst, R.S., Wistuba, I.I., Lee, J.J., Blumenschein, G.R., Tsao, A., Stewart,
D.J., Hicks, M.E., Erasmus, J. Jr, Gupta, S., Alden, C.M., Liu, S., Tang, X., Khuri,
F.R., Tran, H.T., Johnson, B.E., Heymach, J.V., Mao, L., Fossella, F., Kies, M.S., Pa-
padimitrakopoulou, V., Davis, S.E., Lippman, S.M. and Hong, W.K. (2011). The BATTLE
trial: personalizing therapy for lung cancer. Cancer Discovery 1(1) 44–53.
Korn, E.L. and Freidlin, B. (2011a). Outcome-adaptive randomization: is it useful? Journal of
Clinical Oncology, 29 771–776.
Korn, E.L. and Freidlin, B. (2011b). Reply to Y. Yuan et al. Journal of Clinical Oncology 29 e393.
Korn, E.L. and Freidlin, B. (2017). Commentary. Adaptive Clinical Trials: Advantages and Dis-
advantages of Various Adaptive Design Elements. Journal of the National Cancer Institute 109(6)
djx013.
Laage, T., Loewy, J.W., Menon, S., Miller, E.R., Pulkstenis, E., Kan-Dobrosky, N.
and Coffey, C. (2017). Ethical Considerations in Adaptive Design Clinical Trials. Therapeutic
Innovation & Regulatory Science 51(2) 190–199.
Lattimore, T. and Szepesvari, C. (2019). Bandit Algorithms. In press https://tor-
lattimore.com/downloads/book/book.pdf.
43
Lee, J.J., Chen N. and Yin, G. (2012). Worth Adapting? Revisiting the Usefulness of Outcome-
Adaptive Randomization. Clinical Cancer Research 18(17) 4498–4507.
Lee, K.M. and Lee, J.J. (2020+). Evaluating Bayesian adaptive randomization procedures with
adaptive clip methods for multi-arm trials. Under review.
Lee, K.M., Mitra, R. and Biedermann, S. (2018). Optimal design when outcome values are not
missing at random. Statistica Sinica, 28(4) 1821–1838.
Li, X. and Wang, X. (2012). Variance-penalized response-adaptive randomization with mismeasure-
ment. Journal of Statistical Planning and Inference, 142 2128-2135.
Li, X. and Wang, X. (2013). Response adaptive designs with misclassified responses. Communication
in Statistics – Theory and Methods, 42 2071-2083.
Lin, J. and Bunn, V. (2017). Comparison of multi-arm multi-stage design and adaptive randomization
in platform clinical trials. Contemporary Clinical Trials 54 48–59.
Little, R.J. and Rubin, D. B. (2002). Bayes and multiple imputation. In: Statistical analysis with
missing data, 2nd edition, 200–220. New York (NY): Wiley.
London, A.J. (2018). Learning health systems, clinical equipoise and the ethics of response adaptive
randomization. Journal of Medical Ethics 44 409–415.
Ma, Z. (2013). Missing data and adaptive designs in clinical studies. PhD Thesis, University of
Virginia.
Magaret, A.S., Jacob, S.T., Halloran, M.E., Guthrie, K.A., Magaret C.A., Johnston,
C., Simon, N.R. and Wald, A. (2020). Multigroup, Adaptively Randomized Trials Are Advan-
tageous for Comparing Coronavirus Disease 2019 (COVID-19) Interventions. Annals of Internal
Medicine, Advance access, doi:10.7326/M20-2933.
Marchenko, O., Fedorov, V., Lee, J., Nolan, C. and Pinheiro, J. (2014). Adaptive Clinical
Trials: Overview of Early-Phase Designs and Challenges. Therapeutic Innovation & Regulatory
Science 48(1) 20–30.
Melfi, V.F. and Page, C. (2000). Estimation after adaptive allocation. Journal of Statistical Plan-
ning and Inference 87(2) 353–363.
44
Papadimitrakopoulou, v., Lee, J.J., Wistuba, I., Tsao, A. Fossella, F., Kalhor, N.,
Gupta, S., Averett Byers, L., Izzo, J., Gettinger, S., Goldbert, S., Tang, X., Miller,
V., Skoulidis, F., Gibbons, D., Shen, L., Wei, C., Diao, L., Peng, S. A., Wang, J., Tam,
A., Coombes, K., Koo, J. Mauro, D., Rubin, E., Heymach, J., Hong, W. and Herbst,
R. (2016). The BATTLE-2 Study: A Biomarker-Integrated Targeted Therapy Study in Previously
Treated Patients With Advanced Non-Small-Cell Lung Cancer. Journal of Clinical Oncology 34(30)
3638–3647.
Park, J. W., Liu, M.C., Yee, D., Yau, C., van’t Veer, L.J., Symmans, W.F., Paoloni, M.,
Perlmutter, J., Hylton, N.M., Hogarth, M., DeMichele, A., Buxton, M.B., Chien,
A.J., Wallace, A.M., Boughey, J.C., Haddad, T.C., Chui, S.Y., Kemmer, K.A., Kaplan,
H.G., Isaacs, C., Nanda, R., Tripathy, D., Albain, K.S., Edmiston, K.K., Elias, A.D.,
Northfelt, D.W, Pusztai, L., Moulder, S.L., Lang, J.E., Viscusi, R.K. Euhus, D.M.,
Haley, B.B., Khan, Q.J., Wood, W.C., Melisko, M., Schwab, R., Helsten, T., Lyandres,
J., Davis, S.E., Hirst, G.L., Sanil, A., Esserman, L.J. and Berry, D.A. for the I-SPY 2
Investigators (2016). Adaptive Randomization of Neratinib in Early Breast Cancer. New England
Journal of Medicine 375(1) 11–22.
Press, S. J. (2005). Applied multivariate analysis: using Bayesian and frequentist methods of infer-
ence. Courier Corporation.
Proschan, M.A. and Dodd, L.E. (2019). Re-randomization tests in clinical trials. Statistics in
Medicine 38, 2292–2302.
Proschan, M. and Evans, S. (2020). The Temptation of Response-Adaptive Randomization. Clinical
Infectious Diseases. Advance access, doi: 10.1093/cid/ciaa334.
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American
Mathematical Society 58 527–535.
Robertson, D.S. and Wason, J.M.S. (2019). Familywise error control in multi-armed response-
adaptive trials. Biometrics 75(3) 885–894.
Rosenberger, W.F. and Lachin, J.M. (1993). The use of response-adaptive designs in clinical
trials. Controlled Clinical Trials 14(6) 471–484.
Rosenberger, W.F., Stallard, N., Ivanova, A., Harper, C.N. and Ricks, M.L. (2001).
Optimal adaptive designs for binary response trials. Biometrics 57 909–913.
45
Rosenberger, W.F., Vidyashankar, A.N. and Agarwal, D.K. (2001b). Covariate-adjusted
response-adaptive designs for binary response. Journal of Biopharmaceutical Statistics 11(4) 227–
236.
Rosenberger, W.F., Sverdlov, O. and Hu, F. (2012). Adaptive Randomization for Clinical
Trials. Journal of Biopharmaceutical Statistics 22 719–736.
Rosenberger, W.F. and Hu, F. (1999). Bootstrap methods for adaptive designs. Statistics in
Medicine 18 1757–1767.
Rosenberger, W.F. and Lachin, J.M. (2016). Randomization in Clinical Trials. Wiley Series in
Probability and Statistics, New York.
Rosenberger, W.F. and Hu, F. (2004). Maximising power and minimizing treatment failures in
clinical trials. Clincal Trials 1 141–147.
Rosenberger, W.F. and Sverdlov, O. (2008). Handling Covariates in the Design of Clinical Trials.
Statistical Science 23(3) 404–419.
Rosenberger, W.F. and Lachin, J.M. (2016). Randomization in Clinical Trials (Second Edition).
Wiley Series in Probability and Statistics, Hoboken, New Jersey.
Rubin, D. B. (1976). Inference and missing data. Biometrika 63(3), 581–592.
Rugo, H. S. Olopade, O.I., DeMichele, A., Yau, C., van’t Veer, L.J., Buxton, M.B.,
Hogarth, M., Hylton, N.M., Paoloni, M., Perlmutter, J., Symmans, W.F., Yee, D.,
Chien, A.J., Wallace, A.M., Kaplan, H.G., Boughey, J.C., Haddad, T.C., Albain, K.S.,
Liu, M.C., Isaacs, C., Khan, Q.J., Lang, J.E., Viscusi, R.K., Pusztai, L., Moulder, S.L.,
Chui, S.Y., Kemmer, K.A., Elias, A.D., Edmiston, K.K., Euhus, D.M., Haley, B.B.,
Nanda, R., Northfelt, D.W., Tripathy, D., . Wood, W.C., Ewing, C., Schwab, R.,
Lyandres, J., Davis, S.E., Hirst, G.L., Sanil, A., Berry, D.A. and Esserman, L.J. for
the I-SPY 2 Investigators (2016). Adaptive Randomization of VeliparibCarboplatin Treatment in
Breast Cancer. New England Journal of Medicine 375(1) 23–34.
Sabo, R.T. (2014). Adaptive allocation for binary outcomes using decreasingly informative priors.
Journal of Biopharmaceutical Statistics 24(3) 569–578.
Samaniego, F. J. (2010). A comparison of the Bayesian and frequentist approaches to estimation.
Springer Science & Business Media.
46
Shin, J., Ramdas, A. and Rinaldo, A. (2019). Are sample means in multi-armed bandits positively
or negatively biased? NeurIPS 2019 arXiv:1905.11397.
Shin, J., Ramdas, A. and Rinaldo, A. (2020). On conditional versus marginal bias in multi-armed
bandits. arXiv preprint arXiv:2002.08422.
Sim, J. (2019). Outcome-adaptive randomization in clinical trials: issues of participant welfare and
autonomy. Theoretical Medicine and Bioethics 40(2) 83-101.
Simon, R. and Simon, N.R. (2011). Using randomization tests to preserve type I error with response
adaptive and covariate adaptive randomization. Statistics and Probability Letters 81(7) 767–772.
Smith, R.L. (1984). Properties of Biased Coin Designs in Sequential Clinical Trials. Annals of Statis-
tics 12(3) 1018–1034.
Siu, L. L., Ivy, S. P., Dixon, E. L., Gravell, A. E., Reeves, S. A. and Rosner, G. L. (2017).
Challenges and Opportunities in Adapting Clinical Trial Design of Immunotherapies. Clin Cancer
Res 23(17) 4950–4958.
Sugarman, J., Douglas C. McCrory, D.C., Powell, D., Krasny, A., Adams, B., Ball,
E. and Cassell, C. (1999). Empirical research on informed consent. Hastings Center Report
29(suppl) S1S42.
Sverdlov, O. and Rosenberger, W.F. (2013). On recent advances in optimal allocation designs
in clinical trials. Journal of Statistical Theory and Practice 7(4) 753–773.
Sverdlov, O. and Rosenberger, W.F. (2013). Randomization in clinical trials: can we eliminate
bias? Clinical Investigation Journal 3(1) 37–47.
Tamura, R.N., Faries, D.E., Andersen, J.S. and Heiligenstein, J.H. (1994). A Case Study of
an Adaptive Clinical Trial in the Treatment of Out-Patients with Depressive Disorder. Journal of
the American Statistical Association 89, 768–776.
Thall, P.F. and Wathen, J.K. (2007). Practical Bayesian adaptive randomization in clinical trials.
European Journal of Cancer, 43, 859–866.
Thall, P.F., Fox, P.S. and Wathen, J.K. (2015a). Some caveats for outcome adaptive random-
ization in clinical trials. CRC Press, Boca Raton.
47
Thall, P.F., Fox P. and Wathen, J. (2015b). Statistical controversies in clinical research: scientific
and ethical problems with adaptive randomization in comparative clinical trials. Annals of Oncology
26 1621–1628.
Therasse, P., Arbuck, S.G., Eisenhauer, E.A., Wanders, J., Kaplan, R.S., Rubinstein,
L., Verweij, J., Van Glabbeke, M., van Oosterom, A.T., Christian, M.C. and Gwyther,
S.G. (2000). New guidelines to evaluate the response to treatment in solid tumors. Journal of the
National Cancer Institute, 92(3) 205–216.
Thompson, W.R. (1933). On the likelihood that one unknown probability exceeds another in view
of the evidence of two samples. Biometrika 25 285–294.
Torgerson, D.J. and Campbell, M.K. (2000). Use of unequal randomization to aid the economic
efficiency of clinical trials. BMJ 321 759.
Trippa, L., Lee, E.Q., Wen, P.Y., Batchelor, T.T., Cloughesy, T., Parmigiani, G. and
Alexander, B.M. (2012). Bayesian adaptive trial design for patients with recurrent gliobastoma.
Journal of Clinical Oncology 30 3258–3263.
Tymofyeyev, Y., Rosenberger W.F. and Hu, F. (2007). Implementing optimal allocation in
sequential binary response experiments. Journal of the American Statistical Association 102(477)
224–234.
Viele, K., Broglio, K., McGlothlin, A. and Saville, B.R. (2020) Comparison of methods for
control allocation in multiple arm studies using response adaptive randomization. Clinical Trials
17(1) 5260.
Viele, K., Saville, B.R., McGlothlin, A. and Broglio, K. (2020) Comparison of response
adaptive randomization features in multiarm clinical trials with control. Pharmaceutical Statistics,
Advance access, doi:10.1002/pst.2015
Villar, S.S., Bowden, J. and Wason, J. (2015a). Multi-armed Bandit Models for the Optimal
Design of Clinical Trials: Benefits and Challenges. Statistical Science 30(2) 199–215.
Villar, S.S., Wason J. and Bowden, J. (2015b). Response-adaptive randomization for multi-arm
clinical trials using the forward looking Gittins index rule. Biometrics 71(4) 969–978.
Villar, S.S., Bowden, J. and Wason, J. (2018). Response-adaptive designs for binary responses:
How to offer patient benefit while being robust to time trends? Pharmaceutical Statistics 17(2)
182–197.
48
Villar, S.S., Robertson, D.S. and Rosenberger, W.F. (2020). The Temptation of Overgener-
alizing Response-Adaptive Randomization. Clinical Infectious Diseases, in press.
Wang, Y., Zhu, H. and Lee, J.J. (2020) Evaluation of bias for outcome adaptive randomization
designs with binary endpoints. Statistics and Its Interface 13 287-315
Wagenmakers, E.J., Lee, M., Lodewyckx, T. and Iverson, G. (2008). Bayesian versus fre-
quentist inference. In Bayesian evaluation of informative hypotheses, 181–207. Springer, New York,
NY.
Ware, J.H. (1989). Investigating Therapies of Great Benefit: ECMO - with comments. Statistical
Science 4(4), 298–340.
Wason, J.M.S. and Trippa, L. (2014). A comparison of Bayesian adaptive randomization and multi-
stage designs for multi-arm clinical trials. Statistics in Medicine 33(13) 2206–2221.
Wason, J.M.S., Brocklehurst P. and Yap, C. (2019). When to keep it simple – adaptive designs
are not always useful. BMC Medicine 17:152.
Wathen, J.K. and Thall, P.F. (2017). A simulation study of outcome adaptive randomization in
multi-arm clinical trials. Clinical Trials 14(5) 432–440.
Wei, L.J. and Durham, S. (1978). The randomized play-the-winner rule in medical trials. Journal
of Medical Statistics Association 73 840–843.
Wei, L.J. (1988). Exact two-sample permutation tests based on the randomized play-the-winner rule.
Biometrika 75 603–606.
Williamson, S.F., Jacko, P., Villar, S.S. and Jaki, T. (2017). A Bayesian adaptive design for
clinical trials in rare diseases. Computational Statistics and Data Analysis 113 136–153.
Williamson, S. F. and Villar, S. S. (2020). A ResponseAdaptive Randomization Procedure for
MultiArmed Clinical Trials with Normally Distributed Outcomes. Biometrics 76(1) 197–209
Zelen, M. (1969). Play the Winner Rule and the Controlled Clinical Trial. Journal of the American
Statistical Association 64 131–146.
Zhang, L. and Rosenberger, W.F. (2006). Response-Adaptive Randomization for Clinical Trials
with Continuous Outcomes. Biometrics 62 562–569.
49
Zhang, L. and Rosenberger, W.F. (2007a). Response-adaptive randomization for survival trials:
the parametric approach. Applied Statistics 56(2) 153–165.
Zhang, L., Chan, W.S., Cheung, S.H. and Hu, F. (2007b). A generalized urn model for clinical
trials with delayed responses. Statistica Sinica 17 387–409.
Zhang, L., Hu, F., Cheung, S.H. and Chan, W.S. (2011). Immigrated urn models – theoretical
properties and applications. Annals of Statistics 39 643–671.
Zhang, L., Trippa, L. and Parmigiani, G. (2019). Frequentist operating characteristics of Bayesian
optimal designs via simulation. Statistics in Medicine 38 4026–4039.
Zhao, W. and Durkalski, V. (2014). Managing competing demands in the implementation of
response-adaptive randomization in a large multicenter phase III acute stroke trial. Statistics in
Medicine 33(23) 40434052.
Zhou, X., Liu, S., Kim, E.S., Herbst, R.S. and Lee, J.J. (2008). Bayesian adaptive design for
targeted therapy development in lung cancer – a step toward personalized medicine. Clinical Trials
5(3) 181–193.
50