Effi cient Provision of Experience Goods: Evidence from
Antidepressant Choice∗
Michael J. Dickstein†
Harvard University
January 17, 2011
JOB MARKET PAPER
Abstract
In markets for experience goods, consumers learn about the quality of available products
through experimentation. In their decisions, they balance the desire to choose the good with
the largest expected value against the incentive to try lesser-known but potentially superior
options. I develop a dynamic framework for analyzing demand that interacts pricing and
promotion with this learning process, and accommodates settings in which agents search over
large choice sets. I employ this approach to study the antidepressant market, testing whether
insurer policies can improve the quality of drug-patient matches and overall health. I find
that tools that selectively promote only the most effective products for a patient outperform
commonly-used tiered pharmaceutical policies. In addition, my estimates, built from a large
pool of patient episodes, identify treatments that induce stronger adherence and thus faster
recovery relative to alternatives recommended by clinical trial reviews. This result suggests
that treatment protocols built from observed quit rates can improve physician decision-making
and ultimately patient health.
Keywords: Learning, dynamic discrete choice, multi-armed bandits, pharmaceutical pricing
∗I owe special thanks to David Cutler, Guido Imbens, Ariel Pakes, and Dennis Yao for their guidance and sug-gestions. I thank Gary Chamberlain, Andrew Ching, Greg Lewis, Eduardo Morales, Julie Mortimer, Amanda Starc,and seminar participants at the Harvard Industrial Organization Lunch and Harvard Econometrics Lunch for helpfulcomments.†Email: [email protected]; Baker Library 420M, Harvard Business School, Boston, MA 02163.
1 Introduction
In markets for experience goods, buyers learn about the quality of available products through
experimentation. They apply their experience to reevaluate the options, identifying the product
that best matches their preferences. In their next decision, they may choose this product or may
select a less well understood option to sharpen their opinions further. Many markets fit this
setting, including consumers’purchases of non-durable goods, manufacturers’choices of suppliers,
and physicians’choices of prescription drugs for patients1. The market for prescription drugs is
a particularly important example, because improvements in the physician’s learning process can
produce significant gains in patient health. The health benefits come in two forms: patients may
realize better outcomes while on more tailored treatment and may adhere to a drug regimen at
higher rates.
I develop a dynamic model of physician decision-making to ask: what set of insurer copay-
ment policies and drug promotions can improve the effi ciency of drug choice, maximizing patient
adherence and health while minimizing insurer cost? In the model, these policies enter both the
physician’s initial treatment choice and also interact with her learning process. I focus specifically
on the antidepressant market. The potential improvement in depression care is substantial, as few
patients continue treatment to the six month threshold recommended by the American Psychiatric
Association. In my empirical setting, half the patients discontinue treatment by the first month
and over 90% exit care within six months2.
In determining the value of insurer policies, I also tackle the issue of physician agency, a second
feature of the drug choice setting common to a broader set of markets. Agency relationships
among the physician, patient, and insurer complicate the prescribing choice. While the physician
ultimately chooses the treatment, she considers to varying degrees the patient’s health outcome and
the costs borne by the patient and the insurer. Careful incentive design must therefore anticipate
the physician’s reaction to the policy chosen.
To capture both agency and learning, I begin by using information on the first choice of antide-
pressant to identify the characteristics that physicians consider in their initial decision. I represent
the agency trade-off using a utility function for the physician that balances her concern for patient
1Erdem and Keane (1996) and Ackerberg (2003) study consumer products; Crawford and Shum (2005) focus ondemand for prescription drugs.
2 Inadequate treatment duration nearly doubles the rate of illness relapse (Melfi et al. (1998)), and depression itselfinvolves serious pain: in surveys, patients equate 10 years living with depression to only 6 years living without it(Fryback et al. (1993)).
1
outcomes against private considerations, as in Hellerstein (1998). The private aspects that make
a drug more desirable for the physician to prescribe include lower risk of malpractice suits, fewer
interactions with other drugs, and possibly greater advertising. Physicians differ in their prefer-
ences over a rich set of drug characteristics, including price; this helps rationalize the diffuse market
shares at the initial decision.
After the first month, in my dataset 20-25% of patients exit care, 60-70% persist on the initial
treatment, and the balance switch to a new drug. In a model of learning, I focus on this continuation
decision. Two dimensions of uncertainty can enter the physician’s choice problem when she selects
among a set of experience goods. First, physicians may lack knowledge of the relative effi cacy
of available drugs. They may use their experience across the group of patients they treat to
identify effective options. Since many of the antidepressants studied have been marketed for two
decades, this type of uncertainty plays only a minor role in my setting. Far more important to
depression care, physicians possess uncertainty about the quality of match between a particular
patient and the products available. Six subclasses of drugs compose the choice set, each using
a distinct biological mechanism of action in the brain; after adjusting for broad diagnosis classes,
physicians cannot predict ex ante which mechanism will provide the strongest antidepressant effect
for a particular patient. I focus on the physician’s effort to learn about the quality of a patient’s
match to a specific drug. Within the growing literature on physician behavior, Crawford and Shum
(2005) share this focus on the uncertainty in the patient-drug match. Unlike their specific setting,
however, the antidepressant market I study involves both a larger choice set and variation in price
and promotion across products, requiring a distinct solution.
I initialize the mean parameters of the physician’s decision problem via a standard discrete-
choice model. For the learning model, I develop a new methodology. I design a computationally
feasible approximation to the learning and experimentation that underlie the continuation decisions.
The focus on computation is particularly important in the antidepressant setting, where the large
choice set and relatively long time horizon can overwhelm existing approximation methods.
After setting the initial conditions, the dynamic model requires an explicit learning process for
the physician and a decision rule for her to employ after this updating. I choose a form for each
element based on the available variation in my data. I start by parameterizing each drug in terms
of characteristics that physicians suggest prompt patient complaints: (1) effi cacy, captured by the
drug’s subclass; (2) the side effect profile; and (3) the frequency of dosing per day. I allow patients
to differ in their sensitivity to these features; the physician’s goal is to learn these preferences from
2
the patient’s past experience.
The updating itself follows Bayes’ rule, as in Crawford and Shum (2005) and Ching (2010).
However, unlike much of the existing work on learning, the agent updates a vector of coeffi cients
on the drug’s characteristics rather than a single mean for each choice. This process permits cor-
related learning: in updating the coeffi cients on the characteristics of the drug tried, the physician
changes her priors on all drugs that share those characteristics. For example, physicians learn from
experience under drug A how the patient views drugs in A’s subclass as well as how he views any
drug with similar side effects to drug A.
After updating, the physician chooses the next period’s treatment, balancing the desire to
select the drug with the best expected outcome against the incentive to experiment with less well
understood options. I generate a rule that captures this trade-off by analyzing the physician’s
dynamic problem using the "multi-armed bandit" approach developed in the statistics literature
by Gittins and Jones (1979) and applied in economics contexts by Miller (1984) and Bergemann
and Välimäki (1996), among others. The solution takes an index form. Rather than requiring the
physician to forecast and calculate outcomes under all possible treatment paths, the multi-armed
bandit solution simplifies the J-dimensional dynamic programming problem into J one-dimensional
problems, one for each choice. These calculations lead to a single number or "index" by choice;
the physician then chooses the option with the largest index.
If the options were strictly independent, the drug choice setting would satisfy all the conditions
required for Gittins’index solution to equal the dynamic programming solution. However, anti-
depressants are correlated according to the set of characteristics outlined earlier, including their
subclass designation. As a result, I rely on the specific index rule derived by Lai (1987), which
permits correlated learning but only approximates the dynamic programming result.
From reduced form estimates of the initial period decision, I find that physicians respond to high
patient copayments by shifting prescriptions to cheaper options; they express excess attachment
to branded drugs generally, and to specific highly-promoted options in particular; and supply-side
incentives, such as paying physicians using a capitated rate, can shift prescribing toward cost-
effective options. Expanding to the second stage learning model, I judge the effect of insurer
copayment designs and promotion schemes both on the insurer’s costs and on patient adherence.
I translate adherence to predictions of symptomatic relief in dollar terms, as a measure of patient
welfare. I find "value-based" insurance designs that set lower copayments for drugs with higher
effectiveness—i.e. those with low population quit rates—can boost adherence, leading to a dollar gain
3
in patient health that exceeds the change in cost. These policies outperform commonly-used tiered
copayment schemes.
On promotion, I find the practice of academic detailing, in which specialists visit clinicians to
discuss the latest academic research on medications, can improve adherence and health if these
specialists promote drugs with higher tolerability and effi cacy. In addition, the model allows me to
judge the welfare effects of manufacturer advertising itself. I find that promotion of highly effective
branded products may enhance patient welfare if the drug’s price to the insurer is relatively modest.
In one counterfactual, I simulate increased promotion for Zoloft (Sertraline), a drug with relatively
high rates of patient adherence. Increasing its advertising threefold leads to a 50% market share
for Zoloft in the focal month. Overall costs to the insurer increase, as physicians substitute toward
Zoloft from cheaper generics. However, gains in the dollar value of symptomatic relief eclipse the
increase in costs. In a second promotion experiment I run involving a less effective drug, the dollar
gains in health fall below the increase in drug costs.
Finally, I use the information on quit rates in my large census of patient episodes, filtered through
the structural model, to recommended a set of treatments. When patients leave treatment early,
holding price and other incentive effects constant, it reveals information on the effectiveness of the
drug chosen. The approach of judging effi cacy using attrition rates follows from work by Chan
and Hamilton (2006) and Philipson and Hedges (1998) in the setting of randomized clinical trials.
I extend this idea to observational data. Relative to the managers of clinical trials, practicing
physicians exert far less effort to encourage compliance with the prescribed treatment. As a result,
the adherence rates observed in claims data, while lacking the internal validity of a randomized
trial, nonetheless provide an externally valid impression of patient satisfaction with a drug.
Based on observed switching patterns, the model identifies five treatments with 5-10% higher
rates of adherence and 10-20% fewer switches relative to other options. Existing protocols built
from clinical trial outcomes recommend a larger list of products, some of which far under-perform
the above treatments in terms of adherence. Thus, in the case of judging drug effi cacy—and possibly
for policy evaluation more generally—my results suggest observed adherence rates provide valuable
information missed by treatment effects in randomized trials.
I proceed in the paper by first describing the market for depression care, the dataset, and
the available variation used to identify the parameters of the model (Section 2). I then describe
the reduced form approximation of the initial choice and provide estimation results (Section 3); I
provide detail on the learning model (Section 4) and the estimation algorithm used to recover its
4
parameters (Section 5); I describe the results and fit of the dynamic model (Section 6); finally, I
run simulations to test the effects of insurer policies on patient welfare and to develop a new clinical
protocol (Section 7).
2 Background and Data
2.1 Depression Care
Major depression affects 6.5% of adults in the United States each year. The illness causes patients
to suffer significant impairment in their productivity, with surveys finding 60% have symptoms
severe enough to keep them from performing daily tasks (Kessler et al. (2003)). I study patients
with one of five categories of depression: major depression, both current and recurrent; dysthymia
and depression with anxiety; prolonged depressive reaction; adjustment disorder with depressed
mood; and, depression not otherwise specified3. Each involves different clinical characteristics,
with standards set by the American Psychiatric Association.
The volume of diagnoses in the United States leads to a large market for drug treatment. In
2008, patients filled 164 million prescriptions for antidepressants, trailing as a class beyond only
cholesterol regulators and pain medicines. Market sales reached $9.6 billion in 2008, roughly 3.3%
of all U.S. prescription drugs sales4. If advertising expenditure is roughly proportional to the class’
share in sales, manufacturers spent $370 million in 2008 on promotion of antidepressants, about
4% of sales revenue. I consider how this advertising might affect prescribing behavior in later
counterfactuals.
The antidepressants studied fall into six distinct subclasses according to their effect on concen-
trations of serotonin, norepinephrine, or dopamine in the brain. Patients react idiosyncratically
to changes in the concentrations of these chemicals, creating uncertainty regarding the effi cacy of
any one biological mechanism for a patient (Murphy et al. (2009)). The medical literature groups
the subclasses themselves into two generations, with "second generation" antidepressants generally
dominating the older subclasses. Of the drugs I study, only tricyclic antidepressants (TCAs) fall
into the first generation. They act non-selectively on chemicals in the brain, leading to greater is-
sues with tolerability and higher risk of overdose. As a result, "selective" second generation classes
3The depression diagnoses listed match the International Classification of Diseases (ICD-9-CM) codes of: 296.2,296.3, 300.4, 309.0, 309.1, and 311. Melfi et al. (1998), Pomerantz et al. (2004), and Akincigil et al. (2007) usesimilar diagnostic codes in selecting a sample of depression sufferers.
4"Top Therapeutic Classes by US Sales", IMS National Sales Perspectives. IMS Health, 2008.
5
have largely replaced TCAs as the preferred treatment. The second generation includes: selec-
tive serotonin reuptake inhibitors (SSRIs); serotonin-norepinephrine reuptake inhibitors (SNRIs);
norepinephrine and dopamine reuptake inhibitors (NDRIs); noradrenergic and specific serotonergic
antidepressants (NASSAs); and serotonin modulators (SMs) (Gartlehner et al. (2007)).
2.2 Formation of the Sample
I draw a sample from Thomson Medstat’s Commercial Claims and Encounters Marketscan Data-
base, collecting its patient-level insurance claims data for outpatient visits and prescription drug
services. I link this data to patient demographics and plan design features from the Benefit Plan
Design Marketscan Database5. The individuals recorded in the data include active employees
working for a group of large US firms that contract with one of 100 participating payers; the
employees’dependents and some classes of retirees enter the database as well.
To form a sample, I identify patients in the outpatient dataset who receive a new diagnosis in
one of the depression categories listed earlier. Starting from this index offi ce visit, I collect the
patient’s complete treatment sequence, including the identity of the prescriptions he fills, over the
remaining months of the sample. By conditioning on observed diagnoses in creating the sample, I
avoid observations that reflect off-label use of antidepressants6. I impose the following additional
screens in creating the analysis dataset: patients cannot have a concurrent diagnosis of bipolar
disorder or schizophrenia or receive drugs that signal these conditions, as these more complicated
illnesses lead to distinct prescribing behavior7; the patients’age must be between 18 and 64, the
range for which the data are complete; patients must visit a health professional with the ability to
prescribe all possible treatments in the choice set; and, patients must not be pregnant, a condition
that involves serious drug contraindications. To avoid episodes in which I observe care part way
into treatment, I eliminate from the sample patient episodes with treatments prescribed within the
first 6 months of data.
The initial filters lead to a dataset of 102,780 unique patient episodes of depression care, com-
prised of 267,390 observations on antidepressant prescriptions. The individuals are members of 307
insurance plans, each with a distinct set of required drug copayments. Plans also differ in their
5Thomson Reuters MarketScan Research Databases. Ann Arbor, MI: 2003-2005.6 In addition, I miss individuals in the data who suffer from depression but receive either another diagnosis or none
at all. My sample thus underpredicts the use of the "outside option" of non-drug care and reflects care for patientswith more identifiable depressive symptoms.
7Patients excluded due to comorbidities have a diagnosis in one of the following classes: bipolar and manic disorders(296.0, 296.1, 296.4-.8) and schizophrenic disorders (295.0-295.9).
6
physician payment schemes, with some offering capitation contracts to primary care physicians.
These contracts specify an annual lump-sum payment per patient treated rather than reimbursing
the physician for each offi ce visit. I explore later how capitation arrangements change physician
prescribing behavior.
To define the end of a patient episode, I look for a gap of 90 days within the treatment history
where the patient neither fills a prescription nor visits his physician. If such a gap occurs, I
end the episode at the last prescription observed before the 90 day wash-out period. If the
individual appears again in the data after this period, I treat the new observations as composing an
independent episode8. Using this definition of an episode, 35% of patients do not fill a prescription
in their episode; 22% of patients fill one; 13% fill two; 9% fill three; and 6% fill four. 15% have
episodes involving between 5 and 30 filled prescriptions.
Note that, in defining the choice set, I include only drugs maintaining at least a 0.3% mar-
ket share. Drugs omitted via the market share cutoff include extremely rare, older generation
treatments, such as MAOIs (monoamine oxidase inhibitors). Their shares are simply too small to
permit estimation and to collect price information across distinct plans. The choice set, however,
did evolve over the sample period. In July 2003, physicians chose among 17 products that main-
tained at least a 0.3% share in the overall market. Citalopram, the generic form of Celexa, entered
in November 2004 and a new SNRI, Cymbalta (Duloxetine) entered in August 2004, providing 19
drug options by the end of the sample period. I allow the patient to face the latest choice set at
each decision point in his observed episode.
2.3 Variation for Model of Initial Choice
The model of the physician’s initial treatment choice conditions on the insurer’s cost, the patient’s
copayment, drug effi cacy, side effects, and dosing convenience. The Marketscan database contains
the important price variables: its insurer cost variable captures the cost associated with a drug
excluding the dispensing fee, sales tax, and rebates to the payer; the copayment reported is the
plan-specific and drug-specific cost patients face when filling a prescription. In Figure 1, I plot
by drug and year the average patient copayment per day supply, taking the mean across plans in
the sample. The error bars showcase the standard deviation of the prices across available plans.
8Roughly 8% of the unique patient identifiers repeat in a new episode over the course of my sample period. Theinitial choice in these repeat episodes contributes to the first stage model, and may bias the mean estimates if theyrepresent a selected sample of drugs. In the data, however, the repeat episodes begin with a distribution of drugssimilar to drugs chosen at the start of non-repeated episodes.
7
Copayments vary along three dimensions. First, for most drugs, average copayments increase over
the sample period. Second, across products in a given year, copayments vary considerably, most
sharply when comparing branded drugs to generic drugs. The largest copayments tend to be for
branded drugs with available generic equivalents. Finally, third, for a product and year, the large
standard error bars suggest important variation across plans in the data. Average plans charge
patients $20-30 for a 30-day supply of a branded drug and about $5-10 for a 30-day supply of a
generic drug. The most generous plans require copayments of $0-5 while the least generous require
levels exceeding $50 or more for branded drugs. Given this important variation, I use plan-specific
and year-specific prices in later estimation. The insurer cost varies similarly, as reported in Table
1. Generics cost as little as $5-8 per month supply, with an average near $40; patented drugs cost
two to three times as much.
Information on side effects and daily dosing come from Gartlehner et al. (2007). They examine
2,099 relevant citations from the medical literature, collecting information on treatment effects and
side effects by clinical trial. While I use this information to create drug variables, I later compare
the quality of the prescribing advice in this evidence review with information contained in the
switching patterns in my individual-level dataset.
The dosing requirements appear in Table 1. I illustrate the important variation in side effects
by drug molecule in Figure 2. Side effects vary importantly across drug subclasses and, to a lesser
degree, among drugs in a given subclass. Choice behavior both initially and in the learning model
I build later will relate to this variation in side effects9.
Finally, in Panel 2 of Table 1, I include statistics on the individual characteristics reported in
the Marketscan Database that enter the longer specification of the initial choice model. 27% of
patients suffer from major depression, the most severe diagnosis examined; 26% of patients observed
visit a psychiatrist, with 47% visiting general practitioners and the remaining share visiting other
specialists, such as obstetricians. For plan types, about one third of patients receive coverage under
a capitated Health Maintenance Organization (HMO) plan, 14% have non-capitated HMO plans,
and the remaining 53% have Preferred Provider Organization (PPO) or comprehensive plans with
fewer restrictions on offi ce visits and other aspects of care.
9When a patient reacts poorly to a drug with high nausea, the learning model downweights drugs with a similarnausea profile in the next period’s decision. This reflects physician practice: when a patient complains of a sideeffect, physicians report that they look for an alternative less likely to produce the same complaint. Note that thebiological mechanism that generates side effects is poorly understood. The behavior of physicians appears to bebased on anecdote and/or malpractice concern rather than scientific evidence.
8
2.4 Variation for Learning Model
Identification of the learning model parameters requires a suffi ciently rich set of observed switches
across products and subclasses with distinct characteristics. I summarize the available switching
in the panel data in Table 2. The table lists the fraction of individuals who (1) persist, (2) switch
to another molecule within class, (3) switch to another inside class, or (4) exit treatment at period
t, conditional on the subclass choice at (t − 1)10. The statistics show that individuals tend to
persist on treatment at a rate of 55-70%. The degree of persistence increases over the course of a
patient’s episode and depends importantly on the subclass of the initial choice. Individuals persist
at greater rates when beginning on an SSRI or SNRI, relative to NDRIs, TCAs, and NaSSAs.
Of those who switch at t, 70-80% stop taking medications altogether, 15-20% switch to another
subclass, and the balance switch to a new molecule within the same subclass.
To focus on the timing of switches, I include in matrix form in Table 3 the share of patients
who remain on the initial choice at different decision points within an episode. Conditional on the
level of adherence, about 10-15% of patients switch at the first decision point. Physicians appear
to learn quickly about a patient’s match to a treatment. That the switching is slightly slower
in longer episodes suggests either better initial matching or that the physician faced greater noise
when trying to learn about a patient’s match from outcomes. I separate these two explanations in
the learning model.
3 Initial Condition
To initialize the learning process over the choice of drug regimen, I estimate a reduced form model of
behavior. The model focuses on the physician, but accounts for her attention to patient preferences,
as in Hellerstein (1998). I incorporate observable information on patient demographics, physician
specialty, and drug characteristics in the starting point for the agent’s learning process.
I relate the first period choice to the important drug features in a linear specification11. The
patient cares about an option’s: (1) copay; (2) effi cacy, summarized by drug subclass indicators; (3)
10Decision points, t, roughly correspond to one month of treatment. In cases where the physician writes a 90 dayprescription, the decision point corresponds to 3 months. When a patient returns for follow-up before completingthe initial prescription, the decision point corresponds to the observed number of days between offi ce visits.11Unlike earlier work on drug choice, including Crawford and Shum (2005), I choose a utility form for the physician
that assumes risk neutrality. I model risk aversion indirectly via drug fixed effects, which capture the physician’sconcern about a drug’s tendency to cause severe side effects or prompt increased rates of suicide. I choose thissetup because switching patterns in the data may not provide the variation necessary to distinguish a risk aversionparameter from the idiosyncratic variance in outcomes that slows the agent’s learning process.
9
side effect profile; (4) daily dosing frequency; and (5) branded status. To simplify computation,
I rely on a discretized version of the nausea variable to proxy for the overall side effect profile
of a drug12. The products are differentiated horizontally, as each patient faces a distinct vector
of copayments and patients differ in their sensitivity to price, effi cacy, side effects, and dosing
convenience. I include these variables in XPatientijt , multiplied against random coeffi cients. The
drug’s branded status is constant over time and has a fixed coeffi cient; it enters ZPatientij :
UPatientijt = aiXPatientijt + bZPatientij
Ultimately, the physician makes a drug selection, incorporating the patient’s utility to a degree.
I define γ as the weight the physician places on the patient’s utility, with γ →∞ implying perfect
agency. In all other cases, she may consider her own private benefits from a treatment choice. I
capture the private benefits to prescribing drug j using product fixed effects. This can include
advertising for j or lower malpractice risk. For example, if a drug’s effects are so variable that
it might actually increase the risk of suicide for some patients, physicians will avoid it and the
estimation will identify a small or negative fixed effect.
Insurers can also alter the physician’s private benefits via supply-side incentives. They may
pay physicians using lump-sum capitated payments or may design bonus schemes that pay out
with reductions in the total costs of a physician’s prescribing. To capture this, I include in the
physician’s utility the insurer’s cost—with a random coeffi cient—and an interaction of a capitation
indicator with this cost. XPhysicianijt contains the insurer price; all other effects enter ZPhysicianijt
with fixed coeffi cients13:
UPhysicianijt = ciXPhysicianijt + dZPhysicianijt + γUPatientijt + εijt
= (γai, ci)Xijt + (γb, d)Zijt + εijt
= β′iXijt + α′Zijt + εijt
Holding constant the observables, (Xijt, Zijt), physicians may still choose a distribution of products
due to two features. First, I include an idiosyncratic extreme value error, εijt, that captures
time-varying unobservables, such as whether the physician has manufacturer samples on hand.
12 In interviews I conducted, physicians report switching drugs most often following complaints of nausea andinsomnia. As a sensitivity, I replace nausea with insomnia in the specification. The results change very little.13 In the main specification, the fixed effects do not vary over time. I allow the fixed effects to vary by six month
intervals as a sensitivity; the qualitative results are similar.
10
Second, I introduce individual heterogeneity via the βi coeffi cients on price, side effects, subclass
indicators, and dosing frequency. Mixing over the distribution of sensitivity parameters produces
correlations in the unobserved portion of utility for products with similar observed characteristics.
This breaks the Independence of Irrelevant Alternatives property of a simple logit model, which is
undesirable in this setting (McFadden and Train (2000)). I use a Bayesian Markov Chain Monte
Carlo (MCMC) procedure for estimation, as described in Section A of the Technical Appendix. To
reduce computation time, I estimate the model using a random sample of 50,000 patients drawn
from the raw data, holding on to the entire time series for the selected patients. Note that I cannot
separately identify the agency parameter, γ; in the estimates, I look for non-zero coeffi cients on the
fixed effects and the insurer price variable to suggest physicians fail to act as perfect agents.
Four key findings emerge from the parameter estimates, reported in Table 4. First, physicians
respond importantly to patient price: the coeffi cient on log copayment is -.38 and highly significant.
The estimated standard error of the random coeffi cient is .28, suggesting that 92% of the normally
distributed physician sensitivities fall below zero. Physicians may respond to copayment levels
due to requests from patients or after receiving so-called "blue letters"—warnings from an insurer
when a physician prescribes expensive medications too often. I report the average elasticities by
drug in Table 5. They fall near the center of the range of -.8 to -.1 identified in a meta-regression
analysis by Gemmill et al. (2007). Demand for off-patent branded drugs and older generation
treatments, like TCAs, prove most elastic. The newest on-patent drugs and effective generics have
more inelastic demand.
Second, the coeffi cients on subclass indicators show that physicians have preferences over classes,
despite little evidence from clinical trials that second generation antidepressants differ in effi cacy.
Holding constant nausea side effects and dosing, physicians prefer "inside good" drug treatments
over the outside option of non-pharmacologic care, and they prefer SSRIs over other medications.
This preference may stem from unobserved drug features, like fewer interaction risks with other
medications. The mean effects will feed into the dynamic model and influence the design of a new
treatment protocol.
Third, paying physicians by capitation can influence decision making. Specification (2) includes
an interaction that allows capitation to change the physician’s sensitivity to insurer costs. The
coeffi cient of this interaction equals -.25 and is highly significant. The direction implies that
physicians paid by capitation are more sensitive to the total cost of a drug. I illustrate in Figure
3 the change in preferences from expensive branded medications towards effective generics, such as
11
Fluoxetine (Prozac), and to non-drug care when physicians accept capitated payments. There are
multiple hypotheses to explain precisely how capitation influences decisions. Dickstein (2010) tests
these hypotheses using both patient and physician-level data sources.
Fourth, the drug-specific fixed effects in the model suggest a possible role for sustained adver-
tising in the physician’s choice. That the coeffi cients are non-zero also provides some evidence
that physicians fail to act as perfect agents. In Table 6, I compare the levels of drug-specific fixed
effects with drug-level promotional data from the pre-sample years of 2001 to 200314. The data
include both manufacturer expenditures on detailing by drug and the volume of samples provided
to clinicians. A comparison of the fixed effects against advertising suggests a positive correlation of
about 0.44. Lexapro (Escitalopram), Zoloft (Sertraline), and Effexor (Venlafaxine) have among the
highest estimated fixed effects and also rank in the top three or four among antidepressants in both
detailing and sampling. The strong fixed effect for Fluoxetine, the generic version of Prozac, likely
reflects past advertising for Prozac before it lost patent protection in 2002, the year prior to my
sample. I exploit the correlation between advertising and the fixed effects in later counterfactuals.
4 Physician Learning
After the initial drug selection, the physician and patient face a continuation decision. The patient
may stop taking medication or may consult his physician, who can choose to switch the drug
treatment or recommend the patient continue on the initial choice. Capturing this decision as a
function of drug attributes and incentives is crucial in understanding how insurer policies influence
adherence and long-run treatment costs.
Fitting the switching behavior in the data requires three components. First, I specify a linear
utility function conditioned on outcomes and known drug features, like price. Second, I describe
the updating process the physician uses to learn from past experience. Finally, I specify a decision
rule for her to employ after updating15.
14Personal Selling Audit/Hospital Personal Selling Audit. Verispan, Inc., 2003. Also, Integrated PromotionalServices, IMS Health, 2003.15Note, in the learning process, I assume the patient can remain in treatment or exit care, but cannot switch to
another physician during a treatment episode. As a result, changes in medication reflect learning by the initialphysician. I eliminate episodes from the sample that involve switching from a general practitioner to a specialist;this occurs in less than 1% of episodes.
12
4.1 Utility and Outcomes
I define total utility, yijt, for individual i under treatment j in period t as a function of: (1) "fixed"
elements, fijt; (2) a measure of the effectiveness of drug j for patient i, vij ; and (3) a random noise
component, Wit :
yijt = fijt + vij +Wit
The variables in fijt include patient copayments, insurer prices, the branded status of j, and
drug fixed effects. The components of fijt are fixed, in that both the agent and the econometri-
cian know their values at time t—physicians do not learn about them from experience. I add an
idiosyncratic term, εijt, to fijt that captures, for example, the availability of manufacturer samples.
The econometrician does not observe this element; it follows a standard Gumbel (extreme value)
distribution. I apply coeffi cients estimated in the initial stage model to form fijt, implying that
preferences are stable over time and well-captured using the cross section of initial choices. The
fixed component itself varies over time due to the idiosyncratic term and to changes in copayments
and insurer prices. Given its components, fijt reflects the agency relationships between the patient,
physician, and insurer. Previous work on learning in drug choice often neglects this consideration
due to computational or data constraints.
The second component, vij , captures the patient’s match to drug j in terms of health. Since
I do not observe the patient’s medical record to identify his reaction to a drug, I break vij into a
set of covariates, Uj , to proxy for features that might cause the physician to reevaluate a patient’s
treatment regimen. In interviews I conducted, physicians outlined three reasons for switching: the
initial drug failed to improve the patient’s symptoms; it led to intolerable side effects; or, it was
inconvenient to take at the required daily intervals. Thus, the variables in Uj include: (1) drug
subclass variables to proxy for effi cacy; (2) a discretized nausea side effect variable built from the
clinical trials literature16, and (3) a dosing frequency indicator. I interpret switches according to
changes in these covariates. For example, an observed switch from subclass A to B implies the
physician now views subclass B as more effective for the patient following the initial experience.
I use a linear parameterization of the health outcome term: vij = Uj ∗βi. The characteristics of
drug j, Uj , are known to the physician and the econometrician. However, neither knows ex ante
how the patient will react to these characteristics—i.e. neither knows the true value of βi. The
physician forms a prior distribution on the vector βi to update with patient i’s experience. βi has
16Again, I combine side effects into a single nausea-derived variable for computational purposes; the results changelittle by substituting insomnia for the nausea variable.
13
a normal prior, with mean, βi,0, set equal to the initial condition estimates17, and with covariance
matrix, V0. The elements of βi are independent such that V0 is a diagonal matrix. I choose a
normal distribution for βi to build a Bayesian learning model in the style typical of the literature
on experience goods.
Finally, Wit is an unknown random shock to Yijt that is independent of the components in
fijt and vij and independent of (i, j, t). It follows a normal distribution with variance, σ2W , and
a mean set equal to zero without loss of generality. It reflects environmental noise affecting the
individual’s health outcome. For example, a patient might lose his job or have a family crisis
during the treatment period, changing his outcome independent of the treatment. As a result
of this noise, the physician learns only imperfectly from observed experience about a patient’s
sensitivity to attributes of the drug taken18.
The sum of vij andWit form the uncertain outcome realization, which I denote yijt = yijt−fijt =
vij + Wit. Conditional on βi, it follows a normal distribution with variance σ2W . Combining over
t = 1, ..., T independent realizations, the distribution appears as:
yi,1
...
yi,T
|di,1, ..., di,T , βi v N
U(di,1)
...
U(di,T )
βi, σ2W ∗ IT
where di,t is the identity of the individual’s choice at t and U(di,t) is the vector of characteristics
of that choice.
4.2 Bayesian Updating
The physician begins with uncertainty about the patient’s sensitivity to a set of drug characteristics,
βi. She updates based on outcome realizations to find a posterior on βi to use in selecting the
next treatment. Using Bayes’ rule, the posterior for βi at time (T + 1) is the product of two
components: a likelihood on the T outcome realizations to date, Yi,T = (yi1, ..., yiT )′, conditional
on βi, and a prior distribution on βi:
17βi,0 reflects the initial distribution of βi conditional on individual i’s first period choice. This conditionaldistribution has no closed form; I take simulation draws from it via a procedure discussed in Section D of theTechnical Appendix.18Randomness in outcomes may also be drug specific. For example, generic drugs may lead to more variable
outcomes, if the manufacturing process of distinct generic suppliers changes the speed at which the drug circulates inthe body. In the current model, I consider this variation a fixed drug characteristic; it enters the outcome throughdrug fixed effects and the branded indicator variable in fijt. An extension to the current approach might replaceWit with Wijt, allowing distinct variances for branded and generic drugs.
14
P (βi|Yi,T , U) = L(Yi,T |βi, U)g(βi|βi,0, V0)
where U is a TxK matrix holding in row t = 1, ..., T the covariates for the drug chosen at that
time period. Recall, I specify a normal prior on the K-dimensional βi and independent normal
distributions for Wit:
βi v N(βi,0, V0)
Yi,T |U, βi v N(Uβi, σ2W ∗ IT )
Given the normal likelihood and prior, the posterior is normal with parameters (β∗i,T , V∗i,T ) after
updating from T periods of experience:
βi|U, Yi,T v N(β∗i,T , V∗i,T )
V ∗i,T = (V −10 + U ′U(σ2W )−1)−1
β∗i,T = V ∗i,T ∗[V −10 βi,0 + U ′U(σ2W )−1 ∗ βi
]where βi = (U ′U)−1U ′Yi,T
The updating process is akin to a Bayesian analysis of the classical regression model, here a
normal linear model (See Gelman et al. (2004) or Zellner (1971)). This appears clearly from the
manner in which Yi,T enters the posterior mean—the predicted βi uses the standard formula for
a regression coeffi cient. The updated variance, V ∗i,T , combines two sources of variation. The
first, V −10 , reflects uncertainty about the true value of βi; the second term is due to fundamental
variability from the model—Uβi cannot capture Yi,T exactly.
The values of the posterior parameters over time depend on the rows of U and on the size of σ2W .
First, if a patient takes drug j for multiple periods or takes drugs that share many characteristics,
the rows of U will be similar. Since the elements of U are indicator variables, U ′U will be larger,
meaning V ∗i,T will shrink. Second, the larger is σ2W , the less information physicians gain from
experience, since large σ2W implies noisy outcomes. From the expressions for (V ∗i,T , β∗i,T ), the terms
involving σ2W will go to zero. The posteriors (β∗i,T , V∗i,T ) then approximately equal their priors,
(βi,0, V0).
15
4.3 Physician’s Decision
After updating her beliefs on βi conditional on outcome realizations, Yi,T , from periods t = 1, ..., T ,
the physician must choose an option in period T + 1. The simplest decision rule for the physician
maximizes current expected utility in choosing the next treatment:
maxj∈1,...,J
E(yi,j,T+1|U, Yi,T )
Note that at T + 1, only experiences from t = 1 up to T enter the update of βi . I label this rule
"forward-myopic", since the physician learns from experience but fails to consider the future value
from experimenting with products with lower expected means but highly variable outcomes.
In the data, physicians sometimes switch patients to drugs that the initial distribution of pa-
rameters suggests are poor substitutes. Patients often remain on them for several periods. To
rationalize this pattern, the forward-myopic model requires very large variances on the estimated
coeffi cients, βi. That is, the patient’s outcome would have to be quite extreme for the physician to
update her priors in a way that raises the expected outcome of the weaker option above all others
in the choice set.
I rationalize the observed patterns using a second approach common in previous research on
dynamic demand for prescription drugs. Rather than selecting treatments myopically, the physician
looks forward, accounting for the patient’s outcome over his entire episode of care. She experiments
with lesser-known options that may produce superior outcomes for a patient. To capture this
trade-off, I add the variance of a choice to the decision rule:
maxj∈1,...,J
E(yi,j,T+1|U, Yi,T ) + h(V (yi,j,T+1|U, Yi,T ), .
)where the h(.) function increases in V (.), the posterior variance on outcomes. I generate formally
a decision rule that matches this form using a multi-armed bandit solution to the choice problem
of a forward-looking agent. Below, I walk briefly through the steps to rationalize use of this rule.
4.3.1 Dynamic Choice Problem Set-up
A forward-looking physician maximizes the expectation of the discounted sum of future utility flows,
Yjt, by product and time period. This expected utility at t depends on the choices in previous
periods, since the physician uses experience to update her beliefs on the unknown coeffi cients in
16
the utility function. I summarize the parameters of the distribution of these coeffi cients at t in
θj(t) = (β∗i,t−1, V∗i,t−1), based on realizations from t − 1 periods. The learning process begins with
a specific prior distribution, Π, on θj = θj(1). Succinctly, the physician solves:
maxd
∫Eθ
T−1∑t=0
δt∑j∈J
djt ∗ Yjt(θj(t))
dΠ(θ1, ..., θJ)
subject to the Bayesian updating rules defined earlier. Here, djt = 1 when the physician
chooses drug j at t, and δ is the discount rate.
One can use Bellman equations to solve this problem. With this solution concept, the physician
follows backward induction, considering all possible treatment paths and selecting an option today
with the highest expected future stream of rewards. As the number of options available increases,
it grows increasingly unrealistic to presume the decision maker can compute and rank all these
future paths. For the econometrician, even newer empirical techniques, including those surveyed by
Aguirregabiria and Mira (2010), quickly become computationally demanding given the discounted
infinite horizon and the 20 member choice set in my example.
4.3.2 Simplifying via Gittins’Index
Rather than focusing effort in solving for the unknown parameters using the dynamic program-
ming equations directly, I rely on a clever simplification of the problem described by Gittins and
Jones (1979) and Gittins (1979). They prove that in an infinite horizon setting with independent
prior distributions on (θ1, ..., θJ), one can break the underlying J-dimensional problem into J one-
dimensional problems. That is, rather than considering all possible streams of future outcomes
in making today’s treatment choice, the physician can break the problem into J continue-quit
decisions, one for each choice. A single number or "index value" summarizes each continue-quit
decision; the physician needs only to order these J index values, choosing the option in each period
with the largest index. This procedure provides the same solution as the dynamic programming
approach if the following conditions hold: (1) the decision-maker selects only one option at t; (2)
options not chosen remain in their initial state; (3) each option is independent; and (4) options not
selected do not contribute to the individual’s outcome (Mahajan and Teneketzis (2007)).
Intuitively, this procedure works because the conditions allow a "forward-induction" process.
Since the options are independent and those not chosen initially will remain available at later dates
in the same condition, one can reorder the choice sequence without changing the overall outcomes.
17
That is, prescribing drug A, followed by B, would be equivalent to choosing B then A, up to the
discount rate19. Thus, the physician needs to think only about the value of each drug if it were
taken for the optimal length of time. She then builds a treatment sequence, ordering the drugs
in terms of this value. With updating from past experience, the decision to exit treatment occurs
when the physician and patient determine that the indices for all inside goods fall below that of
the outside option.
4.3.3 Index rule with Correlation
The antidepressant setting deviates from the classical setting in which index solutions exactly
solve the agent’s dynamic problem. Specifically, the choices are not independent but correlated
according to the covariates outlined earlier, including subclass indicators and side effects. The
existing literature provides solutions in this special case. Lai (1987) and Chang and Lai (1987)
show that with a finite set of distinct options, one can design an index rule that approximates
the exact solution provided by dynamic programming, even when choice-specific parameters are
correlated20. They judge the approximation using a "regret" measure, discussed in Section B of
the Technical Appendix, that accounts for deviations between the sequence of options chosen and
the optimal sequence. The index rule they propose achieves a lower bound on regret—that is, it is
"asymptotically effi cient," as defined by Lai and Robbins (1985)21.
Lai’s rule matches the intuition described earlier—the physician balances her impulse to choose
the drug with the highest expected utility against the incentive to experiment with lesser-known
but potentially superior treatments:
maxj∈{1,...,J}
E(yi,j,T+1|U, Yi,T ) + σi,j,T+1 ∗ h(δ)
where E(yi,j,T+1|.) is option j’s expected utility at (T + 1) and σi,j,T+1 is the posterior standard
deviation of j’s outcome. Index rules in general require the agent to solve an optimal stopping
19The drug choice setting satisfies the conditions required for Gittins’solution precisely because drugs not chosenremain available in their initial states— the sequencing of the treatments does not change an individual’s reactionto a drug when taking it. This would not be true if the initial drug changed the individual’s physiology such thatit foreclosed use of a second treatment. In the antidepressant example, ignoring minor discontinuation periods,physicians can prescribe drugs in any order without changing their treatment effects.20When options become indistinct, Rusmevichientong and Tsitsiklis (2010) provide an alternative dynamic alloca-
tion rule.21Brezzi and Lai (2002) quantify "regret" for a forward myopic rule. While Lai (1987) possesses regret of logarithmic
order as δ approaches 1, the forward myopic rule has larger regret, of order (1 − δ)−1. In cases where the futurehorizon is short, the two rules likely perform similarly. See Section B of the Technical Appendix for details on theapproximation.
18
problem for each choice; the function, h(δ), approximates the boundary to this continue-quit de-
cision, which depends on the discount rate. Lai (1987) and Chang and Lai (1987) develop closed
form approximations to the boundary (See Technical Appendix, Section C). In the discounted
infinite horizon setting, the boundary function declines when physicians discount the future more
heavily. The standard deviation, σi,j,T+1, weakly decreases over time with experience. Thus, in
cases where the physician has a low discount rate or where previous experience diminished σi,j,T+1,
the decision rule closely resembles the forward-myopic rule.
5 Algorithm
5.1 Forming Choice Probabilities
I combine information from the initial condition and the learning model to form choice probabilities.
The probabilities begin with the index rule from Lai (1987). Since the rule balances the expected
mean utility of a choice with its posterior variance, it depends directly on fijt, the fixed component of
mean utility; on (βi,0, V0), the priors in the effectiveness component; and on past random outcome
realizations, Yi,t−1. Recall, fijt includes an additive extreme value error. The probabilities
therefore take a logit form:
hijt(βi,0, V0, Yi,t−1) =exp(Gijt(βi,0, V0, Yi,t−1))
1 +∑
k exp(Gikt(βi,0, V0, Yi,t−1)
where Gijt is the index rule22. However, I cannot use this directly in estimation, since, as the
econometrician, I do not observe either Yi,t−1 or βi,0. I therefore integrate out over their distribu-
tions. I showed earlier that Yi,t−1 follows a normal distribution with variance, σ2W ∗ It−1 and mean
Uβi. Integrating over its distribution, the choice probabilities appear as:∫Yi,t−1
hijt(βi,0, V0, Yi,t−1)f(Yi,t−1|βi,0, σ2W )dY
βi,0 should come from the initial stage mixed logit model. However, again I do not have a direct
estimate of it. Since the initial stage relies only on a single observed choice per individual, I can
consistently estimate the population parameters, (b,W ), of the distribution of βi,0 but not βi,0
itself— the number of choices observed per agent would have to increase to apply the Bernstein
22 I normalize the index for the outside option to zero, such that exp(Gi,Outside,t(.)) = 1
19
Von-Mises theorem. Instead, I use Bayes’rule to transform the consistent population parameters
from the initial stage into draws from a distribution of β that conditions on the observed drug
choice in the initial period (di1):
k(β|di1, Xi, Zi, b,W, α) =P (di1|Xi, Zi, β, α) ∗ g(β|b,W )∫P (di1|Xi, Zi, β, α)g(β|b,W )
The conditional posterior reflects the distribution of β for the subset of agents who would choose
di1 when (Xi, Zi) define the choice situation23. This posterior does not have a closed form. I
collect R simulation draws from it via a Metropolis-Hastings algorithm, described in the Technical
Appendix (Section D). Collecting these draws requires no additional effort. Since I estimate the
initial stage mixed logit via a hierarchical MCMC procedure, I have in storage sequences of draws
from this conditional posterior as a by-product of the estimation routine.
Thus, the choice probabilities I use in the likelihood function involve simulation:
Pijt =R∑r=1
[∫Yi,t−1
hijt(βri,0, V0, Yi,t−1)f(Yi,t−1|βri,0, σ2W )dY
]
I integrate out the unobserved outcomes, Yi,t−1, over the smooth function hijt(.) via the Gauss-
Hermite quadrature method (See Judd (1998)).
5.2 Simulated Maximum Likelihood
With these simulated choice probabilities, I form the following log likelihood to maximize over
choices in periods 2 onward:
Log(L) =N∑i=1
Ji∑j=1
Ti∑t=2
dijt ∗ log(Pijt)
I proceed with a derivative-free solver to optimize over the diagonal elements of V0 and the
variance of the idiosyncratic outcome term, with the constraint that the variances be weakly posi-
tive24. I find the appropriate standard errors of the resulting estimates using a bootstrap procedure
23As discussed by Train (2003), conditional on the underlying parameters of β, (b,W ), the distribution k(.) isexact—there is no sampling variance. However, I estimate (b0,W0) as (b, W ). In my bootstrap procedure to estimatethe standard errors of the second stage parameters, I account for the variation of (b, W ) around (b0,W0).24As a check on the consistency of my estimation via simulated maximum likelihood, I increase tenfold the number
of simulation draws used in the optimization. The results change very little.
20
described in the Technical Appendix (Section E). In brief, I take random draws from the initial
stage posterior distributions and use each in a bootstrap sampling procedure that computes the
learning model. I account for variation in βi,0 away from E(βi,0) and for the fact that I estimate
the underlying population parameters of βi,0’s normal distribution in the initial stage. The boot-
strap variances approximate the asymptotic covariance matrix for sequential estimators derived by
Newey (1984) in the moment condition setting.
5.3 Identification
The estimates in the learning model condition on the mean parameters from the initial stage esti-
mation. In the first period decision problem, each patient faces plan-specific choice characteristics
which also vary by the calendar year of the patient’s visit. The variation in the choice charac-
teristics produces distinct initial period drug shares, which identifies the fixed coeffi cients and the
population mean and variance of the random coeffi cient distribution25.
Using the panel data, I estimate the variances on: (1) the coeffi cients in the parameterization of
outcomes, and (2) the random shock component of an individual’s outcome. Similar to Crawford
and Shum (2005), the identification of the variances on the outcome coeffi cients depends on how
drug shares across patients change from the first period of treatment through to later periods. If
the shares of a product change little, then the model would fit smaller variances on the coeffi cients
of characteristics that compose that product. For example, if a drug in the SNRI class maintains
roughly the same market share at each point across the sample of patient episodes, the variance
on the SNRI indicator’s coeffi cient will be small. Large changes in the share of a product across
decision points implies physicians knew little about its characteristics before experimentation.
The variance of the idiosyncratic shock, σ2W , relates to the overall rate of learning. With small
variance in the shock, switching will occur faster after the initial period. As the variance in the
idiosyncratic component grows larger, physicians and patients learn relatively less about the quality
of the match to a treatment from experience, and so will persist on it for longer.
6 Dynamic Model Estimation Results and Fit
The new parameters identified from the switching patterns in the panel data include the variance
of the idiosyncratic learning error and the variances of the coeffi cients in the parameterization of an
25Bajari et al. (2009) provide a formal nonparametric identification argument for the distribution of the randomcoeffi cients in a mixed logit model.
21
individual’s health outcome. I report these estimated variances in Table 7, along with bootstrapped
standard errors.
The variances provide insight into the physician’s priors, conditional on the means fixed from
the initial condition. A large variance on the coeffi cient for a particular subclass, for example,
implies significant prior uncertainty on the effi cacy of the class for any patient. Mechanically, the
larger the variance of such coeffi cients, the greater is the experimentation incentive in the index
rule for drugs with these characteristics26. This invites physicians to try out the drug. Following
experience, rates of switching away from the treatment will also be high, since the index value
can drop substantially when the posterior variance of a drug’s effect shrinks. In addition, large
variances on the idiosyncratic component of the outcome implies slow learning—the physician and
patient update very little from experience since the outcome is a noisy signal of match quality.
Looking at the estimates, the NDRI, NaSSA, and TCA subclasses have the largest variances,
much greater than those of SSRIs and SNRIs, holding constant nausea effects and dosing. These
results correspond to the tendency of patients to persist on SSRI and SNRI treatments more often
than other classes in the data, suggesting they offer a good effi cacy-tolerability bundle. The
significant variance in the value of trying an "inside good" relates to the speed at which patients
exit drug care—there is a great deal of uncertainty among patients about the value of drug treatment
in general. The smaller but still significant variance on the side effect variable and on dosing—16.0
and 15.2, respectively—again suggests physicians begin with uncertainty about a specific patient’s
sensitivity to these drug characteristics. Finally, the size of the idiosyncratic variance fits the high
observed persistence on treatment of 60-70% across decision points. Experience offers too noisy a
signal of a patient’s true match to a drug to allow substantial updating over time.
I assess model fit in two ways. First, I use the model to predict the aggregate probabilities
of switching to the outside good and across subclasses. I compare predictions from the dynamic
model against both observed transitions and predictions from a static model. To determine the
static predictions, I rerun the mixed logit specification from the initial stage, feeding in the entire
sequence of observed choices for each patient’s episode. The procedure is static in that it ignores
the order of prescriptions chosen within an episode, instead treating each choice as independent.
As illustrated in Figure 4, the dynamic model matches fairly closely the observed pattern of
patient exit from drug care—with the exception of decision point 2. The static model misses both
26Large variance in outcomes from a treatment might be undesirable from a malpractice perspective if it increasesthe risk of suicide or adverse reactions. As noted in the development of the initial period model, the level of theestimated fixed effect can account for the risk of rare side effects.
22
at decision point 2 and at all future decision points: it predicts a constant share over time, which
the data clearly reject. The dynamic model’s over-prediction of the share of patients in drug care
at the first persist/switch decision suggests the model misses patients’skepticism of the value of
drug care initially. To fix this, one could model the outside good at point 2 as distinct from the
outside good in later periods. I do not implement this extension due to computational burdens,
but point out that this problem commonly affects both static and dynamic approaches to discrete
choice27.
Table 8 illustrates a comparison of the static and dynamic models in predicting switching across
drug subclasses. The dynamic model does a fairly good job matching the observed persistence on
a drug class conditional on survival, particularly for SSRIs. It provides a less accurate prediction
for NDRIs. One reason for this may be the use of the mean estimates from the initial reduced form
model in the later learning model. Physicians appear to favor NDRIs for more severe cases; if a
patient’s diagnosis worsens in period 2 or later, NDRIs should have a larger mean and physicians
should prescribe them more often. That the fixed component of an individual’s mean, apart from
price effects, remains stable in the model means it may predict less persistence on NDRIs than
occurs in practice. Note that the static model misses the trend in persistence completely, as it
predicts shares by period independent of past decisions.
In a second fit measure, I compute how often the model correctly predicts an individual’s
observed choice over three decision points. The dynamic model is unlikely to fit perfectly, given
the restrictions placed on the covariate space. However, as shown in Table 9, the observed choice
matches one of the top three predictions for that patient at a rate of 65%. Expanding to the top
five choices, the model matches the observed choice in 75-80% of patients. To capture how well a
static model would perform, I randomly rank the choices in each period using market shares. This
random ranking amounts to sampling without replacement, where I take draws from a hypothetical
urn containing the names of available options in proportion to their market shares. I report these
probability calculations in Table 9. The dynamic model selects the observed choice nearly twice
as often as does the static approximation.
Note that the model faces the most diffi culty in predicting the exact date the patient will
exit treatment. In the second set of columns in Table 9, I omit from the statistics predictions
at decision points at which an individual exits treatment. For the subset of periods where the
27Specifying a distinct first period error requires replacing the idiosyncratic extreme value term with, say, a normalerror that persists and influences later choices. This requires a multinomial probit approach, which I leave for futurework.
23
individual chooses an inside good, the model’s top 3 and top 5 predicted choices match the observed
choice at rates of 80-85% and 85-90%, respectively.
7 Counterfactual Experiments and Patient Welfare
I return to the motivating public policy question, employing the dynamic model to predict the
impact of alternative copayment policies and promotion schemes on the insurer’s net drug costs
and on adherence. The predicted adherence rate depends directly on the learning model: patients
exit care as soon as past experience, along with an idiosyncratic shock, elevates the outside good’s
index value above all inside goods.
I simulate the experience of 10,000 sample patients diagnosed in 2005, the latest year in my
data. I run two sets of counterfactuals. In the first, I consider a series of copayment schemes, from
uniform pricing to more detailed "value-based" insurance designs, as described by Chernew et al.
(2007). I focus on copayments since the initial reduced form estimates illustrate that physicians
consider patient prices in their decisions. In the second set of counterfactuals, I simulate the impact
of promotion for generics and for the branded medications, Zoloft (Sertraline) and Wellbutrin
(Bupropion). I proxy for advertising by changing the product-level fixed effects, which correlate
with manufacturer promotion. The goal of these examinations is to test the effect of specific
promotion schemes given learning over time. I report the simulation results in Table 10. I use
these predictions later to build a new treatment protocol.
To quantify changes in patient health under the alternative policies, I translate the rate of
adherence to an expected number of weeks the patient suffers from depressive symptoms, assigning
a dollar value to the symptomatic relief. I describe the calculation in Section F of the Technical
Appendix. In brief, I begin by collecting expected rates of full and partial recovery from depressive
symptoms using Berndt et al. (2002). They survey a group of psychiatric specialists, collecting the
panel’s predictions on the recovery rates for various treatment categories and treatment durations.
I divide the patients in my sample into these same categories, using my model to predict the initial
drug each patient uses and his duration on that treatment. The number of patients in each category
will differ by counterfactual policy, since the policy alters the model’s predictions of an individual’s
treatment choice and adherence. The recovery rates for a patient determine the expected number
of weeks in the year following the diagnosis that he will suffer from depression. I rely on Lave et al.
(1998) and Cutler (2004) to translate this expected number of weeks into a dollar measure. In the
24
baseline case, the distribution of treatments prescribed leads to an average reduction in depressive
symptoms worth $366, versus a monthly drug cost bill of roughly $42. I compare the dollar value
of symptomatic relief under the counterfactual policies against the cost of drug care in Table 11.
7.1 Insurance Design
I consider the utility of three alternative copayment policies in reducing insurer costs and improving
patient welfare. In the first, I set a uniform copayment of $0 for all treatments. In the second, I
set the copayments for generic drugs to $0 and branded drugs to the 95th percentile level, about
$50 per month supply. In the third, I set a "value-based" policy, as Chernew et al. (2007) suggest.
For the specific SSRI and SNRIs that lead to the fewest switches and highest adherence in the
model, I set the copayment equal to $0. If any of the drugs in the initial tier are off-patent, I set
the copayment for the branded version to $15 per month supply to discourage their use relative to
the generic version. For all other non-recommended treatments, I set the copayments at $60 per
month supply, above the 95th percentile of observed copayments.
For all three policies, I simulate rates of adherence over three decision points and calculate
net costs to the insurer, which equal the cost charged by the manufacturer less the copayments
paid by patients28. I translate adherence to a reduction in depressive symptoms, denominated in
dollar terms, and judge the performance of a policy by its effect on the insurer cost/patient health
trade-off.
Looking first at the $0 uniform copayment policy, the main effect is to increase the use of branded
drugs relative to their generic counterparts. Physicians in the data express strong preferences for
brand names and so prescribe them readily when price effects disappear. On the cost/adherence
score, the policy leads to a 5.5% increase in adherence by period 3, stemming largely from the
increased use of effective SSRIs and SNRIs. However, the improvement comes at an added cost
equal to 50-60% of the baseline level. By lowering all copayments to zero, the insurer both loses
the revenue from these payments and fails to encourage the use of cost-effective treatments. On
welfare, the policy leads to a $8 increase in symptomatic relief above the baseline average of $366,
but at an added monthly cost of $21.
When the generic copayments are set equal to zero and branded rates to the 95th percentile,
the shifting primarily benefits generic SSRIs, which gain 14-23% in share over three periods—double
28Copayments typically return to the insurer indirectly via a bargaining process with the drug manufacturer. Levy(1999) discusses the interactions between insurers, manufacturers, and retail pharmacies in price setting.
25
the baseline. This increase comes at the expense of branded SSRIs, SNRIs, and NDRIs, which also
cede share to generics in other classes. Costs decrease by 18%, about $8 for a monthly prescription.
Adherence rates, however, improve only marginally, because the policy encourages the use of all
generics, even those in subclasses with poorer effi cacy and tolerability. Symptomatic relief falls by
$5.60.
The "value-based" design offers the highest payoff in costs, adherence, and symptomatic relief.
Patients shift to the most effective SSRIs, incentivized by copayments of $0 per month relative
to $15 or $60 per month on alternatives. The share on classes excluding SSRIs and SNRIs falls
19%, about a quarter of the baseline level. As a result, adherence rates increase over 4% above
the baseline by the third decision point. The gain in symptomatic relief equals $11 while monthly
costs remain fairly constant, declining by $0.13. The predictions suggest a "value-based" design
outperforms existing tiered policies, as it boosts patient health without raising costs sharply.
7.2 Promotion
Promotion of particular products can influence physician decisions and alter the effi ciency of the
depression care provided to patients. The goal of the following exercises is not to identify the
effect of any particular channel of manufacturer advertising—I lack information at the patient or
physician level on the types of advertising used—but to understand how changing the physician’s
prior information set interacts with her learning process. Increased promotion can in theory either
enhance or detract from patient health and costs. If manufacturer advertising skews prescribing
towards a less effective drug initially, generating poor adherence over time, public policy might
suitably aim to reduce those ads or counter them with promotion based on effectiveness studies.
If, on the other hand, promotion—even of inferior products—leads to greater initiation and therefore
learning about the value of drug treatment in general, adherence may increase. This can translate
to greater symptomatic relief and reduced rates of relapse.
To test the role of promotion, I implement three experiments. In each, I proxy for promotion by
adjusting drug fixed effects, which correlate with average advertising expenditure (See Table 6)29.
First, I mimic promotion of Zoloft (Sertraline), a branded SSRI with generally high tolerability and
effi cacy but also a high cost. I increase Zoloft’s fixed effect threefold, to proxy for a percentage
increase in advertising to match the drug with the largest observed monthly advertising in 2003.
29Note that the existing level of advertising may be endogenous: if advertising is less effective in encouraging theuse of a treatment, manufacturers likely reduce promotion. My goal is to understand precisely why advertising maynot change prescribing, looking at physician learning as an impediment.
26
Second, I increase promotion for Wellbutrin (Bupropion), an example of a branded treatment with
lower average tolerability. I increase Wellbutrin’s fixed effect by the same percentage as for Zoloft.
Third, I reduce the bonus physicians assign to branded drugs and raise the fixed effects for generics
to the mean level. This experiment mimics a scenario where public policy bans branded advertising.
Increasing Zoloft advertising produces a majority share for Zoloft of 52% from a baseline level of
8%. After three periods, that share remains above 50%. Adherence increases by 5.8% into period 3,
while costs rise by 13-28%. About 4/5 of this increase comes from the redistribution of treatments
toward Zoloft, which carries a higher manufacturer price (before rebates) than generic alternatives,
with the remaining fifth due to improved adherence. While the increase in adherence shows the
value of advertising for effective products, the cost/health trade-off compares unfavorably with
that of "value-based" copayment design: health, in dollars, increases by $13.70, but at an added
monthly cost of $5.60. In addition, if the insurer were to mimic the manufacturer’s advertising
to increase adherence, it would incur promotional expense. To provide some comparison, in 2003,
the manufacturer of Zoloft spent $8.75 million a month on detailing to clinicians, dropping off 1
million sample packages30.
The increase in Wellbutrin’s advertising in this setting leads to a boost in its share from a
baseline of 15% to 39% in period 1. Unlike with Zoloft, the rates of exit after starting on Wellbutrin
are high, as many users depart for SSRIs or the outside good after poor outcomes. Its share falls
from 39% to 20% at decision 2, and to 14% at decision 3. Overall, adherence improves 2.1% over
the baseline case. The simulation illustrates a positive health effect from advertising, even for
less effective products. More patients begin on active treatment, trying out Wellbutrin. In the
process, some learn positively about their match to Wellbutrin or to another inside good, and stick
with drug care as a result. Patients gain $7.50 from the value of reduced symptoms. The policy
is costly, however, since Wellbutrin itself is expensive: drug costs rise 23%, a $8.30 increase in the
average monthly drug bill over the baseline case.
In the final experiment, I find that banning branded promotion impacts very little the physician’s
decision to write a prescription and the patient’s decision to fill it once in hand. The share of the
outside alternative actually rises by about 1%. Instead, the main effect is to shift the distribution of
prescriptions. Generic SSRIs gain 27-40% in share, with less effective generic NDRIs gaining 11.5%
in share in the first period. The change in distribution leads to cost improvement of around 30%.
Adherence falls a bit because the policy fails to increase initiation and incentivizes all generics, even
30Personal Selling Audit/Hospital Personal Selling Audit. Verispan, Inc., 2003.
27
those with poorer properties. The value of symptomatic relief under the policy falls $17, about
5% of the baseline.
7.3 Discussion of Policies
Two main findings appear from the simulations. First, after accounting for learning, the largest
gains in patient outcomes occur when policies promote treatments with above-average effi cacy and
tolerability. Costs improve when the policies use either advertisements or lower prices to encourage
the cheapest products within the set of effective options. This is precisely the approach of value-
based insurance design, on the price dimension, and "academic detailing", as described by Soumerai
and Avorn (1990), on the promotion dimension. However, the policies differ in costs. Given
information technology, the creation of multiple tiers for pricing involves little additional expense.
Advertising, however, is costly—a small-scale academic detailing program in Pennsylvania in 2006
required $1 million a year in funding31. Sending academic detailing agents to clinicians nationwide
would require millions more.
Second, manufacturer promotion may be beneficial as a matchmaker, even if its purpose seems
more to persuade than inform (Anand and Shachar (forthcoming)). The results from the promotion
of Zoloft and Wellbutrin suggest that while promoting an effective product boosts adherence and
symptomatic relief more, promoting a lesser option can improve welfare if it nonetheless encourages
experimentation. Consumers then learn about their match to specific characteristics of the product,
finding better options and adhering at higher rates than if regulators restricted promotion.
Note that I neglect careful modeling of the manufacturer’s pricing and advertising decision, since
I lack both a measure of a drug’s price after rebates as well as advertising expenditures for the
time period of my patient-level data. However, roughly, I predict poorer performance by branded
manufacturers under the policies recommended above, since they encourage greater generic use. In
practice, firms will likely respond to insurer policies strategically. For example, in the depression
setting, manufacturers often introduce new products into their portfolio near the expiration of the
patent on their branded product. These are typically "reformulations," as discussed by Huskamp
et al. (2009). Manufacturers may create a new patented "extended release" form of their existing
product that requires fewer daily doses. The change may be even more subtle: in 2002, Forest
Laboratories introduced Lexapro (Escitalopram) a year prior to the patent expiration of one of its
31Scott Hensley, “Negative Advertising: As Drug Bill Soars, Some Doctors Get An ’Unsales’Pitch,” The WallStreet Journal, March 13, 2006, sec. A, p.1.
28
branded depression medications, Celexa (Citalopram). Lexapro has virtually the same molecular
form as Celexa, but simplifies the chemical structure to remove a redundant component. From my
model, the two drugs have near identical rates of adherence and predicted effectiveness; Lexapro
costs the insurer twice as much as the generic form of Celexa. While outside the scope of the
current paper, a model of the strategic behavior of manufacturers could provide additional insight
into the effect of pricing and promotion policies on social welfare, including manufacturer profits.
7.4 New Protocol
Finally, using the estimated rates of adherence by drug treatment from the baseline model, I build a
new treatment protocol. Rather than relying on estimated treatment effects in clinical trials, I use
the timing and identity of drug switches in the large census of patient experiences to recommend a
set of treatments.
Past theoretical and empirical work supports the notion that adherence or attrition rates can
provide valuable insight on effectiveness. Philipson and Hedges (1998) and Philipson and DeSimone
(1997), for example, argue that randomization and blinding in clinical trials may not actually
produce unbiased treatment effects when trial subjects are Bayesian and actively learn about the
effectiveness of the experimental drug. The subject’s behavior then is treatment dependent, and
will differ systematically between the control and treatment group. Attrition rates, in contrast,
impound all the unobserved aspects of the treatment that make it undesirable. Chan and Hamilton
(2006) judge the value of attrition rates empirically using the setting of a major clinical trial for an
HIV treatment. I extend this notion to adherence rates in a large pool of insurance claims. Since
physicians in practice expend considerably less effort to ensure patient adherence relative to the
organizers of clinical trials, the observed rates of quitting treatment should provide an externally
valid measure of a drug’s effectiveness.
In Table 12, I report the predicted rates of adherence by drug molecule along with the recom-
mendations for antidepressant treatment from three published protocols. In the final column, I
include my own protocol built from the dynamic model’s predictions32. I use a quit rate of 20%
at the second decision point as an upper threshold for recommending a product. There are two
main distinctions between the model-based protocol and existing recommendations. First, the new
protocol de-emphasizes NDRIs, like Wellbutrin, and TCAs as preferred treatments. Even though
32As a sensitivity, I predict adherence in the setting where all copayments equal $0. The product recommendationsbuilt from these rates are similar to those in the main analysis.
29
clinical trials generally show little difference in effi cacy between these treatments and SSRIs or
SNRIs, they prompt high rates of switching in the panel data. Second, while no existing proto-
col differentiates among available SSRIs, the model suggests using off-patent molecules Fluoxetine
(Prozac) and Paroxetine (Paxil) over alternatives. This preference builds both from the lower
cost of these drugs and from greater unobserved tolerability. The comparative effectiveness review
by Gartlehner et al. (2007) provides some support for the model’s recommendation. Sertraline
(Zoloft) inspired greater gastrointestinal side effects than comparable SSRIs; the model predicts it
having relatively weaker adherence.
Note that I develop this protocol using the choice set and experience of patients treated in
2005. Recommendations for practice today might more appropriately build from newer data.
However, the results provide evidence that pharmacoeconomic analyses can enhance existing pro-
tocols. Crismon et al. (1999), in their algorithm design project, argued for using such analyses, but
found few credible studies to incorporate. The methodology developed here can help fill this evi-
dence gap cheaply: observational studies allow regulators to compare the effectiveness of different
treatments without the $20,000 to $50,000 per-patient cost common in full-scale trials33.
8 Conclusion
The estimates demonstrate that patient copayments, supply-side incentives, and promotion enter
the physician’s treatment decision. A static analysis that ignores patient and physician learning
would employ these estimates to advocate for high-powered incentives. Commonly-used tiered
copayment schemes, for example, can discourage physicians from prescribing expensive branded
treatments and so lower overall costs in the short term. However, using a dynamic analysis, I
measure the importance of a key trade-off in incentive design: policies which promote cheaper but
less effective options may lead to decreased adherence. When patients quit treatment early, they
suffer from depressive symptoms for longer periods; patient welfare declines and the insurer’s long-
run costs may rise. Managing this trade-off requires "value-based" copayment schemes or academic
detailing that promotes treatment regimens with the highest expected effectiveness and those most
likely to convince skeptical patients of the value of drug care. To define this set of treatments, I
find that adherence rates in observational data can provide guidance beyond outcomes in clinical
trials.33Brian Gormley, “The Hidden Costs of Running a Biotech Company,”The Wall Street Journal, Sept 21, 2009.
30
On methodology, my algorithm design allows me to capture the richness of the physician-
patient agency relationship in an initial stage while focusing on the experience good nature of
antidepressants in a subsequent learning model. More generally, the estimation technique provides
a framework to use in judging the effect of policy interventions in other markets where agents
make dynamic purchase decisions. Furthermore, incorporating an index rule to solve the agent’s
dynamic problem reduces computation, particularly in settings where agents face large choice sets
and participate in the market over long time horizons.
31
References
Ackerberg, Daniel A., “Advertising, learning, and consumer choice in experience good markets:
an empirical examination,”International Economic Review, 2003, 44 (3), 1007—1040.
Aguirregabiria, Victor and Pedro Mira, “Dynamic discrete choice structural models: A sur-
vey,”Journal of Econometrics, 2010, 156 (1), 38—67.
Akincigil, Ayse, John R. Bowblis, Carrie Levin, James T. Walkup, Saira Jan, and
Stephen Crystal, “Adherence to antidepressant treatment among privately insured patients
diagnosed with depression,”Medical Care, 2007, 45 (4), 363.
Anand, Bharat and Ron Shachar, “Advertising the Matchmaker,” The RAND Journal of
Economics, forthcoming.
Bajari, Patrick, Jeremy T. Fox, Kyoo il Kim, and Stephen P. Ryan, “The random
coeffi cients logit model is identified,”NBER working paper, 2009, (No. 14934).
Bergemann, Dirk and Juuso Valimaki, “Learning and strategic pricing,”Econometrica, 1996,
64 (5), 1125—1149.
Berndt, Ernst R., Anupa Bir, Susan H. Busch, Richard G. Frank, and Sharon-Lise T.
Normand, “The medical treatment of depression, 1991-1996: productive ineffi ciency, expected
outcome variations, and price indexes,”Journal of Health Economics, 2002, 21 (3), 373—396.
Brezzi, Monica and Tze Leung Lai, “Optimal learning and experimentation in bandit prob-
lems,”Journal of Economic Dynamics and Control, 2002, 27 (1), 87—108.
Chan, Tat Y. and Barton H. Hamilton, “Learning, private information, and the economic
evaluation of randomized experiments,”Journal of Political Economy, 2006, 114 (6), 997—1040.
Chang, Fu and Tze Leung Lai, “Optimal stopping and dynamic allocation,” Advances in
Applied Probability, 1987, 19 (4), 829—853.
Chernew, Michael E., Allison B. Rosen, and A. Mark Fendrick, “Value-based insurance
design,”Health Affairs, 2007, 26 (2), 195.
Ching, Andrew T., “Consumer learning and heterogeneity: Dynamics of demand for prescription
drugs after patent expiration,”International Journal of Industrial Organization, 2010.
32
Crawford, Gregory S. and Matthew Shum, “Uncertainty and Learning in Pharmaceutical
Demand,”Econometrica, 2005, 73 (4), 1137—1173.
Crismon, M. Lynn, Madhukar Trivedi, Teresa A. Pigott, A. John Rush, Robert M. A.
Hirschfeld, David A. Kahn, Charles DeBattista, J. Craig Nelson, Andrew A. Nieren-
berg, Harold A. Sackeim, and Michael E. Thase, “The Texas Medication Algorithm
Project: report of the Texas consensus conference panel on medication treatment of major de-
pressive disorder,”Journal of Clinical Psychiatry, 1999, 60, 142—156.
Cutler, David M., Your money or your life, Oxford University Press, 2004.
Dickstein, Michael J., “Physician vs. patient incentives in prescription drug choice,”2010.
Efron, Bradley and Robert J. Tibshirani, An introduction to the bootstrap, Chapman and
Hall, 1993.
Erdem, Tulin and Michael P. Keane, “Decision-making under uncertainty: Capturing dynamic
brand choice processes in turbulent consumer goods markets,”Marketing Science, 1996, 15 (1),
1—20.
Fochtmann, Laura J. and Alan J. Gelenberg, Guideline Watch: Practice Guideline for the
Treatment of Patients with Major Depressive Disorder, 2nd Edition, Arlington, VA: American
Psychiatric Association, 2005.
Fryback, Dennis G., Eric J. Dasbach, Ronald Klein, Barbara E. K. Klein, Norma Dorn,
Kathy Peterson, and Patrica A. Martin, “The Beaver Dam Health Outcomes Study: initial
catalog of health-state quality factors,”Medical Decision Making, 1993, 13 (2), 89.
Gartlehner, Gerald, Richard A. Hansen, Patricia Thieda, Angela M. DeVeaugh-Geiss,
Bradley N. Gaynes, Erin E. Krebs, Linda J. Lux, Laura C. Morgan, Janelle A.
Shumate, and Loraine G. Monroe, Comparative effectiveness of second-generation antide-
pressants in the pharmacologic treatment of adult depression, Agency for Healthcare Research
and Quality (US), 2007.
Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin, Bayesian Data
Analysis: Texts in Statistical Science, Boca Raton, FL: Chapman and Hall/CRC, 2004.
33
Gemmill, Marin C., Joan Costa-Font, and Alistair McGuire, “In search of a corrected
prescription drug elasticity estimate: a meta-regression approach,”Health Economics, 2007, 16
(6), 627—643.
Gittins, J. C., “Bandit processes and dynamic allocation indices,”Journal of the Royal Statistical
Society.Series B (Methodological), 1979, pp. 148—177.
and D. M. Jones, “A dynamic allocation index for the discounted multiarmed bandit problem,”
Biometrika, 1979, 66 (3), 561—565.
Hellerstein, Judith K., “The Importance of the Physician in the Generic versus Trade-Name
Prescription Decision,”The RAND Journal of Economics, Spring 1998, 29 (1), 108—136.
Huskamp, Haiden A., Alisa B. Busch, Marisa E. Domino, and Sharon-Lise T. Nor-
mand, “Antidepressant reformulations: who uses them, and what are the benefits?,” Health
Affairs, 2009, 28 (3), 734.
Judd, Kenneth L., Numerical Methods in Economics, The MIT Press, 1998.
Karasu, T. Byram, Alan Gelenberg, Arnold Merriam, and Philip Wang, Practice Guide-
line for the Treatment of Patients With Major Depressive Disorder: Second Edition, Arlington,
VA: American Psychiatric Association, 2000.
Kessler, Ronald C., Patricia Berglund, Olga Demler, Robert Jin, Doreen Koretz,
Kathleen R. Merikangas, A. John Rush, Ellen E. Walters, and Phillip S. Wang,
“The epidemiology of major depressive disorder: results from the National Comorbidity Survey
Replication (NCS-R),”JAMA, 2003, 289 (23), 3095.
Lai, Tze Leung, “Adaptive treatment allocation and the multi-armed bandit problem,” The
Annals of Statistics, 1987, 15 (3), 1091—1114.
and H. Robbins, “Asymptotically effi cient adaptive allocation rules,” Advances in Applied
Mathematics, 1985, 6 (1), 4—22.
Lave, Judith R., Richard G. Frank, Herbert C. Schulberg, and Mark S. Kamlet, “Cost-
effectiveness of treatments for major depression in primary care practice,”Archives of General
Psychiatry, 1998, 55 (7), 645.
34
Levy, Roy, The pharmaceutical industry: a discussion of competitive and antitrust issues in an
environment of change, Washington, DC: Bureau of Economics/Federal Trade Commission staff
report, 1999.
Mahajan, Aditya and Demosthenis Teneketzis, Multi-armed bandit problems Foundations
and Applications of Sensor Management, Springer, 2007.
McFadden, Daniel and Kenneth Train, “Mixed MNL models for discrete response,”Journal
of Applied Econometrics, 2000, 15 (5), 447—470.
Melfi, Catherine A., Anita J. Chawla, Thomas W. Croghan, Mark P. Hanna, Sean
Kennedy, and Kate Sredl, “The effects of adherence to antidepressant treatment guidelines
on relapse and recurrence of depression,”Archives of General Psychiatry, 1998, 55 (12), 1128.
Miller, Robert A., “Job matching and occupational choice,”The Journal of Political Economy,
1984, 92 (6), 1086—1120.
Murphy, Kevin M. and Robert H. Topel, “Estimation and inference in two-step econometric
models,”Journal of Business and Economic Statistics, 2002, 20 (1), 88—97.
Murphy, Michael J., Ronald L. Cowan, and Lloyd I. Sederer, Blueprints: Psychiatry,
Philadelphia, PA: Lippincott Williams Wilkins, 2009.
Newey, Whitney K., “A method of moments interpretation of sequential estimators,”Economics
Letters, 1984, 14 (2-3), 201—206.
Philipson, Tomas and Jeffrey DeSimone, “Experiments and subject sampling,”Biometrika,
1997, 84 (3), 619.
and Larry V. Hedges, “Subject evaluation in social experiments,”Econometrica, 1998, 66
(2), 381—408.
Pomerantz, Jay M., Stan N. Finkelstein, Ernst R. Berndt, Amy W. Poret, Leon E.
Walker, Robert C. Alber, Vidya Kadiyam, Mitali Das, David T. Boss, and
Thomas H. Ebert, “Prescriber intent, off-label usage, and early discontinuation of antide-
pressants: a retrospective physician survey and data analysis,” Journal of Clinical Psychiatry,
2004, 65 (3), 395—404.
35
Rossi, Peter E., Greg M. Allenby, and Robert McCulloch, Bayesian Statistics and Mar-
keting, John Wiley and Sons, 2005.
Rusmevichientong, Paat and John N. Tsitsiklis, “Linearly parameterized bandits,”Mathe-
matics of Operations Research, 2010, 35 (2), 395.
Soumerai, Stephen B. and Jerry Avorn, “Principles of educational outreach (’academic de-
tailing’) to improve clinical decision making,”JAMA, 1990, 263 (4), 549.
Train, Kenneth, Discrete Choice Methods with Simulation, Cambridge Univ Press, 2003.
Whittle, P., “Multi-armed bandits and the Gittins index,”Journal of the Royal Statistical Soci-
ety.Series B (Methodological), 1980, 42 (2), 143—149.
Zellner, Arnold, An Introduction to Bayesian Inference in Econometrics, J. Wiley, 1971.
36
9 Technical Appendix
9.1 Section A: Mixed Logit via Bayesian methods
The likelihood of yi, the observed choice for individual i ∈ {1, ..., N}, takes the following form due
to the assumption of independent extreme value errors:
L(yi|α, βi) =∏t
∑k
dikt ∗ (exp(α′Zikt + β′iXikt))∑k
exp(α′Zikt + β′iXikt)
where dikt = 1 if individual i chooses drug k in period t; dikt = 0 otherwise. I rewrite the
likelihood integrating over β to find the mixed logit probability, where β follows a normal density,
φ(.), with parameters (b,W ):
L(yi|α, b,W ) =
∫L(yi|α, β)φ(β|b,W )dβ
The posterior of (b,W, α) is then:
K(b,W, α|Y ) ∝∏i
L(yi|b,W, α)k(b,W )
Rather than proceed via a Metropolis-Hastings algorithm directly on the posterior distribution
of (b,W ) using the mixed logit probabilities, I specify a hierarchical distribution that treats βi as
a parameter as well. In notation, the desired posterior would appear as:
K(b,W, βi∀i, α|Y ) ∝∏i
L(yi|βi, α)φ(βi|b,W )k(b,W )
I employ the procedure described below to draw from this posterior, which simplifies compu-
tation and avoids simulating L(yi|α, β) for each i at each iteration of a direct Metropolis-Hastings
implementation (See Rossi et al. (2005) and Train (2003)). The steps include:
1. Specify Priors
I specify standard conjugate priors for (b,W ) : b v N(b0, Vb0) and W v inverted Wishart
with K degrees of freedom and scale matrix IK . K is the dimension of b0.
2. Combine the following sequential draws into a Gibbs Sampling routine:
37
(a) Draw b|W,βi ∀ i. The posterior is normal, given conjugate priors and the assumption
that βi are N realizations from a normal:
b|W,βi∀i v N
(1Vb0b0 + N
W β
1Vb0
+ NW
,1
1Vb0
+ NW
)
(b) Draw W |b, βi for all i. The posterior is inverse Wishart with K +N degrees of freedom
and the following scale matrix:
KI +Ns1K +N
, where s1 =1
N
∑i
(βi − b)(βi − b)′
(c) Draw α|b,W, βi for all i. Here, posterior K(α|βi∀i) ∝∏i
L(yi|α, βi)k(α). Draw from
this distribution using a Metropolis-Hastings algorithm involving data pooled across
individuals.
(d) Draw βi|b,W, α for all i. Here, posterior K(βi|b,W, α, yi) ∝ L(yi|βi, α)φ(βi|b,W ).
Draw from this distribution using a Metropolis-Hastings algorithm. Section D provides
additional detail.
I implement this MCMC method, experimenting with different starting values and parameters
in the prior distribution to achieve convergence. After dropping a burn-in sequence of 75,000
iterations, I retain every 100th draw from a sequence of 20,000 additional draws. The resulting
vectors approximate draws from the distribution of the parameters of interest. I use the draws from
the conditional posterior distribution on βi in step (d) as an input to the second stage estimation
routine.
9.2 Section B: Regret and Asymptotically Optimal Rules
9.2.1 Minimizing "Regret" vs. Maximizing Returns
The optimal adaptive allocation rule from Lai and Robbins (1985) and Lai (1987) involves a slight
reformulation of Gittins’solution to the multi-armed bandit problem. The index rule no longer
exactly matches the solution to the infinite horizon optimal stopping problem. To discuss the
properties of this index rule, the authors rely on a "regret" function which sums deviations of a
agent’s sequence of choices from the optimal choice. With the objective of minimizing expected
38
regret, Lai and Robbins (1985) find a lower bound on regret and describe a policy rule that achieves
it asymptotically.
To apply the results from Lai (1987) to my setting, I show that a policy that minimizes regret is
equivalent to one that solves the agent’s maximization problem. Thus, if a rule achieves the lower
bound on regret, it best approximates the solution from the traditional dynamic programming
approach amongst all possible index rules. Lai shows the connection between the approaches
generally; I illustrate the intuition below in a 2 option example. Specifically, consider the case
were Yt is the outcome at time t. Yt has a known distribution, f(Y ; θj), where j ∈ {1, 2}. For
example, Yt = θj + εt, where εt v N(0, 1). Let St = Y1 + ...+ Yt, the outcome stream, and let the
policy rule at time t be ϕt, where 1{ϕt = j} = 1 when the policy picks j at time t. The policy
rule ϕt depends on previous outcomes and choices. Let Fs denote the set of policies, ϕs, that may
follow given past observations. Finally, I denote the mean outcome under j as µ(θj). With this
set-up:
E(St) =t∑
s=1
E{E[Ys(θ1) ∗ 1{ϕs = 1}|Fs−1] + E[Ys(θ2) ∗ 1{ϕs = 2}|Fs−1]}
= µ(θ1)E
[t∑
s=1
1{ϕs = 1}]
+ µ(θ2)E
[t∑
s=1
1{ϕs = 2}]
The standard approach to the individual’s problem involves maximizing these cumulative re-
turns over the horizon, T , with respect to the prior distribution on Θ = (θ1, θ2):
max
∫E(ST ) ∗ dH(θ)
= maxϕ1,...,ϕT
∫ (µ(θ1)E
[T∑s=1
1{ϕs = 1}]
+ µ(θ2)E
[T∑s=1
1{ϕs = 2}])
dH(θ)
In contrast, Lai (1987) minimizes regret over the finite horizon:
RT (θ) = T ∗ µ∗(θ)− E(ST ), where µ∗(θ) = max{µ(θ1), µ(θ2)}
39
For the minimization, integrating with respect to the prior distribution over Θ :
min
∫RT (θ)dH(θ)
minϕ1,...,ϕT
∫(T ∗ µ∗(θ)− E(ST ))dH(θ)
=⇒ minϕ1,...,ϕT
−∫ (
µ(θ1)E
[T∑s=1
1{ϕs = 1}]
+ µ(θ2)E
[T∑s=1
1{ϕs = 2}])
dH(θ)
9.2.2 Lower Bound on Regret
In the infinite horizon discounted setting, Chang and Lai (1987) provide a lower bound on the
cumulative regret possible given an index rule that samples one of the available choices at each
time period. To simplify notation, with J possible options, let θj be the parameters of choice j,
with µ(θj) the expected outcome under j. The optimal choice, though not known initially, has
θj = θ∗. The goal of the index rule is to suggest a sequence of choices to minimize deviations from
the optimal choice.
Chang and Lai describe a lower bound on the cumulative regret when using an index rule, as δ
approaches 1:
Regret(θ1, ..., θJ) ≥ |log(1− δ)|∑
j:θj<θ∗
{µ(θ∗)− µ(θj)
I(θ, θ∗)+ o(1)
}where I(θ, θ∗) = Eθ[log(fθ(y)/fθ∗(y))]
The lower bound on regret is specific to the choice of distribution for the outcome function,
fθ(y). The index rule proposed by Lai (1987) and Chang and Lai (1987) attains this asymptotic
lower bound. It achieves the lower bound not only for a fixed (θ1, ..., θJ) as δ approaches 1, but also
uniformly over a variety of prior distributions on (θ1, ..., θJ), including cases where the θ elements
are potentially correlated.
The expression for the lower bound provides the following intuition on the quality of the ap-
proximation under Lai’s rule. With fθ(y) a normal distribution, when outcomes under θj diverge
from outcomes under θ∗ such that (E[y − θ∗]2 − E[y − θj ]2) diminishes and µ(θ∗) − µ(θj) grows,
the lower bound increases and the minimal regret rises. When the choices have θj that are near
θ∗, the expected cumulative regret falls towards 0.
40
9.3 Section C: Closed Form Approximations to the Boundary Function
Gittins and Jones (1979) show that the solution to the discounted infinite-horizon dynamic program-
ming problem follows an index rule, where the decision-maker need only: (1) solve J single-armed
bandit problems, calculating the dynamic allocation index for each choice, denoted Gk(Xk(t)); and
(2) choose the option with the largest dynamic allocation index at any time period t. Here, Xk(t)
reflects the state variables of the choice problem that evolve over time34. At t, the agent chooses
j if and only if:
Gj(Xj(t)) = maxk∈{1,...,J}
Gk(Xk(t))
The dynamic allocation index at t solves the following problem:
Gj(xj(t)) = supτ≥t
Et
[τ∑r=t
βr−tRj(Xj(r))|xj(t)]
Et
[τ∑r=t
βr−t|xj(t)]
Rj(Xj(r)) reflects the returns from option j given the state variables at time r. The index
itself, Gj(xj(t)), can be thought of intuitively as a retirement option, following the formulation of
Whittle (1980). The decision-maker considers the value of trying an option for τ periods against a
retirement option that makes him just indifferent. The index rule approach then involves choosing
the option with the highest retirement value.
In the case of Bayesian decision makers who update the underlying parameters of the choice
problem from experience at t, the decision-maker’s problem repeats itself at every instant of time.
Rather than hypothesizing playing an option until τ , the optimal action is to recalculate the index
values after each period of experience, again choosing the option with the maximal index.
To compute these indices, Lai (1987) and Brezzi and Lai (2002) provide a closed form solution,
relying on a diffusion approximation that computes Gittins’Index for a Wiener process. They use
a change of variables to set up a Brownian motion problem, with an optimal stopping boundary.
The closed form approximation to the optimal stopping boundary in the finite horizon case
takes the form:34 In the drug choice setting, these are the mean and variance of drug j’s outcome function, which the physician
updates given experience.
41
h(s) =
{2 log(s−1)− log(log(s−1))− 1og(16π) + ...
.99 exp(−.038s−1/2)}1/2 if 0 < s ≤ .01
−1.58√s+ 1.53 + 0.07s−1/2 if .01 < s < .28
−.576s3/2 + .299s1/2 + .403s−1/2 if .28 < s ≤ .86
s−1(1− s)1/2{.639− .403(t−1 − 1)} if .86 < s ≤ 1
where s = t/T , with T the finite horizon and t the point in the episode along the sequence to
T .
In the infinite discounted case, Chang and Lai (1987) replace s with (1− β)−1 and replace the
closed-form h(s) with ψ(s):
ψ(s) =
√s/2
.49− .11s−1/2
.63− .26s−1/2
.77− .58s−1/2
{2 log(s)− log(log(s))− log(16π)}1/2
if s ≤ 0.2
if 0.2 < s ≤ 1
if 1 < s ≤ 5
if 5 < s ≤ 15
if s > 15
See the references cited for more details on these approximations.
9.4 Section D: Metropolis-Hastings Algorithm for Draws on βi
The goal is to approximate draws from the posterior distribution of βi conditional on the observed
first period choice, di1:
k(β|di1, Xi, Zi, b,W, α) =P (di1|Xi, Zi, β, α) ∗ g(β|b,W )∫P (di1|Xi, Zi, β, α)g(β|b,W )
Algorithm, via Train (2003):
1. Start with a value for β0i ; I set it equal to the population-level mean, b, for all individuals
2. Draw K independent values from a standard normal distribution and save them in η1, where
K is the dimension of βi.
3. Create trial value of β1i as β1
i = β0i +ρ∗chol(W )∗η1, whereW is the population level variance
and chol(x) denotes the Cholesky decomposition of matrix x . Note, I use W—as estimated
42
in the initial stage—in place of W . ρ is a constant determined iteratively (see note below).
4. Draw a standard uniform u1
5. Calculate the ratio of logit probabilities evaluated at β1
i and β0i and multiplied by the ratio
of the normal probability density functions at these two vectors:
F =L(di1|β
1
i , .)φ(β1
i |b,W )
L(di1|β0i , .)φ(β0i |b,W )
6. If u1 ≤ F , accept β1i and set β1i = β1
i ; If not, reject β1
i and let β1i = β0i .
7. Repeat. For high enough t, βti will approximate a draw from the conditional posterior
distribution of interest.
Notes:
• I set ρ = .1 to begin, and update it iteratively. If the acceptance rate across individuals is
less than 30%, I shift ρ down by .001; if the acceptance rate is greater than 30%, I shift ρ up
by .001.
• Beginning with the (b, W ) estimates from the mixed logit, I "burn-in" 500 draws, and then
collect 200 draws, retaining only every 100th of 20,000 draws to reduce the serial correlation
across draws.
9.5 Section E: Procedure for Second Stage Standard Errors
I design a procedure to approximate the 2-step maximum likelihood asymptotic variance-covariance
matrix derived in Newey (1984) and Murphy and Topel (2002). The two-step estimator satisfies
the following conditions on the log-likelihood of the outcomes, Y :
∑i
∂L1(Y1; θ1)
∂θ1= 0,
∑i
∂L2(Y2; θ1, θ2)
∂θ2= 0
In cases where θ1 is a consistent estimate of θ1, if the true parameters are (θ∗1, θ∗2), the second
stage maximization of 1n∑L2(Y2; θ1, θ2) with respect to θ2 is equivalent to the maximization of
1n
∑L2(Y2; θ
∗1, θ2). Using expansions around (θ∗1, θ
∗2) and relying on the Central Limit Theorem to
find the joint distribution of the the partial derivatives of the likelihood functions, one can derive the
43
variance of the limiting normal distribution for θ2. It is a complex function of the partial derivatives
of the likelihoods; these derivatives are diffi cult to compute in my setting. I approximate this
solution using a Bootstrap resampling procedure (See Efron and Tibshirani (1993)). Simulation
here is particular straightforward given that I have in storage draws from the posterior distribution
of the mixed logit parameters, a consequence of my Bayesian MCMC procedure from the initial
stage. I carry out the following steps:
1. Save the estimated hyperparameters of the βi normal distribution in the matrix θ—i.e. θ =
(E(βi), var(βi)). From the Bayesian implementation of the mixed logit estimation, I will
have S draws on θ saved, along with S draws on the fixed coeffi cient vector, α. I set S = 100.
2. For each of the ss = 1, ..., S vectors of parameter draws on (θ, α), take a draw on β|di1, Xi, Zi, b,W, α
from its conditional posterior distribution using the Metropolis-Hastings approach described
in Section D.
3. With (βi, α) for draw ss on θ, maximize the 2nd stage likelihood for nn = 1 : N_sample
using a population of size 3000 drawn from the full sample of individuals via a multinomial
distribution. I choose N_sample = 20.
4. Save the vector of model variances that the optimization finds for each of ss = 1 : S and
nn = 1 : N_sample. The vector of model variances accounts for sampling variation in the
initial stage mean parameters and in βi around its mean. The standard error across these
points approximates the standard error of the 2nd stage model estimates.
9.6 Section F: Calculating the Dollar Value of Symptomatic Relief
To assign a dollar value to the symptomatic relief provided by prescription drug treatments, I carry
out the following procedure.
First, I save the predicted sequence of treatments for 10,000 patients from the simulation of the
six counterfactual policies along with the baseline case. I divide the sample of patients in each
counterfactual into 5 categories based on the treatments used within their episode: (1) no drug
care; (2) use of a TCA for one month or less, followed by exit from drug care; (3) use of a TCA
for greater than one month; (4) use of any non-TCA drug class (SSRI/SNRI/NDRI/NaSSA/SM)
for one month or less; and, (5) use of any non-TCA drug class for greater than 1 month. I choose
these divisions to match a subset of the treatment categories studied by Berndt et al. (2002). They
44
employ a panel of medical experts to rate the likelihood of both full and partial recovery given a
patient’s background and a treatment regime and duration. For example, in Table 1 of Berndt et
al. (2002), treatment under an SSRI with 1-3 offi ce visits produces a probability of full remission
of .20, and a probability of partial remission of .45. If the SSRI treatment lasts for greater than
30 days, those probabilities increase to .28 and .60, respectively.
Given the probabilities of full and partial remission for each of the 5 categories above, I calculate
the expected number of weeks in the first year of treatment that an individual suffers either the full
symptoms of depression or partial symptoms. I follow a similar procedure to that of Cutler (2004).
Following the medical literature, I set the recovery rate at 0% in the first two weeks of treatment,
before an antidepressant takes full effect. In every week for the next year, I let the rates of full and
partial recovery for those who have yet to recover be the same as the median rate predicted by the
expert panel in Berndt et al. (2002) for the category of treatment. I then use these probabilities
to calculate the expected number of weeks without full recovery. In the case without drug care,
the patient suffers from depression for 6.4 weeks, with partial recovery in an additional 13.8 weeks
of the year. The expected weeks of depression and of partial depression for drug treatment differ
a bit by the type and duration care: with less than one month of TCAs or SSRIs, the expected
number of weeks equals approximately 4.9 and 12.6; for longer duration on TCAs or SSRIs, the
expected number of weeks shrinks to 3.5 and 7.1.
Given the expected duration of complete or partial depressive symptoms, I multiply the savings
in "depression weeks" against the quality disutility from depression, estimated in Lave et al. (1998)
to equal -.41 for full depression. That is, patients equate 10 years living with depression to roughly
6 years living without depression. As in Cutler (2004), I set the disutility from partial depression
at half the full depression rate. Multiplying this savings in utility against $100,000 as the value of a
year of life, I find the per patient dollar value in each of the four treatment categories. I multiply this
by the share of patients in each category in a particular counterfactual, summing across categories
to get a total dollar savings. Dividing by 10,000, I have the savings of an average patient under
the counterfactual policy from decreased depressive symptoms. I report these results in Table
11. Note that the distribution of treatment types promoted by a counterfactual policy drives the
observed differences in the utility gain. When a policy increases adherence and promotes use of
SSRIs, for example, patients shift toward treatment regimens with higher recovery probabilities;
the effect is to increase the expected number of weeks with symptomatic relief in the counterfactual
relative to the baseline, making the benefit from care under the policy higher.
45
Figure 1: Patient Copayments by Product and Year. Error bars show the standard deviation in copayments across insurance plans in the sample.
Figure 2: Nausea and Insomnia Side Effect Reports from Clinical Trials. Percentage rates shown for each drug, for a selected group of subclasses.
0
10
20
30
40
50
60
70
Pat
ient
Cop
ay p
er 3
0 da
y su
pply
(in $
)
Product Name (generic name if off patent)
2003 2004 2005
Branded Generic
0
5
10
15
20
25
30
35
40
45
% o
f Tria
l Par
ticip
ants
Rep
ortin
g Si
de E
ffect
s
TCA NDRI SSRI SNRI
Nausea Insomnia
46
Figure 3: Change in Drug Shares under Capitation, for selected antidepressants.
Figure 4: Share of Patients on Drug Care over Six Decision Points. Bars compare the observed data against predictions from the dynamic model. The static model predicts a share of 55% at all points (shown as a line in
the figure).
0
5
10
15
20
25
30
35
40
Non-drug care Fluoxetine (Prozac) Escitalopram (Lexapro)* Sertraline (Zoloft)* Venlafaxine (Effexor-XR)*
Dru
g sh
are
(in %
)
ProductMolecule name (Brand name), * if on patent
Non-capitated plans Capitated Plans
0
10
20
30
40
50
60
t=2 t=3 t=4 t=5 t=6
Shar
e of
pat
ient
s on
dru
g ca
re (i
n %
)
Decision Point within a patient's treatment episode
Observed Data Dynamic Model Prediction
Static Model Prediction (55%)
47
Panel 1: Drug Characteristics
Product Name Subclass Brand?Daily dosing 2003 2004 2005 2003 2004 2005 2003 2004 2005
None None - - - - - - - - 35.2 39.3 34.2 Citalopram HBr SSRI No 1 - 36.0 19.9 - 7.2 8.7 - 0.2 4.2 Celexa SSRI Yes 1 72.9 74.0 75.3 22.6 31.7 35.4 4.1 2.7 0.1 Lexapro SSRI Yes 1 63.2 63.3 68.9 22.7 24.8 28.8 13.8 13.0 12.0 Fluoxetine HCL SSRI No 1-2 30.9 17.7 17.0 9.9 6.9 7.4 7.3 11.5 10.6 Paroxetine HCL SSRI No 1 68.7 49.3 37.0 11.0 8.3 8.8 1.7 4.3 4.4 Paxil CR SSRI Yes 1 81.4 83.9 87.5 21.8 21.0 25.1 6.4 3.6 1.6 Zoloft SSRI Yes 1 79.9 81.9 85.0 20.4 25.6 28.2 12.5 10.6 9.6 Cymbalta SNRI Yes 1-2 - 103.8 111.8 - 27.8 34.6 - 0.5 2.4 Effexor-XR SNRI Yes 1 109.5 116.0 118.7 21.6 22.4 26.1 8.6 7.5 7.0 Bupropion HCL NDRI No 3 38.6 49.9 50.7 8.5 9.0 9.8 0.3 3.4 4.5 Wellbutrin XL NDRI Yes 1 104.8 104.1 113.5 21.2 21.2 26.8 6.5 5.7 5.1 Amitriptyline HCL TCA No 1 5.9 4.2 4.0 6.9 5.3 5.3 0.7 0.9 0.7 Mirtazapine NaSSA No 1 60.0 42.4 31.9 11.2 8.8 8.0 0.5 0.7 0.6 Trazodone HCL SM No 3 8.0 6.1 5.8 7.2 5.8 6.0 1.3 1.8 1.6
11,341 41,285 50,154
27.2 33.2 13.8 53.0 46.7 27.9 25.5
102,780 267,390
4. Shares calculated using only the initial prescription written for each patient episode. Selected antidepressants shown for illustration.
Table 1: Summary statistics on variables entering the estimation algorithm
Total patient episodes:
Monthly Insurer Cost ($)3 Monthly Drug Copayment ($)3
Total # unique patients:Total # of observations:
VariablePatient diagnosed with major depressive disorder
Market Share (%)4
Panel 2: Individual covariates in product choice model%
Patient's plan is a capitated Health Maintenance Organization (HMO)Patient's plan is a non-capitated HMOPatient's plan is a Preferred Provider Organization (PPO)
3. Values reflect the mean across the 307 plans in the data. Each plan sets distinct copayments and records different insurer costs by drug.
Notes:1. Source - Medstat Marketscan Commercial Claims and Encounters Database, years 2003-2005.2. Sample collected for individuals newly diagnosed with depression in an outpatient office visit; the data include information on subsequent office visits and prescriptions dispensed over the patient's episode. Pregnant women and individuals with psychiatric comorbidities are excluded from the sample, as are patients whose episodes begin within the first 6 months of data (to prevent left-censoring).
Patient visits a general practitioner (internal medicine/family medicine)Patient visits non-psychiatric specialist (obstetrician, etc)Patient visits psychiatrist
48
Persist
Switch within class
Switch to another
class
Switch to outside option
SSRI 68.9 3.5 2.6 25.0 19,663 NDRI 64.7 3.0 7.8 24.5 3,974 SNRI 73.8 1.5 4.5 20.1 4,166 TCA 57.9 0.2 12.7 29.2 497 NaSSA 52.6 - 14.3 33.1 293 SSRI 74.3 2.5 2.6 20.6 13,289 NDRI 69.6 2.3 5.9 22.2 2,668 SNRI 77.2 0.7 3.8 18.3 3,081 TCA 57.3 - 14.6 28.1 335 NaSSA 67.7 - 13.4 18.9 164
t=3
Table 2: Observed continuation decisions (switch vs. persist)
Time of decision
(t)
Drug Class
chosen at (t-1)
Fraction of decisions
Total Obs.
t=2
Notes:1. Observed switches recorded for the 50,000 patient sample drawn from Medstat's Marketscan Database for use in estimation. Roughly 35% of patients receive no drug care at their initial visit and so do not appear in the above switch statistics.2. Selected antidepressant classes shown. The number of individual drugs modeled in each subclass are: SSRI, 8; NDRI, 2; SNRI, 2; TCA, 2; NaSSA, 1. Thus, no patients switch 'within class' when starting on a NaSSA.
49
Table 3: Percentage of patients remaining on the initial drug choice over the course of an episode
1 2 3 4 5 6 7 8 9 10 11 121 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.02 85.1 86.8 88.9 89.4 89.6 90.4 91.4 91.9 92.1 91.6 91.63 84.4 84.2 84.7 85.8 86.4 88.2 88.7 88.9 88.8 88.34 82.3 82.0 83.0 83.9 86.4 86.4 86.9 87.7 85.55 80.2 81.0 82.2 84.4 84.7 85.4 85.4 85.06 79.6 80.1 82.6 83.5 83.9 84.0 84.07 79.3 80.7 81.9 83.5 82.7 84.08 79.9 81.2 81.7 80.9 82.49 79.3 79.7 79.6 82.2
10 79.2 79.3 81.411 77.6 79.412 79.6
Share of dataset: 56.7% 12.9% 9.3% 6.4% 4.4% 3.1% 2.1% 1.7% 1.3% 1.0% 0.7% 0.4%
Length of Treatment Episode (# monthly prescriptions dispensed)
Prescription count within
episode
Statistics conditioned on the length of the treatment episode (measured in # of prescriptions dispensed overall)
Notes:1. Source - Medstat Marketscan Commercial Claims and Encounters Database, years 2003-2005.2. Sample contains 102,720 unique patients, with 267,390 observed prescriptions or office visits across patient episodes.
50
Est. Std. Err T-stat Est. Std. Err T-statRC, Mean 1{TCA} -0.43 0.03 13.00 -0.28 0.02 11.96RC, Mean 1{NDRI} 0.95 0.28 3.34 -0.18 0.46 0.38RC, Mean 1{SSRI} 1.11 0.12 9.54 1.02 0.09 11.47RC, Mean 1{SNRI} 0.09 0.09 1.00 -0.12 0.04 3.36RC, Mean 1{NaSSA} 0.10 0.04 2.34 -0.03 0.05 0.55RC, Mean 1{Inside good} 3.61 0.48 7.51 6.86 0.66 10.32RC, Mean 1{Dosing >1 pill per day} -0.52 0.04 14.68 -0.40 0.04 9.77RC, Mean nausea reports >60th pctile -14.17 2.09 6.77 -8.16 0.77 10.59RC, Mean log(insurer cost) -0.44 0.01 59.71 -0.23 0.01 26.25RC, Mean log(patient copayment) -0.38 0.01 27.92 -0.34 0.01 49.26RC, Var. log(insurer cost) 0.02 0.00 14.60 0.02 0.00 10.34RC, Var. log(patient copayment) 0.08 0.02 3.77 0.04 0.01 6.40Fixed 1{branded} 1.97 0.19 10.58 2.34 0.07 32.35Fixed log(insurer cost)* 1{Capitated} -0.25 0.01 17.28
Notes:
3. Estimation of the hierarchical model carried out using Markov Chain Monte Carlo methods.4. 'RC-Mean' denotes the mean of a random coefficient; 'RC-Var.' denotes its variance.
1. Source - Medstat Marketscan Commercial Claims and Encounters Database, 2003-2005.2. Estimation based on a random sample of 50,000 patients from the full dataset.
Table 4: Mixed logit estimates based on first choice data
VariableType
Yes YesProduct-specific fixed effects included?
(1) (2)
Interactions of specialty with subclass? No YesInteractions of diagnosis with subclass? Yes Yes
51
Product Name Class Brand?Prozac SSRI Yes No 0.38 34.26 -0.90Celexa SSRI Yes No 1.55 32.77 -0.63Mirtazapine NaSSA No - 0.62 8.89 -0.51Amitriptyline HCL TCA No - 0.77 5.64 -0.48Nortriptyline HCL TCA No - 0.34 6.82 -0.47Citalopram HBr SSRI No - 2.19 4.96 -0.45Bupropion HCL NDRI No - 3.39 10.09 -0.44Trazodone HCL SM No - 1.51 6.38 -0.44Cymbalta SNRI Yes Yes 1.30 21.62 -0.42 Zoloft SSRI Yes Yes 10.14 27.28 -0.38Fluoxetine HCL SSRI No - 10.23 8.32 -0.38Paxil SSRI Yes No 2.85 22.30 -0.37Paroxetine HCL SSRI No - 3.96 9.21 -0.36Lexapro SSRI Yes Yes 12.20 27.87 -0.36Wellbutrin XL NDRI Yes Yes 5.45 23.83 -0.32Effexor-XR SNRI Yes Yes 7.27 23.12 -0.24
-0.37Notes:
Table 5: Elasticities from the mixed logit model
1. Elasticities and shares calculated using mixed logit specification (1) reported in Table 4.
Market Share (%)
Average (weighted by share):
2. Selected antidepressants in the choice set shown for illustration.
Avg. copay ($/month supply) Elasticity
On Patent?
52
Table 6: Comparison of estimated drug fixed effects with advertising expenditures, years 2001-2003
Yr 2003Average,
2001-2003 Yr 2003Average,
2001-2003SSRI Fluoxetine HCL (Prozac) 3.02 0.10 480 1,568 64 283SNRI Venlafaxine HCL (Effexor) 2.29 0.09 7,787 6,672 1,206 858SSRI Escitalopram (Lexapro) 2.25 0.26 16,506 5,254 1,055 324SSRI Citalopram HBr (Celexa) 2.19 0.10 593 7,375 39 521SSRI Sertraline HCL (Zoloft) 2.16 0.26 8,754 7,307 1,010 959SSRI Paroxetine HCL (Paxil) 1.80 0.19 9,220 9,071 938 1,073SNRI Duloxetine HCL (Cymbalta) 1.03 0.10 0 0 0 0SM Trazodone HCL (Desyrel) 0.92 0.07 0 0 0 0TCA Nortriptyline HCL (Pamelor) -0.46 0.10 0 0 0 0NDRI Bupropion HCL (Wellbutrin) -1.21 0.22 4,293 3,591 508 425
Average monthly detailing expenditure by year1
($ 000s)
Average monthly volume of samples provided to physicians2 (000s)
Subclass MoleculeFixed Effect3
Std. Error
Notes:1. Detailing expenditure reflects professional promotion recorded in Verispan's Personal Selling Audit/Hospital Personal Selling Audit.2. Sampling volume reflects the package volume of a product provided as samples to office-based physicians through various delivery methods (IMS Health).3. Fixed effects estimated in first stage mixed logit model, specification (1).4. Cymbalta entered the antidepressant market in 2003, after the period of the advertising data available for analysis.
53
Variable Estimate Std. Error Variable Estimate Std. ErrorMeans Means1{TCA} -0.43 0.03 log(insurer cost) -0.44 0.011{NDRI} 0.95 0.28 log(patient copayment) -0.38 0.011{SSRI} 1.11 0.12 1{branded} 1.97 0.191{SNRI} 0.09 0.09 1{Wellbutrin XL} -1.21 0.221{NaSSA} 0.10 0.04 1{Citalopram HBr} 2.19 0.101{Inside good} 3.61 0.48 1{Celexa} -0.02 0.271{Dosing >1 pill per day} -0.52 0.04 1{Cymbalta} 1.03 0.101{nausea reports >60th pctile} -14.17 2.09 1{Lexapro} 2.25 0.26Variances3 1{Fluoxetine HCL} 3.02 0.101{TCA} 43.8 7.7 1{Prozac} -1.32 0.281{NDRI} 98.1 6.1 1{Nortriptyline HCL} -0.46 0.101{SSRI} 9.1 5.2 1{Paroxetine HCL} 1.80 0.191{SNRI} 0.0 0.0 1{Zoloft} 2.16 0.261{NaSSA} 59.3 13.1 1{Trazodone HCL} 0.92 0.071{Inside good} 102.8 9.2 1{Effexor-XR} 2.29 0.091{Dosing >1 pill per day} 16.0 8.81{nausea reports >60th pctile} 15.2 8.4Idiosyncratic variance in outcomes 142.3 21.4
Table 7: Model EstimatesCoefficients on Outcome Characteristics1 Fixed Characteristics2
Notes:1. The means of the coefficients in the parameterization of outcomes are estimated in the initial stage model. The variances (lower panel) come from the subsequent maximum likelihood routine.2. The fixed coefficients are estimated in the initial stage model. The model also includes interactions between patient diagnoses and subclass indicators; these coefficients are suppressed above for clarity.3. Standard errors for the second stage parameters estimated via a bootstrap resampling procedure described in the Technical Appendix (Section E).
54
Class NDRI SSRI SNRINDRI 67.7 5.4 1.7SSRI 1.0 72.4 0.8SNRI 1.1 2.9 75.3
Class NDRI SSRI SNRINDRI 47.5 39.1 2.3SSRI 8.3 72.2 8.2SNRI 2.9 26.2 64.0
Class NDRI SSRI SNRINDRI 7.4 37.5 7.1SSRI 7.4 37.5 7.1SNRI 7.4 37.5 7.1
Notes:
3. 'Static' model does not condition on past choices.
2. Observed and predicted transitions calculated for the sample of patients pursuing active treatment for at least 2 periods.
Table 8: Comparison of predicted transition probabilities with observed transitions, at the first decision point
1. Markov transition table shown includes only selected antidepressant classes.
Panel 1: Observed
Panel 2: Dynamic Model
Panel 3: Static ModelChoice at Decision 2
Choice at Decision 1
Choice at Decision 2
Choice at Decision 1
Choice at Decision 1
Choice at Decision 2
55
Table 9: Model fit - Matching predicted subclass and product choices to observed selections by patient
t=2 t=3 t=4 t=2 t=3 t=4if 'top 3' ranked by model 64.2 66.4 65.3 85.0 80.9 74.3if 'top 3' ranked randomly by share 33.5 32.4 31.8 30.7 30.6 30.5if 'top 5' ranked by the model 80.3 86.2 85.1 91.0 89.1 85.9if 'top 5' ranked randomly by share 46.6 45.3 44.7 44.7 44.6 44.4
Total patient observations: 29,358 19,943 13,993 22,164 15,819 11,296
For Inside Goods Only
% of patients for whom the observed choice equals one of top 3 choices% of patients for whom the observed choice equals one of top 5 choices
Examination StatisticFor All Products
Notes:1. Fit statistics rely on estimated parameters from steps 1-2 of the estimation procedure. The statistics record the percentage of predicted choice sequences that match the observed sequence in a given period, according to the criterion of the statistic. 't' denotes the period of the decision. After period 1, the predictions for periods t=2,3,4 condition on the observed choices from s=1,...,t-1. In this way, the predictions represent only 1 period ahead forecasts based on the prior periods' observed experience.2. "Random" ranking probabilities calculated using the hypergeometric distribution (sampling without replacement) based on observed market shares.
56
Variable 1 2 3Baseline Total monthly drug costs1 ($) 429,040 430,567 254,884
Costs per patient2 ($/day) 1.43 1.44 0.85 % of patients adhering to treatment 100.0% 87.4% 56.8%
All copayments set to $0 Change in % adherence 0.0% 2.3% 5.5%% Change in monthly drug costs 49.2% 49.2% 60.2%
due to change in adherence 3.7% 3.7% 4.4%due to shifts in drugs chosen 45.5% 45.4% 55.8%
Change in % adherence 0.0% 1.5% 3.4%% Change in monthly drug costs -18.7% -18.8% -27.0%
due to change in adherence 1.3% 1.3% 1.1%due to shifts in drugs chosen -19.9% -20.0% -28.1%
Change in % adherence 0.0% 1.5% 4.4%% Change in monthly drug costs -0.3% -0.3% 3.0%
due to change in adherence 0.1% 0.1% 0.3%due to shifts in drugs chosen -0.4% -0.4% 2.7%
Change in % adherence 0.0% -0.6% -1.4%% Change in monthly drug costs -30.0% -30.2% -36.9%
due to change in adherence -0.6% -0.6% -0.6%due to shifts in drugs chosen -29.4% -29.4% -36.4%
Advertise heavily for Zoloft Change in % adherence 0.0% 2.1% 5.8%% Change in monthly drug costs 13.0% 12.9% 27.5%
due to change in adherence 2.5% 2.5% 4.9%due to shifts in drugs chosen 10.5% 10.4% 22.6%
Change in % adherence 0.0% 0.8% 1.6%% Change in monthly drug costs 19.3% 19.2% 12.7%
due to change in adherence 1.3% 1.3% 0.3%due to shifts in drugs chosen 17.9% 17.9% 12.4%
Notes:
3. Values calculated for 10,000 sample patients at individual-specific prices.
Table 10: Counterfactual prescription drug costs to the insurer and adherence
2. Costs per patient reflect monthly dollar costs divided by the total number of patients the insurer serves--whether on active treatment or not.
Effects of Promotional Policies
1. Total monthly drug costs equal the recorded insurer cost per 30 day supply less the patient's drug copay per 30 day supply, for the set of patients on active treatment.
Advertise heavily for Wellbutrin
Counterfactual ScenarioDecision Point
Effects of Copayment Policies
Generic copays set to $0; Branded copayments set to 95th pctile ($50)
Copay for preferred SSRI/SNRI=$0, off-patent preferred SSRI/SNRI=$15, all others=$60
Ban branded advertising (eliminate branded "bonus" )
57
Drug costs per patient, initial month
($)
Value of utility gain from symptomatic
recovery ($)Baseline 42.90 366.13
Change in monthly drug costs per
patient ($)
Change in symptomatic recovery ($)
All copayments set to $0 21.10 8.36 Tiered Pricing (branded vs. generic) (8.00) (5.64) Value-based design (SSRI/SNRI preferred)
(0.13) 11.13
Change in monthly drug costs per
patient ($)
Change in symptomatic recovery ($)
Ban branded advertising (12.89) (17.05) Advertise heavily for Zoloft 5.57 13.68 Advertise heavily for Wellbutrin 8.27 7.48
Effects of Promotional Policies
Table 11: Patient welfare vs. drug costs, under counterfactual scenarios
Effects of Copayment Policies
Notes:1. Per patient monthly drug costs equal the average recorded insurer cost per 30 day supply less the patient's drug copay, for the set of patients on active treatment. 2. The dollar value of the utility gain from symptomatic relief is calculated via the procedure in the Technical Appendix (Section F). The change in symptomatic effect by policy stems from the policy's effect on both the initiation of treatment and adherence. This changes the distribution of treatments used and therefore the probability of symptomatic relief over time.3. Values calculated for 10,000 sample patients at individual-specific prices.
58
(1) (2) (3) (4)
Decision Point 2
Decision Point 3
Texas Medication Algorithm Project, Report for Major
Depression (1998)
APA Guidelines, Second Edition
(2000,2005)
Comparative effectiveness review; AHRQ
(2007)Dynamic Model
PredictionsCelexa (Citalopram) SSRI 20.3 33.0 X X XLexapro (Escitalopram) SSRI 21.5 34.5 X X XPaxil (Paroxetine) SSRI 15.1 23.7 X X X XProzac (Fluoxetine) SSRI 19.9 32.5 X X X XZoloft (Sertraline) SSRI 21.4 34.4 X X XCymbalta (Duloxetine) SNRI 16.3 25.0 X X XEffexor (Venlafaxine) SNRI 16.3 24.9 X X X XWellbutrin (Bupropion) NDRI 21.8 35.1 X X XNortriptyline TCA 24.9 38.5 XMirtazapine NaSSA 23.7 37.1 XNefazodone SM 17.3 26.1 X XTrazodone SM 25.8 39.5Notes:
2. Sources:
Table 12: Treatment recommendations from published protocols vs. the dynamic model's recommendations
1. Percentage of patients who quit their medications at decision points 2 and 3 predicted using the experience of 10,000 sample patients applied against the dynamic model estimates. The recommendations built from adherence patterns differ very little when using observed copayments or when setting all copayments equal to $0.
X's within a row indicate the protocol recommends the drug molecule
(1) Crismon, M. L., et al. (1999). The Texas medication algorithm project: Report of the Texas consensus conference panel on medication treatment of major depressive disorder. J Clin Psychiatry, 60, 142-156.(2) Karasu, T. B., et al. (2000). Practice guideline for the treatment of patients with major depressive disorder: Second edition. Arlington, VA: American Psychiatric Association. Also, Fochtmann, L. J., & Gelenberg, A. J. (2005). Guideline watch: Practice guideline for the treatment of patients with major depressive disorder, 2nd edition. Arlington, VA: American Psychiatric Association.(3) Gartlehner, G., et al. (2007). Comparative effectiveness of second-generation antidepressants in the pharmacologic treatment of adult depression. Agency for Healthcare Research and Quality, US Dept of Health and Human Services.
ClassMolecule
% Patients quitting medication, model
59