Download - E¢ cient Provision of Experience Goods: Evidence from ...faculty.chicagobooth.edu/workshops/AppliedEcon/past/pdf/Dickstein... · E¢ cient Provision of Experience Goods: Evidence

Effi cient Provision of Experience Goods: Evidence from

Antidepressant Choice∗

Michael J. Dickstein†

Harvard University

January 17, 2011

JOB MARKET PAPER

Abstract

In markets for experience goods, consumers learn about the quality of available products

through experimentation. In their decisions, they balance the desire to choose the good with

the largest expected value against the incentive to try lesser-known but potentially superior

options. I develop a dynamic framework for analyzing demand that interacts pricing and

promotion with this learning process, and accommodates settings in which agents search over

large choice sets. I employ this approach to study the antidepressant market, testing whether

insurer policies can improve the quality of drug-patient matches and overall health. I find

that tools that selectively promote only the most effective products for a patient outperform

commonly-used tiered pharmaceutical policies. In addition, my estimates, built from a large

pool of patient episodes, identify treatments that induce stronger adherence and thus faster

recovery relative to alternatives recommended by clinical trial reviews. This result suggests

that treatment protocols built from observed quit rates can improve physician decision-making

and ultimately patient health.

Keywords: Learning, dynamic discrete choice, multi-armed bandits, pharmaceutical pricing

∗I owe special thanks to David Cutler, Guido Imbens, Ariel Pakes, and Dennis Yao for their guidance and sug-gestions. I thank Gary Chamberlain, Andrew Ching, Greg Lewis, Eduardo Morales, Julie Mortimer, Amanda Starc,and seminar participants at the Harvard Industrial Organization Lunch and Harvard Econometrics Lunch for helpfulcomments.†Email: [email protected]; Baker Library 420M, Harvard Business School, Boston, MA 02163.

1 Introduction

In markets for experience goods, buyers learn about the quality of available products through

experimentation. They apply their experience to reevaluate the options, identifying the product

that best matches their preferences. In their next decision, they may choose this product or may

select a less well understood option to sharpen their opinions further. Many markets fit this

setting, including consumers’purchases of non-durable goods, manufacturers’choices of suppliers,

and physicians’choices of prescription drugs for patients1. The market for prescription drugs is

a particularly important example, because improvements in the physician’s learning process can

produce significant gains in patient health. The health benefits come in two forms: patients may

realize better outcomes while on more tailored treatment and may adhere to a drug regimen at

higher rates.

I develop a dynamic model of physician decision-making to ask: what set of insurer copay-

ment policies and drug promotions can improve the effi ciency of drug choice, maximizing patient

adherence and health while minimizing insurer cost? In the model, these policies enter both the

physician’s initial treatment choice and also interact with her learning process. I focus specifically

on the antidepressant market. The potential improvement in depression care is substantial, as few

patients continue treatment to the six month threshold recommended by the American Psychiatric

Association. In my empirical setting, half the patients discontinue treatment by the first month

and over 90% exit care within six months2.

In determining the value of insurer policies, I also tackle the issue of physician agency, a second

feature of the drug choice setting common to a broader set of markets. Agency relationships

among the physician, patient, and insurer complicate the prescribing choice. While the physician

ultimately chooses the treatment, she considers to varying degrees the patient’s health outcome and

the costs borne by the patient and the insurer. Careful incentive design must therefore anticipate

the physician’s reaction to the policy chosen.

To capture both agency and learning, I begin by using information on the first choice of antide-

pressant to identify the characteristics that physicians consider in their initial decision. I represent

the agency trade-off using a utility function for the physician that balances her concern for patient

1Erdem and Keane (1996) and Ackerberg (2003) study consumer products; Crawford and Shum (2005) focus ondemand for prescription drugs.

2 Inadequate treatment duration nearly doubles the rate of illness relapse (Melfi et al. (1998)), and depression itselfinvolves serious pain: in surveys, patients equate 10 years living with depression to only 6 years living without it(Fryback et al. (1993)).

1

outcomes against private considerations, as in Hellerstein (1998). The private aspects that make

a drug more desirable for the physician to prescribe include lower risk of malpractice suits, fewer

interactions with other drugs, and possibly greater advertising. Physicians differ in their prefer-

ences over a rich set of drug characteristics, including price; this helps rationalize the diffuse market

shares at the initial decision.

After the first month, in my dataset 20-25% of patients exit care, 60-70% persist on the initial

treatment, and the balance switch to a new drug. In a model of learning, I focus on this continuation

decision. Two dimensions of uncertainty can enter the physician’s choice problem when she selects

among a set of experience goods. First, physicians may lack knowledge of the relative effi cacy

of available drugs. They may use their experience across the group of patients they treat to

identify effective options. Since many of the antidepressants studied have been marketed for two

decades, this type of uncertainty plays only a minor role in my setting. Far more important to

depression care, physicians possess uncertainty about the quality of match between a particular

patient and the products available. Six subclasses of drugs compose the choice set, each using

a distinct biological mechanism of action in the brain; after adjusting for broad diagnosis classes,

physicians cannot predict ex ante which mechanism will provide the strongest antidepressant effect

for a particular patient. I focus on the physician’s effort to learn about the quality of a patient’s

match to a specific drug. Within the growing literature on physician behavior, Crawford and Shum

(2005) share this focus on the uncertainty in the patient-drug match. Unlike their specific setting,

however, the antidepressant market I study involves both a larger choice set and variation in price

and promotion across products, requiring a distinct solution.

I initialize the mean parameters of the physician’s decision problem via a standard discrete-

choice model. For the learning model, I develop a new methodology. I design a computationally

feasible approximation to the learning and experimentation that underlie the continuation decisions.

The focus on computation is particularly important in the antidepressant setting, where the large

choice set and relatively long time horizon can overwhelm existing approximation methods.

After setting the initial conditions, the dynamic model requires an explicit learning process for

the physician and a decision rule for her to employ after this updating. I choose a form for each

element based on the available variation in my data. I start by parameterizing each drug in terms

of characteristics that physicians suggest prompt patient complaints: (1) effi cacy, captured by the

drug’s subclass; (2) the side effect profile; and (3) the frequency of dosing per day. I allow patients

to differ in their sensitivity to these features; the physician’s goal is to learn these preferences from

2

the patient’s past experience.

The updating itself follows Bayes’ rule, as in Crawford and Shum (2005) and Ching (2010).

However, unlike much of the existing work on learning, the agent updates a vector of coeffi cients

on the drug’s characteristics rather than a single mean for each choice. This process permits cor-

related learning: in updating the coeffi cients on the characteristics of the drug tried, the physician

changes her priors on all drugs that share those characteristics. For example, physicians learn from

experience under drug A how the patient views drugs in A’s subclass as well as how he views any

drug with similar side effects to drug A.

After updating, the physician chooses the next period’s treatment, balancing the desire to

select the drug with the best expected outcome against the incentive to experiment with less well

understood options. I generate a rule that captures this trade-off by analyzing the physician’s

dynamic problem using the "multi-armed bandit" approach developed in the statistics literature

by Gittins and Jones (1979) and applied in economics contexts by Miller (1984) and Bergemann

and Välimäki (1996), among others. The solution takes an index form. Rather than requiring the

physician to forecast and calculate outcomes under all possible treatment paths, the multi-armed

bandit solution simplifies the J-dimensional dynamic programming problem into J one-dimensional

problems, one for each choice. These calculations lead to a single number or "index" by choice;

the physician then chooses the option with the largest index.

If the options were strictly independent, the drug choice setting would satisfy all the conditions

required for Gittins’index solution to equal the dynamic programming solution. However, anti-

depressants are correlated according to the set of characteristics outlined earlier, including their

subclass designation. As a result, I rely on the specific index rule derived by Lai (1987), which

permits correlated learning but only approximates the dynamic programming result.

From reduced form estimates of the initial period decision, I find that physicians respond to high

patient copayments by shifting prescriptions to cheaper options; they express excess attachment

to branded drugs generally, and to specific highly-promoted options in particular; and supply-side

incentives, such as paying physicians using a capitated rate, can shift prescribing toward cost-

effective options. Expanding to the second stage learning model, I judge the effect of insurer

copayment designs and promotion schemes both on the insurer’s costs and on patient adherence.

I translate adherence to predictions of symptomatic relief in dollar terms, as a measure of patient

welfare. I find "value-based" insurance designs that set lower copayments for drugs with higher

effectiveness—i.e. those with low population quit rates—can boost adherence, leading to a dollar gain

3

in patient health that exceeds the change in cost. These policies outperform commonly-used tiered

copayment schemes.

On promotion, I find the practice of academic detailing, in which specialists visit clinicians to

discuss the latest academic research on medications, can improve adherence and health if these

specialists promote drugs with higher tolerability and effi cacy. In addition, the model allows me to

judge the welfare effects of manufacturer advertising itself. I find that promotion of highly effective

branded products may enhance patient welfare if the drug’s price to the insurer is relatively modest.

In one counterfactual, I simulate increased promotion for Zoloft (Sertraline), a drug with relatively

high rates of patient adherence. Increasing its advertising threefold leads to a 50% market share

for Zoloft in the focal month. Overall costs to the insurer increase, as physicians substitute toward

Zoloft from cheaper generics. However, gains in the dollar value of symptomatic relief eclipse the

increase in costs. In a second promotion experiment I run involving a less effective drug, the dollar

gains in health fall below the increase in drug costs.

Finally, I use the information on quit rates in my large census of patient episodes, filtered through

the structural model, to recommended a set of treatments. When patients leave treatment early,

holding price and other incentive effects constant, it reveals information on the effectiveness of the

drug chosen. The approach of judging effi cacy using attrition rates follows from work by Chan

and Hamilton (2006) and Philipson and Hedges (1998) in the setting of randomized clinical trials.

I extend this idea to observational data. Relative to the managers of clinical trials, practicing

physicians exert far less effort to encourage compliance with the prescribed treatment. As a result,

the adherence rates observed in claims data, while lacking the internal validity of a randomized

trial, nonetheless provide an externally valid impression of patient satisfaction with a drug.

Based on observed switching patterns, the model identifies five treatments with 5-10% higher

rates of adherence and 10-20% fewer switches relative to other options. Existing protocols built

from clinical trial outcomes recommend a larger list of products, some of which far under-perform

the above treatments in terms of adherence. Thus, in the case of judging drug effi cacy—and possibly

for policy evaluation more generally—my results suggest observed adherence rates provide valuable

information missed by treatment effects in randomized trials.

I proceed in the paper by first describing the market for depression care, the dataset, and

the available variation used to identify the parameters of the model (Section 2). I then describe

the reduced form approximation of the initial choice and provide estimation results (Section 3); I

provide detail on the learning model (Section 4) and the estimation algorithm used to recover its

4

parameters (Section 5); I describe the results and fit of the dynamic model (Section 6); finally, I

run simulations to test the effects of insurer policies on patient welfare and to develop a new clinical

protocol (Section 7).

2 Background and Data

2.1 Depression Care

Major depression affects 6.5% of adults in the United States each year. The illness causes patients

to suffer significant impairment in their productivity, with surveys finding 60% have symptoms

severe enough to keep them from performing daily tasks (Kessler et al. (2003)). I study patients

with one of five categories of depression: major depression, both current and recurrent; dysthymia

and depression with anxiety; prolonged depressive reaction; adjustment disorder with depressed

mood; and, depression not otherwise specified3. Each involves different clinical characteristics,

with standards set by the American Psychiatric Association.

The volume of diagnoses in the United States leads to a large market for drug treatment. In

2008, patients filled 164 million prescriptions for antidepressants, trailing as a class beyond only

cholesterol regulators and pain medicines. Market sales reached $9.6 billion in 2008, roughly 3.3%

of all U.S. prescription drugs sales4. If advertising expenditure is roughly proportional to the class’

share in sales, manufacturers spent $370 million in 2008 on promotion of antidepressants, about

4% of sales revenue. I consider how this advertising might affect prescribing behavior in later

counterfactuals.

The antidepressants studied fall into six distinct subclasses according to their effect on concen-

trations of serotonin, norepinephrine, or dopamine in the brain. Patients react idiosyncratically

to changes in the concentrations of these chemicals, creating uncertainty regarding the effi cacy of

any one biological mechanism for a patient (Murphy et al. (2009)). The medical literature groups

the subclasses themselves into two generations, with "second generation" antidepressants generally

dominating the older subclasses. Of the drugs I study, only tricyclic antidepressants (TCAs) fall

into the first generation. They act non-selectively on chemicals in the brain, leading to greater is-

sues with tolerability and higher risk of overdose. As a result, "selective" second generation classes

3The depression diagnoses listed match the International Classification of Diseases (ICD-9-CM) codes of: 296.2,296.3, 300.4, 309.0, 309.1, and 311. Melfi et al. (1998), Pomerantz et al. (2004), and Akincigil et al. (2007) usesimilar diagnostic codes in selecting a sample of depression sufferers.

4"Top Therapeutic Classes by US Sales", IMS National Sales Perspectives. IMS Health, 2008.

5

have largely replaced TCAs as the preferred treatment. The second generation includes: selec-

tive serotonin reuptake inhibitors (SSRIs); serotonin-norepinephrine reuptake inhibitors (SNRIs);

norepinephrine and dopamine reuptake inhibitors (NDRIs); noradrenergic and specific serotonergic

antidepressants (NASSAs); and serotonin modulators (SMs) (Gartlehner et al. (2007)).

2.2 Formation of the Sample

I draw a sample from Thomson Medstat’s Commercial Claims and Encounters Marketscan Data-

base, collecting its patient-level insurance claims data for outpatient visits and prescription drug

services. I link this data to patient demographics and plan design features from the Benefit Plan

Design Marketscan Database5. The individuals recorded in the data include active employees

working for a group of large US firms that contract with one of 100 participating payers; the

employees’dependents and some classes of retirees enter the database as well.

To form a sample, I identify patients in the outpatient dataset who receive a new diagnosis in

one of the depression categories listed earlier. Starting from this index offi ce visit, I collect the

patient’s complete treatment sequence, including the identity of the prescriptions he fills, over the

remaining months of the sample. By conditioning on observed diagnoses in creating the sample, I

avoid observations that reflect off-label use of antidepressants6. I impose the following additional

screens in creating the analysis dataset: patients cannot have a concurrent diagnosis of bipolar

disorder or schizophrenia or receive drugs that signal these conditions, as these more complicated

illnesses lead to distinct prescribing behavior7; the patients’age must be between 18 and 64, the

range for which the data are complete; patients must visit a health professional with the ability to

prescribe all possible treatments in the choice set; and, patients must not be pregnant, a condition

that involves serious drug contraindications. To avoid episodes in which I observe care part way

into treatment, I eliminate from the sample patient episodes with treatments prescribed within the

first 6 months of data.

The initial filters lead to a dataset of 102,780 unique patient episodes of depression care, com-

prised of 267,390 observations on antidepressant prescriptions. The individuals are members of 307

insurance plans, each with a distinct set of required drug copayments. Plans also differ in their

5Thomson Reuters MarketScan Research Databases. Ann Arbor, MI: 2003-2005.6 In addition, I miss individuals in the data who suffer from depression but receive either another diagnosis or none

at all. My sample thus underpredicts the use of the "outside option" of non-drug care and reflects care for patientswith more identifiable depressive symptoms.

7Patients excluded due to comorbidities have a diagnosis in one of the following classes: bipolar and manic disorders(296.0, 296.1, 296.4-.8) and schizophrenic disorders (295.0-295.9).

6

physician payment schemes, with some offering capitation contracts to primary care physicians.

These contracts specify an annual lump-sum payment per patient treated rather than reimbursing

the physician for each offi ce visit. I explore later how capitation arrangements change physician

prescribing behavior.

To define the end of a patient episode, I look for a gap of 90 days within the treatment history

where the patient neither fills a prescription nor visits his physician. If such a gap occurs, I

end the episode at the last prescription observed before the 90 day wash-out period. If the

individual appears again in the data after this period, I treat the new observations as composing an

independent episode8. Using this definition of an episode, 35% of patients do not fill a prescription

in their episode; 22% of patients fill one; 13% fill two; 9% fill three; and 6% fill four. 15% have

episodes involving between 5 and 30 filled prescriptions.

Note that, in defining the choice set, I include only drugs maintaining at least a 0.3% mar-

ket share. Drugs omitted via the market share cutoff include extremely rare, older generation

treatments, such as MAOIs (monoamine oxidase inhibitors). Their shares are simply too small to

permit estimation and to collect price information across distinct plans. The choice set, however,

did evolve over the sample period. In July 2003, physicians chose among 17 products that main-

tained at least a 0.3% share in the overall market. Citalopram, the generic form of Celexa, entered

in November 2004 and a new SNRI, Cymbalta (Duloxetine) entered in August 2004, providing 19

drug options by the end of the sample period. I allow the patient to face the latest choice set at

each decision point in his observed episode.

2.3 Variation for Model of Initial Choice

The model of the physician’s initial treatment choice conditions on the insurer’s cost, the patient’s

copayment, drug effi cacy, side effects, and dosing convenience. The Marketscan database contains

the important price variables: its insurer cost variable captures the cost associated with a drug

excluding the dispensing fee, sales tax, and rebates to the payer; the copayment reported is the

plan-specific and drug-specific cost patients face when filling a prescription. In Figure 1, I plot

by drug and year the average patient copayment per day supply, taking the mean across plans in

the sample. The error bars showcase the standard deviation of the prices across available plans.

8Roughly 8% of the unique patient identifiers repeat in a new episode over the course of my sample period. Theinitial choice in these repeat episodes contributes to the first stage model, and may bias the mean estimates if theyrepresent a selected sample of drugs. In the data, however, the repeat episodes begin with a distribution of drugssimilar to drugs chosen at the start of non-repeated episodes.

7

Copayments vary along three dimensions. First, for most drugs, average copayments increase over

the sample period. Second, across products in a given year, copayments vary considerably, most

sharply when comparing branded drugs to generic drugs. The largest copayments tend to be for

branded drugs with available generic equivalents. Finally, third, for a product and year, the large

standard error bars suggest important variation across plans in the data. Average plans charge

patients $20-30 for a 30-day supply of a branded drug and about $5-10 for a 30-day supply of a

generic drug. The most generous plans require copayments of $0-5 while the least generous require

levels exceeding $50 or more for branded drugs. Given this important variation, I use plan-specific

and year-specific prices in later estimation. The insurer cost varies similarly, as reported in Table

1. Generics cost as little as $5-8 per month supply, with an average near $40; patented drugs cost

two to three times as much.

Information on side effects and daily dosing come from Gartlehner et al. (2007). They examine

2,099 relevant citations from the medical literature, collecting information on treatment effects and

side effects by clinical trial. While I use this information to create drug variables, I later compare

the quality of the prescribing advice in this evidence review with information contained in the

switching patterns in my individual-level dataset.

The dosing requirements appear in Table 1. I illustrate the important variation in side effects

by drug molecule in Figure 2. Side effects vary importantly across drug subclasses and, to a lesser

degree, among drugs in a given subclass. Choice behavior both initially and in the learning model

I build later will relate to this variation in side effects9.

Finally, in Panel 2 of Table 1, I include statistics on the individual characteristics reported in

the Marketscan Database that enter the longer specification of the initial choice model. 27% of

patients suffer from major depression, the most severe diagnosis examined; 26% of patients observed

visit a psychiatrist, with 47% visiting general practitioners and the remaining share visiting other

specialists, such as obstetricians. For plan types, about one third of patients receive coverage under

a capitated Health Maintenance Organization (HMO) plan, 14% have non-capitated HMO plans,

and the remaining 53% have Preferred Provider Organization (PPO) or comprehensive plans with

fewer restrictions on offi ce visits and other aspects of care.

9When a patient reacts poorly to a drug with high nausea, the learning model downweights drugs with a similarnausea profile in the next period’s decision. This reflects physician practice: when a patient complains of a sideeffect, physicians report that they look for an alternative less likely to produce the same complaint. Note that thebiological mechanism that generates side effects is poorly understood. The behavior of physicians appears to bebased on anecdote and/or malpractice concern rather than scientific evidence.

8

2.4 Variation for Learning Model

Identification of the learning model parameters requires a suffi ciently rich set of observed switches

across products and subclasses with distinct characteristics. I summarize the available switching

in the panel data in Table 2. The table lists the fraction of individuals who (1) persist, (2) switch

to another molecule within class, (3) switch to another inside class, or (4) exit treatment at period

t, conditional on the subclass choice at (t − 1)10. The statistics show that individuals tend to

persist on treatment at a rate of 55-70%. The degree of persistence increases over the course of a

patient’s episode and depends importantly on the subclass of the initial choice. Individuals persist

at greater rates when beginning on an SSRI or SNRI, relative to NDRIs, TCAs, and NaSSAs.

Of those who switch at t, 70-80% stop taking medications altogether, 15-20% switch to another

subclass, and the balance switch to a new molecule within the same subclass.

To focus on the timing of switches, I include in matrix form in Table 3 the share of patients

who remain on the initial choice at different decision points within an episode. Conditional on the

level of adherence, about 10-15% of patients switch at the first decision point. Physicians appear

to learn quickly about a patient’s match to a treatment. That the switching is slightly slower

in longer episodes suggests either better initial matching or that the physician faced greater noise

when trying to learn about a patient’s match from outcomes. I separate these two explanations in

the learning model.

3 Initial Condition

To initialize the learning process over the choice of drug regimen, I estimate a reduced form model of

behavior. The model focuses on the physician, but accounts for her attention to patient preferences,

as in Hellerstein (1998). I incorporate observable information on patient demographics, physician

specialty, and drug characteristics in the starting point for the agent’s learning process.

I relate the first period choice to the important drug features in a linear specification11. The

patient cares about an option’s: (1) copay; (2) effi cacy, summarized by drug subclass indicators; (3)

10Decision points, t, roughly correspond to one month of treatment. In cases where the physician writes a 90 dayprescription, the decision point corresponds to 3 months. When a patient returns for follow-up before completingthe initial prescription, the decision point corresponds to the observed number of days between offi ce visits.11Unlike earlier work on drug choice, including Crawford and Shum (2005), I choose a utility form for the physician

that assumes risk neutrality. I model risk aversion indirectly via drug fixed effects, which capture the physician’sconcern about a drug’s tendency to cause severe side effects or prompt increased rates of suicide. I choose thissetup because switching patterns in the data may not provide the variation necessary to distinguish a risk aversionparameter from the idiosyncratic variance in outcomes that slows the agent’s learning process.

9

side effect profile; (4) daily dosing frequency; and (5) branded status. To simplify computation,

I rely on a discretized version of the nausea variable to proxy for the overall side effect profile

of a drug12. The products are differentiated horizontally, as each patient faces a distinct vector

of copayments and patients differ in their sensitivity to price, effi cacy, side effects, and dosing

convenience. I include these variables in XPatientijt , multiplied against random coeffi cients. The

drug’s branded status is constant over time and has a fixed coeffi cient; it enters ZPatientij :

UPatientijt = aiXPatientijt + bZPatientij

Ultimately, the physician makes a drug selection, incorporating the patient’s utility to a degree.

I define γ as the weight the physician places on the patient’s utility, with γ →∞ implying perfect

agency. In all other cases, she may consider her own private benefits from a treatment choice. I

capture the private benefits to prescribing drug j using product fixed effects. This can include

advertising for j or lower malpractice risk. For example, if a drug’s effects are so variable that

it might actually increase the risk of suicide for some patients, physicians will avoid it and the

estimation will identify a small or negative fixed effect.

Insurers can also alter the physician’s private benefits via supply-side incentives. They may

pay physicians using lump-sum capitated payments or may design bonus schemes that pay out

with reductions in the total costs of a physician’s prescribing. To capture this, I include in the

physician’s utility the insurer’s cost—with a random coeffi cient—and an interaction of a capitation

indicator with this cost. XPhysicianijt contains the insurer price; all other effects enter ZPhysicianijt

with fixed coeffi cients13:

UPhysicianijt = ciXPhysicianijt + dZPhysicianijt + γUPatientijt + εijt

= (γai, ci)Xijt + (γb, d)Zijt + εijt

= β′iXijt + α′Zijt + εijt

Holding constant the observables, (Xijt, Zijt), physicians may still choose a distribution of products

due to two features. First, I include an idiosyncratic extreme value error, εijt, that captures

time-varying unobservables, such as whether the physician has manufacturer samples on hand.

12 In interviews I conducted, physicians report switching drugs most often following complaints of nausea andinsomnia. As a sensitivity, I replace nausea with insomnia in the specification. The results change very little.13 In the main specification, the fixed effects do not vary over time. I allow the fixed effects to vary by six month

intervals as a sensitivity; the qualitative results are similar.

10

Second, I introduce individual heterogeneity via the βi coeffi cients on price, side effects, subclass

indicators, and dosing frequency. Mixing over the distribution of sensitivity parameters produces

correlations in the unobserved portion of utility for products with similar observed characteristics.

This breaks the Independence of Irrelevant Alternatives property of a simple logit model, which is

undesirable in this setting (McFadden and Train (2000)). I use a Bayesian Markov Chain Monte

Carlo (MCMC) procedure for estimation, as described in Section A of the Technical Appendix. To

reduce computation time, I estimate the model using a random sample of 50,000 patients drawn

from the raw data, holding on to the entire time series for the selected patients. Note that I cannot

separately identify the agency parameter, γ; in the estimates, I look for non-zero coeffi cients on the

fixed effects and the insurer price variable to suggest physicians fail to act as perfect agents.

Four key findings emerge from the parameter estimates, reported in Table 4. First, physicians

respond importantly to patient price: the coeffi cient on log copayment is -.38 and highly significant.

The estimated standard error of the random coeffi cient is .28, suggesting that 92% of the normally

distributed physician sensitivities fall below zero. Physicians may respond to copayment levels

due to requests from patients or after receiving so-called "blue letters"—warnings from an insurer

when a physician prescribes expensive medications too often. I report the average elasticities by

drug in Table 5. They fall near the center of the range of -.8 to -.1 identified in a meta-regression

analysis by Gemmill et al. (2007). Demand for off-patent branded drugs and older generation

treatments, like TCAs, prove most elastic. The newest on-patent drugs and effective generics have

more inelastic demand.

Second, the coeffi cients on subclass indicators show that physicians have preferences over classes,

despite little evidence from clinical trials that second generation antidepressants differ in effi cacy.

Holding constant nausea side effects and dosing, physicians prefer "inside good" drug treatments

over the outside option of non-pharmacologic care, and they prefer SSRIs over other medications.

This preference may stem from unobserved drug features, like fewer interaction risks with other

medications. The mean effects will feed into the dynamic model and influence the design of a new

treatment protocol.

Third, paying physicians by capitation can influence decision making. Specification (2) includes

an interaction that allows capitation to change the physician’s sensitivity to insurer costs. The

coeffi cient of this interaction equals -.25 and is highly significant. The direction implies that

physicians paid by capitation are more sensitive to the total cost of a drug. I illustrate in Figure

3 the change in preferences from expensive branded medications towards effective generics, such as

11

Fluoxetine (Prozac), and to non-drug care when physicians accept capitated payments. There are

multiple hypotheses to explain precisely how capitation influences decisions. Dickstein (2010) tests

these hypotheses using both patient and physician-level data sources.

Fourth, the drug-specific fixed effects in the model suggest a possible role for sustained adver-

tising in the physician’s choice. That the coeffi cients are non-zero also provides some evidence

that physicians fail to act as perfect agents. In Table 6, I compare the levels of drug-specific fixed

effects with drug-level promotional data from the pre-sample years of 2001 to 200314. The data

include both manufacturer expenditures on detailing by drug and the volume of samples provided

to clinicians. A comparison of the fixed effects against advertising suggests a positive correlation of

about 0.44. Lexapro (Escitalopram), Zoloft (Sertraline), and Effexor (Venlafaxine) have among the

highest estimated fixed effects and also rank in the top three or four among antidepressants in both

detailing and sampling. The strong fixed effect for Fluoxetine, the generic version of Prozac, likely

reflects past advertising for Prozac before it lost patent protection in 2002, the year prior to my

sample. I exploit the correlation between advertising and the fixed effects in later counterfactuals.

4 Physician Learning

After the initial drug selection, the physician and patient face a continuation decision. The patient

may stop taking medication or may consult his physician, who can choose to switch the drug

treatment or recommend the patient continue on the initial choice. Capturing this decision as a

function of drug attributes and incentives is crucial in understanding how insurer policies influence

adherence and long-run treatment costs.

Fitting the switching behavior in the data requires three components. First, I specify a linear

utility function conditioned on outcomes and known drug features, like price. Second, I describe

the updating process the physician uses to learn from past experience. Finally, I specify a decision

rule for her to employ after updating15.

14Personal Selling Audit/Hospital Personal Selling Audit. Verispan, Inc., 2003. Also, Integrated PromotionalServices, IMS Health, 2003.15Note, in the learning process, I assume the patient can remain in treatment or exit care, but cannot switch to

another physician during a treatment episode. As a result, changes in medication reflect learning by the initialphysician. I eliminate episodes from the sample that involve switching from a general practitioner to a specialist;this occurs in less than 1% of episodes.

12

4.1 Utility and Outcomes

I define total utility, yijt, for individual i under treatment j in period t as a function of: (1) "fixed"

elements, fijt; (2) a measure of the effectiveness of drug j for patient i, vij ; and (3) a random noise

component, Wit :

yijt = fijt + vij +Wit

The variables in fijt include patient copayments, insurer prices, the branded status of j, and

drug fixed effects. The components of fijt are fixed, in that both the agent and the econometri-

cian know their values at time t—physicians do not learn about them from experience. I add an

idiosyncratic term, εijt, to fijt that captures, for example, the availability of manufacturer samples.

The econometrician does not observe this element; it follows a standard Gumbel (extreme value)

distribution. I apply coeffi cients estimated in the initial stage model to form fijt, implying that

preferences are stable over time and well-captured using the cross section of initial choices. The

fixed component itself varies over time due to the idiosyncratic term and to changes in copayments

and insurer prices. Given its components, fijt reflects the agency relationships between the patient,

physician, and insurer. Previous work on learning in drug choice often neglects this consideration

due to computational or data constraints.

The second component, vij , captures the patient’s match to drug j in terms of health. Since

I do not observe the patient’s medical record to identify his reaction to a drug, I break vij into a

set of covariates, Uj , to proxy for features that might cause the physician to reevaluate a patient’s

treatment regimen. In interviews I conducted, physicians outlined three reasons for switching: the

initial drug failed to improve the patient’s symptoms; it led to intolerable side effects; or, it was

inconvenient to take at the required daily intervals. Thus, the variables in Uj include: (1) drug

subclass variables to proxy for effi cacy; (2) a discretized nausea side effect variable built from the

clinical trials literature16, and (3) a dosing frequency indicator. I interpret switches according to

changes in these covariates. For example, an observed switch from subclass A to B implies the

physician now views subclass B as more effective for the patient following the initial experience.

I use a linear parameterization of the health outcome term: vij = Uj ∗βi. The characteristics of

drug j, Uj , are known to the physician and the econometrician. However, neither knows ex ante

how the patient will react to these characteristics—i.e. neither knows the true value of βi. The

physician forms a prior distribution on the vector βi to update with patient i’s experience. βi has

16Again, I combine side effects into a single nausea-derived variable for computational purposes; the results changelittle by substituting insomnia for the nausea variable.

13

a normal prior, with mean, βi,0, set equal to the initial condition estimates17, and with covariance

matrix, V0. The elements of βi are independent such that V0 is a diagonal matrix. I choose a

normal distribution for βi to build a Bayesian learning model in the style typical of the literature

on experience goods.

Finally, Wit is an unknown random shock to Yijt that is independent of the components in

fijt and vij and independent of (i, j, t). It follows a normal distribution with variance, σ2W , and

a mean set equal to zero without loss of generality. It reflects environmental noise affecting the

individual’s health outcome. For example, a patient might lose his job or have a family crisis

during the treatment period, changing his outcome independent of the treatment. As a result

of this noise, the physician learns only imperfectly from observed experience about a patient’s

sensitivity to attributes of the drug taken18.

The sum of vij andWit form the uncertain outcome realization, which I denote yijt = yijt−fijt =

vij + Wit. Conditional on βi, it follows a normal distribution with variance σ2W . Combining over

t = 1, ..., T independent realizations, the distribution appears as:

yi,1

...

yi,T

|di,1, ..., di,T , βi v N

U(di,1)

...

U(di,T )

βi, σ2W ∗ IT

where di,t is the identity of the individual’s choice at t and U(di,t) is the vector of characteristics

of that choice.

4.2 Bayesian Updating

The physician begins with uncertainty about the patient’s sensitivity to a set of drug characteristics,

βi. She updates based on outcome realizations to find a posterior on βi to use in selecting the

next treatment. Using Bayes’ rule, the posterior for βi at time (T + 1) is the product of two

components: a likelihood on the T outcome realizations to date, Yi,T = (yi1, ..., yiT )′, conditional

on βi, and a prior distribution on βi:

17βi,0 reflects the initial distribution of βi conditional on individual i’s first period choice. This conditionaldistribution has no closed form; I take simulation draws from it via a procedure discussed in Section D of theTechnical Appendix.18Randomness in outcomes may also be drug specific. For example, generic drugs may lead to more variable

outcomes, if the manufacturing process of distinct generic suppliers changes the speed at which the drug circulates inthe body. In the current model, I consider this variation a fixed drug characteristic; it enters the outcome throughdrug fixed effects and the branded indicator variable in fijt. An extension to the current approach might replaceWit with Wijt, allowing distinct variances for branded and generic drugs.

14

P (βi|Yi,T , U) = L(Yi,T |βi, U)g(βi|βi,0, V0)

where U is a TxK matrix holding in row t = 1, ..., T the covariates for the drug chosen at that

time period. Recall, I specify a normal prior on the K-dimensional βi and independent normal

distributions for Wit:

βi v N(βi,0, V0)

Yi,T |U, βi v N(Uβi, σ2W ∗ IT )

Given the normal likelihood and prior, the posterior is normal with parameters (β∗i,T , V∗i,T ) after

updating from T periods of experience:

βi|U, Yi,T v N(β∗i,T , V∗i,T )

V ∗i,T = (V −10 + U ′U(σ2W )−1)−1

β∗i,T = V ∗i,T ∗[V −10 βi,0 + U ′U(σ2W )−1 ∗ βi

]where βi = (U ′U)−1U ′Yi,T

The updating process is akin to a Bayesian analysis of the classical regression model, here a

normal linear model (See Gelman et al. (2004) or Zellner (1971)). This appears clearly from the

manner in which Yi,T enters the posterior mean—the predicted βi uses the standard formula for

a regression coeffi cient. The updated variance, V ∗i,T , combines two sources of variation. The

first, V −10 , reflects uncertainty about the true value of βi; the second term is due to fundamental

variability from the model—Uβi cannot capture Yi,T exactly.

The values of the posterior parameters over time depend on the rows of U and on the size of σ2W .

First, if a patient takes drug j for multiple periods or takes drugs that share many characteristics,

the rows of U will be similar. Since the elements of U are indicator variables, U ′U will be larger,

meaning V ∗i,T will shrink. Second, the larger is σ2W , the less information physicians gain from

experience, since large σ2W implies noisy outcomes. From the expressions for (V ∗i,T , β∗i,T ), the terms

involving σ2W will go to zero. The posteriors (β∗i,T , V∗i,T ) then approximately equal their priors,

(βi,0, V0).

15

4.3 Physician’s Decision

After updating her beliefs on βi conditional on outcome realizations, Yi,T , from periods t = 1, ..., T ,

the physician must choose an option in period T + 1. The simplest decision rule for the physician

maximizes current expected utility in choosing the next treatment:

maxj∈1,...,J

E(yi,j,T+1|U, Yi,T )

Note that at T + 1, only experiences from t = 1 up to T enter the update of βi . I label this rule

"forward-myopic", since the physician learns from experience but fails to consider the future value

from experimenting with products with lower expected means but highly variable outcomes.

In the data, physicians sometimes switch patients to drugs that the initial distribution of pa-

rameters suggests are poor substitutes. Patients often remain on them for several periods. To

rationalize this pattern, the forward-myopic model requires very large variances on the estimated

coeffi cients, βi. That is, the patient’s outcome would have to be quite extreme for the physician to

update her priors in a way that raises the expected outcome of the weaker option above all others

in the choice set.

I rationalize the observed patterns using a second approach common in previous research on

dynamic demand for prescription drugs. Rather than selecting treatments myopically, the physician

looks forward, accounting for the patient’s outcome over his entire episode of care. She experiments

with lesser-known options that may produce superior outcomes for a patient. To capture this

trade-off, I add the variance of a choice to the decision rule:

maxj∈1,...,J

E(yi,j,T+1|U, Yi,T ) + h(V (yi,j,T+1|U, Yi,T ), .

)where the h(.) function increases in V (.), the posterior variance on outcomes. I generate formally

a decision rule that matches this form using a multi-armed bandit solution to the choice problem

of a forward-looking agent. Below, I walk briefly through the steps to rationalize use of this rule.

4.3.1 Dynamic Choice Problem Set-up

A forward-looking physician maximizes the expectation of the discounted sum of future utility flows,

Yjt, by product and time period. This expected utility at t depends on the choices in previous

periods, since the physician uses experience to update her beliefs on the unknown coeffi cients in

16

the utility function. I summarize the parameters of the distribution of these coeffi cients at t in

θj(t) = (β∗i,t−1, V∗i,t−1), based on realizations from t − 1 periods. The learning process begins with

a specific prior distribution, Π, on θj = θj(1). Succinctly, the physician solves:

maxd

∫Eθ

T−1∑t=0

δt∑j∈J

djt ∗ Yjt(θj(t))

dΠ(θ1, ..., θJ)

subject to the Bayesian updating rules defined earlier. Here, djt = 1 when the physician

chooses drug j at t, and δ is the discount rate.

One can use Bellman equations to solve this problem. With this solution concept, the physician

follows backward induction, considering all possible treatment paths and selecting an option today

with the highest expected future stream of rewards. As the number of options available increases,

it grows increasingly unrealistic to presume the decision maker can compute and rank all these

future paths. For the econometrician, even newer empirical techniques, including those surveyed by

Aguirregabiria and Mira (2010), quickly become computationally demanding given the discounted

infinite horizon and the 20 member choice set in my example.

4.3.2 Simplifying via Gittins’Index

Rather than focusing effort in solving for the unknown parameters using the dynamic program-

ming equations directly, I rely on a clever simplification of the problem described by Gittins and

Jones (1979) and Gittins (1979). They prove that in an infinite horizon setting with independent

prior distributions on (θ1, ..., θJ), one can break the underlying J-dimensional problem into J one-

dimensional problems. That is, rather than considering all possible streams of future outcomes

in making today’s treatment choice, the physician can break the problem into J continue-quit

decisions, one for each choice. A single number or "index value" summarizes each continue-quit

decision; the physician needs only to order these J index values, choosing the option in each period

with the largest index. This procedure provides the same solution as the dynamic programming

approach if the following conditions hold: (1) the decision-maker selects only one option at t; (2)

options not chosen remain in their initial state; (3) each option is independent; and (4) options not

selected do not contribute to the individual’s outcome (Mahajan and Teneketzis (2007)).

Intuitively, this procedure works because the conditions allow a "forward-induction" process.

Since the options are independent and those not chosen initially will remain available at later dates

in the same condition, one can reorder the choice sequence without changing the overall outcomes.

17

That is, prescribing drug A, followed by B, would be equivalent to choosing B then A, up to the

discount rate19. Thus, the physician needs to think only about the value of each drug if it were

taken for the optimal length of time. She then builds a treatment sequence, ordering the drugs

in terms of this value. With updating from past experience, the decision to exit treatment occurs

when the physician and patient determine that the indices for all inside goods fall below that of

the outside option.

4.3.3 Index rule with Correlation

The antidepressant setting deviates from the classical setting in which index solutions exactly

solve the agent’s dynamic problem. Specifically, the choices are not independent but correlated

according to the covariates outlined earlier, including subclass indicators and side effects. The

existing literature provides solutions in this special case. Lai (1987) and Chang and Lai (1987)

show that with a finite set of distinct options, one can design an index rule that approximates

the exact solution provided by dynamic programming, even when choice-specific parameters are

correlated20. They judge the approximation using a "regret" measure, discussed in Section B of

the Technical Appendix, that accounts for deviations between the sequence of options chosen and

the optimal sequence. The index rule they propose achieves a lower bound on regret—that is, it is

"asymptotically effi cient," as defined by Lai and Robbins (1985)21.

Lai’s rule matches the intuition described earlier—the physician balances her impulse to choose

the drug with the highest expected utility against the incentive to experiment with lesser-known

but potentially superior treatments:

maxj∈{1,...,J}

E(yi,j,T+1|U, Yi,T ) + σi,j,T+1 ∗ h(δ)

where E(yi,j,T+1|.) is option j’s expected utility at (T + 1) and σi,j,T+1 is the posterior standard

deviation of j’s outcome. Index rules in general require the agent to solve an optimal stopping

19The drug choice setting satisfies the conditions required for Gittins’solution precisely because drugs not chosenremain available in their initial states— the sequencing of the treatments does not change an individual’s reactionto a drug when taking it. This would not be true if the initial drug changed the individual’s physiology such thatit foreclosed use of a second treatment. In the antidepressant example, ignoring minor discontinuation periods,physicians can prescribe drugs in any order without changing their treatment effects.20When options become indistinct, Rusmevichientong and Tsitsiklis (2010) provide an alternative dynamic alloca-

tion rule.21Brezzi and Lai (2002) quantify "regret" for a forward myopic rule. While Lai (1987) possesses regret of logarithmic

order as δ approaches 1, the forward myopic rule has larger regret, of order (1 − δ)−1. In cases where the futurehorizon is short, the two rules likely perform similarly. See Section B of the Technical Appendix for details on theapproximation.

18

problem for each choice; the function, h(δ), approximates the boundary to this continue-quit de-

cision, which depends on the discount rate. Lai (1987) and Chang and Lai (1987) develop closed

form approximations to the boundary (See Technical Appendix, Section C). In the discounted

infinite horizon setting, the boundary function declines when physicians discount the future more

heavily. The standard deviation, σi,j,T+1, weakly decreases over time with experience. Thus, in

cases where the physician has a low discount rate or where previous experience diminished σi,j,T+1,

the decision rule closely resembles the forward-myopic rule.

5 Algorithm

5.1 Forming Choice Probabilities

I combine information from the initial condition and the learning model to form choice probabilities.

The probabilities begin with the index rule from Lai (1987). Since the rule balances the expected

mean utility of a choice with its posterior variance, it depends directly on fijt, the fixed component of

mean utility; on (βi,0, V0), the priors in the effectiveness component; and on past random outcome

realizations, Yi,t−1. Recall, fijt includes an additive extreme value error. The probabilities

therefore take a logit form:

hijt(βi,0, V0, Yi,t−1) =exp(Gijt(βi,0, V0, Yi,t−1))

1 +∑

k exp(Gikt(βi,0, V0, Yi,t−1)

where Gijt is the index rule22. However, I cannot use this directly in estimation, since, as the

econometrician, I do not observe either Yi,t−1 or βi,0. I therefore integrate out over their distribu-

tions. I showed earlier that Yi,t−1 follows a normal distribution with variance, σ2W ∗ It−1 and mean

Uβi. Integrating over its distribution, the choice probabilities appear as:∫Yi,t−1

hijt(βi,0, V0, Yi,t−1)f(Yi,t−1|βi,0, σ2W )dY

βi,0 should come from the initial stage mixed logit model. However, again I do not have a direct

estimate of it. Since the initial stage relies only on a single observed choice per individual, I can

consistently estimate the population parameters, (b,W ), of the distribution of βi,0 but not βi,0

itself— the number of choices observed per agent would have to increase to apply the Bernstein

22 I normalize the index for the outside option to zero, such that exp(Gi,Outside,t(.)) = 1

19

Von-Mises theorem. Instead, I use Bayes’rule to transform the consistent population parameters

from the initial stage into draws from a distribution of β that conditions on the observed drug

choice in the initial period (di1):

k(β|di1, Xi, Zi, b,W, α) =P (di1|Xi, Zi, β, α) ∗ g(β|b,W )∫P (di1|Xi, Zi, β, α)g(β|b,W )

The conditional posterior reflects the distribution of β for the subset of agents who would choose

di1 when (Xi, Zi) define the choice situation23. This posterior does not have a closed form. I

collect R simulation draws from it via a Metropolis-Hastings algorithm, described in the Technical

Appendix (Section D). Collecting these draws requires no additional effort. Since I estimate the

initial stage mixed logit via a hierarchical MCMC procedure, I have in storage sequences of draws

from this conditional posterior as a by-product of the estimation routine.

Thus, the choice probabilities I use in the likelihood function involve simulation:

Pijt =R∑r=1

[∫Yi,t−1

hijt(βri,0, V0, Yi,t−1)f(Yi,t−1|βri,0, σ2W )dY

]

I integrate out the unobserved outcomes, Yi,t−1, over the smooth function hijt(.) via the Gauss-

Hermite quadrature method (See Judd (1998)).

5.2 Simulated Maximum Likelihood

With these simulated choice probabilities, I form the following log likelihood to maximize over

choices in periods 2 onward:

Log(L) =N∑i=1

Ji∑j=1

Ti∑t=2

dijt ∗ log(Pijt)

I proceed with a derivative-free solver to optimize over the diagonal elements of V0 and the

variance of the idiosyncratic outcome term, with the constraint that the variances be weakly posi-

tive24. I find the appropriate standard errors of the resulting estimates using a bootstrap procedure

23As discussed by Train (2003), conditional on the underlying parameters of β, (b,W ), the distribution k(.) isexact—there is no sampling variance. However, I estimate (b0,W0) as (b, W ). In my bootstrap procedure to estimatethe standard errors of the second stage parameters, I account for the variation of (b, W ) around (b0,W0).24As a check on the consistency of my estimation via simulated maximum likelihood, I increase tenfold the number

of simulation draws used in the optimization. The results change very little.

20

described in the Technical Appendix (Section E). In brief, I take random draws from the initial

stage posterior distributions and use each in a bootstrap sampling procedure that computes the

learning model. I account for variation in βi,0 away from E(βi,0) and for the fact that I estimate

the underlying population parameters of βi,0’s normal distribution in the initial stage. The boot-

strap variances approximate the asymptotic covariance matrix for sequential estimators derived by

Newey (1984) in the moment condition setting.

5.3 Identification

The estimates in the learning model condition on the mean parameters from the initial stage esti-

mation. In the first period decision problem, each patient faces plan-specific choice characteristics

which also vary by the calendar year of the patient’s visit. The variation in the choice charac-

teristics produces distinct initial period drug shares, which identifies the fixed coeffi cients and the

population mean and variance of the random coeffi cient distribution25.

Using the panel data, I estimate the variances on: (1) the coeffi cients in the parameterization of

outcomes, and (2) the random shock component of an individual’s outcome. Similar to Crawford

and Shum (2005), the identification of the variances on the outcome coeffi cients depends on how

drug shares across patients change from the first period of treatment through to later periods. If

the shares of a product change little, then the model would fit smaller variances on the coeffi cients

of characteristics that compose that product. For example, if a drug in the SNRI class maintains

roughly the same market share at each point across the sample of patient episodes, the variance

on the SNRI indicator’s coeffi cient will be small. Large changes in the share of a product across

decision points implies physicians knew little about its characteristics before experimentation.

The variance of the idiosyncratic shock, σ2W , relates to the overall rate of learning. With small

variance in the shock, switching will occur faster after the initial period. As the variance in the

idiosyncratic component grows larger, physicians and patients learn relatively less about the quality

of the match to a treatment from experience, and so will persist on it for longer.

6 Dynamic Model Estimation Results and Fit

The new parameters identified from the switching patterns in the panel data include the variance

of the idiosyncratic learning error and the variances of the coeffi cients in the parameterization of an

25Bajari et al. (2009) provide a formal nonparametric identification argument for the distribution of the randomcoeffi cients in a mixed logit model.

21

individual’s health outcome. I report these estimated variances in Table 7, along with bootstrapped

standard errors.

The variances provide insight into the physician’s priors, conditional on the means fixed from

the initial condition. A large variance on the coeffi cient for a particular subclass, for example,

implies significant prior uncertainty on the effi cacy of the class for any patient. Mechanically, the

larger the variance of such coeffi cients, the greater is the experimentation incentive in the index

rule for drugs with these characteristics26. This invites physicians to try out the drug. Following

experience, rates of switching away from the treatment will also be high, since the index value

can drop substantially when the posterior variance of a drug’s effect shrinks. In addition, large

variances on the idiosyncratic component of the outcome implies slow learning—the physician and

patient update very little from experience since the outcome is a noisy signal of match quality.

Looking at the estimates, the NDRI, NaSSA, and TCA subclasses have the largest variances,

much greater than those of SSRIs and SNRIs, holding constant nausea effects and dosing. These

results correspond to the tendency of patients to persist on SSRI and SNRI treatments more often

than other classes in the data, suggesting they offer a good effi cacy-tolerability bundle. The

significant variance in the value of trying an "inside good" relates to the speed at which patients

exit drug care—there is a great deal of uncertainty among patients about the value of drug treatment

in general. The smaller but still significant variance on the side effect variable and on dosing—16.0

and 15.2, respectively—again suggests physicians begin with uncertainty about a specific patient’s

sensitivity to these drug characteristics. Finally, the size of the idiosyncratic variance fits the high

observed persistence on treatment of 60-70% across decision points. Experience offers too noisy a

signal of a patient’s true match to a drug to allow substantial updating over time.

I assess model fit in two ways. First, I use the model to predict the aggregate probabilities

of switching to the outside good and across subclasses. I compare predictions from the dynamic

model against both observed transitions and predictions from a static model. To determine the

static predictions, I rerun the mixed logit specification from the initial stage, feeding in the entire

sequence of observed choices for each patient’s episode. The procedure is static in that it ignores

the order of prescriptions chosen within an episode, instead treating each choice as independent.

As illustrated in Figure 4, the dynamic model matches fairly closely the observed pattern of

patient exit from drug care—with the exception of decision point 2. The static model misses both

26Large variance in outcomes from a treatment might be undesirable from a malpractice perspective if it increasesthe risk of suicide or adverse reactions. As noted in the development of the initial period model, the level of theestimated fixed effect can account for the risk of rare side effects.

22

at decision point 2 and at all future decision points: it predicts a constant share over time, which

the data clearly reject. The dynamic model’s over-prediction of the share of patients in drug care

at the first persist/switch decision suggests the model misses patients’skepticism of the value of

drug care initially. To fix this, one could model the outside good at point 2 as distinct from the

outside good in later periods. I do not implement this extension due to computational burdens,

but point out that this problem commonly affects both static and dynamic approaches to discrete

choice27.

Table 8 illustrates a comparison of the static and dynamic models in predicting switching across

drug subclasses. The dynamic model does a fairly good job matching the observed persistence on

a drug class conditional on survival, particularly for SSRIs. It provides a less accurate prediction

for NDRIs. One reason for this may be the use of the mean estimates from the initial reduced form

model in the later learning model. Physicians appear to favor NDRIs for more severe cases; if a

patient’s diagnosis worsens in period 2 or later, NDRIs should have a larger mean and physicians

should prescribe them more often. That the fixed component of an individual’s mean, apart from

price effects, remains stable in the model means it may predict less persistence on NDRIs than

occurs in practice. Note that the static model misses the trend in persistence completely, as it

predicts shares by period independent of past decisions.

In a second fit measure, I compute how often the model correctly predicts an individual’s

observed choice over three decision points. The dynamic model is unlikely to fit perfectly, given

the restrictions placed on the covariate space. However, as shown in Table 9, the observed choice

matches one of the top three predictions for that patient at a rate of 65%. Expanding to the top

five choices, the model matches the observed choice in 75-80% of patients. To capture how well a

static model would perform, I randomly rank the choices in each period using market shares. This

random ranking amounts to sampling without replacement, where I take draws from a hypothetical

urn containing the names of available options in proportion to their market shares. I report these

probability calculations in Table 9. The dynamic model selects the observed choice nearly twice

as often as does the static approximation.

Note that the model faces the most diffi culty in predicting the exact date the patient will

exit treatment. In the second set of columns in Table 9, I omit from the statistics predictions

at decision points at which an individual exits treatment. For the subset of periods where the

27Specifying a distinct first period error requires replacing the idiosyncratic extreme value term with, say, a normalerror that persists and influences later choices. This requires a multinomial probit approach, which I leave for futurework.

23

individual chooses an inside good, the model’s top 3 and top 5 predicted choices match the observed

choice at rates of 80-85% and 85-90%, respectively.

7 Counterfactual Experiments and Patient Welfare

I return to the motivating public policy question, employing the dynamic model to predict the

impact of alternative copayment policies and promotion schemes on the insurer’s net drug costs

and on adherence. The predicted adherence rate depends directly on the learning model: patients

exit care as soon as past experience, along with an idiosyncratic shock, elevates the outside good’s

index value above all inside goods.

I simulate the experience of 10,000 sample patients diagnosed in 2005, the latest year in my

data. I run two sets of counterfactuals. In the first, I consider a series of copayment schemes, from

uniform pricing to more detailed "value-based" insurance designs, as described by Chernew et al.

(2007). I focus on copayments since the initial reduced form estimates illustrate that physicians

consider patient prices in their decisions. In the second set of counterfactuals, I simulate the impact

of promotion for generics and for the branded medications, Zoloft (Sertraline) and Wellbutrin

(Bupropion). I proxy for advertising by changing the product-level fixed effects, which correlate

with manufacturer promotion. The goal of these examinations is to test the effect of specific

promotion schemes given learning over time. I report the simulation results in Table 10. I use

these predictions later to build a new treatment protocol.

To quantify changes in patient health under the alternative policies, I translate the rate of

adherence to an expected number of weeks the patient suffers from depressive symptoms, assigning

a dollar value to the symptomatic relief. I describe the calculation in Section F of the Technical

Appendix. In brief, I begin by collecting expected rates of full and partial recovery from depressive

symptoms using Berndt et al. (2002). They survey a group of psychiatric specialists, collecting the

panel’s predictions on the recovery rates for various treatment categories and treatment durations.

I divide the patients in my sample into these same categories, using my model to predict the initial

drug each patient uses and his duration on that treatment. The number of patients in each category

will differ by counterfactual policy, since the policy alters the model’s predictions of an individual’s

treatment choice and adherence. The recovery rates for a patient determine the expected number

of weeks in the year following the diagnosis that he will suffer from depression. I rely on Lave et al.

(1998) and Cutler (2004) to translate this expected number of weeks into a dollar measure. In the

24

baseline case, the distribution of treatments prescribed leads to an average reduction in depressive

symptoms worth $366, versus a monthly drug cost bill of roughly $42. I compare the dollar value

of symptomatic relief under the counterfactual policies against the cost of drug care in Table 11.

7.1 Insurance Design

I consider the utility of three alternative copayment policies in reducing insurer costs and improving

patient welfare. In the first, I set a uniform copayment of $0 for all treatments. In the second, I

set the copayments for generic drugs to $0 and branded drugs to the 95th percentile level, about

$50 per month supply. In the third, I set a "value-based" policy, as Chernew et al. (2007) suggest.

For the specific SSRI and SNRIs that lead to the fewest switches and highest adherence in the

model, I set the copayment equal to $0. If any of the drugs in the initial tier are off-patent, I set

the copayment for the branded version to $15 per month supply to discourage their use relative to

the generic version. For all other non-recommended treatments, I set the copayments at $60 per

month supply, above the 95th percentile of observed copayments.

For all three policies, I simulate rates of adherence over three decision points and calculate

net costs to the insurer, which equal the cost charged by the manufacturer less the copayments

paid by patients28. I translate adherence to a reduction in depressive symptoms, denominated in

dollar terms, and judge the performance of a policy by its effect on the insurer cost/patient health

trade-off.

Looking first at the $0 uniform copayment policy, the main effect is to increase the use of branded

drugs relative to their generic counterparts. Physicians in the data express strong preferences for

brand names and so prescribe them readily when price effects disappear. On the cost/adherence

score, the policy leads to a 5.5% increase in adherence by period 3, stemming largely from the

increased use of effective SSRIs and SNRIs. However, the improvement comes at an added cost

equal to 50-60% of the baseline level. By lowering all copayments to zero, the insurer both loses

the revenue from these payments and fails to encourage the use of cost-effective treatments. On

welfare, the policy leads to a $8 increase in symptomatic relief above the baseline average of $366,

but at an added monthly cost of $21.

When the generic copayments are set equal to zero and branded rates to the 95th percentile,

the shifting primarily benefits generic SSRIs, which gain 14-23% in share over three periods—double

28Copayments typically return to the insurer indirectly via a bargaining process with the drug manufacturer. Levy(1999) discusses the interactions between insurers, manufacturers, and retail pharmacies in price setting.

25

the baseline. This increase comes at the expense of branded SSRIs, SNRIs, and NDRIs, which also

cede share to generics in other classes. Costs decrease by 18%, about $8 for a monthly prescription.

Adherence rates, however, improve only marginally, because the policy encourages the use of all

generics, even those in subclasses with poorer effi cacy and tolerability. Symptomatic relief falls by

$5.60.

The "value-based" design offers the highest payoff in costs, adherence, and symptomatic relief.

Patients shift to the most effective SSRIs, incentivized by copayments of $0 per month relative

to $15 or $60 per month on alternatives. The share on classes excluding SSRIs and SNRIs falls

19%, about a quarter of the baseline level. As a result, adherence rates increase over 4% above

the baseline by the third decision point. The gain in symptomatic relief equals $11 while monthly

costs remain fairly constant, declining by $0.13. The predictions suggest a "value-based" design

outperforms existing tiered policies, as it boosts patient health without raising costs sharply.

7.2 Promotion

Promotion of particular products can influence physician decisions and alter the effi ciency of the

depression care provided to patients. The goal of the following exercises is not to identify the

effect of any particular channel of manufacturer advertising—I lack information at the patient or

physician level on the types of advertising used—but to understand how changing the physician’s

prior information set interacts with her learning process. Increased promotion can in theory either

enhance or detract from patient health and costs. If manufacturer advertising skews prescribing

towards a less effective drug initially, generating poor adherence over time, public policy might

suitably aim to reduce those ads or counter them with promotion based on effectiveness studies.

If, on the other hand, promotion—even of inferior products—leads to greater initiation and therefore

learning about the value of drug treatment in general, adherence may increase. This can translate

to greater symptomatic relief and reduced rates of relapse.

To test the role of promotion, I implement three experiments. In each, I proxy for promotion by

adjusting drug fixed effects, which correlate with average advertising expenditure (See Table 6)29.

First, I mimic promotion of Zoloft (Sertraline), a branded SSRI with generally high tolerability and

effi cacy but also a high cost. I increase Zoloft’s fixed effect threefold, to proxy for a percentage

increase in advertising to match the drug with the largest observed monthly advertising in 2003.

29Note that the existing level of advertising may be endogenous: if advertising is less effective in encouraging theuse of a treatment, manufacturers likely reduce promotion. My goal is to understand precisely why advertising maynot change prescribing, looking at physician learning as an impediment.

26

Second, I increase promotion for Wellbutrin (Bupropion), an example of a branded treatment with

lower average tolerability. I increase Wellbutrin’s fixed effect by the same percentage as for Zoloft.

Third, I reduce the bonus physicians assign to branded drugs and raise the fixed effects for generics

to the mean level. This experiment mimics a scenario where public policy bans branded advertising.

Increasing Zoloft advertising produces a majority share for Zoloft of 52% from a baseline level of

8%. After three periods, that share remains above 50%. Adherence increases by 5.8% into period 3,

while costs rise by 13-28%. About 4/5 of this increase comes from the redistribution of treatments

toward Zoloft, which carries a higher manufacturer price (before rebates) than generic alternatives,

with the remaining fifth due to improved adherence. While the increase in adherence shows the

value of advertising for effective products, the cost/health trade-off compares unfavorably with

that of "value-based" copayment design: health, in dollars, increases by $13.70, but at an added

monthly cost of $5.60. In addition, if the insurer were to mimic the manufacturer’s advertising

to increase adherence, it would incur promotional expense. To provide some comparison, in 2003,

the manufacturer of Zoloft spent $8.75 million a month on detailing to clinicians, dropping off 1

million sample packages30.

The increase in Wellbutrin’s advertising in this setting leads to a boost in its share from a

baseline of 15% to 39% in period 1. Unlike with Zoloft, the rates of exit after starting on Wellbutrin

are high, as many users depart for SSRIs or the outside good after poor outcomes. Its share falls

from 39% to 20% at decision 2, and to 14% at decision 3. Overall, adherence improves 2.1% over

the baseline case. The simulation illustrates a positive health effect from advertising, even for

less effective products. More patients begin on active treatment, trying out Wellbutrin. In the

process, some learn positively about their match to Wellbutrin or to another inside good, and stick

with drug care as a result. Patients gain $7.50 from the value of reduced symptoms. The policy

is costly, however, since Wellbutrin itself is expensive: drug costs rise 23%, a $8.30 increase in the

average monthly drug bill over the baseline case.

In the final experiment, I find that banning branded promotion impacts very little the physician’s

decision to write a prescription and the patient’s decision to fill it once in hand. The share of the

outside alternative actually rises by about 1%. Instead, the main effect is to shift the distribution of

prescriptions. Generic SSRIs gain 27-40% in share, with less effective generic NDRIs gaining 11.5%

in share in the first period. The change in distribution leads to cost improvement of around 30%.

Adherence falls a bit because the policy fails to increase initiation and incentivizes all generics, even

30Personal Selling Audit/Hospital Personal Selling Audit. Verispan, Inc., 2003.

27

those with poorer properties. The value of symptomatic relief under the policy falls $17, about

5% of the baseline.

7.3 Discussion of Policies

Two main findings appear from the simulations. First, after accounting for learning, the largest

gains in patient outcomes occur when policies promote treatments with above-average effi cacy and

tolerability. Costs improve when the policies use either advertisements or lower prices to encourage

the cheapest products within the set of effective options. This is precisely the approach of value-

based insurance design, on the price dimension, and "academic detailing", as described by Soumerai

and Avorn (1990), on the promotion dimension. However, the policies differ in costs. Given

information technology, the creation of multiple tiers for pricing involves little additional expense.

Advertising, however, is costly—a small-scale academic detailing program in Pennsylvania in 2006

required $1 million a year in funding31. Sending academic detailing agents to clinicians nationwide

would require millions more.

Second, manufacturer promotion may be beneficial as a matchmaker, even if its purpose seems

more to persuade than inform (Anand and Shachar (forthcoming)). The results from the promotion

of Zoloft and Wellbutrin suggest that while promoting an effective product boosts adherence and

symptomatic relief more, promoting a lesser option can improve welfare if it nonetheless encourages

experimentation. Consumers then learn about their match to specific characteristics of the product,

finding better options and adhering at higher rates than if regulators restricted promotion.

Note that I neglect careful modeling of the manufacturer’s pricing and advertising decision, since

I lack both a measure of a drug’s price after rebates as well as advertising expenditures for the

time period of my patient-level data. However, roughly, I predict poorer performance by branded

manufacturers under the policies recommended above, since they encourage greater generic use. In

practice, firms will likely respond to insurer policies strategically. For example, in the depression

setting, manufacturers often introduce new products into their portfolio near the expiration of the

patent on their branded product. These are typically "reformulations," as discussed by Huskamp

et al. (2009). Manufacturers may create a new patented "extended release" form of their existing

product that requires fewer daily doses. The change may be even more subtle: in 2002, Forest

Laboratories introduced Lexapro (Escitalopram) a year prior to the patent expiration of one of its

31Scott Hensley, “Negative Advertising: As Drug Bill Soars, Some Doctors Get An ’Unsales’Pitch,” The WallStreet Journal, March 13, 2006, sec. A, p.1.

28

branded depression medications, Celexa (Citalopram). Lexapro has virtually the same molecular

form as Celexa, but simplifies the chemical structure to remove a redundant component. From my

model, the two drugs have near identical rates of adherence and predicted effectiveness; Lexapro

costs the insurer twice as much as the generic form of Celexa. While outside the scope of the

current paper, a model of the strategic behavior of manufacturers could provide additional insight

into the effect of pricing and promotion policies on social welfare, including manufacturer profits.

7.4 New Protocol

Finally, using the estimated rates of adherence by drug treatment from the baseline model, I build a

new treatment protocol. Rather than relying on estimated treatment effects in clinical trials, I use

the timing and identity of drug switches in the large census of patient experiences to recommend a

set of treatments.

Past theoretical and empirical work supports the notion that adherence or attrition rates can

provide valuable insight on effectiveness. Philipson and Hedges (1998) and Philipson and DeSimone

(1997), for example, argue that randomization and blinding in clinical trials may not actually

produce unbiased treatment effects when trial subjects are Bayesian and actively learn about the

effectiveness of the experimental drug. The subject’s behavior then is treatment dependent, and

will differ systematically between the control and treatment group. Attrition rates, in contrast,

impound all the unobserved aspects of the treatment that make it undesirable. Chan and Hamilton

(2006) judge the value of attrition rates empirically using the setting of a major clinical trial for an

HIV treatment. I extend this notion to adherence rates in a large pool of insurance claims. Since

physicians in practice expend considerably less effort to ensure patient adherence relative to the

organizers of clinical trials, the observed rates of quitting treatment should provide an externally

valid measure of a drug’s effectiveness.

In Table 12, I report the predicted rates of adherence by drug molecule along with the recom-

mendations for antidepressant treatment from three published protocols. In the final column, I

include my own protocol built from the dynamic model’s predictions32. I use a quit rate of 20%

at the second decision point as an upper threshold for recommending a product. There are two

main distinctions between the model-based protocol and existing recommendations. First, the new

protocol de-emphasizes NDRIs, like Wellbutrin, and TCAs as preferred treatments. Even though

32As a sensitivity, I predict adherence in the setting where all copayments equal $0. The product recommendationsbuilt from these rates are similar to those in the main analysis.

29

clinical trials generally show little difference in effi cacy between these treatments and SSRIs or

SNRIs, they prompt high rates of switching in the panel data. Second, while no existing proto-

col differentiates among available SSRIs, the model suggests using off-patent molecules Fluoxetine

(Prozac) and Paroxetine (Paxil) over alternatives. This preference builds both from the lower

cost of these drugs and from greater unobserved tolerability. The comparative effectiveness review

by Gartlehner et al. (2007) provides some support for the model’s recommendation. Sertraline

(Zoloft) inspired greater gastrointestinal side effects than comparable SSRIs; the model predicts it

having relatively weaker adherence.

Note that I develop this protocol using the choice set and experience of patients treated in

2005. Recommendations for practice today might more appropriately build from newer data.

However, the results provide evidence that pharmacoeconomic analyses can enhance existing pro-

tocols. Crismon et al. (1999), in their algorithm design project, argued for using such analyses, but

found few credible studies to incorporate. The methodology developed here can help fill this evi-

dence gap cheaply: observational studies allow regulators to compare the effectiveness of different

treatments without the $20,000 to $50,000 per-patient cost common in full-scale trials33.

8 Conclusion

The estimates demonstrate that patient copayments, supply-side incentives, and promotion enter

the physician’s treatment decision. A static analysis that ignores patient and physician learning

would employ these estimates to advocate for high-powered incentives. Commonly-used tiered

copayment schemes, for example, can discourage physicians from prescribing expensive branded

treatments and so lower overall costs in the short term. However, using a dynamic analysis, I

measure the importance of a key trade-off in incentive design: policies which promote cheaper but

less effective options may lead to decreased adherence. When patients quit treatment early, they

suffer from depressive symptoms for longer periods; patient welfare declines and the insurer’s long-

run costs may rise. Managing this trade-off requires "value-based" copayment schemes or academic

detailing that promotes treatment regimens with the highest expected effectiveness and those most

likely to convince skeptical patients of the value of drug care. To define this set of treatments, I

find that adherence rates in observational data can provide guidance beyond outcomes in clinical

trials.33Brian Gormley, “The Hidden Costs of Running a Biotech Company,”The Wall Street Journal, Sept 21, 2009.

30

On methodology, my algorithm design allows me to capture the richness of the physician-

patient agency relationship in an initial stage while focusing on the experience good nature of

antidepressants in a subsequent learning model. More generally, the estimation technique provides

a framework to use in judging the effect of policy interventions in other markets where agents

make dynamic purchase decisions. Furthermore, incorporating an index rule to solve the agent’s

dynamic problem reduces computation, particularly in settings where agents face large choice sets

and participate in the market over long time horizons.

31

References

Ackerberg, Daniel A., “Advertising, learning, and consumer choice in experience good markets:

an empirical examination,”International Economic Review, 2003, 44 (3), 1007—1040.

Aguirregabiria, Victor and Pedro Mira, “Dynamic discrete choice structural models: A sur-

vey,”Journal of Econometrics, 2010, 156 (1), 38—67.

Akincigil, Ayse, John R. Bowblis, Carrie Levin, James T. Walkup, Saira Jan, and

Stephen Crystal, “Adherence to antidepressant treatment among privately insured patients

diagnosed with depression,”Medical Care, 2007, 45 (4), 363.

Anand, Bharat and Ron Shachar, “Advertising the Matchmaker,” The RAND Journal of

Economics, forthcoming.

Bajari, Patrick, Jeremy T. Fox, Kyoo il Kim, and Stephen P. Ryan, “The random

coeffi cients logit model is identified,”NBER working paper, 2009, (No. 14934).

Bergemann, Dirk and Juuso Valimaki, “Learning and strategic pricing,”Econometrica, 1996,

64 (5), 1125—1149.

Berndt, Ernst R., Anupa Bir, Susan H. Busch, Richard G. Frank, and Sharon-Lise T.

Normand, “The medical treatment of depression, 1991-1996: productive ineffi ciency, expected

outcome variations, and price indexes,”Journal of Health Economics, 2002, 21 (3), 373—396.

Brezzi, Monica and Tze Leung Lai, “Optimal learning and experimentation in bandit prob-

lems,”Journal of Economic Dynamics and Control, 2002, 27 (1), 87—108.

Chan, Tat Y. and Barton H. Hamilton, “Learning, private information, and the economic

evaluation of randomized experiments,”Journal of Political Economy, 2006, 114 (6), 997—1040.

Chang, Fu and Tze Leung Lai, “Optimal stopping and dynamic allocation,” Advances in

Applied Probability, 1987, 19 (4), 829—853.

Chernew, Michael E., Allison B. Rosen, and A. Mark Fendrick, “Value-based insurance

design,”Health Affairs, 2007, 26 (2), 195.

Ching, Andrew T., “Consumer learning and heterogeneity: Dynamics of demand for prescription

drugs after patent expiration,”International Journal of Industrial Organization, 2010.

32

Crawford, Gregory S. and Matthew Shum, “Uncertainty and Learning in Pharmaceutical

Demand,”Econometrica, 2005, 73 (4), 1137—1173.

Crismon, M. Lynn, Madhukar Trivedi, Teresa A. Pigott, A. John Rush, Robert M. A.

Hirschfeld, David A. Kahn, Charles DeBattista, J. Craig Nelson, Andrew A. Nieren-

berg, Harold A. Sackeim, and Michael E. Thase, “The Texas Medication Algorithm

Project: report of the Texas consensus conference panel on medication treatment of major de-

pressive disorder,”Journal of Clinical Psychiatry, 1999, 60, 142—156.

Cutler, David M., Your money or your life, Oxford University Press, 2004.

Dickstein, Michael J., “Physician vs. patient incentives in prescription drug choice,”2010.

Efron, Bradley and Robert J. Tibshirani, An introduction to the bootstrap, Chapman and

Hall, 1993.

Erdem, Tulin and Michael P. Keane, “Decision-making under uncertainty: Capturing dynamic

brand choice processes in turbulent consumer goods markets,”Marketing Science, 1996, 15 (1),

1—20.

Fochtmann, Laura J. and Alan J. Gelenberg, Guideline Watch: Practice Guideline for the

Treatment of Patients with Major Depressive Disorder, 2nd Edition, Arlington, VA: American

Psychiatric Association, 2005.

Fryback, Dennis G., Eric J. Dasbach, Ronald Klein, Barbara E. K. Klein, Norma Dorn,

Kathy Peterson, and Patrica A. Martin, “The Beaver Dam Health Outcomes Study: initial

catalog of health-state quality factors,”Medical Decision Making, 1993, 13 (2), 89.

Gartlehner, Gerald, Richard A. Hansen, Patricia Thieda, Angela M. DeVeaugh-Geiss,

Bradley N. Gaynes, Erin E. Krebs, Linda J. Lux, Laura C. Morgan, Janelle A.

Shumate, and Loraine G. Monroe, Comparative effectiveness of second-generation antide-

pressants in the pharmacologic treatment of adult depression, Agency for Healthcare Research

and Quality (US), 2007.

Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin, Bayesian Data

Analysis: Texts in Statistical Science, Boca Raton, FL: Chapman and Hall/CRC, 2004.

33

Gemmill, Marin C., Joan Costa-Font, and Alistair McGuire, “In search of a corrected

prescription drug elasticity estimate: a meta-regression approach,”Health Economics, 2007, 16

(6), 627—643.

Gittins, J. C., “Bandit processes and dynamic allocation indices,”Journal of the Royal Statistical

Society.Series B (Methodological), 1979, pp. 148—177.

and D. M. Jones, “A dynamic allocation index for the discounted multiarmed bandit problem,”

Biometrika, 1979, 66 (3), 561—565.

Hellerstein, Judith K., “The Importance of the Physician in the Generic versus Trade-Name

Prescription Decision,”The RAND Journal of Economics, Spring 1998, 29 (1), 108—136.

Huskamp, Haiden A., Alisa B. Busch, Marisa E. Domino, and Sharon-Lise T. Nor-

mand, “Antidepressant reformulations: who uses them, and what are the benefits?,” Health

Affairs, 2009, 28 (3), 734.

Judd, Kenneth L., Numerical Methods in Economics, The MIT Press, 1998.

Karasu, T. Byram, Alan Gelenberg, Arnold Merriam, and Philip Wang, Practice Guide-

line for the Treatment of Patients With Major Depressive Disorder: Second Edition, Arlington,

VA: American Psychiatric Association, 2000.

Kessler, Ronald C., Patricia Berglund, Olga Demler, Robert Jin, Doreen Koretz,

Kathleen R. Merikangas, A. John Rush, Ellen E. Walters, and Phillip S. Wang,

“The epidemiology of major depressive disorder: results from the National Comorbidity Survey

Replication (NCS-R),”JAMA, 2003, 289 (23), 3095.

Lai, Tze Leung, “Adaptive treatment allocation and the multi-armed bandit problem,” The

Annals of Statistics, 1987, 15 (3), 1091—1114.

and H. Robbins, “Asymptotically effi cient adaptive allocation rules,” Advances in Applied

Mathematics, 1985, 6 (1), 4—22.

Lave, Judith R., Richard G. Frank, Herbert C. Schulberg, and Mark S. Kamlet, “Cost-

effectiveness of treatments for major depression in primary care practice,”Archives of General

Psychiatry, 1998, 55 (7), 645.

34

Levy, Roy, The pharmaceutical industry: a discussion of competitive and antitrust issues in an

environment of change, Washington, DC: Bureau of Economics/Federal Trade Commission staff

report, 1999.

Mahajan, Aditya and Demosthenis Teneketzis, Multi-armed bandit problems Foundations

and Applications of Sensor Management, Springer, 2007.

McFadden, Daniel and Kenneth Train, “Mixed MNL models for discrete response,”Journal

of Applied Econometrics, 2000, 15 (5), 447—470.

Melfi, Catherine A., Anita J. Chawla, Thomas W. Croghan, Mark P. Hanna, Sean

Kennedy, and Kate Sredl, “The effects of adherence to antidepressant treatment guidelines

on relapse and recurrence of depression,”Archives of General Psychiatry, 1998, 55 (12), 1128.

Miller, Robert A., “Job matching and occupational choice,”The Journal of Political Economy,

1984, 92 (6), 1086—1120.

Murphy, Kevin M. and Robert H. Topel, “Estimation and inference in two-step econometric

models,”Journal of Business and Economic Statistics, 2002, 20 (1), 88—97.

Murphy, Michael J., Ronald L. Cowan, and Lloyd I. Sederer, Blueprints: Psychiatry,

Philadelphia, PA: Lippincott Williams Wilkins, 2009.

Newey, Whitney K., “A method of moments interpretation of sequential estimators,”Economics

Letters, 1984, 14 (2-3), 201—206.

Philipson, Tomas and Jeffrey DeSimone, “Experiments and subject sampling,”Biometrika,

1997, 84 (3), 619.

and Larry V. Hedges, “Subject evaluation in social experiments,”Econometrica, 1998, 66

(2), 381—408.

Pomerantz, Jay M., Stan N. Finkelstein, Ernst R. Berndt, Amy W. Poret, Leon E.

Walker, Robert C. Alber, Vidya Kadiyam, Mitali Das, David T. Boss, and

Thomas H. Ebert, “Prescriber intent, off-label usage, and early discontinuation of antide-

pressants: a retrospective physician survey and data analysis,” Journal of Clinical Psychiatry,

2004, 65 (3), 395—404.

35

Rossi, Peter E., Greg M. Allenby, and Robert McCulloch, Bayesian Statistics and Mar-

keting, John Wiley and Sons, 2005.

Rusmevichientong, Paat and John N. Tsitsiklis, “Linearly parameterized bandits,”Mathe-

matics of Operations Research, 2010, 35 (2), 395.

Soumerai, Stephen B. and Jerry Avorn, “Principles of educational outreach (’academic de-

tailing’) to improve clinical decision making,”JAMA, 1990, 263 (4), 549.

Train, Kenneth, Discrete Choice Methods with Simulation, Cambridge Univ Press, 2003.

Whittle, P., “Multi-armed bandits and the Gittins index,”Journal of the Royal Statistical Soci-

ety.Series B (Methodological), 1980, 42 (2), 143—149.

Zellner, Arnold, An Introduction to Bayesian Inference in Econometrics, J. Wiley, 1971.

36

9 Technical Appendix

9.1 Section A: Mixed Logit via Bayesian methods

The likelihood of yi, the observed choice for individual i ∈ {1, ..., N}, takes the following form due

to the assumption of independent extreme value errors:

L(yi|α, βi) =∏t

∑k

dikt ∗ (exp(α′Zikt + β′iXikt))∑k

exp(α′Zikt + β′iXikt)

where dikt = 1 if individual i chooses drug k in period t; dikt = 0 otherwise. I rewrite the

likelihood integrating over β to find the mixed logit probability, where β follows a normal density,

φ(.), with parameters (b,W ):

L(yi|α, b,W ) =

∫L(yi|α, β)φ(β|b,W )dβ

The posterior of (b,W, α) is then:

K(b,W, α|Y ) ∝∏i

L(yi|b,W, α)k(b,W )

Rather than proceed via a Metropolis-Hastings algorithm directly on the posterior distribution

of (b,W ) using the mixed logit probabilities, I specify a hierarchical distribution that treats βi as

a parameter as well. In notation, the desired posterior would appear as:

K(b,W, βi∀i, α|Y ) ∝∏i

L(yi|βi, α)φ(βi|b,W )k(b,W )

I employ the procedure described below to draw from this posterior, which simplifies compu-

tation and avoids simulating L(yi|α, β) for each i at each iteration of a direct Metropolis-Hastings

implementation (See Rossi et al. (2005) and Train (2003)). The steps include:

1. Specify Priors

I specify standard conjugate priors for (b,W ) : b v N(b0, Vb0) and W v inverted Wishart

with K degrees of freedom and scale matrix IK . K is the dimension of b0.

2. Combine the following sequential draws into a Gibbs Sampling routine:

37

(a) Draw b|W,βi ∀ i. The posterior is normal, given conjugate priors and the assumption

that βi are N realizations from a normal:

b|W,βi∀i v N

(1Vb0b0 + N

W β

1Vb0

+ NW

,1

1Vb0

+ NW

)

(b) Draw W |b, βi for all i. The posterior is inverse Wishart with K +N degrees of freedom

and the following scale matrix:

KI +Ns1K +N

, where s1 =1

N

∑i

(βi − b)(βi − b)′

(c) Draw α|b,W, βi for all i. Here, posterior K(α|βi∀i) ∝∏i

L(yi|α, βi)k(α). Draw from

this distribution using a Metropolis-Hastings algorithm involving data pooled across

individuals.

(d) Draw βi|b,W, α for all i. Here, posterior K(βi|b,W, α, yi) ∝ L(yi|βi, α)φ(βi|b,W ).

Draw from this distribution using a Metropolis-Hastings algorithm. Section D provides

additional detail.

I implement this MCMC method, experimenting with different starting values and parameters

in the prior distribution to achieve convergence. After dropping a burn-in sequence of 75,000

iterations, I retain every 100th draw from a sequence of 20,000 additional draws. The resulting

vectors approximate draws from the distribution of the parameters of interest. I use the draws from

the conditional posterior distribution on βi in step (d) as an input to the second stage estimation

routine.

9.2 Section B: Regret and Asymptotically Optimal Rules

9.2.1 Minimizing "Regret" vs. Maximizing Returns

The optimal adaptive allocation rule from Lai and Robbins (1985) and Lai (1987) involves a slight

reformulation of Gittins’solution to the multi-armed bandit problem. The index rule no longer

exactly matches the solution to the infinite horizon optimal stopping problem. To discuss the

properties of this index rule, the authors rely on a "regret" function which sums deviations of a

agent’s sequence of choices from the optimal choice. With the objective of minimizing expected

38

regret, Lai and Robbins (1985) find a lower bound on regret and describe a policy rule that achieves

it asymptotically.

To apply the results from Lai (1987) to my setting, I show that a policy that minimizes regret is

equivalent to one that solves the agent’s maximization problem. Thus, if a rule achieves the lower

bound on regret, it best approximates the solution from the traditional dynamic programming

approach amongst all possible index rules. Lai shows the connection between the approaches

generally; I illustrate the intuition below in a 2 option example. Specifically, consider the case

were Yt is the outcome at time t. Yt has a known distribution, f(Y ; θj), where j ∈ {1, 2}. For

example, Yt = θj + εt, where εt v N(0, 1). Let St = Y1 + ...+ Yt, the outcome stream, and let the

policy rule at time t be ϕt, where 1{ϕt = j} = 1 when the policy picks j at time t. The policy

rule ϕt depends on previous outcomes and choices. Let Fs denote the set of policies, ϕs, that may

follow given past observations. Finally, I denote the mean outcome under j as µ(θj). With this

set-up:

E(St) =t∑

s=1

E{E[Ys(θ1) ∗ 1{ϕs = 1}|Fs−1] + E[Ys(θ2) ∗ 1{ϕs = 2}|Fs−1]}

= µ(θ1)E

[t∑

s=1

1{ϕs = 1}]

+ µ(θ2)E

[t∑

s=1

1{ϕs = 2}]

The standard approach to the individual’s problem involves maximizing these cumulative re-

turns over the horizon, T , with respect to the prior distribution on Θ = (θ1, θ2):

max

∫E(ST ) ∗ dH(θ)

= maxϕ1,...,ϕT

∫ (µ(θ1)E

[T∑s=1

1{ϕs = 1}]

+ µ(θ2)E

[T∑s=1

1{ϕs = 2}])

dH(θ)

In contrast, Lai (1987) minimizes regret over the finite horizon:

RT (θ) = T ∗ µ∗(θ)− E(ST ), where µ∗(θ) = max{µ(θ1), µ(θ2)}

39

For the minimization, integrating with respect to the prior distribution over Θ :

min

∫RT (θ)dH(θ)

minϕ1,...,ϕT

∫(T ∗ µ∗(θ)− E(ST ))dH(θ)

=⇒ minϕ1,...,ϕT

−∫ (

µ(θ1)E

[T∑s=1

1{ϕs = 1}]

+ µ(θ2)E

[T∑s=1

1{ϕs = 2}])

dH(θ)

9.2.2 Lower Bound on Regret

In the infinite horizon discounted setting, Chang and Lai (1987) provide a lower bound on the

cumulative regret possible given an index rule that samples one of the available choices at each

time period. To simplify notation, with J possible options, let θj be the parameters of choice j,

with µ(θj) the expected outcome under j. The optimal choice, though not known initially, has

θj = θ∗. The goal of the index rule is to suggest a sequence of choices to minimize deviations from

the optimal choice.

Chang and Lai describe a lower bound on the cumulative regret when using an index rule, as δ

approaches 1:

Regret(θ1, ..., θJ) ≥ |log(1− δ)|∑

j:θj<θ∗

{µ(θ∗)− µ(θj)

I(θ, θ∗)+ o(1)

}where I(θ, θ∗) = Eθ[log(fθ(y)/fθ∗(y))]

The lower bound on regret is specific to the choice of distribution for the outcome function,

fθ(y). The index rule proposed by Lai (1987) and Chang and Lai (1987) attains this asymptotic

lower bound. It achieves the lower bound not only for a fixed (θ1, ..., θJ) as δ approaches 1, but also

uniformly over a variety of prior distributions on (θ1, ..., θJ), including cases where the θ elements

are potentially correlated.

The expression for the lower bound provides the following intuition on the quality of the ap-

proximation under Lai’s rule. With fθ(y) a normal distribution, when outcomes under θj diverge

from outcomes under θ∗ such that (E[y − θ∗]2 − E[y − θj ]2) diminishes and µ(θ∗) − µ(θj) grows,

the lower bound increases and the minimal regret rises. When the choices have θj that are near

θ∗, the expected cumulative regret falls towards 0.

40

9.3 Section C: Closed Form Approximations to the Boundary Function

Gittins and Jones (1979) show that the solution to the discounted infinite-horizon dynamic program-

ming problem follows an index rule, where the decision-maker need only: (1) solve J single-armed

bandit problems, calculating the dynamic allocation index for each choice, denoted Gk(Xk(t)); and

(2) choose the option with the largest dynamic allocation index at any time period t. Here, Xk(t)

reflects the state variables of the choice problem that evolve over time34. At t, the agent chooses

j if and only if:

Gj(Xj(t)) = maxk∈{1,...,J}

Gk(Xk(t))

The dynamic allocation index at t solves the following problem:

Gj(xj(t)) = supτ≥t

Et

[τ∑r=t

βr−tRj(Xj(r))|xj(t)]

Et

[τ∑r=t

βr−t|xj(t)]

Rj(Xj(r)) reflects the returns from option j given the state variables at time r. The index

itself, Gj(xj(t)), can be thought of intuitively as a retirement option, following the formulation of

Whittle (1980). The decision-maker considers the value of trying an option for τ periods against a

retirement option that makes him just indifferent. The index rule approach then involves choosing

the option with the highest retirement value.

In the case of Bayesian decision makers who update the underlying parameters of the choice

problem from experience at t, the decision-maker’s problem repeats itself at every instant of time.

Rather than hypothesizing playing an option until τ , the optimal action is to recalculate the index

values after each period of experience, again choosing the option with the maximal index.

To compute these indices, Lai (1987) and Brezzi and Lai (2002) provide a closed form solution,

relying on a diffusion approximation that computes Gittins’Index for a Wiener process. They use

a change of variables to set up a Brownian motion problem, with an optimal stopping boundary.

The closed form approximation to the optimal stopping boundary in the finite horizon case

takes the form:34 In the drug choice setting, these are the mean and variance of drug j’s outcome function, which the physician

updates given experience.

41

h(s) =

{2 log(s−1)− log(log(s−1))− 1og(16π) + ...

.99 exp(−.038s−1/2)}1/2 if 0 < s ≤ .01

−1.58√s+ 1.53 + 0.07s−1/2 if .01 < s < .28

−.576s3/2 + .299s1/2 + .403s−1/2 if .28 < s ≤ .86

s−1(1− s)1/2{.639− .403(t−1 − 1)} if .86 < s ≤ 1

where s = t/T , with T the finite horizon and t the point in the episode along the sequence to

T .

In the infinite discounted case, Chang and Lai (1987) replace s with (1− β)−1 and replace the

closed-form h(s) with ψ(s):

ψ(s) =

√s/2

.49− .11s−1/2

.63− .26s−1/2

.77− .58s−1/2

{2 log(s)− log(log(s))− log(16π)}1/2

if s ≤ 0.2

if 0.2 < s ≤ 1

if 1 < s ≤ 5

if 5 < s ≤ 15

if s > 15

See the references cited for more details on these approximations.

9.4 Section D: Metropolis-Hastings Algorithm for Draws on βi

The goal is to approximate draws from the posterior distribution of βi conditional on the observed

first period choice, di1:

k(β|di1, Xi, Zi, b,W, α) =P (di1|Xi, Zi, β, α) ∗ g(β|b,W )∫P (di1|Xi, Zi, β, α)g(β|b,W )

Algorithm, via Train (2003):

1. Start with a value for β0i ; I set it equal to the population-level mean, b, for all individuals

2. Draw K independent values from a standard normal distribution and save them in η1, where

K is the dimension of βi.

3. Create trial value of β1i as β1

i = β0i +ρ∗chol(W )∗η1, whereW is the population level variance

and chol(x) denotes the Cholesky decomposition of matrix x . Note, I use W—as estimated

42

in the initial stage—in place of W . ρ is a constant determined iteratively (see note below).

4. Draw a standard uniform u1

5. Calculate the ratio of logit probabilities evaluated at β1

i and β0i and multiplied by the ratio

of the normal probability density functions at these two vectors:

F =L(di1|β

1

i , .)φ(β1

i |b,W )

L(di1|β0i , .)φ(β0i |b,W )

6. If u1 ≤ F , accept β1i and set β1i = β1

i ; If not, reject β1

i and let β1i = β0i .

7. Repeat. For high enough t, βti will approximate a draw from the conditional posterior

distribution of interest.

Notes:

• I set ρ = .1 to begin, and update it iteratively. If the acceptance rate across individuals is

less than 30%, I shift ρ down by .001; if the acceptance rate is greater than 30%, I shift ρ up

by .001.

• Beginning with the (b, W ) estimates from the mixed logit, I "burn-in" 500 draws, and then

collect 200 draws, retaining only every 100th of 20,000 draws to reduce the serial correlation

across draws.

9.5 Section E: Procedure for Second Stage Standard Errors

I design a procedure to approximate the 2-step maximum likelihood asymptotic variance-covariance

matrix derived in Newey (1984) and Murphy and Topel (2002). The two-step estimator satisfies

the following conditions on the log-likelihood of the outcomes, Y :

∑i

∂L1(Y1; θ1)

∂θ1= 0,

∑i

∂L2(Y2; θ1, θ2)

∂θ2= 0

In cases where θ1 is a consistent estimate of θ1, if the true parameters are (θ∗1, θ∗2), the second

stage maximization of 1n∑L2(Y2; θ1, θ2) with respect to θ2 is equivalent to the maximization of

1n

∑L2(Y2; θ

∗1, θ2). Using expansions around (θ∗1, θ

∗2) and relying on the Central Limit Theorem to

find the joint distribution of the the partial derivatives of the likelihood functions, one can derive the

43

variance of the limiting normal distribution for θ2. It is a complex function of the partial derivatives

of the likelihoods; these derivatives are diffi cult to compute in my setting. I approximate this

solution using a Bootstrap resampling procedure (See Efron and Tibshirani (1993)). Simulation

here is particular straightforward given that I have in storage draws from the posterior distribution

of the mixed logit parameters, a consequence of my Bayesian MCMC procedure from the initial

stage. I carry out the following steps:

1. Save the estimated hyperparameters of the βi normal distribution in the matrix θ—i.e. θ =

(E(βi), var(βi)). From the Bayesian implementation of the mixed logit estimation, I will

have S draws on θ saved, along with S draws on the fixed coeffi cient vector, α. I set S = 100.

2. For each of the ss = 1, ..., S vectors of parameter draws on (θ, α), take a draw on β|di1, Xi, Zi, b,W, α

from its conditional posterior distribution using the Metropolis-Hastings approach described

in Section D.

3. With (βi, α) for draw ss on θ, maximize the 2nd stage likelihood for nn = 1 : N_sample

using a population of size 3000 drawn from the full sample of individuals via a multinomial

distribution. I choose N_sample = 20.

4. Save the vector of model variances that the optimization finds for each of ss = 1 : S and

nn = 1 : N_sample. The vector of model variances accounts for sampling variation in the

initial stage mean parameters and in βi around its mean. The standard error across these

points approximates the standard error of the 2nd stage model estimates.

9.6 Section F: Calculating the Dollar Value of Symptomatic Relief

To assign a dollar value to the symptomatic relief provided by prescription drug treatments, I carry

out the following procedure.

First, I save the predicted sequence of treatments for 10,000 patients from the simulation of the

six counterfactual policies along with the baseline case. I divide the sample of patients in each

counterfactual into 5 categories based on the treatments used within their episode: (1) no drug

care; (2) use of a TCA for one month or less, followed by exit from drug care; (3) use of a TCA

for greater than one month; (4) use of any non-TCA drug class (SSRI/SNRI/NDRI/NaSSA/SM)

for one month or less; and, (5) use of any non-TCA drug class for greater than 1 month. I choose

these divisions to match a subset of the treatment categories studied by Berndt et al. (2002). They

44

employ a panel of medical experts to rate the likelihood of both full and partial recovery given a

patient’s background and a treatment regime and duration. For example, in Table 1 of Berndt et

al. (2002), treatment under an SSRI with 1-3 offi ce visits produces a probability of full remission

of .20, and a probability of partial remission of .45. If the SSRI treatment lasts for greater than

30 days, those probabilities increase to .28 and .60, respectively.

Given the probabilities of full and partial remission for each of the 5 categories above, I calculate

the expected number of weeks in the first year of treatment that an individual suffers either the full

symptoms of depression or partial symptoms. I follow a similar procedure to that of Cutler (2004).

Following the medical literature, I set the recovery rate at 0% in the first two weeks of treatment,

before an antidepressant takes full effect. In every week for the next year, I let the rates of full and

partial recovery for those who have yet to recover be the same as the median rate predicted by the

expert panel in Berndt et al. (2002) for the category of treatment. I then use these probabilities

to calculate the expected number of weeks without full recovery. In the case without drug care,

the patient suffers from depression for 6.4 weeks, with partial recovery in an additional 13.8 weeks

of the year. The expected weeks of depression and of partial depression for drug treatment differ

a bit by the type and duration care: with less than one month of TCAs or SSRIs, the expected

number of weeks equals approximately 4.9 and 12.6; for longer duration on TCAs or SSRIs, the

expected number of weeks shrinks to 3.5 and 7.1.

Given the expected duration of complete or partial depressive symptoms, I multiply the savings

in "depression weeks" against the quality disutility from depression, estimated in Lave et al. (1998)

to equal -.41 for full depression. That is, patients equate 10 years living with depression to roughly

6 years living without depression. As in Cutler (2004), I set the disutility from partial depression

at half the full depression rate. Multiplying this savings in utility against $100,000 as the value of a

year of life, I find the per patient dollar value in each of the four treatment categories. I multiply this

by the share of patients in each category in a particular counterfactual, summing across categories

to get a total dollar savings. Dividing by 10,000, I have the savings of an average patient under

the counterfactual policy from decreased depressive symptoms. I report these results in Table

11. Note that the distribution of treatment types promoted by a counterfactual policy drives the

observed differences in the utility gain. When a policy increases adherence and promotes use of

SSRIs, for example, patients shift toward treatment regimens with higher recovery probabilities;

the effect is to increase the expected number of weeks with symptomatic relief in the counterfactual

relative to the baseline, making the benefit from care under the policy higher.

45

Figure 1: Patient Copayments by Product and Year. Error bars show the standard deviation in copayments across insurance plans in the sample.

Figure 2: Nausea and Insomnia Side Effect Reports from Clinical Trials. Percentage rates shown for each drug, for a selected group of subclasses.

0

10

20

30

40

50

60

70

Pat

ient

Cop

ay p

er 3

0 da

y su

pply

(in $

)

Product Name (generic name if off patent)

2003 2004 2005

Branded Generic

0

5

10

15

20

25

30

35

40

45

% o

f Tria

l Par

ticip

ants

Rep

ortin

g Si

de E

ffect

s

TCA NDRI SSRI SNRI

Nausea Insomnia

46

Figure 3: Change in Drug Shares under Capitation, for selected antidepressants.

Figure 4: Share of Patients on Drug Care over Six Decision Points. Bars compare the observed data against predictions from the dynamic model. The static model predicts a share of 55% at all points (shown as a line in

the figure).

0

5

10

15

20

25

30

35

40

Non-drug care Fluoxetine (Prozac) Escitalopram (Lexapro)* Sertraline (Zoloft)* Venlafaxine (Effexor-XR)*

Dru

g sh

are

(in %

)

ProductMolecule name (Brand name), * if on patent

Non-capitated plans Capitated Plans

0

10

20

30

40

50

60

t=2 t=3 t=4 t=5 t=6

Shar

e of

pat

ient

s on

dru

g ca

re (i

n %

)

Decision Point within a patient's treatment episode

Observed Data Dynamic Model Prediction

Static Model Prediction (55%)

47

Panel 1: Drug Characteristics

Product Name Subclass Brand?Daily dosing 2003 2004 2005 2003 2004 2005 2003 2004 2005

None None - - - - - - - - 35.2 39.3 34.2 Citalopram HBr SSRI No 1 - 36.0 19.9 - 7.2 8.7 - 0.2 4.2 Celexa SSRI Yes 1 72.9 74.0 75.3 22.6 31.7 35.4 4.1 2.7 0.1 Lexapro SSRI Yes 1 63.2 63.3 68.9 22.7 24.8 28.8 13.8 13.0 12.0 Fluoxetine HCL SSRI No 1-2 30.9 17.7 17.0 9.9 6.9 7.4 7.3 11.5 10.6 Paroxetine HCL SSRI No 1 68.7 49.3 37.0 11.0 8.3 8.8 1.7 4.3 4.4 Paxil CR SSRI Yes 1 81.4 83.9 87.5 21.8 21.0 25.1 6.4 3.6 1.6 Zoloft SSRI Yes 1 79.9 81.9 85.0 20.4 25.6 28.2 12.5 10.6 9.6 Cymbalta SNRI Yes 1-2 - 103.8 111.8 - 27.8 34.6 - 0.5 2.4 Effexor-XR SNRI Yes 1 109.5 116.0 118.7 21.6 22.4 26.1 8.6 7.5 7.0 Bupropion HCL NDRI No 3 38.6 49.9 50.7 8.5 9.0 9.8 0.3 3.4 4.5 Wellbutrin XL NDRI Yes 1 104.8 104.1 113.5 21.2 21.2 26.8 6.5 5.7 5.1 Amitriptyline HCL TCA No 1 5.9 4.2 4.0 6.9 5.3 5.3 0.7 0.9 0.7 Mirtazapine NaSSA No 1 60.0 42.4 31.9 11.2 8.8 8.0 0.5 0.7 0.6 Trazodone HCL SM No 3 8.0 6.1 5.8 7.2 5.8 6.0 1.3 1.8 1.6

11,341 41,285 50,154

27.2 33.2 13.8 53.0 46.7 27.9 25.5

102,780 267,390

4. Shares calculated using only the initial prescription written for each patient episode. Selected antidepressants shown for illustration.

Table 1: Summary statistics on variables entering the estimation algorithm

Total patient episodes:

Monthly Insurer Cost ($)3 Monthly Drug Copayment ($)3

Total # unique patients:Total # of observations:

VariablePatient diagnosed with major depressive disorder

Market Share (%)4

Panel 2: Individual covariates in product choice model%

Patient's plan is a capitated Health Maintenance Organization (HMO)Patient's plan is a non-capitated HMOPatient's plan is a Preferred Provider Organization (PPO)

3. Values reflect the mean across the 307 plans in the data. Each plan sets distinct copayments and records different insurer costs by drug.

Notes:1. Source - Medstat Marketscan Commercial Claims and Encounters Database, years 2003-2005.2. Sample collected for individuals newly diagnosed with depression in an outpatient office visit; the data include information on subsequent office visits and prescriptions dispensed over the patient's episode. Pregnant women and individuals with psychiatric comorbidities are excluded from the sample, as are patients whose episodes begin within the first 6 months of data (to prevent left-censoring).

Patient visits a general practitioner (internal medicine/family medicine)Patient visits non-psychiatric specialist (obstetrician, etc)Patient visits psychiatrist

48

Persist

Switch within class

Switch to another

class

Switch to outside option

SSRI 68.9 3.5 2.6 25.0 19,663 NDRI 64.7 3.0 7.8 24.5 3,974 SNRI 73.8 1.5 4.5 20.1 4,166 TCA 57.9 0.2 12.7 29.2 497 NaSSA 52.6 - 14.3 33.1 293 SSRI 74.3 2.5 2.6 20.6 13,289 NDRI 69.6 2.3 5.9 22.2 2,668 SNRI 77.2 0.7 3.8 18.3 3,081 TCA 57.3 - 14.6 28.1 335 NaSSA 67.7 - 13.4 18.9 164

t=3

Table 2: Observed continuation decisions (switch vs. persist)

Time of decision

(t)

Drug Class

chosen at (t-1)

Fraction of decisions

Total Obs.

t=2

Notes:1. Observed switches recorded for the 50,000 patient sample drawn from Medstat's Marketscan Database for use in estimation. Roughly 35% of patients receive no drug care at their initial visit and so do not appear in the above switch statistics.2. Selected antidepressant classes shown. The number of individual drugs modeled in each subclass are: SSRI, 8; NDRI, 2; SNRI, 2; TCA, 2; NaSSA, 1. Thus, no patients switch 'within class' when starting on a NaSSA.

49

Table 3: Percentage of patients remaining on the initial drug choice over the course of an episode

1 2 3 4 5 6 7 8 9 10 11 121 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.02 85.1 86.8 88.9 89.4 89.6 90.4 91.4 91.9 92.1 91.6 91.63 84.4 84.2 84.7 85.8 86.4 88.2 88.7 88.9 88.8 88.34 82.3 82.0 83.0 83.9 86.4 86.4 86.9 87.7 85.55 80.2 81.0 82.2 84.4 84.7 85.4 85.4 85.06 79.6 80.1 82.6 83.5 83.9 84.0 84.07 79.3 80.7 81.9 83.5 82.7 84.08 79.9 81.2 81.7 80.9 82.49 79.3 79.7 79.6 82.2

10 79.2 79.3 81.411 77.6 79.412 79.6

Share of dataset: 56.7% 12.9% 9.3% 6.4% 4.4% 3.1% 2.1% 1.7% 1.3% 1.0% 0.7% 0.4%

Length of Treatment Episode (# monthly prescriptions dispensed)

Prescription count within

episode

Statistics conditioned on the length of the treatment episode (measured in # of prescriptions dispensed overall)

Notes:1. Source - Medstat Marketscan Commercial Claims and Encounters Database, years 2003-2005.2. Sample contains 102,720 unique patients, with 267,390 observed prescriptions or office visits across patient episodes.

50

Est. Std. Err T-stat Est. Std. Err T-statRC, Mean 1{TCA} -0.43 0.03 13.00 -0.28 0.02 11.96RC, Mean 1{NDRI} 0.95 0.28 3.34 -0.18 0.46 0.38RC, Mean 1{SSRI} 1.11 0.12 9.54 1.02 0.09 11.47RC, Mean 1{SNRI} 0.09 0.09 1.00 -0.12 0.04 3.36RC, Mean 1{NaSSA} 0.10 0.04 2.34 -0.03 0.05 0.55RC, Mean 1{Inside good} 3.61 0.48 7.51 6.86 0.66 10.32RC, Mean 1{Dosing >1 pill per day} -0.52 0.04 14.68 -0.40 0.04 9.77RC, Mean nausea reports >60th pctile -14.17 2.09 6.77 -8.16 0.77 10.59RC, Mean log(insurer cost) -0.44 0.01 59.71 -0.23 0.01 26.25RC, Mean log(patient copayment) -0.38 0.01 27.92 -0.34 0.01 49.26RC, Var. log(insurer cost) 0.02 0.00 14.60 0.02 0.00 10.34RC, Var. log(patient copayment) 0.08 0.02 3.77 0.04 0.01 6.40Fixed 1{branded} 1.97 0.19 10.58 2.34 0.07 32.35Fixed log(insurer cost)* 1{Capitated} -0.25 0.01 17.28

Notes:

3. Estimation of the hierarchical model carried out using Markov Chain Monte Carlo methods.4. 'RC-Mean' denotes the mean of a random coefficient; 'RC-Var.' denotes its variance.

1. Source - Medstat Marketscan Commercial Claims and Encounters Database, 2003-2005.2. Estimation based on a random sample of 50,000 patients from the full dataset.

Table 4: Mixed logit estimates based on first choice data

VariableType

Yes YesProduct-specific fixed effects included?

(1) (2)

Interactions of specialty with subclass? No YesInteractions of diagnosis with subclass? Yes Yes

51

Product Name Class Brand?Prozac SSRI Yes No 0.38 34.26 -0.90Celexa SSRI Yes No 1.55 32.77 -0.63Mirtazapine NaSSA No - 0.62 8.89 -0.51Amitriptyline HCL TCA No - 0.77 5.64 -0.48Nortriptyline HCL TCA No - 0.34 6.82 -0.47Citalopram HBr SSRI No - 2.19 4.96 -0.45Bupropion HCL NDRI No - 3.39 10.09 -0.44Trazodone HCL SM No - 1.51 6.38 -0.44Cymbalta SNRI Yes Yes 1.30 21.62 -0.42 Zoloft SSRI Yes Yes 10.14 27.28 -0.38Fluoxetine HCL SSRI No - 10.23 8.32 -0.38Paxil SSRI Yes No 2.85 22.30 -0.37Paroxetine HCL SSRI No - 3.96 9.21 -0.36Lexapro SSRI Yes Yes 12.20 27.87 -0.36Wellbutrin XL NDRI Yes Yes 5.45 23.83 -0.32Effexor-XR SNRI Yes Yes 7.27 23.12 -0.24

-0.37Notes:

Table 5: Elasticities from the mixed logit model

1. Elasticities and shares calculated using mixed logit specification (1) reported in Table 4.

Market Share (%)

Average (weighted by share):

2. Selected antidepressants in the choice set shown for illustration.

Avg. copay ($/month supply) Elasticity

On Patent?

52

Table 6: Comparison of estimated drug fixed effects with advertising expenditures, years 2001-2003

Yr 2003Average,

2001-2003 Yr 2003Average,

2001-2003SSRI Fluoxetine HCL (Prozac) 3.02 0.10 480 1,568 64 283SNRI Venlafaxine HCL (Effexor) 2.29 0.09 7,787 6,672 1,206 858SSRI Escitalopram (Lexapro) 2.25 0.26 16,506 5,254 1,055 324SSRI Citalopram HBr (Celexa) 2.19 0.10 593 7,375 39 521SSRI Sertraline HCL (Zoloft) 2.16 0.26 8,754 7,307 1,010 959SSRI Paroxetine HCL (Paxil) 1.80 0.19 9,220 9,071 938 1,073SNRI Duloxetine HCL (Cymbalta) 1.03 0.10 0 0 0 0SM Trazodone HCL (Desyrel) 0.92 0.07 0 0 0 0TCA Nortriptyline HCL (Pamelor) -0.46 0.10 0 0 0 0NDRI Bupropion HCL (Wellbutrin) -1.21 0.22 4,293 3,591 508 425

Average monthly detailing expenditure by year1

($ 000s)

Average monthly volume of samples provided to physicians2 (000s)

Subclass MoleculeFixed Effect3

Std. Error

Notes:1. Detailing expenditure reflects professional promotion recorded in Verispan's Personal Selling Audit/Hospital Personal Selling Audit.2. Sampling volume reflects the package volume of a product provided as samples to office-based physicians through various delivery methods (IMS Health).3. Fixed effects estimated in first stage mixed logit model, specification (1).4. Cymbalta entered the antidepressant market in 2003, after the period of the advertising data available for analysis.

53

Variable Estimate Std. Error Variable Estimate Std. ErrorMeans Means1{TCA} -0.43 0.03 log(insurer cost) -0.44 0.011{NDRI} 0.95 0.28 log(patient copayment) -0.38 0.011{SSRI} 1.11 0.12 1{branded} 1.97 0.191{SNRI} 0.09 0.09 1{Wellbutrin XL} -1.21 0.221{NaSSA} 0.10 0.04 1{Citalopram HBr} 2.19 0.101{Inside good} 3.61 0.48 1{Celexa} -0.02 0.271{Dosing >1 pill per day} -0.52 0.04 1{Cymbalta} 1.03 0.101{nausea reports >60th pctile} -14.17 2.09 1{Lexapro} 2.25 0.26Variances3 1{Fluoxetine HCL} 3.02 0.101{TCA} 43.8 7.7 1{Prozac} -1.32 0.281{NDRI} 98.1 6.1 1{Nortriptyline HCL} -0.46 0.101{SSRI} 9.1 5.2 1{Paroxetine HCL} 1.80 0.191{SNRI} 0.0 0.0 1{Zoloft} 2.16 0.261{NaSSA} 59.3 13.1 1{Trazodone HCL} 0.92 0.071{Inside good} 102.8 9.2 1{Effexor-XR} 2.29 0.091{Dosing >1 pill per day} 16.0 8.81{nausea reports >60th pctile} 15.2 8.4Idiosyncratic variance in outcomes 142.3 21.4

Table 7: Model EstimatesCoefficients on Outcome Characteristics1 Fixed Characteristics2

Notes:1. The means of the coefficients in the parameterization of outcomes are estimated in the initial stage model. The variances (lower panel) come from the subsequent maximum likelihood routine.2. The fixed coefficients are estimated in the initial stage model. The model also includes interactions between patient diagnoses and subclass indicators; these coefficients are suppressed above for clarity.3. Standard errors for the second stage parameters estimated via a bootstrap resampling procedure described in the Technical Appendix (Section E).

54

Class NDRI SSRI SNRINDRI 67.7 5.4 1.7SSRI 1.0 72.4 0.8SNRI 1.1 2.9 75.3



Notes:

3. 'Static' model does not condition on past choices.

2. Observed and predicted transitions calculated for the sample of patients pursuing active treatment for at least 2 periods.

Table 8: Comparison of predicted transition probabilities with observed transitions, at the first decision point

1. Markov transition table shown includes only selected antidepressant classes.

Panel 1: Observed

Panel 2: Dynamic Model

Panel 3: Static ModelChoice at Decision 2

Choice at Decision 1





55

Table 9: Model fit - Matching predicted subclass and product choices to observed selections by patient

t=2 t=3 t=4 t=2 t=3 t=4if 'top 3' ranked by model 64.2 66.4 65.3 85.0 80.9 74.3if 'top 3' ranked randomly by share 33.5 32.4 31.8 30.7 30.6 30.5if 'top 5' ranked by the model 80.3 86.2 85.1 91.0 89.1 85.9if 'top 5' ranked randomly by share 46.6 45.3 44.7 44.7 44.6 44.4

Total patient observations: 29,358 19,943 13,993 22,164 15,819 11,296

For Inside Goods Only

% of patients for whom the observed choice equals one of top 3 choices% of patients for whom the observed choice equals one of top 5 choices

Examination StatisticFor All Products

Notes:1. Fit statistics rely on estimated parameters from steps 1-2 of the estimation procedure. The statistics record the percentage of predicted choice sequences that match the observed sequence in a given period, according to the criterion of the statistic. 't' denotes the period of the decision. After period 1, the predictions for periods t=2,3,4 condition on the observed choices from s=1,...,t-1. In this way, the predictions represent only 1 period ahead forecasts based on the prior periods' observed experience.2. "Random" ranking probabilities calculated using the hypergeometric distribution (sampling without replacement) based on observed market shares.

56

Variable 1 2 3Baseline Total monthly drug costs1 ($) 429,040 430,567 254,884

Costs per patient2 ($/day) 1.43 1.44 0.85 % of patients adhering to treatment 100.0% 87.4% 56.8%

All copayments set to $0 Change in % adherence 0.0% 2.3% 5.5%% Change in monthly drug costs 49.2% 49.2% 60.2%

due to change in adherence 3.7% 3.7% 4.4%due to shifts in drugs chosen 45.5% 45.4% 55.8%

Change in % adherence 0.0% 1.5% 3.4%% Change in monthly drug costs -18.7% -18.8% -27.0%

due to change in adherence 1.3% 1.3% 1.1%due to shifts in drugs chosen -19.9% -20.0% -28.1%

Change in % adherence 0.0% 1.5% 4.4%% Change in monthly drug costs -0.3% -0.3% 3.0%

due to change in adherence 0.1% 0.1% 0.3%due to shifts in drugs chosen -0.4% -0.4% 2.7%

Change in % adherence 0.0% -0.6% -1.4%% Change in monthly drug costs -30.0% -30.2% -36.9%

due to change in adherence -0.6% -0.6% -0.6%due to shifts in drugs chosen -29.4% -29.4% -36.4%

Advertise heavily for Zoloft Change in % adherence 0.0% 2.1% 5.8%% Change in monthly drug costs 13.0% 12.9% 27.5%


Change in % adherence 0.0% 0.8% 1.6%% Change in monthly drug costs 19.3% 19.2% 12.7%


Notes:

3. Values calculated for 10,000 sample patients at individual-specific prices.

Table 10: Counterfactual prescription drug costs to the insurer and adherence

2. Costs per patient reflect monthly dollar costs divided by the total number of patients the insurer serves--whether on active treatment or not.

Effects of Promotional Policies

1. Total monthly drug costs equal the recorded insurer cost per 30 day supply less the patient's drug copay per 30 day supply, for the set of patients on active treatment.

Advertise heavily for Wellbutrin

Counterfactual ScenarioDecision Point

Effects of Copayment Policies

Generic copays set to $0; Branded copayments set to 95th pctile ($50)

Copay for preferred SSRI/SNRI=$0, off-patent preferred SSRI/SNRI=$15, all others=$60

Ban branded advertising (eliminate branded "bonus" )

57

Drug costs per patient, initial month

($)

Value of utility gain from symptomatic

recovery ($)Baseline 42.90 366.13

Change in monthly drug costs per

patient ($)

Change in symptomatic recovery ($)

All copayments set to $0 21.10 8.36 Tiered Pricing (branded vs. generic) (8.00) (5.64) Value-based design (SSRI/SNRI preferred)

(0.13) 11.13

Change in monthly drug costs per

patient ($)

Change in symptomatic recovery ($)

Ban branded advertising (12.89) (17.05) Advertise heavily for Zoloft 5.57 13.68 Advertise heavily for Wellbutrin 8.27 7.48

Effects of Promotional Policies

Table 11: Patient welfare vs. drug costs, under counterfactual scenarios

Effects of Copayment Policies

Notes:1. Per patient monthly drug costs equal the average recorded insurer cost per 30 day supply less the patient's drug copay, for the set of patients on active treatment. 2. The dollar value of the utility gain from symptomatic relief is calculated via the procedure in the Technical Appendix (Section F). The change in symptomatic effect by policy stems from the policy's effect on both the initiation of treatment and adherence. This changes the distribution of treatments used and therefore the probability of symptomatic relief over time.3. Values calculated for 10,000 sample patients at individual-specific prices.

58

(1) (2) (3) (4)

Decision Point 2

Decision Point 3

Texas Medication Algorithm Project, Report for Major

Depression (1998)

APA Guidelines, Second Edition

(2000,2005)

Comparative effectiveness review; AHRQ

(2007)Dynamic Model

PredictionsCelexa (Citalopram) SSRI 20.3 33.0 X X XLexapro (Escitalopram) SSRI 21.5 34.5 X X XPaxil (Paroxetine) SSRI 15.1 23.7 X X X XProzac (Fluoxetine) SSRI 19.9 32.5 X X X XZoloft (Sertraline) SSRI 21.4 34.4 X X XCymbalta (Duloxetine) SNRI 16.3 25.0 X X XEffexor (Venlafaxine) SNRI 16.3 24.9 X X X XWellbutrin (Bupropion) NDRI 21.8 35.1 X X XNortriptyline TCA 24.9 38.5 XMirtazapine NaSSA 23.7 37.1 XNefazodone SM 17.3 26.1 X XTrazodone SM 25.8 39.5Notes:

2. Sources:

Table 12: Treatment recommendations from published protocols vs. the dynamic model's recommendations

1. Percentage of patients who quit their medications at decision points 2 and 3 predicted using the experience of 10,000 sample patients applied against the dynamic model estimates. The recommendations built from adherence patterns differ very little when using observed copayments or when setting all copayments equal to $0.

X's within a row indicate the protocol recommends the drug molecule

(1) Crismon, M. L., et al. (1999). The Texas medication algorithm project: Report of the Texas consensus conference panel on medication treatment of major depressive disorder. J Clin Psychiatry, 60, 142-156.(2) Karasu, T. B., et al. (2000). Practice guideline for the treatment of patients with major depressive disorder: Second edition. Arlington, VA: American Psychiatric Association. Also, Fochtmann, L. J., & Gelenberg, A. J. (2005). Guideline watch: Practice guideline for the treatment of patients with major depressive disorder, 2nd edition. Arlington, VA: American Psychiatric Association.(3) Gartlehner, G., et al. (2007). Comparative effectiveness of second-generation antidepressants in the pharmacologic treatment of adult depression. Agency for Healthcare Research and Quality, US Dept of Health and Human Services.

ClassMolecule

% Patients quitting medication, model

59

Download - E¢ cient Provision of Experience Goods: Evidence from ...faculty.chicagobooth.edu/workshops/AppliedEcon/past/pdf/Dickstein... · E¢ cient Provision of Experience Goods: Evidence

Top Related