recent advances in outcome weighted learning for precision

Recent Advances in Outcome WeightedLearning for Precision Medicine

Michael R. Kosorok

Department of BiostatisticsDepartment of Statistics and Operations Research

University of North Carolina at Chapel Hill

September 15, 2017

Acknowledgments

I Many co-authors and collaborators helped with this work.

I Especially Guanhua Chen (UW-Madison), Yifan Cui (PhDstudent in Statistics and Operations Research atUNC-Chapel Hill), Anna Kahkoska (MD-PhD student atUNC-Chapel Hill), Eric Laber (NCSU), Daniel Luckett(PhD student in Biostatistics at UNC-Chapel Hill), DavidMaahs (Stanford), Elizabeth Mayer-Davis (UNC-ChapelHill), Donglin Zeng (UNC-Chapel Hill), and Ruoqing Zhu(UI-Urban-Champaign).

I This project was funded in part by US NIH grants P01CA142538 and UL1 TR001111, and US NSF grantDMS-1407732, among others.

Michael R. Kosorok 2/ 47

Outline

Introduction

Outcome weighted learning

A Tree Based Approach for Censored Data

Dose finding

Precision medicine in mHealth

Overall conclusions and future work


Personalized vs. precision medicine

I Personalized medicineI The idea of developing targeted treatments that take into

account patient heterogeneityI Clinicians have been doing this for centuries: not really new

I Precision medicineI The idea of developing targeted treatments that take into

account patient heterogeneityI Empirically based, scientifically rigorous, reproducible, and

generalizable (i.e., will work with future patients): very new


Ingredients of precision medicine

I Incorporates data on:I Genetic makeup, proteomics, metabolomics, metabiomics, etc.I Environmental factors, demographics, etc.I Lifestyle, phenotypic information, etc.

I Scientific tools:I Biomedical knowledgeI Data (potentially integrated across many platforms)I Knowledge driven vs. data drivenI Computational, mathematical and statistical tools


Clinical focus

We want to make the best treatment decisions based on data.I The single-decision setting:

I A patient presents with a disease and we need to decide whattreatment (or dose) to give from a list of choices.

I We want to make the best decision (treatment regimen) basedon available baseline patient-level feature data.

I The multi-decision setting:I Treat patients for diseases with multiple treatment decisions

occurring over time.I Make the best decision based on historical patient-level data at

each decision time (dynamic treatment regimen).I The best decisions take into account delayed effects.

I Real time decision making in mHealth:I A large number of decisions need to be made in real timeI Technical and practical challenges for implementing


Outline

Introduction



Dose finding




Single decision setting

I Let X be the vector of patient tailoring variables, A thechoice of treatment given, and R the clinical outcome(with larger being better).

I The “standard” approach is to first estimate theQ-function

Q(x , a) = E [R |X = x ,A = a] ,

through regression of R on (X ,A), and invert to obtain

d(x) = argmaxa

Q(x , a).

I Issue: This approach is indirect, since we must estimateQ(x , a) and invert.


Value function and optimal individualized treatmentrule

I Let P be the distribution of (X ,A,R), with treatmentsrandomized via π(A|X ), and Pd the distribution of(X ,A,R), with treatments chosen according to d . Thevalue function of d (Qian & Murphy, 2011) is

V (d) = E d(R) =

∫RdPd =

∫RdPd

dPdP = E

[I (A = d)

π(A|X )R

].

I Optimal Individualized Treatment Rule:

d∗ ∈ argmaxd

V (d).

E (R |X ,A = 1) > E (R |X ,A = −1)⇒ d∗(X ) = 1

E (R |X ,A = 1) < E (R |X ,A = −1)⇒ d∗(X ) = −1


Outcome weighted learning (OWL or O-learning)

Optimal Individualized Treatment Rule d∗

Maximize the value Minimize the risk

E

[I (A = d(X ))

π(A|X )R

]E

[I (A 6= d(X ))

π(A|X )R

]

I For any rule d , d(X ) = sign(f (X )) for some function f .

I Empirical approximation to the risk function:

n−1n∑

i=1

Ri

π(Ai |Xi)I (Ai 6= sign(f (Xi))).

I Computational challenges: non-convexity anddiscontinuity of 0-1 loss.


Using a support vector machine (SVM) approach

Objective Function: Regularization Framework

minf

{1

n

n∑i=1

Ri

π(Ai |Xi)φ(Ai f (Xi)) + λn‖f ‖2

}. (1)

I φ(u) = (1− u)+ is the hinge loss surrogate, ‖f ‖ is somenorm for f , and λn controls the penalty on f .

I A linear decision rule: f (X ) = XTβ + β0, with ‖f ‖ as theEuclidean norm of β.

I Estimated individualized treatment rule:

dn = sign(fn(X )),

where fn is the solution to (1).


Outcome Weighted Learning (OWL or O-Learning)

Figure 1: Zhao YQ, Zeng D, Rush AJ, and Kosorok MR (2012).Estimating individualized treatment rules using outcome weightedlearning. JASA 107:1106–1118.


Results for O-Learning

I Can use kernel trick to extend to nonparametric decisionrule (e.g., the Gaussian kernel).

I Fisher consistent, asymptotically consistent, and modelrobust.

I Risk bounds and convergence rates similar to thoseobserved in SVM literature (Tsybakov, 2004).

I Excellent simulation results and data analysis ofNefazodone-CBASP clinical trial (Keller et al., 2000).

I Promising performance overall (Y.Q. Zhao, et al., 2012).

I An example of a direct value function approach (see alsoB. Zhang, et al., 2012).

I Opens door to a unique application of machine learningtechniques to personalized medicine.


O-Learning Extensions

I Multiple decision times:I Zhao YQ, Zeng D, Laber EB, and Kosorok MR (2015). New

statistical learning methods for estimating optimal dynamictreatment regimes. JASA 110:583–598.

I Also works well, but a concern is the low fraction of data beingused.

I Issue: O-learning is not invariant under constant shifts inoutcome R .

I Zhou X, Mayer-Hamblett N, Khan U, and Kosorok MR (2017).Residual weighted learning for estimating individualizedtreatment rules. JASA 112:169–187.



I What about observational data?I One method that seems to work is to replace the known

propensity score π(Ai |Xi ) in (1) with an estimate π(Ai |Xi ).

I More than two treatment optionsI Ordinal treatment options (e.g., organized by dose level) (in

progress)I Nonordinal treatments (in progress)



I Censored data:I Zhao YQ, Zeng D, Laber EB, Song R, Yuan M, and Kosorok

MR (2015). Doubly robust learning for estimatingindividualized treatment with censored data. Biometrika102:151-168.

I Two basic approaches:I Inverse weighting: requires correct modeling of conditional

distribution function of censoring time given covariates.I Double robust: correct modeling of one of either conditional

censoring or conditional survival distribution functions (inverseweighting still used).

I Works quite well, but a concern is the need for inverseprobability of censoring weighting.



I Continuous treatment options (e.g., dose or time)I Chen G, Zeng D, and Kosorok MR (2016). Personalized dose

finding using outcome weighted learning (with discussion andrejoinder). Journal of the American Statistical Association111:1509-1547.

I V-learning for continuous time and mHealth (in progress)


Outline

Introduction



Dose finding





I Cui Y, Zhu R, and Kosorok MR (In press). Tree basedweighted learning for estimating individualized treatmentrules with censored data. Electronic Journal of Statistics.

I Recall that (X ,A,R) is the observed data in outcomeweighted learning:

I X is the vector of patient-level features.I A is the treatment indicator.I R is the positive outcome: in survival data this is the event

time which is censored.



Recall for outcome weighted learning, the targeted ITR satisfies

D∗(X ) = arg mind

E

[R1{A 6= d(X )}

π(A|X )

], (2)

where we use the hinge loss surrogate to ensure convexity.

However, due to censoring, we do not observe R = T but only

observe Y = T ∧ C and δ = 1{Y = T}. We will use two

nonparametric estimates of T based on censored data:

I R1 = E (T |X ,A) and

I R2 = 1{δ = 1}Y + 1{δ = 0}E (T |X ,A,Y ,T > Y ),

where the E terms are estimated via random forests.


Recursively Imputed Survival Trees

We use recursively imputed survival trees (RIST) to estimate E :

I Zhu R and Kosorok MR (2013). Recursively imputedsurvival trees. JASA 107:331–340.

The basic ingredients of RIST:

I RIST is a random forest method (see, e.g., Breiman,2001, Machine Learning) for right censored data (slightlymodified).

I RIST uses a form of Monte Carlo EM (see, e.g., Wei andTanner, 1990, JASA) to avoid inverse probability ofcensoring weighting (IPCW) as done in Hothorn et al(2006, Biostatistics) or single-step hazard estimation interminal nodes (see, e.g., Ishwaran et al, 2008, Annals ofApplied Statistics).


Theoretical Results

TheoremUnder reasonable regularity conditions, there exist sequencesrn,wn > 0 converging to zero such that for each covariate value xand treatment choice a,

pr{∣∣E (T | x , a)− E (T | x , a)

∣∣ ≤ rn}≥ 1− wn

and

pr{∣∣E (T | x , a,T > Y ,Y )− E (T | x , a,T > Y ,Y )

∣∣ ≤ rn}

≥ 1− 2wn.


Theoretical Results

Let d∗(x) = sign (f ∗(x)) be the decision function which optimizes

the value function V (d) over all d and dn(x) = sign(fn(x)

)be

the estimated decision function.

TheoremUnder reasonable regularity conditions and assuming that thesequence λn > 0 satisfies λn → 0 and λn ln n→∞, there existsequences εn,wn > 0 converging to zero such that

pr{V (f ∗) ≤ V (fn) + εn

}≥ 1− wn,


Simulation Results

We considered several simulation scenarios, but present two here:

I Scenario 3: Failure time is non-linear with 5 covariateswhile censoring time follows a Cox model.

I Scenario 4: Failure time is AFT with 10 covariates whilecensoring time follows a Cox model.

Other features of both scenarios:

I 45% censoring rate with n = 200 sample size

I 500 replications and predictions based on log failure time

I Compare five methods: (1) oracle, (2) RIST-R1, (3)RIST-R2, (4) ICO, (5) DR, and (6) Cox

I Both linear and Gaussian kernels used (except for Cox).


Simulation Results

●

●●

●

●

●

●●●

●

●

●

●

●

●

●●●●●●

●●●

●●●●●●●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●●●

●

●●●●●

●●●

●●●●

●

●

●●

●●

●●●

●

●●●

●

●

●●

0.50

0.75

1.00

T RIST−R1 RIST−R2 ICO DR Cox

Log

surv

ival

tim

e

Kernel (except Cox)LinearGaussian

Scenario 3

Figure 2: Boxplots of mean log survival time for different treatmentregimes. The black horizontal line is the theoretical optimal value.


Simulation Results

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●●

●●●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●●●●●●●●●●●● ●

●●

●

●●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●●

●●

●●

●●●●

●

●

●

●

●●●

●●

●●

●

●●●●

●

●

●●●

●

●

●●

●

●●●

●

●●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●●

●

●●●●

●●

●●

●

●

●

●●

●

●

●

●

●●

●●●●

●

●●

●●

●●

●

−0.7

−0.6

−0.5

−0.4

T RIST−R1 RIST−R2 ICO DR Cox

Log

surv

ival

tim

e


Scenario 4

Figure 3: Boxplots of mean log survival time for different treatmentregimes. The black horizontal line is the theoretical optimal value.


Non-Small Cell Lung Cancer (NSCLC) Trial Data

We apply the proposed methods to the NSCLC randomized clinical

trial data described in Socinski, et al (2002, Journal of Clinical

Oncology 29:1335–1343):

I Two treatment arms with 114 patients each (228 total)

I Study duration of 104 weeksI Five covariates:

I performance status: ranging from 70% to 100%I cancer stage: 3 or 4I race: black, white, otherI genderI age: ranging from 31 to 82

I We use the methods from the simulations with valuefunction based on log(T )


NSCLC Trial Data: Results

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

3.0

3.5

4.0

RIST−R1 RIST−R2 ICO DR Cox

Val

ue


Non−small−cell lung cancer data

Figure 4: Boxplots of cross-validated (CV) value of survival weeks on thelog scale. Four-fold CV randomly repeated 100 times.


Overall Results for Tree Based Approach

I The proposed outcome-weighted learning using RIST-R2

works well over a range of right-censored survival settings.

I The idea of using random forests to impute beforedownstream analyses seems beneficial.

I There remain several important open research questionsabout inference in this context.


Outline

Introduction



Dose finding




Motivating example

Bronchopulmonary Dysplasia

The most common morbidity associated with premature birth is broncho-

pulmonary dysplasia (BPD).

BPD20%−−→ pulmonary arterial hypertension

40%−−→ die.

Sildenafil

Sildenafil is a potent inhibitor of type 5 phosphodiesterase.

Sildenafil has been used for adults with pulmonary arterial hypertension.

Challenge: Researchers want to know the best dose of sildenafil forpremature infants with BPD as a function of infant characteristics.


Traditional dose trial for sildenafil

I 100 patients randomized to:

(mg/kg/dose) 0 0.75 1.50 2.25 3Rate of developing

hypertension 20% 15% 10% 5% 10%

I But 2.25mg/kg/dose may not be optimal for all babies:

I a very early preterm baby with moderate BPD: 1.5 mg/kg/dose.

I an older baby with severe BPD : 5 mg/kg/dose.


What is the challenge?

I For continuous dose P(A = f (X )) = 0.

I Thus O-learning cannot be applied directly.

I Solution: Chen G, Zeng D, and Kosorok MR (2016).Personalized dose finding using outcome weighted learning(with discussion and rejoinder). JASA 111:1509-1547.

I Note that maximizing the value function is approximatelyequivalent to minimizing:

Rφ(f ) = E

(R`φ(A− f (X ))

φp(A|X )

),

where `φ(A− f (X )) = min(|A− f (X )|/φ, 1).


The `φ Loss

x

y

0-φ φ

1

x

y

0-φ φ

1

I Connection to Kernel estimator: Triangular Kernel (left).

I `φ is a bounded loss and brings robustness.


Outcome weighted learning (O-Learning)

Objective Function: Loss + Penalty

minf

{1

n

n∑i=1

Ri`φ(Ai − f (Xi))

2φp(Ai |Xi)+ λn‖f ‖2

}. (3)

I ‖f ‖ is some norm for f , and λn controls the penalty.

I A linear decision rule: f (X ) = XTw + b, with ‖f ‖ beingthe Euclidean norm of w.

I The kernel trick generalizes to nonlinear/nonparametricrules.

I Estimated individualized treatment rule: fn solves (3).

I (2) is not convex but is the difference of two convexfunctions, so we use the DC algorithm.


`φ as D.C.

x

y

0-φ φ

1

x

y

0-φ φ

1


Results for Dose Finding

I Consistent, with convergence rate able to get close ton−1/4: can be strengthened to nearly n−1/2 under amodified kernel and additional assumptions (Luedtke andvan der Laan, 2016).

I In contrast, unbounded loss functions such as the absolutedeviation loss will not yield consistent result.

I Generalizes to observational data via inverse propensityscore weighting.

I Good simulation results: improves over other methods,especially indirect methods.

I Works well when applied to a Warfarin dosing data set.


Outline

Introduction



Dose finding





Overall research goal:

I Develop estimation techniques (using data collected withmobile devices) for dynamic treatment regimes (which canbe implemented as personalized mHealth interventions)

Motivating example: type 1 diabetes

I Understand type 1 diabetes (T1D) and how it is managed

I Develop tailored mHealth interventions for T1Dmanagement


What is T1D?

I Autoimmune disease wherein the pancreas producesinsufficient insulin

I Daily regimen of monitoring glucose and replacing insulin

I Hypo- and hyperglycemia result from poor management

I Glucose levels affected by insulin, diet, and physicalactivity


The glucose-insulin dynamical system

A day in the life of a T1D patient:

time (half−hours)

gluc

ose

(mg/

dL)

0 10 20 30 40 48

5010

015

020

025

0

glucoseinsulinactivityfood

Figure 5: Plot of glucose, insulin, physical activity, and food intake.


Mobile technology in T1D care

Mobile devices can be used to administer treatment and assist withdata collection in an outpatient setting, including

I Continuous glucose monitoring

I Accelerometers to track physical activity

I Insulin pumps to administer and log injectionsautomatically

These technologies can be incorporated using mobile phones.


Research goals

Methodological goals:

I Estimate dynamic treatment regimes for use in mobilehealth

I Infinite time horizon, minimal modeling assumptions

I Observational data with minute-by-minute observations

I Online estimation to facilitate real-time decision making

Clinical goals:

I Provide patients information on the best actions tostabilize glucose

I Recommendations that are dynamic and personalized tothe patient


Conceptual framework

I We use a Markov decision process (MDP) context

I One potential approach is to use infinite horizonQ-learning (models state-value as a function of actionassuming all future actions are optimal):

I Ertefaie A (2014). Constructing dynamic treatment regimes ininfinite-horizon settings. arXiv preprint arXiv:1406.0764.

I We developed V-learning which uses a policy learningapproach (models state-value as a function of policy):

I Luckett DJ, Laber EB, Kahkoska AR, Maahs DM,Mayer-Davis E, Kosorok MR (2016). Estimating dynamictreatment regimes in mobile health using V-learning. arXivpreprint arXiv:1611.03531.


Summary for V-Learning

I Advantages of V-learning include

I Flexibility in choosing a model for V (π, s; θπ)

I Online estimation, randomized decision rules

I Flexibility in specifying reference distribution

I Parametric value estimates

I A tailored treatment regime delivered through mobiledevices may help to reduce hypo- and hyperglycemia inT1D patients


Outline

Introduction



Dose finding




Overall Conclusions and Future Work

I This is an exciting time for precision medicine at theconfluence of machine learning, Big Data, and statistics.

I Outcome weighted learning, or O-learning, V-learning andrelated methods, bridge between machine learning andprecision medicine.

I There are numerous open questions.

I This work is part of the emergence of a new (or renewed)discipline focused on data driven decision making andprecision medicine, and statistics is playing a key role.


recent advances in outcome weighted learning for precision

Documents