predictive and prescriptive methods in operations research

Predictive and Prescriptive Methods in OperationsResearch and Machine Learning: An Optimization

Approachby

Nishanth MundruB.Tech., Indian Institute of Technology Bombay (2012)

Submitted to the Sloan School of Managementin partial fulfillment of the requirements for the degree of

Doctor of Philosophy in Operations Research

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2019

© Massachusetts Institute of Technology 2019. All rights reserved.

Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Sloan School of Management

May 17, 2019Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dimitris J. BertsimasBoeing Leaders for Global Operations Professor

Sloan School of ManagementCo-Director, Operations Research Center

Thesis Supervisor

Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Patrick Jaillet

Dugald C. Jackson ProfessorDepartment of Electrical Engineering and Computer Science

Co-Director, Operations Research Center

Predictive and Prescriptive Methods in Operations Research and

Machine Learning: An Optimization Approach

by

Nishanth Mundru

Submitted to the Sloan School of Managementon May 17, 2019, in partial fulfillment of the

requirements for the degree ofDoctor of Philosophy in Operations Research

Abstract

The availability and prevalence of data have provided a substantial opportunity for decisionmakers to improve decisions and outcomes by effectively using this data. In this thesis, wepropose approaches that start from data leading to high-quality decisions and predictions invarious application areas.

In the first chapter, we consider problems with observational data, and propose variants ofmachine learning (ML) algorithms that are trained by taking into account decision quality.The traditional approach to such a task has often focused on two-steps, separating theestimation task from the subsequent optimization task which uses these estimated models.Consequently, this approach can miss out on potential improvements in decision quality byconsidering these tasks jointly. Crucially, this leads to stronger prescriptive performance,particularly for smaller training set sizes, and improves the decision quality by 3 − 5% overother state-of-the-art methods. We introduce the idea of uncertainty penalization to controlthe optimism of these methods which improves their performance, and propose finite-sampleregret bounds. Through experiments on real and synthetic data sets, we demonstrate thevalue of this approach.

In the second chapter, we consider observational data with decision-dependent uncer-tainty; in particular, we focus on problems with a finite number of possible decisions (treat-ments). We present our method of prescriptive trees, that prescribes the best treatmentoption by learning from observational data while simultaneously predicting counterfactuals.We demonstrate the effectiveness of such an approach using real data for the problem ofpersonalized diabetes management.

In the third chapter, we consider stochastic optimization problems when the sample av-erage approximation approach is computationally expensive. We introduce a novel measure,called the Prescriptive divergence which takes into account the decision quality of the sce-narios, and consider scenario reduction in this context. We demonstrate the power of thisoptimization-based approach on various examples.

3

In the fourth chapter, we present our work on a problem in predictive analytics where wefocus on ML problems from a modern optimization perspective. For sparse shape-constrainedregression problems, we propose modern optimization based algorithms that are scalable, andrecover the true support with high accuracy and low false positive rates.

Thesis Supervisor: Dimitris J. BertsimasTitle: Boeing Leaders for Global Operations ProfessorSloan School of ManagementCo-Director, Operations Research Center

4

Acknowledgments

First of all, I would like to thank my advisor, Dimitris Bertsimas, for his constant guidance

and encouragement throughout the course of my PhD. His infectious passion, unmatched

ability to innovate, and attention to detail have significantly improved me as a researcher.

His willingness to challenge assumptions by asking the right questions and choose practically

relevant problems have left an indelible impression on me. As I evaluate and reflect back

on my graduate school experience, I realize that he has profoundly shaped my views on

research. It has been a tremendous privilege to collaborate so closely with someone who sees

research as a powerful means to materially improve the human condition.

Next, I would like to express my gratitude towards the other two members of my thesis

committee: Nikos Trichakis and Colin Fogarty. Their insightful comments and feedback have

greatly improved this manuscript. Nikos: Thank you for your support with my applications.

I would also like to thank the following professors at MIT: Rob Freund (for serving on

my general examination committee and helping me with my applications), Georgia Perakis

(for her thoughtful words of encouragement), and John Tsitsiklis (for his role on my general

examination committee). Thank you to the faculty with whom I interacted and learned from:

Patrick Jaillet, Juan Pablo Vielma, David Gamarnik, Rahul Mazumder, and Jim Orlin. In

addition, I also wish to thank the ORC staff – Laura, Andrew, and Nikki for their prompt

help with all the paperwork, and ensuring that the ORC runs smoothly.

I would also like to express my sincere thanks to my collaborators: Velibor Mišić for

his enthusiastic insights and ideas in our work on the airlift problem, Allison Chang for

her guidance and expertise on the airlift project, Jack Dunn for his clarity of thought in our

work on the prescriptive trees paper and for helping me with the Engaging cluster, and Chris

McCord for his intuition and insights in our collaborations on prescriptive analytics. Each

one of you has been exceptionally supportive, and working with you has been an enriching

learning experience for me.

I also wish to thank my friends and first year qualifying exams study group members

5

– Daniel Chen, Martin Copenhaver, and Rajan Udwani. A special word of appreciation to

Martin for being a valuable source of support during some stressful times.

I would like to particularly thank some friends I’ve made at the ORC over the years: Scott

Hunter (for being a constant friendly presence throughout), Alex Weinstein (I particularly

enjoyed our lunches), Miles Lubin and Joey Huchette (for helping me with Julia/JuMP code

during my first few years), Divya and Sowmya Singhvi (for all our conversations), Colin

Pawlowski, Matthew Sobiesk and Daisy Zhuo (for being such a positive influence), and Eli

Gutin (for the fitness tips and shared workout sessions).

A word of thanks to all the other friends I’ve made at the ORC over the years: Andrew Li,

Andrew Vanden Berg, Arthur Delarue, Arthur Flajolet, Charlie Thraves, Chiwei Yan, Chris

Coey, Dan Schonfeld, Deeksha Sinha, Frans deRuiter, Ilias Zadik, Jackie Baek, Jehangir

Amjad, Jerry Kung, Joel Tay, Julia Yan, Kimia Ghobadi, Krish Rajagopalan (for inviting

me to your beautiful wedding), Lennart Baardman, Michael Hu, Nataly Youssef, Nikita

Korolko, Peng Shi, Rebecca Zhang, Rim Harris, Swati Gupta, Ted Papalexopoulos, Virgile

Galle, Yee Sian Ng, and Zach Saunders. Finally, I wish to thank the ORC community as

a whole – I have learnt an immense amount from each of you, and your generosity and

friendliness makes the ORC a special place.

I would also like to thank all the friends I’ve made at Sidney Pacific and MIT: in partic-

ular, Murali Vijayaraghavan (who was also my first roommate at Sidney Pacific and helped

me get used to life in Cambridge) and Sai Gautam (for Tamil movie recommendations and

our discussions on test cricket, the greatest sport on this planet). More generally, the MIT

community (along with the opportunities/facilities that MIT provides) as a whole has helped

me grow immeasurably as a person, and for that I am incredibly thankful.

Thanks to all my teachers (in particular, Professor Mani Bhushan for serving as a research

advisor during my last year, and Professors Hemant Nanavati and Jhumpa Adhikari for being

so accommodating of my course preferences) and friends from my undergraduate days at IIT

Bombay. Your support has contributed to making those four years a rewarding experience.

A word of appreciation to all my teachers and friends throughout my schooling in Hyderabad,

6

without whom I probably would not have been the person I am today.

Finally, I would like to express gratitude to my family for their unconditional support

both during the course of my PhD and over my whole life. In particular, I am indebted to

my parents, Murty and Devi, for their boundless encouragement and patient advice. Thank

you to my brother and maternal grandparents for always being there for me. I also wish to

thank my extended family here in the US for their warmth and hospitality.

7

Contents

1 Introduction 19

1.1 Prescriptive Analytics: Joint Learning and Optimization . . . . . . . . . . . . . 20

1.1.1 Prescriptive Analytics for Observational Data . . . . . . . . . . . . . . . 20

1.1.2 Optimal Prescriptive Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.1.3 Prescriptive Scenario Reduction for Stochastic Optimization . . . . . . 22

1.2 Predictive Analytics: Machine Learning from a Modern Optimization lens . . 23

1.2.1 Sparse Convex Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Prescriptive Analytics for Observational Data 27

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.1.2 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.1.4 Structure of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.2 Outline of our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.3 Methods for Joint Predictive-Prescriptive Analytics . . . . . . . . . . . . . . . . 43

2.3.1 k Nearest Neighbors (kNN) . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.3.2 Nadaraya-Watson Kernel Regression (KR) . . . . . . . . . . . . . . . . . 44

2.3.3 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.3.4 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

9

2.3.5 Penalizing the prediction error of f . . . . . . . . . . . . . . . . . . . . . 48

2.4 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.4.1 Greedy algorithm for learning trees . . . . . . . . . . . . . . . . . . . . . 49

2.4.2 Prescriptive Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.4.3 Local search algorithms for Prescriptive Trees . . . . . . . . . . . . . . . 50

2.5 Observational data with decision-dependent uncertainty . . . . . . . . . . . . . 52

2.5.1 Uncertainty penalization and Parameter tuning . . . . . . . . . . . . . . 54

2.5.2 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

2.5.3 Tractability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2.6 Computational Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

2.6.1 Portfolio Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.6.2 Newsvendor problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

2.6.3 Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

2.6.4 Warfarin Dosing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3 Optimal Prescriptive Trees 73

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.1.1 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

3.1.3 Structure of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.2 Review of Optimal Predictive Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.3 Optimal Prescriptive Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

3.3.1 Optimal Prescriptive Trees with Constant Predictions . . . . . . . . . . 84

3.3.2 Training Prescriptive Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.3.3 Optimal Prescriptive Trees with Linear Predictions . . . . . . . . . . . . 87

3.4 Performance of OPTs on Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . 88

3.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

10

3.4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

3.4.5 Multiple treatments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

3.4.6 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 100

3.5 Performance of OPTs on Real World Data . . . . . . . . . . . . . . . . . . . . . 101

3.5.1 Personalized Warfarin Dosing . . . . . . . . . . . . . . . . . . . . . . . . . 101

3.5.2 Personalized Diabetes Management . . . . . . . . . . . . . . . . . . . . . 105

3.5.3 Personalized Job training . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

3.5.4 Estimating Personalized Treatment Effects for Infant Health . . . . . . 110

3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4 Prescriptive Scenario Reduction for Stochastic Optimization 113

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.1.1 Related literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.1.2 Contributions and Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

4.2.1 Distance between distributions . . . . . . . . . . . . . . . . . . . . . . . . 118

4.2.2 Scenario reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.3 Prescriptive Scenario reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

4.3.1 Prescriptive divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

4.3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

4.3.3 Piecewise (separately) linear cost . . . . . . . . . . . . . . . . . . . . . . . 127

4.3.4 Piecewise bilinear cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

4.3.5 Prediction Error penalization . . . . . . . . . . . . . . . . . . . . . . . . . 128

4.4 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

4.4.1 Alternating optimization framework . . . . . . . . . . . . . . . . . . . . . 129

4.5 Computational Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

4.5.1 Portfolio optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

11

4.5.2 Newsvendor problem with budget constraints . . . . . . . . . . . . . . . 133

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

5 Sparse Convex Regression 137

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

5.1.1 Related literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

5.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

5.1.3 Structure of this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

5.1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

5.2 Optimization Algorithm for Convex Regression . . . . . . . . . . . . . . . . . . . 142

5.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

5.2.2 `1 convex regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

5.2.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

5.3 Sparse Convex Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

5.3.1 Primal approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

5.3.2 Dual approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

5.3.3 Initialization heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

5.4 Computational Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

5.4.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

5.4.2 Comparison of initialization methods for the reduced master problem . 160

5.4.3 Run times of `2 convex regression . . . . . . . . . . . . . . . . . . . . . . 162

5.4.4 Infeasibility as a function of iterations . . . . . . . . . . . . . . . . . . . . 164

5.4.5 Comparison with other state of the art methods . . . . . . . . . . . . . . 165

5.4.6 Run times for `1 convex regression . . . . . . . . . . . . . . . . . . . . . . 167

5.4.7 Experiments on real data . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

5.4.8 Sparse convex regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

5.4.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

6 Conclusions 181

12

A Supplement for Chapter 2 183

A.1 Optimization Algorithms for Joint Predictive and Prescriptive Analytics . . . 183

A.1.1 First order convex methods for local search procedure . . . . . . . . . . 188

A.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

A.3 Optimization with Linear Predictive Models . . . . . . . . . . . . . . . . . . . . 199

A.4 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

A.4.1 Portfolio Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

A.4.2 Newsvendor problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

A.4.3 Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

A.4.4 Warfarin Dosing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

B Supplement for Chapter 5 205

B.1 Heuristic for generating upper bounds . . . . . . . . . . . . . . . . . . . . . . . . 205

B.1.1 Heuristics for norm bounded subgradients . . . . . . . . . . . . . . . . . 208

B.1.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

13

List of Figures

2-1 Tree constructed by regressing Y v/s X on training set. . . . . . . . . . . . . . 29

2-2 A different decision tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3-1 Performance of classification methods averaged across 60 real-world datasets.

OCT and OCT-H refer to Optimal Classification Trees without and with

hyperplane splits, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3-2 Test prediction and personalization error as a function of µ . . . . . . . . . . . 85

3-3 Effect and Treatment accuracy results for Experiment 1. . . . . . . . . . . . . . 92

3-4 Tree constructed by OPT(0.5)-L for an instance of Experiment 1. . . . . . . . 93


3-6 Tree constructed by OPT(0.5)-L for an instance of Experiment 2. . . . . . . . 95


3-8 Outcome and Treatment accuracy results for Experiment 4 with three treat-

ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

3-9 Error in prescribed outcome due to incorrect prescription. . . . . . . . . . . . . 100

3-10 Misclassification rate for warfarin dosing prescriptions as a function of training

set size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

15

3-11 Comparison of methods for personalized diabetes management. The leftmost

plot shows the overall mean change in HbA1c across all patients (lower is

better). The center plot shows the mean change in HbA1c across only those

patients whose prescription differed from the standard-of-care. The rightmost

plot shows the proportion of patients whose prescription was changed from

the standard-of-care. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

3-12 Out-of-sample average personalized income as a function of inclusion rate. . . 109

4-1 Average in-sample prescriptive performance for various methods as a function

of m, the number of reduced scenarios. . . . . . . . . . . . . . . . . . . . . . . . . 132

4-2 Average out-of-sample prescriptive performance for various methods as a func-

tion of m, the number of reduced scenarios. . . . . . . . . . . . . . . . . . . . . . 133

4-3 Average in-sample prescriptive performance for various methods as a function

of m, the number of reduced scenarios. . . . . . . . . . . . . . . . . . . . . . . . . 134

4-4 Average out-of-sample prescriptive performance for various methods as a func-

tion of m, the number of reduced scenarios. . . . . . . . . . . . . . . . . . . . . . 135

5-1 Progress of Algorithm 1 for (n, d) = (104,10), Tol = 0.1. . . . . . . . . . . . . . . 164

5-2 Progress of Algorithm 1 for (n, d) = (104,10), Tol = 0.05. . . . . . . . . . . . . . 165

5-3 Progress of Algorithm 1 for Tol = 0.01. . . . . . . . . . . . . . . . . . . . . . . . . 168

5-4 Progress of Algorithm 1 for Tol = 0.05. . . . . . . . . . . . . . . . . . . . . . . . . 169

5-5 Accuracy and run times for varying SNR. . . . . . . . . . . . . . . . . . . . . . . 175

5-6 Accuracy and run times for varying correlation ρ. . . . . . . . . . . . . . . . . . 176

5-7 Accuracy and run times for varying dimension d. . . . . . . . . . . . . . . . . . . 176

5-8 Accuracy and run times for varying sparsity parameter k. . . . . . . . . . . . . 176

16

List of Tables

2.1 Average out of sample prescriptive performance for Predict and Optimize -

kNN, local Kernel Regression, Lasso, and Random forests as a function of n,

the size of training set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

2.2 Average out of sample prescriptive performance for various methods as a func-

tion of n, the size of training set. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

2.3 Average out of sample prescriptive performance for various methods as a func-

tion of n, the size of training set. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

2.4 Average out of sample revenue on the pricing example for various PtP and

JPP methods as a function of n, the size of training set. . . . . . . . . . . . . . 68

2.5 Average out of sample MSE on the Warfarin example for various PtP and

JPP methods as a function of n, the size of training set. . . . . . . . . . . . . . 70

3.1 Average personalized income on the test set for various methods. . . . . . . . . 109

3.2 Average R2 on the test set for various methods for estimating the personalized

treatment effect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.1 The effect of the initialization method for (n, d) = (104,10) in the `2 convex

regression for tolerances Tol = 0.1 and 0.05. . . . . . . . . . . . . . . . . . . . . . 161

5.2 Run times for Tol = 0.1 and `2 convex regression. . . . . . . . . . . . . . . . . . 162

5.3 Run times for Tol = 0.05 and `2 convex regression. . . . . . . . . . . . . . . . . . 163

5.4 Comparison for `2 convex regression between Algorithm 1 and ACP for Tol = 0.1.166

5.5 Comparison for `2 convex regression with ADMM. . . . . . . . . . . . . . . . . . 167

17

5.6 `1 convex regression - Run times for Tol = 0.1. . . . . . . . . . . . . . . . . . . . 167

5.7 Accuracy% and Run times for Algorithm 2 for n = 50k. . . . . . . . . . . . . . . 172

5.8 Accuracy% and Run times for Algorithm 2 for n = 100k, d = 100. . . . . . . . . 173

5.9 Accuracy% and Run times for Algorithm 3 for n = 50k. . . . . . . . . . . . . . . 174

5.10 False Positive rate for Algorithm 3. . . . . . . . . . . . . . . . . . . . . . . . . . . 178

5.11 False Positive rate for Algorithm 2. . . . . . . . . . . . . . . . . . . . . . . . . . . 178

18

Chapter 1

Introduction

Nowadays a decision maker typically has access to a wide range of data, which can be

used to significantly improve decision making. This data not only includes past samples

of uncertainty, but also auxiliary data (or side information/features/covariates) associated

with each observation. For instance, along with daily demand data (uncertainty), a retailer

could also collect side information such as weather data, temporal data (season, weekday

or weekend), product information, and macroeconomic trends. These features are typically

available at the time of decision making while uncertainty is observed only after the decision is

implemented. A notable point is that this data is often observational, where past decisions are

unknown functions of the covariates. Additionally, in many cases these decisions influence the

observed uncertainty. For example, in revenue management, a store owner must decide how

to price various products in order to maximize profit, but the observed demand (uncertainty)

is itself affected by the chosen price. Thus, a key question in the area of data-driven decision

making is:

How to achieve better decisions than state-of-the-art methods by considering prediction

and optimization models jointly, while adjusting for the observational nature of data?

This thesis tackles various aspects of this problem by using ideas from the statistics and

Machine Learning (ML) literature along with techniques from mathematical optimization.

In particular, the first part of this thesis focuses on improving techniques for data-driven

19

decision making; then in the the second part, we focus on developing techniques using modern

optimization for ML applications.

We provide a brief outline of each of these two areas and the specific problems comprising

them that we consider in this thesis, as well as our contributions, before describing them in

more detail in the subsequent chapters.

1.1 Prescriptive Analytics: Joint Learning and Opti-

mization

1.1.1 Prescriptive Analytics for Observational Data

In Chapter 2, we consider problems where uncertainty affects the cost function, and propose

ML based algorithms for computing these prescriptive policies in a single step. Traditionally,

stochastic optimization based methods, which use past data samples of uncertainty along

with probabilistic assumptions, have been studied extensively for decision-making, but these

methods typically do not account for covariate data. A commonly used two-stage approach

(which is referred to as Predict and Optimize, or P&O) of first training a ML model to

predict the uncertainty from auxiliary data, followed by substituting this estimate in the

optimization problem directly, can often lead to suboptimal decisions. A key drawback of

P&O is that it does not take into account how the prediction uncertainty affects the objective

of the optimization (prescription) problem. We address this in Bertsimas et al. [2019b], where

we train ML models (local learning methods, decision trees, random forests) by explicitly

optimizing for the quality of their corresponding decisions.

A recent approach in Bertsimas and Kallus [2019] which relies on solving a covariate-

dependent SAA (Sample Average Approximation)-like problem (the objective is averaged

only over the relevant neighbors, which are determined by regressing uncertainty on the

covariates) often does better than P&O as it accounts for prediction uncertainty of the cost

function. However, while this approach accounts for the cost uncertainty, it still advocates a

20

two-step approach where ML and optimization are disjoint. As we demonstrate in Bertsimas

et al. [2019b], coupling prediction and optimization can lead to further gains, particularly

for smaller training set sizes.

Additionally, our methods can also account for observational data where the observed

uncertainty can be affected by the decision. We use these ML methods to simultaneously pre-

scribe decisions and impute counterfactual outcomes. We introduce the idea of uncertainty

penalization to control the optimism of these methods which improves their performance,

and propose finite-sample regret bounds on their performance. Finally, we perform compu-

tational experiments and demonstrate the prescriptive power of our methods on synthetic

and real data on portfolio optimization, newsvendor, pricing, and personalized medicine

problems.

This chapter appears in large part in the submitted paper [Bertsimas et al., 2019b].

1.1.2 Optimal Prescriptive Trees

In a related problem, Chapter 3 considers observational data with decision dependent un-

certainty, but focuses on the case with finite number of decisions (treatments). We present

our method of prescriptive trees, that prescribe the best treatment option by learning from

observational data, while simultaneously predicting the counterfactuals. This approach is

interpretable, and is applicable for problems where there are greater than two treatment

choices. In the context of personalized medicine, it is essential that the personalization

method be interpretable, and able to handle the case of more than two treatment options

(unlike causal forests). Trees can greatly help in this setting, as the partitions can be used

to get a sense of the key characteristics that led to a patient being assigned a particular

prescription.

We demonstrate the performance of our methods on synthetic data and two real world

applications - personalized Warfarin dosing and personalized diabetes treatment (data from

the Boston Medical Center). Our methods outperform propensity score based methods,

regress-and-compare methods (analogs of P&O for the discrete case, which involve estimating

21

outcome functions for each treatment and choosing the treatment which leads to the best

predicted outcome), and causal forests. The key message remains the same – rather than

stipulating models that we estimate using data followed by using these estimated models

to arrive at final decisions, we employ a single-step framework for decision making from

observational data.

This chapter appears in the published paper [Bertsimas et al., 2019a].

1.1.3 Prescriptive Scenario Reduction for Stochastic Optimization

In Chapter 4, we consider data-driven stochastic optimization problems, where data here

refers to historically observed n samples of uncertainty. In this setting, the decision maker

aims to minimize a (convex) cost function averaged over the empirical sample distribution,

commonly referred to as the Sample Average Approximation (SAA), over a convex compact

uncertainty-independent set. However, when the number of samples n is large, solving the

SAA becomes computationally prohibitive. A classical solution framework is scenario reduc-

tion, where the empirical distribution is approximated by another discrete distribution with

a smaller support, and the SAA problem on this reduced distribution becomes computa-

tionally tractable. In the course of computing this approximate distribution, a widely used

measure of closeness between two distributions is the Wasserstein distance, which however

does not take into account the decision-quality of these scenarios.

We introduce a novel generalization of the Wasserstein distance, which we refer to

as the Prescriptive divergence, that quantifies the difference in decision quality between

two discrete distributions. We consider scenario reduction in this setting, and develop an

alternating-minimization based algorithm for computing discrete distributions (scenarios and

corresponding probabilities) that minimize this quantity. We demonstrate using computa-

tional examples that this approach can lead to significantly better decisions for constrained

newsvendor and portfolio optimization problems.

This chapter appears in Bertsimas and Mundru [2019] that will be submitted shortly.

22

1.2 Predictive Analytics: Machine Learning from a

Modern Optimization lens

In this section, we describe some work on machine learning problems from an optimization

lens.

1.2.1 Sparse Convex Regression

Estimating a regression function from data with shape constraints (convexity or concavity)

has many applications in operations research (reinforcement learning, resource allocation),

econometrics, geometric programming, image analysis, and target reconstruction. While this

functional optimization problem can be equivalently written as a finite dimensional convex

quadratic optimization problem, it has O(n2) constraints, where n is the number of training

points.

In Chapter 5, we develop a cutting plane-based scalable algorithm for obtaining high

quality solutions in practical times. Next, variable selection for regression has gained im-

portance in the statistics and optimization community, and is relevant when the number of

features d can be large. As part of this work, we develop computational methods that select

the best k out of the d features using an approach that combines first order convex opti-

mization based methods, mixed integer optimization techniques, and dual reformulations.

We demonstrate that our methods scale to solve problems of sizes (n, d, k) = (104,102,10)in minutes and (n, d, k) = (105,102,10) in hours, and also control for the false discovery rate

effectively.

This chapter appears in the submitted paper [Bertsimas and Mundru, 2018].

1.3 Main Contributions

Our contributions in this thesis can be summarized as follows, listed by chapter.

23

Chapter 2: Prescriptive Analytics for Observational Data

• In this chapter, we propose a general approach for solving the prescriptive problem

with uncertain parameters in its objective as a single-step optimization problem. This

framework provides high quality prescriptions by learning from past data and accom-

modates powerful non-parametric machine learning methods such as k nearest neigh-

bors, kernel regression, decision trees and random forests, which have been traditionally

used for prediction. That is, we directly train these machine learning methods for the

parameters that lead to the best decisions as opposed to predictions.

• Further, we develop an algorithmic framework for observational data-driven optimiza-

tion that allows the decision variable to take values on continuous and multidimensional

sets.

• We analyze the power of these approaches theoretically, and present finite-sample regret

bounds on the performance of these methods.

• Finally, we demonstrate the performance of the methods developed through computa-

tional experiments. First, for the case where uncertainty is not affected by decisions,

we apply our methods on a portfolio optimization problem and a newsvendor prob-

lem, and provide evidence that they output superior data-driven decisions compared

to state of the art methods, particularly for smaller sizes of the training set. Next, in

the case where uncertainty is affected by decisions, we consider applications in person-

alized medicine, in which the decision is the dose of Warfarin to prescribe to a patient,

and in pricing, in which the action is the list of prices for several products in a store.

Chapter 3: Optimal Prescriptive Trees

• In this chapter, we present a tree-based method that produces trees with partitions

that are parallel to the axis. Consequently, they are highly interpretable and provide

intuition on the important features that lead to a sample being assigned a particular

24

treatment.

• Similar to predictive trees [Bertsimas and Dunn, 2017, 2019, Dunn, 2018], prescriptive

trees scale to problems with n in 100,000s and d in the 10,000s in seconds when they

use constant predictions in the leaves and in minutes when they use a linear model.

• Prescriptive trees can be applied with multiple treatments. An important desired

characteristic of a prescriptive algorithm is its generalizability to handle the case of

more than two possible arms. Rapid advances in technology have resulted in almost

all diseases having multiple drugs at the same stage of clinical development. This

emphasizes the importance of methods that can handle trials with more than two

treatment options.

• In a series of experiments with real and synthetic data, we demonstrate that prescrip-

tive trees either outperform out of sample or are comparable with several state of the

art methods on synthetic and real world data. Importantly, these methods tend to

perform well in the presence of limited data, which is often the case in practice in the

healthcare setting.

Chapter 4: Prescriptive Scenario Reduction for Stochastic Opti-

mization

• In this chapter, we present a novel optimization based approach for scenario reduction

for stochastic optimization problems. As part of this approach, we introduce “Pre-

scriptive divergence", which measures the difference in quality of decisions induced by

two discrete distributions and includes the Wasserstein distance as a special case.

• We propose scenario reduction approaches in this context, and present an iterative

algorithm for computing these scenarios and their corresponding probabilities for two

classes of cost functions. While the actual problem is nonconvex, we derive convex

upper bounds which we optimize for estimating the scenarios. Our optimization ap-

25

proach relies on an alternating minimization algorithm, where we solve a sequence of

convex optimization problems for computing the scenarios.

• Finally, we present computational results where we apply these methods on constrained

newsvendor and portfolio optimization problems, and demonstrate that these methods

result in improved decisions. Importantly, these scenarios outperform the traditional

Wasserstein-distance based scenario reduction approach both in-sample and out-of-

sample across various choices of the number of scenarios m. This results in better

performance with fewer scenarios, which leads to greater interpretability of decision-

making, and hence is valuable to practitioners.

Chapter 5: Sparse Convex Regression

• In this chapter, we consider the problem of convex regression, and develop a scalable

algorithm for obtaining high quality solutions in practical times that compare favorably

with other state of the art methods. We show that by using a cutting plane method,

the least squares convex regression problem can be solved for sizes (n, d) = (104,10) in

minutes and (n, d) = (105,102) in hours.

• We propose algorithms which iteratively solve for the best subset of features based on

first order and cutting plane methods. To the best of our knowledge, these are the

first algorithms for sparse convex regression.

• We consider two variants of this problem, and develop algorithms for each of them. In

one variant, we consider the sparse problem with bounded subgradients, and develop

iterative mixed integer optimization based algorithms for solving it. In the second

variant, we consider the sparse problem with ridge regularization, and develop a binary

cutting plane method for this problem.

• With the help of computational experiments, we show that our methods are scalable

and obtain near exact subset recovery for sizes (n, d, k) = (104,102,10) in minutes, and

(n, d, k) = (105,102,10) in hours.

26

Chapter 2

Prescriptive Analytics for

Observational Data

2.1 Introduction

One of the central goals of operations research/management science (OR/MS) and business

analytics is to make decisions which lead to lower costs and improved business outcomes.

These decisions (which we shall also refer to as prescriptions in this chapter) are typically

computed by solving a constrained optimization problem. However, a challenge is that some

parameters in the optimization problem are often unknown. Traditionally in operations

research, these uncertain parameters are estimated under apriori imposed assumptions, and

the decisions are then computed by solving the optimization problem with the estimated

parameters.

With the advent and proliferation of data and the improved ability to collect and store

large quantities of diverse information, there has been increased interest in using this rich

data to improve the quality of decisions. Data, rather than models or assumptions, should

guide the decision making process. This key principle has guided the machine learning (ML)

community to notable improvements in predictive analytics over the past decade. However,

most real world business analytics problems typically involve aspects of both prediction and

27

optimization [den Hertog and Postek, 2016]. Consequently, there has been increased interest

among the operations research and management science community to attack problems of

this flavor [Stubbs, 2016]. The applications are abundant and encompass several areas –

demand forecasting and price optimization [Ferreira et al., 2015], promotion planning [Cohen

et al., 2017], shipment decisions [Gallien et al., 2015], inventory management [Bertsimas

et al., 2016a], to name a few. The central goal of this chapter is to develop a framework in

which non-parametric machine learning techniques, originally designed for prediction, can

be adapted to provide high quality decisions for problems in OR/MS, which typically involve

mathematical optimization formulations.

Many important problems across a variety of fields fit into this framework. In healthcare,

for example, a doctor aims to prescribe drugs in specific dosages to regulate a patient’s vital

signs (outcome Y ). In such a setting, we have access to past data (X) about each patient

such as demographics, past medication history, genetic information, and what treatment (Z)

was administered. The patient outcomes are potentially affected by the patient character-

istics and choice of treatment. In revenue management, a store owner must decide how to

price various products in order to maximize profit. In online retail, companies decide which

products to display for a user to maximize sales. An online retailer can easily have access to

information about the customer, and may seek to price different products differently for var-

ious customers. In resource allocation problems, companies have to allocate finite resources

in order to minimize costs. For instance, consider a company with machines distributed

across the country, with each machine described by its state X – working, close to failure,

or failed. Such a company would want to use past data (such as machines’ historical failure

rate, features of the machines, relative importance of machines in the network) to decide

where to dispatch engineers in order to minimize the total cost of travel and cost incurred

due to potential disruptions in the network.

In this chapter, we emphasize the importance of optimizing the right objective, along with

appropriate parameter tuning for computing decisions. While cross validation is commonly

used to tune parameters in ML prediction problems, it can be slightly more challenging for

28

decision problems. The motivation behind tuning these parameters appropriately is that the

best predictive model might not always be the best model for decision making. We illustrate

this with a toy example. Consider a setting with a single covariate x ∼ U[0,1], uniformly

distributed between 0 and 1. The uncertainty of interest Y , is a function of x given by

Y (x) =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

2, if x ≤ 0.5,

1, if 0.5 < x ≤ 0.95,

−1, if 0.95 ≤ x ≤ 1.

In order to compute the decision, we have to solve the optimization problem:

min0≤z≤1 c(z; y) = ∣y + z∣

Suppose we are given n points (X1, Y 1), . . . , (Xn, Y n) sampled randomly. As a starting

approach, we regress Y ∼X and suppose we obtain the following tree:

x ≤ 0.5

Y = 1Y = 2

True False

Figure 2-1: Tree constructed by regressing Y v/s X on training set.

We see that for this tree, the out-of-sample R2 = 0.63, and the corresponding decisions

z(x) and costs incurred are given by:

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

0 ≤ x ≤ 0.5 Ô⇒ z(x) = 0, Avg. Cost = 0.5 ∗ 2 = 1,

0.5 ≤ x ≤ 0.95 Ô⇒ z(x) = 0, Avg. Cost = 0.45,

0.95 ≤ x ≤ 1.0 Ô⇒ z(x) = 0, Avg. Cost = 0.05.

Consequently, we see that the out of sample prescriptive cost, which quantifies the perfor-

29

mance of decisions prescribed by this tree, is 1.50. Now, consider the following different

tree:

x ≤ 0.95

Y = −1Y = 1.5

True False

Figure 2-2: A different decision tree.

This tree is worse in terms of its predictive performance of Y , as is evident from its out

of sample R2 of 0.56, which is lesser than 0.63. Next, we consider the decisions z(x) and

costs incurred as follows:

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

x ≤ 0.95 Ô⇒ z(x) = 0, Avg. Cost = 0.5(2) + 0.45(1),

x > 0.95 Ô⇒ z(x) = 1, Cost = 0.

Thus, the average out of sample prescriptive cost = 1.45, which is lower than 1.50, and hence

better in terms of decisions.

2.1.1 Notation

Throughout this chapter, we use capital letters to refer to random quantities and lower case

letters to refer to deterministic quantities. The general problem we study is characterized

by the following components:

• Decision variable: z ∈ Z ⊂ Rdz ,

• Outcome: Y (z) ∈ Y ⊂ Rdy (We adopt the potential outcomes framework [Rosenbaum,

2002], in which Y (z) denotes the (random) quantity that would have been observed

had decision z been chosen.),

• Auxiliary covariates (also called side-information or context): x ∈ X ⊂ Rdx ,

30

• Cost function: c(z; y) ∶ Z × Y → R.

Thus, we use Z to refer to the decision randomly assigned by the (unknown) historical policy

and z to refer to a specific action (or, decision). For a given auxiliary covariate vector, x,

and a proposed decision, z, the conditional expectation E[c(z;Y )∣X = z,Z = z] quantifies

the expectation of the cost function c(z;Y ) under the conditional measure in which X is

fixed as x and Z is fixed as z. We ignore details of measurability throughout and assume

this conditional expectation is well defined. Throughout this chapter, all norms are `2 norms

unless otherwise specified. We use (X,Z) to denote vector concatenation.

2.1.2 Related Literature

In this section, we present an overview of some related approaches in the literature. Stochas-

tic optimization attempts to solve the problem

minz∈Z

E[c(z;Y )], (2.1)

for some known convex cost function c(z; y) in z and convex feasible set Z, and where the

expectation is computed over the unknown distribution of Y . However, the distribution of

the random variable Y is typically unknown. As shown by Nemirovski and Shapiro [2006],

even estimating the objective for a given decision z can be a highly nontrivial problem.

Typically we have access to data, Y 1, . . . , Y n, which represents historical observations of

the uncertainty y, rather than the distribution of Y . In this setting, the classical paradigm

for data-driven stochastic optimization is sample average approximation (or SAA) [Kleywegt

et al., 2002],[Shapiro and Nemirovski, 2005], where the empirical distribution over Y 1, . . . , Y n

is used to approximate the full expectation in Problem (2.1). To be precise, SAA considers

the problem

minz∈Z

1

n

n

∑i=1

c(z;Y i). (2.2)

Clearly, this can be considered as a stochastic optimization problem with the distribution of

Y approximated by a discrete distribution over scenarios Y 1, . . . , Y n with the probability of

31

each equal to 1n . In fact, as n increases to infinity, under some mild conditions, Problem (2.2)

can be shown to be equivalent to the original stochastic optimization problem (2.1) [Shapiro

et al., 2009a]. However, the classical stochastic optimization framework is unable to include

contextual information provided by observed covariates x. In settings where knowledge of

these covariates is known at the time of implementing the decision, using this additional

knowledge can add substantial value. Recent years have seen tremendous interest in the

area of data-driven optimization. Much of this work combines ideas from the statistics and

machine learning literature with techniques from mathematical optimization.

In this case, the problem we now consider is

z(x) ∈ argminz∈Z

E[c(z;Y )∣X = x]. (2.3)

The optimized decision z(x) thus takes into account this potential knowledge about the

future uncertainty Y , and allows for higher quality decision making. Clearly, this is a gener-

alization of the classical problem (2.1), where contextual information is ignored for decision

making.

To solve Problem (2.3), one commonly used approach in the literature is to employ a Pre-

dict and Optimize (P&O) framework. As the name indicates, this approach involves solving

the problem of generating prescriptions from data in two steps. In the first step, a machine

learning model f(x) that predicts y is trained using past data (X1, Y 1), . . . , (Xn, Y n).In the second step, when X0 is given, the corresponding predicted uncertainty is computed

according to the machine learning model, f(X0), and this estimate is substituted into the op-

timization problem to solve for the decision z. To be precise, the decision z(X0) is computed

by solving

z(X0) ∈ argminz∈Z

c(z; f(X0)).

For learning the function f , any of the several machine learning techniques that have been

proposed in the literature can potentially be used (see Hastie et al. [2009] for an overview).

However, one key drawback of this approach is that by substituting in the predicted y

32

directly, the optimization model does not take into account the uncertainty associated with

this prediction. Another key area where this approach could be potentially improved is

that the prediction model f is not aware of the downstream optimization model, which is a

consequence of the two-step solution approach. We point out that our work resolves both

issues. We next discuss some recent work that addresses some of these issues as well and

compare them with our proposed approach.

To resolve the second issue, Elmachtoub and Grigas [2017] propose an approach where

they also consider the problem of finding prediction functions f that lead to good prescrip-

tions. This method is based on the P&O framework and is restricted to optimization prob-

lems with linear objective functions c(z;Y ) = z′Y and linear predictive functions f(x) = Bx.

However, it is not clear how to extend this for nonlinear (in z) objectives c(z;Y ), or for

nonlinear prediction functions f . In other work, Tulabandhula and Rudin [2013] minimize

a combination of prediction loss along with the operational cost on an unlabeled dataset.

However, the operational cost is defined on the unlabeled data while the prediction loss on

the labeled data, and this approach still follows the P&O methodology. For the feature-based

newsvendor problem, Rudin and Vahn [2014] use machine learning methods to predict the

optimal decision as a direct function of the observed covariates x. While the optimization

is performed in sample, the predicted decisions can potentially be infeasible for some points

in the test dataset. Kao et al. [2009] propose a method which also predicts the decision as a

linear function of the covariates. The regression coefficients are chosen as a convex combina-

tion of the usual least squares coefficients (obtained by minimizing prediction loss) and the

coefficients obtained by solving the prescription problem, which in this case is assumed to be

an unconstrained convex quadratic minimization problem. This convex combination param-

eter is chosen by cross validation. However, it is not clear how to extend this approach when

the optimization problem has constraints, or for the case of nonlinear predictive models.

Finally, we note that this approach is also based on the P&O framework. Another related

recent work is task based end to end learning, where the authors focus on quadratic opti-

mization prescriptive problems and propose neural network based approaches for computing

33

decisions [Donti et al., 2017].

Another recently proposed approach called Predictive to Prescriptive (PtP) analytics [Bert-

simas and Kallus, 2019] also uses a two-step approach, with the first step comprising of

training supervised non-parametric machine learning methods (k nearest neighbors, kernel

regression, trees and forests) to predict Y based on the covariates X. They key difference

from P&O is that in the second step, it does not directly substitute the predictions into the

optimization problem. Rather, it solves a weighted SAA with the weights dictated by the

prediction methods for that particular observation. For example, if f is a kNN predictor,

then this approach first finds the parameter k that results in the most accurate predictions of

y (minimizing the prediction error) over the training set (X1, Y 1), . . . , (Xn, Y n). Now, for

any x, they find the k nearest neighbors of x in the training set and solve an SAA over only

these k neighbors to compute the optimal decision z(x). They also show that this approach

is consistent, and essentially improves over the P&O by considering uncertainty in the cost

estimate E[c(z;Y )∣X = x] as opposed to substituting the estimate into the cost function as

c(z;E[Y ∣X = x]). However, this is still a two-step approach to learning decisions from data

where the procedure for computing the first step machine learning model does not take into

account the quality of decisions computed by the model.

Our approach is similar to the PtP approach in that we use several non-parametric

machine learning algorithms as well for prediction. However, the key difference from the PtP

approach is that we find the best machine learning algorithm that leads to the best decisions

z, as opposed to the best predictions. Another way of interpreting this is that we generate

scenarios (Y,Z) jointly, while PtP or in general the SAA approach, generates scenarios y for

computing the prescription z. To achieve this, the objective we use to train these machine

learning methods is directly based on a mix of the prescription cost and prediction error as

opposed to just the latter, which is the case in both standard P&O and PtP. The key insight

behind the prescription term is that it directly quantifies the cost of the decision making

framework induced by any predictive f and optimizing this avoids the two-step approach

employed by both P&O and PtP. Additionally, this incorporates the uncertainty associated

34

with the estimates given by the prediction methods into the optimization model, as we

consider an SAA-like weighted estimate of the expectation E[c(z;Y )∣X = x] rather than use

the point estimates in our proposed methods.

We also note the connection with the field of structured prediction, a subfield in machine

learning that seeks to predict structured objects such as sequences, images, graphs from

feature data. This predicted output must satisfy some constraints (see Goh and Jaillet

[2016] for some examples). In our case, the structured objects are decision variables that are

input to an optimization problem, and we present non-parametric learning methods in this

setting.

We also consider the setting in which the decision affects the outcome. For many appli-

cations, such as pricing, the demand for a product is clearly affected by the price. Bertsimas

and Kallus [2017] later studied the limitations of predictive approaches to pricing problems.

In particular, they demonstrated that confounding in the data between the decision and out-

come can lead to large optimality gaps if ignored. They proposed a kernel-based method for

data-driven optimization in this setting, but it does not scale well with the dimension of the

decision space. Mišić [2017] developed an efficient mixed integer optimization formulation

for problems in which the predicted cost is given by a tree ensemble model. This approach

scales fairly well with the dimension of the decision space but does not consider the need for

uncertainty penalization.

Another relevant area of research is causal inference (see Rosenbaum [2002] for an

overview), which concerns the study of causal effects from observational data. Much of

the work in this area has focused on determining whether a treatment has a significant effect

on the population as a whole. However, a growing body of work has focused on learning opti-

mal, personalized treatments from observational data. Athey and Wager [2017] proposed an

algorithm that achieves optimal (up to a constant factor) regret bounds in learning a treat-

ment policy when there are two potential treatments. Kallus [2017a] proposed an algorithm

to efficiently learn a treatment policy when there is a finite set of potential treatments.

Building on this approach, Bertsimas et al. [2019a] developed a tree-based algorithm that

35

learns to personalize treatment assignments from observational data. It is based on the op-

timal trees machine learning method [Bertsimas and Dunn, 2017], and has performed well in

experiments on both synthetic and real datasets. This approach involves minimizing a com-

posite objective which is a combination of prescriptive and predictive loss, which is analogous

to our objective that we consider in this chapter. In this setting, the decisions are finite,

and the objective is simply the outcome. Here, we allow continuous and multidimensionsal

decisions, along with potential constraints on the decisions.

Considerably less attention has been paid to problems with a continuous decision space.

Hirano and Imbens [2004] introduced the problem of inference with a continuous treatment,

and Flores [2007] studied the problem of learning an optimal policy in this setting. Recently,

Kallus and Zhou [2018] developed an approach to policy learning with a continuous deci-

sion variable that generalizes the idea of inverse propensity score weighting. Our approach

differs in that we focus on regression-based methods, which we believe scale better with the

dimension of the decision space and avoid the need for density estimation.

The idea of uncertainty penalization has been explored as an alternative to empirical risk

minimization in statistical learning, starting with Maurer and Pontil [2009]. Swaminathan

and Joachims [2015] applied uncertainty penalization to the offline bandit setting. Their set-

ting is similar to the one we study. An agent seeks to minimize the prediction error of their

decision, but only observes the loss associated with the selected decision. They assumed that

the policy used in the training data is known, which allowed them to use inverse propensity

weighting methods. In contrast, we assume ignorability, but not knowledge of the historical

policy, and we allow for more complex decision spaces. We note that uncertainty penaliza-

tion bears a superficial resemblance to the upper confidence bound (UCB) algorithms for

multi-armed bandits [Bubeck et al., 2012]. These algorithms choose the action with the

highest upper confidence bound on its predicted expected reward. Our approach, in con-

trast, chooses the action with the highest lower confidence bound on its predicted expected

reward (or lowest upper confidence bound on predicted expected cost). The difference is

that UCB algorithms choose actions with high upside to balance exploration and exploita-

36

tion in the online bandit setting, whereas we work in the offline setting with a focus on solely

exploitation.

2.1.3 Contributions

The key contributions of this work are as follows.

1. We propose a general approach for solving the prescriptive problem with uncertain

parameters in its objective as a single-step optimization problem. This framework

provides high quality prescriptions by learning from past data and accommodates

powerful non-parametric machine learning methods such as k nearest neighbors, kernel

regression, trees and forests, which have been traditionally used for prediction. That

is, we directly train these machine learning methods for the parameters that lead to

the best decisions as opposed to predictions.

2. We adapt the coordinate descent approach of Dunn [2018], along with first order

methods from convex optimization to further improve the trees. We present algorithms

to aid in the scalability of our approach.

3. We develop an algorithmic framework for observational data-driven optimization that

allows the decision variable to take values on continuous and multidimensional sets.

4. We demonstrate the performance of the methods developed in computational experi-

ments. First, for the case where uncertainty is not affected by decisions, we apply our

methods on a portfolio optimization problem with synthetic data, and a newsvendor

problem with real data, and provide evidence that they output superior data-driven

decisions compared to state of the art methods, particularly for smaller sizes of the

training set. Next, in the case where uncertainty is affected by decisions, we consider

applications in personalized medicine, in which the decision is the dose of Warfarin to

prescribe to a patient, and in pricing where the action is the list of prices for several

products in a store.

37

2.1.4 Structure of the chapter

The structure of this chapter is as follows. In Section 2.2, we present some background on

prescriptive analytics, and outline our approach in brief. In the first part of the chapter,

we consider the case where uncertainty Y is unaffected by the implemented decision Z. We

present more details on our approach for adapting various non-parametric learning methods

in Section 2.3, followed by algorithms for training these methods in Section 2.4. In the

second part of the chapter, we consider the case of observational data, where uncertainty

Y is affected by the implemented decision Z. We present our approach in greater detail in

Section 2.5, followed by theoretical motivation and finite-sample and generalization bounds in

Section 2.5.2. We provide computational evidence of the methods developed in this chapter

on real and synthetic data in Section 2.6, and present our conclusions in Section 2.7.

2.2 Outline of our approach

In this section, we present some background on prescriptive methods and outline our ap-

proach. We first focus on the setting in which the decision, z, does not affect the uncertainty,

y. The historical training data (X i, Y i)ni=1 is comprised of n observations (also referred to

as data points or samples). Each data point (X i, Y i) corresponds to the features (or covari-

ates/contextual information/side information) X i ∈ X ⊆ Rdx of the ith observation and the

realized uncertainty Y i ∈ Y ⊆ Rdy . When the uncertainty y is perfectly known in advance,

the decision maker has to solve a deterministic optimization problem, given by

minz∈Z

c(z; y) (2.4)

to arrive at the decision z ∈ Z ⊆ Rdz . However, the key challenge is that the uncertainty y

is not observed at the instant when the decision z needs to be implemented, and thus Prob-

lem (2.4) cannot be solved directly. At the time the decision needs to be made, the decision

maker has access to covariates x that potentially possess some prognostic information about

38

the unrealized y. In the presence of this additional knowledge, the decision maker seeks to

minimize the cost under the conditional expectation of Y ∣X = x, or equivalently, solve the

problem

minz∈Z

E[c(z;Y )∣X = x]. (2.5)

In this chapter, we consider the problem of finding a policy that, given new contextual

information x, outputs a high quality decision z(x) that leads to good prescriptive perfor-

mance, i.e., low cost c(z(x); y) when y is realized out of sample. As opposed to approaches

reliant on knowledge of the distributions of Y or Y ∣X = x, both of which are typically

unknown, we develop methods which rely on data as the starting point. As part of this ap-

proach, we adapt popular non-parametric machine learning methods – k Nearest Neighbors,

local kernel regression, decision trees, and random forests – to develop their corresponding

prescriptive methods that compute high quality decisions z directly from the covariates x.

We further illustrate this setting with an example. Consider a problem in which a portfo-

lio manager has to allocate finite capital to various stocks (or financial assets). The compli-

cation is that these allocations (or investments) z depend on the future returns of the assets

y, which are unknown at the time of deciding the allocation. But, the decision maker has

access to covariate information x at the time of making this decision such as earnings, sea-

sonality, Google or Twitter trends, performance of the S&P 500 index, past returns of other

similar assets, market sentiment which could potentially contain a signal about future re-

turns. Thus, the problem is to compute decisions z given past data (X1, Y 1), . . . , (Xn, Y n)and the current covariate information x.

Now, to make a decision z(x), we wish to solve the Problem (2.5). Clearly, this condi-

tional expectation is not known and needs to be estimated from the past available data. In

order to estimate this conditional expectation, we consider estimators of the form [Bertsimas

and Kallus, 2019]

f(x) =n

∑i=1

wfi (x)c(z;Y i), (2.6)

39

where the weights are nonnegative and sum to one, i.e.,

wfi (x) ≥ 0 ∀1 ≤ i ≤ n, and

n

∑i=1

wfi (x) = 1.

These weights are determined by the non-parametric function f ∶ Rdx → Rdy , past training

data (X1, Y 1), . . . , (Xn, Y n), and the observed covariate x. To be precise, we consider f

such that its prediction of y, for any x, is given by

f(x) =n

∑i=1

wfi (x)Y i. (2.7)

Intuitively, these weights encode the similarities between x and each of the corresponding

training set covariates X1, . . . ,Xn. For example, suppose f is a tree based estimator and x

belongs to the leaf `(x) which has n(`(x)) sample points, i.e.,

n(`(x)) = ∣j ∈ [n] ∶ `(Xj) = `(x)∣.

Then, the estimated conditional expectation of the cost in Equation (2.6) is given by

f(x) = 1

n(`(x))n

∑j=1

c(z;Y j) 1(`(Xj) = `(x)). (2.8)

In this case, it is easy to see that the weights are given by

wfi (x) =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

1n(`(x)) , if `(X i) = `(x),

0, else .

Note that this f also outputs a corresponding prediction of y for the observed x as

f(x) = 1

n(`(x))n

∑j=1

Y j 1(`(Xj) = `(x)).

Now with these weights, the decision z(f, x) is obtained by solving the corresponding

40

optimization problem

z(f, x) ∈ argminz∈Z

E[c(z;Y )∣f,X = x],

or equivalently,

z(f, x) ∈ argminz∈Z

n

∑i=1

wfi (x)c(z;Y i). (2.9)

Note that as a consequence of nonegativity of the weights, Problem (2.9) is a convex mini-

mization problem for each x.

Now, the question arises as to how do we choose this function f? We wish to ensure

that this decision z(f, x) induced by f has good prescriptive performance, or equivalently,

obtains a low value of c(z(f, x); y) where y is the realized uncertainty. Thus, we formulate

a problem to optimize for the function f that leads to good prescriptive performance of its

decisions. We propose the following formulation in Problem (2.10) where we optimize over

functions f ∶ Rdx → Rdy that lead to a good prescriptive performance.

minf∈F ,z(f,Xi)

n

∑i=1

c(z(f,X i);Y i)

subject to z(f,X i) ∈ argminz∈Z

n

∑j=1

wfj (X i)c(z;Y j) ∀1 ≤ i ≤ n.

(2.10)

The central idea behind solving Problem (2.10) is that it directly optimizes the policy cost

of the prescriptive method used. Indeed, the ith term in the objective signifies the cost

incurred when decision z(f,X i) is implemented and uncertainty Y i is realized. The ith

constraint stipulates that each z(f,X i) is the optimal decision for X i under f and thus is

representative of the actual decision making process. In the previously described example

where f is a tree predictor, f(X i) can be written as 1n(`(x)) ∑j∶`(Xj)=`(x) c(z;Y j), where Xl

is the leaf of the tree in which X i falls into. On implementing this z(f,X i), we observe a

cost of c(z(f,X i);Y i), which depends on the realized uncertainty Y i. When we consider the

average cost of this policy imposed by f on the whole sample of n training points, we arrive

at the objective in Problem (2.10).

In other words, we train the function f while taking into account its prescriptive perfor-

41

mance, by noting that each z(f,X i) is the solution to an optimization problem that depends

on f itself. This is in direct contrast to traditional approaches which involve learning f based

on the predictive error, followed by solving an appropriate optimization problem over Z for

the best decision using the prediction or output of f .

We further impose the condition that the prediction function f also accurately estimate

the uncertainty y. That is, we emphasize that f deliver high quality prescriptions, but at

the same time is also reasonably close to the actual values in terms of its predictions. We

enforce this by choosing a loss function `(⋅, ⋅), which we typically set as the least squared

loss, i.e., `(x, y) = ∥x − y∥2. We penalize the difference between realized uncertainty Y i and

the predicted uncertainty f(X i). Note that the predicted uncertainty is also a weighted

estimate of the training set uncertainties, with the same weights used for estimating the

conditional mean cost. We explain this penalization in greater detail in Section 2.3.5, where

we point out that in the absence of such a penalizing factor, f can become too “optimistic"

in its prescriptions. Following this idea, we consider Problem (2.11), which balances both

prescription and the prediction error

minf∈F ,z(f,Xi)

µn

∑i=1

c(z(f,X i);Y i) + (1 − µ)n

∑i=1

`(Y i, f(X i))


n

∑j=1

wfj (X i)c(z;Y j) ∀1 ≤ i ≤ n,

(2.11)

where the prescription factor 0 < µ < 1 is a hyperparameter that controls the tradeoff between

prescription and prediction objectives. Thus, this approach unifies the two steps – prediction

and prescription – by treating this as a single-step problem. In fact, this approach can be

viewed as a generalization of Bertsimas et al. [2019a] for the case of f described by a tree

function, where Z = 1, . . . ,m, with the ith constraint simply assigning unit X i to the

decision with lowest average cost in the leaf to which X i belongs.

42

2.3 Methods for Joint Predictive-Prescriptive Analyt-

ics

In this section, we describe four non-parametric machine learning methods, and how we

adapt them for the purpose of prescription.

2.3.1 k Nearest Neighbors (kNN)

In this section, we present the k Nearest Neighbors method for joint prescriptive analytics.

The classical kNN method for prediction only considers the k nearest neighbors of x in the

training set and ignores the rest [Altman, 1992]. The predicted outcome f(x) is

f(x) = 1

k∑

i∶Xi∈Nk(x)Y i,

where Nk(x) = X i, i = 1, . . . , n ∶n

∑j=11[∥x−X i∥ ≥ ∥x−Xj∥] ≤ k, the set of k nearest neighbors

of x. In case of ties, we give priority to points with lower index values. In effect, the weights

wi(x) are given by

wi(x) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

1k , if X i is a kNN of x,

0, else.

The distance metric ∥ ⋅ ∥ is usually chosen to be the Mahalanobis metric, which is

∥x − y∥2Σ−1 = (x − y)TΣ−1(x − y), (2.12)

where x, y are any two points, and Σ is the sample covariance matrix of the training data.

Applying this technique in our joint prescriptive analytics framework, for each k in a grid

of potential k values, we compute the objective

Lµ(k) = µn

∑i=1

c(z(k,X i);Y i) + (1 − µ)n

∑i=1

XXXXXXXXXXXXXY i − 1

k∑

j∶Xj∈N−Xi

k(Xi)

Y j

XXXXXXXXXXXXX

2

43

where

• z(k,X i) is the optimal solution to the SAA over the k nearest neighbors of X i, or

z(k,X i) ∈ argminz∈Z

1

k∑

j∶Xj∈N−Xi

k(Xi)

c(z;Y j) ∀1 ≤ i ≤ n,

• N −Xi

k (X i) is the set of k nearest neighbors to X i in the training set excluding X i. These

neighbors are computed based on the Mahalanobis distance metric (Equation (2.12)).

Cross validation to compute k: Over a grid of µ values between 0 and 1, we compute

Lµ(k), and find the best k for each µ as k∗(µ) that leads to the smallest Lµ(k), i.e.,

k∗(µ) = argminkLµ(k).

Thus, for each µ we compute a k∗(µ), and we denote this set of k values for different µ as

Ω. Now, we chose the final value of k∗ as the value of k within this set of Ω that minimizes

the prescription error,n

∑i=1

c(z(k,X i);Y i), or

k∗ = argmink∈Ω

n

∑i=1

c(z(k,X i);Y i). (2.13)

2.3.2 Nadaraya-Watson Kernel Regression (KR)

In this section, we present the Nadaraya-Watson kernel regression method for prescriptive

analytics. Nadaraya-Watson kernel regression (we shall refer to this as KR; [Nadaraya,

1964, Watson, 1964]) is a local predictive method, where the prediction for a given point x is

computed as a weighted estimator of the training samples y. These weights depend on how

“similar" the corresponding training X samples are to the new point x. The prediction for

x is

f(x) =n

∑i=1

wi(x, h)Y i,

44

where the weights wi(x) are given by

wi(x) =K((X i − x)/h)n

∑j=1

K((x −Xj)/h).

Here, h > 0 is the bandwidth parameter, which is typically tuned to a particular dataset.

K ∶ Rdx → R represents the kernel, which for this work we restrict to be a nonnegative one,

i.e., K ∶ Rdx → R+. Some commonly used nonnegative kernels are:

1. Uniform: K(x) = 121[∥x∥ ≤ 1].

2. Epanechnikov: K(x) = 34(1 − ∥x∥2)1[∥x∥ ≤ 1].

3. Tricubic: K(x) = 7081(1 − ∥x∥3)31[∥x∥ ≤ 1].

4. Gaussian: K(x) = 1√2π

exp(−∥x∥2/2).

Next, we discuss how to apply this technique within our single-step prescriptive analytics

framework. For each h, we compute the objective

Lµ(h) = µn

∑i=1

c(z(h,X i);Y i) + (1 − µ)n

∑i=1∥Y i −∑

j≠iwj(X i, h)Y j∥2

where z(h,X i) ∈ argminz∈Z ∑j≠i

wj(X i, h)c(z;Y j) ∀1 ≤ i ≤ n.

Cross validation to compute h Over a grid of µ values between 0 and 1, we compute

Lµ(h), and find the best h for each µ as h∗(µ) that leads to the smallest Lµ(h), i.e.,

h∗(µ) = argminhLµ(h).

Note that for each µ, we compute a h∗(µ), and we denote this set of h values as Ω. Now,

we chose the final value of h∗ as the value of h within this set of Ω that leads to the smallest

prescription error,n

∑i=1

c(z(h,X i);Y i). To be precise,

h∗ = argminh∈Ω

n

∑i=1

c(z(h,X i);Y i).

45

2.3.3 Trees

Traditionally, regression (or classification) trees for the purpose of prediction are trained

by choosing splits that lead to low prediction error. These trees are trained by recursively

partitioning the X space into leaves in order to minimize the least squared error, or some

other metric such as log deviance. In these trees, each leaf predicts the same value for all

the points falling in it. The predictions given by a tree τ with L leaves can be written as

τ(x) =L

∑i=1

γi1(x ∈ X i),

where (X i, γi)Li=1 are the values and leaves that parametrize τ and are estimated from the

data.

In this section, we outline the problem formulation for learning trees that lead to high

quality decisions directly from data. Given a tree τ with L leaves denoted by X1, . . . ,XL, and

a candidate x, the PtP approach dictates that we solve the following weighted SAA problem

to obtain the decision z(τ, x) as

z(τ, x) ∈ argminz∈Z

1

n(lτ(x))n

∑i=1

L

∑j=11[x ∈ Xj]c(z;Y i).

Equivalently, we can write this as

z(τ, x) ∈ argminz∈Z

1

n(`τ(x))∑

i∶`τ (Xi)=`τ (x)c(z;Y i),

where

• `τ(x) denotes the leaf of the tree τ to which x belongs, and

• n(l) is the number of training samples in the leaf l.

Following our approach and using the above observation, the problem of learning the tree τ

46

that leads to good decisions can be formulated as follows

minτ,z(τ,Xi)ni=1

Lµ(τ) = µn

∑i=1

c(z(τ,X i);Y i) + (1 − µ)n

∑i=1

XXXXXXXXXXXY i − 1

n(`τ(X i)) ∑j∈`τ (Xi)

Y j

XXXXXXXXXXX

2

subject to z(τ,X i) ∈ argminz∈Z ∑

j∶Xj∈`τ (Xi)c(z;Y j) ∀1 ≤ i ≤ n.

(2.14)

Thus, the tree which includes the splits and decisions associated with each leaf is computed

by minimizing the net objective in Problem (2.14).

Due to the discrete nature of the tree τ where it splits the X space into leaves, each leaf l

will have a decision z(τ, l) associated with it, which is the solution to the SAA problem solved

over the samples in that leaf. With this observation, Problem (2.14) can be equivalently

written as

minτ,z(τ,`j)Lj=1

µL

∑j=1∑i∈`j

c(z(τ, `j);Y i) + (1 − µ)n

∑i=1

XXXXXXXXXXXY i − 1

n(`(X i)) ∑j∈`(Xi)Y j

XXXXXXXXXXX

2

subject to z(τ, `j) ∈ Z ∀1 ≤ j ≤ L.(2.15)

2.3.4 Random Forests

Following ideas from Breiman [2001] who extend decision trees to reduce the variance of their

predictions by training several trees on randomly chosen subsamples of the data and aggre-

gating their individual outputs. The decisions prescribed by the forest τiKi=1, a collection

of K trees, is given by

z(T,X i) ∈ argminz∈Z

1

K

K

∑k=1

1

n(`k(X i)) ∑j∈`k(Xi)

c(z;Y j),

where `k(x) denotes the leaf of the kth tree to which x belongs.

Following our approach and using the above observation, the problem of solving for the

47

prescriptive random forest T can be formulated as

minT

µn

∑i=1

c(z(T,X i);Y i) + (1 − µ)n

∑i=1

XXXXXXXXXXXXY i − 1

K

K

∑k=1

1


Y j

XXXXXXXXXXXX

2

subject to z(T,X i) ∈ argminz∈Z

1

K

K

∑k=1

1


c(z;Y j).(2.16)

2.3.5 Penalizing the prediction error of f

In this section, we elaborate further on why the prediction error needs to be penalized along

with the prescriptive loss which quantifies the quality of the decisions induced by any f .

Note that in our main Problem (2.11), we minimize the objective with respect to both f

and each of the z(f,X i) variables. Suppose the cost function c is linear in the uncertainty

and decision variable, i.e., c(z; y) = y′z. As pointed out by Elmachtoub and Grigas [2017],

under the P&O framework (which is identical to PtP when c is linear), the ith constraint

stipulates that each z(f,X i) is the solution to minη∈Z η′f(X i). Note that if f(x) = 0,

then the ith constraint reduces simply to z ∈ Z. Thus, the each term in the objective will

reduce to z(f,X i) ∈ argminη∈Z η′y ∀i, which leads to the smallest possible objective value

for Problem (2.11) and thus f = 0 is trivially optimal.

This problem persists even under the PtP framework, i.e., if z(f,X i) is chosen as the solu-

tion to a re-weighted SAA with the weights depending on X i. Suppose f(x) = ∑ni=1wi(x)Y i,

where wi ≥ 0∀i. In this case, each z(f, x) is the solution to minη∈Z η′(∑ni=1wiY i). Suppose

these weights are chosen via trees. This could incentivize those splits in the trees that lead

to a leaf ` where ∑ni∶Xi∈` Y i = 0, which again leads to the same issue mentioned above.

To mitigate this issue, we stipulate that along with good prescriptive performance of f ,

the predicted value f(x) be close to the true value y. We use cross validation to choose the

prescription factor 0 < µ < 1 that balances both these errors.

48

2.4 Optimization algorithms

In this section, we present algorithms for learning the prescriptive versions of the trees and

forests outlined in Section 2.3. We present two algorithms for training these trees – the first

one a greedy algorithm based on the recursive CART heuristic Breiman et al. [1984], and the

second one based on its more recent coordinate descent based improvement Bertsimas and

Dunn [2019]. First, we present the greedy algorithm for training these trees in Algorithm (4),

and its extension to random forests in Algorithm (5). We only describe these algorithms

here, and defer the full details to the Appendix.

2.4.1 Greedy algorithm for learning trees

The greedy algorithm outlined in Algorithm (4) attempts to find a partition of the covariate

space in order to minimize the net loss. It does so by iterating over axis-parallel splits (splits

of the form xi ≤ a and xi > b) to find the best split at each level of the tree. This proceeds

recursively on the training set until either the maximum depth ∆max of the tree is reached,

or if any further split results in less than nmin samples in a leaf. These parameters nmin,∆max

are hyperparameters that need to be set by the user or can be chosen via crossvalidation.

Note that while the routine GreedyTree is written in Algorithm (4) for any input data

set S, we obtain the greedy tree optimized on the training set by calling GreedyTree(Sn =(X1, Y 1), . . . , (Xn, Y n),0).

Once this tree is trained, we use cost complexity pruning (as detailed in Section 2.4

of Dunn [2018]) to regularize the tree and control for overfitting. We omit this in the

presentation of our algorithm for the sake of brevity.

2.4.2 Prescriptive Random Forests

Next, we present our algorithm for training prescriptive random forests. Recall that a random

forest for predictive purposes such as classification/regression Breiman [2001] with K trees is

computed by training each tree greedily, and the final output is obtained by aggregating the

49

individual tree predictions. We follow a similar approach here as well, where we compute

each of the K trees by training them in a greedy manner (via Algorithm (4)). In effect,

this results in an approximate solution to Problem (2.16). The idea is that aggregating

individual trees can potentially reduce variance while simultaneously not increasing the bias

significantly, assuming the trees are sufficiently independent. For the sake of completeness,

we present the algorithm to train the random forests in Algorithm (5). The random forests

algorithm is similar to the one for trees, but with an additional parameter 0 < α < 1 which

restricts the number of features available to each tree to ⌊αdx⌋ (can also consider√dx) in

order to promote independence among the trees for variance reduction.

We note that there is one key difference in generating prescriptions z(x) for any x,

compared to the traditional method of averaging the outputs of the individual trees in

regression. For any new X0, we do not set the prescription z(X0) equal to the average1K

K

∑k=1

z(τ k,X0) (which could be infeasible), but instead we obtain it by solving the weighted

optimization problem

z(X0) ∈ argminz∈Z

1

K

K

∑k=1

1

n(`k(X0)) ∑j∈`k(X0)

c(z;Y j),

where X0 falls in leaf `k(X0) of tree τ k for each k ∈ [K].

2.4.3 Local search algorithms for Prescriptive Trees

In this section, we describe the local search procedure to further improve the trees computed

via the greedy algorithm (4). We note that the top to down nature of Algorithm (4) can lead

to locally optimal solutions, and the following approach iteratively improves the prescriptive

tree until a better solution is reached.

The local search algorithm takes as input a prescriptive tree, which we provide by calling

Algorithm (4). The search procedure iterates over a randomized ordering of the nodes of the

input tree, and each node is further improved by any of the following three steps (with the

first two steps applying to any non-leaf node):

50

• perturbing the split (both feature and threshold value) to improve the net objective,

or

• deleting the split, and replacing it with either its left or right children, or

• finally, if the node is a leaf, then creating a new split (provided the minimum leaf size

and maximum depth conditions are satisfied).

Once there is no further improvement, then we terminate the search and return the final tree.

Additionally, we run this search procedure from various starting points and choose the best

final solution out of all the potential solutions identified in this process. The local search

algorithm is detailed in Algorithm (6). As part of this algorithm, we call the subroutine

OptimizeNode which takes as input a candidate node, and the subset of training data that

falls into this node, and outputs an improved node by either perturbing the split, deleting

the node, or further branching if it is a leaf node. For perturbing the split and potentially

improving it, we define another subroutine PerturbSplit, that varies the split parameters,

and calculates the new error by updating both subtrees rooted at left and right children nodes

of the candidate node. Next, we present two additional ideas which further improve the local

search algorithm.

Pruning splits based on prediction error

This idea uses the fact that the objective comprises of two parts – prescription and prediction

error. For µ < 1, we note that the prediction error contributes to the objective, and is typically

easier to compute than the prescription error which involves solving constrained optimization

problems. We use the presence of this term to narrow down the list of potential splits in

our search. At each node while choosing a potential split, we rank all the O(np) splits

in decreasing order of the resulting prediction error. Then, we compute the prescription

term for only the top few of these splits, and finally choose the one that leads to the lowest

composite objective.

51

Caching for warm starts

We store previously computed results in memory and use them when the same result is needed

again, rather than recomputing them. This is particularly useful when a lot of candidate

subproblems need to be evaluated and some of the computations may be repeated. Applying

this idea here, we store the optimal solutions z, the prescription, and prediction errors for

each of the candidate splits evaluated so far. Suppose we have computed and stored these

values for each of M sets of indices I1, . . . , IM . Now, when our algorithm wishes to evaluate

the objective for a new set I, we use the solution with least cost for the new objective as an

initial starting point for the first order algorithms.

Finally, note that we need to compute a tree for different values of µ. Note that µ only

dictates how the prescription and prediction terms are weighted, so we can take advantage

of the individual values of these two terms for each split by looking up computed values for

previous choices of µ.

The full algorithm involves computing a greedy tree with Algorithm (4) with caching and

sorting splits based on predictive error. Next, using this tree as a warm start, we run the

local search algorithm with caching (to compute the optimal decisions within each leaf). We

repeat this using different warm starts, and choose the best among all the generated trees.

We present the full algorithm (9) in the Appendix.

2.5 Observational data with decision-dependent uncer-

tainty

In this section, we extend the approach developed so far to the setting in which the decision,

Z, affects the realization of the uncertainty, Y . From a mathematical standpoint, we study

the problem in which a decision maker seeks to optimize a known objective function that

depends on an uncertain quantity. We allow the auxiliary covariates X, decision variable

z, and outcome Y to take values on multi-dimensional, continuous sets. A decision-maker

52

seeks minimize the conditional expected cost:

minz∈Z

E[c(z;Y (z))∣X = x]. (2.17)

Since the distribution of Y (z) is unknown, it is not possible to solve Problem (2.17)

exactly. However, we assume that we have access to observational data, consisting of n

independent and identically distributed observations, (X i, Zi, Y i) for i = 1, . . . , n. Recall

that each of these observations consists of an auxiliary covariate vector, a decision, and an

observed outcome. This type of data presents two challenges that differentiate our prob-

lem from a predictive machine learning problem. First, it is incomplete. We only observe

Y i ∶= Y i(Zi), the outcome associated with the applied decision. We do not observe what

the outcome would have been under a different decision. Second, the decisions were not

necessarily chosen independently of the outcomes, as they would have been in a randomized

experiment, and we do not know how the decisions were assigned. Following common prac-

tice in the causal inference literature, we make the ignorability assumption of Hirano and

Imbens [2004].

Assumption 1 (Ignorability).

Y (z) ⊥⊥ Z ∣X ∀z ∈ Z

In other words, we assume that historically the decision Z has been chosen as a function

of the auxiliary covariates X. There were no unmeasured confounding variables that affected

both the choice of decision and the outcome. In fact, Bertsimas and Kallus [2017] show that

without this assumption, the problem is not well posed. Under this assumption, we are able

to rewrite the objective of (2.17) as

E[c(z;Y ) ∣X = x,Z = z].

This form of the objective is easier to learn because it depends only on the observed outcome,

53

not on the counterfactual outcomes. A direct approach to solve this problem is to use a

regression method to predict the cost as a function of x and z and then choose z to minimize

this predicted cost. If the selected regression method is uniformly consistent in z, then the

action chosen by this method will be asymptotically optimal under certain conditions. (We

will formalize this later.) However, this requires choosing a regression method that ensures

the optimization problem is tractable. For this work, we restrict our attention to linear and

tree-based methods, such as CART [Breiman et al., 1984] and random forests [Breiman,

2001], as they are both effective and tractable for many practical problems.

A key issue with the direct approach is that it tries to learn too much. It tries to learn

the expected outcome under every possible decision, and the level of uncertainty associated

with the predicted expected cost can vary between different decisions. This method can lead

us to select a decision which has a small point estimate of the cost, but a large uncertainty

interval, i.e., high variance of the cost.

Because of Assumption 1,

E[c(z;Y (z))∣X = x] = E[c(z;Y )∣X = x,Z = z],

so we focus on learning c(z;Y ) as a function of x and z. We jointly learn and make decisions

as in (2.11), with a few modifications.

We emphasize that our methods assume access to past data and no additional information

about this distribution. Finally, we note that in this chapter we only consider the single

period problem and not the multiperiod one, in which the decision maker can make some

decisions after the uncertainty is realized. Also, we consider the setting where the constraint

space Z is unaffected by the uncertainty y, i.e., y only affects the cost c.

2.5.1 Uncertainty penalization and Parameter tuning

In addition, we introduce the idea of uncertainty penalization to prevent the method from

making decisions that have a small, but highly uncertain, predicted cost. We define the bias

54

and variance of the estimators, and introduce a composite objective (which we will describe in

greater detail in (2.18)) that penalizes the sum of bias and variance terms, each multiplied

by parameters λ1 and λ2. While this results in a different problem than in Section 2.3

and introduces additional parameters, we learn these the same way as before – tuning for

prescriptive performance. In order to tune the parameters λ1, λ2, and any other tuning

parameters associated with F , we perform cross validation. We split the data into a training

and validation set, train f on the training data, and compute the objective of (2.18) on the

validation data with fixed f . We repeat this for various combinations of tuning parameters

and then select the best combination.

Because Y is now a function of z, we do not know what the outcome would have been

if we had chosen a different decision. We need to estimate the counterfactuals with the

machine learning method. However, if we do not restrict the machine learning method, it

can choose trivial solutions that will lead to a very small objective value for (2.11). For

example, if c(z;Y (z)) = Y (z) ≥ 0, an optimal tree may isolate a single training example with

very small Y (z) and then propose decisions such that all of the new observations fall in that

leaf with small predicted cost. This single example has little impact on the predictive error,

but an outsized impact on the prescriptive error. As before, we denote the machine learning

estimator, f ∈ F , as a linear combination of the training examples:

f(x, z) =n

∑i=1

wfi (x, z)c(z;Y i).

In order to prevent the learning from picking an estimator that is overly optimistic about

its predicted costs, we require the weights, wfi (x, z), satisfy a generalization of the honesty

property of Wager and Athey [2018].

Assumption 2 (Honesty). The model trained on (X1, Z1, Y 1), . . . , (Xn, Zn, Y n) is honest,

i.e., the weights, wfi (x, z), are determined independently of the responses, Y 1, . . . , Y n.

For tree-based methods, this assumption can be satisfied by either ignoring the response

variables while building the tree, or by separating the data into two sets and using one set

55

to make splits and the other to make predictions in the leaves. If Assumption 2 holds, the

conditional variance of f(x, z) given (X1, Z1), . . . , (Xn, Zn) is given by

V f(x, z) ∶=∑i

(wfi (x, z))2Var(c(z;Y i)∣X i, Zi).

Because f(x, z) may be a biased predictor, we also introduce a term that penalizes the

conditional bias of the predicted cost given (X1, Z1), . . . , (Xn, Zn). Since the true cost is

unknown, it is not always possible to exactly compute this bias. Instead, we compute an

upper bound under a Lipschitz assumption (details in Section 2.5.2).

Bf(x, z) ∶=∑i

wfi (x, z)∣∣(X i, Zi) − (x, z)∣∣2.

With these penalty terms, we rewrite (2.11) as:

minf∈F ,z(f,Xi)

µn

∑i=1

c(z(f,X i);Y i) + (1 − µ)n

∑i=1

`(Y i, f(X i, zi))


f(xi, z) + λ1Vf(xi, z) + λ2B

f(xi, z) ∀1 ≤ i ≤ n,(2.18)

where λ1 and λ2 are parameters that are tuned by cross validation. These modifications

t (2.11) are important because we use the machine learning estimator to impute counter-

factual outcomes. Otherwise, the method can be overly optimistic and perform poorly out

of sample. When µ = 0, there is no prescriptive component in the objective, so we fix the

values of λ1 and λ2 to be 0.

As a concrete example, we consider trees as the machine learning method. The variance

penalty term, V f(x, z), will be small when the proposed decision is in a leaf with many

training examples. The bias penalty term, Bf(x, z), will be small when the proposed decision

is in a leaf with a small diameter, i.e., the training examples are close to the new observation

and proposed decision. It makes sense intuitively why we would like both of these penalty

terms to be small.

Before proceeding, we note that the variance terms, Var(c(z;Y i)∣X i, Zi), are usually

56

unknown in practice. In the absence of further knowledge, we assume homoscedasticity,

i.e., Var(Y i∣X i, Zi) is constant. It is possible to estimate this value by training a machine

learning model to predict Y i as a function of (X i, Zi) and computing the mean squared error

on the training set. However, it may be advantageous to include this value with the tuning

parameter λ1.

2.5.2 Theoretical Results

In this section, we describe the theoretical motivation for our approach and provide finite-

sample generalization and regret bounds. For notational convenience, we define

f(x, z) = E[c(z;Y )∣X = x,Z = z].

Before presenting the results, we first present a few additional assumptions.

Assumption 3 (Weights). For all (x, z) ∈ X ×Z, ∑ni=1w

fi (x, z) = 1 and for all i, wf

i (x, z) ∈[0,1/γn]. In addition, X ×Z can be partitioned into Γn regions such that if (x, z) and (x, z′)are in the same region, ∣∣wf(x, z) −wf(x, z′)∣∣1 ≤ α∣∣z − z′∣∣2.

This assumption is trivially satisfied with weight functions derived from a tree-based

machine learning method. Γn is the maximum number of leaves in the tree, γn is the

minimum number of training examples in a leaf, and α = 0.

Assumption 4 (Regularity). The set X ×Z is nonempty, closed, and bounded with diameter

D.

Assumption 5 (Objective Conditions). The objective function satisfies the following prop-

erties:

1. ∣c(z; y)∣ ≤ 1 ∀z, y.

2. For all y ∈ Y, c(⋅; y) is L-Lipschitz.

3. For any x,x′ ∈ X and any z, z′ ∈ Z, ∣f(x, z) − f(x′, z′)∣ ≤ L∣∣(x, z) − (x′, z′)∣∣.

57

These assumptions provide some conditions under which the generalization and regret

bounds hold, but similar results hold under alternative sets of assumptions (e.g. if c(z;Y )∣Zis subexponential instead of bounded). With these additional assumptions, we have the

following generalization bound. All proofs are contained in the appendix.

Theorem 1. Suppose assumptions 1-5 hold. Then, with probability at least 1 − δ,

f(x, z) − f(x, z) ≤ 4

3γnln(Kn/δ) + 2

√V f(x, z) ln(Kn/δ) +L ⋅Bf(x, z) ∀z ∈ Z,

where Kn = Γn (9Dγn (α(LD + 1 +√2) +L(

√2 + 3)))dz .

This result uniformly bounds, with high probability, the true cost of action z by the

predicted cost, f(x, z), a term depending on the uncertainty of that predicted cost, V f(x, z),and a term proportional to the bias associated with that predicted cost, Bf(x, z). It is easy

to see how this result motivates the approach described in (2.18). One can also verify that the

generalization bound still holds if (X1, Z1), . . . , (Xn, Zn) are chosen deterministically, as long

as Y 1, . . . , Y n are still independent. Using Theorem 1, we are able to derive a finite-sample

regret bound.

Theorem 2. Suppose assumptions 1-5 hold. Define

z∗ ∈ argminz

f(x, z),

z ∈ argminz

f(x, z) + λ1

√V f(x, z) + λ2B

f(x, z).

If λ1 = 2√ln(2Kn/δ) and λ2 = L, then with probability at least 1 − δ,

f(x, z) − f(x, z∗) ≤ 2

γnln(2Kn/δ) + 4

√V (x, z∗) ln(2Kn/δ) + 2L ⋅B(x, z∗),

where Kn = Γn (9Dγn (α(LD + 1 +√2) +L(

√2 + 3)))dz .

By this result, the regret of the approach defined in (2.18) depends only on the variance

and bias terms of the optimal action, z∗. Because the predicted cost is penalized by V (x, z)

58

and B(x, z), it does not matter how poor the prediction of cost is at suboptimal actions. The-

orem 2 immediately implies the following asymptotic result, assuming the auxiliary feature

space and decision space are fixed as the training sample size grows to infinity.

Corollary 1. In the setting of Theorem 2, if γn = Ω(nβ) for some β > 0, Γn = O(n), and

B(x, z∗)→p 0 as n→∞, then

f(x, z)→p f(x, z∗)

as n→∞.

The assumptions can be satisfied, for example, with CART or random forest as the

learning algorithm with parameters set in accordance with Lemma 2 of Wager and Athey

[2018]. This next example demonstrates that there exist problems for which the regret of

the method with µ > 0 is strictly better, asymptotically, than the regret of the method with

µ = 0.

Example 1. Suppose there are m + 1 different actions and two possible, equally probable

states of the world. In one state, action 0 has a cost that is deterministically 1, and all

other actions have a random cost that is drawn from N (0,1) distribution. In the other state,

action 0 has a cost that is deterministically 0, and all other actions have a random cost,

drawn from a N (1,1) distribution. Suppose the training data consists of m trials of each

action. If f(j) is the empirical average cost of action j, then the method with µ = 0 selects the

action that minimizes f(j). The method with µ > 0 adds a penalty of the form suggested by

Theorem 2, λ1

√σ2j lnm

m . If λ1 ≥√2, the (Bayesian) expected regret of the method with µ > 0 is

asymptotically strictly less than the expected regret of the method with µ = 0, ERµ = o(ER0),where the expectations are taken over both the training data and the unknown state of the

world.

This example is simple but demonstrates that there exist settings in which the method

with µ = 0 is asymptotically suboptimal to our method. In addition, the proof illustrates

how one can construct tighter regret bounds than the one in Theorem 2 for problems with

specific structure.

59

2.5.3 Tractability

The tractability of the method depends on the algorithm that is used as the predictive model.

For many kernel-based methods, the resulting optimization problems are highly nonlinear

and do not scale well when the dimension of the decision space is more than 2 or 3. For this

reason, we advocate for the use of tree-based and linear models as the predictive model. Tree

based models partition the space X ×Z into Γn leaves, so there are only Γn possible values

of wf(x, z). Therefore, we can solve the problem separately for each leaf. For j = 1, . . . ,Γn,

we solvemin f(x, z) + λ1

√V (x, z) + λ2B(x, z)

s.t. z ∈ Z

(x, z) ∈ Lj,

(2.19)

where Lj denotes the subset of X×Z that makes up leaf j of the tree. Because each split in the

tree is a hyperplane, Lj is defined by an intersection of hyperplanes and thus is a polyhedral

set. Clearly, B(x, z) is a convex function in z, as it is a nonnegative linear combination of

convex functions. If we assume homoscedasticity, then V (x, z) is constant for all (x, z) ∈ Lj.

If c(z; y) is convex in z and Z is a convex set, (2.19) is a convex optimization problem and

can be solved by convex optimization techniques. Furthermore, since the Γn instances of

(2.19) are all independent, we can solve them in parallel. Once (2.19) has been solved for all

leaves, we select the solution from the leaf with the overall minimal objective value.

For tree ensemble methods, such as random forest [Breiman, 2001] or xgboost [Chen

and Guestrin, 2016], optimization is more difficult. We compute optimal decisions using a

coordinate descent heuristic. From a random starting action, we cycle through holding all

decision variables fixed except for one and optimize that decision using discretization. We

repeat this until convergence from several different random starting decisions. For linear

predictive models, the resulting problem is often a second order conic optimization problem,

which can be handled by off-the-shelf solvers (details given in the appendix).

60

2.6 Computational Experiments

In this section, we apply our proposed methods on two problems – the portfolio optimization

problem, and the newsvendor problem.

First, we explain the setup of our experiments. For each value of n, we generate a

training set (X1, Y 1), . . . , (Xn, Y n). Next, we generate the test set, which we denote with

subscript T , as (X1T , Y

1T ), . . . , (X

nT

T , Y nT

T ) with nT = 10,000. Note that both training and

test sets are generated via the same distribution. In our experiments, we calculate the

optimal decision z(X i) for each X i in the test set, and report the average out of sample

test set cost c(z(X i);Y i). We further average this out of sample cost over twenty different

realizations of the training and test sets, and repeat this for various values of the training

set size n.

For each of these problems, we compare our methods with other methods – SAA, P&O,

PtP, along with the oracle. We denote our method as JPP, for Joint Predictive-Prescriptive

Analytics. We compare the prescriptive performance of the following five methods:

1. SAA: simply the SAA solution, where zSAA is computed over the whole training set

as

zSAA ∈ argminz∈Z

n

∑i=1

c(z;Y i),

and the cost is calculated as 1nT

c(zSAA;Y iT ) over the test set.

For each of the following three methods, we report four different – k Nearest Neighbors

(kNN), Kernel Regression (KR), Trees (T), and Random Forests (RF). We also include

an example when the training method is k-NN to further clarify our experiments.

2. P&O: First, the machine learning model is trained to minimize the training loss, i.e.,

solve Problem (2.11) with µ = 0. Once this model is computed, it is used to compute

the decisions. When the model is kNN, the first step involves finding the k that leads

to the least in sample training loss for predicting y. Next, the test set cost is computed

61

by playing ziP&O, for each test set X iT , 1 ≤ i ≤ nT , obtained by

Y i = 1

k∑

j∈Nk(XiT )Y j,

ziP&O ∈ argminz∈Z

c(z; Y i),

with the cost calculated as 1nT

nT

∑i=1

c(ziP&O;YiT ). We report the performance of this

method as P&O-kNN.

3. PtP: Similar to the P&O method, the machine learning model is trained by simply

minimzing the training loss, followed by using that model for decision making. When

the model is kNN, the optimal decisions are computed as

ziPtP ∈ argminz∈Z

1

k∑

j∈Nk(XiT )c(z;Y j)

for each test set X iT , 1 ≤ i ≤ nT , where the k is the same value as used in the P&O

methodology. The performance metric of this method is the average out of sample

prescriptive cost 1nT

nT

∑i=1

c(ziPtP;YiT ), which we report as PtP-kNN.

4. JPP: This is the approach presented in this chapter. Here, the machine learning

model is trained to minimize the joint loss function by solving Problem (2.11), where

0 < µ < 1 is chosen via cross validation. In the case of JPP-kNN, where the learning

model is kNN, the parameter k is calculated to minimize the joint loss according to

Equation (2.13), while the decision ziJPP is computed for each test set X iT , 1 ≤ i ≤ nT ,

as

ziJPP ∈ argminz∈Z

1

k∑

j∈Nk(Xi

T )c(z;Y j).

The prescriptive performance of this method is captured by the cost 1nT

nT

∑i=1

c(ziJPP;YiT ),

which we report as JPP-kNN.

5. Oracle: Finally, as the name indicates, for this method we assume we have access to

62

the uncertainty Y i corresponding to each X i in the test set. The cost is calculated

over the test set as1

nT

nT

∑i=1

minz∈Z

c(z;Y iT ).

Note that this cost is the best any method can possibly achieve, and thus provides a

lower bound on the attainable test set performance of any prescriptive method.

2.6.1 Portfolio Optimization

We consider the same example mentioned in Bertsimas and Van Parys [2017], where the

problem is to allocate a limited budget among 6 different securities in an artificial portfolio.

Thus, the decision variable z ∈ R6+ represents the asset allocations, while the uncertain returns

y ∈ R6 are unknown at the time of investing. We consider 3 different covariates that can

potentially influence the returns – the global S&P500 performance, inflation, and the amount

of Twitter chatter mentioning the hash tag #WAR – as x1, x2 and x3 respectively, where we

wish to use this additional side information to aid the decision making process.

We wish to maximize the mean return while at the same time minimizing the risk that

the loss (−zTy)+ = max−zTy,0 is large. Employing a conditional value-at-risk (CVaR)

reformulation [Rockafellar and Uryasev, 2000] and using β as an auxiliary variable, we solve

the following convex minimization problem, for a given covariate x,

(z∗(x), β∗(x)) ∈arg minz≥0,β

E[β + 1

ε(−zTy − β)+ − λzTy∣X = x]

subject to6

∑i=1

zi = 1.

Thus, the augmented vector z = (zT , β)T ∈ R7 is the decision variable. Here, λ, ε are both

given parameters where λ ≥ 0 represents the tradeoff between expected risk and return, and

the risk term represents the expected tail loss occurring above the (1 − ε) quantile. For all

our experiments that follow, we fix these parameter values as ε = 0.05 and λ = 1. We include

the details on the data generation process in the appendix.

We consider various sizes of the training data, and use this to compute the best f . We

63

n P&O-kNN P&O-KR P&O-Lasso P&O-RF100 1357.62 1304.33 1112.28 1216.10200 1278.06 1212.23 1026.20 1109.17500 1143.51 1108.24 966.31 1009.851000 1112.49 1049.91 923.96 945.10

Table 2.1: Average out of sample prescriptive performance for Predict and Optimize - kNN,local Kernel Regression, Lasso, and Random forests as a function of n, the size of trainingset.

n SAA PtP-kNN JPP-kNN PtP-KR JPP-KR PtP-OptTree JPP-OptTree100 91.53 146.15 70.27 84.52 62.18 94.88 107.09200 73.81 90.27 50.21 53.51 36.97 71.58 62.09500 63.69 48.16 34.53 25.06 14.49 35.75 31.991000 60.05 33.18 29.97 15.63 4.84 15.42 15.90

Table 2.2: Average out of sample prescriptive performance for various methods as a functionof n, the size of training set.

evaluate the performance of various methods by reporting the out of sample prescriptive cost

on the test set of size 10,000. For each value of n, we repeat this procedure twenty times

and average the cost over these instances. Note that the lower this value is, the better the

method.

Computational details: For the kNN method, we use the Mahalanobis distance metric

and for the Nadaraya-Watson kernel regression, we use the Gaussian kernel and the usual

Euclidean distance metric. We use leave one out cross validation to compute the parameters

k and h in kNN and KR respectively. For the trees, we use the projected subgradient descent

(Algorithm (A.1)) to update the solutions to various splits. The projection problem (A.2)

on Z = z ∈ Rdz+ ,∑i zi = 1, solving which is key to Algorithm (A.1), can be efficiently solved

in O(dz log(dz)) time [Duchi et al., 2008]. Finally, we implement our algorithms in Python

3, with Gurobi [Gurobi] as the optimization solver.

Table 2.1 shows the prescriptive performance of P&O policies for various machine learning

methods for different training set sizes. From the first column of table 2.2 which shows the

performance of the SAA solution (which is averaged over the same training and test set

64

instances for each n, and hence are directly comparable), we see that the performance of

P&O methods is evidently not very strong for this problem. Even Bertsimas and Kallus

[2019] in their computations (Figure 2(a) in their paper) note the under-performance of

P&O methods for this particular problem. Furthermore, Table 2.2 also demonstrates that

while PtP methods offer a huge improvement over P&O methods, they are dominated by

the JPP based versions of these methods. This improvement holds even as n increases.

The improvement due to the best method, JPP-KR, over PtP-KR is statistically significant

by the Wilcoxon signed-rank test (p-values for n = 100,200,500,1000 are 1.5 × 10−16,4.6 ×10−16,6.5 × 10−18,1.3 × 10−17 respectively). As a reference, we note that the oracle method

with perfect hindsight has test set performance between −527 and −528. Finally, we do not

include random forests as there are only three covariates.

2.6.2 Newsvendor problem

In this section, we consider the newsvendor problem with auxiliary information x about the

demand y. The cost function is piecewise linear convex, given by

c(z;Y ) = b ⋅ (Y − z)+ + h ⋅ (z − Y )+, (2.20)

where y represents the uncertain demand, and the backorder cost b > 0 and holding cost

h > 0 are both known in advance. Consequently. the conditional expectation problem we

wish to solve to obtain the decision z∗(x), for any x, is:

z∗(x) ∈ argminz

E[b ⋅ (Y − z)+ + h ⋅ (z − Y )+∣X = x]

Clearly, if y is known apriori, then the optimal decision is z∗ = y, which leads to an optimal

oracle cost of zero. If the conditional distribution of Y ∣X = x is perfectly known, then the

classical result stipulates that the optimal decision z∗(x) is given by the quantile

z∗(x) = inf t ∶ PY ∣X=x[y ≤ t∣X = x] ≥b

b + h.

65

We also note that the optimal solution to the weighted SAA version of the newsvendor

problem, given by,

minz

n

∑i=1

wfi (x) c(z;Y i),

can be computed efficiently in O(n log(n)) as

z∗(f, x) = inf Y (j) ∶j

∑i=1

wfi (x) ≥

b

b + h,

for each x, where the demands (Y 1, . . . , Y n) are ordered in nondecreasing order as Y (1) ≤Y (2) ≤ . . . ≤ Y (n).

Computational details: In this section, we apply our methodology on real data from a

Mexican bakery items producer. This data set is available at https://www.kaggle.com/c/

grupo-bimbo-inventory-demand, and consists of weekly sales data of more than hundred

of its products to various clients (stores across Mexico) over a period of nine weeks. We

restrict our analysis to the top hundred products and the top five hundred clients. The

details on the data generation process can be found in the appendix. We train our models

on random samples of various sizes (n) from weeks 6 and 7, and evaluate our models on test

data from weeks 8 and 9. For each n, we average our results over five hundred randomly

chosen training samples for each n, and report the average out of sample prescriptive cost

of various methods. Finally, we set the backorder and holding costs in Equation (2.20) as

b = 10.0 and h = 1.0.

Table 2.3 presents the performance of SAA followed by PtP and JPP variants of various

methods – Nadaraya-Watson kernel regression (KR), Optimal trees (OptTree), and Random

forests (RF). The kNN method performs similarly to the Kernel regression method, and

we omit for the sake of brevity. Clearly, JPP-RF is the best performer with the smallest

out of sample cost, closely followed by PtP-RF. We also note that the JPP versions of kNN,

Optimal trees, and RF perform better than their PtP versions at each value of n ≥ 200, which

shows the benefit of training the machine learning model keeping in mind the prescription

task. Finally, we note that all the methods perform better with more training data which

66

https://www.kaggle.com/c/grupo-bimbo-inventory-demand

https://www.kaggle.com/c/grupo-bimbo-inventory-demand

is expected, but the JPP methods consistently outperform their PtP counterparts. The

improvements due to JPP-RF over PtP-RF are all statistically significant for n ≥ 200 by

the Wilcoxon signed-rank test (p-values for n = 200,300,400,500,600 are 4.9 × 10−6,3.5 ×10−21,1.8×10−21,5.6×10−20,2.8×10−23 respectively). Note that the oracle method will always

have an average out of sample cost of 0.

n SAA PtP-KR JPP-KR PtP-OptTree JPP-OptTree PtP-RF JPP-RF

100 2065.45 2034.84 2032.73 2057.26 2022.10 1889.38 1890.36

200 2059.12 2024.07 2020.64 1983.34 1944.68 1830.66 1827.65

300 2056.60 2021.65 2016.68 1974.52 1923.52 1793.96 1783.24

400 2054.91 2021.66 2013.68 1954.00 1909.54 1760.35 1747.40

500 2054.33 2021.48 2012.30 1934.02 1892.65 1726.88 1714.18

600 2053.20 2022.98 2010.71 1924.11 1870.92 1692.28 1677.92

Table 2.3: Average out of sample prescriptive performance for various methods as a functionof n, the size of training set.

In the following sections, we demonstrate the effectiveness of our approach for problems

where Y is affected by Z with two examples. In the first, we consider pricing problem

with synthetic data, while in the second, we use real patient data for personalized Warfarin

dosing. We compare our methods (JPP) with PtP methods for which µ = 0, and there is no

prescriptive component in the objective of (2.18).

2.6.3 Pricing

In this example, the decision variable, z ∈ R5, is a vector of prices for a collection of products.

The response, Y , is a vector of demands for those products. The auxiliary covariates, x,

may contain data on the weather and other exogenous factors that may affect demand. The

objective is to select prices to maximize revenue for a given vector of auxiliary covariates.

The demand for a single product is affected by the auxiliary covariates, the price of that

product, and the price of one or more of the other products, but the mapping is unknown

67

n PtP-CART JPP-CART PtP-RF JPP-RF30 18630.37 55851.14 57259.44 57189.2950 18479.17 56316.17 57757.40 57964.1770 21228.90 56448.46 57949.27 58144.01100 21426.18 56889.63 58085.08 58209.91150 27961.63 57206.25 58285.22 58464.33300 30751.34 57631.42 58415.60 58630.36600 34894.42 57951.45 58373.28 58621.001000 37803.67 58163.99 58480.03 58713.352000 41503.90 58348.54 58449.98 58711.00

Table 2.4: Average out of sample revenue on the pricing example for various PtP and JPPmethods as a function of n, the size of training set.

to the algorithm. The details on the data generation process can be found in the appendix.

In Table 2.4, we compare the expected revenues of the strategies produced by several

algorithms. The PtP results refer to the methods that solve (2.18) with µ = 0. They are

trained to minimize predictive error, and then the decision that minimizes predicted cost is

then selected. The JPP results refer to the methods that solve (2.18) with µ = 0.5. For each

training sample size, n, we average our results over one hundred separate training sets of

size n. At a training size of 2000, the JPP random forest method improves expected revenue

by an average of $270 compared to the PtP RF method. This improvement is statistically

significant at the 0.05 significance level by the Wilcoxon signed-rank test (p-value 4.4×10−18,testing the hypothesis that mean improvement is 0 across 100 different training sets).

2.6.4 Warfarin Dosing

Warfarin is a commonly prescribed anticoagulant that is used to treat patients who have had

blood clots or who have a high risk of stroke. Determining the optimal maintenance dose

of Warfarin presents a challenge as the appropriate dose varies significantly from patient

to patient and is potentially affected by many factors including age, gender, weight, health

history, and genetics. However, this is a crucial task because a dose that is too low or too

high can put the patient at risk for clotting or bleeding. The effect of a Warfarin dose on a

patient is measured by the International Normalilzed Ratio (INR). Physicians typically aim

68

for patients to have an INR in a target range of 2-3.

In this example, we test the efficacy of our approach in learning optimal Warfarin dos-

ing with data from Consortium et al. [2009]. This publicly available data set contains the

optimal stable dose, found by experimentation, for a diverse set of 5410 patients. In addi-

tion, the data set contains a variety of covariates for each patient, including demographic

information, reason for treatment, medical history, current medications, and the genotype

variant at CYP2C9 and VKORC1. It is unique because it contains the optimal dose for

each patient, permitting the use of off-the-shelf machine learning methods to predict this

optimal dose as a function of patient covariates. We instead use this data to construct a

problem with observational data, which resembles the common problem practitioners face.

Our access to the true optimal dose for each patient allows us to evaluate the performance

of our method out-of-sample. This is a commonly used technique, and the resulting data

set is sometimes called semi-synthetic. Several researchers have used the Warfarin data for

developing personalized approaches to medical treatments. In particular, Kallus [2017b] and

Bertsimas et al. [2019a] tested algorithms that learned to treat patients from semi-synthetic

observational data. However, they both discretized the dosage into three categories, whereas

we treat the dosage as a continuous decision variable.

To begin, we split the data into a training set of 4000 patients and a test set of 1410

patients. We keep this split fixed throughout all of our experiments to prevent cheating

by using insights gained by visualization and exploration on the training set. Similar to

Kallus [2017b], we assume physicians prescribe Warfarin as a function of BMI. We assume

the response that the physicians observe is related to the difference between the dose a

patient was given and the true optimal dose for that patient. It is a noisy observation, but

it, on average, gives directional information (whether the dose was too high or too low) and

information on the magnitude of the distance from the optimal dose. The precise details

of how we generate the data are given in the supplementary materials. For all methods,

we repeat our work across 100 randomizations of assigned training doses and responses. To

measure the performance of our methods, we compute, on the test set, the mean squared

69

n PtP-Lasso JPP-Lasso PtP-CART JPP-CART PtP-RF JPP-RF200 450.90 448.11 440.13 307.22 301.93 257.70500 260.31 255.27 309.32 273.91 234.11 219.081000 300.82 286.32 269.92 254.24 220.43 211.431500 195.60 188.48 258.72 244.15 220.39 206.742000 180.44 174.44 247.76 239.52 215.56 199.272500 162.95 161.40 238.42 232.78 206.25 191.503000 161.22 159.41 230.18 222.63 211.11 193.543500 155.17 154.63 234.78 223.09 210.91 189.214000 154.33 153.51 221.87 216.86 205.13 187.35

Table 2.5: Average out of sample MSE on the Warfarin example for various PtP and JPPmethods as a function of n, the size of training set.

error (MSE) of the prescribed doses relative to the true optimal doses. Using the notation

described in Section 2.1, X i ∈ R99 represents the auxiliary covariates for patient i. We work

in normalized units so the covariates all contribute equally to the bias penalty term. Zi ∈ Rrepresents the assigned dose for patient i, and Y i ∈ R represents the observed response for

patient i. The objective in this problem is to minimize (E[Y (z)∣X = x])2 with respect to

the dose, z.1

Table 2.5 displays the results of several algorithms as a function of the number of training

examples. We compare PtP and JPP versions of CART, random forest, and Lasso algo-

rithms. We see consistent improvements in MSE with the JPP methods over the PtP meth-

ods. The lasso based method works best on this data set when the number of training samples

is large, but the random forest based method is best for smaller sample sizes. With the max-

imal training set size of 4000, the improvements of the CART, random forest, and lasso

uncertainty penalized methods over their unpenalized analogues (2.2%, 8.6%, 0.5% respec-

tively) are all statistically significant at the 0.05 family-wise error rate level by the Wilcoxon

signed-rank test with Bonferroni correction (adjusted p-values 2.1×10−4,4.3×10−16,1.2×10−6

respectively).

1This objective differs slightly from the setting described in Section 2.5.2 in which the objective was tominimize the conditional expectation of a cost function. However, it is straightforward to modify the resultsto obtain the same regret bound (up to a few constant factors) when minimizing g(E[c(z;Y (z))∣X = x]) fora Lipschitz function, g.

70

2.7 Conclusions

In this chapter, we consider the problem of computing decisions from data – a topic that

lies at the intersection of machine learning and operations research/management science.

In our setting, we assume the decision maker has access to n samples of past observational

data (X i, Zi, Y i)ni=1 comprising of auxiliary covariates X i, decisions Zi, and outcomes Y i.

We propose non-parametric ML based methods that, given a new observation x prescribe

a decision z(x). We compute these prescriptive policies in a single step, rather than the

usual two-step approach of learning a model to predict y from x and z, and then using

those predictions to compute decisions. A crucial component of our approach is that we

train these ML models by explicitly optimizing for the quality of their induced decisions.

Additionally, we prove finite sample generalization and regret bounds and provide a sufficient

set of conditions under which the resulting decisions are asymptotically optimal. Finally, we

perform computational experiments and demonstrate the prescriptive power of our methods

on synthetic and real data.

71

Chapter 3

Optimal Prescriptive Trees

3.1 Introduction

The proliferation in volume, quality, and accessibility of highly granular data has enabled

decision makers in various domains to seek customized decisions at the individual level.

This personalized decision making framework encompasses a multitude of applications. In

online advertising internet companies display advertisements to users based on the user

search history, demographic information, geographic location, and other available data they

routinely collect from visitors of their website. Specifically targeting these advertisements

by displaying them to appropriate users can maximize their probability of being clicked, and

can improve revenue. In personalized medicine, we want to assign different drugs/treatment

regimens/dosage levels to different patients depending on their demographics, past diagnosis

history and genetic information in order to maximize medical outcomes for patients. By

taking into account the heterogeneous responses to different treatments among different

patients, personalized medicine aspires to provide individualized, highly effective treatments.

In this chapter, we consider the problem of prescribing the best option from among a

set of predefined treatments to a given sample (patient or customer depending on context)

as a function of the sample’s features. We have access to observational data of the form

(xi, yi, zi)ni=1, which comprises of n observations. Each data point (xi, yi, zi) corresponds

73

to the features xi ∈ Rd of the ith sample, the assigned treatment zi ∈ [m] = 1, . . . ,m, and the

corresponding outcome yi ∈ R. We use y(1), . . . , y(m) to denote the m “potential outcomes"

resulting from applying each of the m respective treatments.

There are three key challenges for designing personalized prescriptions for each sample

as a function of their observed features:

1. While we have observed the outcome of the administered treatment for each sample, we

have not observed the counterfactual outcomes, that is the outcomes that would have

occurred had another treatment been administered. Note that if this information was

known, then the prescription problem reduces to a standard multi-class classification

problem. We thus need to infer the counterfactual outcomes.

2. The vast majority of the available data is observational in nature as opposed to data

from randomized trials. In a randomized trial, different samples are randomly assigned

different treatments, while in an observational study, the assignment of treatments

potentially, and often, depends on features of the sample. Different samples are thus

more or less likely to receive certain treatments and may have different outcomes than

others that were offered different treatments. Consequently, our approach needs to

take into account the bias inherent in observational data.

3. Especially for personalized medicine, the proposed approach needs to be interpretable,

that is easily understandable by humans. Even in high speed online advertising, one

needs to demonstrate that the approach is fair, appropriate, and does not discriminate

people over certain features such as race, gender, age, etc. In our view interpretability

is highly desirable always, and a necessity in many contexts.

We seek a function τ ∶ Rd → [m] that selects the best treatment τ(x) out of the m options

given the sample features x. In doing so, we need to be both “optimal” and “accurate”. We

thus consider two objectives:

1. Assuming that smaller outcomes y are preferable (for example, sugar levels for person-

alized diabetes management), we want to minimize E[y(τ(x))], where the expectation

74

is taken over the distribution of outcomes for a given treatment policy τ(x). Given

that we only have data, we rewrite this expectation as

n

∑i=1(yi1[τ(xi) = zi] +∑

t≠ziyi(t)1[τ(xi) = t]), (3.1)

where yi(t) denotes the unknown counterfactual outcome that would be observed if

sample i were to be assigned treatment t. We refer to the objective function (3.1) as

the prescription error.

2. We further want to design treatment τ(x) that accurately estimates the counterfactual

outcomes. For this reason, our second objective function is to minimize

[n

∑i=1(yi − yi(zi))2] , (3.2)

that is we seek to minimize the squared prediction error for the observed data.

Given our desire for optimality and accuracy, we propose in this chapter to seek a policy

τ(x) that optimizes a convex combination of the two objectives (3.1) and (3.2) :

µ [n

∑i=1(yi1[τ(xi) = zi] +∑

t≠ziyi(t)1[τ(xi) = t])] + (1 − µ) [

n

∑i=1(yi − yi(zi))2] , (3.3)

where the prescription factor µ is a hyperparameter that controls the tradeoff between the

prescription and the prediction error.

3.1.1 Related Literature

In this section, we present some related approaches to personalization in the literature and

how they relate to our work. We present some methodological papers by researchers in

statistics and operations research, followed by a few papers in the medical literature.

Learning the outcome function for each treatment: A common approach in the lit-

erature is to estimate each sample’s outcome under a particular treatment, and recommend

75

the treatment that predicts the best prognosis for that sample. Formally, this is equivalent

to estimating the conditional expectation E[Y ∣Z = t,X = x] for each t ∈ [m], and assign

the treatment that predicts the lowest outcome to a sample. For instance, these conditional

means could be estimated by regressing the outcomes against the covariates of samples who

received treatment t separately. This approach has been followed historically by several au-

thors in clinical research (for e.g., Feldstein et al. [1978]), and more recently by researchers

in statistics [Qian and Murphy, 2011] and operations research [Bertsimas et al., 2017]. The

online version of this problem, called the contextual bandit problem, has been studied by

several authors [Bastani and Bayati, 2015, Goldenshluger and Zeevi, 2013, Li et al., 2010]) in

the multi-armed bandit literature [Gittins, 1989]. These papers use variants of linear regres-

sion to estimate the treatment function for each arm all while ensuring sufficient exploration,

and picking the best treatment based on the m predictions for a given sample.

In the context of personalized diabetes management, Bertsimas et al. [2017] use carefully

constructed k−nearest neighbors to estimate the counterfactuals, and prescribe the treatment

option with the best predicted outcome if the expected improvement (over the status quo)

exceeds a threshold δ. The parameters, k and δ, used as part of this approach are themselves

learned from the data.

More generally in the fields of operations research and management science, Bertsimas

and Kallus [2019] consider the problem of prescribing optimal decisions by directly learning

from data. In this work, they adapt powerful machine learning methods and encode them

within an optimization framework to solve a wide range of decision problems. In the context

of revenue management and pricing, Bertsimas and Kallus [2017] consider the problem of

prescribing the optimal price by learning from historical demand and other side informa-

tion, but taking into account that the demand data is observational. Specifically, historical

demand data is available only for the observed price and is missing for the remaining price

levels.

Effectively, regress-and-compare approaches inherently encode a personalization frame-

work that consists of a (shallow) decision tree of depth one. To see this, consider a problem

76

with m−arms where this approach involves estimating functions fi for computing the out-

comes of samples that received arm i, for each 1 ≤ i ≤ m. This prescription mechanism can

be represented as splitting the feature space into m leaves, with the first leaf constituting

all the subjects who are recommended arm 1 and so on. The i−th leaf is given by the region

x ∈ Rd ∶ fi(x) < fj(x) ∀j ≠ i,1 ≤ j ≤ m. However, the individual functions f can be

highly nonlinear which hurts interpretability. Additionally, using only the samples who were

administered arm i to compute each fi results in using only a subset of the training data

for each of these computations and the fi’s not interacting with each other while learning,

which can potentially lead to less effective decision rules.

Statistical learning based approaches: Another relatively recent approach involves

reformulating this problem as a weighted multi-class classification problem based on im-

puted propensity scores, and using off the shelf methods/solvers available for such problems.

Propensity scores are defined as the conditional probability of a sample receiving a particu-

lar treatment given his/her features [Rosenbaum and Rubin, 1983]. Clearly, for a two arm

randomized control trial, these values are 0.5 for each sample. For problems where these

scores are known and two armed studies, Zhou et al. [2017] propose a weighted SVM based

approach to learn a classifier that prescribes one of the two treatment options. However, this

analysis is restricted to settings where these scores are perfectly known and predefined in the

trial design, e.g., randomized clinical trials (propensities are constant) or stratified designs

(where the dependence of the treatment assignment on the covariates is known a priori).

In observational studies, these probabilities are typically not known, and hence are usu-

ally estimated via maximum likelihood estimation. However, there are multiple proposed

methods for estimating these scores, e.g., using machine learning [Westreich et al., 2010]

or as primarily covariate balancing [Imai and Ratkovic, 2014], and the choice of method is

not clear a priori. Once these probabilities are known or estimated, the average outcome is

computed using approaches based on the inverse probability of treatment weighting estima-

tor. This involves multiplying the observed outcome by the inverse of the propensity score

(this approach is also referred to as importance/rejection sampling in the machine learning

77

literature). While this method has desirable asymptotic properties and low bias, dividing

the outcome by the estimated probabilities may lead to unstable, high variance estimates

for small samples.

Tree based approaches: Continuing in the spirit of adapting machine learning approaches,

Kallus [2017b] proposes personalization trees (and forests), which adapt regular classification

trees [Breiman et al., 1984] to directly optimize the prescription error. The key differences

from our approach are that we modify our objective to account for the prediction error, and

use the methodology of Bertsimas and Dunn [2017, 2019] to design near optimal trees, which

improves performance substantially. Athey and Imbens [2016] and Wager and Athey [2018]

also use a recursive splitting procedure of the feature space to construct causal trees and

causal forests, respectively, which estimate the causal effect of a treatment for a given sample,

or construct confidence intervals for the treatment effects, but not explicit prescriptions or

recommendations which is the main point of this chapter. Also, causal trees (or forests)

are designed exclusively for studies comparing binary treatments. Additional methods that

build on causal forests are proposed in the recent work of Powers et al. [2017], who develop

nonlinear methods to provide better estimates of the personalized average treatment effect,

E[Y (1)∣X = x] − E[Y (0)∣X = x], for high dimensional covariates x. They adapt methods

such as random forests, boosting, and MARS (Multivariate Adaptive Regression Splines) and

develop their equivalents for treatment effect estimation – pollinated transformed outcome

(PTO) forests, causal boosting, and causal MARS. These methods rely on first estimating

the propensity score (by regressing historically assigned Z against X), followed by another

regression using those propensity adjustments. The causal MARS approach uses nonlinear

functions, which are added to the basis in a greedy manner, as regressors for predicting

outcomes via linear regression for each arm, but uses a common set of basis functions for

both arms.

One of the advantages of these recent approaches – weighted classification or tree based

methods – over regress and compare based approaches is that they use all of the training

data rather than breaking down the problem into m (where m is the number of arms)

78

subproblems, each using a separate subset of the data. This key modification increases the

efficiency of learning, which results in better estimates of personalized treatment effects for

smaller sizes of the training data.

Personalization in medicine: Heterogeneity in patient response and the potential benefits

of personalized medicine have also been discussed in the medical literature. As an illustration

of heterogeneity in responses, a certain drug that works for a majority of individuals may

not be appropriate for other subsets of patients, e.g., in general older patients tend to

have poor outcomes independent of any treatment [Lipkovich and Dmitrienko, 2014]. In

an example of breast cancer, Gort et al. [2007] find that even when patients receive identical

treatments, heterogeneity of the disease at the molecular level may lead to varying clinical

outcomes. Thus, personalized medicine can be thought of as a framework for utilizing all

this information, past data, and patient level characteristics to develop a rule that assigns

treatments best suited for each patient. These treatment rules have provided high quality

recommendations, e.g., in cystic fibrosis [Flume et al., 2007] and mental illness [Insel, 2009],

and can potentially lead to significant improvements in health outcomes and reduce costs.

3.1.2 Contributions

We propose an approach that generalizes our earlier work on prediction trees [Bertsimas and

Dunn, 2017, 2019, Dunn, 2018] to prescriptive trees that are interpretable, highly scalable,

generalizable to multiple treatments, and either outperform out of sample or are comparable

with several state of the art methods on synthetic and real world data. Specifically, our

contributions include:

Interpretability: Decision trees are highly interpretable (in the words of Leo Breiman: “On

interpretability Trees rate an A+”). Given that our method produces trees with partitions

that are parallel to the axis, they are highly interpretable and provide intuition on the

important features that lead to a sample being assigned a particular treatment.

Scalability: Similarly to predictive trees [Bertsimas and Dunn, 2017, 2019, Dunn, 2018],

prescriptive trees scale to problems with n in 100,000s and d in the 10,000s in seconds when

79

they use constant predictions in the leaves and in minutes when they use a linear model.

Generalizable to multiple treatments: Prescriptive trees can be applied with multiple

treatments. An important desired characteristic of a prescriptive algorithm is its generaliz-

ability to handle the case of more than two possible arms. As an example, a recent review

by Baron et al. [2013] found that almost 18% of published randomized control trials (RCTs)

in 2009 were multi-arm clinical trials, where more than two new treatments are tested simul-

taneously. Multi-arm trials are attractive as they can greatly improve efficiency compared to

traditional two arm RCTs by reducing costs, speeding up recruitments of participants, and

most importantly, increasing the chances of finding an effective treatment [Parmar et al.,

2014]. On the other hand, two arm trials can force the investigator to make potentially

incorrect series of decisions on treatment, dose or assessment duration [Parmar et al., 2014].

Rapid advances in technology have resulted in almost all diseases having multiple drugs at

the same stage of clinical development, e.g., 771 drugs for various kinds of cancer are cur-

rently in the clinical pipeline [Buffery, 2015]. This emphasizes the importance of methods

that can handle trials with more than two treatment options.

Highly effective prescriptions: In a series of experiments with real and synthetic data,

we demonstrate that prescriptive trees either outperform out of sample or are comparable

with several state of the art methods on synthetic and real world data.

Given their combination of interpretability, scalability, generalizability and performance,

it is our belief that prescriptive trees are an attractive alternative for personalized decision

making.

3.1.3 Structure of the Chapter

The structure of this chapter is as follows. In Section 3.2, we review optimal predictive trees

for classification and regression. In Section 3.3, we describe optimal prescriptive trees (OPTs)

and the algorithm we propose in greater detail. In Section 3.3.3, we present improvements

to the OPTs methodology using improved counterfactual estimates. We provide evidence of

the benefits of this method with the help of synthetic data in Section 3.4 and four real world

80

examples in Section 3.5. Finally, we present our conclusions in Section 3.6.

3.2 Review of Optimal Predictive Trees

Decision trees are primarily used for the tasks of classification and regression, which are

prediction problems where the goal is to predict the outcome y for a given point x. We

therefore refer to these trees as predictive trees. The problem we consider in this chapter is

prescription, where we use the point x and the observed outcomes y to prescribe the best

treatment for each point. We will adapt ideas from predictive trees in order to effectively

train prescriptive trees, where each leaf prescribes a treatment for the point and also predicts

the associated outcome for that treatment. In this section, we briefly review predictive trees,

and in particular, we give an overview of the Optimal Trees framework [Bertsimas and Dunn,

2019, Dunn, 2018] which is a novel approach for training predictive trees that have state-of-

the-art accuracy.

The traditional approach for training decision trees is to use a greedy heuristic to re-

cursively partition the feature space by finding the single split that locally optimizes the

objective function. This approach is used by methods like CART [Breiman et al., 1984] to

find classification and regression trees. The main drawback to this greedy approach is that

each split in the tree is determined in isolation without considering the possible impact of

future splits in the tree. This can lead to trees that do not capture well the underlying char-

acteristics of the dataset and can lead to weak performance on unseen data. The natural way

to resolve this problem is to consider forming the decision tree in a single step, determining

each split in the tree with full knowledge of all other splits.

Optimal Trees is a novel approach for constructing decision trees that substantially out-

performs existing decision tree methods [Bertsimas and Dunn, 2019, Dunn, 2018]. It uses

mixed-integer optimization (MIO) to formulate the problem of finding the globally optimal

decision tree, and solves this problem with coordinate descent to find optimal or near-optimal

solutions in practical times. The resulting predictive trees are often as powerful as state-of-

the-art methods like random forests or boosted trees, yet they maintain the interpretability

81

of a single decision tree, avoiding the need to choose between interpretability and state-of-

the-art accuracy.

The Optimal Trees framework is a generic approach for training decision trees according

to a loss function of the form

minT

error(T,D) + α ⋅ complexity(T ), (3.4)

where T is the decision tree being optimized, D is the training data, error(T,D) is a

function measuring how well the tree T fits the training data D, complexity(T ) is a function

penalizing the complexity of the tree (for a tree with splits parallel to the axis, this is the

number of splits in the tree), and α is the complexity parameter that controls the tradeoff

between the quality of the fit and the size of the tree.

Previous attempts in the literature for finding globally optimal predictive trees [examples

include Bennett and Blue, 1996, Grubinger et al., 2014, Son, 1998] were not able to scale

to datasets of the size seen in practice, and as such did not deliver practical improvements

over greedy heuristics. The key development that allows Optimal Trees to scale is using

coordinate descent to train the decision trees towards global optimality. The algorithm

repeatedly optimizes the splits in the tree one-at-a-time, attempting to find changes that

improve the global objective value in Problem (3.4). At a high level, it visits the nodes of

the tree in a random order and considers the following modifications at each node:

• If the node is not a leaf, delete the split at that node;

• If the node is not a leaf, find the optimal split to use at that node and update the

current split;

• If the node is a leaf, create a new split at that node.

After each of the changes, the objective value of the tree with respect to Problem (3.4) is

calculated. If any of these changes improve the overall objective value of the tree, then the

modification is accepted. The algorithm continually visits the nodes in a random order until

82

70

75

80

85

2 4 6 8 10

Maximum depth of tree

Out−

of−

sam

ple

acc

ura

cy

CART OCT OCT−H Random Forest Boosting

Figure 3-1: Performance of classification methods averaged across 60 real-world datasets.OCT and OCT-H refer to Optimal Classification Trees without and with hyperplane splits,respectively.

no possible improvements are found, meaning this tree is a local minimum. The problem is

non-convex, so this coordinate descent process is repeated from a variety of starting decision

trees that are generated randomly. From this set of trees, the one with the lowest overall

objective function is selected as the final solution. For a more comprehensive guide to the

coordinate descent process, we refer the reader to Bertsimas and Dunn [2019].

The coordinate descent algorithm is generic and can be applied to any objective function

in order to optimize a decision tree. For example, the Optimal Trees framework can train

Optimal Classification Trees by setting error(T,D) to be the misclassification error associ-

ated with the tree predictions made on the training data. Figure 3-1 shows a comparison of

performance between various classification methods from Bertsimas and Dunn [2019]. These

results demonstrate that the Optimal Tree methods outperform CART in producing a single

predictive tree that has accuracies comparable with some of the best classification methods.

In Section 3.3, we extend the Optimal Trees framework to generate prescriptive trees

using objective function (3.3).

83

3.3 Optimal Prescriptive Trees

In this section, we motivate and present the OPT algorithm that trains prescriptive trees

to directly minimize the objective presented in Problem (3.3) using a decision rule that

takes the form of a prescriptive tree; that is, a decision tree that in each leaf prescribes a

common treatment for all samples that are assigned to that leaf of the tree. Our approach

is to estimate the counterfactual outcomes using this prescriptive tree during the training

process, and therefore jointly optimize the prescription and the prediction error.

3.3.1 Optimal Prescriptive Trees with Constant Predictions

Observe that a decision tree divides the training data into neighborhoods where the samples

are similar. We propose using these neighborhoods as the basis for our counterfactual esti-

mation. More concretely, we will estimate the counterfactual yi(t) using the outcomes yj for

all samples j with zj = t that fall into the same leaf of the tree as sample i. An immediate

method for estimation is to simply use the mean outcome of the relevant samples in this

neighborhood, giving the following expression for yi(t):

yi(t) =1

∣j ∶ xj ∈ Xl(i), zj = t∣∑

j∶xj∈Xl(i),zj=tyj, (3.5)

where Xl(i) is the leaf of the prescription tree into which xi falls.

Substituting this back into Problem (3.3), we want to find a prescriptive tree τ that

solves the following problem:

minτ(.)

µ [n

∑i=1(yi1[τ(xi) = zi] +∑

t≠zi

∑j∶xj∈Xl(i),zj=t yj∣j ∶ xj ∈ Xl(i), zj = t∣

1[τ(xi) = t])]

+ (1 − µ)⎡⎢⎢⎢⎢⎣

n

∑i=1

⎛⎝yi −

1

∣j ∶ xj ∈ Xl(i), zj = zi∣∑

j∶xj∈Xl(i),zj=ziyj⎞⎠

2⎤⎥⎥⎥⎥⎦.

(3.6)

We note that when µ = 1, we obtain the same objective function as Kallus [2017b], which

means this objective is an unbiased and consistent estimator for the prescription error. We

84

0.04

0.06

0.08

0.10

0.00 0.25 0.50 0.75 1.00µ

Pre

dic

tion E

rror

2.73

2.76

2.79

2.82

0.00 0.25 0.50 0.75 1.00µ

Pre

scri

pti

on E

rror

Figure 3-2: Test prediction and personalization error as a function of µ

note that in this work they attempted to solve Problem (3.6) to global optimality using a

MIO formulation based on an earlier version of Optimal Trees [Bertsimas and Dunn, 2017].

This approach did not scale beyond shallow trees and small datasets, and so they resorted to

using a greedy CART-like heuristic to solve the problem instead. The approach we describe,

using the latest version of Optimal Trees centered around coordinate descent, is practical

and scales to large datasets while solving in tractable times. When µ = 0, we obtain the

objective function in Bertsimas and Dunn [2017] that emphasizes prediction.

Empirically, when µ = 1, we have observed that the resulting prescriptive trees lead to a

high predictive error and an optimistic estimate of the prescriptive error that is not supported

in out of sample experiments. Allowing µ to vary ensures that the tree predictions lead to a

major improvement of the out of sample predictive and prescriptive error.

To illustrate this observation, Figure 3-2 shows the average prediction and prescription

errors as a function of µ for one of the synthetic experiments we conduct in Section 3.4. We

see that using µ = 1 leads to very high prediction errors, as the prescriptions are learned

without making sure the predicted outcomes are close to reality. More interestingly, we see

that the best prescription error is not achieved at µ = 1. Instead, varying µ leads to improved

prescription error, and for this particular example the lowest error is attained for µ in the

range 0.5–0.8. This gives clear evidence that our choice of objective function is crucial for

delivering better prescriptive trees.

85

3.3.2 Training Prescriptive Trees

We apply the Optimal Trees framework to solve Problem (3.6) and find OPTs. The core

of the algorithm remains as described in Section 3.2, and we set Problem (3.6) as the loss

function error(T,D). When evaluating the loss at each step of the coordinate descent,

we calculate the estimates of the counterfactuals by finding the mean outcome for each

treatment in each leaf among the samples in that leaf that received that treatment using

Equation (3.5). We determine the best treatment to assign at each leaf by summing up the

outcomes (observed or counterfactual as appropriate) of all samples for each treatment, and

then selecting the treatment with the lowest total outcome in the leaf. Finally, we calculate

the two terms of the objective using the means and best treatments in each leaf, and add

these terms with the appropriate weighting to calculate the total objective value.

The hyperparameters that control the tree training process are:

• nmin: the minimum number of samples required in each leaf;

• Dmax: the maximum depth of the prescriptive tree;

• α: the complexity parameter that controls the tradeoff between training accuracy and

tree complexity in Problem (3.4);

• ntreatment: the minimum number of samples of a treatment t we need at a leaf before we

are allowed to prescribe treatment t for that leaf. This is to avoid using counterfactual

estimates that are derived from relatively few samples;

• µ: the prescription factor that controls the tradeoff between prescription and prediction

errors in the objective function.

The first three parameters are parameters that appear in the general Optimal Trees

framework (for more detail see Bertsimas and Dunn [2019]), while the final two are specific

to OPTs.

In practice we have found that we can achieve good results for most problems by setting

nmin = 1, ntreatment = 10, and tuning Dmax and α using the procedure outlined in Section 2.4

86

of Dunn [2018]. We also have seen that setting µ = 0.5 typically gives good results, although

this may need to be tuned to achieve the best performance on a specific problem.

3.3.3 Optimal Prescriptive Trees with Linear Predictions

In Section 3.3.1, we trained OPTs by using the mean treatment outcomes in each leaf as the

counterfactual estimates for the other samples in that leaf. There is nothing special about

our choice to use the mean outcome other than ease of computation, and it seems intuitive

that a better predictive model for the counterfactual estimates could lead to a better final

prescriptive tree. In this section, we propose using linear regression methods as the basis for

counterfactual estimation inside the OPT framework.

Traditionally, regression trees have eschewed linear regression models in the leaves due to

the prohibitive cost of repeatedly fitting linear regression models during the training process,

and instead have preferred to use simpler methods such as predicting the mean outcome in

the leaf. However, the Optimal Trees framework contains approaches for training regression

trees with linear regression models with elastic net regularization [Zou and Hastie, 2005]

in each leaf. It uses fast updates and coordinate descent to minimize the computational

cost of fitting these models repeatedly, providing a practical and tractable way of generating

interpretable regression trees with more sophisticated prediction functions in each leaf.

We propose using this approach for fitting linear regression models from the Optimal

Trees framework for the estimation of counterfactuals in our OPTs. To do this, in each leaf

we fit a linear regression model for each treatment, using only the samples in that leaf that

received the corresponding treatment. We will then use these linear regression models to

estimate the counterfactuals for each sample/treatment pair as necessary, before proceeding

to determine the best treatment overall in the leaf using the same approach as in Section 3.3.

Concretely, in each leaf of the tree ` we fit an elastic net model for each treatment t using

the relevant points in the leaf, i ∶ xi ∈ X`, zi = t, to obtain regression coefficients βt`:

minβt`

1

2 ∣i ∶ xi ∈ X`, zi = t∣∑

i∶xi∈X`,zi=t(yi − (βt

`)Txi)

2+ λPα(βt

`), (3.7)

87

where

Pα(β) = (1 − α)1

2∥β∥22 + α∥β∥1. (3.8)

We proceed to estimate the counterfactuals with the following equation:

yi(t) = (βt`(i))

Txi, (3.9)

where `(i) is the leaf where sample i falls to. The overall objective function is therefore

minτ(.),β

µ [n

∑i=1(yi1[τ(xi) = zi] +∑

t≠zi(βt

`(i))Txi1[τ(xi) = t])]

+ (1 − µ) [n

∑i=1(yi − (βt

`(i))Txi)

2+ λ

m

∑t=1∑`

Pα(βt`)] ,

(3.10)

where the regression models β are found by solving the elastic net problems (3.7) defined by

the prescriptive tree. Note that we have included the elastic net penalty in the prediction

accuracy term, mirroring the structure of the elastic net problem itself. This is so that our

objective accounts for overfitting the β coefficients in the same way as standard regression.

We solve this problem using the Optimal Regression Trees framework from Bertsimas and

Dunn [2019], modified to fit a regression model for each treatment in each leaf, rather than

just a single regression model per leaf.

There are two additional hyperparameters in this model over the model in Section 3.3,

namely the degree of regularization in the elastic net λ and the parameter α controlling the

trade-off between `1 and `2 norms in (3.8). We have found that we obtain strong results using

only the `1 norm, and so this is what we use in all experiments. We select the regularization

parameter λ through validation.

3.4 Performance of OPTs on Synthetic Data

In this section, we design simulations on synthetic datasets to evaluate and compare the

performance of our proposed methods with other approaches. Since the data set is simulated,

the counterfactuals are fully known, which enables us to compare with the ground truth. In

88

the remainder of this section, we present our motivation behind these experiments, describe

the data generating process and the methods we compare, followed by computational results

and conclusions.

3.4.1 Motivation

The general motivation of these experiments is to investigate the performance of the OPT

method for various choices of synthetic data. Specifically, as part of these experiments, we

seek to answer the following questions.

1. How well does each method prescribe, i.e., compute the decision boundary x ∈ Rd ∶y0(x) = y1(x)?

2. How accurate are the predicted outcomes?

3.4.2 Experimental Setup

Our experimental setup is motivated by that shown in Powers et al. [2017]. In our experi-

ments, we generate n data points xi, i = 1, . . . , n where each xi ∈ Rd. Each xi is generated

i.i.d. such that the odd numbered coordinates j are sampled from xij ∼ Normal(0,1), while

the even numbered coordinates j are sampled from xij ∼ Bernoulli(0.5).

Next, we simulate the observed outcomes under each treatment. We restrict the scope

of these simulations to two treatments (0 and 1) so that we can include in our comparison

methods those that only support two treatments. For each experiment, we define a baseline

function that gives the base outcome for each observation and an effect function that models

the effect of the treatment being applied. Both of these are functions of the covariates

x, centered and scaled to have zero mean and unit variance. The outcome yt under each

treatment t as a function of x is given by

y0(x) = baseline(x) − 1

2effect(x),

y1(x) = baseline(x) + 1

2effect(x).

89

Finally, we assign treatments to each observation. In order to simulate an observational

study, we assign treatments based on the outcomes for each treatment so that treatment 1 is

typically assigned to observations with a large outcome under treatment 0, which are likely

to realize a greater benefit from this prescription. Concretely, we assign treatment 1 with

the following probability:

P(Z = 1∣X = x) = ey0(x)

1 + ey0(x) .

In the training set, we add noise εi ∼ Normal(0, σ2) to the outcomes yi corresponding to

the selected treatment.

We consider three different experiments with different forms for the baseline and effect

functions and differing levels of noise:

1. The first experiment has low noise, σ = 0.1, a linear baseline function, and a piecewise

constant effect function:

baseline(x) = x1 + x3 + x5 + x7 + x8 + x9 − 2, effect(x) = 51(x1 > 1) − 5.

2. The second experiment has moderate noise, σ = 0.2, a constant baseline function, and

a piecewise linear effect function:

baseline(x) = 0, effect(x) = 41(x1 > 1)1(x3 > 0) + 41(x5 > 1)1(x7 > 0) + 2x8x9.

3. The third experiment has high noise, σ = 0.5, a piecewise constant baseline function,

and a quadratic effect function:

baseline(x) = 51(x1 > 1)−5, effect(x) = 1

2(x2

1+x2+x23+x4+x2

5+x6+x27+x8+x2

9−11).

For each experiment, we generate training data with n = 1,000 and d = 20 as described

above. We also generate a test set with n = 60,000 using the same process, without adding

90

noise. In the test set, we know the true outcome for each observation under each treatment,

so we can identify the correct prescription for each observation.

For each method, we train a model using the training set, and then use the model to

make prescriptions on the test set. We consider the following metrics for evaluating the

quality of prescriptions:

• Treatment Accuracy: the proportion of the test set where the prescriptions are correct;

• Effect Accuracy: the R2 of the predicted effects, y(1) − y(0), made by the model for

each observation in the test set, compared against the true effect for each observation.

We run 100 simulations for each experiment and report the average values of treatment

and effect accuracy on the test set.

3.4.3 Methods

We compare the following methods:

• Prescription Trees: We include four prescriptive tree approaches:

– Personalization trees, denoted PT (recall that these are the same as OPT with

µ = 1 but trained with a greedy approach);

– OPT with µ = 1 and µ = 0.5, denoted OPT(1) and OPT(0.5), respectively;

– OPT with µ = 0.5 and with linear counterfactual estimation in each leaf, denoted

OPT(0.5)-L.

• Regress-and-compare: We include three regress-and-compare approaches where the

underlying regression uses either Optimal Regression Trees (ORT), LASSO regression

or random forests, denoted RC–ORT, RC–LASSO and RC–RF, respectively. For each

sample in the test set, we prescribe the treatment that leads to the lowest predicted

outcome.

91

−0.5

0.0

0.5

1.0E

ffec

t A

ccura

cy

0.25

0.50

0.75

1.00

Tre

atm

ent

Acc

ura

cy

PTOPT(1)

OPT(0.5)OPT(0.5)−L

RC−ORTRC−LASSO

RC−RFCF

Figure 3-3: Effect and Treatment accuracy results for Experiment 1.

• Causal Methods: We include the method of causal forests [Wager and Athey, 2018]

with the default parameters. While causal forests are intended to estimate the indi-

vidual treatment effect, we use the sign of the estimated individual treatment effect

to determine the choice of treatment. Specifically, we prescribe 1 if the estimated

treatment effect for that sample is negative, and 0, otherwise.

We also tested causal MARS on all examples, but it performed similarly to causal

forests, and hence was omitted from the results for brevity.

Notice that causal forests and OPTs are joint learning methods—the training data for these

approaches is the whole sample that includes both the treatment and control groups, as

opposed to regress-and-compare methods which split the data and develop separate models

for observations with z = 0 and z = 1.

3.4.4 Results

Figure 3-3 shows the results for Experiment 1. In this experiment, the boundary func-

tion is piecewise constant and the individual outcome functions are both piecewise linear.

The true decision boundary is x1 = 1, and the regions x1 > 1 and x1 ≤ 1 each have

constant treatment effect. The true response function in each of these regions is linear.

OPT(0.5)-L outperforms all the three regress-and-compare approaches and causal forests

(CF) both in treatment and effect accuracy. There is a marked improvement from OPT(0.5)

92

x1 < 1.0011

01

True False

Figure 3-4: Tree constructed by OPT(0.5)-L for an instance of Experiment 1.

0.25

0.50

0.75

1.00

Eff

ect

Acc

ura

cy

0.4

0.6

0.8

1.0

Tre

atm

ent

Acc

ura

cyPTOPT(1)

OPT(0.5)OPT(0.5)−L

RC−ORTRC−LASSO

RC−RFCF


to OPT(0.5)-L with the addition of linear regression in the leaves, which is unsurprising as

this models exactly the truth in the data. The poorest performing method is the greedy PT,

which has both low treatment accuracy, and very poor effect accuracy (note that the out of

sample R2 can be negative). OPT(1) improves slightly in the treatment accuracy, but the

effect accuracy is still poor. OPT(0.5) shows a large improvement in both the treatment

and effect accuracies over PT and OPT(1), which demonstrates the importance of consid-

ering both the prescriptive and predictive components with the prescriptive factor µ in the

objective function.

Figure 3-4 shows the tree for one of the instances of Experiment 1 under OPT(0.5)-L.

Recall, the boundary function for this experiment was simply x1 = 1, which is correctly

identified by the tree. This particular tree has a treatment accuracy of 0.99, reflecting

the accuracy of the boundary function, and an effect accuracy of 0.90, showing that the

linear regressions within each leaf provide high quality estimates of the outcomes for both

treatments.

93

The results for Experiment 2 are shown in Figure 3-5. This experiment has a piecewise

linear boundary with piecewise linear individual outcome functions, with moderate noise.

OPT(0.5)-L is again the strongest performing method in both treatment and effect accu-

racies, followed by OPT(0.5) and Causal Forests. All prescriptive tree methods have good

treatment accuracy, showing that these tree models are able to effectively learn the indicator

terms in the outcome functions of both arms. We again see that OPT(0.5) and OPT(0.5)-L

improve upon the other tree methods, particularly in effect accuracy, as a consequence of

incorporating the predictive term in the objective. The linear trends in the outcome func-

tions of this experiment are not as strong as in Experiment 1, and so the improvement of

OPT(0.5)-L over OPT(0.5) is not as large as before.

We observe that the joint learning methods perform better than the regress-and-compare

methods in this example even though the outcome functions for both the treatment options

do not have a common component (the baseline function is 0). We believe this is because

both the methods included here, causal forests and prescriptive trees, can learn local effects

effectively. Note that the structure of the boundary function is such that the function is

either constant or linear in different buckets.

We plot the tree from OPT(0.5)-L for an instance of this experiment in Figure 3-6. This

particular tree has a treatment accuracy of 0.925, which indicates that it has learned the

decision boundary effectively, along with an effect accuracy of 0.82. We make the following

observations from this tree.

1. Recall that the true boundary function for this experiment only includes the variables

from x1, x3, x5, x7, x8, and x9, and none of the remaining variables from x2 to x20. From

the figure above, we see that this tree does not include any of these variables as well,

i.e., it has a zero false positive rate.

2. By inspecting the splits on the variables x1, x3, x5 and x7, we note that the tree has

learned thresholds of close to 0 for x3 and x7, and 1 for x1 and x5, which matches with

the ground truth for these variables.

3. Examining the tree more closely, we see that the prescriptions reflect the reality of

94

00

01

0x9 < 0.7220

01

x1 < 0.9971

x3 < −0.0123x5 < 1.0005

x7 < 0.0008x8 = 0

x9 < 0.64081

x9 < 1.74791

True False

Figure 3-6: Tree constructed by OPT(0.5)-L for an instance of Experiment 2.

95

−1.0

−0.5

0.0

0.5

1.0E

ffec

t A

ccura

cy

0.4

0.6

0.8

1.0

Tre

atm

ent

Acc

ura

cy

PTOPT(1)

OPT(0.5)OPT(0.5)−L

RC−ORTRC−LASSO

RC−RFCF


which outcome is best. For example, when x1 ≥ 0.9971 and x3 ≥ −0.0123, the tree

prescribes 0. This corresponds to the ground truth of the 41(x1 > 1)1(x3 > 0) term

becoming active, which makes it likely that treatment 1 leads to larger (worse) out-

comes. We also see that the linear component in the outcome functions is reflected in

the tree, as the tree assigns treatment 0 when x9 is larger, which corresponds to the

linear term in the outcome function being large.

4. Finally, we note that the tree has a split where both the terminal leaves prescribe the

same treatment, which can initially seem odd. However, recall that the objective term

contains both prescription and prediction errors, and a split like this can improve the

prediction term in the objective, and hence the overall objective value, even though

none of the prescriptions are changed.

Finally, Figure 3-7 show the results from Experiment 3. This experiment has high noise

and a nonlinear quadratic boundary. Overall, regress-and-compare random forest and causal

forest are the best-performing methods, followed closely by OPT(0.5)-L, demonstrating that

all three methods are capable of learning complicated nonlinear relationships, both in the

outcome functions and in the decision boundary. The treatment accuracy is comparable

for all prescriptive tree methods, but PT and OPT(1) have very poor effect accuracy. This

again demonstrates the importance of controlling for the prediction error in the objective.

In this experiment, we see that regress-and-compare random forests performs comparably

96

to causal forests, which was not the case for the other two experiments. We believe that

this is because the baseline function is relatively simple compared to the effect function,

which leads to the absence of strong common trends within the two treatment outcome

functions. This could make it more difficult to effectively learn from both groups jointly,

mitigating the benefits of combining the groups in training. Consequently, in this setting

regress-and-compare methods could have performance closer to joint learning methods.

3.4.5 Multiple treatments

In this section, we consider a synthetic example with three treatments. We generate the n

covariates from the same distribution as before. We simulate the observed outcomes under

each treatment as

y0(x) = baseline(x),

y1(x) = baseline(x) + effect1(x),

y2(x) = baseline(x) + effect2(x).

Finally, we assign treatments to each observation. As before, we typically assign treatment 0

to observations when the baseline is small, and typically assign 1 and 2 with equal probability

when the baseline is higher. Concretely, we assign treatments with the following probabilities:

P(Z = 0∣X = x) = 1

1 + ey0(x) ,

P(Z = 1∣X = x) = P(Z = 2∣X = x) = 1

2(1 − P(Z = 0∣X = x)).

We consider the following experiment with the baseline and two effect functions given

97

by:

baseline(x) = 41(x1 > 1)1(x3 > 0) + 41(x5 > 1)1(x7 > 0) + 2x8x9,

effect1(x) = 51(x1 > 1) − 5,

effect2(x) =1

2(x2

1 + x2 + x23 + x4 + x2

5 + x6 + x27 + x8 + x2

9 − 11),

and the noise level σ = 0.1.

We generate training data with n = 1,000 and d = 20 and add noise εi ∼ Normal(0, σ2)to the outcomes yi corresponding to the selected treatment. As before, we generate a test

set with n = 60,000 using the same process, without adding noise. In the test set, we know

the true outcome for each observation under each treatment, so we can identify the correct

prescription for each observation.

For each method, we train a model using the training set, and then use the model to

make prescriptions on the test set. We consider the following metrics for evaluating the

quality of prescriptions:

• Treatment Accuracy: as defined in Section 3.4.2;

• Outcome Accuracy: the R2 of the predicted outcome y of the prescribed treatment

z, given by y(z), made by the model for each observation in the test set, compared

against the true outcome of that prescription, y(z), for each observation.

We run 100 simulations for each experiment and report the average values of treatment

and effect accuracy on the test set. We include the same methods as for the previous

experiments with the exception of causal forests as it only supports two treatments.

Results

Figure 3-8 shows the results for Experiment 4, where the baseline function is piecewise

constant and the individual effect functions are piecewise linear and nonlinear respectively.

OPT(0.5) and OPT(0.5)-L outperform all the other methods both in treatment and out-

come accuracy. Importantly, both these methods have the highest treatment accuracy, which

98

0.6

0.7

0.8

0.9

1.0O

utc

ome

Acc

ura

cy

0.25

0.50

0.75

1.00

Tre

atm

ent

Acc

ura

cy

PTOPT(1)

OPT(0.5)OPT(0.5)−L

RC−ORTRC−LASSO

RC−RF

Figure 3-8: Outcome and Treatment accuracy results for Experiment 4 with three treatments

indicates that they estimate the decision boundary reasonably well, unlike R&C-Random

forests which has high outcome accuracy but low treatment accuracy. As in the experi-

ments with two treatments, OPT(0.5) shows a large improvement in both the treatment

and outcome accuracies over PT and OPT(1), which again demonstrates the importance of

considering both the prescriptive and predictive components with the prescriptive factor µ in

the objective function. Overall, this experiment provides strong evidence that our approach

continues to perform well when there are more than two treatments.

Impact of incorrect prescriptions

In the context of Experiment 4, we will now investigate the impact of the various algorithms

making incorrect prescriptions. In particular, we are interested in how much the predicted

outcome can deviate from the actual truth when making an incorrect prescription, i.e. the

seriousness of the mistake. To this end, we considered the results from Experiment 4 and in

every case where an algorithm made an incorrect prescription we calculated the difference

between the true outcome under an algorithm’s incorrect prescription and the true outcome

under the optimal prescription. Note that this difference is always nonnegative.

Figure 3-9 shows the distributions of these errors in outcomes under incorrect prescrip-

tions. We see that all algorithms have similar medians and spreads, with RC–RF having the

smallest spread. We also see that the upper tail of the error distribution is similar between

99

0

1

2

3

4E

rror

in p

resc

ribed

outc

ome

PTOPT(1)OPT(0.5)OPT(0.5)−LRC−ORTRC−LASSORC−RF

Figure 3-9: Error in prescribed outcome due to incorrect prescription.

PT, OPT(0.5)–L, RC–ORT and RC–RF, while it is higher for OPT(1), OPT(0.5) and RC–

LASSO, indicating that incorrect prescriptions made by these methods could possibly be

more serious than the others in the very extreme cases. However, overall these results give

evidence that all of the methods are roughly similar in terms of the errors made as a result

of incorrect prescriptions.

3.4.6 Discussion and Conclusions

In terms of both prescriptive and predictive performance, we provide evidence that our

method performs comparably with, or even outperforms the state-of-the-art methods, as

evidenced by both treatment and effect accuracy metrics. Additionally, the main advantage

of prescriptive trees is that they provide an explicit representation of the decision boundary,

as opposed to the other methods where the boundary is only implied by the learned outcome

functions. This lends credence to our claim that the trees are interpretable. In fact, from

our discussion of the trees obtained for Combinations 1 and 2 in Figures 3-4 and 3-6, the

trees correctly learn the true decision boundary in the data.

We also found that regress-and-compare methods that fit separate functions for each

treatment are generally outperformed by joint learning methods that learn from the entire

100

dataset. We note that if there were an infinite amount of data and the regress-and-compare

methods could learn the individual outcome functions perfectly, then they would also learn

the decision boundary perfectly. However, for practical problems with finite sample sizes,

we have strong evidence that the performance can be much worse than the joint learning

methods.

3.5 Performance of OPTs on Real World Data

In this section, we apply prescriptive trees to some real world problems to evaluate the

performance of our OPTs in a practical setting. The first two problems belong to the area

of personalized medicine, which are personalized warfarin dosing and personalized diabetes

management. Next, we provide personalized job training recommendations to individuals,

and finally conclude with an example where we estimate the personalized treatment effect

of high quality child care specialist home visits on the future cognitive test scores of infants.

3.5.1 Personalized Warfarin Dosing

In this section, we test our algorithm in the context of personalized warfarin dosing. Warfarin

is the most widely used oral anticoagulant agent worldwide. Its appropriate dose can vary

by a factor of ten among patients and hence can be difficult to establish, with incorrect doses

contributing to severe adverse effects [Consortium et al., 2009]. Physicians who prescribe

warfarin to their patients must constantly balance the risks of bleeding and clotting. The

current guideline is to start the patient at 5 mg per day, and then vary the dosage based on

how the patient reacts until a stable therapeutic dose is reached [Jaffer and Bragg, 2003].

The publicly available dataset we use was collected and curated by staff at the Phar-

macogenetics and Pharmacogenomics Knowledge Base (PharmKGB) and members of the

International Warfarin Pharmacogenetics Consortium. One advantage of this dataset is that

it gives us access to counterfactuals—it contains the true stable dose for each patient found

by physician controlled experimentation for 5,528 patients. The patient covariates include

101

demographic information (sex, race, age, weight, height), diagnostic information (reason for

treatment, e.g., deep vein thrombosis etc.), pre-existing diagnoses (indicators for diabetes,

congestive heart failure, smoker status etc.), current medications (Tylenol etc.), and genetic

information (presence of genotype polymorphisms of CYP2C9 and VKORC1). The correct

dose of warfarin was split into three dose groups: low (≤ 3 mg/day), medium (> 3 and < 5mg/day), and high(≥ 5 mg/day), which we consider as our three possible treatments 0, 1,

and 2.

Our goal is to learn a policy that prescribes the correct dose of warfarin for each patient

in the test set. In this dataset, we know the correct dose for each patient, and so we consider

the following two approaches for learning the personalization policy.

Personalization when counterfactuals are known

Since we know the correct treatment z∗i for each patient, we can simply develop a prediction

model that predicts the optimal treatment z given covariates x. This is a standard multi-

class classification problem, and so we can use off-the-shelf algorithms for this problem.

Solving this classification problem gives us a bound on the performance of our prescriptive

algorithms, as this is the best we could do if we had perfect information.

Personalization when counterfactuals are unknown

Since it is unlikely that a real world dataset will consist of these optimal prescriptions, we

reassign some patients in the training set to other treatments so that their assignment is no

longer optimal. To achieve this, we follow the setup of Kallus [2017b], and assume that the

doctor prescribes warfarin dosage according to the following probabilistic assignment model:

P(Z = t∣X = x) = 1

Sexp((t − 1)(BMI − µ)

σ), (3.11)

102

where µ,σ are the population mean and standard deviation of patients’ BMI respectively,

and the normalizing factor

S =3

∑t=1

exp((t − 1)(BMI − µ)σ

).

We use this probabilistic model to assign each patient i in the training set a new treatment

zi, and then set yi = 0 if zi = zi, and yi = 1, otherwise. We proceed to train our methods

using the training data (xi, yi, zi), i = 1, . . . , n. This allows us to evaluate the performance

of various prescriptive methods on data which is closer to real world observational data.

Experiments

In order to test the efficacy with which our algorithm learns from observational data, we

split the data into training and test sets, where we vary the size of the training set as

h = 500,600, . . . ,2500, and the size of the test set is fixed as ntest = 2500. We perform 100

replications of this experiment for each n, where we resample the training and test sets of

respective sizes without replacement each time. We report the misclassification (error) rate

on the test set, noting that the full counterfactuals are available on the test set.

We compare the methods described in Section 3.4.3, but do not include OPT(0.5)-L as we

did not observe any benefit when adding continuous estimates of the counterfactuals, possi-

bly due to the discrete nature of the outcomes in the problem. We also do not include causal

forests as the problem has more than two treatments. Additionally, to evaluate the perfor-

mance of prescriptions when all outcomes are known, we treat the problem as a multi-class

classification problem and solve using off-the-shelf algorithms as described in Section 3.5.1.

We use random forests [Breiman, 2001], denoted Class–RF, and logistic regression, denoted

Class–LR.

In Figure 3-10, we present the out-of-sample misclassification rates for each approach.

We see that, as expected, the classification methods perform the best with random forests

having the lowest overall error rate, reaching around 32% at n = 2,500. This provides a

concrete lower bound for the performance of the prescriptive approaches to be benchmarked

103

0.32

0.34

0.36

0.38

500 1000 1500 2000 2500Training size

Mis

clas

sifi

cati

on

Class−LRClass−RFPTOPT(1)OPT(0.5)RC−LASSORC−RF

Figure 3-10: Misclassification rate for warfarin dosing prescriptions as a function of trainingset size.

against.

The greedy PT approach has stronger performance than the OPT methods at low n, but

as n increases this advantage disappears. At n = 2,500, OPT(1) algorithm outperforms PT

by about 0.6%, which shows the improvement that is gained by solving the prescriptive tree

problem holistically rather than in a greedy fashion. OPT(0.5) improves further upon this

by 0.6%, demonstrating the value achieved by accounting for the prediction error in addition

to the prescriptive error. The trees generated by OPT(1) and OPT(0.5) were also smaller

than those from PT, making them more easily interpretable.

Finally, the regress-and-compare approaches both perform similarly, outperforming all

prescriptive tree methods. We note that this is the opposite result to that found by Kallus

[2017b], where the prescriptive trees were the strongest. We suspect the discrepancy is be-

cause they did not include random forests or LASSO as regress-and-compare approaches,

only CART, k-NN, logistic regression and OLS regression which are all typically weaker

methods for regression, and so the regressions inside the regress-and-compare were not as

powerful, leading to diminished regress-and-compare performance. It is perhaps not surpris-

ing that the regress-and-compare approaches are more powerful in this example; they are

able to choose the best treatment for each patient based on which treatment has the best

104

prediction, whereas the prescription tree can only make prescriptions for each leaf, based on

which treatment works well across all patients in the leaf. This added flexibility leads to

more refined prescriptions, but at a complete loss of interpretability which is a crucial aspect

of the prescription tree.

Overall, our results show that there is a substantial advantage to both solving the pre-

scriptive tree problem with a view to global optimality, and accounting for the prediction

error as well as the prescription error while optimizing the tree.

3.5.2 Personalized Diabetes Management

In this section, we apply our algorithms to personalized diabetes management using patient

level data from Boston Medical Center (BMC). This dataset consists of electronic medical

records for more than 1.1 million patients from 1999 to 2014. We consider more than

100,000 patient visits for patients with type 2 diabetes during this period. Patient features

include demographic information (sex, race, gender etc.), treatment history, and diabetes

progression. This dataset was first considered in Bertsimas et al. [2017], where the authors

propose a k-nearest neighbors (kNN) regress-and-compare approach to provide personalized

treatment recommendations for each patient from the 13 possible treatment regimens. We

compare our prescriptive trees method to several regress-and-compare based approaches,

including the previously proposed kNN approach.

We follow the same experimental design as in Bertsimas et al. [2017]. The data is split

50/50 into training and testing. The models are constructed using the training data and then

used to make prescriptions on the testing data. The quality of the predictions on the testing

data is evaluated using a kNN approach to impute the counterfactuals on the test set—we

also considered imputing the counterfactuals using LASSO and random forests and found

the results were not sensitive to the imputation method. We use the same three metrics

to evaluate the various methods: the mean HbA1c improvement relative to the standard

of care; the percentage of visits for which the algorithm’s recommendations differed from

the observed standard of care; and the mean HbA1c benefit relative to standard of care for

105

patients where the algorithm’s recommendation differed from the observed care.

We varied the number of training samples from 1,000–50,000 (with the test set fixed)

to examine the effect of the amount of training data on out-of-sample performance. We

repeated this process for 100 different splittings of the data into training and testing to

minimize the effect of any individual split on our results.

In addition to methods defined in Section 3.4.3, we compare the following approaches:

• Baseline: The baseline method continues the current line of care for each patient.

• Oracle: For comparison purposes, we include an oracle method that selects the best

outcome for each patient using the imputed counterfactuals on the test set. This

method therefore represents the best possible performance on the data.

• Regress-and-compare: In addition to RC–LASSO, RC–RF, we include k-nearest

neighbors regress-and-compare, denoted RC–kNN, to match the approaches used in Bert-

simas et al. [2017]

The results of the experiments are shown in Figure 3-11. We see that our results for the

regress-and-compare methods mirror those of Bertsimas et al. [2017]; RC–kNN is the best

performing regression method for prescriptions, and the performance increases with more

training data. RC–LASSO increases in performance with more data as well, but performs

uniformly worse than kNN. RC–RF performs strongly with limited data, but does not im-

prove as more training data becomes available. OPT(0.5) offers the best performance across

all training set sizes. Compared to RC–kNN, OPT(0.5) is much stronger at smaller training

set sizes, supporting our intuition that it makes better use of the data by considering all

treatments simultaneously rather than partitioning based on treatment. At higher training

set sizes, the performance behaviors of RC–kNN and OPT(0.5) become similar, suggesting

that the methods may be approaching the performance limits of the dataset.

These computational experiments offer strong evidence that the prescriptions of OPT are

at least as strong as those from RC–kNN, and much stronger at smaller training set sizes.

The other critical advantage is the increased interpretability of OPT compared to RC–kNN,

106

−0.6

−0.4

−0.2

0.0

103 103.5 104 104.5

Training size

Mea

n H

bA

1c c

han

ge

−1.00

−0.75

−0.50

−0.25

0.00

103 103.5 104 104.5

Training size

Con

dit

ional

HbA

1c c

han

ge

0%

25%

50%

75%

100%

103 103.5 104 104.5

Training size

Pro

p.

dif

fer

from

SO

C

BaselineOracle

RC−kNNRC−LASSO

RC−RFOPT(0.5)

Figure 3-11: Comparison of methods for personalized diabetes management. The leftmostplot shows the overall mean change in HbA1c across all patients (lower is better). Thecenter plot shows the mean change in HbA1c across only those patients whose prescriptiondiffered from the standard-of-care. The rightmost plot shows the proportion of patientswhose prescription was changed from the standard-of-care.

107

which is itself already more interpretable than other regress-and-compare approaches. To

interpret the RC–kNN prescription for a patient, one must first find the set of nearest

neighbors to this point among each of the possible treatments. Then, in each group of

nearest neighbors, we must identify the set of common characteristics that determine the

efficacy of the corresponding treatment on this group of similar patients. When interpreting

the OPT prescription, the tree structure already describes the decision mechanism for the

treatment recommendation, and is easily visualizable and readily interpretable.

3.5.3 Personalized Job training

In this section, we apply our methodology on the Jobs dataset [LaLonde, 1986], a widely

used benchmark dataset in the causal inference literature, where the treatment is job train-

ing and the outcomes are the annual earnings after the training program. This dataset is

obtained from a study based on the National Supported Work program and can be down-

loaded from http://users.nber.org/~rdehejia/nswdata2.html. This study consists of

297 and 425 individuals in the control and treated groups respectively, where the treatment

indicator zi is 1, if the subject received job training in 1976–77 or 0, otherwise. The dataset

has seven covariates which include age, education, race, marital status, if the individual

earned a degree or not, and prior earnings (earnings in 1975) and the outcome yi is 1978

annual earnings.

We split the full dataset into 70/30 training/testing samples, and averaged the results

over 100 such splits to plot the out of sample average personalized income. Since the counter-

factuals are not known for this example we employ a nearest neighbor matching algorithm,

identical to the one used in Section 3.5.2, to impute the counterfactual values on the test

set. Using these imputed values, we compute the cost of policies prescribed by each of the

following methods. Note that for this example, the higher the out of sample income, the

better.

We compare the same methods as Section 3.5.2 with the addition of causal forests as this

problem only has two treatment options.

108

http://users.nber.org/~rdehejia/nswdata2.html

Method Average income ($) Standard error ($)Baseline 5467.09 10.81CF 5908.23 17.92RC–kNN 5913.44 17.79RC–RF 5916.22 17.78RC–LASSO 5990.85 18.94OPT(0.5)-L 6000.02 18.07Oracle 7717.96 17.16

Table 3.1: Average personalized income on the test set for various methods.

5100

5400

5700

0.00 0.25 0.50 0.75 1.00Inclusion rate

Mea

n o

ut−

of−

sam

ple

inco

me

OPT(0.5)−LRC−kNNRC−RFRC−LASSOCF

Figure 3-12: Out-of-sample average personalized income as a function of inclusion rate.

In Table 3.1, we present the average net personalized income on the test set, as prescribed

by each of the five methods. For each method, we only prescribe a treatment for an individual

in the test set if the predicted treatment effect for that individual is higher than a certain

value δ > 0, whose value we vary and choose such that it leads to the highest possible

predicted average test set income. We find the best such δ for each instance, and average

the best prescription income over 100 realizations for each method. From the results, we see

that OPT(0.5)-L obtains an average personalized income of $6000, which is higher than the

other methods. The next closest method is RC–LASSO, which obtains an average income

of $5991.

In Figure 3-12, we present the out-of-sample incomes as a function of the fraction of

109

subjects for which the intervention is prescribed (the inclusion rate), which we obtain by

varying the threshold δ described above. We see that the average income in the test set is

highest for OPT(0.5)-L at all values of the inclusion rate, indicating that our OPT method

is best able to estimate the personalized treatment effect across all subjects. We also see

that the income peaks at a relatively low inclusion rate, showing that we are able to easily

identify a subset of the subjects with large treatment effect.

3.5.4 Estimating Personalized Treatment Effects for Infant Health

In this section, we apply our method for estimating the personalized treatment effect of high

quality child care specialist home visits on the future cognitive test scores of children. This

dataset is based on the Infant Health Development Program (IHDP) and was compiled by Hill

[2011]. This dataset is commonly used as a benchmark in the causal inference literature.

Since its first usage, several authors [Morgan and Winship, 2014, Wager and Athey, 2018,

Zubizarreta, 2012] have used it in their research for benchmarking methods. Following Hill

[2011], the original randomized control trial was made imbalanced by removing a biased

subset of the group that had specialist home visits. The final dataset consists of 139 and 608

subjects in the treatment and control groups respectively, with zi = 1 indicating treatment

(specialist home visit), and a total of 25 covariates which include child measurements such as

child-birth weight, head circumference, weeks born pre-term, sex etc., along with behaviors

engaged during the pregnancy– cigarette smoking, alcohol and drug consumption etc., and

measurements on the mother at the time she gave birth–age, marital status, educational

attainment etc.

In this example we focus on estimating the individual treatment effect, since it has been

acknowledged that the program has been successful in raising test scores of treated children

compared to the control group (see references in Hill [2011]). The outcomes are simulated

in such a way that the average treatment effect on the control subjects is positive (setting B

in Hill [2011] with no overlap). However, note that even though the sign and magnitude of

the average treatment effect is known, there is still heterogeneity in the magnitudes of the

110

Method Mean accuracy Standard errorCF 0.543 0.015RC–LASSO 0.639 0.018RC–RF 0.704 0.013OPT(0.5)-L 0.759 0.013

Table 3.2: Average R2 on the test set for various methods for estimating the personalizedtreatment effect.

individual treatment effects. In all our experiments, we split the data into training/test as

90/10, and compute the error of the treatment effect estimates on the test set compared to

the true noiseless outcomes (known). We average this value over 100 splits of the dataset,

and compare the test set performance for each method.

In Table 3.2, we present the means and standard errors of the R2 of the personalized

treatment effect estimates on the test set, given by each of the four methods. We see that

OPT(0.5)-L obtains the highest average R2 value of 0.759, followed by RC-Random forests

with 0.704. This again gives strong evidence that our OPT methods can deliver high-quality

prescriptions whilst simultaneously maintaining interpretability.

3.6 Conclusions

In this chapter, we present an interpretable approach of personalizing treatments that learns

from observational data. Our method relies on iterative splitting of the feature space, and

can handle the case of more than two treatment options. We apply this method on synthetic

and real world datasets, and illustrate its superior prescriptive power compared to other

state of the art methods.

111

Chapter 4

Prescriptive Scenario Reduction for

Stochastic Optimization

4.1 Introduction

A wide range of decision problems that involve optimization under uncertainty are formu-

lated as stochastic optimization problems. For instance, consider a production planning

problem, where the decision maker wishes to make strategic decisions on plant sizing and

allocating resources among plants. Later when demand is realized, the decision maker has to

make tactical decisions about storing, processing and shipping these products to the market

sources, all while ensuring minimal expected costs, satisfying relevant plant capacity con-

straints, and given the first stage decision. Taking this second stage decision-making into

account can ostensibly lead to better first-stage strategic decisions.

More generally, such problems fall in the setting where a practitioner aims to select the

best possible decision that satisfies certain constraints, but with the knowledge that the

outcome of this decision is influenced by the realization of a random event. The quality of

a decision is judged by averaging its cost over all possible realizations of this random event.

These models can be applied to formulate problems in various areas such as finance, energy,

fleet management, and supply chain optimization, to name a few. For a more comprehensive

113

list of applications, we refer the reader to Wallace and Ziemba [2005].

Traditional stochastic optimization formulates this as finding an optimal decision, which

among all feasible candidates in the set Z, has the lowest average cost when averaged over

all possible realizations of the uncertain parameter Y . In other words, these problems can

be formulated as

minz∈Z

EY [c(z;Y )]. (4.1)

For instance, in inventory management problems, the uncertainty Y may refer to demand

data, or time series of stock returns in portfolio optimization problems. We provide concrete

examples of such cost functions:

• Inventory management (Newsvendor problem):

c(z;Y ) =maxb(Y − z), h(z − Y ), (4.2)

where Y refers to demand of the item and z is the amount of inventory (decision

variable). The values b > 0 and h > 0 are prespecified parameters that represent the

backorder and holding cost respectively.

• Portfolio optimization:

c((z, β);Y ) = −λz′Y + β + 1

εmax−z′Y − β,0, (4.3)

where Y and z are vectors of stock returns and corresponding investments (decision

variable) respectively, and β is an auxiliary decision variable. Minimizing the cost

c(z;Y ) ensures high returns z′Y , while at the same time controlling the risk, which

here is given by CVaR (Conditional Value-at-Risk) of negative returns at level ε, as

CVaRε(z′Y ) = infβ

β + 1

εE[max−z′Y − β,0]

The quantities ε ∈ (0,1) which parametrizes the risk measure, and λ > 0, the tradeoff

between risk and return, are prespecified parameters.

114

We assume Z, the set of feasible decisions, is a non-empty convex compact set, and is

independent of uncertainty Y .

While we wish to solve Problem (4.1), the true distribution of the uncertainty Y is

typically unknown. Even if it is fully known, solving the exact optimization problem may

not be tractable. In the context of data-driven stochastic optimization, where past data,

consisting of n samples ξ1, . . . , ξn, is assumed to be known, a commonly used approach to

approximate Problem (4.1) is Sample Average Approximation (SAA) [Shapiro et al., 2009a].

Under this approach, the problem we wish to solve is

minz∈Z

1

n

n

∑i=1

c(z; ξi). (4.4)

It is easy to see that this approach, in effect, approximates the unknown full distribution

with the empirical distribution with each data point ξi equally probable. In fact, Kleywegt

et al. [2002] show that, under some regularity conditions, the optimal objective value and

solution of Problem (4.4) converge to their counterparts of Problem (4.1) as n increases,

regardless of the distribution of ξ. For more recent advances in SAA, we direct the reader

towards Homem-de Mello and Bayraksan [2014], Rahimian et al. [2018] and the references

therein.

In this chapter, we consider the approach of scenario reduction, which approximates the

empirical distribution with a smaller distribution with scenarios ζ1, . . . , ζm and corresponding

probabilities q1, . . . , qm, for m << n. To be more precise, we use knowledge of the cost function

and constraints while computing this reduced distribution which, as we shall demonstrate,

results in higher quality decisions. In situations where n is very large and even the SAA

problem (4.4) is not tractable, such an approach can substantially improve tractability while

ensuring minimal loss in decision quality. Another key benefit accrued by practitioners when

solutions of higher quality are computed with significantly lesser scenarios is interpretability.

In this chapter, we demonstrate that using optimization to compute these smaller set of

scenarios that take into account the cost function can substantially increase accuracy and

interpretability.

115

We note that scenario reduction with the Wasserstein distance and the Euclidean norm is

equivalent to clustering, with the problem reducing to assigning n points to m clusters with

the scenarios chosen as the cluster-mean. The corresponding scenario probability is simply

the size of the cluster divided by n. The central idea in our approach is that these scenarios

and assignments should be chosen keeping in mind their decision quality, rather than the

cost-agnostic least squares (or any general norm) error. We demonstrate that while this

approach leads to more complicated optimization problems, the resulting distributions often

have superior decision quality. This gap is particularly enhanced when the cost function is

not symmetric, unlike a norm which penalizes scenarios, irrespective of their decision quality,

simply based on the norm distance between the empirical and new scenarios.

4.1.1 Related literature

In this section, we review related work on scenario reduction for stochastic optimization

problems. Dupačová et al. [2003] present theory and algorithms for scenario reduction using

probability metrics, while Heitsch and Römisch [2003] derive bounds for forward and back-

ward scenario selection heuristics. More recently, Rujeerapaiboon et al. [2017] analyze worst

case bounds of scenario reduction using the Wasserstein metric, and propose heuristics with

worst case approximation error guarantees and an exact mixed integer optimization formu-

lation. These heuristics are based on the alternating-minimization algorithm for k-means

clustering [Arya et al., 2004].

More generally, our work fits in the area of research demonstrating the advantages of

optimization over randomization. Some related work includes Bertsimas et al. [2015], which

demonstrates that using optimization to reduce discrepancy between groups, rather than

randomization, leads to stronger inference.

4.1.2 Contributions and Structure

The contributions of this work are as follows:

1. We present a novel optimization based approach for scenario reduction for stochastic

116

optimization problems. As part of this approach, we introduce “Prescriptive diver-

gence", which measures the difference in quality of decisions induced by two discrete

distributions, and includes the Wasserstein distance as a special case.

2. We propose scenario reduction in this context, and present algorithms for computing

these scenarios and corresponding probabilities. Our approach relies on an alternating

minimization algorithm, where we solve a sequence of convex optimization problems

for determining the scenarios.

3. Finally, with the help of computational results we demonstrate the effectiveness of

these methods on constrained newsvendor and portfolio optimization problems, both

in-sample and out-of-sample, compared to a traditional Wasserstein-distance based sce-

nario reduction approach. We note that our approach results in improved performance

with a smaller number of scenarios, which improves interpretability for practitioners.

4.1.3 Notation

Let e be the vector of all ones, and ei the ith standard basis vector of appropriate dimensions.

For any positive integer n, we define the set [n] = 0,1, . . . , n. We denote a generic norm

by ∥ ⋅ ∥, while ∥ ⋅ ∥p denotes the p−norm, for p ≥ 1. Recall that the Euclidean norm of any

vector x is defined as ∥x∥2 =√∑i x

2i . For a set X ∈ Rd, we define P(X ,m) as the set of all

probability distributions supported on at most m points belonging to X . The support of

a probability distribution P is denoted by supp(P), and the Dirac delta “distribution" at ξ

denoted by δ(ξ). We define Pn(ξ1, . . . , ξn) as the uniform distribution supported on the n

distinct points ξi, which we equivalently represent as

Pn(ξ1, . . . , ξn) =n

∑i=1

1

nδ(ξi).

Cost functions are denoted by c(z; y), where z ∈ Rnz , y ∈ Rd represent the decision variable

and uncertainty respectively, and Z ⊆ Rd represents the nonempty convex set of feasible

decisions. For any given y, we assume that c(z; y) is a convex function of z. Finally, we

117

denote

c∗(ξ) =minz∈Z

c(z; ξ),

the optimal objective value corresponding to the scenario ξ, where we assume c∗(ξ) to be

finite for every ξ.

4.2 Preliminaries

In this section, we discuss the notion of Wasserstein distance, which defines a distance

between two probability distributions, and the scenario generation problem.

4.2.1 Distance between distributions

Let P be a discrete probability distribution on scenarios ξ1, . . . , ξn with corresponding prob-

abilities p1, . . . , pn, and Q another discrete distribution on scenarios ζ1, . . . , ζm with corre-

sponding probabilities q1, . . . , qm. Note that

n

∑i=1

pi = 1 =m

∑j=1

qj.

Next, we define the Wasserstein distance between these two discrete probability distributions.

Definition 1. The Wasserstein distance (induced by the `2 norm) between two discrete

distributions P and Q, which we denote as dW (P,Q), is defined as the square root of the

optimal objective value of the following problem:

minπ∈Rn×m+

n

∑i=1

m

∑j=1

πij∥ξi − ζj∥2

subject tom

∑j=1

πij = pi ∀i ∈ [n],

n

∑i=1

πij = qj ∀j ∈ [m].

(4.5)

The linear optimization problem (4.5) used to define the Wasserstein distance can be

118

interpreted as a minimum-cost transportation problem from n sources to m destinations.

Here, πij represents the amount of probability mass shipped from ξi to ζj at a transportation

cost of ∥ξi − ζj∥2 per unit. Note that the probabilities πij sum to one, as

n

∑i=1

m

∑j=1

πij =n

∑i=1

pi = 1,

and hence is not included in Problem (4.5) since it is a redundant constraint. Therefore,

Problem (4.5) is an optimal transportation problem whose objective function is minimizing

the overall cost of moving probability mass from the initial distribution P to the target

distribution Q.

Next, we introduce the idea of scenario reduction, which approximates a distribution

supported on n scenarios by a different distribution supported on m scenarios, with m

typically chosen to be significantly smaller than n. As part of this approach, both the new

reduced set of scenarios ζjmj=1 and their corresponding probabilities qjmj=1 are estimated.

Next, we describe the two variants of scenario reduction - discrete and continuous.

4.2.2 Scenario reduction

In this section, we describe the scenario reduction problem. For notational convenience, we

denote Pn(ξ1, . . . , ξn) as Pn.

Definition 2. The discrete scenario reduction problem is defined as

DW (Pn,m) =minQdW (Pn,Q) ∶ Q ∈ P(supp(Pn),m) (4.6)

Definition 3. The continuous scenario reduction problem is defined as

CW (Pn,m) =minQdW (Pn,Q) ∶ Q ∈ P(Rd,m) (4.7)

In Problem (4.6), the new scenarios must be selected from among the support of the

empirical distribution, given by the set ξ1, . . . , ξn. In contrast, the continuous scenario

119

reduction problem (4.7) allows the scenarios to be chosen from outside the set of observations,

and allows for greater flexibility and better approximation of the empirical distribution.

However, in both these settings, the approximate distributions are computed without

taking into account the cost function c(z; y) and the feasible set Z. We address this in the

following section, where we first define an extension of the Wasserstein distance between two

distributions, and use it to compute scenarios tailored for the optimization problem at hand.

In this chapter, we focus our attention on the continuous scenario reduction approach, but

note that these techniques can be adapted for the discrete problem as well.

4.3 Prescriptive Scenario reduction

In this section, we describe our approach for generating scenarios for stochastic optimization

problems. We define z∗(η) as an optimal decision corresponding to the scenario η, and is

given by

z∗(η) ∈ argminz∈Z

c(z; η).

For simplicity, we assume that there exists a unique optimal solution for every possible

scenario η, but we relax this assumption later.

Next, we define a prescriptive variant of the Wasserstein distance metric between two

probability distributions P,Q with respect to the cost c and constraint set Z as D(Q∣P; c,Z).

4.3.1 Prescriptive divergence

Definition 4. Let P and Q be two discrete probability distributions in Rd, given by

P =n

∑i=1

piδ(ξi), Q =m

∑j=1

qjδ(ζj)

120

respectively. Then, d(Q∣P; c,Z) is given by the square root of the optimal objective value of

the following linear optimization problem:

d2(Q∣P; c,Z) = minπ∈Rn×m+

m

∑j=1

n

∑i=1

πij(c(z∗(ζj); ξi) − c(z∗(ξi); ξi))

subject tom

∑j=1

πij = pi ∀1 ≤ i ≤ n,

n

∑i=1

πij = qj ∀1 ≤ j ≤m.

(4.8)

We denote this as the Prescriptive divergence between the two distributions P and Q,

with respect to the cost function c(z; y) and constraint set Z. It is a non-symmetric measure

of the difference between two probability distributions, and hence not a metric distance.

Specifically, it is a measure of the loss in decision quality when Q is used to approximate

P. We observe that the optimal value of the optimization problem (4.8) is guaranteed to be

non negative, as

c(z∗(ζj); ξi) ≥ c(z∗(ξi); ξi) =minz∈Z

c(z; ξi),

and hence, each term is positive for any choice of ζj.

We note that the Wasserstein distance dW can be recovered as a special case of the

Prescriptive divergence when the cost is given by c(z; η) = ∥z − η∥22, the squared Euclidean

norm between z and η, and the constraint set as Z = Rd. That is,

d(Q∣P; ∥z − y∥2;Rd) = dW (P,Q).

To see this, we note that the optimal decision for scenario η is given by

z∗(η) ∈ argminz∈Rd∥z − η∥2 = η.

Hence,c(z∗(ζ); ξ) = ∥z∗(ζ) − ξ∥2,

= ∥ζ − ξ∥2,

121

and

minz

c(z; ξ) = ∥z∗(ξ) − ξ∥2 = 0,

and we conclude that Problem (4.8) is equivalent to Problem (4.5).

4.3.2 Problem Formulation

Analogous to Problem (4.7), we define the continuous prescriptive scenario reduction problem

as

CP (Pn,m; c,Z) =minQd(Q∣∣Pn; c,Z) ∶ Q ∈ P(Rd,m). (4.9)

We denote by B(I,m) the family of all m−set partitions of the set I, i.e.,

B(I,m) = I1, . . . , Im ∶ ∅ /= I1, . . . , Im ⊆ I,∪jIj = I, Ii ∩ Ij = ∅ ∀i /= j.

Also, we denote a specific m−set partition as Ij ∈ B(I,m). Next, we note the following

result, which is similar to Theorem 1 in Rujeerapaiboon et al. [2017], that reformulates the

continuous prescriptive scenario reduction problem (4.9) as a set partitioning problem.

Theorem 3. The prescriptive scenario reduction problem (4.9) can be written as the follow-

ing problem of finding an m−set partition that optimizes the following problem:

C2P (Pn,m; c,Z) = min

Ij∈B(I,m)1

n

m

∑j=1

minζj∑i∈Ij(c(z∗(ζj); ξi) −min

z∈Zc(z; ξi)). (4.10)

Proof. Following the argument of Theorem 2 in Dupačová et al. [2003], we argue that the

optimal Prescriptive divergence between Pn and any distribution Q supported on a finite set

Ψ is given by

minQ∈P(Ψ,∞)

d2(Q∣∣Pn; c,Z) =1

n

n

∑i=1

minζ∈Ψ(c(z∗(ζ); ξi) − c∗(ξi)),

where P(Ψ,∞) denotes the set of all probability distributions supported on the finite set Ψ.

The continuous scenario reduction problem (4.7), but with the Prescriptive divergence in-

122

stead of the squared Euclidean distance, can be written as the following problem of finding the

set Ψ with m elements that leads to the smallest objective value. Letting Ψ = ζ1, . . . , ζm,we have

C2P (Pn,m; c,Z) = min

ζ1,...,ζm

1

n

n

∑i=1

minj∈[m](c(z∗(ζj); ξi) − c∗(ξi)). (4.11)

Next, we show that Problem (4.11) is equivalent to Problem (4.10). Given an optimal

solution ζ1∗, . . . , ζm∗ to Problem (4.11), we construct a partition such that

Ij = i ∶ c(z∗(ζj∗); ξi) = mink∈[m]

c(z∗(ζk∗); ξi)

which leads to Problem (4.10) having the same objective as Problem (4.11) (We break ties

arbitrarily). For the other direction, given an optimal partition I1, . . . , Im and corresponding

inner minimizing scenarios ζ1∗, . . . , ζm∗ of Problem (4.10), it is easy to see that these scenarios

will also be an optimal solution of Problem (4.11) with identical objective value. This

completes the proof.

We note that Problem (4.10) can also be interpreted as a clustering problem, where

the n points, ξ1, . . . , ξn, are partitioned into m clusters with centroids ζ1, . . . , ζm. Both

the cluster assignments and the centroids within each cluster are chosen to minimize the

cumulative prescriptive divergence to the n sample points. For the jth cluster, each optimal

scenario ζj∗ is chosen such that z∗(ζj) is close (or in some cases, equal) to the optimal SAA

solution for scenarios in Ij. In other words,

ζj∗ ∈ argminζ∑i∈Ij

c(z∗(ζ); ξi).

Once the distribution Q (described by scenarios ζj and probabilities qj = ∣Ij ∣n ) has been

computed, the decision z(Q) is given by optimizing for the cost under this reduced distribu-

tion Q, i.e.,

z(Q) ∈ argminz∈Z

EY ∼Q[c(z;Y )] =m

∑j=1

qj c(z; ζj).

We emphasize that while traditional scenario reduction aims to compute Q “close" to P,

123

prescriptive scenario reduction takes into account the quality of decisions induced and finds

Q such that z(Q) is “close to" z(P) in terms of decision quality.

When m = n, then the scenarios ζj = ξi ∀i = j as D(P∣P; c,Z) = 0. Thus, the optimal

decision is the same as the SAA solution on the full n scenarios, which is the best decision

that can be computed using this data. In fact, as the following example shows, minimizing

this Prescriptive divergence with just m = 1 scenario finds the SAA solution which the

standard Wasserstein method would find with m = n.

Let us consider the simple unconstrained newsvendor problem, where the decision variable

decides how much inventory is to be stocked in the presence of uncertain demand, with cost

function

c(z; ξ) =maxb(ξ − z), h(z − ξ), (4.12)

for known b, h > 0. The parameters b, h can be interpreted as holding and back order costs

which apply when the inventory z exceeds or is below the observed demand ξ respectively.

Proposition 1. For the unconstrained newsvendor problem with cost given by Equation (4.12),

minimizing the Prescriptive divergence with m = 1 finds the optimal SAA solution.

Proof. When we perform traditional scenario reduction with m = 1, the Wasserstein metric

finds the single scenario ζ = 1n

n

∑i=1

ξi, the sample mean. Next, we note that for any η, the

corresponding optimal solution z∗(η) is given by

z∗(η) ∈ argminz

c(z; η),

∈ argminz

b(η − z)+ + h(z − η)+,

= η.

124

Thus, we see that using the D(Q∣∣P; c,R) divergence with m = 1 finds the scenario

ζ∗ ∈ argminζ

n

∑i=1

b(ξi − z∗(ζ))+ + h(z∗(ζ) − ξi)+,

∈ argminζ

n

∑i=1

b(ξi − ζ)+ + h(ζ − ξi)+,

= Q( bb+h )(ξ

1, . . . , ξn),

where Q(β)(η1, . . . , ηN) is the β−quantile of the sample η1, . . . , ηN. In fact, we emphasize

that Q( bb+h )(ξ

1, . . . , ξn) is the optimal SAA solution which is obtained with just one scenario

using the prescriptive scenario reduction method.

However, we note that in the presence of constraints or for general objective functions

the estimation of z∗(η) may not be given by a closed form expression, or even be unique. To

address this issue, we introduce a variant of the prescriptive divergence, where we consider

the worst case over the set of optimal solutions Z∗(ζ), for every ζ. To be precise, we define

Z∗(ζ) = z ∈ Z ∶ c(z; ζ) ≤minz∈Z

c(z; ζ), (4.13)

and modify the definition presented in Equation (4.8) as

d2(Q∣P; c,Z) = minπ∈Rn×m+

m

∑j=1

n

∑i=1

πij maxz∈Z∗(ζj)

(c(z; ξi) − c∗(ξi))

subject tom

∑j=1

πij = pi ∀1 ≤ i ≤ n,

n

∑i=1

πij = qj ∀1 ≤ j ≤m.

(4.14)

Note that when Z∗(ζj)∀j ∈ [m] are each singleton sets, then Equations (4.14) and (4.8) are

identical.

Next, we present our approach of scenario reduction in this framework. Motivated by

the newsvendor and CVaR objectives described in Equations (4.2) and (4.3) respectively, we

consider the following two classes of cost functions:

125

1. Piecewise (separately) linear cost

c(z; y) =max1≤t≤ka′tz + b′ty, (4.15)

for known vectors of appropriate sizes at, bt for t ∈ [k].

2. Piecewise bilinear cost

c(z; y) =max1≤t≤kz′Aty, (4.16)

for known matrices At, t ∈ [k].

Given the points belonging to the jth cluster Ij, in order to compute the scenario ζj we wish

to solve the problem:

minζ∑i∈Ij

maxz∈Z∗(ζ)

c(z; ξi).

We note that this is not necessarily convex, and derive convex upper bound approximations

of this objective in the following section.

We first note the following result, which provides an upper bound for maxz∈Z∗(ζ) c(z; ξ).

Proposition 2. For any α ≥ 0, we have

maxz∈Z∗(ζ)

c(z; ξ) ≤maxz∈Zc(z; ξ) − αc(z; ζ) + αmin

z∈Zc(z; ζ).

Proof.max

z∈Z∗(ζ)c(z; ξ)

=maxz∈Z

infα≥0 c(z; ξ) + α( − c(z; ζ) +min

z∈Zc(z; ζ))

≤ infα≥0max

z∈Zc(z; ξ) + α( − c(z; ζ) +min

z∈Zc(z; ζ))

≤maxz∈Z

(c(z; ξ) − αc(z; ζ)) + αminz∈Z

c(z; ζ),

for some fixed α ≥ 0, by using the definition in Equation (4.13) and weak duality .

126

Using this result, we now focus our attention to the approximate problem:

minζ∑i∈Ij

maxz∈Z(c(z; ξi) − αc(z; ζ)) + αmin

z∈Zc(z; ζ)

for each set Ij. We approximate the second term as c(z∗(Ij); ζ), which results in a further

upper bound,

minζ∑i∈Ij

maxz∈Z(c(z; ξi) − αc(z; ζ)) + αc(z∗(Ij); ζ).

In the following section, we discuss approximations of the first term, maxz∈Z (c(z; ξi) −αc(z; ζ)), for different cost functions.

4.3.3 Piecewise (separately) linear cost

In this case, the first term can be written as

maxz∈Z(c(z; ξi) − αc(z; ζ))

=maxz∈Z(max

t∈[k]a′tz + b′tξi − αmax

t∈[k]a′tz + b′tζ)

≤maxz∈Z

maxt∈[k]a′t(z − αz) + b′t(ξi − αζ)

Choosing α = 1, we get the following convex approximate problem for ζj:

minζ∑i∈Ij(max

t∈[k]b′t(ξi − ζ) +max

t∈[k]a′tz∗(Ij) + b′tζ).

Note that the full problem of finding partitions and scenarios is given by

minπ

m

∑j=1

minζj∈Rd

n

∑i=1

πij( maxt=1,...,k

b′t(ξi − ζj) + maxt=1,...,k

a′tz∗(Ij) + b′tζj),

subject to πe = e,

z∗(Ij) ∈ argminz∈Z

n

∑i=1

πijc(z; ξi),

π ∈ 0,1n×m.

127

4.3.4 Piecewise bilinear cost

In this case, the first term can be written as

maxz∈Z(c(z; ξi) − αc(z; ζ))

=maxz∈Z(max

t∈[k]z′Atξ

i − αmaxt∈[k]z′Atζ)

≤maxz∈Z

maxt∈[k]z′At(ξi − αζ)

=maxt∈[k]max

z∈Zz′At(ξi − αζ)

Choosing α = 1, we get the following convex approximate problem for ζj:

minζ∑i∈Ij(max

t∈[k]max

z∈Zz′At(ξi − ζ) +max

t∈[k]z∗(Ij)′Atζ).

Note that the full problem of finding partitions and scenarios is given by

minπ

m

∑j=1

minζj∈Rd

n

∑i=1

πij( maxt=1,...,k

maxz∈Z

z′At(ξi − ζj) + maxt=1,...,k

z∗(Ij)′Atζj),

subject to πe = e,


n

∑i=1

πijc(z; ξi),

π ∈ 0,1n×m.

4.3.5 Prediction Error penalization

We note that this approach can often lead to too “optimistic" distributions. One way of

controlling for this is by penalizing the prediction error within each cluster. This idea has

been demonstrated to improve prescriptive performance in related problem settings [Bert-

simas et al., 2019a,b]. In fact, Kao et al. [2009] develop an estimator that accounts for the

decision objective when computing regression coefficients, and is a convex combination of

the ordinary least squares and prescriptive loss. Now, we introduce the formulation, where

128

the parameter µ ∈ [0,1] is chosen via cross validation.

minπ

m

∑j=1

minζj∈Rd

n

∑i=1

πij(µFij + (1 − µ)∥ξi − ζj∥2),

subject to πe = e,


n

∑i=1

πijc(z; ξi),

π ∈ 0,1n×m,

(4.17)

where Fij is given by

Fij = maxt=1,...,k

b′t(ξi − ζj) + maxt=1,...,k

a′tz∗(Ij) + b′tζj

for c(z, y) =maxt∈[k](a′tz + b′ty), or

Fij = maxt=1,...,k

maxz∈Z

z′At(ξi − ζj) + maxt=1,...,k

z∗(Ij)′Atζj

for c(z, y) =maxt∈[k] z′Aty.

4.4 Optimization algorithms

In this section, we present an alternating optimization framework of solving Problem (4.17).

4.4.1 Alternating optimization framework

1. Given a candidate partition I1, . . . , Im, we solve m convex optimization problems, by

considering the m inner minimizations over ζj separately in Problem (4.17).

2. Given scenarios ζj, j ∈ [m], we assign each point ξi to cluster j(i), given by

j(i) = arg min1≤j≤m

µFij + (1 − µ)∥ξi − ζj∥2,

129

and update the partition I1, . . . , Im.

3. Continue to Step 1 and stop when there is no change in assignments, or the improve-

ment in objective value is smaller than a prespecified tolerance, or after a maximum

number of iterations.

The initial values of sets I1, . . . , Im are chosen randomly. To further improve this pro-

cedure, we perform this algorithm with different random restarts, and choose the solution

with the least prescriptive cost on a validation set.

4.5 Computational Examples

For the prescriptive divergence, the reduced scenarios are computed using an alternating

optimization heuristic with random restarts, similar in spirit to the k-means algorithm. In

our computational results, we compare the Prescriptive cost (for Q obtained by Wasserstein

and Prescriptive scenario reduction), computed as:

1

∣S∣∑i∈Sc(z(Q); ξi).

We note that this cost quantifies the decision quality of a distribution Q which generates

the decision z(Q). For each of the two scenario reduction methods, we report this metric

both in-sample and out-of-sample, i.e., S is either the training or test set respectively. We

denote the two methods as P-SR (Prescriptive Scenario Reduction) and W-SR (Wasserstein

Scenario Reduction with squared Euclidean norm).

4.5.1 Portfolio optimization

First, we consider a portfolio optimization problem, where the problem is given by

(z∗(Q), β∗(Q)) ∈arg minz∈Rd+,β∈R

β +EY ∼Q[1

εmax−z′Y − β,0 − λz′Y ]

subject to e′z = 1.(4.18)

130

We generate data sampled as

Y = µ +Σ 12 ε,

where µ ∼ N(0, Id×d), the noise is sampled from a standard Normal distribution ε ∼ N(0,1),and the covariance matrix Σ has entries given by

Σij = ρ∣i−j∣ ∀1 ≤ i, j ≤ d.

We sample a training set of n points from this distribution, with n = 1000 and d = 20. We

perform both Wasserstein and Prescriptive scenario reduction for different choices of m. We

repeat this for 100 different instances, and report the mean prescriptive risk averaged over

these instances. We choose parameters ε, λ as ε = 0.05, λ = 0.01, and correlation parameter

ρ = 0.8. The parameter ρ controls the correlation levels of the stock returns, with ρ = 0

implying no correlation, while ρ closer to +1 (−1) results in more positively (negatively)

correlated returns. Finally, we ensure that both the mean returns and all the return values

each exceed −1.00.

In Figure 4-1, we compare the average in-sample performances of the distributions pro-

duced by Wasserstein and prescriptive scenario reduction algorithms. We see that the pre-

scriptive scenario reduction method outperforms the standard Wasserstein method for differ-

ent values of m, and the gap narrows as the number of scenarios increases. This trend repeats

itself in Figure 4-2 as well, which plots the out-of-sample performance with m, reinforcing

the fact that these new distributions lead to an improvement out-of-sample as well.

131

Figure 4-1: Average in-sample prescriptive performance for various methods as a function ofm, the number of reduced scenarios.

132

Figure 4-2: Average out-of-sample prescriptive performance for various methods as a functionof m, the number of reduced scenarios.

4.5.2 Newsvendor problem with budget constraints

In this example, we consider the case of an inventory manager, with various products and a

capacity constraint on the total inventory. The complete problem is given by

z∗(Q) ∈argminz∈Rd+

EY ∼Q[max1≤j≤db(Yj − zj)+ + h(zj − Yj)+]

subject tod

∑j=1

zj ≤ U.

Demand Y ∈ Rd generated as

Y = µ + ε,

with the mean demands µj ∼ U[4,5], and noise distributed as εj ∼ N(0,1) ∀j ∈ [d]. The cost

133

parameters were chosen as b = 10 and h = 1. We sample n points from this distribution, with

n = 1000, d = 5. Note that this means the number of pieces in the cost function, k, equals 10.

We perform both Wasserstein and Prescriptive scenario reduction for various choices of m.

The out of sample cost is calculated over a test set of size ntest = 100,000 points generated

from the same distribution as the training set. We repeat this for 50 different instances, and

report the mean prescriptive risk averaged over these instances.

In Figure 4-3, we compare the expected in-sample performances of the distributions

produced by Wasserstein and prescriptive scenario reduction algorithms. As in the first

example, we see here as well that the prescriptive scenario reduction method outperforms

the standard Wasserstein method for different values of m in terms of in-sample prescriptive

performance, and the gap narrows as the number of scenarios increases. A similar trend is

observed for the out-of-sample performance in Figure 4-4 as well.

Figure 4-3: Average in-sample prescriptive performance for various methods as a function ofm, the number of reduced scenarios.

134

Figure 4-4: Average out-of-sample prescriptive performance for various methods as a functionof m, the number of reduced scenarios.

4.6 Conclusion

In this chapter, we introduced an optimization-based framework that combines ideas from

scenario reduction and convex optimization to compute scenarios that lead to improved

decisions. Unlike most existing approaches, our approach takes the cost function and con-

straints into account, is general, and applies in a wide range of settings. With the help of

computational examples, we demonstrate the benefit of this approach over a commonly used

cost-agnostic scenario reduction method. Our approach consistently outperforms standard

Wasserstein-based scenario reduction methods across different choices of m, the number of

scenarios. From a practitioner’s perspective, achieving higher quality decisions with fewer

scenarios can be highly desirable as the scenarios can be inspected, which improves inter-

pretability of the decision-making process.

135

Chapter 5

Sparse Convex Regression

5.1 Introduction

Given data (x1, y1), . . . , (xn, yn), we consider the problem of finding a convex function on

the x ∈ Rd variables (features) that best fits the dependent variables y ∈ R. Formally, we

wish to estimate a function f ∶ Rd → R where

y = f(x) + ε (5.1)

with the requirement that f be a convex function. Here the random noise ε is assumed

to have zero mean. Note that one can equivalently perform concave regression, as the

requirement that f is convex is identical to restricting −f to be concave. As we discuss next,

such convexity/concavity constraints arise naturally in several settings. Such problems fall

in the general class of shape constrained function estimation.

Shape constrained regression problems have many applications in various fields such as,

but not limited to, operations research, econometrics, geometric programming [Magnani and

Boyd, 2009], image analysis [Goldenshluger and Zeevi, 2006], and target reconstruction [Lele

et al., 1992]. In operations research, these problems arise in reinforcement learning [Shapiro

et al., 2009b], [Hannah et al., 2014], in resource allocation [Topaloglu and Powell, 2003], and

while analyzing performance measures of queueing networks [Chen and Yao, 2001]. In eco-

137

nomics, such problems are encountered when demand [Varian, 1982], utility functions [Var-

ian, 1984], and production functions [Allon et al., 2007] are assumed to be concave. For a

more detailed list of applications, see Lim and Glynn [2012] and Hannah and Dunson [2013].

The convex least squares estimator is the solution of the following generalized regression

problem:minf∈C

1

2

n

∑i=1(yi − f(xi))2, (5.2)

where C represents the space of convex functions on Rd. Note that Problem (5.2) is an

optimization problem over functions. Surprisingly, this can be written equivalently as a finite

dimensional convex quadratic optimization problem where the variables are the function

values and subgradients at each of the points x1, . . . ,xn [Boyd and Vandenberghe, 2004].

As part of the constraints, we enforce the convexity condition, i.e., the graph of the convex

function lies above each of its tangent planes. More precisely, this convexity condition implies

that given any point xi, the value of f at xi is greater or equal to the value of any tangent

hyperplane of f evaluated at xi. Clearly any convex function has a nonempty subdifferential

at every point, and the existence of such tangent planes is guaranteed. For this problem, it

suffices to enforce this condition for all n(n − 1) pairs of points xi,xj, 1 ≤ i, j ≤ n.

The resulting quadratic optimization problem with variables (θ,ξini=1) is given as fol-

lows.min

θ,ξini=1

1

2

n

∑i=1(yi − θi)2

subject to θi + ξTi (xj − xi) ≤ θj ∀i, j,

θ ∈ Rn,

ξi ∈ Rd ∀i.

(5.3)

The variables θi represent the values of f(xi), and ξi belongs to the subdifferential set of

the convex function f at each xi. The solution to this problem θ∗ is referred to as the

convex least squares estimator (CLSE). Note that we recover the usual least squares linear

regression problem by setting ξi = ξ ∀i and θi = ξTxi ∀i.

We note that the feasible set of Problem 5.3 can be unbounded, that may lead to potential

138

instability. There can be multiple values of the subgradients leading to the same objective

value. Hence, we propose solving the following regularized optimization problem, for a given

λ > 0,min

θ,ξini=1

1

2

n

∑i=1(yi − θi)2 +

λ

2

n

∑i=1∥ξi∥2


θ ∈ Rn,

ξi ∈ Rd ∀i.

(5.4)

By adding a regularization term on the subgradients, which leads to a strongly convex ob-

jective, the subgradients ξi,j cannot take any value for a given objective value and feasibility.

5.1.1 Related literature

In this section, we review the relevant literature. Recently, there has been considerable

interest in shape constrained regression among the statistics community. Seijo and Sen

[2011] and Lim and Glynn [2012] characterize and show consistency of the CLSE. Seijo and

Sen [2011] use off-the-shelf interior point solvers (like MOSEK, cvx) for solving the problem.

But these solvers do not scale well for n ≥ 300 due to the presence of O(n2) constraints.

This motivated the recent work by Mazumder et al. [2018] which presents a first order

method based on the Alternating Direction Method of Multipliers (ADMM) to compute

the optimal solutions for the least squares convex regression problem. They demonstrate

the flexibility of their approach in the presence of monotonicity constraints, and bounded

subgradients. Their method solves instances of sizes of n ≈ 1000 to an accuracy of 10−3 in a

few seconds, and moderate accuracy solutions for n ≈ 5000 in a few minutes. However, their

method cannot be easily extended for least absolute deviation convex regression (where the

loss function is the `1 norm rather than the `2 least squares loss), or any joint constraints

over the subgradients. Hannah and Dunson [2013] consider an approximation of the convex

regression problem which is based on iteratively partitioning the set of observations, and

report results for n of the order of 10,000 in a few minutes. Balázs et al. [2015] propose an

139

aggregate cutting plane based method for solving the full convex regression problem along

with an approximate version, and they demonstrate via numerical experiments that their

algorithm solves instances with sizes of n ≈ 500 in a few minutes. However, they do not

perform large scale computations and show how their method scales. Regarding statistical

results, Han and Wellner [2016] sharply characterize the rate of statistical convergence for

the minimax risk.

In the context of linear regression, the problem of sparse regression refers to finding the

optimal vector of coefficients β ∈ Rd which minimizes the sum of squares of the residuals,

with the additional restriction that β only have at most k (for some positive integer k < d)

elements different from zero. In high dimensional settings where d >> n such an assumption

is valuable for conducting statistical inference, and for settings where d < n sparsity improves

interpretability of the model. We explore the notion of sparsity in this setting - we impose

the restriction that the union of supports of the subgradients is a set with cardinality at

most k. We refer to this problem as the sparse convex regression problem. Sparsity and

variable selection for non-parametric regression models is a new and relatively unexplored

area. Recently, Xu et al. [2016] develop a method for high dimensional sparse convex regres-

sion which solves an approximate problem, with the additional restriction that the target

convex function f itself be a sum of univariate convex functions. Additionally, they show

that under certain conditions on the samples, this approximation is accurate for the purpose

of variable selection.

However, such a cardinality constraint makes the sparse linear regression problem NP-

hard [Natarajan, 1995], and has led to this problem being considered as intractable. However,

there have been tremendous advances in computing power over the last thirty years - both

in hardware and optimization software (see Bixby [2012], Nemhauser [2013] for more de-

tails), which can computationally benefit such problems in statistics. Recently, there has

been some work that propose using modern Mixed Integer Optimization (MIO) methods

along with tools from first order methods in convex optimization for solving classical sta-

tistical problems such as best subset selection [Bertsimas et al., 2016b] and least quantiles

140

regression [Bertsimas and Mazumder, 2014]. More recently, Bertsimas and Van Parys [2016]

propose a reformulation of the sparse regression problem where, they develop a cutting plane

algorithm using a duality perspective that solves problems with sizes of n, d in the order of

100,000s in a few seconds. We explore the use of such techniques while solving the sparse

convex regression problem, where we select the best subset of features whose cardinality is

bounded by k, and find the optimal convex function on this subset.

5.1.2 Contributions

In this section, we outline the main contributions of our work.

1. In this chapter, we consider the problem of convex regression, and develop a scalable

algorithm for obtaining high quality solutions in practical times that compare favorably

with other state of the art methods. We show that by using a cutting plane method,

the least squares convex regression problem can be solved for sizes (n, d) = (104,10) in

minutes and (n, d) = (105,102) in hours. We emphasize that this approach can be also

used for `1 convex regression (where we minimize the `1 norm of the residuals vector

y − θ) as well with similar scalability results.

2. We propose algorithms which iteratively solve for the best subset of features based on

first order and cutting plane methods. To the best of our knowledge, these are the first

algorithms for sparse convex regression. We consider two variants of this problem, and

develop algorithms for each of them. In one variant, we consider the sparse problem

with bounded subgradients, and develop iterative mixed integer optimization based

algorithms for solving it. In the second variant, we consider the sparse problem with

ridge regularization, and develop a binary cutting plane method for this problem. With

the help of computational experiments, we show that our methods are scalable and

obtain near exact subset recovery for sizes (n, d, k) = (104,102,10) in minutes, and

(n, d, k) = (105,102,10) in hours.

141

5.1.3 Structure of this chapter

The structure of this chapter is as follows. In Section 5.2, we present the cutting plane algo-

rithm for solving the least squares convex regression problem and other variants. In Section

5.3, we define the sparse convex regression problem, and present our solution approach. We

illustrate the effectiveness of our approach with computational results and discuss the results

in Section 5.4.

5.1.4 Notation

For any positive integer n, we use [n] to denote the set of the first n positive integers, that

is, [n] = 1, . . . , n. The response vector is an n-dimensional vector of observations, and the

covariates are each d-dimensional vectors, i.e., y ∈ Rn,xi ∈ Rd ∀i ∈ [n], where d ≥ 1. Also, ∥.∥0denotes the `0 norm, given by the number of nonzero elements in a vector. Finally, Supp(x)denotes the set of indices of the vector x whose corresponding values are non zero.

5.2 Optimization Algorithm for Convex Regression

In this section, we propose an algorithm to solve the convex regression problem. Additionally,

we show that our algorithm can easily accommodate the case with an `1 objective, as well

as other constraints on the subgradients.

5.2.1 Algorithm

We present a cutting plane based algorithm for solving Problem (5.4). We now explain the

various steps in the algorithm in the following subsections.

Cutting plane algorithms

Cutting plane algorithms are an effective tool for solving large-scale optimization problems

where the number of constraints is very high. Before we proceed, we define some terminology

142

that is commonly used in the large-scale optimization literature. In this context, master

problem refers to the full formulation (5.4) with n(n−1) constraints, while the reduced master

problem refers to a problem with the same objective and variables, but with only a subset

of the constraints. The main idea behind these methods is to start solving the problem with

a few constraints initially - the initial reduced master problem. We then find the violated

constraints, and iteratively add them in a delayed manner - at each iteration we solve a

reduced master problem (but with progressively more constraints than the initial reduced

master problem). Consequently, such methods are also referred to as delayed constraint

generation in the large-scale optimization literature [Bertsimas and Tsitsiklis, 1997]. The

success of this method depends greatly on the efficiency of finding the violated constraints.

Initial reduced master problem

We start with a fraction of the n(n − 1) constraints - an initial reduced master problem.

Typically only a small fraction of the n(n − 1) constraints will be active at the optimal

solution to the full problem, and solving the problem with only these active constraints is

clearly equivalent to solving the full problem. However, these active constraints are not

known beforehand. A key advantage of starting with a constraint set that is “close" to the

active constraint set is that it could substantially reduce the number of cuts added at later

iterations, and reduce the net computational burden.

We motivate our algorithm from the solution to the convex regression problem for d = 1,

where the convexity condition is applied to only the immediate neighboring points. Recall,

when d = 1, Problem (5.4) can be computed by solving with only n − 1 constraints, i.e., by

sorting xi’s and considering the adjacent index pairs.

For d > 1, given x1, . . . ,xn, we form a spanning path (SP) based on the Euclidean dis-

tances between these points. The algorithm works as follows - starting from xi1 (say i1 = 1),

we find the closest point (based on the usual Euclidean distance metric) xi2 to it, and add it

as the next point. We then find the closest point xi3 to xi2 over all the points excluding xi1

and xi2 , then we find the closest point xi4 to xi3 over all the points excluding xi1 , xi2 and

143

xi3 , and so on. We utilize the n − 1 edges in the spanning path among x1, . . . ,xn as initial

constraints. These n − 1 constraints initially form the reduced master problem:

minθ,ξ1,...,ξn

1

2∥y − θ∥2 + λ

2

n

∑i=1∥ξi∥2

subject to θi1 + ξ′i1(xi2 − xi1) ≤ θi2 ,

θi2 + ξ′i2(xi3 − xi2) ≤ θi3 ,

⋮

θin−1 + ξ′in−1(xin − xin−1) ≤ θin ,

∥ξj∥∞ ≤M⋆ ∀1 ≤ j ≤ n,

(5.5)

with the solution as θ, ξ1, . . . , ξn. The last constraint bounds the feasible space, with M⋆

obtained via solving Problems 5.14 and 5.15.

Alternatively, we have also computed the minimum spanning tree (MST) among x1, . . . ,xn

and used the n − 1 edges of the MST as initial constraints. We have also used randomly

chosen pairs of points (Method (R)) as the initial reduced master problem and also selected

the closest point for each point xi (Method (C)). For d = 1, we note that the MST and SP

methods coincide. In Section 5.4, we compare Methods SP, MST, R and C.

Delayed constraint generation

For any given solution to the reduced master problem (5.5) given by θ, ξ1, . . . , ξn, we need

to check if this is a feasible solution for the full problem. If it is indeed feasible, clearly

it is also optimal for the full problem. On the other hand, if it is not feasible, we need to

find a violated constraint efficiently. This problem of finding a violated constraint is also

referred to as the separation problem, as this amounts to finding a hyperplane that separates

θ, ξ1, . . . , ξn from the feasible set [Bertsimas and Tsitsiklis, 1997]. Thus, for each i, the ith

separation problem is to find the maximal index j(i), where

j(i) = arg max1≤k≤n

θi − θk + ξ′i(xk − xi) , (5.6)

144

and check if the corresponding largest value is greater than 0.

In practice, we only consider a constraint to be violated if it exceeds a given tolerance

Tol. In the case of such a violation, we add the constraint

θi + ξ′i(xj(i) − xi) ≤ θj(i) (5.7)

to the reduced master problem for each i, and re-solve it. Let us denote the index pairs of

the violated constraints we add at the kth iteration be given by Tk. Thus, at the kth iteration,

the problem we solve is given by

minθ,ξ1,...,ξn

1

2∥y − θ∥2 + λ

2

n

∑i=1∥ξi∥2

subject to θi + ξTi (xj − xi) ≤ θj,∀(i, j) ∈ T0,

θi + ξTi (xj − xi) ≤ θj,∀(i, j) ∈ T1,

⋮

θi + ξTi (xj − xi) ≤ θj,∀(i, j) ∈ Tk.

(5.8)

If max1≤k≤n θi − θk + ξ′i(xk − xi) ≤ Tol ∀i ∈ [n], then the current solution is in fact optimal

for the full problem (5.4) with n(n−1) constraints, and the method terminates. The complete

algorithm is as follows:

145

Algorithm 1 Cutting plane algorithm for Problem (5.4)Input: Data (yi,xi), i = 1, . . . , n, tolerance Tol > 0.

Output: An optimal solution (θ∗,ξ∗1 , . . . ,ξ∗n) to Problem (5.4).

1: Solve the reduced master problem, i.e., Problem (5.8) with k = 0.

2: Set success = 0.

3: while success == 0 do

4: for 1 ≤ i ≤ n do

5: For this i, solve the separation problem (5.6) to find a j(i).6: Add the corresponding violated constraint (Eq. (5.7)), to the reduced master

problem.7: end for

8: If there is no violated constraint within the tolerance Tol, set success← 1.

9: Else, re-solve Problem (5.8) with new constraint set Tk+1, consisting of additional

constraint(s) added from Steps 4 − 7.

10: k ← k + 111: end while

We also note that this cutting plane algorithm, by successively adding violated constraints

to the reduced master problem, is guaranteed to converge to an optimal solution in a finite

number of steps [Kelley, 1960].

Theorem 4. The cutting plane Algorithm 1 converges to an optimal solution of Problem

(5.4) in a finite number of iterations.

5.2.2 `1 convex regression

Consider the problem of `1 convex regression, given by,

minf∈C

n

∑i=1∣yi − f(xi)∣ (5.9)

146

where, as before, C is the space of convex functions on Rd. Such a variant is along the lines

of linear regression with an `1 loss, rather than the usual least squares loss. Problem (5.9)

can be written as an equivalent finite dimensional linear optimization problem (5.10), using

additional auxiliary variables z ∈ Rn+ as follows.

minθ,ξini=1,z

n

∑i=1

zi

subject to zi ≥ yi − θi ∀i,

zi ≥ −(yi − θi) ∀i,

θi + ξTi (xj − xi) ≤ θj ∀i, j,

θ,z ∈ Rn,

ξi ∈ Rd ∀i ∈ [n].

(5.10)

We utilize the dual simplex algorithm, as when we introduce a new cut the optimality

conditions are satisfied, while the previous solution may be infeasible. As we illustrate in

Section 5.4, this method is fast in practice and scales well.

5.2.3 Extensions

Algorithm 1 can be extended to accommodate the following additional requirements on f(⋅):

a) The function f is coordinate-wise monotone, i.e., its subgradients ξi are either ξi ≥ 0or ξi ≤ 0 (non-decreasing or non-increasing respectively) for all i.

b) The subgradients ξi are bounded, i.e., ∥ξi∥p ≤ L ∀i for some L and `p norm ∥.∥p. The

usual cases of p ∈ 1,2,∞ result in conic optimization problems and can be handled

by this approach. Such constraints could be added as a part of the reduced master

problem all at once, or in a delayed manner as and when they are violated.

147

5.3 Sparse Convex Regression

In this section, we consider the problem of sparse convex regression, in which the union of

supports of the subgradients of f in each point x is a set whose cardinality is bounded by k.

We formulate this as the following optimization problem over sets,

minθ,ξini=1,S

1

2

n

∑i=1(yi − θi)2 +

λ

2

n

∑i=1∥ξi∥22


Supp(ξi) ⊆ S ∀i,

θ ∈ Rn,

ξi ∈ Rd ∀i,

∣S∣ ≤ k,S ⊆ 1, . . . , d .

(5.11)

5.3.1 Primal approach

In this section, we present a primal-based approach of solving for the optimal subset of

features for the convex regression problem. Consider the following mixed integer (binary)

quadratic optimization (MIQO) problem

minθ,z,ξini=1

1

2

n

∑i=1(yi − θi)2 +

λ

2

n

∑i=1∥ξi∥22

subject to θi + ξTi (xj − xi) ≤ θj ∀1 ≤ i, j ≤ n,

∣(ξi)j ∣ ≤Mzj ∀i ∈ [n], j ∈ [d],n

∑j=1

zj ≤ k,

z ∈ 0,1d ,

θ ∈ Rn,

ξi ∈ Rd ∀i ∈ [n].

(5.12)

148

for some positive constant M .

To solve this problem, we first develop heuristics based on convex optimization which

generate solutions and are fast in practice. We solve a reduced MIQO problem (using a

commercial mixed integer optimization solver [Gurobi]) to generate lower bounds, which

provide a guarantee on the quality of this solution. Bertsimas et al. [2016b] used commercial

state of the art MIO solvers to solve the sparse linear regression problem with considerable

success. We present the details on this algorithm in the following section.

Algorithmic approach

In this section, we present an algorithm to solve Problem (5.12). To summarize, our solution

approach involves generating lower bounds by solving the reduced MIQO problem, and

improving this bound at each successive iteration. We use heuristics in order to find feasible

solutions fast, and generate lower bounds in order to determine the quality of the proposed

solution, or potentially improve it further. We elaborate in more detail on the heuristics

in Section 5.3.3. In order to determine the quality of our solution (in terms of optimality

gap), we generate lower bounds. For this, we solve the full sparse problem as an MIQO

problem, but with only the initial reduced set of constraints to start. Whenever possible,

we warm-start this problem with a feasible solution obtained via heuristics, which we briefly

discuss in Section 5.3.3. We then iteratively add the violated constraints to Problem (5.12)

to tighten the bounds, similar in spirit to the cutting plane approach. For the upper bound,

we solve the full convex regression problem on this restricted support. To be precise, this

problem is given bymin

θ,ξ1,...,ξn

1

2

n

∑i=1(yi − θi)2 +

λ

2

n

∑i=1∥ξi∥22

subject to θi + ξ′i((xj)S − (xi)S) ≤ θj ∀i, j,

∥ξi∥∞ ≤M ∀i,

ξ ∈ Rk ∀i,θ ∈ Rn.

(5.13)

where S is the support set obtained from the MIO solution, and vS is the vector v restricted

to the set S. The overall primal algorithm is as follows:

149

Algorithm 2 Primal approach.Input: Initial constraints (C(0), a subset of the n(n − 1) constraints), tolerance ε > 0, a

positive integer T .

Output: A sparse optimal solution to problem (5.12).

1: Initialize Problem (5.12) with the initial constraints C0.2: Use the initialization heuristic (Section 5.3.3) to generate an initial solution S(0).

3: Set t← 1.

4: while t ≤ T AND gap > ε do

5: Solve problem (5.12), with reduced constraint set C(t), to obtain support set S(t),

possibly utilizing S(t−1) as a warm-start.

6: Set LB (Lower bound) to the optimal objective of problem (5.12).

7: With the output support, solve Problem (5.13) on the support S(t).

8: Update UB (Upper bound) to be the optimal objective.

9: Update gap ← UB−LBLB .

10: Add (at most) n violated constraints (one for each 1 ≤ i ≤ n), which forms C(t+1) by

this solution to the lower bound MIQO problem (5.12) to obtain the support S(t+1).11: Warm start it with the solution to the same lower bound problem constraints on this

restricted support set S(t+1) by solving Problem (5.13).

12: t← t + 113: end while

Computing the bound M

In this section, we describe how we compute bounds on the subgradient values. For some

initial feasible solution θ0 and ξ01, . . . ,ξ0n for Problem (5.11), we solve the following problems,

150

for each 1 ≤ t ≤ n,1 ≤ u ≤ d:

minθ,ξ1,...,ξn

ξt,u

subject to1

2∥y − θ∥2 + λ

2

n

∑i=1∥ξi∥22 ≤

1

2∥y − θ0∥2 + λ

2

n

∑i=1∥ξ0i ∥22,

θi + ξ′i(xj − xi) ≤ θj ∀i, j,

(5.14)

andmax

θ,ξ1,...,ξnξt,u

subject to1

2∥y − θ∥2 + λ

2

n

∑i=1∥ξi∥22 ≤

1

2∥y − θ0∥2 + λ

2

n

∑i=1∥ξ0i ∥22,

θi + ξ′i(xj − xi) ≤ θj ∀i, j.

(5.15)

We note that the feasible regions of Problems (5.14) and (5.15) are bounded, and hence

the optimal objective values for both these problems is guaranteed to be finite.

Let M∗ be the maximum of the optimal solution absolute values of (5.14) and (5.15) over

all 1 ≤ t ≤ n and 1 ≤ u ≤ d. An optimal solution of (5.11) is clearly feasible to both (5.14) and

(5.15). Therefore, using M∗ in the formulation (5.12) does not exclude optimal solutions to

(5.11) and therefore the optimal solution values of Problems (5.11) and (5.12) are equal.

5.3.2 Dual approach

In this section, we adapt the approach proposed by Bertsimas and Van Parys [2016] for

sparse linear regression to this convex regression setting. We solve the following regularized

151

problem, for a given λ > 0,

minθ,ξ1,...,ξn

1

2∥y − θ∥2 + λ

2

n

∑i=1∥ξi∥2

subject to θi + ξ′i(xj − xi) ≤ θj ∀i, j,


θ ∈ Rn,

ξi ∈ Rd ∀i,

∣S∣ ≤ k,S ⊆ 1, . . . , d .

(5.16)

Before we proceed, we introduce some notation. Sdk denotes the set of d dimensional binary

vectors with at most k non-zero components, i.e.,

Sdk = z ∈ 0,1d ∶

d

∑i=1

zi ≤ k.

We next present the following result that transforms this problem to a binary optimization

problem with a convex objective function.

Theorem 5. Problem (5.16) is equivalent to solving the following binary optimization prob-

lem with convex objective, given by

minz∈Sd

k

g(z), (5.17)

where

g(z) =maxµ≥0 − 1

2

n

∑i=1(yi +

n

∑j=1

µji −n

∑j=1

µij)2

− 1

2λ

n

∑i=1

d

∑p=1

zp (n

∑j=1

µij(xj − xi))2

p

, (5.18)

and a subgradient of g is given by the vector with pth element given by

(∂g(z))p = −1

2λ

n

∑i=1(

n

∑j=1

µij(xip − xjp))2

, (5.19)

where µ is an optimal solution to the concave maximization problem given in Eq. (5.18).

152

Proof. Using binary variables z ∈ 0,1d to denote the support set (zj = 0 ⇐⇒ (ξi)j = 0 ∀i ∈[n]), we write Problem (5.16) as

minz∈Sd

k,Z=diag(z)

minθ,ξ1,...,ξn

1

2∥y − θ∥2 + λ

2

n

∑i=1∥ξi∥2

subject to θi + ξ′iZ(xj − xi) ≤ θj ∀i, j.(5.20)

We take the dual of the inner convex optimization problem, which is given by

maxµ≥0 − 1

2

n

∑i=1(yi +∑

j

µji −∑j

µij)2

− 1

2λ

n

∑i=1∥∑

j

µijZ(xi − xj)∥2

.

For brevity, let vi = ∑j µij(xi − xj). Note that Z′Z = Z2 = Z, and thus we get

maxµ≥0 − 1

2

n

∑i=1(yi +∑

j

µji −∑j

µij)2

− 1

2λ

n

∑i=1

d

∑p=1

zp (n

∑j=1

µij(xj − xi))2

p

, (5.21)

and thus, the result follows.

From Theorem 5, as g is convex in z, we use µ to compute the subgradient of g which we

use to solve the outer binary minimization problem using cutting planes. This is equivalent

to approximating the convex function g by a piecewise linear function of its lower tangents,

while improving the outer approximation by adding a new tangent at each iteration. To be

precise, we solve the outer problem as

minz∈0,1d

maxi=1,...,m

g(zi) + ∂g(z(i))′(z − zi)

subject tod

∑i=1

zi ≤ k,(5.22)

153

or equivalently in epigraph form,

minz∈0,1d,γ

γ

subject to g(zi) + ∂g(z(i))′(z − zi) ≤ γ ∀1 ≤ i ≤m

subject tod

∑i=1

zi ≤ k,

(5.23)

where m is the number of cuts added.

While solving Problem (5.22) we use dynamic constraint generation, or lazy callbacks,

which enables the solver to avoid building multiple branch and bound trees each time a new

constraint is added to the problem. This leads to only one branch and bound tree being

built. Typically, lazy constraints are used when the full set of constraints is too large to

enumerate explicitly. Under this scheme, cuts are added to the model whenever a binary

feasible solution is found.

As mentioned in Bertsimas and Van Parys [2016] for the sparse linear regression case,

the linear relaxation of problem (5.17) provides strong warm starts to problem (5.16). This

motivates the following corollary.

Corollary 2. The linear relaxation of problem (5.17) is given by the following convex opti-

mization problem with semi-infinite constraints:

minµ≥0,γ

1

2

n

∑i=1(yi +∑

j

µji −∑j

µij)2

+ γ

subject to γ ≥ 1

2λ

d

∑p=1

zp

⎧⎪⎪⎨⎪⎪⎩

n

∑i=1(∑

j

µij(xip − xjp))2⎫⎪⎪⎬⎪⎪⎭

∀z ∈∆k,d

(5.24)

where ∆k,d = z ∈ Rd ∶ 0 ≤ z ≤ 1,∑di=1 zi ≤ k.

We solve the relaxation to generate warm-starts for the original binary optimization

problem (5.22). Once again, we use cutting planes to solve this problem. At the optimal

solution, the support set would be the corresponding indices of the k largest values of the

154

vector v, with pth element given by

vp =n

∑i=1(

n

∑j=1

µij(xip − xjp))2

. (5.25)

In practice, we have observed that this method does provide good quality warm-starts.

Before we elaborate further, we introduce some terminology. Here, the outer problem

refers to the binary minimization problem (5.22). As we have noted in the statement of the

theorem 5, evaluating the function g requires us to solve an optimization problem (5.18),

which we shall henceforth refer to as the inner problem.

Column generation methods for the inner problem

An issue with the above approach is that for the inner problem, the number of variables µ

is too large, i.e., O(n2), and is thus not practical for larger n. Hence, we propose a column

generation approach for solving the inner problem (5.18). We start with a subset of all the

n(n − 1) variables µ (with the rest set to zero), and add corresponding variables as we go

along. From the KKT conditions, for a given dual optimal solution µ we recover the primal

variables as:θi = yi +

n

∑j=1

µji −n

∑j=1

µij ∀i,

ξi =1

λ

n

∑j=1

µijZ(xi − xj) ∀i.(5.26)

We add the violated constraint if

θi + ξ′i(xj − xi) − θj ≤ 0. (5.27)

If not, then for any i, we find the j∗ such that

j∗ = argmax1≤j≤n

θi − θj + ξ′i(xj − xi) , (5.28)

and add the variable µij∗ to the set of active variables, and re-solve problem (5.18).

155

In practice, problem (5.18), while having relatively simple nonnegative constraints, has

a dense quadratic objective, which often results in larger solve times. Instead, we solve its

dual, which is the inner minimization problem in Eq. (5.20). We use Algorithm 1 to solve

the inner minimization problem in (5.20) and calculate the variables θ and ξ1, . . . ,ξn, as well

as the dual variables µij corresponding to the constraints in Eq. (5.20). Given these values,

and the expression in Eq. (5.19), we compute the subgradient of g at this value of z, and

add the corresponding constraint to the outer binary optimization problem in the case of a

violation. This dual approach differs from the method in Bertsimas et al. [2016b], which is

a primal method. In Section 5.4, we observe that this dual approach has a significant edge

over the primal one.

We now present the complete algorithm for the dual approach:

Algorithm 3 Cutting plane based algorithm for the dual approachInput: λ > 0, tolerance ε > 0.

Output: Optimal support z∗.

1: Start with γ0 = 0 and some feasible z0.

2: t← 0.

3: while γt < g(zt) + ε do

4: Compute a subgradient value of g at zt, using Theorem 5.

5: Add the constraint g(zt) + ∂g(zt)′(z − zt) ≤ γ.

6: Re-solve the outer problem (5.23), with solution given by (zt+1, γt+1).7: t← t + 1.

8: end while

5.3.3 Initialization heuristics

In this section, we briefly describe a thresholding based heuristic for the sparse convex

regression problem. This method provides an alternative approach of generating warm starts

to solving the relaxation problem (5.24). For the sake of brevity, we do not include the ridge

156

regularization term, but these methods can be easily adapted to include it as well.

minθ,ξ

1

2∥y − θ∥2

subject to Aθ +n

∑i=1

Biξi ≤ 0,


θ ∈ Rn,

ξi ∈ Rd ∀i,

∣S∣ ≤ k,S ⊆ 1, . . . , d .

(5.29)

where A, Bi are the full matrices representing the total n(n−1) constraints. Typically at any

feasible solution, only a few of the constraints will be active. Let the indices of the binding

constraints be described in T , and the sub matrices be given by AT ,BT,i. Thus, the problem

can be written asminθ,ξ

1

2∥y − θ∥2

subject to ATθ +n

∑i=1

BT,iξi ≤ 0,


θ ∈ Rn,

ξi ∈ Rd ∀i,

∣S∣ ≤ k,S ⊆ 1, . . . , d .

(5.30)

Dualizing the linear inequality constraints, the objective is given by

f(θ,ξ) =maxλ≥0

1

2∥y − θ∥2 + λ′(ATθ +

n

∑i=1

BT,iξi). (5.31)

We smoothen the objective function by subtracting a strongly convex term ( τ2∥λ∥2) for some

fixed scalar τ > 0. Note that we need to efficiently compute this function f for different

157

values of θ,ξ. The smooth convex objective is now

fτ(θ,ξ) =maxλ≥0

1

2∥y − θ∥2 + λ′(ATθ +

n

∑i=1

BT,iξi) −τ

2∥λ∥22. (5.32)

This function fτ is Lipschitz continuous with parameter `, where ` = λMAX(M′M)τ [Nesterov,

2005]. The matrix M ∈ R(m)×(n+nd), where m is the number of rows of AT (the number of

active equality constraints), and is given by

M = [AT BT,1 . . . BT,n.]

Now, the optimal λ∗τ can be computed by

λ∗τ =1

τ(ATθ +

n

∑i=1

BT,iξi)+. (5.33)

We then apply an upper quadratic approximation to the above function, followed by an

iterative thresholding procedure to the above function fτ(θ,ξ1, . . . ,ξn), while sequentially

reducing the value of τ . The complete details of this algorithm can be found in the Appendix.

5.4 Computational Experiments

Our objective in this section is

1. To understand the scalability and run times of Algorithm 1 for convex regression for

synthetic and real data.

2. To compare the performance of Algorithm 1 to other state of the art methods.

3. To understand the scalability and run times of Algorithms 2 and 3 for sparse convex

regression. Given that there are no competing approaches for this problem to the best

of our knowledge, we do not include any comparisons.

The structure of this section is as follows. In Section 5.4.1, we discuss the data generation

158

mechanism, and compare various initialization schemes for Algorithm 1 in Section 5.4.2. We

then examine its run times on synthetic data in Section 5.4.3, and infeasibility of the solution

at each iteration of Algorithm 1 in Section 5.4.4. Next, we compare it with other approaches

in Section 5.4.5, and discuss the run times of Algorithm 1 applied to the convex regression

problem with an `1 loss in Section 5.4.6. We then present the run times and infeasibility of

Algorithm 1 on real data in Section 5.4.7. Next, we consider the sparse convex regression

problem in Section 5.4.8, where we present the run times of Algorithm 2 (primal approach)

and Algorithm 3 (dual approach) for various sizes. Additionally, we present the accuracy

and run times of Algorithm 3 as a function of various parameters such as k, d, ρ, SNR, and

present the false positive rates of both the algorithms as well in this section. We conclude

by discussing our findings from these experiments in Section 5.4.9.

In all the experiments that follow, we use Gurobi 6.5.2 [Gurobi] as the optimization solver,

within the Julia programming language [Bezanson et al., 2017] using the JuMP modeling

language [Dunning et al., 2017]. All computations were performed on nodes of the Engag-

ing cluster, which is a collaboration between the Massachusetts Green High Performance

Computing Center (MGHPCC) and several of Boston’s leading universities. Each compute

node of the cluster had two 8-core, 2GHz Intel Xeon E2650 processors, 64GB of memory

and 3.5TB of local disk.

5.4.1 Synthetic Data

In this section, we generate X data from a standard Gaussian distribution, and use the

convex function Φ(x) = ∥x∥22, where yi = Φ(xi) + εi, 1 ≤ i ≤ n. The errors εi are assumed to

be independent and identically distributed Gaussian, i.e., N(0, σ2), for i = 1, . . . , n. We scale

the data appropriately so that the Signal to Noise ratio (SNR) is 3, i.e., Var(µ)Var(ε) = 3. Finally,

before feeding the data into the algorithm, we mean-center and normalize the features and

response vectors to have unit `2 norm.

We report the number of blocks of cuts (iterations) added till the end, along with another

159

metric called primal infeasibility [Mazumder et al., 2018],

Primal infeasibility = 1

n∥V∥F (5.34)

where the matrix V has entries given by Vi,j = (θi + ξ′i(xj − xi) − θj)+, ∀1 ≤ i, j ≤ n, where

z+ = maxz,0. Vi,j indicates the magnitude of violation of that constraint, and a value of

0 indicates no violation. Note that ∥ ⋅ ∥F denotes the usual Frobenius norm, where

∥V ∥2F =n

∑i=1

n

∑j=1

V 2i,j. (5.35)

Finally, Tol is the threshold above which we report the constraint to be violated.

5.4.2 Comparison of initialization methods for the reduced master

problem

In this section, we apply Algorithm 1 to the least squares convex regression problem (5.4).

We compare the run times of different ways of forming the reduced master problem for

Problem (5.4): We used methods MST, SP, C and R. MST refers to the Euclidean minimum

spanning tree formed on the set of points x1, . . . ,xn. SP refers to the spanning path approach

described in Section 5.2.1. C refers to finding the point closest to each point, and adding

that pair in that order. For example, if xj is closest to xi, we add the constraint

θi + ξ′i(xj − xi) ≤ θj.

Finally, R refers to finding a point randomly sampled from the remaining n − 1 for each xi,

and adding the resulting n constraints. The last four methods 2-MST, 2-SP, 2-C, and 2-R

denote two sided constraints, i.e., for each pair (xi,xj), we add both the constraints

θi + ξ′i(xj − xi) ≤ θj

160

and

θj + ξ′j(xi − xj) ≤ θi.

The term Tol in Table 5.1 refers to the tolerance to which each of the n(n − 1) constraints

is satisfied while terminating Algorithm 1. The sizes for all the instances are set to (n, d) =(104,10), and we use the least squares objective function ∥x∥22 without the ridge regularization

term of the subgradients. All entries in the table are averaged over the same twenty instances.

The numbers in parenthesis indicate the standard deviation.

Method Tol Cuts added (Blocks) Primal Infeasibility Run time (seconds)

MST 0.10 26 (2) 0.0104 (0.0001) 94.44 (20.826)

SP 0.10 9 (6) 0.0112 (0.0002) 21.36 (12.820)

C 0.10 25 (2) 0.0106 (0.0002) 55.93 (4.539)

R 0.10 6 (3) 0.0106 (0.0002) 15.39 (6.125)

2-MST 0.10 26 (2) 0.0098 (0.0002) 343.53 (62.363)

2-SP 0.10 15 (5) 0.0091 (0.0001) 35.66 (11.276)

2-C 0.10 26 (2) 0.0104 (0.0002) 131.09 (18.933)

2-R 0.10 21 (2) 0.0088 (0.0001) 46.47 (5.273)

MST 0.05 29 (2) 0.0073 (0.0002) 221.21 (40.035)

SP 0.05 25 (2) 0.0078 (0.0002) 56.75 (7.027)

C 0.05 30 (2) 0.0061 (0.0001) 117.07 (19.565)

R 0.05 26 (3) 0.0074 (0.0001) 57.95 (8.107)

2-MST 0.05 31 (3) 0.0055 (0.0001) 1448.46 (313.576)

2-SP 0.05 25 (2) 0.0068 (0.0001) 58.30 (7.041)

2-C 0.05 31 (2) 0.0059 (0.0001) 567.26 (121.269)

2-R 0.05 26 (2) 0.0064 (0.0001) 61.51 (6.861)

Table 5.1: The effect of the initialization method for (n, d) = (104,10) in the `2 convexregression for tolerances Tol = 0.1 and 0.05.

161

The results of Table 5.1 suggest that starting from a “good" initial reduced master prob-

lem can substantially impact the total run time of Algorithm 1. Both the Spanning path

(SP) and Random (R) methods outperform the other methods. SP and R perform similarly,

with the one-sided being marginally better than the two-sided constraints. We chose R in

all of our further experiments.

5.4.3 Run times of `2 convex regression

In this section, we report how Algorithm 1 scales as n, d increase for Problem (5.4), different

tolerances and least square objective. Table 5.2 presents the results obtained for a tolerance

of 0.1, while Table 5.3 shows the results for a tolerance of 0.05.

n d Cuts (Blocks) Infeasibility Run time

103 101 24 (2) 0.0147 (0.0016) 2.4s (1.5s)

104 101 8 (5) 0.0106 (0.0002) 16.5s (8.7s)

104 102 14 (3) 0.0107 (0.0003) 169.2s (35.5s)

104 103 22 (6) 0.0107 (0.0002) 1.5h (0.4h)

105 101 5 (4) 0.0054 (0.0001) 1156.9s (859.4s)

105 102 5 (1) 0.0056 (0.0001) 3.8h (0.4h)

105 5 × 102 6 (1) 0.0056 (0.0001) 19.1h (3.0h)

5 × 105 101 5 (4) 0.0034 (0.0000) 20.2h (7.2h)

Table 5.2: Run times for Tol = 0.1 and `2 convex regression.

162

n d Cuts (Blocks) Infeasibility Run time

103 101 36 (4) 0.0026 (0.0004) 58.0s (25.6s)

104 101 25 (3) 0.0074 (0.0001) 57.0s (8.4s)

104 102 110 (3) 0.0065 (0.0003) 1369.3s (91.7s)

105 101 11 (6) 0.0039 (0.0001) 1.0h (0.4h)

105 102 11 (1) 0.0040 (0.0000) 6.8h (0.7h)

Table 5.3: Run times for Tol = 0.05 and `2 convex regression.

We make the following observations:

• As the number of dimensions increases, the problem becomes harder to solve as each

added constraint becomes more dense. This is reflected in both Tables 5.2 and 5.3.

• The largest instances of (105,500) and (5 × 105,10) took almost a day on average to

solve to the required tolerance. While we tried solving them with Tol = 0.05, the run

time took more than 24 hours, after which we terminated them. For such problems,

the interior point solvers, even if they solve the initial reduced master problem, stall at

subsequent iterations when the quadratic problem has close to a million constraints.

• When tolerance is reduced to 0.05, the run times of (104,102) increases from a 2.5

minutes to 23 minutes with the average number of iterations increasing by a factor of

eight.

• To further aid in interpreting the results, we performed a linear regression of the run

times versus n, d and Tol. Our results indicate that a linear relationship between these

variables has an R2 of 0.96, which indicates a good fit. Regressing the logarithms of

times versus the logarithm of n and d yields that the run time depends on n1.25 and

d1.05, which also resulted in an R2 value of 0.96.

163

5.4.4 Infeasibility as a function of iterations

In this section, we aim to understand how the primal infeasibility changes as a function of

the iterations for different values of tolerance. In addition to primal infeasibility defined in

(5.34), we report the maximum violation defined as

maxi∈[n],j∈[n]

θi − θj + ξ′i(xj − xi) , (5.36)

as well as the constraints added at each iteration. We present two instances with (n, d) =(104,10) - with tolerance set to 0.1 and 0.05 respectively, and illustrate the progress of

the algorithm - constraints added at each iteration, primal infeasibility and the maximum

violation at the end of each iteration.

0

200

400

0 5 10 15 20Iteration

Prim

al In

feas

ibili

ty

(a) Primal infeasibility de-fined in (5.34) as a functionof the number of iterations.

0

50000

100000

150000


Max

imum

Vio

latio

n

(b) Maximum violation de-fined in (5.36) as a functionof the number of iterations.

0

2500

5000

7500

10000


Cut

s A

dded

(c) Number of constraintsadded at each iteration.

Figure 5-1: Progress of Algorithm 1 for (n, d) = (104,10), Tol = 0.1.

164

0

250

500

750

1000

0 10 20Iteration

Prim

al In

feas

ibili

ty

(a) Primal infeasibility de-fined in (5.34) as a functionof the number of iterations.

0

50000

100000

150000

200000

0 10 20Iteration

Max

imum

Vio

latio

n

(b) Maximum violation de-fined in (5.36) as a functionof the number of iterations.

0

2500

5000

7500

10000

0 10 20Iteration

Cut

s A

dded


Figure 5-2: Progress of Algorithm 1 for (n, d) = (104,10), Tol = 0.05.

Figures 5-1-5-2 suggest that Algorithm 1 makes rapid progress to decrease infeasibility.

It takes twenty to twenty five iterations to decrease infeasibility (and violation) to near zero.

Moreover, the number of cuts added decreases substantially as Algorithm 1 progresses. At

the final few iterations when Algorithm 1 is close to convergence, the algorithm only adds

typically less than 5 constraints at each iteration. Even for the larger sizes, we observe this

trend of fewer constraints per iteration at later stages of the algorithm.

5.4.5 Comparison with other state of the art methods

In this section, we compare Algorithm 1 with two other recent methods proposed in the

literature for the least squares convex regression Problem (5.4):

1. The cutting plane based method proposed in Balázs et al. [2015] and referred as ag-

gregated cutting planes (ACP). The main difference with Algorithm 1 is that Balázs

et al. [2015] use aggregated constraints in the reduced master problem.

2. The method in Mazumder et al. [2018], where the authors use an Alternating Direction

Method of Multipliers (ADMM) framework to solve the least squares convex regression

165

problem.

The ACP algorithm solves a variant of Problem (5.4), with bounds on both the function

values and the subgradients, which we both set to ∞ in Algorithm 1. Both the ACP algo-

rithm and Algorithm 1 were run with an upper bound of 1000 on the iteration limit and

Tol = 0.1. Each of the rows with n < 105 were averaged over twenty random independently

generated samples of that given size, while the larger ones (n ≥ 105) were averaged over ten

independently generated samples.

In Table 5.4, we record the final values of primal infeasibility and total running times

for Algorithm 1 and ACP respectively As far as the quality of solution is considered, the

final infeasibility indicates that the solutions obtained from both these methods are quite

similar. However, Algorithm 1 is approximately twenty times faster than Algorithm ACP as

(n, d) increase. For (n, d) = (105,102), while Algorithm 1 obtained solutions in a few hours,

the ACP algorithm did not complete even after 24 hours, after which it was terminated.

We remark that most of the time Algorithm ACP takes is to form the initial aggregation

constraints. The results followed a similar pattern for Tol = 0.05, and thus we omit them for

the sake of brevity.

n d (Alg. 1) Inf. (Alg. 1) Run time (ACP) Inf. (ACP) Run time

103 101 0.0143 (0.0012) 1.9 (0.6) 0.0168 (0.0011) 7.3 (0.8)

104 101 0.0106 (0.0002) 25.2 (10.8) 0.0099 (0.0002) 411.7 (26.6)

104 102 0.0107 (0.0002) 153.5 (20.3) 0.0097 (0.0003) 4785.7( 363.7)

105 101 0.0054 (0.0001) 1841.8 (230.9) 0.0050 (0.0001) 36842.7 (1391.03)

Table 5.4: Comparison for `2 convex regression between Algorithm 1 and ACP for Tol = 0.1.

In Table 5.5, we present a comparison between ADMM and Algorithm 1 for instances

with n = 103 and d = 10. For larger sizes of n = 104, the ADMM method ran into memory

issues and hence we do not report the performance for those cases. We set both the primal

error and gradient error tolerance to be 0.1 in the ADMM algorithm. We observe that the

166

ADMM algorithm has an edge on Algorithm 1 in terms of infeasibility, where as Algorithm 1

has the edge in terms of maximum violation. Algorithm 1 improves when the tolerance

is reduced to 0.05 with similar primal infeasibility to the ADMM solution. However, the

maximum violation is guaranteed to be at most 0.05 for Algorithm 1, while it is not satisfied

by the ADMM method. The ADMM solution can be improved by reducing the primal and

gradient error tolerances, but the point we emphasize is that Algorithm 1 gives a direct

control on the maximum constraint violation.

n d Tol (Alg. 1) Inf. (Alg. 1) time ADMM Inf. ADMM time ADMM Max viol.103 10 0.1 0.0150 8.3 0.0059 47.8 0.0840103 10 0.05 0.0029 142.7 0.0059 46.8 0.0885

Table 5.5: Comparison for `2 convex regression with ADMM.

5.4.6 Run times for `1 convex regression

In this section, we solve Problem (5.10), where we minimize the `1 loss rather than the usual

least squares loss, and demonstrate how the algorithm scales in this context. Table 5.6 shows

the run times and cuts added for a few instance sizes with tolerance set as 0.1.

n d Cuts (Blocks) Infeasibility Run time (seconds)

103 101 24 (3) 0.0158 (0.0012) 2.9 (3.7)

104 101 10 (1) 0.0118 (0.0001) 25.3 (2.4)

104 102 168 (10) 0.0119 (0.0001) 2437.7 (470.3)

105 101 9 (1) 0.0056 (0.0001) 2501.9 (416.3)

Table 5.6: `1 convex regression - Run times for Tol = 0.1.

We observe that for the same 0.1 tolerance, the run times are higher than the ones

obtained for `2 regression (Table 5.2). Also, as d increases for a given n, the run times

increase as compared to the `2 case.

167

5.4.7 Experiments on real data

In this section, we apply some of our methods on a real world data set. This data set,

which was considered in Mekaroonreung and Johnson [2012], was downloaded from https:

//ampd.epa.gov/ampd/. The data consists of the amount of heat input (in MMBtu) and

the following four covariates – the NOx emission rate, and emissions of SO2, CO2 and NOx in

tons. We consider nine years worth of data of electric utility units from 2000-2008, and after

removing some rows with missing entries, we obtain a dataset with n = 28,063 and d = 4.

We took a logarithmic transformation of the covariates, centered and scaled them so that

they had mean zero and standard deviation of one. We ran the cutting plane algorithm for

solving the least squares convex regression problem on this dataset, and present the results

below.

0

50

100

150


Prim

al In

feas

ibili

ty

(a) Primal infeasibility as afunction of the number of it-erations.

0

25000

50000

75000

100000


Max

imum

Vio

latio

n

(b) Maximum violation as afunction of the number of it-erations.

0

10000

20000


Cut

s A

dded


Figure 5-3: Progress of Algorithm 1 for Tol = 0.01.

168

https://ampd.epa.gov/ampd/

https://ampd.epa.gov/ampd/

0.0

2.5

5.0

7.5

10.0

2.5 5.0 7.5Iteration

Prim

al In

feas

ibili

ty

(a) Primal infeasibility as afunction of the number of it-erations.

0

2500

5000

7500


Max

imum

Vio

latio

n

(b) Maximum violation as afunction of the number of it-erations.

0

10000

20000


Cut

s A

dded


Figure 5-4: Progress of Algorithm 1 for Tol = 0.05.

We make the following observations.

• Figures 5-3-5-4 suggest that Algorithm 1 makes rapid progress to decrease infeasibility.

For tolerance value of 0.05, it reaches optimality fairly quickly in around ten iterations,

while it takes around twenty iterations for a smaller value of tolerance 0.01.

• Similar to the experiments on synthetic data, the number of cuts added decreases

substantially as Algorithm 1 progresses. The final few iterations involve adding a very

small number of cuts at each iteration.

• Finally, we include a note on the running times of the algorithm for this data. We

observe a run time of 20−30 minutes for a tolerance of 0.05, which is along the lines of

what we observe in Table 3 for synthetic data. On reducing the tolerance to 0.01, the

run time increases to 60 − 70 minutes, which is expected as the number of iterations

doubles in this case.

169

5.4.8 Sparse convex regression

In this section, we present the computational results for Algorithms 2 and 3 applied to

the problem of optimal subset selection in this setting. As for the continuous case, we

generate X from a Gaussian distribution, and randomly sample the support set of size k

from 1, . . . , d. We generate n d-dimensional vectors xi, each of which is generated from a

Gaussian distribution with zero mean and correlation matrix Σ, where Σij = ρ∣i−j∣, 1 ≤ i, j ≤ dfor some correlation 0 ≤ ρ ≤ 1. Note that when ρ = 0, the features are i.i.d, and higher ρ

indicates that the correlation among the features is larger.

We use the convex function Φ(x) = ∑i∈S∗ x2i , and the response data yi = Φ(xi) + εi, 1 ≤

i ≤ n. The errors εi are i.i.d. N(0, σ2), for all i = 1, . . . , n. We scale the data appropriately

so that the Signal to Noise ratio (SNR) is 3. Again, we mean-center and normalize the

features and response vectors to have unit `2 norm before providing the data as an input

into Algorithms 2 and 3.

First, we demonstrate the value of using an MIO solver, by iteratively adding constraints

to the primal problem according to Algorithm 2, and show the computational results in

Tables 5.7 and 5.8. Next, we present the results for Algorithm 3 in Table 5.9, where we

reformulate the sparse problem as minimizing a convex piecewise linear function over pure

binary variables. If S is the optimal set obtained by our algorithms, we define accuracy as

Accuracy = ∣S∗ ∩ S∣k

, (5.37)

where S∗ is the true support. Next, we define the false positive rate, which is the fraction

of features from the recovered support that are outside the true support S∗, i.e.,

False Positive Rate = ∣S ∖ S∗∣

∣S∣. (5.38)

170

A. Primal approach (Algorithm 2)

We present the results for the primal approach, as defined in Algorithm 2, in Tables 5.7 and

5.8 for n = 50k and 100k respectively. First, we discuss how the value of M is selected in

Problem (5.12) in the execution of Algorithm 2. In this primal approach, we solve the sparse

convex regression problem with `∞ norm bounds on the subgradients. Mazumder et al.

[2018] argue that the subgradients of the points near the boundary of Conv(x1, . . . ,xn) grow

large resulting in overfitting, and thus a bound on the subgradients is needed. Consequently,

we vary the value of M , and select it via cross-validation. Using M∗, the maximum of the

absolute of optimal solutions of Problem (5.14) and 5.15, we vary the value M as ηM∗ by

varying η, and calculate the validation error for each of these choices of M . For smaller

values of M , the solution is too constrained, and for larger values, overfitting tends to occur.

We use the one standard error rule for cross validation [Hastie et al., 2009] while se-

lecting the value of the parameter M . While performing cross validation to find the best

hyperparameter, we typically select various values of the parameter M1, . . . ,Ms, with corre-

sponding mean errors and standard deviations of the mean error on the validation set given

by E1, . . . ,Es and σ1, . . . , σs respectively. Typically, these values are obtained by K−fold

cross validation. The one standard error rule selects the parameter M =Mj where j is the

smallest index in the set i∣Ei ≤ Ei∗ + σi∗, and i∗ = argmini=1,...,sEi.

For n ≤ 10,000 the thresholding heuristic described in Section 5.3.3 was used to provide

warm-starts to Algorithm 2. For n > 104, we ran the same thresholding heuristic on a

sample of the points, and used the resulting support as a warm start. For n ≤ 104, we solve

Problems (5.14) and 5.15 to find the value M∗ and set the value of M = ηM∗. We choose η via

cross validation from the set 10−3,10−2,10−1,0.5. For n > 104, we avoid solving Problems

(5.14) and (5.15) due to high solve times and select M from the set 10−3,10−2,10−1,0.5 via

cross validation. In this case, we set the ridge regularization parameter λ to be zero.

For every row in all the following tables, we report the median run times and mean

accuracies over ten independently generated samples for the case when n = 50k, d = 100, and

five samples for instances where d = 500 or n = 100k with the standard deviation across these

171

samples in the parentheses. The key finding is that by comparing Tables 5.2 and 5.8, we

see that the sparse convex regression in fact solves faster than convex regression at least for

k = 10. Moreover, the resulting accuracy is at least 95%. Furthermore, as n increases the

accuracy of Algorithm 2 increases to near perfect value of 100%.

Tables 5.7 and 5.8 indicate that as n increases the accuracy increases and beyond a

certain n the accuracy becomes 100%.

n = 50k

Accuracy % Run time

ρ = 0.0k = 10

d = 100 100.0 (0.0) 1691.16 (284.8)d = 500 100.0 (0.0) 6522.37 (281.1)

k = 20d = 100 98.0(2.7) 2411.76(323.0)d = 500 92.0(2.3) 15276.47(5862.6)

ρ = 0.1k = 10

d = 100 100.0 (0.0) 2778.83 (6369.8)d = 500 100.0 (0.0) 6326.79 (613.4)

k = 20d = 100 99.0 (2.2 2109.42 (292.8)d = 500 94.4(2.2) 11589.47(6883.6)

ρ = 0.5k = 10

d = 100 100.0 (0.0) 2062.20 (508.0)d = 500 98.0 (4.5) 6083.46 (441.1)

k = 20d = 100 95.0 (3.5) 3158.10 (868.4)d = 500 93.3∗ (4.1) 25596.20 (3310.3)

Table 5.7: Accuracy% and Run times for Algorithm 2 for n = 50k.

172

n = 100k, d = 100

k Accuracy % Run time

ρ = 0.010 95.0 (12.2) 7665.68 (38.23)

20 100.0(0.0) 6605.0 (357.8)

ρ = 0.110 100.0 (0.0) 8939.83 (3575.0)

20 100.0 (0.0) 11638.24 (1950.7)

ρ = 0.510 100.0 (0.0) 6823.41 (777.0)

20 96.0 (2.2) 11937.81 (2120.6)

Table 5.8: Accuracy% and Run times for Algorithm 2 for n = 100k, d = 100.


• For n = 50k, increasing d from hundred to five hundred increases the run time by

almost four times. The accuracy however, remains close to 100%. When we raised d

to a thousand, the machines ran out of memory.

• The median run times and accuracies for ρ = 0.1 remain comparable in magnitude with

the results for ρ = 0. For (50k,100,10) the run times were highly skewed with values

ranging from 2000 to a maximum of 15,000 seconds. Excluding the three values with

run times of over 10,000 seconds, the median run time of the remaining seven instances

was 2,509.64 seconds with a standard deviation of 413.9 seconds.

• For ρ = 0.5, the solver could not solve one instance out of the five samples for

(50k,500,20). We report the median over the remaining four instances. The median

run times, on average, increase with a corresponding increase in ρ.

173

B. Dual approach (Algorithm 3)

We present the results of the dual approach in Table 5.9 where we report the average accuracy

(in %) and the average run times (in seconds) to provable optimality (MIO optimality gap

of 1%). Each row for d = 100 is averaged over ten independently generated random instances

of that size, with five such samples for problems where d = 500. As before, we use the one-

standard error rule while using cross validation [Hastie et al., 2009] to select the final value

of the parameter λ by varying it from 10−3 to 10−1. The tolerance parameter ε in Algorithm

3 is set to 10−3 in all the experiments that follow.

n = 50k

Accuracy % Run time (seconds)

ρ = 0.0k = 10

d = 100 97.0 (4.8) 2072.23 (236.5)d = 500 96.0 (5.5) 1785.43 (239.3)

k = 20d = 100 94.0 (3.9) 2356.16 (318.7)d = 500 91.1 (4.9) 1633.81 (523.8)

ρ = 0.1k = 10

d = 100 97.0 (4.8) 2073.42 (434.3)d = 500 96.7 (5.8) 2178.06 (358.9)

k = 20d = 100 91.0 (4.6) 2055.16 (608.3)d = 500 90.0 (0.0) 1493.74 (169.9)

ρ = 0.5k = 10

d = 100 92.0 (6.3) 1210.07 (171.4)d = 500 98.0 (4.5) 1858.62 (522.8)

k = 20d = 100 91.0 (4.2) 1122.74 (149.8)d = 500 89.0 (5.5) 1685.85 (340.2)

Table 5.9: Accuracy% and Run times for Algorithm 3 for n = 50k.

A key takeaway from Table 5.9 is that the dual algorithm 3 is able to solve instances

with 50,000 sample points in a few minutes for various values of correlation with high

174

support recovery rates. For n > 50,000 the key bottleneck is computing the objective g

and its subgradients. Recall that while evaluating g, Algorithm 3 requires the solution

of a continuous ridge regularized convex regression problem on a restricted support set

(Problem (5.18)) which has O(nk) terms in its objective. The relaxation problem (5.24),

which provides good quality warm starts, also becomes computationally expensive to solve

due to the presence of dense semi-infinite constraints.

To summarize, both the primal and dual methods achieve exact or near-exact recovery on

fairly noisy data (as evidenced by the fairly low Signal-to-Noise ratio of 3 of the data). While

the primal approach seems to have an edge in terms of scalability over the dual approach,

the dual approach is faster than the primal approach when (n, d, k) = (5 × 104,500,10).

C. Accuracy

In this section, we report on the accuracy of the solutions obtained by Algorithm 3 as function

of the parameters d, k, ρ, and SNR. We generate synthetic data for various values of each

of these parameters and vary one of these parameters at a time while keeping the remaining

constant. We present the mean accuracy and run times averaged over fifteen independently

generated samples, along with their one standard deviation error bars.

(a)√

SNR = 3 (b)√

SNR = 7 (c)√

SNR = 20

Figure 5-5: Accuracy and run times for varying SNR.

In Figures 5-5a–5-5c, we fix (d, k, ρ) = (100,10,0.1), and vary√

SNR ∈ 3,7,20.

175

(a) ρ = 0.0 (b) ρ = 0.1 (c) ρ = 0.5

Figure 5-6: Accuracy and run times for varying correlation ρ.

In Figures 5-6a–5-6c, we fix (d, k,√

SNR) = (100,10,20), and vary ρ in the set 0,0.1,0.5.

(a) d = 50 (b) d = 100 (c) d = 150

Figure 5-7: Accuracy and run times for varying dimension d.

In Figures 5-7a–5-7c, we fix (k, ρ,√

SNR) = (5,0.1,20) and vary d in the set 50,100,150.

(a) k = 5 (b) k = 10 (c) k = 15

Figure 5-8: Accuracy and run times for varying sparsity parameter k.

Finally, in Figures 5-8a–5-8c, we fix (d, ρ,√

SNR) = (100,0.1,20) and vary the sparsity

176

level k in the set 5,10,15. We solve the problems with a time cutoff of two, five, and ten

minutes for k = 5,10,15 respectively, and take the best solution obtained until that time in

case the incumbent solution has not been guaranteed to be optimal by that time.


(a) As n increases, the accuracy of Algorithm 3 increases and the running time decreases.

These observations are consistent with the findings of Bertsimas and Van Parys [2016]

in the context of sparse linear regression.

(b) As SNR increases, we reach higher accuracy for smaller values of n, that is the problem

becomes easier (see Figures 5-5a–5-5c).

(c) To reach accuracy of 95% we need n equal to 10,000, 12,000 and 15,000 for ρ being

0, 0.1 and 0.5, respectively (see Figures 5-6a–5-6c).

(d) To reach accuracy of 95% we need n equal to 3,000, 4,000 and 5,000 for d being 50,

100 and 150, respectively (see Figures 5-7a–5-7c).

(e) To reach accuracy of 90% we need n equal to 2,500, 8,000 and 10,000 for k being 5,

10 and 15, respectively (Figures 5-8a–5-8c).

D. False Positive rates

In this section, we investigate the false discovery rate for the estimator resulting from this

algorithm. So far we have taken k, the sparsity parameter as a given in all of our experiments.

In reality however, this value needs to be inferred from the data, and is usually done by cross

validation. Thus, it is imperative that the algorithm not only choose the relevant features,

but also that it picks no extra spurious ones and mark them as relevant.

To check this, we performed an experiment with simulated data for (n, d) = (10000,100)with five features chosen randomly. We vary k in the set 3, . . . ,10, and choose the best k

by five fold cross validation. We then run our algorithms for that value of k, and report the

median false positive rate over ten independently generated samples. We present our results

for both the primal and dual algorithms in Tables 5.10 and 5.11 respectively.

177

For the dual approach, we impose a time limit of 120 seconds, and take the best solution

obtained by that point of time and for the primal method, we do not impose any such time

limit. We report the median false positive rate over ten independently chosen samples for

different values of ρ and SNR. This suggests that our algorithms not only pick the relevant

features, but are also able to control for spurious discoveries.

√SNR ρ = 0.0 ρ = 0.1 ρ = 0.5

3 0% 0% 0%

7 0% 0% 0%

20 0% 0% 0%

Table 5.10: False Positive rate for Algorithm 3.

√SNR ρ = 0.0 ρ = 0.1 ρ = 0.5

3 0% 0% 0%

7 0% 0% 0%

20 0% 0% 0%

Table 5.11: False Positive rate for Algorithm 2.

5.4.9 Discussion

(a) For the problem of convex regression, we see that Algorithm 1 has a significant edge

over other state of the art methods in terms of run time and accuracy. Our approach

allows us to solve problems of n = 100,000 and d = 100 in hours. Also, it is flexible

enough to accommodate other constraints such as coordinate-wise monotonicity and

norm bounded subgradients.

(b) For the sparse convex regression problem, the dual approach (Algorithm 3) has an

edge over the primal method (Algorithm 2) in run times and scalability. Surprisingly,

178

Algorithm 3 solves the sparse convex regression problem in times comparable to the

continuous case, implying that the price of sparsity is small. Since we break new

ground in this area, we are unable to include any comparisons to other methods.

(c) For the sparse convex regression problem, the primal approach scales to problems of

the size (n, d, k) = (105,100,10) in hours, while the dual approach scales to (n, d, k) =(5 × 104,500,10) in minutes. We perform various experiments by varying the degree

of correlation among the covariates ρ, signal to noise ratio (SNR), number of features

d, sparsity level k, and demonstrate that our algorithms achieve near perfect support

recovery as n increases. Also, we note that both Algorithms 2 and 3 limit the false

discovery rate.

179

Chapter 6

Conclusions

This thesis started with the motivation that the current state-of-the-art approaches in data-

driven decision making can be improved upon by considering prediction and prescription

jointly rather than separately. The current data-rich age has made such approaches possible,

and opened up exciting avenues in diverse application domains such as healthcare and retail.

For instance, personalizing treatment choices for each patient depending on his/her features is

a problem of tremendous interest to healthcare providers. This thesis considers such problems

from a broad perspective, and develops approaches that heavily rely on optimization methods

and demonstrates the merits of such an approach.

In Chapters 2 and 3, we consider problems in prescriptive analytics – where the goal is

not just predicting uncertainty (demand, returns) but on decision-making. We demonstrate

that jointly considering prediction and prescription can typically lead to better decisions and

outcomes. Particularly, with the prevalence of observational data, we demonstrate that such

approaches can be directly used in various applications and add significant value.

In Chapter 4, we consider the classical technique of scenario reduction for stochastic

optimization, which relies on approximating the empirical distribution with a few scenarios

to improve tractability for large n. We demonstrate that taking into account the decision

quality of scenarios can result in better distributions tailored for decision-making, compared

to standard clustering based approaches with the Wasserstein distance. Crucially, achieving

181

higher quality decisions with fewer scenarios can significantly improve interpretability, which

is desirable for practitioners.

Finally, in Chapter 5, we apply modern optimization-based techniques for solving shape

constrained regression problems. We consider the problem of selecting a small subset of

features that leads to least error, and develop primal and dual approaches for solving this

problem. With the aid of computational examples on real and synthetic data, we demonstrate

that our techniques lead to improved tractability, high accuracy and low false positive rates.

In conclusion, this area of data-driven decision making lies at the intersection of opti-

mization and learning, and is a particularly exciting avenue for research. These four chapters

serve to illuminate just a few of the areas where optimization and data analytics can yield an

edge over current practice – in terms of interpretability, sparsity, tractability, and decision

quality – in a wide variety of applications.

182

Appendix A

Supplement for Chapter 2

A.1 Optimization Algorithms for Joint Predictive and

Prescriptive Analytics

Algorithm 5 Random forests algorithmInput: Training data S = (X i, Y i), i = 1, . . . , n, parameters nmin,∆max,K,α,µ.

Output: Random forest τ 1, . . . , τK.1: procedure ComputeRandomForest

2: for 1 ≤ t ≤K do

3: Sample S(t), a collection of n points with replacement from S.

4: Sample ⌊αdx⌋ features from the full dx features and form the set S(t)α .

5: Compute the tth tree, τ t = GreedyTree(S(t)α ,0)

6: end for

7: return Random forest τ 1, . . . , τK.8: end procedure

Before defining the local search algorithms, we first define some notation which we use while

describing the algorithm.

183

Algorithm 4 Greedy recursive algorithm for training prescriptive treesInput: Data S = (Xi, Y i), i = 1, . . . , n0, current depth ∆, tuning parameters nmin,∆max, µ.Output: Greedy prescriptive tree τ .1: procedure GreedyTree(S,∆)2: Solve the empirical SAA problem over S (Problem (2.2)) to obtain z∗(S)3: Set τ(x) = z∗(S)4: Also, store the optimal objective value of Problem (2.2) as cz(S)

5: Compute predictive error, cy(S) = ∑i∈S∥Y i − 1

n0 ∑j∈S

Y j∥

2

6: Net split cost, c(S) = µn0cz(S) + (1 − µ)cy(S)7: Set success ← 0, and cmin ← c(S).8: if ∆ <∆max and n0 ≥ 2nmin then9: for each 1 ≤ p ≤ dx do

10: Sort covariate values i ∈ S ∶Xip of feature p in non decreasing order

11: Obtain set of kp unique values for each p as πp1 < . . . < π

pkp

12: end for13: for 1 ≤ p ≤ dx and p ∶ kp ≥ 2 do14: for 1 ≤ k < kp do15: SL = x ∈ S ∶ xp ≤

12(πp

k + πpk+1), let nL = ∣SL∣.

16: SR = x ∈ S ∶ xp >12(πp

k + πpk+1), let nR = ∣SR∣.

17: if nmin ≤ nL and nmin ≤ n

R then18: Solve SAA problems over SL and SR to obtain z∗(SL), z∗(SR) respectively19: Also, compute the respective prescriptive costs as cz(S

L), cz(SR)

20: Next, compute the predictive costs as follows:21: cy(S

L) = ∑i∈SL

∥Y i − 1nL ∑

j∈SL

Y j∥2

22: cy(SR) = ∑

i∈SR

∥Y i − 1nR ∑

j∈SR

Y j∥2

23: New cost, c← µ(nLcz(SL) + nRcz(S

R)) + (1 − µ)(cy(SL) + cy(S

R))

24: if If c < cmin then25: Set p∗ = p and s∗ = 1

2(πp

k + πpk+1)

26: Update cmin ← c and success ← 127: end if28: end if29: end for30: end for31: if success == 1 then32: SLeft = x ∈ S ∶ xp∗ ≤ s

∗

33: SRight = x ∈ S ∶ xp∗ > s∗

34: τLeft = return GreedyTree(SLeft, (∆ + 1))35: τRight = return GreedyTree(SRight, (∆ + 1))36: τ(x) = 1(xp∗ ≤ s

∗)τLeft(x) + 1(xp∗ > s∗)τRight(x).

37: end if38: end if39: return τ .40: end procedure

184

• τt denotes the subtree whose root is the tth node of a tree τ , and nodes(τ) refers to

the set of all nodes of the tree τ .

• For any index set I, XI , Y I denote the subsets of the training data X,y corresponding

to I.

• Shuffle(I) returns a randomized order of the index set I.

• For any subtree τ , Lµ(τ) denotes the combined objective (2.15) of the subtree evaluated

on the training set for a given value of µ.

• Also, minleafsize(τ) of any subtree τ denotes the minimum number of samples in a

leaf belonging to τ .

• Finally, we refer to the two descendants of a non-leaf node τ as left and right child

respectively, denoted by τL and τR.

Algorithm 6 Local search algorithm for training optimal prescriptive treesInput: Data S = (X i, Y i), i = 1, . . . , n, Initial prescriptive tree τ

Output: Locally optimal prescriptive tree τ

1: procedure CoordinateDescentTree(τ)

2: repeat

3: cprev ← Lµ(τ)4: for all t ∈ shuffle(nodes(τ)) do

5: I ← i ∶X i is assigned by τ to a leaf contained in subtree τt6: τt ← OptimizeNode(τt,XI , Y I)

7: Update τ by replacing tth node with τt

8: Update ccurrent ← Lµ(τ)9: end for

10: until cprev = ccurrent ▷ Local optimality.

11: return τ .

12: end procedure

185

Algorithm 7 Optimizing a nodeInput: Training data S = (X i, Y i), i = 1, . . . , n, Subtree τ to optimize.

Output: Optimized subtree τ .

1: procedure OptimizeNode(τ)

2: if τ is a branch then

3: τ(1), error(1) ← PerturbSplit(τL, τR)

4: error(2) ← Lµ(τL) with τ(2) ← τL

5: error(3) ← Lµ(τR) with τ(3) ← τR

6: Update cnew ← error(j∗) and τnew ← τ(j∗) where j∗ = argmin1≤j≤3 error(j)7: else ▷ τ is a leaf node

8: Create new split with children τL, τR

9: τnew, cnew ← PerturbSplit(τL, τR) to obtain new error and subtree

10: end if

11: if cnew < Lµ(τ) then

12: Update τ ← τnew

13: end if

14: return τ .

15: end procedure

186

Algorithm 8 Perturbing a splitInput: Left and right subtrees τL and τR to use as children of new split and S, the subset

of training data that falls into leaves of these two subtrees.

Output: Subtree τ with best axis-parallel split at root, and its corresponding loss.

1: procedure PerturbSplit(τL, τR)

2: Initialize error∗ ←∞3: for p ∈ [dx] do

4: Sort the values Xjp , j ∈ S in non decreasing order

5: Obtain set of kp unique values of (Xjp , j ∈ S) for each p as πp

1 < . . . < πpkp

6: for 1 ≤ k < kp do

7: Split value, γ = 12(π

pk + π

pk+1)

8: τ ← branch node X ip ≤ γ with left and right children τL, τR

9: if minleafsize(τ) > nmin then ▷ Split feasible

10: if Lµ(τ) < error∗ then ▷ Improvement

11: error∗ ← Lµ(τ)12: τ∗ ← τ

13: end if

14: end if

15: end for

16: end for

17: return τ∗, error∗

18: end procedure

187

Algorithm 9 Complete tree algorithmInput: Training data S = (X i, Y i), i = 1, . . . , n, parameters nmin,∆max,K, f1, µ.

Output: Optimized prescriptive tree τ∗.

1: procedure ComputeTree

2: for 1 ≤ t ≤K do

3: τ jG = GreedyTree(S,0)

4: We make the following three modifications of Algorithm (4):

5: Sort the splits in non increasing order of prediction error

6: Compute prescriptive costs of only the top f1% of the splits

7: Save the values cz(S), z∗(S) for various sets of indices S

8: τ jL = CoordinateDescentTree(τ jG)

9: We make the following three modifications of Algorithm (6):

10: For each candidate split S, find the best z among stored z∗

11: Use this z∗ as the starting point for the first order methods

12: end for

13: Select the best tree τ∗ = τ jL where j ∈ argmin1≤k≤K Lµ(τ kL)14: return τ∗.

15: end procedure

A.1.1 First order convex methods for local search procedure

In the following, we present two first order convex optimization based algorithms we use in

the local search procedure in greater detail. Recall that both Algorithms (4) and (6) rely on

evaluating the quality of various splits, which involves solving an empirical SAA problem over

different index sets. The core idea behind these methods is that since we already have access

to good quality solutions, which we use to iteratively improve to re-compute the optimal

solution for a new split.

In the part that follows, we assume have access to an initial solution z(0), and the opti-

mization problem we solve is minz∈Z f(z). In the case of trees, f is the empirical sum of the

188

costs c(z; y) over all the values of y in a leaf.

A. Projected subgradient descent: In this algorithm, we compute the subgradient at

each candidate solution z, with the update step as

z(k+1) ∈ argminz∈ZPZ(z(k) − αkg(z(k))), (A.1)

where

• αk is the step size at the kth iteration,

• g is a subgradient of f evaluated at z(k), and

• PZ(u) the projection of u onto the set Z, given by the solution to the following convex

minimization problem

PZ(u) = argminz∈Z

1

2∥u − z∥22. (A.2)

Since this is not necessarily a descent method when subgradients are used, we keep a track

of the best iterate obtained during prior iterations at each step.

B. Algorithms on the smoothed function: An alternative approach we present in this

section is to smooth the potentially nonsmooth function f , and then apply iterative algo-

rithms to find the optimal solution. For instance, when f can be represented as maxu u′Az,

then we define the smoothed approximation fδ of f as

fδ(z) =maxu∈Q

u′Az − δ

2∥u∥2, (A.3)

for some δ > 0 that controls the quality of the approximation. The set Q is a closed bounded

convex set, which denotes the dual space of the primal feasible set Z. The gradient of fδ is

given by A′u∗δ(z), where u∗δ(z) is the optimal solution to the maximization problem (A.3)

for any given z. For more details, we refer the reader to Nesterov [2005].

189

When solving the projection problem (A.2) on Z is efficiently solvable, then a projected

gradient type algorithm (A.1) can be used. An alternative approach is to use a Frank-

Wolfe type algorithm, where the iterates are given by solving sequential linear optimization

problems asz ∈ argmin

z∈Z∇zfδ(z(k))′(z − z(k)),

z(k+1) = αkz(k) + (1 − αk)z,

(A.4)

Here, αk is chosen between zero and one. This algorithm relies on being able to efficiently

minimize linear functions over the constraint set Z. For more details, we refer the reader

to Jaggi [2013] and the references therein.

A.2 Proofs

To begin, we prove the following lemma.

Lemma 1. Suppose assumptions 1-5 hold. If (x, z) and (x, z′) are in the same partition of

X ×Z, as specified by assumption 3, then

∣Ψ(z, δ) −Ψ(z′, δ)∣ ≤ (α(LD + 1 +√2λmax ln 1/δ) +L(

√2 ln 1/δ + 3)) ∣∣z − z′∣∣,

where Ψ(z, δ) = f(x, z) − f(x, z) − 23γn

ln(1/δ) −√2V (x, z) ln(1/δ) −L ⋅B(x, z).

Proof. We first note ∣f(x, z) − f(x, z′)∣ ≤ L∣∣z − z′∣∣ by the Lipschitz assumption on c(z; y).

190

Next, since (x, z) and (x, z′) are contained in the same partition,

∣f(x, z) − f(x, z′)∣ = ∣∑i

wi(x, z)c(z;Y i) −wi(x, z′)c(z′;Y i)∣

≤ ∣∑i

wi(x, z)c(z;Y i) −wi(x, z)c(z′;Y i)∣

+ ∣∑i

wi(x, z)c(z′;Y i) −wi(x, z′)c(z′;Y i)∣

≤ L∣∣z − z′∣∣ + ∣∣w(x, z) −w(x, z′)∣∣1

≤ (L + α)∣∣z − z′∣∣,

where we have used Holder’s inequality, the uniform bound on c, and Assumption 3.

Similarly, for the bias term,

∣LB(x, z) −LB(x, z′)∣ = L ∣∑i

wi(x, z)∣∣(X i, Zi) − (x, z)∣∣ −wi(x, z′)∣∣(X i, Zi) − (x, z′)∣∣∣

≤ L ∣∑i

wi(x, z)∣∣(X i, Zi) − (x, z)∣∣ −wi(x, z)∣∣(X i, Zi) − (x, z′)∣∣∣

+L ∣∑i

wi(x, z)∣∣(X i, Zi) − (x, z′)∣∣ −wi(x, z′)∣∣(X i, Zi) − (x, z′)∣∣∣

≤ L∑i

wi(x, z) ∣∣∣(X i, Zi) − (x, z)∣∣ − ∣∣(X i, Zi) − (x, z′)∣∣∣

+L∣∣w(x, z) −w(x, z′)∣∣1 supi∣∣(X i, Zi) − (x, z)∣∣

≤ (L +LαD)∣∣z − z′∣∣.

Next, we consider variance term. We let Σ(z) denote the diagonal matrix with

191

Var(c(z;Y i)∣X i, Zi) for i = 1, . . . , n as entries. As before,

∣√V (x, z) −

√V (x, z′)∣ =

RRRRRRRRRRRR

√

∑i

w2i (x, z)Var(c(z;Y i)∣Xi, Zi) −

√

∑i

w2i (x, z

′)Var(c(z′;Y i)∣Xi, Zi)

RRRRRRRRRRRR

≤

RRRRRRRRRRRR

√

∑i

w2i (x, z)Var(c(z;Y i)∣Xi, Zi) −

√

∑i

w2i (x, z)Var(c(z′;Y i)∣Xi, Zi)

RRRRRRRRRRRR

+

RRRRRRRRRRRR

√

∑i

w2i (x, z)Var(c(z′;Y i)∣Xi, Zi) −

√

∑i

w2i (x, z

′)Var(c(z′;Y i)∣Xi, Zi)

RRRRRRRRRRRR

= ∣√w(x, z)TΣ(z)w(x, z) −

√w(x, z)TΣ(z′)w(x, z)∣

+ ∣∣∣w(x, z)∣∣Σ(z′) − ∣∣w(x, z′)∣∣Σ(z′)∣ ,

where ∣∣v∣∣Σ =√vTΣv. One can verify that, because Σ is positive semidefinite, ∣∣ ⋅ ∣∣Σ is

seminorm that satisfies the triangle inequality. Therefore, we can upper bound the latter

term by

√(w(x, z) −w(x, z′))TΣ(w(x, z) −w(x, z′)) ≤ ∣∣w(x, z) −w(x, z′)∣∣

≤ ∣∣w(x, z) −w(x, z′)∣∣1

≤ α∣∣z − z′∣∣,

where we have used the assumption that ∣c(z; y)∣ ≤ 1.

The former term can again be upper bounded by the triangle inequality.

∣√∑i

w2i (x, z)Var(c(z;Y i)∣X i, Zi) −

√∑i

w2i (x, z)Var(c(z′;Y i)∣X i, Zi)∣

≤√∑i

w2i (x, z)(

√Var(c(z;Y i)∣X i, Zi) −

√Var(c(z′;Y i)∣X i, Zi))2 (A.5)

Noting that√

Var(c(z;Y i)) = ∣∣c(z;Y i)−E[c(z;Y i)]∣∣L2 (dropping conditioning for notational

192

convenience), we can apply the triangle inequality to the L2 norm:

(∣∣c(z;Y i) −E[c(z;Y i)]∣∣L2 − ∣∣c(z′;Y i) −E[c(z′;Y i)]∣∣L2)2

≤ ∣∣c(z;Y i) − c(z′;Y i) −E[c(z;Y i) − c(z′;Y i)]∣∣2L2

≤ E[(c(z;Y i) − c(z′;Y i))2]

≤ L2∣∣z − z′∣∣2.

Therefore, we can upper bound (A.5) by

√∑i

w2i (x, z)L2∣∣z − z′∣∣2

≤∑i

wi(x, z)L∣∣z − z′∣∣ = L∣∣z − z′∣∣,

where we have used the concavity of the square root function. Therefore,

∣√V (x, z) −

√V (x, z′)∣ ≤ (α +L)∣∣z − z′∣∣.

Combining the three results with the triangle inequality yields the desired result.

Proof of Theorem 1. To derive a regret bound, we first restrict our attention to the fixed

design setting. Here, we condition on X1, Z1, . . . ,Xn, Zn and bound f(x, z) around its

expectation. To simplify notation, we write X to denote (X1, . . . ,Xn) and Z to denote

(Z1, . . . , Zn). Note that by the honesty assumption, in this setting, f is a simple sum of

independent random variables. Applying Bernstein’s inequality (see, for example, Boucheron

et al. [2013]), we have, for δ ∈ (0,1),

P (E[f(x, z) ∣X,Z] − f(x, z) ≤ 2

3γnln(1/δ) +

√2V (x, z) ln(1/δ)∣X,Z) ≥ 1 − δ.

Next, we need to bound the difference between E[f(x, z)∣X,Z] and f(x, z). By the honesty

193

assumption, Jensen’s inequality, and the Lipschitz assumption, we have

∣E[f(x, z) ∣X,Z] − f(x, z)∣ = ∣∑i

wi(x, z)(f(X i, Zi) − f(x, z))∣

≤∑i

wi(x, z)∣f(X i, Zi) − f(x, z)∣

≤ L∑i

wi(x, z)∣∣(X i, Zi) − (x, z)∣∣

= L ⋅B(x, z).

Combining this with the previous result, we have, with probability at least 1−δ (conditioned

on X and Z),

f(x, z) − f(x, z) ≤ 2

3γnln(1/δ) +

√2V (x, z) ln(1/δ) +L ⋅B(x, z) (A.6)

Next, we extend this result to hold uniformly over all z ∈ Z. To do so, we partition X ×Zinto Γn regions as in Assumption 3. For each region, we construct a ν-net. Therefore, we have

a set z1, . . . , zKn such that for any z ∈ Z, there exists a zk such that (x, z) and (x, zk) are

contained in the same region with ∣∣z − zk∣∣ ≤ ν. For ease of notation, let k ∶ Z → 1, . . . ,Knreturn an index that satisfies these criteria. By assumption, Z ⊂ Rdz has finite diameter,

D, so we can construct this set with Kn ≤ Γn(3D/ν)dz (e.g., Shalev-Shwartz and Ben-David

[2014, pg. 337]).

By Lemma 1 (and using the notation therein), we have

Ψ(z, δ) ≤ Ψ(zk(z), δ) + ν (α(LD + 1 +√2 ln 1/δ) +L(

√2 ln 1/δ + 3)) .

Taking the supremum over z of both sides, we get

supz

Ψ(z, δ) ≤maxk

Ψ(zk, δ) + ν (α(LD + 1 +√2 ln 1/δ) +L(

√2 ln 1/δ + 3)) .

194

If we let ν = 13γn(α(LD + 1 +

√2) +L(

√2 + 3))−1, we have

P (supz

Ψ(z, δ) > 0∣X,Z)

≤ P (maxk

Ψ(zk, δ) + ν (α(LD + 1 +√2 ln 1/δ) +L(

√2 ln 1/δ + 3)) > 0∣X,Z)

≤ P (maxk

Ψ(zk, δ) + ν (α(LD + 1 +√2) +L(

√2 + 3)) ln 1/δ > 0∣X,Z)

≤∑k

P (Ψ(zk, δ) +ln 1/δ3γn

> 0∣X,Z)

≤∑k

P (Ψ(zk,√δ) > 0∣X,Z)

≤Kn

√δ,

where we have used the union bound and (A.6). Replacing δ with δ2/K2n and integrating

both sides to remove the conditioning completes the proof.

Proof of Theorem 2. By Theorem 1, with probability at least 1 − δ/2,

f(x, z) ≤ f(x, z) + 4

3γnln(2Kn/δ) + λ1

√V (x, z) + λ2B(x, z)

≤ f(x, z∗) + 4


√V (x, z∗) + λ2B(x, z∗),

where the second inequality follows from the definition of z. Using the same argument we

used to derive (A.6), since z∗ is not a random quantity, we have, with probability at least

1 − δ/2,

f(x, z∗) − f(x, z∗) ≤ 2

3γnln(2/δ) +

√2V (x, z∗) ln(2/δ) +L ⋅B(x, z∗)

≤ 2


√V (x, z∗) + λ2B(x, z∗).

Combining the two inequalities with the union bound yields the desired result.

Proof of Corollary 1. We show f(x, z) − 2LB(x, z∗) →p f(x, z∗). The desired result follows

195

from the assumption regarding B(x, z∗) and Slutsky’s theorem. First, we note, due to the

assumption ∣c(z; y)∣ ≤ 1,

V (x, z∗) =∑i

wi(x, z∗)Var(c(z∗;Y i)∣X i, Zi) ≤ 1

γn∑i

wi(x, z∗) =1

γn.

We have, for any ε > 0,

P (∣f(x, z) − 2LB(x, z∗) − f(x, z∗)∣ > ε)

≤ P (f(x, z) − 2LB(x, z∗) − f(x, z∗) > ε/2)

+ P (f(x, z∗) − f(x, z) + 2LB(x, z∗) > ε/2).

By Theorem 2, for large enough n, the first term is upper bounded by

2Kn exp⎛⎝− ε2

4(2/γn + 4√V (x, z∗))2

⎞⎠

≤ 2Kn exp(−ε2

4(2/√γn + 4/√γn)2)

= 2Γn (9Dγn (α(LD + 1 +√2) +L(

√2 + 3)))

dzexp(−γnε

2

144)

≤ C1n1+β exp(−C2n

β)→ 0.

Because f(x, z∗) ≤ f(x, z), the latter term is upper bounded by

P (B(x, z∗) > ε/4L)→ 0.

Proof of Example 1. First we consider the case that the zero variance action has cost 0, and

the other actions have cost 1 (call this event A). Because the cost of the optimal action is

0 and the cost of a suboptimal action is 1, the expected regret in this problem equals the

probability of the algorithm selecting a suboptimal action. Noting that f(j) ∼ N (1,1/m)

196

for j = 1, . . . ,m, we can express the expected regret of the µ = 0 algorithm as

E[R0∣A] = P (f(j) < 0 for some j ∈ 1, . . . ,m∣A) = P (maxj

Wj >√m) ,

where W1, . . . ,Wm are i.i.d. standard normal random variables. Similarly, the expected

regret of the µ > 0 algorithm can be expressed as

E[Rµ∣A] = P (f(j) < −λ√lnm√m

for some j ∈ 1, . . . ,m∣A)

= P (maxj

Wj >√m + λ

√lnm)

We can construct an upper bound on ERµ with the union bound and a concentration in-

equality (as in the proof of Theorem 1). Applying the Gaussian tail inequality (see, for

example, Vershynin [2016, Proposition 2.1.2]), we have

E[Rµ∣A] ≤mP (W1 >√m + λ

√lnm)

≤√m√2π

exp(−12(√m + λ

√lnm)2)

=√m

mλ2/2√2πexp(−m/2) exp(−λ

√m lnm)

≤ 1√m√2π

e−m/2,

where we have used the assumption λ ≥√2.

To lower bound the expected regret of the µ = 0 algorithm, we can use a similar Gaussian

197

tail inequality.

E[R0∣A] = 1 − [1 − P (W1 >√m)]m

≥ 1 − [1 − (1 − 1

m) 1√m√2π

e−m/2]m

≥ 1 − [1 − 1

2√m√2π

e−m/2]m

≥ 1 −⎡⎢⎢⎢⎢⎣[1 − 1

2√m√2π

e−m/2]2√2πm exp(m/2)⎤⎥⎥⎥⎥⎦

√m exp(−m/2)/2√2π

,

where the second inequality is valid for all m ≥ 2. One can verify that (1 − 1/n)n is a

monotonically increasing function that converges to e−1. Therefore, for all m ≥ 2,

E[R0∣A] ≥ 1 − exp(−√m

2√2π

exp(−m/2)) .

Next, we use these bounds to compute the ratio E[Rµ∣A]/E[R0∣A] in the limit as m→∞.

E[Rµ∣A]E[R0∣A] ≤

1√m√2πe−m/2

1 − exp (−√m

2√2π

exp(−m/2)).

Applying L’Hopital’s rule, the limit of the right hand side is equal to the limit of

2(2π)−1/2 (−m−3/2e−m/2 −m−1/2e−m/2)(2π)−1/2 [m−1/2e−m/2 −m1/2e−m/2] exp (−

√m

2√2πe−m/2)

= 2 −1 −mm −m2

⋅ exp(√m

2√2π

e−m/2)→ 0.

Next, we consider the case that the zero variance action has cost 1, and the other actions

have cost 0. The expected regret equals the probability that the zero variance action is

198

selected. For sufficiently large m,

E[Rµ∣Ac] = P (f(j) > 1 − λ√lnm√m

∀j ∈ 1, . . . ,m∣Ac)

≤ P (W1 >√m − λ

√lnm)m

≤ P (W1 >√m/2)m

≤ ( 2√2π

e−m/8)m

≤ e−m2/8 = o(E[Rµ∣A]).

Therefore, for sufficiently large m and some constant C,

E[Rµ]E[R0] =

E[Rµ∣A] +E[Rµ∣Ac]E[R0∣A] +E[R0∣Ac]

≤ E[Rµ∣A] +E[Rµ∣Ac]E[R0∣A]

≤ (1 +C)E[Rµ∣A]

E[R0∣A] → 0.

A.3 Optimization with Linear Predictive Models

Here, we detail the optimization of (2) with linear predictive models. We focus on the case

that c(z;Y ) = Y for simplicity. For these models, we posit the outcome is a linear function

of the auxiliary covariates and decision. That is there exists a β such that, given X = x,

Y (z) = (x, z)Tβ + ε, where ε is a mean 0 subgaussian noise term with variance σ2. If we let

A denote the design matrix for the problem, a matrix with rows consisting of (X i, Zi) for

i = 1, . . . , n, then the ordinary least squares (OLS) estimator for β is given by

βOLS = (ATA)−1ATY.

199

The ordinary least squares estimator is unbiased, so when solving (2), we set λ2 = 0. The vari-

ance of (x, z)T βOLS is given by σ2(x, z)T (ATA)−1(x, z). (ATA)−1 is a positive semidefinite

matrix, so√V (x, z) is convex. Therefore, (2) becomes

minz∈Z

(x, z)T βOLS + λ1σ√(x, z)T (ATA)−1(x, z),

which is a second order conic optimization problem if Z is polyhedral and can be solved

efficiently by commercial solvers. Even if Z is a mixed integer set, commercial solvers such

as Gurobi [Gurobi] can still solve the problem for sizes of practical interest.

For regularized linear models such as ridge and lasso regression, we use a similar approach.

Although these estimators are biased, we set λ2 = 0 for computational reasons. The ridge

estimator for β has a similar form to the OLS estimator:

βRidge = (ATA + αI)−1ATY,

for some α ≥ 0. The resulting optimization problem is essentially the same as with the OLS

estimator. The lasso estimator does not have a closed form solution, but we can approximate

it as in Tibshirani [1996]:

PβLasso ≈ (PATAP T + αPW )−1PATY,

where W = diag(1/∣β∗1 ∣, . . . ,1/∣β∗d+p∣), β∗ is the true lasso solution, and P is a projection

matrix that projects to the nonzero components of β∗. (The zero components of β∗ are still

0 in the approximation.) With this approximation, the resulting optimization takes the same

form as those for the OLS and ridge estimators.

200

A.4 Data Generation

A.4.1 Portfolio Optimization

In this example,we simulate the uncertainty as following a Normal distribution, given by

y(x) = N(µy + 0.1(x1 − 1000) ⋅ 16 + 1000 ⋅ x2 ⋅ 16 + 10 ⋅ log(x3 + 1) ⋅ 16,Σy),

where 16 represents a vector of ones in R6, and the mean vector and covariance matrix µy,Σy

are given by

µy = (86.8625 71.6059 75.3759 97.6258 52.7854 84.8973)T ,

Σ1/2Y =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

136.687 ∗ ∗ ∗ ∗ ∗8.79766 142.279 ∗ ∗ ∗ ∗16.1504 15.0637 122.613 ∗ ∗ ∗18.4944 15.6961 26.344 139.148 ∗ ∗3.41394 16.5922 14.8795 13.9914 151.732 ∗24.8156 18.7292 17.1574 6.36536 24.7703 144.672

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

.

Finally, each of the three covariates are distributed as

x1 ∼ N(1000,50),

x2 ∼ N(0.02,0.01),

log(x3) ∼ N(0,1).

A.4.2 Newsvendor problem

In this example, for each week t, client r and item i, we include as product features the past

demands of product i at client r and times t − 1, . . . , t − 5, physical product characteristics

such as weight and number of pieces per unit, and indicator variables to encode whether

201

its name contains strings such as “whole grain", “fiber", “multigrain", “chocolate", “vanilla",

and “burrito". We also include client r specific information as features – aggregate demand

over all products sold at client r and times t − 1, . . . , t − 5, and indicator variables to deter-

mine the client type among the following categories – Walmart, Individual store, General

Market, Supermarket, Small franchise, or NA/Other. The final data set includes twenty five

covariates.

A.4.3 Pricing

For our synthetic pricing example, we consider a store offering 5 products. We generate

auxiliary covariates, X i, from a N (10,1) distribution. We generate historical prices, Zi,

from a Gaussian distribution,

N

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 0

1 0

0 1

0 1

0.5 0.5

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

T

X i,100I

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

.

We compute the expected demand for each product as:

µ =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

500 − (Zi1)2/10 −X i

1 ⋅Zi1/10 − (X i

1)2/10 −Zi2

500 − (Zi2)2/10 −X i

1 ⋅Zi2/10 − (X i

1)2/10 −Zi1

500 − (Zi3)2/10 −X i

2 ⋅Zi3/10 − (X i

2)2/10 +Zi1 +Zi

2

500 − (Zi4)2/10 −X i

2 ⋅Zi4/10 − (X i

2)2/10 +Zi1 +Zi

2

500 − (Zi5)2/10 −X i

2 ⋅Zi5/20 −X i

1 ⋅Zi5/20 − (X i

2)2/10

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

,

and generate Y i from a N (µ,2500I) distribution. This example serves to simulate the

situation in which some products are complements and some are substitutes.

202

A.4.4 Warfarin Dosing

To simulate how physicians might assign Warfarin doses to patients, we compute a nor-

malized BMI for each patient (i.e. body mass divided by height squared, normalized by

the population standard deviation of BMI). For each patient, we then sample a dose (in

mg/week), Zi, from

Zi ∼ N (30 + 15 ⋅BMIi,64).

If Zi is negative, we assign a dose drawn uniformly from [0,20]. If the data dose not contain

the patients height and/or weight, we assign a dose drawn uniformly from [10,50], a standard

range for Warfarin doses.

To simulate the response that a physician observes for a particular patient, we compute

the difference between the the assigned dose and the true optimal dose for that patient, Zi∗,

and add noise. We then cap the response so it is less than or equal to 40 in absolute value.

The reasoning behind this construction is that the INR measurement gives the physician

some idea of whether the assigned dose is too high or too low and whether it is close to

the optimal dose. However, if the dose is very far from optimal, then the information INR

provides is not very useful in determining the optimal dose (it is purely directional). The

response of patient i is given by

Y i =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

−40, Ri < −40

Ri, −40 ≤ Ri ≤ 40

40, Ri > 40

,

where Ri ∼ N (Zi −Zi∗,400).

203

Appendix B

Supplement for Chapter 5

B.1 Heuristic for generating upper bounds

In this section, we explain the heuristics referenced to in Section 3.3 in greater detail. Recall

the smoothed function given by

fτ(θ,ξ) =maxλ≥0

1

2∥y − θ∥2 + λ′(ATθ +

n

∑i=1

BT,iξi) + ρn

∑i=1∥ξi∥1 −

τ

2∥λ∥22. (B.1)

This function fτ is Lipschitz continuous with parameter `, where ` = λMAX(M′M)τ [Nesterov,

2005]. The matrix M ∈ R(m)×(n+nd), where m is the number of rows of AT (the number of

active equality constraints), and is given by

M = [AT BT,1 . . . BT,n]

Now, the optimal λ∗τ can be computed by

λ∗τ =1

τ(ATθ +

n

∑i=1

BT,iξi)+. (B.2)

Let Θ be the combined vector of θ and ξ1, . . . ,ξn. As mentioned previously, we minimize

the upper convex quadratic envelope, which is a majorizer of the smoothed function, given

205

by

gτ(S,Θ) =1

2∥y − θ∥2 + hτ(Θ(t)) + ⟨∇hτ(Θ(t)),Θ −Θ(t)⟩ +

`

2∥Θ −Θ(t)∥2

where hτ(Θ) is given by

hτ(Θ) = ⟨λ∗τ , ATθ +n

∑i=1

BT,iξi⟩ −τ

2∥λ∗τ∥22

= 1

2τ∥(ATθ +

n

∑i=1

BT,iξi)+∥22.

Thus, the problem we aim to solve now is

minS,Θ

1

2∥y − θ∥2 + hτ(Θ(t)) + ⟨∇hτ(Θ(t)),Θ −Θ(t)⟩ +

`

2∥Θ −Θ(t)∥2

subject to Supp(ξi) ⊆ S ∀i,

∣S∣ ≤ k,

Θ′ = [θ′,ξ′1, . . . ,ξ′n].

(B.3)

After using the expressions for the gradient term and some algebra, the update scheme takes

the following form:min

S,θ,ξ1,...,ξn

` + 12`∥θ − u∥2 + 1

2

n

∑j=1∥ξj − vj∥2

subject to Supp(ξi) ⊆ S ∀i,

∣S∣ ≤ k,

(B.4)

where the vectors u, vj, 1 ≤ j ≤ n are given by

u = `

` + 1θ(t) − 1

` + 1(A′Tλ∗τ − y),

vj = ξ(t)j −1

`(B′i,Tλ∗τ) ∀1 ≤ j ≤ n.

The solution to this problem is to just set S to be the set of indices of the k largest elements

of n

∑j=1(vj)21, . . . ,

n

∑j=1(vj)2d. We denote this complete procedure as simply

(θ(t+1),ξ(t+1)) = Hτk(θ(t),ξ(t))

206

Thus, we now arrive at our algorithms. Algorithm 10 presents the iterative thresholding

heuristic applied to the function gτ .

Algorithm 10 Hard thresholding algorithm for warm startsInput: (θ(0),ξ(0)), with active constraint set T (0), tolerance TOL > 0, and iteration limit

MAX ITER.

Output: Output an improved sparse feasible solution (θ∗,ξ∗1 , . . . ,ξ∗n) to Problem (29).

1: Set t← 0.

2: while t ≤MAX ITER do

3: Compute the dual vector λ using equation (B.2).

4: Using this dual value, compute the vectors u,v1, . . . ,vn.

5: Perform the thresholding (θ(t+1),ξ(t+1)) = Hτk(θ(t),ξ(t)).

6: Repeat until gτ(θ(t),ξ(t)) − gτ(θ(t+1),ξ(t+1)) ≤ TOL.

7: t← t + 18: end while

Clearly τ is a parameter that controls the degree of smoothness of the approximation, and as

a heuristic we decrease it iteratively. Algorithm 11 presents a scheme where τ is iteratively

varied, reduced by a factor of γ at each iteration, and combining this with Algorithm 10.

Algorithm 11 Varying τ for Iterative Hard thresholding on smooth gτ(.)Input: θ(0) ∈ Rn,ξ(0) ∈ Rn×d, with active constraint set T (0), threshold τMIN > 0.

Output: Output a sparse solution (θ∗,ξ∗1 , . . . ,ξ∗n) to Problem (29).

1: while τ > τMIN do

2: Apply Algorithm 10 to the smooth function gτ(θ0,ξ0). Let (θ∗τ ,ξ∗τ ) be the limiting

solution.3: Decrease τ ← γτ , for some damping factor 0 < γ < 1.

4: Go back to Step 2 with (θ0,ξ0) = (θ∗τ ,ξ∗τ ).5: end while

207

B.1.1 Heuristics for norm bounded subgradients

In this section, we emphasize that this heuristic can also be adapted to generate fast feasible

solutions for the sparse regression problem with Lipschitz bounded subgradients as well. To

be precise, the full problem we consider is as follows:

minθ,ξini=1,S

n

∑i=1(yi − θi)2



∥ξi∥ ≤ L ∀i,

θ ∈ Rn,

ξi ∈ Rd ∀i,

∣S∣ ≤ k,S ⊆ 1, . . . , n ,

(B.5)

Note that Nesterov smoothing [Nesterov, 2005] cannot be directly applied to this problem,

due to the conic constraints. We resort to conic duality, and using ideas from Becker et al.

[2011] and Nesterov smoothing, we obtain the following smoothed objective function

fτ(θ,ξ) = maxλ≥0,µ1,...,µn

1

2∥y − θ∥2 + λ′(ATθ +

n

∑i=1

BT,iξi) +n

∑j=1(µ′jξj −L∥µj∥∗)−

τ

2(∥λ∥22 +

n

∑j=1∥µj∥22)

where λ,µ1, . . . , µn are the dual variables, and ∥.∥∗ is the dual norm of ∥.∥. To compute µ∗j,τ

for any 1 ≤ j ≤ n, we need to solve the following problem efficiently:

µ∗j,τ ∈ argminµj

− µ′jξj +τ

2∥µj∥22 +L∥µj∥∗.

208

`2 norm bound

When the constraints are on the `2 norm of ξi, as the `2 norm is dual to itself, we need to

solve the problem given by


− µ′jξj +τ

2∥µj∥22 +L∥µj∥2.

The solution to this problem can indeed be computed analytically as

µ∗j,τ = Shrink(1τξj,

L

τ),

where Shrink is an `2 shrinkage operation given by

Shrink(u, γ) =max1 − γ

∥u∥2,0 .u =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

0, if ∥u∥2 ≤ γ,

(1 − γ∥u∥2 ).u, else.

The expressions for λ∗τ and the rest of the iterative hard thresholding algorithm follows

similar steps as before.

`∞ norm bound

When the `∞ norm of ξi are bounded, using the fact that the dual norm of `∞ is now the `1

norm, we need to solve the problem given by


− µ′jξj +τ

2∥µj∥22 +L∥µj∥1.

The solution to this problem is the well known soft thresholding operator, given by

µ∗j,τ = PLτ(1τξj).

209

Pγ(.) is an `1 shrinkage operation, with ith element

(Pγ(u))i = sign(ui)(∣ui∣ − γ)+ =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

ui + γ, if ui ≤ −γ,

ui − γ, if ui ≥ γ,

0, else.

The expressions for λ∗τ and the rest of the iterative hard thresholding algorithm follows

similar steps as before.

B.1.2 Implementation details

In this section, we outline some practical implementation details of the iterative thresholding

heuristic outlined above.

Computing the Lipschitz values for the heuristic

In this subsection, we provide some computational details about the computation of the

Lipschitz values of the first order heuristic. Forming the matrix M would require to store

a matrix of dimensions m × (n + nd), which is not very practical for say n = 105, d = 100.

Similarly, storing the matrices AT and BT,i would also require substantial memory. However,

we utilize the structure of the problem and avoid storing these large matrices. We store a

two dimensional array Φ ∶ [m] → [n] × [n], which for each constraint gives the two indices

which are part of that constraint. For example, if the first constraint was

θ2 + ξ′2(x5 − x2) ≤ θ5,

then Φ(1) = [2,5], where Φ1(1) = 2,Φ2(1) = 5. Thus the whole system of constraints

represented by the matrices AT , BT,1, . . . , BT,n can be stored efficiently.

Now, in order to compute the vector A′Tλ, we first need to define the inverse map of Φ.

To be precise, let Ψ1 ∶ [n] → 2[m] be a vector valued function which when given an index

210

i ∈ [n] outputs the subset of constraints in which i is the first index, i.e.,

Ψ1(i) = j ∶ Φ1(j) = i ,

and similarly

Ψ2(i) = j ∶ Φ2(j) = i .

Now, it is easy to see that the product A′Tλ can be easily computed. The ith element of this

vector is simply

∑q∈Ψ1(i)

λq − ∑q∈Ψ2(i)

λq.

Similarly, the ith element of the d dimensional vector B′T,jλ is given by

∑q∈Ψ1(j)

(xΦ2(q) − xj)iλq.

In order to compute an estimate of the Lipschitz constant `, we use backtracking (see Beck

and Teboulle [2009]), rather than computing the largest eigenvalue of the matrix M′Mwhich can potentially be computationally expensive.

211

Bibliography

Gad Allon, Michael Beenstock, Steven Hackman, Ury Passy, and Alexander Shapiro. Non-parametric estimation of concave production technologies by entropic methods. J. Appl.Econometrics, 22:795–816, 2007.

Naomi S Altman. An introduction to kernel and nearest-neighbor nonparametric regression.The American Statistician, 46(3):175–185, 1992.

Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, andVinayaka Pandit. Local search heuristics for k-median and facility location problems.SIAM Journal on computing, 33(3):544–562, 2004.

Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects.Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016.

Susan Athey and Stefan Wager. Efficient policy learning. arXiv preprint arXiv:1702.02896,2017.

Gábor Balázs, András György, and Csaba Szepesvári. Near-optimal max-affine estimators forconvex regression. In Proceedings of the Eighteenth International Conference on ArtificialIntelligence and Statistics, 38:56–64, 2015.

Gabriel Baron, Elodie Perrodeau, Isabelle Boutron, and Philippe Ravaud. Reporting ofanalyses from randomized controlled trials with multiple arms: a systematic review. BMCmedicine, 11(1):84, 2013.

Hamsa Bastani and Mohsen Bayati. Online decision-making with high-dimensional covari-ates. Available at SSRN 2661896, 2015. Working paper.

Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linearinverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.

Stephen R. Becker, Emmanuel J. Candès, and Michael C. Grant. Templates for convexcone problems with applications to sparse signal recovery. Mathematical ProgrammingComputation, 3(3):165–218, 2011.

Kristin P Bennett and J Blue. Optimal decision trees. Rensselaer Polytechnic Institute MathReport, 214, 1996.

213

Dimitris Bertsimas and Jack Dunn. Optimal classification trees. Machine Learning, pages1–44, 2017.

Dimitris Bertsimas and Jack Dunn. Machine Learning under a Modern Optimization Lens.Dynamic Ideas, Belmont, 2019. To appear.

Dimitris Bertsimas and Nathan Kallus. Pricing from observational data. arXiv preprintarXiv:1605.02347, 2017. Under review.

Dimitris Bertsimas and Nathan Kallus. From predictive to prescriptive analytics. Manage-ment Science, 2019. To appear.

Dimitris Bertsimas and Rahul Mazumder. Least quantile regression via modern optimization.The Annals of Statistics, 42(6):2494–2525, 2014.

Dimitris Bertsimas and Nishanth Mundru. Sparse convex regression. INFORMS Journal onComputation, 2018. Minor revision.

Dimitris Bertsimas and Nishanth Mundru. Prescriptive scenario reduction for stochasticoptimization. 2019. in preparation.

Dimitris Bertsimas and John N. Tsitsiklis. Introduction to Linear Optimization, volume 6.Athena Scientific, 1997.

Dimitris Bertsimas and Bart Van Parys. Sparse high dimensional regression: Exact scalablealgorithms and phase transitions. The Annals of Statistics, 2016. To appear.

Dimitris Bertsimas and Bart Van Parys. Bootstrap robust prescriptive analytics. arXivpreprint arXiv:1711.09974, 2017.

Dimitris Bertsimas, Mac Johnson, and Nathan Kallus. The power of optimization overrandomization in designing experiments involving small samples. Operations Research, 63(4):868–876, 2015.

Dimitris Bertsimas, Nathan Kallus, and Amjad Hussain. Inventory management in the eraof big data. Production and Operations Management, 25(12):2006–2009, 2016a.

Dimitris Bertsimas, Angela King, and Rahul Mazumder. Best subset selection via a modernoptimization lens. The Annals of Statistics, 44(2):813–852, 2016b.

Dimitris Bertsimas, Nathan Kallus, Alex Weinstein, and Ying Daisy Zhuo. Personalizeddiabetes management using electronic medical records. Diabetes Care, 40(2):210–217,2017.

Dimitris Bertsimas, Jack Dunn, and Nishanth Mundru. Optimal Prescriptive Trees. InformsJournal on Optimization, 2019a. In print.

214

Dimitris Bertsimas, Christopher McCord, and Nishanth Mundru. Prescriptive analytics forobservational data. 2019b. Submitted.

Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. Julia: A fresh approachto numerical computing. SIAM review, 59(1):65–98, 2017.

Robert E. Bixby. A brief history of linear and mixed-integer programming computation.Documenta Mathematica, Extra Volume: Optim. Stories, pages 107–121, 2012.

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: Anonasymptotic theory of independence. Oxford university press, 2013.

Stephen P. Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge UniversityPress, 2004.

Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. Classification andRegression Trees. Wadsworth and Brooks, Monterey, California, 1984.

Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochasticmulti-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122,2012.

Dalia Buffery. The 2015 oncology drug pipeline: innovation drives the race to cure cancer.American health & drug benefits, 8(4):216, 2015.

Hong Chen and David D. Yao. Fundamentals of Queueing Networks: Performance, Asymp-totics, and Optimization. Springer-Verlag, 2001.

Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedingsof the 22nd acm sigkdd international conference on knowledge discovery and data mining,pages 785–794. ACM, 2016.

Maxime C Cohen, Ngai-Hang Zachary Leung, Kiran Panchamgam, Georgia Perakis, andAnthony Smith. The impact of linear optimization on promotion planning. OperationsResearch, 65(2):446–468, 2017.

International Warfarin Pharmacogenetics Consortium et al. Estimation of the warfarin dosewith clinical and pharmacogenetic data. N Engl J Med, 2009(360):753–764, 2009.

Dick den Hertog and Krzysztof Postek. Bridging the gap between predictive and prescrip-tive analytics-new optimization methodology needed. Technical report, Tilburg Univer-sity, Netherlands, 2016. Available at: http://www.optimization-online.org/DB_HTML/2016/12/5779.html.

215

http://www.optimization-online.org/DB_HTML/2016/12/5779.html

http://www.optimization-online.org/DB_HTML/2016/12/5779.html

Priya Donti, J Zico Kolter, and Brandon Amos. Task-based end-to-end model learning instochastic optimization. In Advances in Neural Information Processing Systems, pages5488–5498, 2017.

John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projectionsonto the l 1-ball for learning in high dimensions. In International Conference on Machinelearning, pages 272–279. ACM, 2008.

Jack Dunn. Optimal Trees for Prediction and Prescription. PhD thesis, MassachusettsInstitute of Technology, 2018. URL http://jack.dunn.nz/papers/Thesis.pdf.

Iain Dunning, Joey Huchette, and Miles Lubin. JuMP: A modeling language for mathemat-ical optimization. SIAM Review, 59(2):295–320, 2017.

Jitka Dupačová, Nicole Gröwe-Kuska, and Werner Römisch. Scenario reduction in stochasticprogramming. Mathematical Programming, 95(3):493–511, 2003.

Adam N Elmachtoub and Paul Grigas. Smart “predict, then optimize". arXiv preprintarXiv:1710.08005, 2017.

Michael L Feldstein, Edwin D Savlov, and Russell Hilf. A statistical model for predictingresponse of breast cancer patients to cytotoxic chemotherapy. Cancer research, 38(8):2544–2548, 1978.

Kris Johnson Ferreira, Bin Hong Alex Lee, and David Simchi-Levi. Analytics for an onlineretailer: Demand forecasting and price optimization. Manufacturing & Service OperationsManagement, 18(1):69–88, 2015.

Carlos A Flores. Estimation of dose-response functions and optimal doses with a continuoustreatment. University of Miami. Typescript, 2007.

Patrick A Flume, Brian P O’sullivan, Karen A Robinson, Christopher H Goss, Peter JMogayzel Jr, Donna Beth Willey-Courand, Janet Bujan, Jonathan Finder, Mary Lester,Lynne Quittell, et al. Cystic fibrosis pulmonary guidelines: chronic medications for main-tenance of lung health. American journal of respiratory and critical care medicine, 176(10):957–969, 2007.

Jérémie Gallien, Adam J Mersereau, Andres Garro, Alberte Dapena Mora, and Martín NóvoaVidal. Initial shipment decisions for new products at zara. Operations Research, 63(2):269–286, 2015.

John C Gittins. Multi-Armed Bandit Allocation Indices. Wiley, Chichester, UK, 1989.

Chong Yang Goh and Patrick Jaillet. Structured prediction by least squares estimatedconditional risk minimization. arXiv preprint arXiv:1611.07096, 2016.

216

http://jack.dunn.nz/papers/Thesis.pdf

Alexander Goldenshluger and Assaf Zeevi. Recovering convex boundaries from blurred andnoisy observations. The Annals of Statistics, 34:1375–1394, 2006.

Alexander Goldenshluger and Assaf Zeevi. A linear response bandit problem. StochasticSystems, 3(1):230–261, 2013.

Marjan Gort, Manda Broekhuis, Renée Otter, and Niek S Klazinga. Improvement of bestpractice in early breast cancer: actionable surgeon and hospital factors. Breast cancerresearch and treatment, 102(2):219–226, 2007.

Thomas Grubinger, Achim Zeileis, and Karl-Peter Pfeiffer. evtree: Evolutionary learn-ing of globally optimal classification and regression trees in r. Journal of statisti-cal software, 61(1):1–29, 2014. ISSN 1548-7660. doi: 10.18637/jss.v061.i01. URLhttps://www.jstatsoft.org/v061/i01.

Gurobi. Gurobi Optimizer Reference Manual. http://www.gurobi.com, 2015.

Qiyang Han and Jon A Wellner. Multivariate convex regression: global risk bounds andadaptation. arXiv preprint arXiv:1601.06844, 2016.

Lauren A. Hannah and David B. Dunson. Multivariate convex regression with adaptivepartitioning. J. Mach. Learn. Res., 14:3261–3294, 2013.

Lauren A. Hannah, Warren B. Powell, and David B. Dunson. Semiconvex regression formetamodeling based optimization. SIAM Journal on Optimization, 24(2):573–597, 2014.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learn-ing. Springer Series in Statistics, Springer, New York, second edition, 2009.

Holger Heitsch and Werner Römisch. Scenario reduction algorithms in stochastic program-ming. Computational optimization and applications, 24(2-3):187–206, 2003.

Jennifer L Hill. Bayesian nonparametric modeling for causal inference. Journal of Compu-tational and Graphical Statistics, 20(1):217–240, 2011.

Keisuke Hirano and Guido W Imbens. The propensity score with continuous treatments.Applied Bayesian modeling and causal inference from incomplete-data perspectives, 226164:73–84, 2004.

Tito Homem-de Mello and Güzin Bayraksan. Monte carlo sampling-based methods forstochastic optimization. Surveys in Operations Research and Management Science, 19(1):56–85, 2014.

Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 76(1):243–263, 2014.

Thomas R Insel. Translating scientific opportunity into public health impact: a strategicplan for research on mental illness. Archives of General Psychiatry, 66(2):128–133, 2009.

217

https://www.jstatsoft.org/v061/i01

http://www.gurobi.com

Amir Jaffer and Lee Bragg. Practical tips for warfarin dosing and monitoring. ClevelandClinic journal of medicine, 70(4):361–371, 2003.

Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Inter-national Conference on Machine Learning, pages 427–435, 2013.

Nathan Kallus. Balanced policy evaluation and learning. arXiv preprint arXiv:1705.07384,2017a.

Nathan Kallus. Recursive partitioning for personalization using observational data. InInternational Conference on Machine Learning, pages 1789–1798, 2017b.

Nathan Kallus and Angela Zhou. Policy evaluation and optimization with continuous treat-ments. arXiv preprint arXiv:1802.06037, 2018.

Yi-hao Kao, Benjamin V Roy, and Xiang Yan. Directed regression. In Advances in NeuralInformation Processing Systems, pages 889–897, 2009.

James E Kelley, Jr. The cutting-plane method for solving convex programs. Journal of thesociety for Industrial and Applied Mathematics, 8(4):703–712, 1960.

Anton J Kleywegt, Alexander Shapiro, and Tito Homem-de Mello. The sample average ap-proximation method for stochastic discrete optimization. SIAM Journal on Optimization,12(2):479–502, 2002.

Robert J LaLonde. Evaluating the econometric evaluations of training programs with ex-perimental data. The American economic review, pages 604–620, 1986.

Avinash S. Lele, Sanjeev R. Kulkarni, and Alan S. Willsky. Convex-polygon estimation fromsupport-line measurements and applications to target reconstruction from laser-radar data.Journal of the Optical Society of America, Series A, 9:1693–1714, 1992.

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approachto personalized news article recommendation. In Proceedings of the 19th internationalconference on World wide web, pages 661–670. ACM, 2010.

Eunji Lim and Peter W. Glynn. Consistency of multidimensional convex regression. Opera-tions Research, 60(1):196–208, 2012.

Ilya Lipkovich and Alex Dmitrienko. Strategies for identifying predictive biomarkers andsubgroups with enhanced treatment effect in clinical trials using sides. Journal of biophar-maceutical statistics, 24(1):130–153, 2014.

Alessandro Magnani and Stephen P. Boyd. Convex piecewise-linear fitting. Optimizationand Engineering, 10(1):1–17, 2009.

Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variancepenalization. arXiv preprint arXiv:0907.3740, 2009.

218

Rahul Mazumder, Arkopal Choudhury, Garud Iyengar, and Bodhisattva Sen. A compu-tational framework for multivariate convex regression and its variants. Journal of theAmerican Statistical Association, pages 1–14, 2018.

Maethee Mekaroonreung and Andrew L Johnson. Estimating the shadow prices of so2 andnox for us coal power plants: a convex nonparametric least squares approach. EnergyEconomics, 34(3):723–732, 2012.

Velibor V Mišić. Optimization of tree ensembles. arXiv preprint arXiv:1705.10883, 2017.

Stephen L Morgan and Christopher Winship. Counterfactuals and causal inference. Cam-bridge University Press, 2014.

Elizbar A Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141–142, 1964.

Balas Kausik Natarajan. Sparse approximate solutions to linear systems. SIAM Journal onComputing, 24:227–234, 1995.

George Nemhauser. Integer programming: The global impact. EURO, INFORMS, 2013.URL https://smartech.gatech.edu/handle/1853/49829.

Arkadi Nemirovski and Alexander Shapiro. Convex approximations of chance constrainedprograms. SIAM Journal on Optimization, 17(4):969–996, 2006.

Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming,103(1):127–152, 2005.

Mahesh KB Parmar, James Carpenter, and Matthew R Sydes. More multiarm randomisedtrials of superiority are needed. The Lancet, 384(9940):283, 2014.

Scott Powers, Junyang Qian, Kenneth Jung, Alejandro Schuler, H. Shah, Nigam, TrevorHastie, and Robert Tibshirani. Some methods for heterogenous treatment effect estimationin high dimensions. arXiv preprint arXiv:1707.00102v1, 2017. Working paper.

Min Qian and Susan A Murphy. Performance guarantees for individualized treatment rules.The Annals of Statistics, 39(2):1180, 2011.

Hamed Rahimian, Güzin Bayraksan, and Tito Homem-de Mello. Identifying effective sce-narios in distributionally robust stochastic programs with total variation distance. Math-ematical Programming, pages 1–38, 2018.

R Tyrrell Rockafellar and Stanislav Uryasev. Optimization of conditional value-at-risk.Journal of risk, 2:21–42, 2000.

Paul R Rosenbaum. Observational studies. In Observational studies, pages 1–17. Springer,2002.

219

https://smartech.gatech.edu/handle/1853/49829

Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in obser-vational studies for causal effects. Biometrika, pages 41–55, 1983.

Cynthia Rudin and Gah-Yi Vahn. The big data newsvendor: Practical insights from machinelearning. 2014.

Napat Rujeerapaiboon, Kilian Schindler, Daniel Kuhn, and Wolfram Wiesemann. Scenarioreduction revisited: Fundamental limits and guarantees. Mathematical Programming,pages 1–36, 2017.

Emilio Seijo and Bodhisattva Sen. Nonparametric least squares estimation of a multivariateconvex regression function. The Annals of Statistics, 39:1633–1657, 2011.

Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory toalgorithms. Cambridge university press, 2014.

Alexander Shapiro and Arkadi Nemirovski. On complexity of stochastic programming prob-lems. Continuous optimization, pages 111–146, 2005.

Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on stochasticprogramming: modeling and theory. SIAM, 2009a.

Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on StochasticProgramming, Modeling and Theory. Society for Industrial and Applied Mathematics andthe Mathematical Programming Society, 2009b.

Nguyen Hung Son. From optimal hyperplanes to optimal decision trees. Fundamenta Infor-maticae, 34(1, 2):145–174, 1998.

Evan Stubbs. The value of business analytics. http://analytics-magazine.org/the-value-of-business-analytics/, 2016. Accessed: 2018-01-30.

Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization. In Pro-ceedings of the 24th International Conference on World Wide Web, pages 939–941. ACM,2015.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the RoyalStatistical Society. Series B (Methodological), pages 267–288, 1996.

Huseyin Topaloglu and Warren B. Powell. An algorithm for approximating piecewise linearconcave functions from sample gradients. Operations Research Letters, 31:66–76, 2003.

Theja Tulabandhula and Cynthia Rudin. Machine learning with operational costs. TheJournal of Machine Learning Research, 14(1):1989–2028, 2013.

Hal R. Varian. The nonparametric approach to demand analysis. Econometrica, 50(4):945–973, 1982.

220

http://analytics-magazine.org/the-value-of-business-analytics/

http://analytics-magazine.org/the-value-of-business-analytics/

Hal R. Varian. The nonparametric approach to production analysis. Econometrica, 52(3):579–597, 1984.

Roman Vershynin. High-dimensional probability. An Introduction with Applications, 2016.

Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effectsusing random forests. Journal of the American Statistical Association, 113(523):1228–1242,2018.

Stein W Wallace and William T Ziemba. Applications of stochastic programming. SIAM,2005.

Geoffrey S Watson. Smooth regression analysis. Sankhya: The Indian Journal of Statistics,Series A, pages 359–372, 1964.

Daniel Westreich, Justin Lessler, and Michele Jonsson Funk. Propensity score estimation:machine learning and classification methods as alternatives to logistic regression. Journalof clinical epidemiology, 63(8):826, 2010.

Min Xu, Minhua Chen, and John Lafferty. Faithful variable screening for high-dimensionalconvex regression. The Annals of Statistics, 44(6):2624–2660, 2016.

Xin Zhou, Nicole Mayer-Hamblett, Umer Khan, and Michael R Kosorok. Residual weightedlearning for estimating individualized treatment rules. Journal of the American StatisticalAssociation, 112(517):169–187, 2017.

Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journalof the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.

José R Zubizarreta. Using mixed integer programming for matching in an observationalstudy of kidney failure after surgery. Journal of the American Statistical Association, 107(500):1360–1371, 2012.

221

predictive and prescriptive methods in operations research

Documents