predictive and prescriptive methods in operations research
Post on 11-Jan-2022
4 Views
Preview:
TRANSCRIPT
Predictive and Prescriptive Methods in OperationsResearch and Machine Learning: An Optimization
Approachby
Nishanth MundruB.Tech., Indian Institute of Technology Bombay (2012)
Submitted to the Sloan School of Managementin partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Operations Research
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2019
© Massachusetts Institute of Technology 2019. All rights reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Sloan School of Management
May 17, 2019Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dimitris J. BertsimasBoeing Leaders for Global Operations Professor
Sloan School of ManagementCo-Director, Operations Research Center
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Patrick Jaillet
Dugald C. Jackson ProfessorDepartment of Electrical Engineering and Computer Science
Co-Director, Operations Research Center
2
Predictive and Prescriptive Methods in Operations Research and
Machine Learning: An Optimization Approach
by
Nishanth Mundru
Submitted to the Sloan School of Managementon May 17, 2019, in partial fulfillment of the
requirements for the degree ofDoctor of Philosophy in Operations Research
Abstract
The availability and prevalence of data have provided a substantial opportunity for decisionmakers to improve decisions and outcomes by effectively using this data. In this thesis, wepropose approaches that start from data leading to high-quality decisions and predictions invarious application areas.
In the first chapter, we consider problems with observational data, and propose variants ofmachine learning (ML) algorithms that are trained by taking into account decision quality.The traditional approach to such a task has often focused on two-steps, separating theestimation task from the subsequent optimization task which uses these estimated models.Consequently, this approach can miss out on potential improvements in decision quality byconsidering these tasks jointly. Crucially, this leads to stronger prescriptive performance,particularly for smaller training set sizes, and improves the decision quality by 3 − 5% overother state-of-the-art methods. We introduce the idea of uncertainty penalization to controlthe optimism of these methods which improves their performance, and propose finite-sampleregret bounds. Through experiments on real and synthetic data sets, we demonstrate thevalue of this approach.
In the second chapter, we consider observational data with decision-dependent uncer-tainty; in particular, we focus on problems with a finite number of possible decisions (treat-ments). We present our method of prescriptive trees, that prescribes the best treatmentoption by learning from observational data while simultaneously predicting counterfactuals.We demonstrate the effectiveness of such an approach using real data for the problem ofpersonalized diabetes management.
In the third chapter, we consider stochastic optimization problems when the sample av-erage approximation approach is computationally expensive. We introduce a novel measure,called the Prescriptive divergence which takes into account the decision quality of the sce-narios, and consider scenario reduction in this context. We demonstrate the power of thisoptimization-based approach on various examples.
3
In the fourth chapter, we present our work on a problem in predictive analytics where wefocus on ML problems from a modern optimization perspective. For sparse shape-constrainedregression problems, we propose modern optimization based algorithms that are scalable, andrecover the true support with high accuracy and low false positive rates.
Thesis Supervisor: Dimitris J. BertsimasTitle: Boeing Leaders for Global Operations ProfessorSloan School of ManagementCo-Director, Operations Research Center
4
Acknowledgments
First of all, I would like to thank my advisor, Dimitris Bertsimas, for his constant guidance
and encouragement throughout the course of my PhD. His infectious passion, unmatched
ability to innovate, and attention to detail have significantly improved me as a researcher.
His willingness to challenge assumptions by asking the right questions and choose practically
relevant problems have left an indelible impression on me. As I evaluate and reflect back
on my graduate school experience, I realize that he has profoundly shaped my views on
research. It has been a tremendous privilege to collaborate so closely with someone who sees
research as a powerful means to materially improve the human condition.
Next, I would like to express my gratitude towards the other two members of my thesis
committee: Nikos Trichakis and Colin Fogarty. Their insightful comments and feedback have
greatly improved this manuscript. Nikos: Thank you for your support with my applications.
I would also like to thank the following professors at MIT: Rob Freund (for serving on
my general examination committee and helping me with my applications), Georgia Perakis
(for her thoughtful words of encouragement), and John Tsitsiklis (for his role on my general
examination committee). Thank you to the faculty with whom I interacted and learned from:
Patrick Jaillet, Juan Pablo Vielma, David Gamarnik, Rahul Mazumder, and Jim Orlin. In
addition, I also wish to thank the ORC staff – Laura, Andrew, and Nikki for their prompt
help with all the paperwork, and ensuring that the ORC runs smoothly.
I would also like to express my sincere thanks to my collaborators: Velibor Mišić for
his enthusiastic insights and ideas in our work on the airlift problem, Allison Chang for
her guidance and expertise on the airlift project, Jack Dunn for his clarity of thought in our
work on the prescriptive trees paper and for helping me with the Engaging cluster, and Chris
McCord for his intuition and insights in our collaborations on prescriptive analytics. Each
one of you has been exceptionally supportive, and working with you has been an enriching
learning experience for me.
I also wish to thank my friends and first year qualifying exams study group members
5
– Daniel Chen, Martin Copenhaver, and Rajan Udwani. A special word of appreciation to
Martin for being a valuable source of support during some stressful times.
I would like to particularly thank some friends I’ve made at the ORC over the years: Scott
Hunter (for being a constant friendly presence throughout), Alex Weinstein (I particularly
enjoyed our lunches), Miles Lubin and Joey Huchette (for helping me with Julia/JuMP code
during my first few years), Divya and Sowmya Singhvi (for all our conversations), Colin
Pawlowski, Matthew Sobiesk and Daisy Zhuo (for being such a positive influence), and Eli
Gutin (for the fitness tips and shared workout sessions).
A word of thanks to all the other friends I’ve made at the ORC over the years: Andrew Li,
Andrew Vanden Berg, Arthur Delarue, Arthur Flajolet, Charlie Thraves, Chiwei Yan, Chris
Coey, Dan Schonfeld, Deeksha Sinha, Frans deRuiter, Ilias Zadik, Jackie Baek, Jehangir
Amjad, Jerry Kung, Joel Tay, Julia Yan, Kimia Ghobadi, Krish Rajagopalan (for inviting
me to your beautiful wedding), Lennart Baardman, Michael Hu, Nataly Youssef, Nikita
Korolko, Peng Shi, Rebecca Zhang, Rim Harris, Swati Gupta, Ted Papalexopoulos, Virgile
Galle, Yee Sian Ng, and Zach Saunders. Finally, I wish to thank the ORC community as
a whole – I have learnt an immense amount from each of you, and your generosity and
friendliness makes the ORC a special place.
I would also like to thank all the friends I’ve made at Sidney Pacific and MIT: in partic-
ular, Murali Vijayaraghavan (who was also my first roommate at Sidney Pacific and helped
me get used to life in Cambridge) and Sai Gautam (for Tamil movie recommendations and
our discussions on test cricket, the greatest sport on this planet). More generally, the MIT
community (along with the opportunities/facilities that MIT provides) as a whole has helped
me grow immeasurably as a person, and for that I am incredibly thankful.
Thanks to all my teachers (in particular, Professor Mani Bhushan for serving as a research
advisor during my last year, and Professors Hemant Nanavati and Jhumpa Adhikari for being
so accommodating of my course preferences) and friends from my undergraduate days at IIT
Bombay. Your support has contributed to making those four years a rewarding experience.
A word of appreciation to all my teachers and friends throughout my schooling in Hyderabad,
6
without whom I probably would not have been the person I am today.
Finally, I would like to express gratitude to my family for their unconditional support
both during the course of my PhD and over my whole life. In particular, I am indebted to
my parents, Murty and Devi, for their boundless encouragement and patient advice. Thank
you to my brother and maternal grandparents for always being there for me. I also wish to
thank my extended family here in the US for their warmth and hospitality.
7
8
Contents
1 Introduction 19
1.1 Prescriptive Analytics: Joint Learning and Optimization . . . . . . . . . . . . . 20
1.1.1 Prescriptive Analytics for Observational Data . . . . . . . . . . . . . . . 20
1.1.2 Optimal Prescriptive Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.1.3 Prescriptive Scenario Reduction for Stochastic Optimization . . . . . . 22
1.2 Predictive Analytics: Machine Learning from a Modern Optimization lens . . 23
1.2.1 Sparse Convex Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Prescriptive Analytics for Observational Data 27
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.2 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.1.4 Structure of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.2 Outline of our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3 Methods for Joint Predictive-Prescriptive Analytics . . . . . . . . . . . . . . . . 43
2.3.1 k Nearest Neighbors (kNN) . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.3.2 Nadaraya-Watson Kernel Regression (KR) . . . . . . . . . . . . . . . . . 44
2.3.3 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.3.4 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
9
2.3.5 Penalizing the prediction error of f . . . . . . . . . . . . . . . . . . . . . 48
2.4 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.4.1 Greedy algorithm for learning trees . . . . . . . . . . . . . . . . . . . . . 49
2.4.2 Prescriptive Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.4.3 Local search algorithms for Prescriptive Trees . . . . . . . . . . . . . . . 50
2.5 Observational data with decision-dependent uncertainty . . . . . . . . . . . . . 52
2.5.1 Uncertainty penalization and Parameter tuning . . . . . . . . . . . . . . 54
2.5.2 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.5.3 Tractability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.6 Computational Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.6.1 Portfolio Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.6.2 Newsvendor problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.6.3 Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.6.4 Warfarin Dosing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3 Optimal Prescriptive Trees 73
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.1.1 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.1.3 Structure of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.2 Review of Optimal Predictive Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.3 Optimal Prescriptive Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.3.1 Optimal Prescriptive Trees with Constant Predictions . . . . . . . . . . 84
3.3.2 Training Prescriptive Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.3.3 Optimal Prescriptive Trees with Linear Predictions . . . . . . . . . . . . 87
3.4 Performance of OPTs on Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . 88
3.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10
3.4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.4.5 Multiple treatments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.4.6 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.5 Performance of OPTs on Real World Data . . . . . . . . . . . . . . . . . . . . . 101
3.5.1 Personalized Warfarin Dosing . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.5.2 Personalized Diabetes Management . . . . . . . . . . . . . . . . . . . . . 105
3.5.3 Personalized Job training . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.5.4 Estimating Personalized Treatment Effects for Infant Health . . . . . . 110
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4 Prescriptive Scenario Reduction for Stochastic Optimization 113
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.1.1 Related literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.1.2 Contributions and Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.2.1 Distance between distributions . . . . . . . . . . . . . . . . . . . . . . . . 118
4.2.2 Scenario reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.3 Prescriptive Scenario reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.3.1 Prescriptive divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.3.3 Piecewise (separately) linear cost . . . . . . . . . . . . . . . . . . . . . . . 127
4.3.4 Piecewise bilinear cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.3.5 Prediction Error penalization . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.4 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.4.1 Alternating optimization framework . . . . . . . . . . . . . . . . . . . . . 129
4.5 Computational Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.5.1 Portfolio optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
11
4.5.2 Newsvendor problem with budget constraints . . . . . . . . . . . . . . . 133
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5 Sparse Convex Regression 137
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.1.1 Related literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.1.3 Structure of this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.2 Optimization Algorithm for Convex Regression . . . . . . . . . . . . . . . . . . . 142
5.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.2.2 `1 convex regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.2.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.3 Sparse Convex Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.3.1 Primal approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.3.2 Dual approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.3.3 Initialization heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.4 Computational Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5.4.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.4.2 Comparison of initialization methods for the reduced master problem . 160
5.4.3 Run times of `2 convex regression . . . . . . . . . . . . . . . . . . . . . . 162
5.4.4 Infeasibility as a function of iterations . . . . . . . . . . . . . . . . . . . . 164
5.4.5 Comparison with other state of the art methods . . . . . . . . . . . . . . 165
5.4.6 Run times for `1 convex regression . . . . . . . . . . . . . . . . . . . . . . 167
5.4.7 Experiments on real data . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.4.8 Sparse convex regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
5.4.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
6 Conclusions 181
12
A Supplement for Chapter 2 183
A.1 Optimization Algorithms for Joint Predictive and Prescriptive Analytics . . . 183
A.1.1 First order convex methods for local search procedure . . . . . . . . . . 188
A.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
A.3 Optimization with Linear Predictive Models . . . . . . . . . . . . . . . . . . . . 199
A.4 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
A.4.1 Portfolio Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
A.4.2 Newsvendor problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
A.4.3 Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
A.4.4 Warfarin Dosing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
B Supplement for Chapter 5 205
B.1 Heuristic for generating upper bounds . . . . . . . . . . . . . . . . . . . . . . . . 205
B.1.1 Heuristics for norm bounded subgradients . . . . . . . . . . . . . . . . . 208
B.1.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
13
14
List of Figures
2-1 Tree constructed by regressing Y v/s X on training set. . . . . . . . . . . . . . 29
2-2 A different decision tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3-1 Performance of classification methods averaged across 60 real-world datasets.
OCT and OCT-H refer to Optimal Classification Trees without and with
hyperplane splits, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3-2 Test prediction and personalization error as a function of µ . . . . . . . . . . . 85
3-3 Effect and Treatment accuracy results for Experiment 1. . . . . . . . . . . . . . 92
3-4 Tree constructed by OPT(0.5)-L for an instance of Experiment 1. . . . . . . . 93
3-5 Effect and Treatment accuracy results for Experiment 2. . . . . . . . . . . . . . 93
3-6 Tree constructed by OPT(0.5)-L for an instance of Experiment 2. . . . . . . . 95
3-7 Effect and Treatment accuracy results for Experiment 3. . . . . . . . . . . . . . 96
3-8 Outcome and Treatment accuracy results for Experiment 4 with three treat-
ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3-9 Error in prescribed outcome due to incorrect prescription. . . . . . . . . . . . . 100
3-10 Misclassification rate for warfarin dosing prescriptions as a function of training
set size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
15
3-11 Comparison of methods for personalized diabetes management. The leftmost
plot shows the overall mean change in HbA1c across all patients (lower is
better). The center plot shows the mean change in HbA1c across only those
patients whose prescription differed from the standard-of-care. The rightmost
plot shows the proportion of patients whose prescription was changed from
the standard-of-care. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3-12 Out-of-sample average personalized income as a function of inclusion rate. . . 109
4-1 Average in-sample prescriptive performance for various methods as a function
of m, the number of reduced scenarios. . . . . . . . . . . . . . . . . . . . . . . . . 132
4-2 Average out-of-sample prescriptive performance for various methods as a func-
tion of m, the number of reduced scenarios. . . . . . . . . . . . . . . . . . . . . . 133
4-3 Average in-sample prescriptive performance for various methods as a function
of m, the number of reduced scenarios. . . . . . . . . . . . . . . . . . . . . . . . . 134
4-4 Average out-of-sample prescriptive performance for various methods as a func-
tion of m, the number of reduced scenarios. . . . . . . . . . . . . . . . . . . . . . 135
5-1 Progress of Algorithm 1 for (n, d) = (104,10), Tol = 0.1. . . . . . . . . . . . . . . 164
5-2 Progress of Algorithm 1 for (n, d) = (104,10), Tol = 0.05. . . . . . . . . . . . . . 165
5-3 Progress of Algorithm 1 for Tol = 0.01. . . . . . . . . . . . . . . . . . . . . . . . . 168
5-4 Progress of Algorithm 1 for Tol = 0.05. . . . . . . . . . . . . . . . . . . . . . . . . 169
5-5 Accuracy and run times for varying SNR. . . . . . . . . . . . . . . . . . . . . . . 175
5-6 Accuracy and run times for varying correlation ρ. . . . . . . . . . . . . . . . . . 176
5-7 Accuracy and run times for varying dimension d. . . . . . . . . . . . . . . . . . . 176
5-8 Accuracy and run times for varying sparsity parameter k. . . . . . . . . . . . . 176
16
List of Tables
2.1 Average out of sample prescriptive performance for Predict and Optimize -
kNN, local Kernel Regression, Lasso, and Random forests as a function of n,
the size of training set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.2 Average out of sample prescriptive performance for various methods as a func-
tion of n, the size of training set. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.3 Average out of sample prescriptive performance for various methods as a func-
tion of n, the size of training set. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.4 Average out of sample revenue on the pricing example for various PtP and
JPP methods as a function of n, the size of training set. . . . . . . . . . . . . . 68
2.5 Average out of sample MSE on the Warfarin example for various PtP and
JPP methods as a function of n, the size of training set. . . . . . . . . . . . . . 70
3.1 Average personalized income on the test set for various methods. . . . . . . . . 109
3.2 Average R2 on the test set for various methods for estimating the personalized
treatment effect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.1 The effect of the initialization method for (n, d) = (104,10) in the `2 convex
regression for tolerances Tol = 0.1 and 0.05. . . . . . . . . . . . . . . . . . . . . . 161
5.2 Run times for Tol = 0.1 and `2 convex regression. . . . . . . . . . . . . . . . . . 162
5.3 Run times for Tol = 0.05 and `2 convex regression. . . . . . . . . . . . . . . . . . 163
5.4 Comparison for `2 convex regression between Algorithm 1 and ACP for Tol = 0.1.166
5.5 Comparison for `2 convex regression with ADMM. . . . . . . . . . . . . . . . . . 167
17
5.6 `1 convex regression - Run times for Tol = 0.1. . . . . . . . . . . . . . . . . . . . 167
5.7 Accuracy% and Run times for Algorithm 2 for n = 50k. . . . . . . . . . . . . . . 172
5.8 Accuracy% and Run times for Algorithm 2 for n = 100k, d = 100. . . . . . . . . 173
5.9 Accuracy% and Run times for Algorithm 3 for n = 50k. . . . . . . . . . . . . . . 174
5.10 False Positive rate for Algorithm 3. . . . . . . . . . . . . . . . . . . . . . . . . . . 178
5.11 False Positive rate for Algorithm 2. . . . . . . . . . . . . . . . . . . . . . . . . . . 178
18
Chapter 1
Introduction
Nowadays a decision maker typically has access to a wide range of data, which can be
used to significantly improve decision making. This data not only includes past samples
of uncertainty, but also auxiliary data (or side information/features/covariates) associated
with each observation. For instance, along with daily demand data (uncertainty), a retailer
could also collect side information such as weather data, temporal data (season, weekday
or weekend), product information, and macroeconomic trends. These features are typically
available at the time of decision making while uncertainty is observed only after the decision is
implemented. A notable point is that this data is often observational, where past decisions are
unknown functions of the covariates. Additionally, in many cases these decisions influence the
observed uncertainty. For example, in revenue management, a store owner must decide how
to price various products in order to maximize profit, but the observed demand (uncertainty)
is itself affected by the chosen price. Thus, a key question in the area of data-driven decision
making is:
How to achieve better decisions than state-of-the-art methods by considering prediction
and optimization models jointly, while adjusting for the observational nature of data?
This thesis tackles various aspects of this problem by using ideas from the statistics and
Machine Learning (ML) literature along with techniques from mathematical optimization.
In particular, the first part of this thesis focuses on improving techniques for data-driven
19
decision making; then in the the second part, we focus on developing techniques using modern
optimization for ML applications.
We provide a brief outline of each of these two areas and the specific problems comprising
them that we consider in this thesis, as well as our contributions, before describing them in
more detail in the subsequent chapters.
1.1 Prescriptive Analytics: Joint Learning and Opti-
mization
1.1.1 Prescriptive Analytics for Observational Data
In Chapter 2, we consider problems where uncertainty affects the cost function, and propose
ML based algorithms for computing these prescriptive policies in a single step. Traditionally,
stochastic optimization based methods, which use past data samples of uncertainty along
with probabilistic assumptions, have been studied extensively for decision-making, but these
methods typically do not account for covariate data. A commonly used two-stage approach
(which is referred to as Predict and Optimize, or P&O) of first training a ML model to
predict the uncertainty from auxiliary data, followed by substituting this estimate in the
optimization problem directly, can often lead to suboptimal decisions. A key drawback of
P&O is that it does not take into account how the prediction uncertainty affects the objective
of the optimization (prescription) problem. We address this in Bertsimas et al. [2019b], where
we train ML models (local learning methods, decision trees, random forests) by explicitly
optimizing for the quality of their corresponding decisions.
A recent approach in Bertsimas and Kallus [2019] which relies on solving a covariate-
dependent SAA (Sample Average Approximation)-like problem (the objective is averaged
only over the relevant neighbors, which are determined by regressing uncertainty on the
covariates) often does better than P&O as it accounts for prediction uncertainty of the cost
function. However, while this approach accounts for the cost uncertainty, it still advocates a
20
two-step approach where ML and optimization are disjoint. As we demonstrate in Bertsimas
et al. [2019b], coupling prediction and optimization can lead to further gains, particularly
for smaller training set sizes.
Additionally, our methods can also account for observational data where the observed
uncertainty can be affected by the decision. We use these ML methods to simultaneously pre-
scribe decisions and impute counterfactual outcomes. We introduce the idea of uncertainty
penalization to control the optimism of these methods which improves their performance,
and propose finite-sample regret bounds on their performance. Finally, we perform compu-
tational experiments and demonstrate the prescriptive power of our methods on synthetic
and real data on portfolio optimization, newsvendor, pricing, and personalized medicine
problems.
This chapter appears in large part in the submitted paper [Bertsimas et al., 2019b].
1.1.2 Optimal Prescriptive Trees
In a related problem, Chapter 3 considers observational data with decision dependent un-
certainty, but focuses on the case with finite number of decisions (treatments). We present
our method of prescriptive trees, that prescribe the best treatment option by learning from
observational data, while simultaneously predicting the counterfactuals. This approach is
interpretable, and is applicable for problems where there are greater than two treatment
choices. In the context of personalized medicine, it is essential that the personalization
method be interpretable, and able to handle the case of more than two treatment options
(unlike causal forests). Trees can greatly help in this setting, as the partitions can be used
to get a sense of the key characteristics that led to a patient being assigned a particular
prescription.
We demonstrate the performance of our methods on synthetic data and two real world
applications - personalized Warfarin dosing and personalized diabetes treatment (data from
the Boston Medical Center). Our methods outperform propensity score based methods,
regress-and-compare methods (analogs of P&O for the discrete case, which involve estimating
21
outcome functions for each treatment and choosing the treatment which leads to the best
predicted outcome), and causal forests. The key message remains the same – rather than
stipulating models that we estimate using data followed by using these estimated models
to arrive at final decisions, we employ a single-step framework for decision making from
observational data.
This chapter appears in the published paper [Bertsimas et al., 2019a].
1.1.3 Prescriptive Scenario Reduction for Stochastic Optimization
In Chapter 4, we consider data-driven stochastic optimization problems, where data here
refers to historically observed n samples of uncertainty. In this setting, the decision maker
aims to minimize a (convex) cost function averaged over the empirical sample distribution,
commonly referred to as the Sample Average Approximation (SAA), over a convex compact
uncertainty-independent set. However, when the number of samples n is large, solving the
SAA becomes computationally prohibitive. A classical solution framework is scenario reduc-
tion, where the empirical distribution is approximated by another discrete distribution with
a smaller support, and the SAA problem on this reduced distribution becomes computa-
tionally tractable. In the course of computing this approximate distribution, a widely used
measure of closeness between two distributions is the Wasserstein distance, which however
does not take into account the decision-quality of these scenarios.
We introduce a novel generalization of the Wasserstein distance, which we refer to
as the Prescriptive divergence, that quantifies the difference in decision quality between
two discrete distributions. We consider scenario reduction in this setting, and develop an
alternating-minimization based algorithm for computing discrete distributions (scenarios and
corresponding probabilities) that minimize this quantity. We demonstrate using computa-
tional examples that this approach can lead to significantly better decisions for constrained
newsvendor and portfolio optimization problems.
This chapter appears in Bertsimas and Mundru [2019] that will be submitted shortly.
22
1.2 Predictive Analytics: Machine Learning from a
Modern Optimization lens
In this section, we describe some work on machine learning problems from an optimization
lens.
1.2.1 Sparse Convex Regression
Estimating a regression function from data with shape constraints (convexity or concavity)
has many applications in operations research (reinforcement learning, resource allocation),
econometrics, geometric programming, image analysis, and target reconstruction. While this
functional optimization problem can be equivalently written as a finite dimensional convex
quadratic optimization problem, it has O(n2) constraints, where n is the number of training
points.
In Chapter 5, we develop a cutting plane-based scalable algorithm for obtaining high
quality solutions in practical times. Next, variable selection for regression has gained im-
portance in the statistics and optimization community, and is relevant when the number of
features d can be large. As part of this work, we develop computational methods that select
the best k out of the d features using an approach that combines first order convex opti-
mization based methods, mixed integer optimization techniques, and dual reformulations.
We demonstrate that our methods scale to solve problems of sizes (n, d, k) = (104,102,10)in minutes and (n, d, k) = (105,102,10) in hours, and also control for the false discovery rate
effectively.
This chapter appears in the submitted paper [Bertsimas and Mundru, 2018].
1.3 Main Contributions
Our contributions in this thesis can be summarized as follows, listed by chapter.
23
Chapter 2: Prescriptive Analytics for Observational Data
• In this chapter, we propose a general approach for solving the prescriptive problem
with uncertain parameters in its objective as a single-step optimization problem. This
framework provides high quality prescriptions by learning from past data and accom-
modates powerful non-parametric machine learning methods such as k nearest neigh-
bors, kernel regression, decision trees and random forests, which have been traditionally
used for prediction. That is, we directly train these machine learning methods for the
parameters that lead to the best decisions as opposed to predictions.
• Further, we develop an algorithmic framework for observational data-driven optimiza-
tion that allows the decision variable to take values on continuous and multidimensional
sets.
• We analyze the power of these approaches theoretically, and present finite-sample regret
bounds on the performance of these methods.
• Finally, we demonstrate the performance of the methods developed through computa-
tional experiments. First, for the case where uncertainty is not affected by decisions,
we apply our methods on a portfolio optimization problem and a newsvendor prob-
lem, and provide evidence that they output superior data-driven decisions compared
to state of the art methods, particularly for smaller sizes of the training set. Next, in
the case where uncertainty is affected by decisions, we consider applications in person-
alized medicine, in which the decision is the dose of Warfarin to prescribe to a patient,
and in pricing, in which the action is the list of prices for several products in a store.
Chapter 3: Optimal Prescriptive Trees
• In this chapter, we present a tree-based method that produces trees with partitions
that are parallel to the axis. Consequently, they are highly interpretable and provide
intuition on the important features that lead to a sample being assigned a particular
24
treatment.
• Similar to predictive trees [Bertsimas and Dunn, 2017, 2019, Dunn, 2018], prescriptive
trees scale to problems with n in 100,000s and d in the 10,000s in seconds when they
use constant predictions in the leaves and in minutes when they use a linear model.
• Prescriptive trees can be applied with multiple treatments. An important desired
characteristic of a prescriptive algorithm is its generalizability to handle the case of
more than two possible arms. Rapid advances in technology have resulted in almost
all diseases having multiple drugs at the same stage of clinical development. This
emphasizes the importance of methods that can handle trials with more than two
treatment options.
• In a series of experiments with real and synthetic data, we demonstrate that prescrip-
tive trees either outperform out of sample or are comparable with several state of the
art methods on synthetic and real world data. Importantly, these methods tend to
perform well in the presence of limited data, which is often the case in practice in the
healthcare setting.
Chapter 4: Prescriptive Scenario Reduction for Stochastic Opti-
mization
• In this chapter, we present a novel optimization based approach for scenario reduction
for stochastic optimization problems. As part of this approach, we introduce “Pre-
scriptive divergence", which measures the difference in quality of decisions induced by
two discrete distributions and includes the Wasserstein distance as a special case.
• We propose scenario reduction approaches in this context, and present an iterative
algorithm for computing these scenarios and their corresponding probabilities for two
classes of cost functions. While the actual problem is nonconvex, we derive convex
upper bounds which we optimize for estimating the scenarios. Our optimization ap-
25
proach relies on an alternating minimization algorithm, where we solve a sequence of
convex optimization problems for computing the scenarios.
• Finally, we present computational results where we apply these methods on constrained
newsvendor and portfolio optimization problems, and demonstrate that these methods
result in improved decisions. Importantly, these scenarios outperform the traditional
Wasserstein-distance based scenario reduction approach both in-sample and out-of-
sample across various choices of the number of scenarios m. This results in better
performance with fewer scenarios, which leads to greater interpretability of decision-
making, and hence is valuable to practitioners.
Chapter 5: Sparse Convex Regression
• In this chapter, we consider the problem of convex regression, and develop a scalable
algorithm for obtaining high quality solutions in practical times that compare favorably
with other state of the art methods. We show that by using a cutting plane method,
the least squares convex regression problem can be solved for sizes (n, d) = (104,10) in
minutes and (n, d) = (105,102) in hours.
• We propose algorithms which iteratively solve for the best subset of features based on
first order and cutting plane methods. To the best of our knowledge, these are the
first algorithms for sparse convex regression.
• We consider two variants of this problem, and develop algorithms for each of them. In
one variant, we consider the sparse problem with bounded subgradients, and develop
iterative mixed integer optimization based algorithms for solving it. In the second
variant, we consider the sparse problem with ridge regularization, and develop a binary
cutting plane method for this problem.
• With the help of computational experiments, we show that our methods are scalable
and obtain near exact subset recovery for sizes (n, d, k) = (104,102,10) in minutes, and
(n, d, k) = (105,102,10) in hours.
26
Chapter 2
Prescriptive Analytics for
Observational Data
2.1 Introduction
One of the central goals of operations research/management science (OR/MS) and business
analytics is to make decisions which lead to lower costs and improved business outcomes.
These decisions (which we shall also refer to as prescriptions in this chapter) are typically
computed by solving a constrained optimization problem. However, a challenge is that some
parameters in the optimization problem are often unknown. Traditionally in operations
research, these uncertain parameters are estimated under apriori imposed assumptions, and
the decisions are then computed by solving the optimization problem with the estimated
parameters.
With the advent and proliferation of data and the improved ability to collect and store
large quantities of diverse information, there has been increased interest in using this rich
data to improve the quality of decisions. Data, rather than models or assumptions, should
guide the decision making process. This key principle has guided the machine learning (ML)
community to notable improvements in predictive analytics over the past decade. However,
most real world business analytics problems typically involve aspects of both prediction and
27
optimization [den Hertog and Postek, 2016]. Consequently, there has been increased interest
among the operations research and management science community to attack problems of
this flavor [Stubbs, 2016]. The applications are abundant and encompass several areas –
demand forecasting and price optimization [Ferreira et al., 2015], promotion planning [Cohen
et al., 2017], shipment decisions [Gallien et al., 2015], inventory management [Bertsimas
et al., 2016a], to name a few. The central goal of this chapter is to develop a framework in
which non-parametric machine learning techniques, originally designed for prediction, can
be adapted to provide high quality decisions for problems in OR/MS, which typically involve
mathematical optimization formulations.
Many important problems across a variety of fields fit into this framework. In healthcare,
for example, a doctor aims to prescribe drugs in specific dosages to regulate a patient’s vital
signs (outcome Y ). In such a setting, we have access to past data (X) about each patient
such as demographics, past medication history, genetic information, and what treatment (Z)
was administered. The patient outcomes are potentially affected by the patient character-
istics and choice of treatment. In revenue management, a store owner must decide how to
price various products in order to maximize profit. In online retail, companies decide which
products to display for a user to maximize sales. An online retailer can easily have access to
information about the customer, and may seek to price different products differently for var-
ious customers. In resource allocation problems, companies have to allocate finite resources
in order to minimize costs. For instance, consider a company with machines distributed
across the country, with each machine described by its state X – working, close to failure,
or failed. Such a company would want to use past data (such as machines’ historical failure
rate, features of the machines, relative importance of machines in the network) to decide
where to dispatch engineers in order to minimize the total cost of travel and cost incurred
due to potential disruptions in the network.
In this chapter, we emphasize the importance of optimizing the right objective, along with
appropriate parameter tuning for computing decisions. While cross validation is commonly
used to tune parameters in ML prediction problems, it can be slightly more challenging for
28
decision problems. The motivation behind tuning these parameters appropriately is that the
best predictive model might not always be the best model for decision making. We illustrate
this with a toy example. Consider a setting with a single covariate x ∼ U[0,1], uniformly
distributed between 0 and 1. The uncertainty of interest Y , is a function of x given by
Y (x) =
⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
2, if x ≤ 0.5,
1, if 0.5 < x ≤ 0.95,
−1, if 0.95 ≤ x ≤ 1.
In order to compute the decision, we have to solve the optimization problem:
min0≤z≤1 c(z; y) = ∣y + z∣
Suppose we are given n points (X1, Y 1), . . . , (Xn, Y n) sampled randomly. As a starting
approach, we regress Y ∼X and suppose we obtain the following tree:
x ≤ 0.5
Y = 1Y = 2
True False
Figure 2-1: Tree constructed by regressing Y v/s X on training set.
We see that for this tree, the out-of-sample R2 = 0.63, and the corresponding decisions
z(x) and costs incurred are given by:
⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
0 ≤ x ≤ 0.5 Ô⇒ z(x) = 0, Avg. Cost = 0.5 ∗ 2 = 1,
0.5 ≤ x ≤ 0.95 Ô⇒ z(x) = 0, Avg. Cost = 0.45,
0.95 ≤ x ≤ 1.0 Ô⇒ z(x) = 0, Avg. Cost = 0.05.
Consequently, we see that the out of sample prescriptive cost, which quantifies the perfor-
29
mance of decisions prescribed by this tree, is 1.50. Now, consider the following different
tree:
x ≤ 0.95
Y = −1Y = 1.5
True False
Figure 2-2: A different decision tree.
This tree is worse in terms of its predictive performance of Y , as is evident from its out
of sample R2 of 0.56, which is lesser than 0.63. Next, we consider the decisions z(x) and
costs incurred as follows:
⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩
x ≤ 0.95 Ô⇒ z(x) = 0, Avg. Cost = 0.5(2) + 0.45(1),
x > 0.95 Ô⇒ z(x) = 1, Cost = 0.
Thus, the average out of sample prescriptive cost = 1.45, which is lower than 1.50, and hence
better in terms of decisions.
2.1.1 Notation
Throughout this chapter, we use capital letters to refer to random quantities and lower case
letters to refer to deterministic quantities. The general problem we study is characterized
by the following components:
• Decision variable: z ∈ Z ⊂ Rdz ,
• Outcome: Y (z) ∈ Y ⊂ Rdy (We adopt the potential outcomes framework [Rosenbaum,
2002], in which Y (z) denotes the (random) quantity that would have been observed
had decision z been chosen.),
• Auxiliary covariates (also called side-information or context): x ∈ X ⊂ Rdx ,
30
• Cost function: c(z; y) ∶ Z × Y → R.
Thus, we use Z to refer to the decision randomly assigned by the (unknown) historical policy
and z to refer to a specific action (or, decision). For a given auxiliary covariate vector, x,
and a proposed decision, z, the conditional expectation E[c(z;Y )∣X = z,Z = z] quantifies
the expectation of the cost function c(z;Y ) under the conditional measure in which X is
fixed as x and Z is fixed as z. We ignore details of measurability throughout and assume
this conditional expectation is well defined. Throughout this chapter, all norms are `2 norms
unless otherwise specified. We use (X,Z) to denote vector concatenation.
2.1.2 Related Literature
In this section, we present an overview of some related approaches in the literature. Stochas-
tic optimization attempts to solve the problem
minz∈Z
E[c(z;Y )], (2.1)
for some known convex cost function c(z; y) in z and convex feasible set Z, and where the
expectation is computed over the unknown distribution of Y . However, the distribution of
the random variable Y is typically unknown. As shown by Nemirovski and Shapiro [2006],
even estimating the objective for a given decision z can be a highly nontrivial problem.
Typically we have access to data, Y 1, . . . , Y n, which represents historical observations of
the uncertainty y, rather than the distribution of Y . In this setting, the classical paradigm
for data-driven stochastic optimization is sample average approximation (or SAA) [Kleywegt
et al., 2002],[Shapiro and Nemirovski, 2005], where the empirical distribution over Y 1, . . . , Y n
is used to approximate the full expectation in Problem (2.1). To be precise, SAA considers
the problem
minz∈Z
1
n
n
∑i=1
c(z;Y i). (2.2)
Clearly, this can be considered as a stochastic optimization problem with the distribution of
Y approximated by a discrete distribution over scenarios Y 1, . . . , Y n with the probability of
31
each equal to 1n . In fact, as n increases to infinity, under some mild conditions, Problem (2.2)
can be shown to be equivalent to the original stochastic optimization problem (2.1) [Shapiro
et al., 2009a]. However, the classical stochastic optimization framework is unable to include
contextual information provided by observed covariates x. In settings where knowledge of
these covariates is known at the time of implementing the decision, using this additional
knowledge can add substantial value. Recent years have seen tremendous interest in the
area of data-driven optimization. Much of this work combines ideas from the statistics and
machine learning literature with techniques from mathematical optimization.
In this case, the problem we now consider is
z(x) ∈ argminz∈Z
E[c(z;Y )∣X = x]. (2.3)
The optimized decision z(x) thus takes into account this potential knowledge about the
future uncertainty Y , and allows for higher quality decision making. Clearly, this is a gener-
alization of the classical problem (2.1), where contextual information is ignored for decision
making.
To solve Problem (2.3), one commonly used approach in the literature is to employ a Pre-
dict and Optimize (P&O) framework. As the name indicates, this approach involves solving
the problem of generating prescriptions from data in two steps. In the first step, a machine
learning model f(x) that predicts y is trained using past data (X1, Y 1), . . . , (Xn, Y n).In the second step, when X0 is given, the corresponding predicted uncertainty is computed
according to the machine learning model, f(X0), and this estimate is substituted into the op-
timization problem to solve for the decision z. To be precise, the decision z(X0) is computed
by solving
z(X0) ∈ argminz∈Z
c(z; f(X0)).
For learning the function f , any of the several machine learning techniques that have been
proposed in the literature can potentially be used (see Hastie et al. [2009] for an overview).
However, one key drawback of this approach is that by substituting in the predicted y
32
directly, the optimization model does not take into account the uncertainty associated with
this prediction. Another key area where this approach could be potentially improved is
that the prediction model f is not aware of the downstream optimization model, which is a
consequence of the two-step solution approach. We point out that our work resolves both
issues. We next discuss some recent work that addresses some of these issues as well and
compare them with our proposed approach.
To resolve the second issue, Elmachtoub and Grigas [2017] propose an approach where
they also consider the problem of finding prediction functions f that lead to good prescrip-
tions. This method is based on the P&O framework and is restricted to optimization prob-
lems with linear objective functions c(z;Y ) = z′Y and linear predictive functions f(x) = Bx.
However, it is not clear how to extend this for nonlinear (in z) objectives c(z;Y ), or for
nonlinear prediction functions f . In other work, Tulabandhula and Rudin [2013] minimize
a combination of prediction loss along with the operational cost on an unlabeled dataset.
However, the operational cost is defined on the unlabeled data while the prediction loss on
the labeled data, and this approach still follows the P&O methodology. For the feature-based
newsvendor problem, Rudin and Vahn [2014] use machine learning methods to predict the
optimal decision as a direct function of the observed covariates x. While the optimization
is performed in sample, the predicted decisions can potentially be infeasible for some points
in the test dataset. Kao et al. [2009] propose a method which also predicts the decision as a
linear function of the covariates. The regression coefficients are chosen as a convex combina-
tion of the usual least squares coefficients (obtained by minimizing prediction loss) and the
coefficients obtained by solving the prescription problem, which in this case is assumed to be
an unconstrained convex quadratic minimization problem. This convex combination param-
eter is chosen by cross validation. However, it is not clear how to extend this approach when
the optimization problem has constraints, or for the case of nonlinear predictive models.
Finally, we note that this approach is also based on the P&O framework. Another related
recent work is task based end to end learning, where the authors focus on quadratic opti-
mization prescriptive problems and propose neural network based approaches for computing
33
decisions [Donti et al., 2017].
Another recently proposed approach called Predictive to Prescriptive (PtP) analytics [Bert-
simas and Kallus, 2019] also uses a two-step approach, with the first step comprising of
training supervised non-parametric machine learning methods (k nearest neighbors, kernel
regression, trees and forests) to predict Y based on the covariates X. They key difference
from P&O is that in the second step, it does not directly substitute the predictions into the
optimization problem. Rather, it solves a weighted SAA with the weights dictated by the
prediction methods for that particular observation. For example, if f is a kNN predictor,
then this approach first finds the parameter k that results in the most accurate predictions of
y (minimizing the prediction error) over the training set (X1, Y 1), . . . , (Xn, Y n). Now, for
any x, they find the k nearest neighbors of x in the training set and solve an SAA over only
these k neighbors to compute the optimal decision z(x). They also show that this approach
is consistent, and essentially improves over the P&O by considering uncertainty in the cost
estimate E[c(z;Y )∣X = x] as opposed to substituting the estimate into the cost function as
c(z;E[Y ∣X = x]). However, this is still a two-step approach to learning decisions from data
where the procedure for computing the first step machine learning model does not take into
account the quality of decisions computed by the model.
Our approach is similar to the PtP approach in that we use several non-parametric
machine learning algorithms as well for prediction. However, the key difference from the PtP
approach is that we find the best machine learning algorithm that leads to the best decisions
z, as opposed to the best predictions. Another way of interpreting this is that we generate
scenarios (Y,Z) jointly, while PtP or in general the SAA approach, generates scenarios y for
computing the prescription z. To achieve this, the objective we use to train these machine
learning methods is directly based on a mix of the prescription cost and prediction error as
opposed to just the latter, which is the case in both standard P&O and PtP. The key insight
behind the prescription term is that it directly quantifies the cost of the decision making
framework induced by any predictive f and optimizing this avoids the two-step approach
employed by both P&O and PtP. Additionally, this incorporates the uncertainty associated
34
with the estimates given by the prediction methods into the optimization model, as we
consider an SAA-like weighted estimate of the expectation E[c(z;Y )∣X = x] rather than use
the point estimates in our proposed methods.
We also note the connection with the field of structured prediction, a subfield in machine
learning that seeks to predict structured objects such as sequences, images, graphs from
feature data. This predicted output must satisfy some constraints (see Goh and Jaillet
[2016] for some examples). In our case, the structured objects are decision variables that are
input to an optimization problem, and we present non-parametric learning methods in this
setting.
We also consider the setting in which the decision affects the outcome. For many appli-
cations, such as pricing, the demand for a product is clearly affected by the price. Bertsimas
and Kallus [2017] later studied the limitations of predictive approaches to pricing problems.
In particular, they demonstrated that confounding in the data between the decision and out-
come can lead to large optimality gaps if ignored. They proposed a kernel-based method for
data-driven optimization in this setting, but it does not scale well with the dimension of the
decision space. Mišić [2017] developed an efficient mixed integer optimization formulation
for problems in which the predicted cost is given by a tree ensemble model. This approach
scales fairly well with the dimension of the decision space but does not consider the need for
uncertainty penalization.
Another relevant area of research is causal inference (see Rosenbaum [2002] for an
overview), which concerns the study of causal effects from observational data. Much of
the work in this area has focused on determining whether a treatment has a significant effect
on the population as a whole. However, a growing body of work has focused on learning opti-
mal, personalized treatments from observational data. Athey and Wager [2017] proposed an
algorithm that achieves optimal (up to a constant factor) regret bounds in learning a treat-
ment policy when there are two potential treatments. Kallus [2017a] proposed an algorithm
to efficiently learn a treatment policy when there is a finite set of potential treatments.
Building on this approach, Bertsimas et al. [2019a] developed a tree-based algorithm that
35
learns to personalize treatment assignments from observational data. It is based on the op-
timal trees machine learning method [Bertsimas and Dunn, 2017], and has performed well in
experiments on both synthetic and real datasets. This approach involves minimizing a com-
posite objective which is a combination of prescriptive and predictive loss, which is analogous
to our objective that we consider in this chapter. In this setting, the decisions are finite,
and the objective is simply the outcome. Here, we allow continuous and multidimensionsal
decisions, along with potential constraints on the decisions.
Considerably less attention has been paid to problems with a continuous decision space.
Hirano and Imbens [2004] introduced the problem of inference with a continuous treatment,
and Flores [2007] studied the problem of learning an optimal policy in this setting. Recently,
Kallus and Zhou [2018] developed an approach to policy learning with a continuous deci-
sion variable that generalizes the idea of inverse propensity score weighting. Our approach
differs in that we focus on regression-based methods, which we believe scale better with the
dimension of the decision space and avoid the need for density estimation.
The idea of uncertainty penalization has been explored as an alternative to empirical risk
minimization in statistical learning, starting with Maurer and Pontil [2009]. Swaminathan
and Joachims [2015] applied uncertainty penalization to the offline bandit setting. Their set-
ting is similar to the one we study. An agent seeks to minimize the prediction error of their
decision, but only observes the loss associated with the selected decision. They assumed that
the policy used in the training data is known, which allowed them to use inverse propensity
weighting methods. In contrast, we assume ignorability, but not knowledge of the historical
policy, and we allow for more complex decision spaces. We note that uncertainty penaliza-
tion bears a superficial resemblance to the upper confidence bound (UCB) algorithms for
multi-armed bandits [Bubeck et al., 2012]. These algorithms choose the action with the
highest upper confidence bound on its predicted expected reward. Our approach, in con-
trast, chooses the action with the highest lower confidence bound on its predicted expected
reward (or lowest upper confidence bound on predicted expected cost). The difference is
that UCB algorithms choose actions with high upside to balance exploration and exploita-
36
tion in the online bandit setting, whereas we work in the offline setting with a focus on solely
exploitation.
2.1.3 Contributions
The key contributions of this work are as follows.
1. We propose a general approach for solving the prescriptive problem with uncertain
parameters in its objective as a single-step optimization problem. This framework
provides high quality prescriptions by learning from past data and accommodates
powerful non-parametric machine learning methods such as k nearest neighbors, kernel
regression, trees and forests, which have been traditionally used for prediction. That
is, we directly train these machine learning methods for the parameters that lead to
the best decisions as opposed to predictions.
2. We adapt the coordinate descent approach of Dunn [2018], along with first order
methods from convex optimization to further improve the trees. We present algorithms
to aid in the scalability of our approach.
3. We develop an algorithmic framework for observational data-driven optimization that
allows the decision variable to take values on continuous and multidimensional sets.
4. We demonstrate the performance of the methods developed in computational experi-
ments. First, for the case where uncertainty is not affected by decisions, we apply our
methods on a portfolio optimization problem with synthetic data, and a newsvendor
problem with real data, and provide evidence that they output superior data-driven
decisions compared to state of the art methods, particularly for smaller sizes of the
training set. Next, in the case where uncertainty is affected by decisions, we consider
applications in personalized medicine, in which the decision is the dose of Warfarin to
prescribe to a patient, and in pricing where the action is the list of prices for several
products in a store.
37
2.1.4 Structure of the chapter
The structure of this chapter is as follows. In Section 2.2, we present some background on
prescriptive analytics, and outline our approach in brief. In the first part of the chapter,
we consider the case where uncertainty Y is unaffected by the implemented decision Z. We
present more details on our approach for adapting various non-parametric learning methods
in Section 2.3, followed by algorithms for training these methods in Section 2.4. In the
second part of the chapter, we consider the case of observational data, where uncertainty
Y is affected by the implemented decision Z. We present our approach in greater detail in
Section 2.5, followed by theoretical motivation and finite-sample and generalization bounds in
Section 2.5.2. We provide computational evidence of the methods developed in this chapter
on real and synthetic data in Section 2.6, and present our conclusions in Section 2.7.
2.2 Outline of our approach
In this section, we present some background on prescriptive methods and outline our ap-
proach. We first focus on the setting in which the decision, z, does not affect the uncertainty,
y. The historical training data (X i, Y i)ni=1 is comprised of n observations (also referred to
as data points or samples). Each data point (X i, Y i) corresponds to the features (or covari-
ates/contextual information/side information) X i ∈ X ⊆ Rdx of the ith observation and the
realized uncertainty Y i ∈ Y ⊆ Rdy . When the uncertainty y is perfectly known in advance,
the decision maker has to solve a deterministic optimization problem, given by
minz∈Z
c(z; y) (2.4)
to arrive at the decision z ∈ Z ⊆ Rdz . However, the key challenge is that the uncertainty y
is not observed at the instant when the decision z needs to be implemented, and thus Prob-
lem (2.4) cannot be solved directly. At the time the decision needs to be made, the decision
maker has access to covariates x that potentially possess some prognostic information about
38
the unrealized y. In the presence of this additional knowledge, the decision maker seeks to
minimize the cost under the conditional expectation of Y ∣X = x, or equivalently, solve the
problem
minz∈Z
E[c(z;Y )∣X = x]. (2.5)
In this chapter, we consider the problem of finding a policy that, given new contextual
information x, outputs a high quality decision z(x) that leads to good prescriptive perfor-
mance, i.e., low cost c(z(x); y) when y is realized out of sample. As opposed to approaches
reliant on knowledge of the distributions of Y or Y ∣X = x, both of which are typically
unknown, we develop methods which rely on data as the starting point. As part of this ap-
proach, we adapt popular non-parametric machine learning methods – k Nearest Neighbors,
local kernel regression, decision trees, and random forests – to develop their corresponding
prescriptive methods that compute high quality decisions z directly from the covariates x.
We further illustrate this setting with an example. Consider a problem in which a portfo-
lio manager has to allocate finite capital to various stocks (or financial assets). The compli-
cation is that these allocations (or investments) z depend on the future returns of the assets
y, which are unknown at the time of deciding the allocation. But, the decision maker has
access to covariate information x at the time of making this decision such as earnings, sea-
sonality, Google or Twitter trends, performance of the S&P 500 index, past returns of other
similar assets, market sentiment which could potentially contain a signal about future re-
turns. Thus, the problem is to compute decisions z given past data (X1, Y 1), . . . , (Xn, Y n)and the current covariate information x.
Now, to make a decision z(x), we wish to solve the Problem (2.5). Clearly, this condi-
tional expectation is not known and needs to be estimated from the past available data. In
order to estimate this conditional expectation, we consider estimators of the form [Bertsimas
and Kallus, 2019]
f(x) =n
∑i=1
wfi (x)c(z;Y i), (2.6)
39
where the weights are nonnegative and sum to one, i.e.,
wfi (x) ≥ 0 ∀1 ≤ i ≤ n, and
n
∑i=1
wfi (x) = 1.
These weights are determined by the non-parametric function f ∶ Rdx → Rdy , past training
data (X1, Y 1), . . . , (Xn, Y n), and the observed covariate x. To be precise, we consider f
such that its prediction of y, for any x, is given by
f(x) =n
∑i=1
wfi (x)Y i. (2.7)
Intuitively, these weights encode the similarities between x and each of the corresponding
training set covariates X1, . . . ,Xn. For example, suppose f is a tree based estimator and x
belongs to the leaf `(x) which has n(`(x)) sample points, i.e.,
n(`(x)) = ∣j ∈ [n] ∶ `(Xj) = `(x)∣.
Then, the estimated conditional expectation of the cost in Equation (2.6) is given by
f(x) = 1
n(`(x))n
∑j=1
c(z;Y j) 1(`(Xj) = `(x)). (2.8)
In this case, it is easy to see that the weights are given by
wfi (x) =
⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩
1n(`(x)) , if `(X i) = `(x),
0, else .
Note that this f also outputs a corresponding prediction of y for the observed x as
f(x) = 1
n(`(x))n
∑j=1
Y j 1(`(Xj) = `(x)).
Now with these weights, the decision z(f, x) is obtained by solving the corresponding
40
optimization problem
z(f, x) ∈ argminz∈Z
E[c(z;Y )∣f,X = x],
or equivalently,
z(f, x) ∈ argminz∈Z
n
∑i=1
wfi (x)c(z;Y i). (2.9)
Note that as a consequence of nonegativity of the weights, Problem (2.9) is a convex mini-
mization problem for each x.
Now, the question arises as to how do we choose this function f? We wish to ensure
that this decision z(f, x) induced by f has good prescriptive performance, or equivalently,
obtains a low value of c(z(f, x); y) where y is the realized uncertainty. Thus, we formulate
a problem to optimize for the function f that leads to good prescriptive performance of its
decisions. We propose the following formulation in Problem (2.10) where we optimize over
functions f ∶ Rdx → Rdy that lead to a good prescriptive performance.
minf∈F ,z(f,Xi)
n
∑i=1
c(z(f,X i);Y i)
subject to z(f,X i) ∈ argminz∈Z
n
∑j=1
wfj (X i)c(z;Y j) ∀1 ≤ i ≤ n.
(2.10)
The central idea behind solving Problem (2.10) is that it directly optimizes the policy cost
of the prescriptive method used. Indeed, the ith term in the objective signifies the cost
incurred when decision z(f,X i) is implemented and uncertainty Y i is realized. The ith
constraint stipulates that each z(f,X i) is the optimal decision for X i under f and thus is
representative of the actual decision making process. In the previously described example
where f is a tree predictor, f(X i) can be written as 1n(`(x)) ∑j∶`(Xj)=`(x) c(z;Y j), where Xl
is the leaf of the tree in which X i falls into. On implementing this z(f,X i), we observe a
cost of c(z(f,X i);Y i), which depends on the realized uncertainty Y i. When we consider the
average cost of this policy imposed by f on the whole sample of n training points, we arrive
at the objective in Problem (2.10).
In other words, we train the function f while taking into account its prescriptive perfor-
41
mance, by noting that each z(f,X i) is the solution to an optimization problem that depends
on f itself. This is in direct contrast to traditional approaches which involve learning f based
on the predictive error, followed by solving an appropriate optimization problem over Z for
the best decision using the prediction or output of f .
We further impose the condition that the prediction function f also accurately estimate
the uncertainty y. That is, we emphasize that f deliver high quality prescriptions, but at
the same time is also reasonably close to the actual values in terms of its predictions. We
enforce this by choosing a loss function `(⋅, ⋅), which we typically set as the least squared
loss, i.e., `(x, y) = ∥x − y∥2. We penalize the difference between realized uncertainty Y i and
the predicted uncertainty f(X i). Note that the predicted uncertainty is also a weighted
estimate of the training set uncertainties, with the same weights used for estimating the
conditional mean cost. We explain this penalization in greater detail in Section 2.3.5, where
we point out that in the absence of such a penalizing factor, f can become too “optimistic"
in its prescriptions. Following this idea, we consider Problem (2.11), which balances both
prescription and the prediction error
minf∈F ,z(f,Xi)
µn
∑i=1
c(z(f,X i);Y i) + (1 − µ)n
∑i=1
`(Y i, f(X i))
subject to z(f,X i) ∈ argminz∈Z
n
∑j=1
wfj (X i)c(z;Y j) ∀1 ≤ i ≤ n,
(2.11)
where the prescription factor 0 < µ < 1 is a hyperparameter that controls the tradeoff between
prescription and prediction objectives. Thus, this approach unifies the two steps – prediction
and prescription – by treating this as a single-step problem. In fact, this approach can be
viewed as a generalization of Bertsimas et al. [2019a] for the case of f described by a tree
function, where Z = 1, . . . ,m, with the ith constraint simply assigning unit X i to the
decision with lowest average cost in the leaf to which X i belongs.
42
2.3 Methods for Joint Predictive-Prescriptive Analyt-
ics
In this section, we describe four non-parametric machine learning methods, and how we
adapt them for the purpose of prescription.
2.3.1 k Nearest Neighbors (kNN)
In this section, we present the k Nearest Neighbors method for joint prescriptive analytics.
The classical kNN method for prediction only considers the k nearest neighbors of x in the
training set and ignores the rest [Altman, 1992]. The predicted outcome f(x) is
f(x) = 1
k∑
i∶Xi∈Nk(x)Y i,
where Nk(x) = X i, i = 1, . . . , n ∶n
∑j=11[∥x−X i∥ ≥ ∥x−Xj∥] ≤ k, the set of k nearest neighbors
of x. In case of ties, we give priority to points with lower index values. In effect, the weights
wi(x) are given by
wi(x) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩
1k , if X i is a kNN of x,
0, else.
The distance metric ∥ ⋅ ∥ is usually chosen to be the Mahalanobis metric, which is
∥x − y∥2Σ−1 = (x − y)TΣ−1(x − y), (2.12)
where x, y are any two points, and Σ is the sample covariance matrix of the training data.
Applying this technique in our joint prescriptive analytics framework, for each k in a grid
of potential k values, we compute the objective
Lµ(k) = µn
∑i=1
c(z(k,X i);Y i) + (1 − µ)n
∑i=1
XXXXXXXXXXXXXY i − 1
k∑
j∶Xj∈N−Xi
k(Xi)
Y j
XXXXXXXXXXXXX
2
43
where
• z(k,X i) is the optimal solution to the SAA over the k nearest neighbors of X i, or
z(k,X i) ∈ argminz∈Z
1
k∑
j∶Xj∈N−Xi
k(Xi)
c(z;Y j) ∀1 ≤ i ≤ n,
• N −Xi
k (X i) is the set of k nearest neighbors to X i in the training set excluding X i. These
neighbors are computed based on the Mahalanobis distance metric (Equation (2.12)).
Cross validation to compute k: Over a grid of µ values between 0 and 1, we compute
Lµ(k), and find the best k for each µ as k∗(µ) that leads to the smallest Lµ(k), i.e.,
k∗(µ) = argminkLµ(k).
Thus, for each µ we compute a k∗(µ), and we denote this set of k values for different µ as
Ω. Now, we chose the final value of k∗ as the value of k within this set of Ω that minimizes
the prescription error,n
∑i=1
c(z(k,X i);Y i), or
k∗ = argmink∈Ω
n
∑i=1
c(z(k,X i);Y i). (2.13)
2.3.2 Nadaraya-Watson Kernel Regression (KR)
In this section, we present the Nadaraya-Watson kernel regression method for prescriptive
analytics. Nadaraya-Watson kernel regression (we shall refer to this as KR; [Nadaraya,
1964, Watson, 1964]) is a local predictive method, where the prediction for a given point x is
computed as a weighted estimator of the training samples y. These weights depend on how
“similar" the corresponding training X samples are to the new point x. The prediction for
x is
f(x) =n
∑i=1
wi(x, h)Y i,
44
where the weights wi(x) are given by
wi(x) =K((X i − x)/h)n
∑j=1
K((x −Xj)/h).
Here, h > 0 is the bandwidth parameter, which is typically tuned to a particular dataset.
K ∶ Rdx → R represents the kernel, which for this work we restrict to be a nonnegative one,
i.e., K ∶ Rdx → R+. Some commonly used nonnegative kernels are:
1. Uniform: K(x) = 121[∥x∥ ≤ 1].
2. Epanechnikov: K(x) = 34(1 − ∥x∥2)1[∥x∥ ≤ 1].
3. Tricubic: K(x) = 7081(1 − ∥x∥3)31[∥x∥ ≤ 1].
4. Gaussian: K(x) = 1√2π
exp(−∥x∥2/2).
Next, we discuss how to apply this technique within our single-step prescriptive analytics
framework. For each h, we compute the objective
Lµ(h) = µn
∑i=1
c(z(h,X i);Y i) + (1 − µ)n
∑i=1∥Y i −∑
j≠iwj(X i, h)Y j∥2
where z(h,X i) ∈ argminz∈Z ∑j≠i
wj(X i, h)c(z;Y j) ∀1 ≤ i ≤ n.
Cross validation to compute h Over a grid of µ values between 0 and 1, we compute
Lµ(h), and find the best h for each µ as h∗(µ) that leads to the smallest Lµ(h), i.e.,
h∗(µ) = argminhLµ(h).
Note that for each µ, we compute a h∗(µ), and we denote this set of h values as Ω. Now,
we chose the final value of h∗ as the value of h within this set of Ω that leads to the smallest
prescription error,n
∑i=1
c(z(h,X i);Y i). To be precise,
h∗ = argminh∈Ω
n
∑i=1
c(z(h,X i);Y i).
45
2.3.3 Trees
Traditionally, regression (or classification) trees for the purpose of prediction are trained
by choosing splits that lead to low prediction error. These trees are trained by recursively
partitioning the X space into leaves in order to minimize the least squared error, or some
other metric such as log deviance. In these trees, each leaf predicts the same value for all
the points falling in it. The predictions given by a tree τ with L leaves can be written as
τ(x) =L
∑i=1
γi1(x ∈ X i),
where (X i, γi)Li=1 are the values and leaves that parametrize τ and are estimated from the
data.
In this section, we outline the problem formulation for learning trees that lead to high
quality decisions directly from data. Given a tree τ with L leaves denoted by X1, . . . ,XL, and
a candidate x, the PtP approach dictates that we solve the following weighted SAA problem
to obtain the decision z(τ, x) as
z(τ, x) ∈ argminz∈Z
1
n(lτ(x))n
∑i=1
L
∑j=11[x ∈ Xj]c(z;Y i).
Equivalently, we can write this as
z(τ, x) ∈ argminz∈Z
1
n(`τ(x))∑
i∶`τ (Xi)=`τ (x)c(z;Y i),
where
• `τ(x) denotes the leaf of the tree τ to which x belongs, and
• n(l) is the number of training samples in the leaf l.
Following our approach and using the above observation, the problem of learning the tree τ
46
that leads to good decisions can be formulated as follows
minτ,z(τ,Xi)ni=1
Lµ(τ) = µn
∑i=1
c(z(τ,X i);Y i) + (1 − µ)n
∑i=1
XXXXXXXXXXXY i − 1
n(`τ(X i)) ∑j∈`τ (Xi)
Y j
XXXXXXXXXXX
2
subject to z(τ,X i) ∈ argminz∈Z ∑
j∶Xj∈`τ (Xi)c(z;Y j) ∀1 ≤ i ≤ n.
(2.14)
Thus, the tree which includes the splits and decisions associated with each leaf is computed
by minimizing the net objective in Problem (2.14).
Due to the discrete nature of the tree τ where it splits the X space into leaves, each leaf l
will have a decision z(τ, l) associated with it, which is the solution to the SAA problem solved
over the samples in that leaf. With this observation, Problem (2.14) can be equivalently
written as
minτ,z(τ,`j)Lj=1
µL
∑j=1∑i∈`j
c(z(τ, `j);Y i) + (1 − µ)n
∑i=1
XXXXXXXXXXXY i − 1
n(`(X i)) ∑j∈`(Xi)Y j
XXXXXXXXXXX
2
subject to z(τ, `j) ∈ Z ∀1 ≤ j ≤ L.(2.15)
2.3.4 Random Forests
Following ideas from Breiman [2001] who extend decision trees to reduce the variance of their
predictions by training several trees on randomly chosen subsamples of the data and aggre-
gating their individual outputs. The decisions prescribed by the forest τiKi=1, a collection
of K trees, is given by
z(T,X i) ∈ argminz∈Z
1
K
K
∑k=1
1
n(`k(X i)) ∑j∈`k(Xi)
c(z;Y j),
where `k(x) denotes the leaf of the kth tree to which x belongs.
Following our approach and using the above observation, the problem of solving for the
47
prescriptive random forest T can be formulated as
minT
µn
∑i=1
c(z(T,X i);Y i) + (1 − µ)n
∑i=1
XXXXXXXXXXXXY i − 1
K
K
∑k=1
1
n(`k(X i)) ∑j∈`k(Xi)
Y j
XXXXXXXXXXXX
2
subject to z(T,X i) ∈ argminz∈Z
1
K
K
∑k=1
1
n(`k(X i)) ∑j∈`k(Xi)
c(z;Y j).(2.16)
2.3.5 Penalizing the prediction error of f
In this section, we elaborate further on why the prediction error needs to be penalized along
with the prescriptive loss which quantifies the quality of the decisions induced by any f .
Note that in our main Problem (2.11), we minimize the objective with respect to both f
and each of the z(f,X i) variables. Suppose the cost function c is linear in the uncertainty
and decision variable, i.e., c(z; y) = y′z. As pointed out by Elmachtoub and Grigas [2017],
under the P&O framework (which is identical to PtP when c is linear), the ith constraint
stipulates that each z(f,X i) is the solution to minη∈Z η′f(X i). Note that if f(x) = 0,
then the ith constraint reduces simply to z ∈ Z. Thus, the each term in the objective will
reduce to z(f,X i) ∈ argminη∈Z η′y ∀i, which leads to the smallest possible objective value
for Problem (2.11) and thus f = 0 is trivially optimal.
This problem persists even under the PtP framework, i.e., if z(f,X i) is chosen as the solu-
tion to a re-weighted SAA with the weights depending on X i. Suppose f(x) = ∑ni=1wi(x)Y i,
where wi ≥ 0∀i. In this case, each z(f, x) is the solution to minη∈Z η′(∑ni=1wiY i). Suppose
these weights are chosen via trees. This could incentivize those splits in the trees that lead
to a leaf ` where ∑ni∶Xi∈` Y i = 0, which again leads to the same issue mentioned above.
To mitigate this issue, we stipulate that along with good prescriptive performance of f ,
the predicted value f(x) be close to the true value y. We use cross validation to choose the
prescription factor 0 < µ < 1 that balances both these errors.
48
2.4 Optimization algorithms
In this section, we present algorithms for learning the prescriptive versions of the trees and
forests outlined in Section 2.3. We present two algorithms for training these trees – the first
one a greedy algorithm based on the recursive CART heuristic Breiman et al. [1984], and the
second one based on its more recent coordinate descent based improvement Bertsimas and
Dunn [2019]. First, we present the greedy algorithm for training these trees in Algorithm (4),
and its extension to random forests in Algorithm (5). We only describe these algorithms
here, and defer the full details to the Appendix.
2.4.1 Greedy algorithm for learning trees
The greedy algorithm outlined in Algorithm (4) attempts to find a partition of the covariate
space in order to minimize the net loss. It does so by iterating over axis-parallel splits (splits
of the form xi ≤ a and xi > b) to find the best split at each level of the tree. This proceeds
recursively on the training set until either the maximum depth ∆max of the tree is reached,
or if any further split results in less than nmin samples in a leaf. These parameters nmin,∆max
are hyperparameters that need to be set by the user or can be chosen via crossvalidation.
Note that while the routine GreedyTree is written in Algorithm (4) for any input data
set S, we obtain the greedy tree optimized on the training set by calling GreedyTree(Sn =(X1, Y 1), . . . , (Xn, Y n),0).
Once this tree is trained, we use cost complexity pruning (as detailed in Section 2.4
of Dunn [2018]) to regularize the tree and control for overfitting. We omit this in the
presentation of our algorithm for the sake of brevity.
2.4.2 Prescriptive Random Forests
Next, we present our algorithm for training prescriptive random forests. Recall that a random
forest for predictive purposes such as classification/regression Breiman [2001] with K trees is
computed by training each tree greedily, and the final output is obtained by aggregating the
49
individual tree predictions. We follow a similar approach here as well, where we compute
each of the K trees by training them in a greedy manner (via Algorithm (4)). In effect,
this results in an approximate solution to Problem (2.16). The idea is that aggregating
individual trees can potentially reduce variance while simultaneously not increasing the bias
significantly, assuming the trees are sufficiently independent. For the sake of completeness,
we present the algorithm to train the random forests in Algorithm (5). The random forests
algorithm is similar to the one for trees, but with an additional parameter 0 < α < 1 which
restricts the number of features available to each tree to ⌊αdx⌋ (can also consider√dx) in
order to promote independence among the trees for variance reduction.
We note that there is one key difference in generating prescriptions z(x) for any x,
compared to the traditional method of averaging the outputs of the individual trees in
regression. For any new X0, we do not set the prescription z(X0) equal to the average1K
K
∑k=1
z(τ k,X0) (which could be infeasible), but instead we obtain it by solving the weighted
optimization problem
z(X0) ∈ argminz∈Z
1
K
K
∑k=1
1
n(`k(X0)) ∑j∈`k(X0)
c(z;Y j),
where X0 falls in leaf `k(X0) of tree τ k for each k ∈ [K].
2.4.3 Local search algorithms for Prescriptive Trees
In this section, we describe the local search procedure to further improve the trees computed
via the greedy algorithm (4). We note that the top to down nature of Algorithm (4) can lead
to locally optimal solutions, and the following approach iteratively improves the prescriptive
tree until a better solution is reached.
The local search algorithm takes as input a prescriptive tree, which we provide by calling
Algorithm (4). The search procedure iterates over a randomized ordering of the nodes of the
input tree, and each node is further improved by any of the following three steps (with the
first two steps applying to any non-leaf node):
50
• perturbing the split (both feature and threshold value) to improve the net objective,
or
• deleting the split, and replacing it with either its left or right children, or
• finally, if the node is a leaf, then creating a new split (provided the minimum leaf size
and maximum depth conditions are satisfied).
Once there is no further improvement, then we terminate the search and return the final tree.
Additionally, we run this search procedure from various starting points and choose the best
final solution out of all the potential solutions identified in this process. The local search
algorithm is detailed in Algorithm (6). As part of this algorithm, we call the subroutine
OptimizeNode which takes as input a candidate node, and the subset of training data that
falls into this node, and outputs an improved node by either perturbing the split, deleting
the node, or further branching if it is a leaf node. For perturbing the split and potentially
improving it, we define another subroutine PerturbSplit, that varies the split parameters,
and calculates the new error by updating both subtrees rooted at left and right children nodes
of the candidate node. Next, we present two additional ideas which further improve the local
search algorithm.
Pruning splits based on prediction error
This idea uses the fact that the objective comprises of two parts – prescription and prediction
error. For µ < 1, we note that the prediction error contributes to the objective, and is typically
easier to compute than the prescription error which involves solving constrained optimization
problems. We use the presence of this term to narrow down the list of potential splits in
our search. At each node while choosing a potential split, we rank all the O(np) splits
in decreasing order of the resulting prediction error. Then, we compute the prescription
term for only the top few of these splits, and finally choose the one that leads to the lowest
composite objective.
51
Caching for warm starts
We store previously computed results in memory and use them when the same result is needed
again, rather than recomputing them. This is particularly useful when a lot of candidate
subproblems need to be evaluated and some of the computations may be repeated. Applying
this idea here, we store the optimal solutions z, the prescription, and prediction errors for
each of the candidate splits evaluated so far. Suppose we have computed and stored these
values for each of M sets of indices I1, . . . , IM . Now, when our algorithm wishes to evaluate
the objective for a new set I, we use the solution with least cost for the new objective as an
initial starting point for the first order algorithms.
Finally, note that we need to compute a tree for different values of µ. Note that µ only
dictates how the prescription and prediction terms are weighted, so we can take advantage
of the individual values of these two terms for each split by looking up computed values for
previous choices of µ.
The full algorithm involves computing a greedy tree with Algorithm (4) with caching and
sorting splits based on predictive error. Next, using this tree as a warm start, we run the
local search algorithm with caching (to compute the optimal decisions within each leaf). We
repeat this using different warm starts, and choose the best among all the generated trees.
We present the full algorithm (9) in the Appendix.
2.5 Observational data with decision-dependent uncer-
tainty
In this section, we extend the approach developed so far to the setting in which the decision,
Z, affects the realization of the uncertainty, Y . From a mathematical standpoint, we study
the problem in which a decision maker seeks to optimize a known objective function that
depends on an uncertain quantity. We allow the auxiliary covariates X, decision variable
z, and outcome Y to take values on multi-dimensional, continuous sets. A decision-maker
52
seeks minimize the conditional expected cost:
minz∈Z
E[c(z;Y (z))∣X = x]. (2.17)
Since the distribution of Y (z) is unknown, it is not possible to solve Problem (2.17)
exactly. However, we assume that we have access to observational data, consisting of n
independent and identically distributed observations, (X i, Zi, Y i) for i = 1, . . . , n. Recall
that each of these observations consists of an auxiliary covariate vector, a decision, and an
observed outcome. This type of data presents two challenges that differentiate our prob-
lem from a predictive machine learning problem. First, it is incomplete. We only observe
Y i ∶= Y i(Zi), the outcome associated with the applied decision. We do not observe what
the outcome would have been under a different decision. Second, the decisions were not
necessarily chosen independently of the outcomes, as they would have been in a randomized
experiment, and we do not know how the decisions were assigned. Following common prac-
tice in the causal inference literature, we make the ignorability assumption of Hirano and
Imbens [2004].
Assumption 1 (Ignorability).
Y (z) ⊥⊥ Z ∣X ∀z ∈ Z
In other words, we assume that historically the decision Z has been chosen as a function
of the auxiliary covariates X. There were no unmeasured confounding variables that affected
both the choice of decision and the outcome. In fact, Bertsimas and Kallus [2017] show that
without this assumption, the problem is not well posed. Under this assumption, we are able
to rewrite the objective of (2.17) as
E[c(z;Y ) ∣X = x,Z = z].
This form of the objective is easier to learn because it depends only on the observed outcome,
53
not on the counterfactual outcomes. A direct approach to solve this problem is to use a
regression method to predict the cost as a function of x and z and then choose z to minimize
this predicted cost. If the selected regression method is uniformly consistent in z, then the
action chosen by this method will be asymptotically optimal under certain conditions. (We
will formalize this later.) However, this requires choosing a regression method that ensures
the optimization problem is tractable. For this work, we restrict our attention to linear and
tree-based methods, such as CART [Breiman et al., 1984] and random forests [Breiman,
2001], as they are both effective and tractable for many practical problems.
A key issue with the direct approach is that it tries to learn too much. It tries to learn
the expected outcome under every possible decision, and the level of uncertainty associated
with the predicted expected cost can vary between different decisions. This method can lead
us to select a decision which has a small point estimate of the cost, but a large uncertainty
interval, i.e., high variance of the cost.
Because of Assumption 1,
E[c(z;Y (z))∣X = x] = E[c(z;Y )∣X = x,Z = z],
so we focus on learning c(z;Y ) as a function of x and z. We jointly learn and make decisions
as in (2.11), with a few modifications.
We emphasize that our methods assume access to past data and no additional information
about this distribution. Finally, we note that in this chapter we only consider the single
period problem and not the multiperiod one, in which the decision maker can make some
decisions after the uncertainty is realized. Also, we consider the setting where the constraint
space Z is unaffected by the uncertainty y, i.e., y only affects the cost c.
2.5.1 Uncertainty penalization and Parameter tuning
In addition, we introduce the idea of uncertainty penalization to prevent the method from
making decisions that have a small, but highly uncertain, predicted cost. We define the bias
54
and variance of the estimators, and introduce a composite objective (which we will describe in
greater detail in (2.18)) that penalizes the sum of bias and variance terms, each multiplied
by parameters λ1 and λ2. While this results in a different problem than in Section 2.3
and introduces additional parameters, we learn these the same way as before – tuning for
prescriptive performance. In order to tune the parameters λ1, λ2, and any other tuning
parameters associated with F , we perform cross validation. We split the data into a training
and validation set, train f on the training data, and compute the objective of (2.18) on the
validation data with fixed f . We repeat this for various combinations of tuning parameters
and then select the best combination.
Because Y is now a function of z, we do not know what the outcome would have been
if we had chosen a different decision. We need to estimate the counterfactuals with the
machine learning method. However, if we do not restrict the machine learning method, it
can choose trivial solutions that will lead to a very small objective value for (2.11). For
example, if c(z;Y (z)) = Y (z) ≥ 0, an optimal tree may isolate a single training example with
very small Y (z) and then propose decisions such that all of the new observations fall in that
leaf with small predicted cost. This single example has little impact on the predictive error,
but an outsized impact on the prescriptive error. As before, we denote the machine learning
estimator, f ∈ F , as a linear combination of the training examples:
f(x, z) =n
∑i=1
wfi (x, z)c(z;Y i).
In order to prevent the learning from picking an estimator that is overly optimistic about
its predicted costs, we require the weights, wfi (x, z), satisfy a generalization of the honesty
property of Wager and Athey [2018].
Assumption 2 (Honesty). The model trained on (X1, Z1, Y 1), . . . , (Xn, Zn, Y n) is honest,
i.e., the weights, wfi (x, z), are determined independently of the responses, Y 1, . . . , Y n.
For tree-based methods, this assumption can be satisfied by either ignoring the response
variables while building the tree, or by separating the data into two sets and using one set
55
to make splits and the other to make predictions in the leaves. If Assumption 2 holds, the
conditional variance of f(x, z) given (X1, Z1), . . . , (Xn, Zn) is given by
V f(x, z) ∶=∑i
(wfi (x, z))2Var(c(z;Y i)∣X i, Zi).
Because f(x, z) may be a biased predictor, we also introduce a term that penalizes the
conditional bias of the predicted cost given (X1, Z1), . . . , (Xn, Zn). Since the true cost is
unknown, it is not always possible to exactly compute this bias. Instead, we compute an
upper bound under a Lipschitz assumption (details in Section 2.5.2).
Bf(x, z) ∶=∑i
wfi (x, z)∣∣(X i, Zi) − (x, z)∣∣2.
With these penalty terms, we rewrite (2.11) as:
minf∈F ,z(f,Xi)
µn
∑i=1
c(z(f,X i);Y i) + (1 − µ)n
∑i=1
`(Y i, f(X i, zi))
subject to z(f,X i) ∈ argminz∈Z
f(xi, z) + λ1Vf(xi, z) + λ2B
f(xi, z) ∀1 ≤ i ≤ n,(2.18)
where λ1 and λ2 are parameters that are tuned by cross validation. These modifications
t (2.11) are important because we use the machine learning estimator to impute counter-
factual outcomes. Otherwise, the method can be overly optimistic and perform poorly out
of sample. When µ = 0, there is no prescriptive component in the objective, so we fix the
values of λ1 and λ2 to be 0.
As a concrete example, we consider trees as the machine learning method. The variance
penalty term, V f(x, z), will be small when the proposed decision is in a leaf with many
training examples. The bias penalty term, Bf(x, z), will be small when the proposed decision
is in a leaf with a small diameter, i.e., the training examples are close to the new observation
and proposed decision. It makes sense intuitively why we would like both of these penalty
terms to be small.
Before proceeding, we note that the variance terms, Var(c(z;Y i)∣X i, Zi), are usually
56
unknown in practice. In the absence of further knowledge, we assume homoscedasticity,
i.e., Var(Y i∣X i, Zi) is constant. It is possible to estimate this value by training a machine
learning model to predict Y i as a function of (X i, Zi) and computing the mean squared error
on the training set. However, it may be advantageous to include this value with the tuning
parameter λ1.
2.5.2 Theoretical Results
In this section, we describe the theoretical motivation for our approach and provide finite-
sample generalization and regret bounds. For notational convenience, we define
f(x, z) = E[c(z;Y )∣X = x,Z = z].
Before presenting the results, we first present a few additional assumptions.
Assumption 3 (Weights). For all (x, z) ∈ X ×Z, ∑ni=1w
fi (x, z) = 1 and for all i, wf
i (x, z) ∈[0,1/γn]. In addition, X ×Z can be partitioned into Γn regions such that if (x, z) and (x, z′)are in the same region, ∣∣wf(x, z) −wf(x, z′)∣∣1 ≤ α∣∣z − z′∣∣2.
This assumption is trivially satisfied with weight functions derived from a tree-based
machine learning method. Γn is the maximum number of leaves in the tree, γn is the
minimum number of training examples in a leaf, and α = 0.
Assumption 4 (Regularity). The set X ×Z is nonempty, closed, and bounded with diameter
D.
Assumption 5 (Objective Conditions). The objective function satisfies the following prop-
erties:
1. ∣c(z; y)∣ ≤ 1 ∀z, y.
2. For all y ∈ Y, c(⋅; y) is L-Lipschitz.
3. For any x,x′ ∈ X and any z, z′ ∈ Z, ∣f(x, z) − f(x′, z′)∣ ≤ L∣∣(x, z) − (x′, z′)∣∣.
57
These assumptions provide some conditions under which the generalization and regret
bounds hold, but similar results hold under alternative sets of assumptions (e.g. if c(z;Y )∣Zis subexponential instead of bounded). With these additional assumptions, we have the
following generalization bound. All proofs are contained in the appendix.
Theorem 1. Suppose assumptions 1-5 hold. Then, with probability at least 1 − δ,
f(x, z) − f(x, z) ≤ 4
3γnln(Kn/δ) + 2
√V f(x, z) ln(Kn/δ) +L ⋅Bf(x, z) ∀z ∈ Z,
where Kn = Γn (9Dγn (α(LD + 1 +√2) +L(
√2 + 3)))dz .
This result uniformly bounds, with high probability, the true cost of action z by the
predicted cost, f(x, z), a term depending on the uncertainty of that predicted cost, V f(x, z),and a term proportional to the bias associated with that predicted cost, Bf(x, z). It is easy
to see how this result motivates the approach described in (2.18). One can also verify that the
generalization bound still holds if (X1, Z1), . . . , (Xn, Zn) are chosen deterministically, as long
as Y 1, . . . , Y n are still independent. Using Theorem 1, we are able to derive a finite-sample
regret bound.
Theorem 2. Suppose assumptions 1-5 hold. Define
z∗ ∈ argminz
f(x, z),
z ∈ argminz
f(x, z) + λ1
√V f(x, z) + λ2B
f(x, z).
If λ1 = 2√ln(2Kn/δ) and λ2 = L, then with probability at least 1 − δ,
f(x, z) − f(x, z∗) ≤ 2
γnln(2Kn/δ) + 4
√V (x, z∗) ln(2Kn/δ) + 2L ⋅B(x, z∗),
where Kn = Γn (9Dγn (α(LD + 1 +√2) +L(
√2 + 3)))dz .
By this result, the regret of the approach defined in (2.18) depends only on the variance
and bias terms of the optimal action, z∗. Because the predicted cost is penalized by V (x, z)
58
and B(x, z), it does not matter how poor the prediction of cost is at suboptimal actions. The-
orem 2 immediately implies the following asymptotic result, assuming the auxiliary feature
space and decision space are fixed as the training sample size grows to infinity.
Corollary 1. In the setting of Theorem 2, if γn = Ω(nβ) for some β > 0, Γn = O(n), and
B(x, z∗)→p 0 as n→∞, then
f(x, z)→p f(x, z∗)
as n→∞.
The assumptions can be satisfied, for example, with CART or random forest as the
learning algorithm with parameters set in accordance with Lemma 2 of Wager and Athey
[2018]. This next example demonstrates that there exist problems for which the regret of
the method with µ > 0 is strictly better, asymptotically, than the regret of the method with
µ = 0.
Example 1. Suppose there are m + 1 different actions and two possible, equally probable
states of the world. In one state, action 0 has a cost that is deterministically 1, and all
other actions have a random cost that is drawn from N (0,1) distribution. In the other state,
action 0 has a cost that is deterministically 0, and all other actions have a random cost,
drawn from a N (1,1) distribution. Suppose the training data consists of m trials of each
action. If f(j) is the empirical average cost of action j, then the method with µ = 0 selects the
action that minimizes f(j). The method with µ > 0 adds a penalty of the form suggested by
Theorem 2, λ1
√σ2j lnm
m . If λ1 ≥√2, the (Bayesian) expected regret of the method with µ > 0 is
asymptotically strictly less than the expected regret of the method with µ = 0, ERµ = o(ER0),where the expectations are taken over both the training data and the unknown state of the
world.
This example is simple but demonstrates that there exist settings in which the method
with µ = 0 is asymptotically suboptimal to our method. In addition, the proof illustrates
how one can construct tighter regret bounds than the one in Theorem 2 for problems with
specific structure.
59
2.5.3 Tractability
The tractability of the method depends on the algorithm that is used as the predictive model.
For many kernel-based methods, the resulting optimization problems are highly nonlinear
and do not scale well when the dimension of the decision space is more than 2 or 3. For this
reason, we advocate for the use of tree-based and linear models as the predictive model. Tree
based models partition the space X ×Z into Γn leaves, so there are only Γn possible values
of wf(x, z). Therefore, we can solve the problem separately for each leaf. For j = 1, . . . ,Γn,
we solvemin f(x, z) + λ1
√V (x, z) + λ2B(x, z)
s.t. z ∈ Z
(x, z) ∈ Lj,
(2.19)
where Lj denotes the subset of X×Z that makes up leaf j of the tree. Because each split in the
tree is a hyperplane, Lj is defined by an intersection of hyperplanes and thus is a polyhedral
set. Clearly, B(x, z) is a convex function in z, as it is a nonnegative linear combination of
convex functions. If we assume homoscedasticity, then V (x, z) is constant for all (x, z) ∈ Lj.
If c(z; y) is convex in z and Z is a convex set, (2.19) is a convex optimization problem and
can be solved by convex optimization techniques. Furthermore, since the Γn instances of
(2.19) are all independent, we can solve them in parallel. Once (2.19) has been solved for all
leaves, we select the solution from the leaf with the overall minimal objective value.
For tree ensemble methods, such as random forest [Breiman, 2001] or xgboost [Chen
and Guestrin, 2016], optimization is more difficult. We compute optimal decisions using a
coordinate descent heuristic. From a random starting action, we cycle through holding all
decision variables fixed except for one and optimize that decision using discretization. We
repeat this until convergence from several different random starting decisions. For linear
predictive models, the resulting problem is often a second order conic optimization problem,
which can be handled by off-the-shelf solvers (details given in the appendix).
60
2.6 Computational Experiments
In this section, we apply our proposed methods on two problems – the portfolio optimization
problem, and the newsvendor problem.
First, we explain the setup of our experiments. For each value of n, we generate a
training set (X1, Y 1), . . . , (Xn, Y n). Next, we generate the test set, which we denote with
subscript T , as (X1T , Y
1T ), . . . , (X
nT
T , Y nT
T ) with nT = 10,000. Note that both training and
test sets are generated via the same distribution. In our experiments, we calculate the
optimal decision z(X i) for each X i in the test set, and report the average out of sample
test set cost c(z(X i);Y i). We further average this out of sample cost over twenty different
realizations of the training and test sets, and repeat this for various values of the training
set size n.
For each of these problems, we compare our methods with other methods – SAA, P&O,
PtP, along with the oracle. We denote our method as JPP, for Joint Predictive-Prescriptive
Analytics. We compare the prescriptive performance of the following five methods:
1. SAA: simply the SAA solution, where zSAA is computed over the whole training set
as
zSAA ∈ argminz∈Z
n
∑i=1
c(z;Y i),
and the cost is calculated as 1nT
c(zSAA;Y iT ) over the test set.
For each of the following three methods, we report four different – k Nearest Neighbors
(kNN), Kernel Regression (KR), Trees (T), and Random Forests (RF). We also include
an example when the training method is k-NN to further clarify our experiments.
2. P&O: First, the machine learning model is trained to minimize the training loss, i.e.,
solve Problem (2.11) with µ = 0. Once this model is computed, it is used to compute
the decisions. When the model is kNN, the first step involves finding the k that leads
to the least in sample training loss for predicting y. Next, the test set cost is computed
61
by playing ziP&O, for each test set X iT , 1 ≤ i ≤ nT , obtained by
Y i = 1
k∑
j∈Nk(XiT )Y j,
ziP&O ∈ argminz∈Z
c(z; Y i),
with the cost calculated as 1nT
nT
∑i=1
c(ziP&O;YiT ). We report the performance of this
method as P&O-kNN.
3. PtP: Similar to the P&O method, the machine learning model is trained by simply
minimzing the training loss, followed by using that model for decision making. When
the model is kNN, the optimal decisions are computed as
ziPtP ∈ argminz∈Z
1
k∑
j∈Nk(XiT )c(z;Y j)
for each test set X iT , 1 ≤ i ≤ nT , where the k is the same value as used in the P&O
methodology. The performance metric of this method is the average out of sample
prescriptive cost 1nT
nT
∑i=1
c(ziPtP;YiT ), which we report as PtP-kNN.
4. JPP: This is the approach presented in this chapter. Here, the machine learning
model is trained to minimize the joint loss function by solving Problem (2.11), where
0 < µ < 1 is chosen via cross validation. In the case of JPP-kNN, where the learning
model is kNN, the parameter k is calculated to minimize the joint loss according to
Equation (2.13), while the decision ziJPP is computed for each test set X iT , 1 ≤ i ≤ nT ,
as
ziJPP ∈ argminz∈Z
1
k∑
j∈Nk(Xi
T )c(z;Y j).
The prescriptive performance of this method is captured by the cost 1nT
nT
∑i=1
c(ziJPP;YiT ),
which we report as JPP-kNN.
5. Oracle: Finally, as the name indicates, for this method we assume we have access to
62
the uncertainty Y i corresponding to each X i in the test set. The cost is calculated
over the test set as1
nT
nT
∑i=1
minz∈Z
c(z;Y iT ).
Note that this cost is the best any method can possibly achieve, and thus provides a
lower bound on the attainable test set performance of any prescriptive method.
2.6.1 Portfolio Optimization
We consider the same example mentioned in Bertsimas and Van Parys [2017], where the
problem is to allocate a limited budget among 6 different securities in an artificial portfolio.
Thus, the decision variable z ∈ R6+ represents the asset allocations, while the uncertain returns
y ∈ R6 are unknown at the time of investing. We consider 3 different covariates that can
potentially influence the returns – the global S&P500 performance, inflation, and the amount
of Twitter chatter mentioning the hash tag #WAR – as x1, x2 and x3 respectively, where we
wish to use this additional side information to aid the decision making process.
We wish to maximize the mean return while at the same time minimizing the risk that
the loss (−zTy)+ = max−zTy,0 is large. Employing a conditional value-at-risk (CVaR)
reformulation [Rockafellar and Uryasev, 2000] and using β as an auxiliary variable, we solve
the following convex minimization problem, for a given covariate x,
(z∗(x), β∗(x)) ∈arg minz≥0,β
E[β + 1
ε(−zTy − β)+ − λzTy∣X = x]
subject to6
∑i=1
zi = 1.
Thus, the augmented vector z = (zT , β)T ∈ R7 is the decision variable. Here, λ, ε are both
given parameters where λ ≥ 0 represents the tradeoff between expected risk and return, and
the risk term represents the expected tail loss occurring above the (1 − ε) quantile. For all
our experiments that follow, we fix these parameter values as ε = 0.05 and λ = 1. We include
the details on the data generation process in the appendix.
We consider various sizes of the training data, and use this to compute the best f . We
63
n P&O-kNN P&O-KR P&O-Lasso P&O-RF100 1357.62 1304.33 1112.28 1216.10200 1278.06 1212.23 1026.20 1109.17500 1143.51 1108.24 966.31 1009.851000 1112.49 1049.91 923.96 945.10
Table 2.1: Average out of sample prescriptive performance for Predict and Optimize - kNN,local Kernel Regression, Lasso, and Random forests as a function of n, the size of trainingset.
n SAA PtP-kNN JPP-kNN PtP-KR JPP-KR PtP-OptTree JPP-OptTree100 91.53 146.15 70.27 84.52 62.18 94.88 107.09200 73.81 90.27 50.21 53.51 36.97 71.58 62.09500 63.69 48.16 34.53 25.06 14.49 35.75 31.991000 60.05 33.18 29.97 15.63 4.84 15.42 15.90
Table 2.2: Average out of sample prescriptive performance for various methods as a functionof n, the size of training set.
evaluate the performance of various methods by reporting the out of sample prescriptive cost
on the test set of size 10,000. For each value of n, we repeat this procedure twenty times
and average the cost over these instances. Note that the lower this value is, the better the
method.
Computational details: For the kNN method, we use the Mahalanobis distance metric
and for the Nadaraya-Watson kernel regression, we use the Gaussian kernel and the usual
Euclidean distance metric. We use leave one out cross validation to compute the parameters
k and h in kNN and KR respectively. For the trees, we use the projected subgradient descent
(Algorithm (A.1)) to update the solutions to various splits. The projection problem (A.2)
on Z = z ∈ Rdz+ ,∑i zi = 1, solving which is key to Algorithm (A.1), can be efficiently solved
in O(dz log(dz)) time [Duchi et al., 2008]. Finally, we implement our algorithms in Python
3, with Gurobi [Gurobi] as the optimization solver.
Table 2.1 shows the prescriptive performance of P&O policies for various machine learning
methods for different training set sizes. From the first column of table 2.2 which shows the
performance of the SAA solution (which is averaged over the same training and test set
64
instances for each n, and hence are directly comparable), we see that the performance of
P&O methods is evidently not very strong for this problem. Even Bertsimas and Kallus
[2019] in their computations (Figure 2(a) in their paper) note the under-performance of
P&O methods for this particular problem. Furthermore, Table 2.2 also demonstrates that
while PtP methods offer a huge improvement over P&O methods, they are dominated by
the JPP based versions of these methods. This improvement holds even as n increases.
The improvement due to the best method, JPP-KR, over PtP-KR is statistically significant
by the Wilcoxon signed-rank test (p-values for n = 100,200,500,1000 are 1.5 × 10−16,4.6 ×10−16,6.5 × 10−18,1.3 × 10−17 respectively). As a reference, we note that the oracle method
with perfect hindsight has test set performance between −527 and −528. Finally, we do not
include random forests as there are only three covariates.
2.6.2 Newsvendor problem
In this section, we consider the newsvendor problem with auxiliary information x about the
demand y. The cost function is piecewise linear convex, given by
c(z;Y ) = b ⋅ (Y − z)+ + h ⋅ (z − Y )+, (2.20)
where y represents the uncertain demand, and the backorder cost b > 0 and holding cost
h > 0 are both known in advance. Consequently. the conditional expectation problem we
wish to solve to obtain the decision z∗(x), for any x, is:
z∗(x) ∈ argminz
E[b ⋅ (Y − z)+ + h ⋅ (z − Y )+∣X = x]
Clearly, if y is known apriori, then the optimal decision is z∗ = y, which leads to an optimal
oracle cost of zero. If the conditional distribution of Y ∣X = x is perfectly known, then the
classical result stipulates that the optimal decision z∗(x) is given by the quantile
z∗(x) = inf t ∶ PY ∣X=x[y ≤ t∣X = x] ≥b
b + h.
65
We also note that the optimal solution to the weighted SAA version of the newsvendor
problem, given by,
minz
n
∑i=1
wfi (x) c(z;Y i),
can be computed efficiently in O(n log(n)) as
z∗(f, x) = inf Y (j) ∶j
∑i=1
wfi (x) ≥
b
b + h,
for each x, where the demands (Y 1, . . . , Y n) are ordered in nondecreasing order as Y (1) ≤Y (2) ≤ . . . ≤ Y (n).
Computational details: In this section, we apply our methodology on real data from a
Mexican bakery items producer. This data set is available at https://www.kaggle.com/c/
grupo-bimbo-inventory-demand, and consists of weekly sales data of more than hundred
of its products to various clients (stores across Mexico) over a period of nine weeks. We
restrict our analysis to the top hundred products and the top five hundred clients. The
details on the data generation process can be found in the appendix. We train our models
on random samples of various sizes (n) from weeks 6 and 7, and evaluate our models on test
data from weeks 8 and 9. For each n, we average our results over five hundred randomly
chosen training samples for each n, and report the average out of sample prescriptive cost
of various methods. Finally, we set the backorder and holding costs in Equation (2.20) as
b = 10.0 and h = 1.0.
Table 2.3 presents the performance of SAA followed by PtP and JPP variants of various
methods – Nadaraya-Watson kernel regression (KR), Optimal trees (OptTree), and Random
forests (RF). The kNN method performs similarly to the Kernel regression method, and
we omit for the sake of brevity. Clearly, JPP-RF is the best performer with the smallest
out of sample cost, closely followed by PtP-RF. We also note that the JPP versions of kNN,
Optimal trees, and RF perform better than their PtP versions at each value of n ≥ 200, which
shows the benefit of training the machine learning model keeping in mind the prescription
task. Finally, we note that all the methods perform better with more training data which
66
is expected, but the JPP methods consistently outperform their PtP counterparts. The
improvements due to JPP-RF over PtP-RF are all statistically significant for n ≥ 200 by
the Wilcoxon signed-rank test (p-values for n = 200,300,400,500,600 are 4.9 × 10−6,3.5 ×10−21,1.8×10−21,5.6×10−20,2.8×10−23 respectively). Note that the oracle method will always
have an average out of sample cost of 0.
n SAA PtP-KR JPP-KR PtP-OptTree JPP-OptTree PtP-RF JPP-RF
100 2065.45 2034.84 2032.73 2057.26 2022.10 1889.38 1890.36
200 2059.12 2024.07 2020.64 1983.34 1944.68 1830.66 1827.65
300 2056.60 2021.65 2016.68 1974.52 1923.52 1793.96 1783.24
400 2054.91 2021.66 2013.68 1954.00 1909.54 1760.35 1747.40
500 2054.33 2021.48 2012.30 1934.02 1892.65 1726.88 1714.18
600 2053.20 2022.98 2010.71 1924.11 1870.92 1692.28 1677.92
Table 2.3: Average out of sample prescriptive performance for various methods as a functionof n, the size of training set.
In the following sections, we demonstrate the effectiveness of our approach for problems
where Y is affected by Z with two examples. In the first, we consider pricing problem
with synthetic data, while in the second, we use real patient data for personalized Warfarin
dosing. We compare our methods (JPP) with PtP methods for which µ = 0, and there is no
prescriptive component in the objective of (2.18).
2.6.3 Pricing
In this example, the decision variable, z ∈ R5, is a vector of prices for a collection of products.
The response, Y , is a vector of demands for those products. The auxiliary covariates, x,
may contain data on the weather and other exogenous factors that may affect demand. The
objective is to select prices to maximize revenue for a given vector of auxiliary covariates.
The demand for a single product is affected by the auxiliary covariates, the price of that
product, and the price of one or more of the other products, but the mapping is unknown
67
n PtP-CART JPP-CART PtP-RF JPP-RF30 18630.37 55851.14 57259.44 57189.2950 18479.17 56316.17 57757.40 57964.1770 21228.90 56448.46 57949.27 58144.01100 21426.18 56889.63 58085.08 58209.91150 27961.63 57206.25 58285.22 58464.33300 30751.34 57631.42 58415.60 58630.36600 34894.42 57951.45 58373.28 58621.001000 37803.67 58163.99 58480.03 58713.352000 41503.90 58348.54 58449.98 58711.00
Table 2.4: Average out of sample revenue on the pricing example for various PtP and JPPmethods as a function of n, the size of training set.
to the algorithm. The details on the data generation process can be found in the appendix.
In Table 2.4, we compare the expected revenues of the strategies produced by several
algorithms. The PtP results refer to the methods that solve (2.18) with µ = 0. They are
trained to minimize predictive error, and then the decision that minimizes predicted cost is
then selected. The JPP results refer to the methods that solve (2.18) with µ = 0.5. For each
training sample size, n, we average our results over one hundred separate training sets of
size n. At a training size of 2000, the JPP random forest method improves expected revenue
by an average of $270 compared to the PtP RF method. This improvement is statistically
significant at the 0.05 significance level by the Wilcoxon signed-rank test (p-value 4.4×10−18,testing the hypothesis that mean improvement is 0 across 100 different training sets).
2.6.4 Warfarin Dosing
Warfarin is a commonly prescribed anticoagulant that is used to treat patients who have had
blood clots or who have a high risk of stroke. Determining the optimal maintenance dose
of Warfarin presents a challenge as the appropriate dose varies significantly from patient
to patient and is potentially affected by many factors including age, gender, weight, health
history, and genetics. However, this is a crucial task because a dose that is too low or too
high can put the patient at risk for clotting or bleeding. The effect of a Warfarin dose on a
patient is measured by the International Normalilzed Ratio (INR). Physicians typically aim
68
for patients to have an INR in a target range of 2-3.
In this example, we test the efficacy of our approach in learning optimal Warfarin dos-
ing with data from Consortium et al. [2009]. This publicly available data set contains the
optimal stable dose, found by experimentation, for a diverse set of 5410 patients. In addi-
tion, the data set contains a variety of covariates for each patient, including demographic
information, reason for treatment, medical history, current medications, and the genotype
variant at CYP2C9 and VKORC1. It is unique because it contains the optimal dose for
each patient, permitting the use of off-the-shelf machine learning methods to predict this
optimal dose as a function of patient covariates. We instead use this data to construct a
problem with observational data, which resembles the common problem practitioners face.
Our access to the true optimal dose for each patient allows us to evaluate the performance
of our method out-of-sample. This is a commonly used technique, and the resulting data
set is sometimes called semi-synthetic. Several researchers have used the Warfarin data for
developing personalized approaches to medical treatments. In particular, Kallus [2017b] and
Bertsimas et al. [2019a] tested algorithms that learned to treat patients from semi-synthetic
observational data. However, they both discretized the dosage into three categories, whereas
we treat the dosage as a continuous decision variable.
To begin, we split the data into a training set of 4000 patients and a test set of 1410
patients. We keep this split fixed throughout all of our experiments to prevent cheating
by using insights gained by visualization and exploration on the training set. Similar to
Kallus [2017b], we assume physicians prescribe Warfarin as a function of BMI. We assume
the response that the physicians observe is related to the difference between the dose a
patient was given and the true optimal dose for that patient. It is a noisy observation, but
it, on average, gives directional information (whether the dose was too high or too low) and
information on the magnitude of the distance from the optimal dose. The precise details
of how we generate the data are given in the supplementary materials. For all methods,
we repeat our work across 100 randomizations of assigned training doses and responses. To
measure the performance of our methods, we compute, on the test set, the mean squared
69
n PtP-Lasso JPP-Lasso PtP-CART JPP-CART PtP-RF JPP-RF200 450.90 448.11 440.13 307.22 301.93 257.70500 260.31 255.27 309.32 273.91 234.11 219.081000 300.82 286.32 269.92 254.24 220.43 211.431500 195.60 188.48 258.72 244.15 220.39 206.742000 180.44 174.44 247.76 239.52 215.56 199.272500 162.95 161.40 238.42 232.78 206.25 191.503000 161.22 159.41 230.18 222.63 211.11 193.543500 155.17 154.63 234.78 223.09 210.91 189.214000 154.33 153.51 221.87 216.86 205.13 187.35
Table 2.5: Average out of sample MSE on the Warfarin example for various PtP and JPPmethods as a function of n, the size of training set.
error (MSE) of the prescribed doses relative to the true optimal doses. Using the notation
described in Section 2.1, X i ∈ R99 represents the auxiliary covariates for patient i. We work
in normalized units so the covariates all contribute equally to the bias penalty term. Zi ∈ Rrepresents the assigned dose for patient i, and Y i ∈ R represents the observed response for
patient i. The objective in this problem is to minimize (E[Y (z)∣X = x])2 with respect to
the dose, z.1
Table 2.5 displays the results of several algorithms as a function of the number of training
examples. We compare PtP and JPP versions of CART, random forest, and Lasso algo-
rithms. We see consistent improvements in MSE with the JPP methods over the PtP meth-
ods. The lasso based method works best on this data set when the number of training samples
is large, but the random forest based method is best for smaller sample sizes. With the max-
imal training set size of 4000, the improvements of the CART, random forest, and lasso
uncertainty penalized methods over their unpenalized analogues (2.2%, 8.6%, 0.5% respec-
tively) are all statistically significant at the 0.05 family-wise error rate level by the Wilcoxon
signed-rank test with Bonferroni correction (adjusted p-values 2.1×10−4,4.3×10−16,1.2×10−6
respectively).
1This objective differs slightly from the setting described in Section 2.5.2 in which the objective was tominimize the conditional expectation of a cost function. However, it is straightforward to modify the resultsto obtain the same regret bound (up to a few constant factors) when minimizing g(E[c(z;Y (z))∣X = x]) fora Lipschitz function, g.
70
2.7 Conclusions
In this chapter, we consider the problem of computing decisions from data – a topic that
lies at the intersection of machine learning and operations research/management science.
In our setting, we assume the decision maker has access to n samples of past observational
data (X i, Zi, Y i)ni=1 comprising of auxiliary covariates X i, decisions Zi, and outcomes Y i.
We propose non-parametric ML based methods that, given a new observation x prescribe
a decision z(x). We compute these prescriptive policies in a single step, rather than the
usual two-step approach of learning a model to predict y from x and z, and then using
those predictions to compute decisions. A crucial component of our approach is that we
train these ML models by explicitly optimizing for the quality of their induced decisions.
Additionally, we prove finite sample generalization and regret bounds and provide a sufficient
set of conditions under which the resulting decisions are asymptotically optimal. Finally, we
perform computational experiments and demonstrate the prescriptive power of our methods
on synthetic and real data.
71
72
Chapter 3
Optimal Prescriptive Trees
3.1 Introduction
The proliferation in volume, quality, and accessibility of highly granular data has enabled
decision makers in various domains to seek customized decisions at the individual level.
This personalized decision making framework encompasses a multitude of applications. In
online advertising internet companies display advertisements to users based on the user
search history, demographic information, geographic location, and other available data they
routinely collect from visitors of their website. Specifically targeting these advertisements
by displaying them to appropriate users can maximize their probability of being clicked, and
can improve revenue. In personalized medicine, we want to assign different drugs/treatment
regimens/dosage levels to different patients depending on their demographics, past diagnosis
history and genetic information in order to maximize medical outcomes for patients. By
taking into account the heterogeneous responses to different treatments among different
patients, personalized medicine aspires to provide individualized, highly effective treatments.
In this chapter, we consider the problem of prescribing the best option from among a
set of predefined treatments to a given sample (patient or customer depending on context)
as a function of the sample’s features. We have access to observational data of the form
(xi, yi, zi)ni=1, which comprises of n observations. Each data point (xi, yi, zi) corresponds
73
to the features xi ∈ Rd of the ith sample, the assigned treatment zi ∈ [m] = 1, . . . ,m, and the
corresponding outcome yi ∈ R. We use y(1), . . . , y(m) to denote the m “potential outcomes"
resulting from applying each of the m respective treatments.
There are three key challenges for designing personalized prescriptions for each sample
as a function of their observed features:
1. While we have observed the outcome of the administered treatment for each sample, we
have not observed the counterfactual outcomes, that is the outcomes that would have
occurred had another treatment been administered. Note that if this information was
known, then the prescription problem reduces to a standard multi-class classification
problem. We thus need to infer the counterfactual outcomes.
2. The vast majority of the available data is observational in nature as opposed to data
from randomized trials. In a randomized trial, different samples are randomly assigned
different treatments, while in an observational study, the assignment of treatments
potentially, and often, depends on features of the sample. Different samples are thus
more or less likely to receive certain treatments and may have different outcomes than
others that were offered different treatments. Consequently, our approach needs to
take into account the bias inherent in observational data.
3. Especially for personalized medicine, the proposed approach needs to be interpretable,
that is easily understandable by humans. Even in high speed online advertising, one
needs to demonstrate that the approach is fair, appropriate, and does not discriminate
people over certain features such as race, gender, age, etc. In our view interpretability
is highly desirable always, and a necessity in many contexts.
We seek a function τ ∶ Rd → [m] that selects the best treatment τ(x) out of the m options
given the sample features x. In doing so, we need to be both “optimal” and “accurate”. We
thus consider two objectives:
1. Assuming that smaller outcomes y are preferable (for example, sugar levels for person-
alized diabetes management), we want to minimize E[y(τ(x))], where the expectation
74
is taken over the distribution of outcomes for a given treatment policy τ(x). Given
that we only have data, we rewrite this expectation as
n
∑i=1(yi1[τ(xi) = zi] +∑
t≠ziyi(t)1[τ(xi) = t]), (3.1)
where yi(t) denotes the unknown counterfactual outcome that would be observed if
sample i were to be assigned treatment t. We refer to the objective function (3.1) as
the prescription error.
2. We further want to design treatment τ(x) that accurately estimates the counterfactual
outcomes. For this reason, our second objective function is to minimize
[n
∑i=1(yi − yi(zi))2] , (3.2)
that is we seek to minimize the squared prediction error for the observed data.
Given our desire for optimality and accuracy, we propose in this chapter to seek a policy
τ(x) that optimizes a convex combination of the two objectives (3.1) and (3.2) :
µ [n
∑i=1(yi1[τ(xi) = zi] +∑
t≠ziyi(t)1[τ(xi) = t])] + (1 − µ) [
n
∑i=1(yi − yi(zi))2] , (3.3)
where the prescription factor µ is a hyperparameter that controls the tradeoff between the
prescription and the prediction error.
3.1.1 Related Literature
In this section, we present some related approaches to personalization in the literature and
how they relate to our work. We present some methodological papers by researchers in
statistics and operations research, followed by a few papers in the medical literature.
Learning the outcome function for each treatment: A common approach in the lit-
erature is to estimate each sample’s outcome under a particular treatment, and recommend
75
the treatment that predicts the best prognosis for that sample. Formally, this is equivalent
to estimating the conditional expectation E[Y ∣Z = t,X = x] for each t ∈ [m], and assign
the treatment that predicts the lowest outcome to a sample. For instance, these conditional
means could be estimated by regressing the outcomes against the covariates of samples who
received treatment t separately. This approach has been followed historically by several au-
thors in clinical research (for e.g., Feldstein et al. [1978]), and more recently by researchers
in statistics [Qian and Murphy, 2011] and operations research [Bertsimas et al., 2017]. The
online version of this problem, called the contextual bandit problem, has been studied by
several authors [Bastani and Bayati, 2015, Goldenshluger and Zeevi, 2013, Li et al., 2010]) in
the multi-armed bandit literature [Gittins, 1989]. These papers use variants of linear regres-
sion to estimate the treatment function for each arm all while ensuring sufficient exploration,
and picking the best treatment based on the m predictions for a given sample.
In the context of personalized diabetes management, Bertsimas et al. [2017] use carefully
constructed k−nearest neighbors to estimate the counterfactuals, and prescribe the treatment
option with the best predicted outcome if the expected improvement (over the status quo)
exceeds a threshold δ. The parameters, k and δ, used as part of this approach are themselves
learned from the data.
More generally in the fields of operations research and management science, Bertsimas
and Kallus [2019] consider the problem of prescribing optimal decisions by directly learning
from data. In this work, they adapt powerful machine learning methods and encode them
within an optimization framework to solve a wide range of decision problems. In the context
of revenue management and pricing, Bertsimas and Kallus [2017] consider the problem of
prescribing the optimal price by learning from historical demand and other side informa-
tion, but taking into account that the demand data is observational. Specifically, historical
demand data is available only for the observed price and is missing for the remaining price
levels.
Effectively, regress-and-compare approaches inherently encode a personalization frame-
work that consists of a (shallow) decision tree of depth one. To see this, consider a problem
76
with m−arms where this approach involves estimating functions fi for computing the out-
comes of samples that received arm i, for each 1 ≤ i ≤ m. This prescription mechanism can
be represented as splitting the feature space into m leaves, with the first leaf constituting
all the subjects who are recommended arm 1 and so on. The i−th leaf is given by the region
x ∈ Rd ∶ fi(x) < fj(x) ∀j ≠ i,1 ≤ j ≤ m. However, the individual functions f can be
highly nonlinear which hurts interpretability. Additionally, using only the samples who were
administered arm i to compute each fi results in using only a subset of the training data
for each of these computations and the fi’s not interacting with each other while learning,
which can potentially lead to less effective decision rules.
Statistical learning based approaches: Another relatively recent approach involves
reformulating this problem as a weighted multi-class classification problem based on im-
puted propensity scores, and using off the shelf methods/solvers available for such problems.
Propensity scores are defined as the conditional probability of a sample receiving a particu-
lar treatment given his/her features [Rosenbaum and Rubin, 1983]. Clearly, for a two arm
randomized control trial, these values are 0.5 for each sample. For problems where these
scores are known and two armed studies, Zhou et al. [2017] propose a weighted SVM based
approach to learn a classifier that prescribes one of the two treatment options. However, this
analysis is restricted to settings where these scores are perfectly known and predefined in the
trial design, e.g., randomized clinical trials (propensities are constant) or stratified designs
(where the dependence of the treatment assignment on the covariates is known a priori).
In observational studies, these probabilities are typically not known, and hence are usu-
ally estimated via maximum likelihood estimation. However, there are multiple proposed
methods for estimating these scores, e.g., using machine learning [Westreich et al., 2010]
or as primarily covariate balancing [Imai and Ratkovic, 2014], and the choice of method is
not clear a priori. Once these probabilities are known or estimated, the average outcome is
computed using approaches based on the inverse probability of treatment weighting estima-
tor. This involves multiplying the observed outcome by the inverse of the propensity score
(this approach is also referred to as importance/rejection sampling in the machine learning
77
literature). While this method has desirable asymptotic properties and low bias, dividing
the outcome by the estimated probabilities may lead to unstable, high variance estimates
for small samples.
Tree based approaches: Continuing in the spirit of adapting machine learning approaches,
Kallus [2017b] proposes personalization trees (and forests), which adapt regular classification
trees [Breiman et al., 1984] to directly optimize the prescription error. The key differences
from our approach are that we modify our objective to account for the prediction error, and
use the methodology of Bertsimas and Dunn [2017, 2019] to design near optimal trees, which
improves performance substantially. Athey and Imbens [2016] and Wager and Athey [2018]
also use a recursive splitting procedure of the feature space to construct causal trees and
causal forests, respectively, which estimate the causal effect of a treatment for a given sample,
or construct confidence intervals for the treatment effects, but not explicit prescriptions or
recommendations which is the main point of this chapter. Also, causal trees (or forests)
are designed exclusively for studies comparing binary treatments. Additional methods that
build on causal forests are proposed in the recent work of Powers et al. [2017], who develop
nonlinear methods to provide better estimates of the personalized average treatment effect,
E[Y (1)∣X = x] − E[Y (0)∣X = x], for high dimensional covariates x. They adapt methods
such as random forests, boosting, and MARS (Multivariate Adaptive Regression Splines) and
develop their equivalents for treatment effect estimation – pollinated transformed outcome
(PTO) forests, causal boosting, and causal MARS. These methods rely on first estimating
the propensity score (by regressing historically assigned Z against X), followed by another
regression using those propensity adjustments. The causal MARS approach uses nonlinear
functions, which are added to the basis in a greedy manner, as regressors for predicting
outcomes via linear regression for each arm, but uses a common set of basis functions for
both arms.
One of the advantages of these recent approaches – weighted classification or tree based
methods – over regress and compare based approaches is that they use all of the training
data rather than breaking down the problem into m (where m is the number of arms)
78
subproblems, each using a separate subset of the data. This key modification increases the
efficiency of learning, which results in better estimates of personalized treatment effects for
smaller sizes of the training data.
Personalization in medicine: Heterogeneity in patient response and the potential benefits
of personalized medicine have also been discussed in the medical literature. As an illustration
of heterogeneity in responses, a certain drug that works for a majority of individuals may
not be appropriate for other subsets of patients, e.g., in general older patients tend to
have poor outcomes independent of any treatment [Lipkovich and Dmitrienko, 2014]. In
an example of breast cancer, Gort et al. [2007] find that even when patients receive identical
treatments, heterogeneity of the disease at the molecular level may lead to varying clinical
outcomes. Thus, personalized medicine can be thought of as a framework for utilizing all
this information, past data, and patient level characteristics to develop a rule that assigns
treatments best suited for each patient. These treatment rules have provided high quality
recommendations, e.g., in cystic fibrosis [Flume et al., 2007] and mental illness [Insel, 2009],
and can potentially lead to significant improvements in health outcomes and reduce costs.
3.1.2 Contributions
We propose an approach that generalizes our earlier work on prediction trees [Bertsimas and
Dunn, 2017, 2019, Dunn, 2018] to prescriptive trees that are interpretable, highly scalable,
generalizable to multiple treatments, and either outperform out of sample or are comparable
with several state of the art methods on synthetic and real world data. Specifically, our
contributions include:
Interpretability: Decision trees are highly interpretable (in the words of Leo Breiman: “On
interpretability Trees rate an A+”). Given that our method produces trees with partitions
that are parallel to the axis, they are highly interpretable and provide intuition on the
important features that lead to a sample being assigned a particular treatment.
Scalability: Similarly to predictive trees [Bertsimas and Dunn, 2017, 2019, Dunn, 2018],
prescriptive trees scale to problems with n in 100,000s and d in the 10,000s in seconds when
79
they use constant predictions in the leaves and in minutes when they use a linear model.
Generalizable to multiple treatments: Prescriptive trees can be applied with multiple
treatments. An important desired characteristic of a prescriptive algorithm is its generaliz-
ability to handle the case of more than two possible arms. As an example, a recent review
by Baron et al. [2013] found that almost 18% of published randomized control trials (RCTs)
in 2009 were multi-arm clinical trials, where more than two new treatments are tested simul-
taneously. Multi-arm trials are attractive as they can greatly improve efficiency compared to
traditional two arm RCTs by reducing costs, speeding up recruitments of participants, and
most importantly, increasing the chances of finding an effective treatment [Parmar et al.,
2014]. On the other hand, two arm trials can force the investigator to make potentially
incorrect series of decisions on treatment, dose or assessment duration [Parmar et al., 2014].
Rapid advances in technology have resulted in almost all diseases having multiple drugs at
the same stage of clinical development, e.g., 771 drugs for various kinds of cancer are cur-
rently in the clinical pipeline [Buffery, 2015]. This emphasizes the importance of methods
that can handle trials with more than two treatment options.
Highly effective prescriptions: In a series of experiments with real and synthetic data,
we demonstrate that prescriptive trees either outperform out of sample or are comparable
with several state of the art methods on synthetic and real world data.
Given their combination of interpretability, scalability, generalizability and performance,
it is our belief that prescriptive trees are an attractive alternative for personalized decision
making.
3.1.3 Structure of the Chapter
The structure of this chapter is as follows. In Section 3.2, we review optimal predictive trees
for classification and regression. In Section 3.3, we describe optimal prescriptive trees (OPTs)
and the algorithm we propose in greater detail. In Section 3.3.3, we present improvements
to the OPTs methodology using improved counterfactual estimates. We provide evidence of
the benefits of this method with the help of synthetic data in Section 3.4 and four real world
80
examples in Section 3.5. Finally, we present our conclusions in Section 3.6.
3.2 Review of Optimal Predictive Trees
Decision trees are primarily used for the tasks of classification and regression, which are
prediction problems where the goal is to predict the outcome y for a given point x. We
therefore refer to these trees as predictive trees. The problem we consider in this chapter is
prescription, where we use the point x and the observed outcomes y to prescribe the best
treatment for each point. We will adapt ideas from predictive trees in order to effectively
train prescriptive trees, where each leaf prescribes a treatment for the point and also predicts
the associated outcome for that treatment. In this section, we briefly review predictive trees,
and in particular, we give an overview of the Optimal Trees framework [Bertsimas and Dunn,
2019, Dunn, 2018] which is a novel approach for training predictive trees that have state-of-
the-art accuracy.
The traditional approach for training decision trees is to use a greedy heuristic to re-
cursively partition the feature space by finding the single split that locally optimizes the
objective function. This approach is used by methods like CART [Breiman et al., 1984] to
find classification and regression trees. The main drawback to this greedy approach is that
each split in the tree is determined in isolation without considering the possible impact of
future splits in the tree. This can lead to trees that do not capture well the underlying char-
acteristics of the dataset and can lead to weak performance on unseen data. The natural way
to resolve this problem is to consider forming the decision tree in a single step, determining
each split in the tree with full knowledge of all other splits.
Optimal Trees is a novel approach for constructing decision trees that substantially out-
performs existing decision tree methods [Bertsimas and Dunn, 2019, Dunn, 2018]. It uses
mixed-integer optimization (MIO) to formulate the problem of finding the globally optimal
decision tree, and solves this problem with coordinate descent to find optimal or near-optimal
solutions in practical times. The resulting predictive trees are often as powerful as state-of-
the-art methods like random forests or boosted trees, yet they maintain the interpretability
81
of a single decision tree, avoiding the need to choose between interpretability and state-of-
the-art accuracy.
The Optimal Trees framework is a generic approach for training decision trees according
to a loss function of the form
minT
error(T,D) + α ⋅ complexity(T ), (3.4)
where T is the decision tree being optimized, D is the training data, error(T,D) is a
function measuring how well the tree T fits the training data D, complexity(T ) is a function
penalizing the complexity of the tree (for a tree with splits parallel to the axis, this is the
number of splits in the tree), and α is the complexity parameter that controls the tradeoff
between the quality of the fit and the size of the tree.
Previous attempts in the literature for finding globally optimal predictive trees [examples
include Bennett and Blue, 1996, Grubinger et al., 2014, Son, 1998] were not able to scale
to datasets of the size seen in practice, and as such did not deliver practical improvements
over greedy heuristics. The key development that allows Optimal Trees to scale is using
coordinate descent to train the decision trees towards global optimality. The algorithm
repeatedly optimizes the splits in the tree one-at-a-time, attempting to find changes that
improve the global objective value in Problem (3.4). At a high level, it visits the nodes of
the tree in a random order and considers the following modifications at each node:
• If the node is not a leaf, delete the split at that node;
• If the node is not a leaf, find the optimal split to use at that node and update the
current split;
• If the node is a leaf, create a new split at that node.
After each of the changes, the objective value of the tree with respect to Problem (3.4) is
calculated. If any of these changes improve the overall objective value of the tree, then the
modification is accepted. The algorithm continually visits the nodes in a random order until
82
70
75
80
85
2 4 6 8 10
Maximum depth of tree
Out−
of−
sam
ple
acc
ura
cy
CART OCT OCT−H Random Forest Boosting
Figure 3-1: Performance of classification methods averaged across 60 real-world datasets.OCT and OCT-H refer to Optimal Classification Trees without and with hyperplane splits,respectively.
no possible improvements are found, meaning this tree is a local minimum. The problem is
non-convex, so this coordinate descent process is repeated from a variety of starting decision
trees that are generated randomly. From this set of trees, the one with the lowest overall
objective function is selected as the final solution. For a more comprehensive guide to the
coordinate descent process, we refer the reader to Bertsimas and Dunn [2019].
The coordinate descent algorithm is generic and can be applied to any objective function
in order to optimize a decision tree. For example, the Optimal Trees framework can train
Optimal Classification Trees by setting error(T,D) to be the misclassification error associ-
ated with the tree predictions made on the training data. Figure 3-1 shows a comparison of
performance between various classification methods from Bertsimas and Dunn [2019]. These
results demonstrate that the Optimal Tree methods outperform CART in producing a single
predictive tree that has accuracies comparable with some of the best classification methods.
In Section 3.3, we extend the Optimal Trees framework to generate prescriptive trees
using objective function (3.3).
83
3.3 Optimal Prescriptive Trees
In this section, we motivate and present the OPT algorithm that trains prescriptive trees
to directly minimize the objective presented in Problem (3.3) using a decision rule that
takes the form of a prescriptive tree; that is, a decision tree that in each leaf prescribes a
common treatment for all samples that are assigned to that leaf of the tree. Our approach
is to estimate the counterfactual outcomes using this prescriptive tree during the training
process, and therefore jointly optimize the prescription and the prediction error.
3.3.1 Optimal Prescriptive Trees with Constant Predictions
Observe that a decision tree divides the training data into neighborhoods where the samples
are similar. We propose using these neighborhoods as the basis for our counterfactual esti-
mation. More concretely, we will estimate the counterfactual yi(t) using the outcomes yj for
all samples j with zj = t that fall into the same leaf of the tree as sample i. An immediate
method for estimation is to simply use the mean outcome of the relevant samples in this
neighborhood, giving the following expression for yi(t):
yi(t) =1
∣j ∶ xj ∈ Xl(i), zj = t∣∑
j∶xj∈Xl(i),zj=tyj, (3.5)
where Xl(i) is the leaf of the prescription tree into which xi falls.
Substituting this back into Problem (3.3), we want to find a prescriptive tree τ that
solves the following problem:
minτ(.)
µ [n
∑i=1(yi1[τ(xi) = zi] +∑
t≠zi
∑j∶xj∈Xl(i),zj=t yj∣j ∶ xj ∈ Xl(i), zj = t∣
1[τ(xi) = t])]
+ (1 − µ)⎡⎢⎢⎢⎢⎣
n
∑i=1
⎛⎝yi −
1
∣j ∶ xj ∈ Xl(i), zj = zi∣∑
j∶xj∈Xl(i),zj=ziyj⎞⎠
2⎤⎥⎥⎥⎥⎦.
(3.6)
We note that when µ = 1, we obtain the same objective function as Kallus [2017b], which
means this objective is an unbiased and consistent estimator for the prescription error. We
84
0.04
0.06
0.08
0.10
0.00 0.25 0.50 0.75 1.00µ
Pre
dic
tion E
rror
2.73
2.76
2.79
2.82
0.00 0.25 0.50 0.75 1.00µ
Pre
scri
pti
on E
rror
Figure 3-2: Test prediction and personalization error as a function of µ
note that in this work they attempted to solve Problem (3.6) to global optimality using a
MIO formulation based on an earlier version of Optimal Trees [Bertsimas and Dunn, 2017].
This approach did not scale beyond shallow trees and small datasets, and so they resorted to
using a greedy CART-like heuristic to solve the problem instead. The approach we describe,
using the latest version of Optimal Trees centered around coordinate descent, is practical
and scales to large datasets while solving in tractable times. When µ = 0, we obtain the
objective function in Bertsimas and Dunn [2017] that emphasizes prediction.
Empirically, when µ = 1, we have observed that the resulting prescriptive trees lead to a
high predictive error and an optimistic estimate of the prescriptive error that is not supported
in out of sample experiments. Allowing µ to vary ensures that the tree predictions lead to a
major improvement of the out of sample predictive and prescriptive error.
To illustrate this observation, Figure 3-2 shows the average prediction and prescription
errors as a function of µ for one of the synthetic experiments we conduct in Section 3.4. We
see that using µ = 1 leads to very high prediction errors, as the prescriptions are learned
without making sure the predicted outcomes are close to reality. More interestingly, we see
that the best prescription error is not achieved at µ = 1. Instead, varying µ leads to improved
prescription error, and for this particular example the lowest error is attained for µ in the
range 0.5–0.8. This gives clear evidence that our choice of objective function is crucial for
delivering better prescriptive trees.
85
3.3.2 Training Prescriptive Trees
We apply the Optimal Trees framework to solve Problem (3.6) and find OPTs. The core
of the algorithm remains as described in Section 3.2, and we set Problem (3.6) as the loss
function error(T,D). When evaluating the loss at each step of the coordinate descent,
we calculate the estimates of the counterfactuals by finding the mean outcome for each
treatment in each leaf among the samples in that leaf that received that treatment using
Equation (3.5). We determine the best treatment to assign at each leaf by summing up the
outcomes (observed or counterfactual as appropriate) of all samples for each treatment, and
then selecting the treatment with the lowest total outcome in the leaf. Finally, we calculate
the two terms of the objective using the means and best treatments in each leaf, and add
these terms with the appropriate weighting to calculate the total objective value.
The hyperparameters that control the tree training process are:
• nmin: the minimum number of samples required in each leaf;
• Dmax: the maximum depth of the prescriptive tree;
• α: the complexity parameter that controls the tradeoff between training accuracy and
tree complexity in Problem (3.4);
• ntreatment: the minimum number of samples of a treatment t we need at a leaf before we
are allowed to prescribe treatment t for that leaf. This is to avoid using counterfactual
estimates that are derived from relatively few samples;
• µ: the prescription factor that controls the tradeoff between prescription and prediction
errors in the objective function.
The first three parameters are parameters that appear in the general Optimal Trees
framework (for more detail see Bertsimas and Dunn [2019]), while the final two are specific
to OPTs.
In practice we have found that we can achieve good results for most problems by setting
nmin = 1, ntreatment = 10, and tuning Dmax and α using the procedure outlined in Section 2.4
86
of Dunn [2018]. We also have seen that setting µ = 0.5 typically gives good results, although
this may need to be tuned to achieve the best performance on a specific problem.
3.3.3 Optimal Prescriptive Trees with Linear Predictions
In Section 3.3.1, we trained OPTs by using the mean treatment outcomes in each leaf as the
counterfactual estimates for the other samples in that leaf. There is nothing special about
our choice to use the mean outcome other than ease of computation, and it seems intuitive
that a better predictive model for the counterfactual estimates could lead to a better final
prescriptive tree. In this section, we propose using linear regression methods as the basis for
counterfactual estimation inside the OPT framework.
Traditionally, regression trees have eschewed linear regression models in the leaves due to
the prohibitive cost of repeatedly fitting linear regression models during the training process,
and instead have preferred to use simpler methods such as predicting the mean outcome in
the leaf. However, the Optimal Trees framework contains approaches for training regression
trees with linear regression models with elastic net regularization [Zou and Hastie, 2005]
in each leaf. It uses fast updates and coordinate descent to minimize the computational
cost of fitting these models repeatedly, providing a practical and tractable way of generating
interpretable regression trees with more sophisticated prediction functions in each leaf.
We propose using this approach for fitting linear regression models from the Optimal
Trees framework for the estimation of counterfactuals in our OPTs. To do this, in each leaf
we fit a linear regression model for each treatment, using only the samples in that leaf that
received the corresponding treatment. We will then use these linear regression models to
estimate the counterfactuals for each sample/treatment pair as necessary, before proceeding
to determine the best treatment overall in the leaf using the same approach as in Section 3.3.
Concretely, in each leaf of the tree ` we fit an elastic net model for each treatment t using
the relevant points in the leaf, i ∶ xi ∈ X`, zi = t, to obtain regression coefficients βt`:
minβt`
1
2 ∣i ∶ xi ∈ X`, zi = t∣∑
i∶xi∈X`,zi=t(yi − (βt
`)Txi)
2+ λPα(βt
`), (3.7)
87
where
Pα(β) = (1 − α)1
2∥β∥22 + α∥β∥1. (3.8)
We proceed to estimate the counterfactuals with the following equation:
yi(t) = (βt`(i))
Txi, (3.9)
where `(i) is the leaf where sample i falls to. The overall objective function is therefore
minτ(.),β
µ [n
∑i=1(yi1[τ(xi) = zi] +∑
t≠zi(βt
`(i))Txi1[τ(xi) = t])]
+ (1 − µ) [n
∑i=1(yi − (βt
`(i))Txi)
2+ λ
m
∑t=1∑`
Pα(βt`)] ,
(3.10)
where the regression models β are found by solving the elastic net problems (3.7) defined by
the prescriptive tree. Note that we have included the elastic net penalty in the prediction
accuracy term, mirroring the structure of the elastic net problem itself. This is so that our
objective accounts for overfitting the β coefficients in the same way as standard regression.
We solve this problem using the Optimal Regression Trees framework from Bertsimas and
Dunn [2019], modified to fit a regression model for each treatment in each leaf, rather than
just a single regression model per leaf.
There are two additional hyperparameters in this model over the model in Section 3.3,
namely the degree of regularization in the elastic net λ and the parameter α controlling the
trade-off between `1 and `2 norms in (3.8). We have found that we obtain strong results using
only the `1 norm, and so this is what we use in all experiments. We select the regularization
parameter λ through validation.
3.4 Performance of OPTs on Synthetic Data
In this section, we design simulations on synthetic datasets to evaluate and compare the
performance of our proposed methods with other approaches. Since the data set is simulated,
the counterfactuals are fully known, which enables us to compare with the ground truth. In
88
the remainder of this section, we present our motivation behind these experiments, describe
the data generating process and the methods we compare, followed by computational results
and conclusions.
3.4.1 Motivation
The general motivation of these experiments is to investigate the performance of the OPT
method for various choices of synthetic data. Specifically, as part of these experiments, we
seek to answer the following questions.
1. How well does each method prescribe, i.e., compute the decision boundary x ∈ Rd ∶y0(x) = y1(x)?
2. How accurate are the predicted outcomes?
3.4.2 Experimental Setup
Our experimental setup is motivated by that shown in Powers et al. [2017]. In our experi-
ments, we generate n data points xi, i = 1, . . . , n where each xi ∈ Rd. Each xi is generated
i.i.d. such that the odd numbered coordinates j are sampled from xij ∼ Normal(0,1), while
the even numbered coordinates j are sampled from xij ∼ Bernoulli(0.5).
Next, we simulate the observed outcomes under each treatment. We restrict the scope
of these simulations to two treatments (0 and 1) so that we can include in our comparison
methods those that only support two treatments. For each experiment, we define a baseline
function that gives the base outcome for each observation and an effect function that models
the effect of the treatment being applied. Both of these are functions of the covariates
x, centered and scaled to have zero mean and unit variance. The outcome yt under each
treatment t as a function of x is given by
y0(x) = baseline(x) − 1
2effect(x),
y1(x) = baseline(x) + 1
2effect(x).
89
Finally, we assign treatments to each observation. In order to simulate an observational
study, we assign treatments based on the outcomes for each treatment so that treatment 1 is
typically assigned to observations with a large outcome under treatment 0, which are likely
to realize a greater benefit from this prescription. Concretely, we assign treatment 1 with
the following probability:
P(Z = 1∣X = x) = ey0(x)
1 + ey0(x) .
In the training set, we add noise εi ∼ Normal(0, σ2) to the outcomes yi corresponding to
the selected treatment.
We consider three different experiments with different forms for the baseline and effect
functions and differing levels of noise:
1. The first experiment has low noise, σ = 0.1, a linear baseline function, and a piecewise
constant effect function:
baseline(x) = x1 + x3 + x5 + x7 + x8 + x9 − 2, effect(x) = 51(x1 > 1) − 5.
2. The second experiment has moderate noise, σ = 0.2, a constant baseline function, and
a piecewise linear effect function:
baseline(x) = 0, effect(x) = 41(x1 > 1)1(x3 > 0) + 41(x5 > 1)1(x7 > 0) + 2x8x9.
3. The third experiment has high noise, σ = 0.5, a piecewise constant baseline function,
and a quadratic effect function:
baseline(x) = 51(x1 > 1)−5, effect(x) = 1
2(x2
1+x2+x23+x4+x2
5+x6+x27+x8+x2
9−11).
For each experiment, we generate training data with n = 1,000 and d = 20 as described
above. We also generate a test set with n = 60,000 using the same process, without adding
90
noise. In the test set, we know the true outcome for each observation under each treatment,
so we can identify the correct prescription for each observation.
For each method, we train a model using the training set, and then use the model to
make prescriptions on the test set. We consider the following metrics for evaluating the
quality of prescriptions:
• Treatment Accuracy: the proportion of the test set where the prescriptions are correct;
• Effect Accuracy: the R2 of the predicted effects, y(1) − y(0), made by the model for
each observation in the test set, compared against the true effect for each observation.
We run 100 simulations for each experiment and report the average values of treatment
and effect accuracy on the test set.
3.4.3 Methods
We compare the following methods:
• Prescription Trees: We include four prescriptive tree approaches:
– Personalization trees, denoted PT (recall that these are the same as OPT with
µ = 1 but trained with a greedy approach);
– OPT with µ = 1 and µ = 0.5, denoted OPT(1) and OPT(0.5), respectively;
– OPT with µ = 0.5 and with linear counterfactual estimation in each leaf, denoted
OPT(0.5)-L.
• Regress-and-compare: We include three regress-and-compare approaches where the
underlying regression uses either Optimal Regression Trees (ORT), LASSO regression
or random forests, denoted RC–ORT, RC–LASSO and RC–RF, respectively. For each
sample in the test set, we prescribe the treatment that leads to the lowest predicted
outcome.
91
−0.5
0.0
0.5
1.0E
ffec
t A
ccura
cy
0.25
0.50
0.75
1.00
Tre
atm
ent
Acc
ura
cy
PTOPT(1)
OPT(0.5)OPT(0.5)−L
RC−ORTRC−LASSO
RC−RFCF
Figure 3-3: Effect and Treatment accuracy results for Experiment 1.
• Causal Methods: We include the method of causal forests [Wager and Athey, 2018]
with the default parameters. While causal forests are intended to estimate the indi-
vidual treatment effect, we use the sign of the estimated individual treatment effect
to determine the choice of treatment. Specifically, we prescribe 1 if the estimated
treatment effect for that sample is negative, and 0, otherwise.
We also tested causal MARS on all examples, but it performed similarly to causal
forests, and hence was omitted from the results for brevity.
Notice that causal forests and OPTs are joint learning methods—the training data for these
approaches is the whole sample that includes both the treatment and control groups, as
opposed to regress-and-compare methods which split the data and develop separate models
for observations with z = 0 and z = 1.
3.4.4 Results
Figure 3-3 shows the results for Experiment 1. In this experiment, the boundary func-
tion is piecewise constant and the individual outcome functions are both piecewise linear.
The true decision boundary is x1 = 1, and the regions x1 > 1 and x1 ≤ 1 each have
constant treatment effect. The true response function in each of these regions is linear.
OPT(0.5)-L outperforms all the three regress-and-compare approaches and causal forests
(CF) both in treatment and effect accuracy. There is a marked improvement from OPT(0.5)
92
x1 < 1.0011
01
True False
Figure 3-4: Tree constructed by OPT(0.5)-L for an instance of Experiment 1.
0.25
0.50
0.75
1.00
Eff
ect
Acc
ura
cy
0.4
0.6
0.8
1.0
Tre
atm
ent
Acc
ura
cyPTOPT(1)
OPT(0.5)OPT(0.5)−L
RC−ORTRC−LASSO
RC−RFCF
Figure 3-5: Effect and Treatment accuracy results for Experiment 2.
to OPT(0.5)-L with the addition of linear regression in the leaves, which is unsurprising as
this models exactly the truth in the data. The poorest performing method is the greedy PT,
which has both low treatment accuracy, and very poor effect accuracy (note that the out of
sample R2 can be negative). OPT(1) improves slightly in the treatment accuracy, but the
effect accuracy is still poor. OPT(0.5) shows a large improvement in both the treatment
and effect accuracies over PT and OPT(1), which demonstrates the importance of consid-
ering both the prescriptive and predictive components with the prescriptive factor µ in the
objective function.
Figure 3-4 shows the tree for one of the instances of Experiment 1 under OPT(0.5)-L.
Recall, the boundary function for this experiment was simply x1 = 1, which is correctly
identified by the tree. This particular tree has a treatment accuracy of 0.99, reflecting
the accuracy of the boundary function, and an effect accuracy of 0.90, showing that the
linear regressions within each leaf provide high quality estimates of the outcomes for both
treatments.
93
The results for Experiment 2 are shown in Figure 3-5. This experiment has a piecewise
linear boundary with piecewise linear individual outcome functions, with moderate noise.
OPT(0.5)-L is again the strongest performing method in both treatment and effect accu-
racies, followed by OPT(0.5) and Causal Forests. All prescriptive tree methods have good
treatment accuracy, showing that these tree models are able to effectively learn the indicator
terms in the outcome functions of both arms. We again see that OPT(0.5) and OPT(0.5)-L
improve upon the other tree methods, particularly in effect accuracy, as a consequence of
incorporating the predictive term in the objective. The linear trends in the outcome func-
tions of this experiment are not as strong as in Experiment 1, and so the improvement of
OPT(0.5)-L over OPT(0.5) is not as large as before.
We observe that the joint learning methods perform better than the regress-and-compare
methods in this example even though the outcome functions for both the treatment options
do not have a common component (the baseline function is 0). We believe this is because
both the methods included here, causal forests and prescriptive trees, can learn local effects
effectively. Note that the structure of the boundary function is such that the function is
either constant or linear in different buckets.
We plot the tree from OPT(0.5)-L for an instance of this experiment in Figure 3-6. This
particular tree has a treatment accuracy of 0.925, which indicates that it has learned the
decision boundary effectively, along with an effect accuracy of 0.82. We make the following
observations from this tree.
1. Recall that the true boundary function for this experiment only includes the variables
from x1, x3, x5, x7, x8, and x9, and none of the remaining variables from x2 to x20. From
the figure above, we see that this tree does not include any of these variables as well,
i.e., it has a zero false positive rate.
2. By inspecting the splits on the variables x1, x3, x5 and x7, we note that the tree has
learned thresholds of close to 0 for x3 and x7, and 1 for x1 and x5, which matches with
the ground truth for these variables.
3. Examining the tree more closely, we see that the prescriptions reflect the reality of
94
00
01
0x9 < 0.7220
01
x1 < 0.9971
x3 < −0.0123x5 < 1.0005
x7 < 0.0008x8 = 0
x9 < 0.64081
x9 < 1.74791
True False
Figure 3-6: Tree constructed by OPT(0.5)-L for an instance of Experiment 2.
95
−1.0
−0.5
0.0
0.5
1.0E
ffec
t A
ccura
cy
0.4
0.6
0.8
1.0
Tre
atm
ent
Acc
ura
cy
PTOPT(1)
OPT(0.5)OPT(0.5)−L
RC−ORTRC−LASSO
RC−RFCF
Figure 3-7: Effect and Treatment accuracy results for Experiment 3.
which outcome is best. For example, when x1 ≥ 0.9971 and x3 ≥ −0.0123, the tree
prescribes 0. This corresponds to the ground truth of the 41(x1 > 1)1(x3 > 0) term
becoming active, which makes it likely that treatment 1 leads to larger (worse) out-
comes. We also see that the linear component in the outcome functions is reflected in
the tree, as the tree assigns treatment 0 when x9 is larger, which corresponds to the
linear term in the outcome function being large.
4. Finally, we note that the tree has a split where both the terminal leaves prescribe the
same treatment, which can initially seem odd. However, recall that the objective term
contains both prescription and prediction errors, and a split like this can improve the
prediction term in the objective, and hence the overall objective value, even though
none of the prescriptions are changed.
Finally, Figure 3-7 show the results from Experiment 3. This experiment has high noise
and a nonlinear quadratic boundary. Overall, regress-and-compare random forest and causal
forest are the best-performing methods, followed closely by OPT(0.5)-L, demonstrating that
all three methods are capable of learning complicated nonlinear relationships, both in the
outcome functions and in the decision boundary. The treatment accuracy is comparable
for all prescriptive tree methods, but PT and OPT(1) have very poor effect accuracy. This
again demonstrates the importance of controlling for the prediction error in the objective.
In this experiment, we see that regress-and-compare random forests performs comparably
96
to causal forests, which was not the case for the other two experiments. We believe that
this is because the baseline function is relatively simple compared to the effect function,
which leads to the absence of strong common trends within the two treatment outcome
functions. This could make it more difficult to effectively learn from both groups jointly,
mitigating the benefits of combining the groups in training. Consequently, in this setting
regress-and-compare methods could have performance closer to joint learning methods.
3.4.5 Multiple treatments
In this section, we consider a synthetic example with three treatments. We generate the n
covariates from the same distribution as before. We simulate the observed outcomes under
each treatment as
y0(x) = baseline(x),
y1(x) = baseline(x) + effect1(x),
y2(x) = baseline(x) + effect2(x).
Finally, we assign treatments to each observation. As before, we typically assign treatment 0
to observations when the baseline is small, and typically assign 1 and 2 with equal probability
when the baseline is higher. Concretely, we assign treatments with the following probabilities:
P(Z = 0∣X = x) = 1
1 + ey0(x) ,
P(Z = 1∣X = x) = P(Z = 2∣X = x) = 1
2(1 − P(Z = 0∣X = x)).
We consider the following experiment with the baseline and two effect functions given
97
by:
baseline(x) = 41(x1 > 1)1(x3 > 0) + 41(x5 > 1)1(x7 > 0) + 2x8x9,
effect1(x) = 51(x1 > 1) − 5,
effect2(x) =1
2(x2
1 + x2 + x23 + x4 + x2
5 + x6 + x27 + x8 + x2
9 − 11),
and the noise level σ = 0.1.
We generate training data with n = 1,000 and d = 20 and add noise εi ∼ Normal(0, σ2)to the outcomes yi corresponding to the selected treatment. As before, we generate a test
set with n = 60,000 using the same process, without adding noise. In the test set, we know
the true outcome for each observation under each treatment, so we can identify the correct
prescription for each observation.
For each method, we train a model using the training set, and then use the model to
make prescriptions on the test set. We consider the following metrics for evaluating the
quality of prescriptions:
• Treatment Accuracy: as defined in Section 3.4.2;
• Outcome Accuracy: the R2 of the predicted outcome y of the prescribed treatment
z, given by y(z), made by the model for each observation in the test set, compared
against the true outcome of that prescription, y(z), for each observation.
We run 100 simulations for each experiment and report the average values of treatment
and effect accuracy on the test set. We include the same methods as for the previous
experiments with the exception of causal forests as it only supports two treatments.
Results
Figure 3-8 shows the results for Experiment 4, where the baseline function is piecewise
constant and the individual effect functions are piecewise linear and nonlinear respectively.
OPT(0.5) and OPT(0.5)-L outperform all the other methods both in treatment and out-
come accuracy. Importantly, both these methods have the highest treatment accuracy, which
98
0.6
0.7
0.8
0.9
1.0O
utc
ome
Acc
ura
cy
0.25
0.50
0.75
1.00
Tre
atm
ent
Acc
ura
cy
PTOPT(1)
OPT(0.5)OPT(0.5)−L
RC−ORTRC−LASSO
RC−RF
Figure 3-8: Outcome and Treatment accuracy results for Experiment 4 with three treatments
indicates that they estimate the decision boundary reasonably well, unlike R&C-Random
forests which has high outcome accuracy but low treatment accuracy. As in the experi-
ments with two treatments, OPT(0.5) shows a large improvement in both the treatment
and outcome accuracies over PT and OPT(1), which again demonstrates the importance of
considering both the prescriptive and predictive components with the prescriptive factor µ in
the objective function. Overall, this experiment provides strong evidence that our approach
continues to perform well when there are more than two treatments.
Impact of incorrect prescriptions
In the context of Experiment 4, we will now investigate the impact of the various algorithms
making incorrect prescriptions. In particular, we are interested in how much the predicted
outcome can deviate from the actual truth when making an incorrect prescription, i.e. the
seriousness of the mistake. To this end, we considered the results from Experiment 4 and in
every case where an algorithm made an incorrect prescription we calculated the difference
between the true outcome under an algorithm’s incorrect prescription and the true outcome
under the optimal prescription. Note that this difference is always nonnegative.
Figure 3-9 shows the distributions of these errors in outcomes under incorrect prescrip-
tions. We see that all algorithms have similar medians and spreads, with RC–RF having the
smallest spread. We also see that the upper tail of the error distribution is similar between
99
0
1
2
3
4E
rror
in p
resc
ribed
outc
ome
PTOPT(1)OPT(0.5)OPT(0.5)−LRC−ORTRC−LASSORC−RF
Figure 3-9: Error in prescribed outcome due to incorrect prescription.
PT, OPT(0.5)–L, RC–ORT and RC–RF, while it is higher for OPT(1), OPT(0.5) and RC–
LASSO, indicating that incorrect prescriptions made by these methods could possibly be
more serious than the others in the very extreme cases. However, overall these results give
evidence that all of the methods are roughly similar in terms of the errors made as a result
of incorrect prescriptions.
3.4.6 Discussion and Conclusions
In terms of both prescriptive and predictive performance, we provide evidence that our
method performs comparably with, or even outperforms the state-of-the-art methods, as
evidenced by both treatment and effect accuracy metrics. Additionally, the main advantage
of prescriptive trees is that they provide an explicit representation of the decision boundary,
as opposed to the other methods where the boundary is only implied by the learned outcome
functions. This lends credence to our claim that the trees are interpretable. In fact, from
our discussion of the trees obtained for Combinations 1 and 2 in Figures 3-4 and 3-6, the
trees correctly learn the true decision boundary in the data.
We also found that regress-and-compare methods that fit separate functions for each
treatment are generally outperformed by joint learning methods that learn from the entire
100
dataset. We note that if there were an infinite amount of data and the regress-and-compare
methods could learn the individual outcome functions perfectly, then they would also learn
the decision boundary perfectly. However, for practical problems with finite sample sizes,
we have strong evidence that the performance can be much worse than the joint learning
methods.
3.5 Performance of OPTs on Real World Data
In this section, we apply prescriptive trees to some real world problems to evaluate the
performance of our OPTs in a practical setting. The first two problems belong to the area
of personalized medicine, which are personalized warfarin dosing and personalized diabetes
management. Next, we provide personalized job training recommendations to individuals,
and finally conclude with an example where we estimate the personalized treatment effect
of high quality child care specialist home visits on the future cognitive test scores of infants.
3.5.1 Personalized Warfarin Dosing
In this section, we test our algorithm in the context of personalized warfarin dosing. Warfarin
is the most widely used oral anticoagulant agent worldwide. Its appropriate dose can vary
by a factor of ten among patients and hence can be difficult to establish, with incorrect doses
contributing to severe adverse effects [Consortium et al., 2009]. Physicians who prescribe
warfarin to their patients must constantly balance the risks of bleeding and clotting. The
current guideline is to start the patient at 5 mg per day, and then vary the dosage based on
how the patient reacts until a stable therapeutic dose is reached [Jaffer and Bragg, 2003].
The publicly available dataset we use was collected and curated by staff at the Phar-
macogenetics and Pharmacogenomics Knowledge Base (PharmKGB) and members of the
International Warfarin Pharmacogenetics Consortium. One advantage of this dataset is that
it gives us access to counterfactuals—it contains the true stable dose for each patient found
by physician controlled experimentation for 5,528 patients. The patient covariates include
101
demographic information (sex, race, age, weight, height), diagnostic information (reason for
treatment, e.g., deep vein thrombosis etc.), pre-existing diagnoses (indicators for diabetes,
congestive heart failure, smoker status etc.), current medications (Tylenol etc.), and genetic
information (presence of genotype polymorphisms of CYP2C9 and VKORC1). The correct
dose of warfarin was split into three dose groups: low (≤ 3 mg/day), medium (> 3 and < 5mg/day), and high(≥ 5 mg/day), which we consider as our three possible treatments 0, 1,
and 2.
Our goal is to learn a policy that prescribes the correct dose of warfarin for each patient
in the test set. In this dataset, we know the correct dose for each patient, and so we consider
the following two approaches for learning the personalization policy.
Personalization when counterfactuals are known
Since we know the correct treatment z∗i for each patient, we can simply develop a prediction
model that predicts the optimal treatment z given covariates x. This is a standard multi-
class classification problem, and so we can use off-the-shelf algorithms for this problem.
Solving this classification problem gives us a bound on the performance of our prescriptive
algorithms, as this is the best we could do if we had perfect information.
Personalization when counterfactuals are unknown
Since it is unlikely that a real world dataset will consist of these optimal prescriptions, we
reassign some patients in the training set to other treatments so that their assignment is no
longer optimal. To achieve this, we follow the setup of Kallus [2017b], and assume that the
doctor prescribes warfarin dosage according to the following probabilistic assignment model:
P(Z = t∣X = x) = 1
Sexp((t − 1)(BMI − µ)
σ), (3.11)
102
where µ,σ are the population mean and standard deviation of patients’ BMI respectively,
and the normalizing factor
S =3
∑t=1
exp((t − 1)(BMI − µ)σ
).
We use this probabilistic model to assign each patient i in the training set a new treatment
zi, and then set yi = 0 if zi = zi, and yi = 1, otherwise. We proceed to train our methods
using the training data (xi, yi, zi), i = 1, . . . , n. This allows us to evaluate the performance
of various prescriptive methods on data which is closer to real world observational data.
Experiments
In order to test the efficacy with which our algorithm learns from observational data, we
split the data into training and test sets, where we vary the size of the training set as
h = 500,600, . . . ,2500, and the size of the test set is fixed as ntest = 2500. We perform 100
replications of this experiment for each n, where we resample the training and test sets of
respective sizes without replacement each time. We report the misclassification (error) rate
on the test set, noting that the full counterfactuals are available on the test set.
We compare the methods described in Section 3.4.3, but do not include OPT(0.5)-L as we
did not observe any benefit when adding continuous estimates of the counterfactuals, possi-
bly due to the discrete nature of the outcomes in the problem. We also do not include causal
forests as the problem has more than two treatments. Additionally, to evaluate the perfor-
mance of prescriptions when all outcomes are known, we treat the problem as a multi-class
classification problem and solve using off-the-shelf algorithms as described in Section 3.5.1.
We use random forests [Breiman, 2001], denoted Class–RF, and logistic regression, denoted
Class–LR.
In Figure 3-10, we present the out-of-sample misclassification rates for each approach.
We see that, as expected, the classification methods perform the best with random forests
having the lowest overall error rate, reaching around 32% at n = 2,500. This provides a
concrete lower bound for the performance of the prescriptive approaches to be benchmarked
103
0.32
0.34
0.36
0.38
500 1000 1500 2000 2500Training size
Mis
clas
sifi
cati
on
Class−LRClass−RFPTOPT(1)OPT(0.5)RC−LASSORC−RF
Figure 3-10: Misclassification rate for warfarin dosing prescriptions as a function of trainingset size.
against.
The greedy PT approach has stronger performance than the OPT methods at low n, but
as n increases this advantage disappears. At n = 2,500, OPT(1) algorithm outperforms PT
by about 0.6%, which shows the improvement that is gained by solving the prescriptive tree
problem holistically rather than in a greedy fashion. OPT(0.5) improves further upon this
by 0.6%, demonstrating the value achieved by accounting for the prediction error in addition
to the prescriptive error. The trees generated by OPT(1) and OPT(0.5) were also smaller
than those from PT, making them more easily interpretable.
Finally, the regress-and-compare approaches both perform similarly, outperforming all
prescriptive tree methods. We note that this is the opposite result to that found by Kallus
[2017b], where the prescriptive trees were the strongest. We suspect the discrepancy is be-
cause they did not include random forests or LASSO as regress-and-compare approaches,
only CART, k-NN, logistic regression and OLS regression which are all typically weaker
methods for regression, and so the regressions inside the regress-and-compare were not as
powerful, leading to diminished regress-and-compare performance. It is perhaps not surpris-
ing that the regress-and-compare approaches are more powerful in this example; they are
able to choose the best treatment for each patient based on which treatment has the best
104
prediction, whereas the prescription tree can only make prescriptions for each leaf, based on
which treatment works well across all patients in the leaf. This added flexibility leads to
more refined prescriptions, but at a complete loss of interpretability which is a crucial aspect
of the prescription tree.
Overall, our results show that there is a substantial advantage to both solving the pre-
scriptive tree problem with a view to global optimality, and accounting for the prediction
error as well as the prescription error while optimizing the tree.
3.5.2 Personalized Diabetes Management
In this section, we apply our algorithms to personalized diabetes management using patient
level data from Boston Medical Center (BMC). This dataset consists of electronic medical
records for more than 1.1 million patients from 1999 to 2014. We consider more than
100,000 patient visits for patients with type 2 diabetes during this period. Patient features
include demographic information (sex, race, gender etc.), treatment history, and diabetes
progression. This dataset was first considered in Bertsimas et al. [2017], where the authors
propose a k-nearest neighbors (kNN) regress-and-compare approach to provide personalized
treatment recommendations for each patient from the 13 possible treatment regimens. We
compare our prescriptive trees method to several regress-and-compare based approaches,
including the previously proposed kNN approach.
We follow the same experimental design as in Bertsimas et al. [2017]. The data is split
50/50 into training and testing. The models are constructed using the training data and then
used to make prescriptions on the testing data. The quality of the predictions on the testing
data is evaluated using a kNN approach to impute the counterfactuals on the test set—we
also considered imputing the counterfactuals using LASSO and random forests and found
the results were not sensitive to the imputation method. We use the same three metrics
to evaluate the various methods: the mean HbA1c improvement relative to the standard
of care; the percentage of visits for which the algorithm’s recommendations differed from
the observed standard of care; and the mean HbA1c benefit relative to standard of care for
105
patients where the algorithm’s recommendation differed from the observed care.
We varied the number of training samples from 1,000–50,000 (with the test set fixed)
to examine the effect of the amount of training data on out-of-sample performance. We
repeated this process for 100 different splittings of the data into training and testing to
minimize the effect of any individual split on our results.
In addition to methods defined in Section 3.4.3, we compare the following approaches:
• Baseline: The baseline method continues the current line of care for each patient.
• Oracle: For comparison purposes, we include an oracle method that selects the best
outcome for each patient using the imputed counterfactuals on the test set. This
method therefore represents the best possible performance on the data.
• Regress-and-compare: In addition to RC–LASSO, RC–RF, we include k-nearest
neighbors regress-and-compare, denoted RC–kNN, to match the approaches used in Bert-
simas et al. [2017]
The results of the experiments are shown in Figure 3-11. We see that our results for the
regress-and-compare methods mirror those of Bertsimas et al. [2017]; RC–kNN is the best
performing regression method for prescriptions, and the performance increases with more
training data. RC–LASSO increases in performance with more data as well, but performs
uniformly worse than kNN. RC–RF performs strongly with limited data, but does not im-
prove as more training data becomes available. OPT(0.5) offers the best performance across
all training set sizes. Compared to RC–kNN, OPT(0.5) is much stronger at smaller training
set sizes, supporting our intuition that it makes better use of the data by considering all
treatments simultaneously rather than partitioning based on treatment. At higher training
set sizes, the performance behaviors of RC–kNN and OPT(0.5) become similar, suggesting
that the methods may be approaching the performance limits of the dataset.
These computational experiments offer strong evidence that the prescriptions of OPT are
at least as strong as those from RC–kNN, and much stronger at smaller training set sizes.
The other critical advantage is the increased interpretability of OPT compared to RC–kNN,
106
−0.6
−0.4
−0.2
0.0
103 103.5 104 104.5
Training size
Mea
n H
bA
1c c
han
ge
−1.00
−0.75
−0.50
−0.25
0.00
103 103.5 104 104.5
Training size
Con
dit
ional
HbA
1c c
han
ge
0%
25%
50%
75%
100%
103 103.5 104 104.5
Training size
Pro
p.
dif
fer
from
SO
C
BaselineOracle
RC−kNNRC−LASSO
RC−RFOPT(0.5)
Figure 3-11: Comparison of methods for personalized diabetes management. The leftmostplot shows the overall mean change in HbA1c across all patients (lower is better). Thecenter plot shows the mean change in HbA1c across only those patients whose prescriptiondiffered from the standard-of-care. The rightmost plot shows the proportion of patientswhose prescription was changed from the standard-of-care.
107
which is itself already more interpretable than other regress-and-compare approaches. To
interpret the RC–kNN prescription for a patient, one must first find the set of nearest
neighbors to this point among each of the possible treatments. Then, in each group of
nearest neighbors, we must identify the set of common characteristics that determine the
efficacy of the corresponding treatment on this group of similar patients. When interpreting
the OPT prescription, the tree structure already describes the decision mechanism for the
treatment recommendation, and is easily visualizable and readily interpretable.
3.5.3 Personalized Job training
In this section, we apply our methodology on the Jobs dataset [LaLonde, 1986], a widely
used benchmark dataset in the causal inference literature, where the treatment is job train-
ing and the outcomes are the annual earnings after the training program. This dataset is
obtained from a study based on the National Supported Work program and can be down-
loaded from http://users.nber.org/~rdehejia/nswdata2.html. This study consists of
297 and 425 individuals in the control and treated groups respectively, where the treatment
indicator zi is 1, if the subject received job training in 1976–77 or 0, otherwise. The dataset
has seven covariates which include age, education, race, marital status, if the individual
earned a degree or not, and prior earnings (earnings in 1975) and the outcome yi is 1978
annual earnings.
We split the full dataset into 70/30 training/testing samples, and averaged the results
over 100 such splits to plot the out of sample average personalized income. Since the counter-
factuals are not known for this example we employ a nearest neighbor matching algorithm,
identical to the one used in Section 3.5.2, to impute the counterfactual values on the test
set. Using these imputed values, we compute the cost of policies prescribed by each of the
following methods. Note that for this example, the higher the out of sample income, the
better.
We compare the same methods as Section 3.5.2 with the addition of causal forests as this
problem only has two treatment options.
108
Method Average income ($) Standard error ($)Baseline 5467.09 10.81CF 5908.23 17.92RC–kNN 5913.44 17.79RC–RF 5916.22 17.78RC–LASSO 5990.85 18.94OPT(0.5)-L 6000.02 18.07Oracle 7717.96 17.16
Table 3.1: Average personalized income on the test set for various methods.
5100
5400
5700
0.00 0.25 0.50 0.75 1.00Inclusion rate
Mea
n o
ut−
of−
sam
ple
inco
me
OPT(0.5)−LRC−kNNRC−RFRC−LASSOCF
Figure 3-12: Out-of-sample average personalized income as a function of inclusion rate.
In Table 3.1, we present the average net personalized income on the test set, as prescribed
by each of the five methods. For each method, we only prescribe a treatment for an individual
in the test set if the predicted treatment effect for that individual is higher than a certain
value δ > 0, whose value we vary and choose such that it leads to the highest possible
predicted average test set income. We find the best such δ for each instance, and average
the best prescription income over 100 realizations for each method. From the results, we see
that OPT(0.5)-L obtains an average personalized income of $6000, which is higher than the
other methods. The next closest method is RC–LASSO, which obtains an average income
of $5991.
In Figure 3-12, we present the out-of-sample incomes as a function of the fraction of
109
subjects for which the intervention is prescribed (the inclusion rate), which we obtain by
varying the threshold δ described above. We see that the average income in the test set is
highest for OPT(0.5)-L at all values of the inclusion rate, indicating that our OPT method
is best able to estimate the personalized treatment effect across all subjects. We also see
that the income peaks at a relatively low inclusion rate, showing that we are able to easily
identify a subset of the subjects with large treatment effect.
3.5.4 Estimating Personalized Treatment Effects for Infant Health
In this section, we apply our method for estimating the personalized treatment effect of high
quality child care specialist home visits on the future cognitive test scores of children. This
dataset is based on the Infant Health Development Program (IHDP) and was compiled by Hill
[2011]. This dataset is commonly used as a benchmark in the causal inference literature.
Since its first usage, several authors [Morgan and Winship, 2014, Wager and Athey, 2018,
Zubizarreta, 2012] have used it in their research for benchmarking methods. Following Hill
[2011], the original randomized control trial was made imbalanced by removing a biased
subset of the group that had specialist home visits. The final dataset consists of 139 and 608
subjects in the treatment and control groups respectively, with zi = 1 indicating treatment
(specialist home visit), and a total of 25 covariates which include child measurements such as
child-birth weight, head circumference, weeks born pre-term, sex etc., along with behaviors
engaged during the pregnancy– cigarette smoking, alcohol and drug consumption etc., and
measurements on the mother at the time she gave birth–age, marital status, educational
attainment etc.
In this example we focus on estimating the individual treatment effect, since it has been
acknowledged that the program has been successful in raising test scores of treated children
compared to the control group (see references in Hill [2011]). The outcomes are simulated
in such a way that the average treatment effect on the control subjects is positive (setting B
in Hill [2011] with no overlap). However, note that even though the sign and magnitude of
the average treatment effect is known, there is still heterogeneity in the magnitudes of the
110
Method Mean accuracy Standard errorCF 0.543 0.015RC–LASSO 0.639 0.018RC–RF 0.704 0.013OPT(0.5)-L 0.759 0.013
Table 3.2: Average R2 on the test set for various methods for estimating the personalizedtreatment effect.
individual treatment effects. In all our experiments, we split the data into training/test as
90/10, and compute the error of the treatment effect estimates on the test set compared to
the true noiseless outcomes (known). We average this value over 100 splits of the dataset,
and compare the test set performance for each method.
In Table 3.2, we present the means and standard errors of the R2 of the personalized
treatment effect estimates on the test set, given by each of the four methods. We see that
OPT(0.5)-L obtains the highest average R2 value of 0.759, followed by RC-Random forests
with 0.704. This again gives strong evidence that our OPT methods can deliver high-quality
prescriptions whilst simultaneously maintaining interpretability.
3.6 Conclusions
In this chapter, we present an interpretable approach of personalizing treatments that learns
from observational data. Our method relies on iterative splitting of the feature space, and
can handle the case of more than two treatment options. We apply this method on synthetic
and real world datasets, and illustrate its superior prescriptive power compared to other
state of the art methods.
111
112
Chapter 4
Prescriptive Scenario Reduction for
Stochastic Optimization
4.1 Introduction
A wide range of decision problems that involve optimization under uncertainty are formu-
lated as stochastic optimization problems. For instance, consider a production planning
problem, where the decision maker wishes to make strategic decisions on plant sizing and
allocating resources among plants. Later when demand is realized, the decision maker has to
make tactical decisions about storing, processing and shipping these products to the market
sources, all while ensuring minimal expected costs, satisfying relevant plant capacity con-
straints, and given the first stage decision. Taking this second stage decision-making into
account can ostensibly lead to better first-stage strategic decisions.
More generally, such problems fall in the setting where a practitioner aims to select the
best possible decision that satisfies certain constraints, but with the knowledge that the
outcome of this decision is influenced by the realization of a random event. The quality of
a decision is judged by averaging its cost over all possible realizations of this random event.
These models can be applied to formulate problems in various areas such as finance, energy,
fleet management, and supply chain optimization, to name a few. For a more comprehensive
113
list of applications, we refer the reader to Wallace and Ziemba [2005].
Traditional stochastic optimization formulates this as finding an optimal decision, which
among all feasible candidates in the set Z, has the lowest average cost when averaged over
all possible realizations of the uncertain parameter Y . In other words, these problems can
be formulated as
minz∈Z
EY [c(z;Y )]. (4.1)
For instance, in inventory management problems, the uncertainty Y may refer to demand
data, or time series of stock returns in portfolio optimization problems. We provide concrete
examples of such cost functions:
• Inventory management (Newsvendor problem):
c(z;Y ) =maxb(Y − z), h(z − Y ), (4.2)
where Y refers to demand of the item and z is the amount of inventory (decision
variable). The values b > 0 and h > 0 are prespecified parameters that represent the
backorder and holding cost respectively.
• Portfolio optimization:
c((z, β);Y ) = −λz′Y + β + 1
εmax−z′Y − β,0, (4.3)
where Y and z are vectors of stock returns and corresponding investments (decision
variable) respectively, and β is an auxiliary decision variable. Minimizing the cost
c(z;Y ) ensures high returns z′Y , while at the same time controlling the risk, which
here is given by CVaR (Conditional Value-at-Risk) of negative returns at level ε, as
CVaRε(z′Y ) = infβ
β + 1
εE[max−z′Y − β,0]
The quantities ε ∈ (0,1) which parametrizes the risk measure, and λ > 0, the tradeoff
between risk and return, are prespecified parameters.
114
We assume Z, the set of feasible decisions, is a non-empty convex compact set, and is
independent of uncertainty Y .
While we wish to solve Problem (4.1), the true distribution of the uncertainty Y is
typically unknown. Even if it is fully known, solving the exact optimization problem may
not be tractable. In the context of data-driven stochastic optimization, where past data,
consisting of n samples ξ1, . . . , ξn, is assumed to be known, a commonly used approach to
approximate Problem (4.1) is Sample Average Approximation (SAA) [Shapiro et al., 2009a].
Under this approach, the problem we wish to solve is
minz∈Z
1
n
n
∑i=1
c(z; ξi). (4.4)
It is easy to see that this approach, in effect, approximates the unknown full distribution
with the empirical distribution with each data point ξi equally probable. In fact, Kleywegt
et al. [2002] show that, under some regularity conditions, the optimal objective value and
solution of Problem (4.4) converge to their counterparts of Problem (4.1) as n increases,
regardless of the distribution of ξ. For more recent advances in SAA, we direct the reader
towards Homem-de Mello and Bayraksan [2014], Rahimian et al. [2018] and the references
therein.
In this chapter, we consider the approach of scenario reduction, which approximates the
empirical distribution with a smaller distribution with scenarios ζ1, . . . , ζm and corresponding
probabilities q1, . . . , qm, for m << n. To be more precise, we use knowledge of the cost function
and constraints while computing this reduced distribution which, as we shall demonstrate,
results in higher quality decisions. In situations where n is very large and even the SAA
problem (4.4) is not tractable, such an approach can substantially improve tractability while
ensuring minimal loss in decision quality. Another key benefit accrued by practitioners when
solutions of higher quality are computed with significantly lesser scenarios is interpretability.
In this chapter, we demonstrate that using optimization to compute these smaller set of
scenarios that take into account the cost function can substantially increase accuracy and
interpretability.
115
We note that scenario reduction with the Wasserstein distance and the Euclidean norm is
equivalent to clustering, with the problem reducing to assigning n points to m clusters with
the scenarios chosen as the cluster-mean. The corresponding scenario probability is simply
the size of the cluster divided by n. The central idea in our approach is that these scenarios
and assignments should be chosen keeping in mind their decision quality, rather than the
cost-agnostic least squares (or any general norm) error. We demonstrate that while this
approach leads to more complicated optimization problems, the resulting distributions often
have superior decision quality. This gap is particularly enhanced when the cost function is
not symmetric, unlike a norm which penalizes scenarios, irrespective of their decision quality,
simply based on the norm distance between the empirical and new scenarios.
4.1.1 Related literature
In this section, we review related work on scenario reduction for stochastic optimization
problems. Dupačová et al. [2003] present theory and algorithms for scenario reduction using
probability metrics, while Heitsch and Römisch [2003] derive bounds for forward and back-
ward scenario selection heuristics. More recently, Rujeerapaiboon et al. [2017] analyze worst
case bounds of scenario reduction using the Wasserstein metric, and propose heuristics with
worst case approximation error guarantees and an exact mixed integer optimization formu-
lation. These heuristics are based on the alternating-minimization algorithm for k-means
clustering [Arya et al., 2004].
More generally, our work fits in the area of research demonstrating the advantages of
optimization over randomization. Some related work includes Bertsimas et al. [2015], which
demonstrates that using optimization to reduce discrepancy between groups, rather than
randomization, leads to stronger inference.
4.1.2 Contributions and Structure
The contributions of this work are as follows:
1. We present a novel optimization based approach for scenario reduction for stochastic
116
optimization problems. As part of this approach, we introduce “Prescriptive diver-
gence", which measures the difference in quality of decisions induced by two discrete
distributions, and includes the Wasserstein distance as a special case.
2. We propose scenario reduction in this context, and present algorithms for computing
these scenarios and corresponding probabilities. Our approach relies on an alternating
minimization algorithm, where we solve a sequence of convex optimization problems
for determining the scenarios.
3. Finally, with the help of computational results we demonstrate the effectiveness of
these methods on constrained newsvendor and portfolio optimization problems, both
in-sample and out-of-sample, compared to a traditional Wasserstein-distance based sce-
nario reduction approach. We note that our approach results in improved performance
with a smaller number of scenarios, which improves interpretability for practitioners.
4.1.3 Notation
Let e be the vector of all ones, and ei the ith standard basis vector of appropriate dimensions.
For any positive integer n, we define the set [n] = 0,1, . . . , n. We denote a generic norm
by ∥ ⋅ ∥, while ∥ ⋅ ∥p denotes the p−norm, for p ≥ 1. Recall that the Euclidean norm of any
vector x is defined as ∥x∥2 =√∑i x
2i . For a set X ∈ Rd, we define P(X ,m) as the set of all
probability distributions supported on at most m points belonging to X . The support of
a probability distribution P is denoted by supp(P), and the Dirac delta “distribution" at ξ
denoted by δ(ξ). We define Pn(ξ1, . . . , ξn) as the uniform distribution supported on the n
distinct points ξi, which we equivalently represent as
Pn(ξ1, . . . , ξn) =n
∑i=1
1
nδ(ξi).
Cost functions are denoted by c(z; y), where z ∈ Rnz , y ∈ Rd represent the decision variable
and uncertainty respectively, and Z ⊆ Rd represents the nonempty convex set of feasible
decisions. For any given y, we assume that c(z; y) is a convex function of z. Finally, we
117
denote
c∗(ξ) =minz∈Z
c(z; ξ),
the optimal objective value corresponding to the scenario ξ, where we assume c∗(ξ) to be
finite for every ξ.
4.2 Preliminaries
In this section, we discuss the notion of Wasserstein distance, which defines a distance
between two probability distributions, and the scenario generation problem.
4.2.1 Distance between distributions
Let P be a discrete probability distribution on scenarios ξ1, . . . , ξn with corresponding prob-
abilities p1, . . . , pn, and Q another discrete distribution on scenarios ζ1, . . . , ζm with corre-
sponding probabilities q1, . . . , qm. Note that
n
∑i=1
pi = 1 =m
∑j=1
qj.
Next, we define the Wasserstein distance between these two discrete probability distributions.
Definition 1. The Wasserstein distance (induced by the `2 norm) between two discrete
distributions P and Q, which we denote as dW (P,Q), is defined as the square root of the
optimal objective value of the following problem:
minπ∈Rn×m+
n
∑i=1
m
∑j=1
πij∥ξi − ζj∥2
subject tom
∑j=1
πij = pi ∀i ∈ [n],
n
∑i=1
πij = qj ∀j ∈ [m].
(4.5)
The linear optimization problem (4.5) used to define the Wasserstein distance can be
118
interpreted as a minimum-cost transportation problem from n sources to m destinations.
Here, πij represents the amount of probability mass shipped from ξi to ζj at a transportation
cost of ∥ξi − ζj∥2 per unit. Note that the probabilities πij sum to one, as
n
∑i=1
m
∑j=1
πij =n
∑i=1
pi = 1,
and hence is not included in Problem (4.5) since it is a redundant constraint. Therefore,
Problem (4.5) is an optimal transportation problem whose objective function is minimizing
the overall cost of moving probability mass from the initial distribution P to the target
distribution Q.
Next, we introduce the idea of scenario reduction, which approximates a distribution
supported on n scenarios by a different distribution supported on m scenarios, with m
typically chosen to be significantly smaller than n. As part of this approach, both the new
reduced set of scenarios ζjmj=1 and their corresponding probabilities qjmj=1 are estimated.
Next, we describe the two variants of scenario reduction - discrete and continuous.
4.2.2 Scenario reduction
In this section, we describe the scenario reduction problem. For notational convenience, we
denote Pn(ξ1, . . . , ξn) as Pn.
Definition 2. The discrete scenario reduction problem is defined as
DW (Pn,m) =minQdW (Pn,Q) ∶ Q ∈ P(supp(Pn),m) (4.6)
Definition 3. The continuous scenario reduction problem is defined as
CW (Pn,m) =minQdW (Pn,Q) ∶ Q ∈ P(Rd,m) (4.7)
In Problem (4.6), the new scenarios must be selected from among the support of the
empirical distribution, given by the set ξ1, . . . , ξn. In contrast, the continuous scenario
119
reduction problem (4.7) allows the scenarios to be chosen from outside the set of observations,
and allows for greater flexibility and better approximation of the empirical distribution.
However, in both these settings, the approximate distributions are computed without
taking into account the cost function c(z; y) and the feasible set Z. We address this in the
following section, where we first define an extension of the Wasserstein distance between two
distributions, and use it to compute scenarios tailored for the optimization problem at hand.
In this chapter, we focus our attention on the continuous scenario reduction approach, but
note that these techniques can be adapted for the discrete problem as well.
4.3 Prescriptive Scenario reduction
In this section, we describe our approach for generating scenarios for stochastic optimization
problems. We define z∗(η) as an optimal decision corresponding to the scenario η, and is
given by
z∗(η) ∈ argminz∈Z
c(z; η).
For simplicity, we assume that there exists a unique optimal solution for every possible
scenario η, but we relax this assumption later.
Next, we define a prescriptive variant of the Wasserstein distance metric between two
probability distributions P,Q with respect to the cost c and constraint set Z as D(Q∣P; c,Z).
4.3.1 Prescriptive divergence
Definition 4. Let P and Q be two discrete probability distributions in Rd, given by
P =n
∑i=1
piδ(ξi), Q =m
∑j=1
qjδ(ζj)
120
respectively. Then, d(Q∣P; c,Z) is given by the square root of the optimal objective value of
the following linear optimization problem:
d2(Q∣P; c,Z) = minπ∈Rn×m+
m
∑j=1
n
∑i=1
πij(c(z∗(ζj); ξi) − c(z∗(ξi); ξi))
subject tom
∑j=1
πij = pi ∀1 ≤ i ≤ n,
n
∑i=1
πij = qj ∀1 ≤ j ≤m.
(4.8)
We denote this as the Prescriptive divergence between the two distributions P and Q,
with respect to the cost function c(z; y) and constraint set Z. It is a non-symmetric measure
of the difference between two probability distributions, and hence not a metric distance.
Specifically, it is a measure of the loss in decision quality when Q is used to approximate
P. We observe that the optimal value of the optimization problem (4.8) is guaranteed to be
non negative, as
c(z∗(ζj); ξi) ≥ c(z∗(ξi); ξi) =minz∈Z
c(z; ξi),
and hence, each term is positive for any choice of ζj.
We note that the Wasserstein distance dW can be recovered as a special case of the
Prescriptive divergence when the cost is given by c(z; η) = ∥z − η∥22, the squared Euclidean
norm between z and η, and the constraint set as Z = Rd. That is,
d(Q∣P; ∥z − y∥2;Rd) = dW (P,Q).
To see this, we note that the optimal decision for scenario η is given by
z∗(η) ∈ argminz∈Rd∥z − η∥2 = η.
Hence,c(z∗(ζ); ξ) = ∥z∗(ζ) − ξ∥2,
= ∥ζ − ξ∥2,
121
and
minz
c(z; ξ) = ∥z∗(ξ) − ξ∥2 = 0,
and we conclude that Problem (4.8) is equivalent to Problem (4.5).
4.3.2 Problem Formulation
Analogous to Problem (4.7), we define the continuous prescriptive scenario reduction problem
as
CP (Pn,m; c,Z) =minQd(Q∣∣Pn; c,Z) ∶ Q ∈ P(Rd,m). (4.9)
We denote by B(I,m) the family of all m−set partitions of the set I, i.e.,
B(I,m) = I1, . . . , Im ∶ ∅ /= I1, . . . , Im ⊆ I,∪jIj = I, Ii ∩ Ij = ∅ ∀i /= j.
Also, we denote a specific m−set partition as Ij ∈ B(I,m). Next, we note the following
result, which is similar to Theorem 1 in Rujeerapaiboon et al. [2017], that reformulates the
continuous prescriptive scenario reduction problem (4.9) as a set partitioning problem.
Theorem 3. The prescriptive scenario reduction problem (4.9) can be written as the follow-
ing problem of finding an m−set partition that optimizes the following problem:
C2P (Pn,m; c,Z) = min
Ij∈B(I,m)1
n
m
∑j=1
minζj∑i∈Ij(c(z∗(ζj); ξi) −min
z∈Zc(z; ξi)). (4.10)
Proof. Following the argument of Theorem 2 in Dupačová et al. [2003], we argue that the
optimal Prescriptive divergence between Pn and any distribution Q supported on a finite set
Ψ is given by
minQ∈P(Ψ,∞)
d2(Q∣∣Pn; c,Z) =1
n
n
∑i=1
minζ∈Ψ(c(z∗(ζ); ξi) − c∗(ξi)),
where P(Ψ,∞) denotes the set of all probability distributions supported on the finite set Ψ.
The continuous scenario reduction problem (4.7), but with the Prescriptive divergence in-
122
stead of the squared Euclidean distance, can be written as the following problem of finding the
set Ψ with m elements that leads to the smallest objective value. Letting Ψ = ζ1, . . . , ζm,we have
C2P (Pn,m; c,Z) = min
ζ1,...,ζm
1
n
n
∑i=1
minj∈[m](c(z∗(ζj); ξi) − c∗(ξi)). (4.11)
Next, we show that Problem (4.11) is equivalent to Problem (4.10). Given an optimal
solution ζ1∗, . . . , ζm∗ to Problem (4.11), we construct a partition such that
Ij = i ∶ c(z∗(ζj∗); ξi) = mink∈[m]
c(z∗(ζk∗); ξi)
which leads to Problem (4.10) having the same objective as Problem (4.11) (We break ties
arbitrarily). For the other direction, given an optimal partition I1, . . . , Im and corresponding
inner minimizing scenarios ζ1∗, . . . , ζm∗ of Problem (4.10), it is easy to see that these scenarios
will also be an optimal solution of Problem (4.11) with identical objective value. This
completes the proof.
We note that Problem (4.10) can also be interpreted as a clustering problem, where
the n points, ξ1, . . . , ξn, are partitioned into m clusters with centroids ζ1, . . . , ζm. Both
the cluster assignments and the centroids within each cluster are chosen to minimize the
cumulative prescriptive divergence to the n sample points. For the jth cluster, each optimal
scenario ζj∗ is chosen such that z∗(ζj) is close (or in some cases, equal) to the optimal SAA
solution for scenarios in Ij. In other words,
ζj∗ ∈ argminζ∑i∈Ij
c(z∗(ζ); ξi).
Once the distribution Q (described by scenarios ζj and probabilities qj = ∣Ij ∣n ) has been
computed, the decision z(Q) is given by optimizing for the cost under this reduced distribu-
tion Q, i.e.,
z(Q) ∈ argminz∈Z
EY ∼Q[c(z;Y )] =m
∑j=1
qj c(z; ζj).
We emphasize that while traditional scenario reduction aims to compute Q “close" to P,
123
prescriptive scenario reduction takes into account the quality of decisions induced and finds
Q such that z(Q) is “close to" z(P) in terms of decision quality.
When m = n, then the scenarios ζj = ξi ∀i = j as D(P∣P; c,Z) = 0. Thus, the optimal
decision is the same as the SAA solution on the full n scenarios, which is the best decision
that can be computed using this data. In fact, as the following example shows, minimizing
this Prescriptive divergence with just m = 1 scenario finds the SAA solution which the
standard Wasserstein method would find with m = n.
Let us consider the simple unconstrained newsvendor problem, where the decision variable
decides how much inventory is to be stocked in the presence of uncertain demand, with cost
function
c(z; ξ) =maxb(ξ − z), h(z − ξ), (4.12)
for known b, h > 0. The parameters b, h can be interpreted as holding and back order costs
which apply when the inventory z exceeds or is below the observed demand ξ respectively.
Proposition 1. For the unconstrained newsvendor problem with cost given by Equation (4.12),
minimizing the Prescriptive divergence with m = 1 finds the optimal SAA solution.
Proof. When we perform traditional scenario reduction with m = 1, the Wasserstein metric
finds the single scenario ζ = 1n
n
∑i=1
ξi, the sample mean. Next, we note that for any η, the
corresponding optimal solution z∗(η) is given by
z∗(η) ∈ argminz
c(z; η),
∈ argminz
b(η − z)+ + h(z − η)+,
= η.
124
Thus, we see that using the D(Q∣∣P; c,R) divergence with m = 1 finds the scenario
ζ∗ ∈ argminζ
n
∑i=1
b(ξi − z∗(ζ))+ + h(z∗(ζ) − ξi)+,
∈ argminζ
n
∑i=1
b(ξi − ζ)+ + h(ζ − ξi)+,
= Q( bb+h )(ξ
1, . . . , ξn),
where Q(β)(η1, . . . , ηN) is the β−quantile of the sample η1, . . . , ηN. In fact, we emphasize
that Q( bb+h )(ξ
1, . . . , ξn) is the optimal SAA solution which is obtained with just one scenario
using the prescriptive scenario reduction method.
However, we note that in the presence of constraints or for general objective functions
the estimation of z∗(η) may not be given by a closed form expression, or even be unique. To
address this issue, we introduce a variant of the prescriptive divergence, where we consider
the worst case over the set of optimal solutions Z∗(ζ), for every ζ. To be precise, we define
Z∗(ζ) = z ∈ Z ∶ c(z; ζ) ≤minz∈Z
c(z; ζ), (4.13)
and modify the definition presented in Equation (4.8) as
d2(Q∣P; c,Z) = minπ∈Rn×m+
m
∑j=1
n
∑i=1
πij maxz∈Z∗(ζj)
(c(z; ξi) − c∗(ξi))
subject tom
∑j=1
πij = pi ∀1 ≤ i ≤ n,
n
∑i=1
πij = qj ∀1 ≤ j ≤m.
(4.14)
Note that when Z∗(ζj)∀j ∈ [m] are each singleton sets, then Equations (4.14) and (4.8) are
identical.
Next, we present our approach of scenario reduction in this framework. Motivated by
the newsvendor and CVaR objectives described in Equations (4.2) and (4.3) respectively, we
consider the following two classes of cost functions:
125
1. Piecewise (separately) linear cost
c(z; y) =max1≤t≤ka′tz + b′ty, (4.15)
for known vectors of appropriate sizes at, bt for t ∈ [k].
2. Piecewise bilinear cost
c(z; y) =max1≤t≤kz′Aty, (4.16)
for known matrices At, t ∈ [k].
Given the points belonging to the jth cluster Ij, in order to compute the scenario ζj we wish
to solve the problem:
minζ∑i∈Ij
maxz∈Z∗(ζ)
c(z; ξi).
We note that this is not necessarily convex, and derive convex upper bound approximations
of this objective in the following section.
We first note the following result, which provides an upper bound for maxz∈Z∗(ζ) c(z; ξ).
Proposition 2. For any α ≥ 0, we have
maxz∈Z∗(ζ)
c(z; ξ) ≤maxz∈Zc(z; ξ) − αc(z; ζ) + αmin
z∈Zc(z; ζ).
Proof.max
z∈Z∗(ζ)c(z; ξ)
=maxz∈Z
infα≥0 c(z; ξ) + α( − c(z; ζ) +min
z∈Zc(z; ζ))
≤ infα≥0max
z∈Zc(z; ξ) + α( − c(z; ζ) +min
z∈Zc(z; ζ))
≤maxz∈Z
(c(z; ξ) − αc(z; ζ)) + αminz∈Z
c(z; ζ),
for some fixed α ≥ 0, by using the definition in Equation (4.13) and weak duality .
126
Using this result, we now focus our attention to the approximate problem:
minζ∑i∈Ij
maxz∈Z(c(z; ξi) − αc(z; ζ)) + αmin
z∈Zc(z; ζ)
for each set Ij. We approximate the second term as c(z∗(Ij); ζ), which results in a further
upper bound,
minζ∑i∈Ij
maxz∈Z(c(z; ξi) − αc(z; ζ)) + αc(z∗(Ij); ζ).
In the following section, we discuss approximations of the first term, maxz∈Z (c(z; ξi) −αc(z; ζ)), for different cost functions.
4.3.3 Piecewise (separately) linear cost
In this case, the first term can be written as
maxz∈Z(c(z; ξi) − αc(z; ζ))
=maxz∈Z(max
t∈[k]a′tz + b′tξi − αmax
t∈[k]a′tz + b′tζ)
≤maxz∈Z
maxt∈[k]a′t(z − αz) + b′t(ξi − αζ)
Choosing α = 1, we get the following convex approximate problem for ζj:
minζ∑i∈Ij(max
t∈[k]b′t(ξi − ζ) +max
t∈[k]a′tz∗(Ij) + b′tζ).
Note that the full problem of finding partitions and scenarios is given by
minπ
m
∑j=1
minζj∈Rd
n
∑i=1
πij( maxt=1,...,k
b′t(ξi − ζj) + maxt=1,...,k
a′tz∗(Ij) + b′tζj),
subject to πe = e,
z∗(Ij) ∈ argminz∈Z
n
∑i=1
πijc(z; ξi),
π ∈ 0,1n×m.
127
4.3.4 Piecewise bilinear cost
In this case, the first term can be written as
maxz∈Z(c(z; ξi) − αc(z; ζ))
=maxz∈Z(max
t∈[k]z′Atξ
i − αmaxt∈[k]z′Atζ)
≤maxz∈Z
maxt∈[k]z′At(ξi − αζ)
=maxt∈[k]max
z∈Zz′At(ξi − αζ)
Choosing α = 1, we get the following convex approximate problem for ζj:
minζ∑i∈Ij(max
t∈[k]max
z∈Zz′At(ξi − ζ) +max
t∈[k]z∗(Ij)′Atζ).
Note that the full problem of finding partitions and scenarios is given by
minπ
m
∑j=1
minζj∈Rd
n
∑i=1
πij( maxt=1,...,k
maxz∈Z
z′At(ξi − ζj) + maxt=1,...,k
z∗(Ij)′Atζj),
subject to πe = e,
z∗(Ij) ∈ argminz∈Z
n
∑i=1
πijc(z; ξi),
π ∈ 0,1n×m.
4.3.5 Prediction Error penalization
We note that this approach can often lead to too “optimistic" distributions. One way of
controlling for this is by penalizing the prediction error within each cluster. This idea has
been demonstrated to improve prescriptive performance in related problem settings [Bert-
simas et al., 2019a,b]. In fact, Kao et al. [2009] develop an estimator that accounts for the
decision objective when computing regression coefficients, and is a convex combination of
the ordinary least squares and prescriptive loss. Now, we introduce the formulation, where
128
the parameter µ ∈ [0,1] is chosen via cross validation.
minπ
m
∑j=1
minζj∈Rd
n
∑i=1
πij(µFij + (1 − µ)∥ξi − ζj∥2),
subject to πe = e,
z∗(Ij) ∈ argminz∈Z
n
∑i=1
πijc(z; ξi),
π ∈ 0,1n×m,
(4.17)
where Fij is given by
Fij = maxt=1,...,k
b′t(ξi − ζj) + maxt=1,...,k
a′tz∗(Ij) + b′tζj
for c(z, y) =maxt∈[k](a′tz + b′ty), or
Fij = maxt=1,...,k
maxz∈Z
z′At(ξi − ζj) + maxt=1,...,k
z∗(Ij)′Atζj
for c(z, y) =maxt∈[k] z′Aty.
4.4 Optimization algorithms
In this section, we present an alternating optimization framework of solving Problem (4.17).
4.4.1 Alternating optimization framework
1. Given a candidate partition I1, . . . , Im, we solve m convex optimization problems, by
considering the m inner minimizations over ζj separately in Problem (4.17).
2. Given scenarios ζj, j ∈ [m], we assign each point ξi to cluster j(i), given by
j(i) = arg min1≤j≤m
µFij + (1 − µ)∥ξi − ζj∥2,
129
and update the partition I1, . . . , Im.
3. Continue to Step 1 and stop when there is no change in assignments, or the improve-
ment in objective value is smaller than a prespecified tolerance, or after a maximum
number of iterations.
The initial values of sets I1, . . . , Im are chosen randomly. To further improve this pro-
cedure, we perform this algorithm with different random restarts, and choose the solution
with the least prescriptive cost on a validation set.
4.5 Computational Examples
For the prescriptive divergence, the reduced scenarios are computed using an alternating
optimization heuristic with random restarts, similar in spirit to the k-means algorithm. In
our computational results, we compare the Prescriptive cost (for Q obtained by Wasserstein
and Prescriptive scenario reduction), computed as:
1
∣S∣∑i∈Sc(z(Q); ξi).
We note that this cost quantifies the decision quality of a distribution Q which generates
the decision z(Q). For each of the two scenario reduction methods, we report this metric
both in-sample and out-of-sample, i.e., S is either the training or test set respectively. We
denote the two methods as P-SR (Prescriptive Scenario Reduction) and W-SR (Wasserstein
Scenario Reduction with squared Euclidean norm).
4.5.1 Portfolio optimization
First, we consider a portfolio optimization problem, where the problem is given by
(z∗(Q), β∗(Q)) ∈arg minz∈Rd+,β∈R
β +EY ∼Q[1
εmax−z′Y − β,0 − λz′Y ]
subject to e′z = 1.(4.18)
130
We generate data sampled as
Y = µ +Σ 12 ε,
where µ ∼ N(0, Id×d), the noise is sampled from a standard Normal distribution ε ∼ N(0,1),and the covariance matrix Σ has entries given by
Σij = ρ∣i−j∣ ∀1 ≤ i, j ≤ d.
We sample a training set of n points from this distribution, with n = 1000 and d = 20. We
perform both Wasserstein and Prescriptive scenario reduction for different choices of m. We
repeat this for 100 different instances, and report the mean prescriptive risk averaged over
these instances. We choose parameters ε, λ as ε = 0.05, λ = 0.01, and correlation parameter
ρ = 0.8. The parameter ρ controls the correlation levels of the stock returns, with ρ = 0
implying no correlation, while ρ closer to +1 (−1) results in more positively (negatively)
correlated returns. Finally, we ensure that both the mean returns and all the return values
each exceed −1.00.
In Figure 4-1, we compare the average in-sample performances of the distributions pro-
duced by Wasserstein and prescriptive scenario reduction algorithms. We see that the pre-
scriptive scenario reduction method outperforms the standard Wasserstein method for differ-
ent values of m, and the gap narrows as the number of scenarios increases. This trend repeats
itself in Figure 4-2 as well, which plots the out-of-sample performance with m, reinforcing
the fact that these new distributions lead to an improvement out-of-sample as well.
131
Figure 4-1: Average in-sample prescriptive performance for various methods as a function ofm, the number of reduced scenarios.
132
Figure 4-2: Average out-of-sample prescriptive performance for various methods as a functionof m, the number of reduced scenarios.
4.5.2 Newsvendor problem with budget constraints
In this example, we consider the case of an inventory manager, with various products and a
capacity constraint on the total inventory. The complete problem is given by
z∗(Q) ∈argminz∈Rd+
EY ∼Q[max1≤j≤db(Yj − zj)+ + h(zj − Yj)+]
subject tod
∑j=1
zj ≤ U.
Demand Y ∈ Rd generated as
Y = µ + ε,
with the mean demands µj ∼ U[4,5], and noise distributed as εj ∼ N(0,1) ∀j ∈ [d]. The cost
133
parameters were chosen as b = 10 and h = 1. We sample n points from this distribution, with
n = 1000, d = 5. Note that this means the number of pieces in the cost function, k, equals 10.
We perform both Wasserstein and Prescriptive scenario reduction for various choices of m.
The out of sample cost is calculated over a test set of size ntest = 100,000 points generated
from the same distribution as the training set. We repeat this for 50 different instances, and
report the mean prescriptive risk averaged over these instances.
In Figure 4-3, we compare the expected in-sample performances of the distributions
produced by Wasserstein and prescriptive scenario reduction algorithms. As in the first
example, we see here as well that the prescriptive scenario reduction method outperforms
the standard Wasserstein method for different values of m in terms of in-sample prescriptive
performance, and the gap narrows as the number of scenarios increases. A similar trend is
observed for the out-of-sample performance in Figure 4-4 as well.
Figure 4-3: Average in-sample prescriptive performance for various methods as a function ofm, the number of reduced scenarios.
134
Figure 4-4: Average out-of-sample prescriptive performance for various methods as a functionof m, the number of reduced scenarios.
4.6 Conclusion
In this chapter, we introduced an optimization-based framework that combines ideas from
scenario reduction and convex optimization to compute scenarios that lead to improved
decisions. Unlike most existing approaches, our approach takes the cost function and con-
straints into account, is general, and applies in a wide range of settings. With the help of
computational examples, we demonstrate the benefit of this approach over a commonly used
cost-agnostic scenario reduction method. Our approach consistently outperforms standard
Wasserstein-based scenario reduction methods across different choices of m, the number of
scenarios. From a practitioner’s perspective, achieving higher quality decisions with fewer
scenarios can be highly desirable as the scenarios can be inspected, which improves inter-
pretability of the decision-making process.
135
136
Chapter 5
Sparse Convex Regression
5.1 Introduction
Given data (x1, y1), . . . , (xn, yn), we consider the problem of finding a convex function on
the x ∈ Rd variables (features) that best fits the dependent variables y ∈ R. Formally, we
wish to estimate a function f ∶ Rd → R where
y = f(x) + ε (5.1)
with the requirement that f be a convex function. Here the random noise ε is assumed
to have zero mean. Note that one can equivalently perform concave regression, as the
requirement that f is convex is identical to restricting −f to be concave. As we discuss next,
such convexity/concavity constraints arise naturally in several settings. Such problems fall
in the general class of shape constrained function estimation.
Shape constrained regression problems have many applications in various fields such as,
but not limited to, operations research, econometrics, geometric programming [Magnani and
Boyd, 2009], image analysis [Goldenshluger and Zeevi, 2006], and target reconstruction [Lele
et al., 1992]. In operations research, these problems arise in reinforcement learning [Shapiro
et al., 2009b], [Hannah et al., 2014], in resource allocation [Topaloglu and Powell, 2003], and
while analyzing performance measures of queueing networks [Chen and Yao, 2001]. In eco-
137
nomics, such problems are encountered when demand [Varian, 1982], utility functions [Var-
ian, 1984], and production functions [Allon et al., 2007] are assumed to be concave. For a
more detailed list of applications, see Lim and Glynn [2012] and Hannah and Dunson [2013].
The convex least squares estimator is the solution of the following generalized regression
problem:minf∈C
1
2
n
∑i=1(yi − f(xi))2, (5.2)
where C represents the space of convex functions on Rd. Note that Problem (5.2) is an
optimization problem over functions. Surprisingly, this can be written equivalently as a finite
dimensional convex quadratic optimization problem where the variables are the function
values and subgradients at each of the points x1, . . . ,xn [Boyd and Vandenberghe, 2004].
As part of the constraints, we enforce the convexity condition, i.e., the graph of the convex
function lies above each of its tangent planes. More precisely, this convexity condition implies
that given any point xi, the value of f at xi is greater or equal to the value of any tangent
hyperplane of f evaluated at xi. Clearly any convex function has a nonempty subdifferential
at every point, and the existence of such tangent planes is guaranteed. For this problem, it
suffices to enforce this condition for all n(n − 1) pairs of points xi,xj, 1 ≤ i, j ≤ n.
The resulting quadratic optimization problem with variables (θ,ξini=1) is given as fol-
lows.min
θ,ξini=1
1
2
n
∑i=1(yi − θi)2
subject to θi + ξTi (xj − xi) ≤ θj ∀i, j,
θ ∈ Rn,
ξi ∈ Rd ∀i.
(5.3)
The variables θi represent the values of f(xi), and ξi belongs to the subdifferential set of
the convex function f at each xi. The solution to this problem θ∗ is referred to as the
convex least squares estimator (CLSE). Note that we recover the usual least squares linear
regression problem by setting ξi = ξ ∀i and θi = ξTxi ∀i.
We note that the feasible set of Problem 5.3 can be unbounded, that may lead to potential
138
instability. There can be multiple values of the subgradients leading to the same objective
value. Hence, we propose solving the following regularized optimization problem, for a given
λ > 0,min
θ,ξini=1
1
2
n
∑i=1(yi − θi)2 +
λ
2
n
∑i=1∥ξi∥2
subject to θi + ξTi (xj − xi) ≤ θj ∀i, j,
θ ∈ Rn,
ξi ∈ Rd ∀i.
(5.4)
By adding a regularization term on the subgradients, which leads to a strongly convex ob-
jective, the subgradients ξi,j cannot take any value for a given objective value and feasibility.
5.1.1 Related literature
In this section, we review the relevant literature. Recently, there has been considerable
interest in shape constrained regression among the statistics community. Seijo and Sen
[2011] and Lim and Glynn [2012] characterize and show consistency of the CLSE. Seijo and
Sen [2011] use off-the-shelf interior point solvers (like MOSEK, cvx) for solving the problem.
But these solvers do not scale well for n ≥ 300 due to the presence of O(n2) constraints.
This motivated the recent work by Mazumder et al. [2018] which presents a first order
method based on the Alternating Direction Method of Multipliers (ADMM) to compute
the optimal solutions for the least squares convex regression problem. They demonstrate
the flexibility of their approach in the presence of monotonicity constraints, and bounded
subgradients. Their method solves instances of sizes of n ≈ 1000 to an accuracy of 10−3 in a
few seconds, and moderate accuracy solutions for n ≈ 5000 in a few minutes. However, their
method cannot be easily extended for least absolute deviation convex regression (where the
loss function is the `1 norm rather than the `2 least squares loss), or any joint constraints
over the subgradients. Hannah and Dunson [2013] consider an approximation of the convex
regression problem which is based on iteratively partitioning the set of observations, and
report results for n of the order of 10,000 in a few minutes. Balázs et al. [2015] propose an
139
aggregate cutting plane based method for solving the full convex regression problem along
with an approximate version, and they demonstrate via numerical experiments that their
algorithm solves instances with sizes of n ≈ 500 in a few minutes. However, they do not
perform large scale computations and show how their method scales. Regarding statistical
results, Han and Wellner [2016] sharply characterize the rate of statistical convergence for
the minimax risk.
In the context of linear regression, the problem of sparse regression refers to finding the
optimal vector of coefficients β ∈ Rd which minimizes the sum of squares of the residuals,
with the additional restriction that β only have at most k (for some positive integer k < d)
elements different from zero. In high dimensional settings where d >> n such an assumption
is valuable for conducting statistical inference, and for settings where d < n sparsity improves
interpretability of the model. We explore the notion of sparsity in this setting - we impose
the restriction that the union of supports of the subgradients is a set with cardinality at
most k. We refer to this problem as the sparse convex regression problem. Sparsity and
variable selection for non-parametric regression models is a new and relatively unexplored
area. Recently, Xu et al. [2016] develop a method for high dimensional sparse convex regres-
sion which solves an approximate problem, with the additional restriction that the target
convex function f itself be a sum of univariate convex functions. Additionally, they show
that under certain conditions on the samples, this approximation is accurate for the purpose
of variable selection.
However, such a cardinality constraint makes the sparse linear regression problem NP-
hard [Natarajan, 1995], and has led to this problem being considered as intractable. However,
there have been tremendous advances in computing power over the last thirty years - both
in hardware and optimization software (see Bixby [2012], Nemhauser [2013] for more de-
tails), which can computationally benefit such problems in statistics. Recently, there has
been some work that propose using modern Mixed Integer Optimization (MIO) methods
along with tools from first order methods in convex optimization for solving classical sta-
tistical problems such as best subset selection [Bertsimas et al., 2016b] and least quantiles
140
regression [Bertsimas and Mazumder, 2014]. More recently, Bertsimas and Van Parys [2016]
propose a reformulation of the sparse regression problem where, they develop a cutting plane
algorithm using a duality perspective that solves problems with sizes of n, d in the order of
100,000s in a few seconds. We explore the use of such techniques while solving the sparse
convex regression problem, where we select the best subset of features whose cardinality is
bounded by k, and find the optimal convex function on this subset.
5.1.2 Contributions
In this section, we outline the main contributions of our work.
1. In this chapter, we consider the problem of convex regression, and develop a scalable
algorithm for obtaining high quality solutions in practical times that compare favorably
with other state of the art methods. We show that by using a cutting plane method,
the least squares convex regression problem can be solved for sizes (n, d) = (104,10) in
minutes and (n, d) = (105,102) in hours. We emphasize that this approach can be also
used for `1 convex regression (where we minimize the `1 norm of the residuals vector
y − θ) as well with similar scalability results.
2. We propose algorithms which iteratively solve for the best subset of features based on
first order and cutting plane methods. To the best of our knowledge, these are the first
algorithms for sparse convex regression. We consider two variants of this problem, and
develop algorithms for each of them. In one variant, we consider the sparse problem
with bounded subgradients, and develop iterative mixed integer optimization based
algorithms for solving it. In the second variant, we consider the sparse problem with
ridge regularization, and develop a binary cutting plane method for this problem. With
the help of computational experiments, we show that our methods are scalable and
obtain near exact subset recovery for sizes (n, d, k) = (104,102,10) in minutes, and
(n, d, k) = (105,102,10) in hours.
141
5.1.3 Structure of this chapter
The structure of this chapter is as follows. In Section 5.2, we present the cutting plane algo-
rithm for solving the least squares convex regression problem and other variants. In Section
5.3, we define the sparse convex regression problem, and present our solution approach. We
illustrate the effectiveness of our approach with computational results and discuss the results
in Section 5.4.
5.1.4 Notation
For any positive integer n, we use [n] to denote the set of the first n positive integers, that
is, [n] = 1, . . . , n. The response vector is an n-dimensional vector of observations, and the
covariates are each d-dimensional vectors, i.e., y ∈ Rn,xi ∈ Rd ∀i ∈ [n], where d ≥ 1. Also, ∥.∥0denotes the `0 norm, given by the number of nonzero elements in a vector. Finally, Supp(x)denotes the set of indices of the vector x whose corresponding values are non zero.
5.2 Optimization Algorithm for Convex Regression
In this section, we propose an algorithm to solve the convex regression problem. Additionally,
we show that our algorithm can easily accommodate the case with an `1 objective, as well
as other constraints on the subgradients.
5.2.1 Algorithm
We present a cutting plane based algorithm for solving Problem (5.4). We now explain the
various steps in the algorithm in the following subsections.
Cutting plane algorithms
Cutting plane algorithms are an effective tool for solving large-scale optimization problems
where the number of constraints is very high. Before we proceed, we define some terminology
142
that is commonly used in the large-scale optimization literature. In this context, master
problem refers to the full formulation (5.4) with n(n−1) constraints, while the reduced master
problem refers to a problem with the same objective and variables, but with only a subset
of the constraints. The main idea behind these methods is to start solving the problem with
a few constraints initially - the initial reduced master problem. We then find the violated
constraints, and iteratively add them in a delayed manner - at each iteration we solve a
reduced master problem (but with progressively more constraints than the initial reduced
master problem). Consequently, such methods are also referred to as delayed constraint
generation in the large-scale optimization literature [Bertsimas and Tsitsiklis, 1997]. The
success of this method depends greatly on the efficiency of finding the violated constraints.
Initial reduced master problem
We start with a fraction of the n(n − 1) constraints - an initial reduced master problem.
Typically only a small fraction of the n(n − 1) constraints will be active at the optimal
solution to the full problem, and solving the problem with only these active constraints is
clearly equivalent to solving the full problem. However, these active constraints are not
known beforehand. A key advantage of starting with a constraint set that is “close" to the
active constraint set is that it could substantially reduce the number of cuts added at later
iterations, and reduce the net computational burden.
We motivate our algorithm from the solution to the convex regression problem for d = 1,
where the convexity condition is applied to only the immediate neighboring points. Recall,
when d = 1, Problem (5.4) can be computed by solving with only n − 1 constraints, i.e., by
sorting xi’s and considering the adjacent index pairs.
For d > 1, given x1, . . . ,xn, we form a spanning path (SP) based on the Euclidean dis-
tances between these points. The algorithm works as follows - starting from xi1 (say i1 = 1),
we find the closest point (based on the usual Euclidean distance metric) xi2 to it, and add it
as the next point. We then find the closest point xi3 to xi2 over all the points excluding xi1
and xi2 , then we find the closest point xi4 to xi3 over all the points excluding xi1 , xi2 and
143
xi3 , and so on. We utilize the n − 1 edges in the spanning path among x1, . . . ,xn as initial
constraints. These n − 1 constraints initially form the reduced master problem:
minθ,ξ1,...,ξn
1
2∥y − θ∥2 + λ
2
n
∑i=1∥ξi∥2
subject to θi1 + ξ′i1(xi2 − xi1) ≤ θi2 ,
θi2 + ξ′i2(xi3 − xi2) ≤ θi3 ,
⋮
θin−1 + ξ′in−1(xin − xin−1) ≤ θin ,
∥ξj∥∞ ≤M⋆ ∀1 ≤ j ≤ n,
(5.5)
with the solution as θ, ξ1, . . . , ξn. The last constraint bounds the feasible space, with M⋆
obtained via solving Problems 5.14 and 5.15.
Alternatively, we have also computed the minimum spanning tree (MST) among x1, . . . ,xn
and used the n − 1 edges of the MST as initial constraints. We have also used randomly
chosen pairs of points (Method (R)) as the initial reduced master problem and also selected
the closest point for each point xi (Method (C)). For d = 1, we note that the MST and SP
methods coincide. In Section 5.4, we compare Methods SP, MST, R and C.
Delayed constraint generation
For any given solution to the reduced master problem (5.5) given by θ, ξ1, . . . , ξn, we need
to check if this is a feasible solution for the full problem. If it is indeed feasible, clearly
it is also optimal for the full problem. On the other hand, if it is not feasible, we need to
find a violated constraint efficiently. This problem of finding a violated constraint is also
referred to as the separation problem, as this amounts to finding a hyperplane that separates
θ, ξ1, . . . , ξn from the feasible set [Bertsimas and Tsitsiklis, 1997]. Thus, for each i, the ith
separation problem is to find the maximal index j(i), where
j(i) = arg max1≤k≤n
θi − θk + ξ′i(xk − xi) , (5.6)
144
and check if the corresponding largest value is greater than 0.
In practice, we only consider a constraint to be violated if it exceeds a given tolerance
Tol. In the case of such a violation, we add the constraint
θi + ξ′i(xj(i) − xi) ≤ θj(i) (5.7)
to the reduced master problem for each i, and re-solve it. Let us denote the index pairs of
the violated constraints we add at the kth iteration be given by Tk. Thus, at the kth iteration,
the problem we solve is given by
minθ,ξ1,...,ξn
1
2∥y − θ∥2 + λ
2
n
∑i=1∥ξi∥2
subject to θi + ξTi (xj − xi) ≤ θj,∀(i, j) ∈ T0,
θi + ξTi (xj − xi) ≤ θj,∀(i, j) ∈ T1,
⋮
θi + ξTi (xj − xi) ≤ θj,∀(i, j) ∈ Tk.
(5.8)
If max1≤k≤n θi − θk + ξ′i(xk − xi) ≤ Tol ∀i ∈ [n], then the current solution is in fact optimal
for the full problem (5.4) with n(n−1) constraints, and the method terminates. The complete
algorithm is as follows:
145
Algorithm 1 Cutting plane algorithm for Problem (5.4)Input: Data (yi,xi), i = 1, . . . , n, tolerance Tol > 0.
Output: An optimal solution (θ∗,ξ∗1 , . . . ,ξ∗n) to Problem (5.4).
1: Solve the reduced master problem, i.e., Problem (5.8) with k = 0.
2: Set success = 0.
3: while success == 0 do
4: for 1 ≤ i ≤ n do
5: For this i, solve the separation problem (5.6) to find a j(i).6: Add the corresponding violated constraint (Eq. (5.7)), to the reduced master
problem.7: end for
8: If there is no violated constraint within the tolerance Tol, set success← 1.
9: Else, re-solve Problem (5.8) with new constraint set Tk+1, consisting of additional
constraint(s) added from Steps 4 − 7.
10: k ← k + 111: end while
We also note that this cutting plane algorithm, by successively adding violated constraints
to the reduced master problem, is guaranteed to converge to an optimal solution in a finite
number of steps [Kelley, 1960].
Theorem 4. The cutting plane Algorithm 1 converges to an optimal solution of Problem
(5.4) in a finite number of iterations.
5.2.2 `1 convex regression
Consider the problem of `1 convex regression, given by,
minf∈C
n
∑i=1∣yi − f(xi)∣ (5.9)
146
where, as before, C is the space of convex functions on Rd. Such a variant is along the lines
of linear regression with an `1 loss, rather than the usual least squares loss. Problem (5.9)
can be written as an equivalent finite dimensional linear optimization problem (5.10), using
additional auxiliary variables z ∈ Rn+ as follows.
minθ,ξini=1,z
n
∑i=1
zi
subject to zi ≥ yi − θi ∀i,
zi ≥ −(yi − θi) ∀i,
θi + ξTi (xj − xi) ≤ θj ∀i, j,
θ,z ∈ Rn,
ξi ∈ Rd ∀i ∈ [n].
(5.10)
We utilize the dual simplex algorithm, as when we introduce a new cut the optimality
conditions are satisfied, while the previous solution may be infeasible. As we illustrate in
Section 5.4, this method is fast in practice and scales well.
5.2.3 Extensions
Algorithm 1 can be extended to accommodate the following additional requirements on f(⋅):
a) The function f is coordinate-wise monotone, i.e., its subgradients ξi are either ξi ≥ 0or ξi ≤ 0 (non-decreasing or non-increasing respectively) for all i.
b) The subgradients ξi are bounded, i.e., ∥ξi∥p ≤ L ∀i for some L and `p norm ∥.∥p. The
usual cases of p ∈ 1,2,∞ result in conic optimization problems and can be handled
by this approach. Such constraints could be added as a part of the reduced master
problem all at once, or in a delayed manner as and when they are violated.
147
5.3 Sparse Convex Regression
In this section, we consider the problem of sparse convex regression, in which the union of
supports of the subgradients of f in each point x is a set whose cardinality is bounded by k.
We formulate this as the following optimization problem over sets,
minθ,ξini=1,S
1
2
n
∑i=1(yi − θi)2 +
λ
2
n
∑i=1∥ξi∥22
subject to θi + ξTi (xj − xi) ≤ θj ∀i, j,
Supp(ξi) ⊆ S ∀i,
θ ∈ Rn,
ξi ∈ Rd ∀i,
∣S∣ ≤ k,S ⊆ 1, . . . , d .
(5.11)
5.3.1 Primal approach
In this section, we present a primal-based approach of solving for the optimal subset of
features for the convex regression problem. Consider the following mixed integer (binary)
quadratic optimization (MIQO) problem
minθ,z,ξini=1
1
2
n
∑i=1(yi − θi)2 +
λ
2
n
∑i=1∥ξi∥22
subject to θi + ξTi (xj − xi) ≤ θj ∀1 ≤ i, j ≤ n,
∣(ξi)j ∣ ≤Mzj ∀i ∈ [n], j ∈ [d],n
∑j=1
zj ≤ k,
z ∈ 0,1d ,
θ ∈ Rn,
ξi ∈ Rd ∀i ∈ [n].
(5.12)
148
for some positive constant M .
To solve this problem, we first develop heuristics based on convex optimization which
generate solutions and are fast in practice. We solve a reduced MIQO problem (using a
commercial mixed integer optimization solver [Gurobi]) to generate lower bounds, which
provide a guarantee on the quality of this solution. Bertsimas et al. [2016b] used commercial
state of the art MIO solvers to solve the sparse linear regression problem with considerable
success. We present the details on this algorithm in the following section.
Algorithmic approach
In this section, we present an algorithm to solve Problem (5.12). To summarize, our solution
approach involves generating lower bounds by solving the reduced MIQO problem, and
improving this bound at each successive iteration. We use heuristics in order to find feasible
solutions fast, and generate lower bounds in order to determine the quality of the proposed
solution, or potentially improve it further. We elaborate in more detail on the heuristics
in Section 5.3.3. In order to determine the quality of our solution (in terms of optimality
gap), we generate lower bounds. For this, we solve the full sparse problem as an MIQO
problem, but with only the initial reduced set of constraints to start. Whenever possible,
we warm-start this problem with a feasible solution obtained via heuristics, which we briefly
discuss in Section 5.3.3. We then iteratively add the violated constraints to Problem (5.12)
to tighten the bounds, similar in spirit to the cutting plane approach. For the upper bound,
we solve the full convex regression problem on this restricted support. To be precise, this
problem is given bymin
θ,ξ1,...,ξn
1
2
n
∑i=1(yi − θi)2 +
λ
2
n
∑i=1∥ξi∥22
subject to θi + ξ′i((xj)S − (xi)S) ≤ θj ∀i, j,
∥ξi∥∞ ≤M ∀i,
ξ ∈ Rk ∀i,θ ∈ Rn.
(5.13)
where S is the support set obtained from the MIO solution, and vS is the vector v restricted
to the set S. The overall primal algorithm is as follows:
149
Algorithm 2 Primal approach.Input: Initial constraints (C(0), a subset of the n(n − 1) constraints), tolerance ε > 0, a
positive integer T .
Output: A sparse optimal solution to problem (5.12).
1: Initialize Problem (5.12) with the initial constraints C0.2: Use the initialization heuristic (Section 5.3.3) to generate an initial solution S(0).
3: Set t← 1.
4: while t ≤ T AND gap > ε do
5: Solve problem (5.12), with reduced constraint set C(t), to obtain support set S(t),
possibly utilizing S(t−1) as a warm-start.
6: Set LB (Lower bound) to the optimal objective of problem (5.12).
7: With the output support, solve Problem (5.13) on the support S(t).
8: Update UB (Upper bound) to be the optimal objective.
9: Update gap ← UB−LBLB .
10: Add (at most) n violated constraints (one for each 1 ≤ i ≤ n), which forms C(t+1) by
this solution to the lower bound MIQO problem (5.12) to obtain the support S(t+1).11: Warm start it with the solution to the same lower bound problem constraints on this
restricted support set S(t+1) by solving Problem (5.13).
12: t← t + 113: end while
Computing the bound M
In this section, we describe how we compute bounds on the subgradient values. For some
initial feasible solution θ0 and ξ01, . . . ,ξ0n for Problem (5.11), we solve the following problems,
150
for each 1 ≤ t ≤ n,1 ≤ u ≤ d:
minθ,ξ1,...,ξn
ξt,u
subject to1
2∥y − θ∥2 + λ
2
n
∑i=1∥ξi∥22 ≤
1
2∥y − θ0∥2 + λ
2
n
∑i=1∥ξ0i ∥22,
θi + ξ′i(xj − xi) ≤ θj ∀i, j,
(5.14)
andmax
θ,ξ1,...,ξnξt,u
subject to1
2∥y − θ∥2 + λ
2
n
∑i=1∥ξi∥22 ≤
1
2∥y − θ0∥2 + λ
2
n
∑i=1∥ξ0i ∥22,
θi + ξ′i(xj − xi) ≤ θj ∀i, j.
(5.15)
We note that the feasible regions of Problems (5.14) and (5.15) are bounded, and hence
the optimal objective values for both these problems is guaranteed to be finite.
Let M∗ be the maximum of the optimal solution absolute values of (5.14) and (5.15) over
all 1 ≤ t ≤ n and 1 ≤ u ≤ d. An optimal solution of (5.11) is clearly feasible to both (5.14) and
(5.15). Therefore, using M∗ in the formulation (5.12) does not exclude optimal solutions to
(5.11) and therefore the optimal solution values of Problems (5.11) and (5.12) are equal.
5.3.2 Dual approach
In this section, we adapt the approach proposed by Bertsimas and Van Parys [2016] for
sparse linear regression to this convex regression setting. We solve the following regularized
151
problem, for a given λ > 0,
minθ,ξ1,...,ξn
1
2∥y − θ∥2 + λ
2
n
∑i=1∥ξi∥2
subject to θi + ξ′i(xj − xi) ≤ θj ∀i, j,
Supp(ξi) ⊆ S ∀i,
θ ∈ Rn,
ξi ∈ Rd ∀i,
∣S∣ ≤ k,S ⊆ 1, . . . , d .
(5.16)
Before we proceed, we introduce some notation. Sdk denotes the set of d dimensional binary
vectors with at most k non-zero components, i.e.,
Sdk = z ∈ 0,1d ∶
d
∑i=1
zi ≤ k.
We next present the following result that transforms this problem to a binary optimization
problem with a convex objective function.
Theorem 5. Problem (5.16) is equivalent to solving the following binary optimization prob-
lem with convex objective, given by
minz∈Sd
k
g(z), (5.17)
where
g(z) =maxµ≥0 − 1
2
n
∑i=1(yi +
n
∑j=1
µji −n
∑j=1
µij)2
− 1
2λ
n
∑i=1
d
∑p=1
zp (n
∑j=1
µij(xj − xi))2
p
, (5.18)
and a subgradient of g is given by the vector with pth element given by
(∂g(z))p = −1
2λ
n
∑i=1(
n
∑j=1
µij(xip − xjp))2
, (5.19)
where µ is an optimal solution to the concave maximization problem given in Eq. (5.18).
152
Proof. Using binary variables z ∈ 0,1d to denote the support set (zj = 0 ⇐⇒ (ξi)j = 0 ∀i ∈[n]), we write Problem (5.16) as
minz∈Sd
k,Z=diag(z)
minθ,ξ1,...,ξn
1
2∥y − θ∥2 + λ
2
n
∑i=1∥ξi∥2
subject to θi + ξ′iZ(xj − xi) ≤ θj ∀i, j.(5.20)
We take the dual of the inner convex optimization problem, which is given by
maxµ≥0 − 1
2
n
∑i=1(yi +∑
j
µji −∑j
µij)2
− 1
2λ
n
∑i=1∥∑
j
µijZ(xi − xj)∥2
.
For brevity, let vi = ∑j µij(xi − xj). Note that Z′Z = Z2 = Z, and thus we get
maxµ≥0 − 1
2
n
∑i=1(yi +∑
j
µji −∑j
µij)2
− 1
2λ
n
∑i=1
d
∑p=1
zp (n
∑j=1
µij(xj − xi))2
p
, (5.21)
and thus, the result follows.
From Theorem 5, as g is convex in z, we use µ to compute the subgradient of g which we
use to solve the outer binary minimization problem using cutting planes. This is equivalent
to approximating the convex function g by a piecewise linear function of its lower tangents,
while improving the outer approximation by adding a new tangent at each iteration. To be
precise, we solve the outer problem as
minz∈0,1d
maxi=1,...,m
g(zi) + ∂g(z(i))′(z − zi)
subject tod
∑i=1
zi ≤ k,(5.22)
153
or equivalently in epigraph form,
minz∈0,1d,γ
γ
subject to g(zi) + ∂g(z(i))′(z − zi) ≤ γ ∀1 ≤ i ≤m
subject tod
∑i=1
zi ≤ k,
(5.23)
where m is the number of cuts added.
While solving Problem (5.22) we use dynamic constraint generation, or lazy callbacks,
which enables the solver to avoid building multiple branch and bound trees each time a new
constraint is added to the problem. This leads to only one branch and bound tree being
built. Typically, lazy constraints are used when the full set of constraints is too large to
enumerate explicitly. Under this scheme, cuts are added to the model whenever a binary
feasible solution is found.
As mentioned in Bertsimas and Van Parys [2016] for the sparse linear regression case,
the linear relaxation of problem (5.17) provides strong warm starts to problem (5.16). This
motivates the following corollary.
Corollary 2. The linear relaxation of problem (5.17) is given by the following convex opti-
mization problem with semi-infinite constraints:
minµ≥0,γ
1
2
n
∑i=1(yi +∑
j
µji −∑j
µij)2
+ γ
subject to γ ≥ 1
2λ
d
∑p=1
zp
⎧⎪⎪⎨⎪⎪⎩
n
∑i=1(∑
j
µij(xip − xjp))2⎫⎪⎪⎬⎪⎪⎭
∀z ∈∆k,d
(5.24)
where ∆k,d = z ∈ Rd ∶ 0 ≤ z ≤ 1,∑di=1 zi ≤ k.
We solve the relaxation to generate warm-starts for the original binary optimization
problem (5.22). Once again, we use cutting planes to solve this problem. At the optimal
solution, the support set would be the corresponding indices of the k largest values of the
154
vector v, with pth element given by
vp =n
∑i=1(
n
∑j=1
µij(xip − xjp))2
. (5.25)
In practice, we have observed that this method does provide good quality warm-starts.
Before we elaborate further, we introduce some terminology. Here, the outer problem
refers to the binary minimization problem (5.22). As we have noted in the statement of the
theorem 5, evaluating the function g requires us to solve an optimization problem (5.18),
which we shall henceforth refer to as the inner problem.
Column generation methods for the inner problem
An issue with the above approach is that for the inner problem, the number of variables µ
is too large, i.e., O(n2), and is thus not practical for larger n. Hence, we propose a column
generation approach for solving the inner problem (5.18). We start with a subset of all the
n(n − 1) variables µ (with the rest set to zero), and add corresponding variables as we go
along. From the KKT conditions, for a given dual optimal solution µ we recover the primal
variables as:θi = yi +
n
∑j=1
µji −n
∑j=1
µij ∀i,
ξi =1
λ
n
∑j=1
µijZ(xi − xj) ∀i.(5.26)
We add the violated constraint if
θi + ξ′i(xj − xi) − θj ≤ 0. (5.27)
If not, then for any i, we find the j∗ such that
j∗ = argmax1≤j≤n
θi − θj + ξ′i(xj − xi) , (5.28)
and add the variable µij∗ to the set of active variables, and re-solve problem (5.18).
155
In practice, problem (5.18), while having relatively simple nonnegative constraints, has
a dense quadratic objective, which often results in larger solve times. Instead, we solve its
dual, which is the inner minimization problem in Eq. (5.20). We use Algorithm 1 to solve
the inner minimization problem in (5.20) and calculate the variables θ and ξ1, . . . ,ξn, as well
as the dual variables µij corresponding to the constraints in Eq. (5.20). Given these values,
and the expression in Eq. (5.19), we compute the subgradient of g at this value of z, and
add the corresponding constraint to the outer binary optimization problem in the case of a
violation. This dual approach differs from the method in Bertsimas et al. [2016b], which is
a primal method. In Section 5.4, we observe that this dual approach has a significant edge
over the primal one.
We now present the complete algorithm for the dual approach:
Algorithm 3 Cutting plane based algorithm for the dual approachInput: λ > 0, tolerance ε > 0.
Output: Optimal support z∗.
1: Start with γ0 = 0 and some feasible z0.
2: t← 0.
3: while γt < g(zt) + ε do
4: Compute a subgradient value of g at zt, using Theorem 5.
5: Add the constraint g(zt) + ∂g(zt)′(z − zt) ≤ γ.
6: Re-solve the outer problem (5.23), with solution given by (zt+1, γt+1).7: t← t + 1.
8: end while
5.3.3 Initialization heuristics
In this section, we briefly describe a thresholding based heuristic for the sparse convex
regression problem. This method provides an alternative approach of generating warm starts
to solving the relaxation problem (5.24). For the sake of brevity, we do not include the ridge
156
regularization term, but these methods can be easily adapted to include it as well.
minθ,ξ
1
2∥y − θ∥2
subject to Aθ +n
∑i=1
Biξi ≤ 0,
Supp(ξi) ⊆ S ∀i,
θ ∈ Rn,
ξi ∈ Rd ∀i,
∣S∣ ≤ k,S ⊆ 1, . . . , d .
(5.29)
where A, Bi are the full matrices representing the total n(n−1) constraints. Typically at any
feasible solution, only a few of the constraints will be active. Let the indices of the binding
constraints be described in T , and the sub matrices be given by AT ,BT,i. Thus, the problem
can be written asminθ,ξ
1
2∥y − θ∥2
subject to ATθ +n
∑i=1
BT,iξi ≤ 0,
Supp(ξi) ⊆ S ∀i,
θ ∈ Rn,
ξi ∈ Rd ∀i,
∣S∣ ≤ k,S ⊆ 1, . . . , d .
(5.30)
Dualizing the linear inequality constraints, the objective is given by
f(θ,ξ) =maxλ≥0
1
2∥y − θ∥2 + λ′(ATθ +
n
∑i=1
BT,iξi). (5.31)
We smoothen the objective function by subtracting a strongly convex term ( τ2∥λ∥2) for some
fixed scalar τ > 0. Note that we need to efficiently compute this function f for different
157
values of θ,ξ. The smooth convex objective is now
fτ(θ,ξ) =maxλ≥0
1
2∥y − θ∥2 + λ′(ATθ +
n
∑i=1
BT,iξi) −τ
2∥λ∥22. (5.32)
This function fτ is Lipschitz continuous with parameter `, where ` = λMAX(M′M)τ [Nesterov,
2005]. The matrix M ∈ R(m)×(n+nd), where m is the number of rows of AT (the number of
active equality constraints), and is given by
M = [AT BT,1 . . . BT,n.]
Now, the optimal λ∗τ can be computed by
λ∗τ =1
τ(ATθ +
n
∑i=1
BT,iξi)+. (5.33)
We then apply an upper quadratic approximation to the above function, followed by an
iterative thresholding procedure to the above function fτ(θ,ξ1, . . . ,ξn), while sequentially
reducing the value of τ . The complete details of this algorithm can be found in the Appendix.
5.4 Computational Experiments
Our objective in this section is
1. To understand the scalability and run times of Algorithm 1 for convex regression for
synthetic and real data.
2. To compare the performance of Algorithm 1 to other state of the art methods.
3. To understand the scalability and run times of Algorithms 2 and 3 for sparse convex
regression. Given that there are no competing approaches for this problem to the best
of our knowledge, we do not include any comparisons.
The structure of this section is as follows. In Section 5.4.1, we discuss the data generation
158
mechanism, and compare various initialization schemes for Algorithm 1 in Section 5.4.2. We
then examine its run times on synthetic data in Section 5.4.3, and infeasibility of the solution
at each iteration of Algorithm 1 in Section 5.4.4. Next, we compare it with other approaches
in Section 5.4.5, and discuss the run times of Algorithm 1 applied to the convex regression
problem with an `1 loss in Section 5.4.6. We then present the run times and infeasibility of
Algorithm 1 on real data in Section 5.4.7. Next, we consider the sparse convex regression
problem in Section 5.4.8, where we present the run times of Algorithm 2 (primal approach)
and Algorithm 3 (dual approach) for various sizes. Additionally, we present the accuracy
and run times of Algorithm 3 as a function of various parameters such as k, d, ρ, SNR, and
present the false positive rates of both the algorithms as well in this section. We conclude
by discussing our findings from these experiments in Section 5.4.9.
In all the experiments that follow, we use Gurobi 6.5.2 [Gurobi] as the optimization solver,
within the Julia programming language [Bezanson et al., 2017] using the JuMP modeling
language [Dunning et al., 2017]. All computations were performed on nodes of the Engag-
ing cluster, which is a collaboration between the Massachusetts Green High Performance
Computing Center (MGHPCC) and several of Boston’s leading universities. Each compute
node of the cluster had two 8-core, 2GHz Intel Xeon E2650 processors, 64GB of memory
and 3.5TB of local disk.
5.4.1 Synthetic Data
In this section, we generate X data from a standard Gaussian distribution, and use the
convex function Φ(x) = ∥x∥22, where yi = Φ(xi) + εi, 1 ≤ i ≤ n. The errors εi are assumed to
be independent and identically distributed Gaussian, i.e., N(0, σ2), for i = 1, . . . , n. We scale
the data appropriately so that the Signal to Noise ratio (SNR) is 3, i.e., Var(µ)Var(ε) = 3. Finally,
before feeding the data into the algorithm, we mean-center and normalize the features and
response vectors to have unit `2 norm.
We report the number of blocks of cuts (iterations) added till the end, along with another
159
metric called primal infeasibility [Mazumder et al., 2018],
Primal infeasibility = 1
n∥V∥F (5.34)
where the matrix V has entries given by Vi,j = (θi + ξ′i(xj − xi) − θj)+, ∀1 ≤ i, j ≤ n, where
z+ = maxz,0. Vi,j indicates the magnitude of violation of that constraint, and a value of
0 indicates no violation. Note that ∥ ⋅ ∥F denotes the usual Frobenius norm, where
∥V ∥2F =n
∑i=1
n
∑j=1
V 2i,j. (5.35)
Finally, Tol is the threshold above which we report the constraint to be violated.
5.4.2 Comparison of initialization methods for the reduced master
problem
In this section, we apply Algorithm 1 to the least squares convex regression problem (5.4).
We compare the run times of different ways of forming the reduced master problem for
Problem (5.4): We used methods MST, SP, C and R. MST refers to the Euclidean minimum
spanning tree formed on the set of points x1, . . . ,xn. SP refers to the spanning path approach
described in Section 5.2.1. C refers to finding the point closest to each point, and adding
that pair in that order. For example, if xj is closest to xi, we add the constraint
θi + ξ′i(xj − xi) ≤ θj.
Finally, R refers to finding a point randomly sampled from the remaining n − 1 for each xi,
and adding the resulting n constraints. The last four methods 2-MST, 2-SP, 2-C, and 2-R
denote two sided constraints, i.e., for each pair (xi,xj), we add both the constraints
θi + ξ′i(xj − xi) ≤ θj
160
and
θj + ξ′j(xi − xj) ≤ θi.
The term Tol in Table 5.1 refers to the tolerance to which each of the n(n − 1) constraints
is satisfied while terminating Algorithm 1. The sizes for all the instances are set to (n, d) =(104,10), and we use the least squares objective function ∥x∥22 without the ridge regularization
term of the subgradients. All entries in the table are averaged over the same twenty instances.
The numbers in parenthesis indicate the standard deviation.
Method Tol Cuts added (Blocks) Primal Infeasibility Run time (seconds)
MST 0.10 26 (2) 0.0104 (0.0001) 94.44 (20.826)
SP 0.10 9 (6) 0.0112 (0.0002) 21.36 (12.820)
C 0.10 25 (2) 0.0106 (0.0002) 55.93 (4.539)
R 0.10 6 (3) 0.0106 (0.0002) 15.39 (6.125)
2-MST 0.10 26 (2) 0.0098 (0.0002) 343.53 (62.363)
2-SP 0.10 15 (5) 0.0091 (0.0001) 35.66 (11.276)
2-C 0.10 26 (2) 0.0104 (0.0002) 131.09 (18.933)
2-R 0.10 21 (2) 0.0088 (0.0001) 46.47 (5.273)
MST 0.05 29 (2) 0.0073 (0.0002) 221.21 (40.035)
SP 0.05 25 (2) 0.0078 (0.0002) 56.75 (7.027)
C 0.05 30 (2) 0.0061 (0.0001) 117.07 (19.565)
R 0.05 26 (3) 0.0074 (0.0001) 57.95 (8.107)
2-MST 0.05 31 (3) 0.0055 (0.0001) 1448.46 (313.576)
2-SP 0.05 25 (2) 0.0068 (0.0001) 58.30 (7.041)
2-C 0.05 31 (2) 0.0059 (0.0001) 567.26 (121.269)
2-R 0.05 26 (2) 0.0064 (0.0001) 61.51 (6.861)
Table 5.1: The effect of the initialization method for (n, d) = (104,10) in the `2 convexregression for tolerances Tol = 0.1 and 0.05.
161
The results of Table 5.1 suggest that starting from a “good" initial reduced master prob-
lem can substantially impact the total run time of Algorithm 1. Both the Spanning path
(SP) and Random (R) methods outperform the other methods. SP and R perform similarly,
with the one-sided being marginally better than the two-sided constraints. We chose R in
all of our further experiments.
5.4.3 Run times of `2 convex regression
In this section, we report how Algorithm 1 scales as n, d increase for Problem (5.4), different
tolerances and least square objective. Table 5.2 presents the results obtained for a tolerance
of 0.1, while Table 5.3 shows the results for a tolerance of 0.05.
n d Cuts (Blocks) Infeasibility Run time
103 101 24 (2) 0.0147 (0.0016) 2.4s (1.5s)
104 101 8 (5) 0.0106 (0.0002) 16.5s (8.7s)
104 102 14 (3) 0.0107 (0.0003) 169.2s (35.5s)
104 103 22 (6) 0.0107 (0.0002) 1.5h (0.4h)
105 101 5 (4) 0.0054 (0.0001) 1156.9s (859.4s)
105 102 5 (1) 0.0056 (0.0001) 3.8h (0.4h)
105 5 × 102 6 (1) 0.0056 (0.0001) 19.1h (3.0h)
5 × 105 101 5 (4) 0.0034 (0.0000) 20.2h (7.2h)
Table 5.2: Run times for Tol = 0.1 and `2 convex regression.
162
n d Cuts (Blocks) Infeasibility Run time
103 101 36 (4) 0.0026 (0.0004) 58.0s (25.6s)
104 101 25 (3) 0.0074 (0.0001) 57.0s (8.4s)
104 102 110 (3) 0.0065 (0.0003) 1369.3s (91.7s)
105 101 11 (6) 0.0039 (0.0001) 1.0h (0.4h)
105 102 11 (1) 0.0040 (0.0000) 6.8h (0.7h)
Table 5.3: Run times for Tol = 0.05 and `2 convex regression.
We make the following observations:
• As the number of dimensions increases, the problem becomes harder to solve as each
added constraint becomes more dense. This is reflected in both Tables 5.2 and 5.3.
• The largest instances of (105,500) and (5 × 105,10) took almost a day on average to
solve to the required tolerance. While we tried solving them with Tol = 0.05, the run
time took more than 24 hours, after which we terminated them. For such problems,
the interior point solvers, even if they solve the initial reduced master problem, stall at
subsequent iterations when the quadratic problem has close to a million constraints.
• When tolerance is reduced to 0.05, the run times of (104,102) increases from a 2.5
minutes to 23 minutes with the average number of iterations increasing by a factor of
eight.
• To further aid in interpreting the results, we performed a linear regression of the run
times versus n, d and Tol. Our results indicate that a linear relationship between these
variables has an R2 of 0.96, which indicates a good fit. Regressing the logarithms of
times versus the logarithm of n and d yields that the run time depends on n1.25 and
d1.05, which also resulted in an R2 value of 0.96.
163
5.4.4 Infeasibility as a function of iterations
In this section, we aim to understand how the primal infeasibility changes as a function of
the iterations for different values of tolerance. In addition to primal infeasibility defined in
(5.34), we report the maximum violation defined as
maxi∈[n],j∈[n]
θi − θj + ξ′i(xj − xi) , (5.36)
as well as the constraints added at each iteration. We present two instances with (n, d) =(104,10) - with tolerance set to 0.1 and 0.05 respectively, and illustrate the progress of
the algorithm - constraints added at each iteration, primal infeasibility and the maximum
violation at the end of each iteration.
0
200
400
0 5 10 15 20Iteration
Prim
al In
feas
ibili
ty
(a) Primal infeasibility de-fined in (5.34) as a functionof the number of iterations.
0
50000
100000
150000
0 5 10 15 20Iteration
Max
imum
Vio
latio
n
(b) Maximum violation de-fined in (5.36) as a functionof the number of iterations.
0
2500
5000
7500
10000
0 5 10 15 20Iteration
Cut
s A
dded
(c) Number of constraintsadded at each iteration.
Figure 5-1: Progress of Algorithm 1 for (n, d) = (104,10), Tol = 0.1.
164
0
250
500
750
1000
0 10 20Iteration
Prim
al In
feas
ibili
ty
(a) Primal infeasibility de-fined in (5.34) as a functionof the number of iterations.
0
50000
100000
150000
200000
0 10 20Iteration
Max
imum
Vio
latio
n
(b) Maximum violation de-fined in (5.36) as a functionof the number of iterations.
0
2500
5000
7500
10000
0 10 20Iteration
Cut
s A
dded
(c) Number of constraintsadded at each iteration.
Figure 5-2: Progress of Algorithm 1 for (n, d) = (104,10), Tol = 0.05.
Figures 5-1-5-2 suggest that Algorithm 1 makes rapid progress to decrease infeasibility.
It takes twenty to twenty five iterations to decrease infeasibility (and violation) to near zero.
Moreover, the number of cuts added decreases substantially as Algorithm 1 progresses. At
the final few iterations when Algorithm 1 is close to convergence, the algorithm only adds
typically less than 5 constraints at each iteration. Even for the larger sizes, we observe this
trend of fewer constraints per iteration at later stages of the algorithm.
5.4.5 Comparison with other state of the art methods
In this section, we compare Algorithm 1 with two other recent methods proposed in the
literature for the least squares convex regression Problem (5.4):
1. The cutting plane based method proposed in Balázs et al. [2015] and referred as ag-
gregated cutting planes (ACP). The main difference with Algorithm 1 is that Balázs
et al. [2015] use aggregated constraints in the reduced master problem.
2. The method in Mazumder et al. [2018], where the authors use an Alternating Direction
Method of Multipliers (ADMM) framework to solve the least squares convex regression
165
problem.
The ACP algorithm solves a variant of Problem (5.4), with bounds on both the function
values and the subgradients, which we both set to ∞ in Algorithm 1. Both the ACP algo-
rithm and Algorithm 1 were run with an upper bound of 1000 on the iteration limit and
Tol = 0.1. Each of the rows with n < 105 were averaged over twenty random independently
generated samples of that given size, while the larger ones (n ≥ 105) were averaged over ten
independently generated samples.
In Table 5.4, we record the final values of primal infeasibility and total running times
for Algorithm 1 and ACP respectively As far as the quality of solution is considered, the
final infeasibility indicates that the solutions obtained from both these methods are quite
similar. However, Algorithm 1 is approximately twenty times faster than Algorithm ACP as
(n, d) increase. For (n, d) = (105,102), while Algorithm 1 obtained solutions in a few hours,
the ACP algorithm did not complete even after 24 hours, after which it was terminated.
We remark that most of the time Algorithm ACP takes is to form the initial aggregation
constraints. The results followed a similar pattern for Tol = 0.05, and thus we omit them for
the sake of brevity.
n d (Alg. 1) Inf. (Alg. 1) Run time (ACP) Inf. (ACP) Run time
103 101 0.0143 (0.0012) 1.9 (0.6) 0.0168 (0.0011) 7.3 (0.8)
104 101 0.0106 (0.0002) 25.2 (10.8) 0.0099 (0.0002) 411.7 (26.6)
104 102 0.0107 (0.0002) 153.5 (20.3) 0.0097 (0.0003) 4785.7( 363.7)
105 101 0.0054 (0.0001) 1841.8 (230.9) 0.0050 (0.0001) 36842.7 (1391.03)
Table 5.4: Comparison for `2 convex regression between Algorithm 1 and ACP for Tol = 0.1.
In Table 5.5, we present a comparison between ADMM and Algorithm 1 for instances
with n = 103 and d = 10. For larger sizes of n = 104, the ADMM method ran into memory
issues and hence we do not report the performance for those cases. We set both the primal
error and gradient error tolerance to be 0.1 in the ADMM algorithm. We observe that the
166
ADMM algorithm has an edge on Algorithm 1 in terms of infeasibility, where as Algorithm 1
has the edge in terms of maximum violation. Algorithm 1 improves when the tolerance
is reduced to 0.05 with similar primal infeasibility to the ADMM solution. However, the
maximum violation is guaranteed to be at most 0.05 for Algorithm 1, while it is not satisfied
by the ADMM method. The ADMM solution can be improved by reducing the primal and
gradient error tolerances, but the point we emphasize is that Algorithm 1 gives a direct
control on the maximum constraint violation.
n d Tol (Alg. 1) Inf. (Alg. 1) time ADMM Inf. ADMM time ADMM Max viol.103 10 0.1 0.0150 8.3 0.0059 47.8 0.0840103 10 0.05 0.0029 142.7 0.0059 46.8 0.0885
Table 5.5: Comparison for `2 convex regression with ADMM.
5.4.6 Run times for `1 convex regression
In this section, we solve Problem (5.10), where we minimize the `1 loss rather than the usual
least squares loss, and demonstrate how the algorithm scales in this context. Table 5.6 shows
the run times and cuts added for a few instance sizes with tolerance set as 0.1.
n d Cuts (Blocks) Infeasibility Run time (seconds)
103 101 24 (3) 0.0158 (0.0012) 2.9 (3.7)
104 101 10 (1) 0.0118 (0.0001) 25.3 (2.4)
104 102 168 (10) 0.0119 (0.0001) 2437.7 (470.3)
105 101 9 (1) 0.0056 (0.0001) 2501.9 (416.3)
Table 5.6: `1 convex regression - Run times for Tol = 0.1.
We observe that for the same 0.1 tolerance, the run times are higher than the ones
obtained for `2 regression (Table 5.2). Also, as d increases for a given n, the run times
increase as compared to the `2 case.
167
5.4.7 Experiments on real data
In this section, we apply some of our methods on a real world data set. This data set,
which was considered in Mekaroonreung and Johnson [2012], was downloaded from https:
//ampd.epa.gov/ampd/. The data consists of the amount of heat input (in MMBtu) and
the following four covariates – the NOx emission rate, and emissions of SO2, CO2 and NOx in
tons. We consider nine years worth of data of electric utility units from 2000-2008, and after
removing some rows with missing entries, we obtain a dataset with n = 28,063 and d = 4.
We took a logarithmic transformation of the covariates, centered and scaled them so that
they had mean zero and standard deviation of one. We ran the cutting plane algorithm for
solving the least squares convex regression problem on this dataset, and present the results
below.
0
50
100
150
0 5 10 15 20Iteration
Prim
al In
feas
ibili
ty
(a) Primal infeasibility as afunction of the number of it-erations.
0
25000
50000
75000
100000
0 5 10 15 20Iteration
Max
imum
Vio
latio
n
(b) Maximum violation as afunction of the number of it-erations.
0
10000
20000
0 5 10 15 20Iteration
Cut
s A
dded
(c) Number of constraintsadded at each iteration.
Figure 5-3: Progress of Algorithm 1 for Tol = 0.01.
168
0.0
2.5
5.0
7.5
10.0
2.5 5.0 7.5Iteration
Prim
al In
feas
ibili
ty
(a) Primal infeasibility as afunction of the number of it-erations.
0
2500
5000
7500
2.5 5.0 7.5Iteration
Max
imum
Vio
latio
n
(b) Maximum violation as afunction of the number of it-erations.
0
10000
20000
2.5 5.0 7.5Iteration
Cut
s A
dded
(c) Number of constraintsadded at each iteration.
Figure 5-4: Progress of Algorithm 1 for Tol = 0.05.
We make the following observations.
• Figures 5-3-5-4 suggest that Algorithm 1 makes rapid progress to decrease infeasibility.
For tolerance value of 0.05, it reaches optimality fairly quickly in around ten iterations,
while it takes around twenty iterations for a smaller value of tolerance 0.01.
• Similar to the experiments on synthetic data, the number of cuts added decreases
substantially as Algorithm 1 progresses. The final few iterations involve adding a very
small number of cuts at each iteration.
• Finally, we include a note on the running times of the algorithm for this data. We
observe a run time of 20−30 minutes for a tolerance of 0.05, which is along the lines of
what we observe in Table 3 for synthetic data. On reducing the tolerance to 0.01, the
run time increases to 60 − 70 minutes, which is expected as the number of iterations
doubles in this case.
169
5.4.8 Sparse convex regression
In this section, we present the computational results for Algorithms 2 and 3 applied to
the problem of optimal subset selection in this setting. As for the continuous case, we
generate X from a Gaussian distribution, and randomly sample the support set of size k
from 1, . . . , d. We generate n d-dimensional vectors xi, each of which is generated from a
Gaussian distribution with zero mean and correlation matrix Σ, where Σij = ρ∣i−j∣, 1 ≤ i, j ≤ dfor some correlation 0 ≤ ρ ≤ 1. Note that when ρ = 0, the features are i.i.d, and higher ρ
indicates that the correlation among the features is larger.
We use the convex function Φ(x) = ∑i∈S∗ x2i , and the response data yi = Φ(xi) + εi, 1 ≤
i ≤ n. The errors εi are i.i.d. N(0, σ2), for all i = 1, . . . , n. We scale the data appropriately
so that the Signal to Noise ratio (SNR) is 3. Again, we mean-center and normalize the
features and response vectors to have unit `2 norm before providing the data as an input
into Algorithms 2 and 3.
First, we demonstrate the value of using an MIO solver, by iteratively adding constraints
to the primal problem according to Algorithm 2, and show the computational results in
Tables 5.7 and 5.8. Next, we present the results for Algorithm 3 in Table 5.9, where we
reformulate the sparse problem as minimizing a convex piecewise linear function over pure
binary variables. If S is the optimal set obtained by our algorithms, we define accuracy as
Accuracy = ∣S∗ ∩ S∣k
, (5.37)
where S∗ is the true support. Next, we define the false positive rate, which is the fraction
of features from the recovered support that are outside the true support S∗, i.e.,
False Positive Rate = ∣S ∖ S∗∣
∣S∣. (5.38)
170
A. Primal approach (Algorithm 2)
We present the results for the primal approach, as defined in Algorithm 2, in Tables 5.7 and
5.8 for n = 50k and 100k respectively. First, we discuss how the value of M is selected in
Problem (5.12) in the execution of Algorithm 2. In this primal approach, we solve the sparse
convex regression problem with `∞ norm bounds on the subgradients. Mazumder et al.
[2018] argue that the subgradients of the points near the boundary of Conv(x1, . . . ,xn) grow
large resulting in overfitting, and thus a bound on the subgradients is needed. Consequently,
we vary the value of M , and select it via cross-validation. Using M∗, the maximum of the
absolute of optimal solutions of Problem (5.14) and 5.15, we vary the value M as ηM∗ by
varying η, and calculate the validation error for each of these choices of M . For smaller
values of M , the solution is too constrained, and for larger values, overfitting tends to occur.
We use the one standard error rule for cross validation [Hastie et al., 2009] while se-
lecting the value of the parameter M . While performing cross validation to find the best
hyperparameter, we typically select various values of the parameter M1, . . . ,Ms, with corre-
sponding mean errors and standard deviations of the mean error on the validation set given
by E1, . . . ,Es and σ1, . . . , σs respectively. Typically, these values are obtained by K−fold
cross validation. The one standard error rule selects the parameter M =Mj where j is the
smallest index in the set i∣Ei ≤ Ei∗ + σi∗, and i∗ = argmini=1,...,sEi.
For n ≤ 10,000 the thresholding heuristic described in Section 5.3.3 was used to provide
warm-starts to Algorithm 2. For n > 104, we ran the same thresholding heuristic on a
sample of the points, and used the resulting support as a warm start. For n ≤ 104, we solve
Problems (5.14) and 5.15 to find the value M∗ and set the value of M = ηM∗. We choose η via
cross validation from the set 10−3,10−2,10−1,0.5. For n > 104, we avoid solving Problems
(5.14) and (5.15) due to high solve times and select M from the set 10−3,10−2,10−1,0.5 via
cross validation. In this case, we set the ridge regularization parameter λ to be zero.
For every row in all the following tables, we report the median run times and mean
accuracies over ten independently generated samples for the case when n = 50k, d = 100, and
five samples for instances where d = 500 or n = 100k with the standard deviation across these
171
samples in the parentheses. The key finding is that by comparing Tables 5.2 and 5.8, we
see that the sparse convex regression in fact solves faster than convex regression at least for
k = 10. Moreover, the resulting accuracy is at least 95%. Furthermore, as n increases the
accuracy of Algorithm 2 increases to near perfect value of 100%.
Tables 5.7 and 5.8 indicate that as n increases the accuracy increases and beyond a
certain n the accuracy becomes 100%.
n = 50k
Accuracy % Run time
ρ = 0.0k = 10
d = 100 100.0 (0.0) 1691.16 (284.8)d = 500 100.0 (0.0) 6522.37 (281.1)
k = 20d = 100 98.0(2.7) 2411.76(323.0)d = 500 92.0(2.3) 15276.47(5862.6)
ρ = 0.1k = 10
d = 100 100.0 (0.0) 2778.83 (6369.8)d = 500 100.0 (0.0) 6326.79 (613.4)
k = 20d = 100 99.0 (2.2 2109.42 (292.8)d = 500 94.4(2.2) 11589.47(6883.6)
ρ = 0.5k = 10
d = 100 100.0 (0.0) 2062.20 (508.0)d = 500 98.0 (4.5) 6083.46 (441.1)
k = 20d = 100 95.0 (3.5) 3158.10 (868.4)d = 500 93.3∗ (4.1) 25596.20 (3310.3)
Table 5.7: Accuracy% and Run times for Algorithm 2 for n = 50k.
172
n = 100k, d = 100
k Accuracy % Run time
ρ = 0.010 95.0 (12.2) 7665.68 (38.23)
20 100.0(0.0) 6605.0 (357.8)
ρ = 0.110 100.0 (0.0) 8939.83 (3575.0)
20 100.0 (0.0) 11638.24 (1950.7)
ρ = 0.510 100.0 (0.0) 6823.41 (777.0)
20 96.0 (2.2) 11937.81 (2120.6)
Table 5.8: Accuracy% and Run times for Algorithm 2 for n = 100k, d = 100.
We make the following observations:
• For n = 50k, increasing d from hundred to five hundred increases the run time by
almost four times. The accuracy however, remains close to 100%. When we raised d
to a thousand, the machines ran out of memory.
• The median run times and accuracies for ρ = 0.1 remain comparable in magnitude with
the results for ρ = 0. For (50k,100,10) the run times were highly skewed with values
ranging from 2000 to a maximum of 15,000 seconds. Excluding the three values with
run times of over 10,000 seconds, the median run time of the remaining seven instances
was 2,509.64 seconds with a standard deviation of 413.9 seconds.
• For ρ = 0.5, the solver could not solve one instance out of the five samples for
(50k,500,20). We report the median over the remaining four instances. The median
run times, on average, increase with a corresponding increase in ρ.
173
B. Dual approach (Algorithm 3)
We present the results of the dual approach in Table 5.9 where we report the average accuracy
(in %) and the average run times (in seconds) to provable optimality (MIO optimality gap
of 1%). Each row for d = 100 is averaged over ten independently generated random instances
of that size, with five such samples for problems where d = 500. As before, we use the one-
standard error rule while using cross validation [Hastie et al., 2009] to select the final value
of the parameter λ by varying it from 10−3 to 10−1. The tolerance parameter ε in Algorithm
3 is set to 10−3 in all the experiments that follow.
n = 50k
Accuracy % Run time (seconds)
ρ = 0.0k = 10
d = 100 97.0 (4.8) 2072.23 (236.5)d = 500 96.0 (5.5) 1785.43 (239.3)
k = 20d = 100 94.0 (3.9) 2356.16 (318.7)d = 500 91.1 (4.9) 1633.81 (523.8)
ρ = 0.1k = 10
d = 100 97.0 (4.8) 2073.42 (434.3)d = 500 96.7 (5.8) 2178.06 (358.9)
k = 20d = 100 91.0 (4.6) 2055.16 (608.3)d = 500 90.0 (0.0) 1493.74 (169.9)
ρ = 0.5k = 10
d = 100 92.0 (6.3) 1210.07 (171.4)d = 500 98.0 (4.5) 1858.62 (522.8)
k = 20d = 100 91.0 (4.2) 1122.74 (149.8)d = 500 89.0 (5.5) 1685.85 (340.2)
Table 5.9: Accuracy% and Run times for Algorithm 3 for n = 50k.
A key takeaway from Table 5.9 is that the dual algorithm 3 is able to solve instances
with 50,000 sample points in a few minutes for various values of correlation with high
174
support recovery rates. For n > 50,000 the key bottleneck is computing the objective g
and its subgradients. Recall that while evaluating g, Algorithm 3 requires the solution
of a continuous ridge regularized convex regression problem on a restricted support set
(Problem (5.18)) which has O(nk) terms in its objective. The relaxation problem (5.24),
which provides good quality warm starts, also becomes computationally expensive to solve
due to the presence of dense semi-infinite constraints.
To summarize, both the primal and dual methods achieve exact or near-exact recovery on
fairly noisy data (as evidenced by the fairly low Signal-to-Noise ratio of 3 of the data). While
the primal approach seems to have an edge in terms of scalability over the dual approach,
the dual approach is faster than the primal approach when (n, d, k) = (5 × 104,500,10).
C. Accuracy
In this section, we report on the accuracy of the solutions obtained by Algorithm 3 as function
of the parameters d, k, ρ, and SNR. We generate synthetic data for various values of each
of these parameters and vary one of these parameters at a time while keeping the remaining
constant. We present the mean accuracy and run times averaged over fifteen independently
generated samples, along with their one standard deviation error bars.
(a)√
SNR = 3 (b)√
SNR = 7 (c)√
SNR = 20
Figure 5-5: Accuracy and run times for varying SNR.
In Figures 5-5a–5-5c, we fix (d, k, ρ) = (100,10,0.1), and vary√
SNR ∈ 3,7,20.
175
(a) ρ = 0.0 (b) ρ = 0.1 (c) ρ = 0.5
Figure 5-6: Accuracy and run times for varying correlation ρ.
In Figures 5-6a–5-6c, we fix (d, k,√
SNR) = (100,10,20), and vary ρ in the set 0,0.1,0.5.
(a) d = 50 (b) d = 100 (c) d = 150
Figure 5-7: Accuracy and run times for varying dimension d.
In Figures 5-7a–5-7c, we fix (k, ρ,√
SNR) = (5,0.1,20) and vary d in the set 50,100,150.
(a) k = 5 (b) k = 10 (c) k = 15
Figure 5-8: Accuracy and run times for varying sparsity parameter k.
Finally, in Figures 5-8a–5-8c, we fix (d, ρ,√
SNR) = (100,0.1,20) and vary the sparsity
176
level k in the set 5,10,15. We solve the problems with a time cutoff of two, five, and ten
minutes for k = 5,10,15 respectively, and take the best solution obtained until that time in
case the incumbent solution has not been guaranteed to be optimal by that time.
We make the following observations:
(a) As n increases, the accuracy of Algorithm 3 increases and the running time decreases.
These observations are consistent with the findings of Bertsimas and Van Parys [2016]
in the context of sparse linear regression.
(b) As SNR increases, we reach higher accuracy for smaller values of n, that is the problem
becomes easier (see Figures 5-5a–5-5c).
(c) To reach accuracy of 95% we need n equal to 10,000, 12,000 and 15,000 for ρ being
0, 0.1 and 0.5, respectively (see Figures 5-6a–5-6c).
(d) To reach accuracy of 95% we need n equal to 3,000, 4,000 and 5,000 for d being 50,
100 and 150, respectively (see Figures 5-7a–5-7c).
(e) To reach accuracy of 90% we need n equal to 2,500, 8,000 and 10,000 for k being 5,
10 and 15, respectively (Figures 5-8a–5-8c).
D. False Positive rates
In this section, we investigate the false discovery rate for the estimator resulting from this
algorithm. So far we have taken k, the sparsity parameter as a given in all of our experiments.
In reality however, this value needs to be inferred from the data, and is usually done by cross
validation. Thus, it is imperative that the algorithm not only choose the relevant features,
but also that it picks no extra spurious ones and mark them as relevant.
To check this, we performed an experiment with simulated data for (n, d) = (10000,100)with five features chosen randomly. We vary k in the set 3, . . . ,10, and choose the best k
by five fold cross validation. We then run our algorithms for that value of k, and report the
median false positive rate over ten independently generated samples. We present our results
for both the primal and dual algorithms in Tables 5.10 and 5.11 respectively.
177
For the dual approach, we impose a time limit of 120 seconds, and take the best solution
obtained by that point of time and for the primal method, we do not impose any such time
limit. We report the median false positive rate over ten independently chosen samples for
different values of ρ and SNR. This suggests that our algorithms not only pick the relevant
features, but are also able to control for spurious discoveries.
√SNR ρ = 0.0 ρ = 0.1 ρ = 0.5
3 0% 0% 0%
7 0% 0% 0%
20 0% 0% 0%
Table 5.10: False Positive rate for Algorithm 3.
√SNR ρ = 0.0 ρ = 0.1 ρ = 0.5
3 0% 0% 0%
7 0% 0% 0%
20 0% 0% 0%
Table 5.11: False Positive rate for Algorithm 2.
5.4.9 Discussion
(a) For the problem of convex regression, we see that Algorithm 1 has a significant edge
over other state of the art methods in terms of run time and accuracy. Our approach
allows us to solve problems of n = 100,000 and d = 100 in hours. Also, it is flexible
enough to accommodate other constraints such as coordinate-wise monotonicity and
norm bounded subgradients.
(b) For the sparse convex regression problem, the dual approach (Algorithm 3) has an
edge over the primal method (Algorithm 2) in run times and scalability. Surprisingly,
178
Algorithm 3 solves the sparse convex regression problem in times comparable to the
continuous case, implying that the price of sparsity is small. Since we break new
ground in this area, we are unable to include any comparisons to other methods.
(c) For the sparse convex regression problem, the primal approach scales to problems of
the size (n, d, k) = (105,100,10) in hours, while the dual approach scales to (n, d, k) =(5 × 104,500,10) in minutes. We perform various experiments by varying the degree
of correlation among the covariates ρ, signal to noise ratio (SNR), number of features
d, sparsity level k, and demonstrate that our algorithms achieve near perfect support
recovery as n increases. Also, we note that both Algorithms 2 and 3 limit the false
discovery rate.
179
180
Chapter 6
Conclusions
This thesis started with the motivation that the current state-of-the-art approaches in data-
driven decision making can be improved upon by considering prediction and prescription
jointly rather than separately. The current data-rich age has made such approaches possible,
and opened up exciting avenues in diverse application domains such as healthcare and retail.
For instance, personalizing treatment choices for each patient depending on his/her features is
a problem of tremendous interest to healthcare providers. This thesis considers such problems
from a broad perspective, and develops approaches that heavily rely on optimization methods
and demonstrates the merits of such an approach.
In Chapters 2 and 3, we consider problems in prescriptive analytics – where the goal is
not just predicting uncertainty (demand, returns) but on decision-making. We demonstrate
that jointly considering prediction and prescription can typically lead to better decisions and
outcomes. Particularly, with the prevalence of observational data, we demonstrate that such
approaches can be directly used in various applications and add significant value.
In Chapter 4, we consider the classical technique of scenario reduction for stochastic
optimization, which relies on approximating the empirical distribution with a few scenarios
to improve tractability for large n. We demonstrate that taking into account the decision
quality of scenarios can result in better distributions tailored for decision-making, compared
to standard clustering based approaches with the Wasserstein distance. Crucially, achieving
181
higher quality decisions with fewer scenarios can significantly improve interpretability, which
is desirable for practitioners.
Finally, in Chapter 5, we apply modern optimization-based techniques for solving shape
constrained regression problems. We consider the problem of selecting a small subset of
features that leads to least error, and develop primal and dual approaches for solving this
problem. With the aid of computational examples on real and synthetic data, we demonstrate
that our techniques lead to improved tractability, high accuracy and low false positive rates.
In conclusion, this area of data-driven decision making lies at the intersection of opti-
mization and learning, and is a particularly exciting avenue for research. These four chapters
serve to illuminate just a few of the areas where optimization and data analytics can yield an
edge over current practice – in terms of interpretability, sparsity, tractability, and decision
quality – in a wide variety of applications.
182
Appendix A
Supplement for Chapter 2
A.1 Optimization Algorithms for Joint Predictive and
Prescriptive Analytics
Algorithm 5 Random forests algorithmInput: Training data S = (X i, Y i), i = 1, . . . , n, parameters nmin,∆max,K,α,µ.
Output: Random forest τ 1, . . . , τK.1: procedure ComputeRandomForest
2: for 1 ≤ t ≤K do
3: Sample S(t), a collection of n points with replacement from S.
4: Sample ⌊αdx⌋ features from the full dx features and form the set S(t)α .
5: Compute the tth tree, τ t = GreedyTree(S(t)α ,0)
6: end for
7: return Random forest τ 1, . . . , τK.8: end procedure
Before defining the local search algorithms, we first define some notation which we use while
describing the algorithm.
183
Algorithm 4 Greedy recursive algorithm for training prescriptive treesInput: Data S = (Xi, Y i), i = 1, . . . , n0, current depth ∆, tuning parameters nmin,∆max, µ.Output: Greedy prescriptive tree τ .1: procedure GreedyTree(S,∆)2: Solve the empirical SAA problem over S (Problem (2.2)) to obtain z∗(S)3: Set τ(x) = z∗(S)4: Also, store the optimal objective value of Problem (2.2) as cz(S)
5: Compute predictive error, cy(S) = ∑i∈S∥Y i − 1
n0 ∑j∈S
Y j∥
2
6: Net split cost, c(S) = µn0cz(S) + (1 − µ)cy(S)7: Set success ← 0, and cmin ← c(S).8: if ∆ <∆max and n0 ≥ 2nmin then9: for each 1 ≤ p ≤ dx do
10: Sort covariate values i ∈ S ∶Xip of feature p in non decreasing order
11: Obtain set of kp unique values for each p as πp1 < . . . < π
pkp
12: end for13: for 1 ≤ p ≤ dx and p ∶ kp ≥ 2 do14: for 1 ≤ k < kp do15: SL = x ∈ S ∶ xp ≤
12(πp
k + πpk+1), let nL = ∣SL∣.
16: SR = x ∈ S ∶ xp >12(πp
k + πpk+1), let nR = ∣SR∣.
17: if nmin ≤ nL and nmin ≤ n
R then18: Solve SAA problems over SL and SR to obtain z∗(SL), z∗(SR) respectively19: Also, compute the respective prescriptive costs as cz(S
L), cz(SR)
20: Next, compute the predictive costs as follows:21: cy(S
L) = ∑i∈SL
∥Y i − 1nL ∑
j∈SL
Y j∥2
22: cy(SR) = ∑
i∈SR
∥Y i − 1nR ∑
j∈SR
Y j∥2
23: New cost, c← µ(nLcz(SL) + nRcz(S
R)) + (1 − µ)(cy(SL) + cy(S
R))
24: if If c < cmin then25: Set p∗ = p and s∗ = 1
2(πp
k + πpk+1)
26: Update cmin ← c and success ← 127: end if28: end if29: end for30: end for31: if success == 1 then32: SLeft = x ∈ S ∶ xp∗ ≤ s
∗
33: SRight = x ∈ S ∶ xp∗ > s∗
34: τLeft = return GreedyTree(SLeft, (∆ + 1))35: τRight = return GreedyTree(SRight, (∆ + 1))36: τ(x) = 1(xp∗ ≤ s
∗)τLeft(x) + 1(xp∗ > s∗)τRight(x).
37: end if38: end if39: return τ .40: end procedure
184
• τt denotes the subtree whose root is the tth node of a tree τ , and nodes(τ) refers to
the set of all nodes of the tree τ .
• For any index set I, XI , Y I denote the subsets of the training data X,y corresponding
to I.
• Shuffle(I) returns a randomized order of the index set I.
• For any subtree τ , Lµ(τ) denotes the combined objective (2.15) of the subtree evaluated
on the training set for a given value of µ.
• Also, minleafsize(τ) of any subtree τ denotes the minimum number of samples in a
leaf belonging to τ .
• Finally, we refer to the two descendants of a non-leaf node τ as left and right child
respectively, denoted by τL and τR.
Algorithm 6 Local search algorithm for training optimal prescriptive treesInput: Data S = (X i, Y i), i = 1, . . . , n, Initial prescriptive tree τ
Output: Locally optimal prescriptive tree τ
1: procedure CoordinateDescentTree(τ)
2: repeat
3: cprev ← Lµ(τ)4: for all t ∈ shuffle(nodes(τ)) do
5: I ← i ∶X i is assigned by τ to a leaf contained in subtree τt6: τt ← OptimizeNode(τt,XI , Y I)
7: Update τ by replacing tth node with τt
8: Update ccurrent ← Lµ(τ)9: end for
10: until cprev = ccurrent ▷ Local optimality.
11: return τ .
12: end procedure
185
Algorithm 7 Optimizing a nodeInput: Training data S = (X i, Y i), i = 1, . . . , n, Subtree τ to optimize.
Output: Optimized subtree τ .
1: procedure OptimizeNode(τ)
2: if τ is a branch then
3: τ(1), error(1) ← PerturbSplit(τL, τR)
4: error(2) ← Lµ(τL) with τ(2) ← τL
5: error(3) ← Lµ(τR) with τ(3) ← τR
6: Update cnew ← error(j∗) and τnew ← τ(j∗) where j∗ = argmin1≤j≤3 error(j)7: else ▷ τ is a leaf node
8: Create new split with children τL, τR
9: τnew, cnew ← PerturbSplit(τL, τR) to obtain new error and subtree
10: end if
11: if cnew < Lµ(τ) then
12: Update τ ← τnew
13: end if
14: return τ .
15: end procedure
186
Algorithm 8 Perturbing a splitInput: Left and right subtrees τL and τR to use as children of new split and S, the subset
of training data that falls into leaves of these two subtrees.
Output: Subtree τ with best axis-parallel split at root, and its corresponding loss.
1: procedure PerturbSplit(τL, τR)
2: Initialize error∗ ←∞3: for p ∈ [dx] do
4: Sort the values Xjp , j ∈ S in non decreasing order
5: Obtain set of kp unique values of (Xjp , j ∈ S) for each p as πp
1 < . . . < πpkp
6: for 1 ≤ k < kp do
7: Split value, γ = 12(π
pk + π
pk+1)
8: τ ← branch node X ip ≤ γ with left and right children τL, τR
9: if minleafsize(τ) > nmin then ▷ Split feasible
10: if Lµ(τ) < error∗ then ▷ Improvement
11: error∗ ← Lµ(τ)12: τ∗ ← τ
13: end if
14: end if
15: end for
16: end for
17: return τ∗, error∗
18: end procedure
187
Algorithm 9 Complete tree algorithmInput: Training data S = (X i, Y i), i = 1, . . . , n, parameters nmin,∆max,K, f1, µ.
Output: Optimized prescriptive tree τ∗.
1: procedure ComputeTree
2: for 1 ≤ t ≤K do
3: τ jG = GreedyTree(S,0)
4: We make the following three modifications of Algorithm (4):
5: Sort the splits in non increasing order of prediction error
6: Compute prescriptive costs of only the top f1% of the splits
7: Save the values cz(S), z∗(S) for various sets of indices S
8: τ jL = CoordinateDescentTree(τ jG)
9: We make the following three modifications of Algorithm (6):
10: For each candidate split S, find the best z among stored z∗
11: Use this z∗ as the starting point for the first order methods
12: end for
13: Select the best tree τ∗ = τ jL where j ∈ argmin1≤k≤K Lµ(τ kL)14: return τ∗.
15: end procedure
A.1.1 First order convex methods for local search procedure
In the following, we present two first order convex optimization based algorithms we use in
the local search procedure in greater detail. Recall that both Algorithms (4) and (6) rely on
evaluating the quality of various splits, which involves solving an empirical SAA problem over
different index sets. The core idea behind these methods is that since we already have access
to good quality solutions, which we use to iteratively improve to re-compute the optimal
solution for a new split.
In the part that follows, we assume have access to an initial solution z(0), and the opti-
mization problem we solve is minz∈Z f(z). In the case of trees, f is the empirical sum of the
188
costs c(z; y) over all the values of y in a leaf.
A. Projected subgradient descent: In this algorithm, we compute the subgradient at
each candidate solution z, with the update step as
z(k+1) ∈ argminz∈ZPZ(z(k) − αkg(z(k))), (A.1)
where
• αk is the step size at the kth iteration,
• g is a subgradient of f evaluated at z(k), and
• PZ(u) the projection of u onto the set Z, given by the solution to the following convex
minimization problem
PZ(u) = argminz∈Z
1
2∥u − z∥22. (A.2)
Since this is not necessarily a descent method when subgradients are used, we keep a track
of the best iterate obtained during prior iterations at each step.
B. Algorithms on the smoothed function: An alternative approach we present in this
section is to smooth the potentially nonsmooth function f , and then apply iterative algo-
rithms to find the optimal solution. For instance, when f can be represented as maxu u′Az,
then we define the smoothed approximation fδ of f as
fδ(z) =maxu∈Q
u′Az − δ
2∥u∥2, (A.3)
for some δ > 0 that controls the quality of the approximation. The set Q is a closed bounded
convex set, which denotes the dual space of the primal feasible set Z. The gradient of fδ is
given by A′u∗δ(z), where u∗δ(z) is the optimal solution to the maximization problem (A.3)
for any given z. For more details, we refer the reader to Nesterov [2005].
189
When solving the projection problem (A.2) on Z is efficiently solvable, then a projected
gradient type algorithm (A.1) can be used. An alternative approach is to use a Frank-
Wolfe type algorithm, where the iterates are given by solving sequential linear optimization
problems asz ∈ argmin
z∈Z∇zfδ(z(k))′(z − z(k)),
z(k+1) = αkz(k) + (1 − αk)z,
(A.4)
Here, αk is chosen between zero and one. This algorithm relies on being able to efficiently
minimize linear functions over the constraint set Z. For more details, we refer the reader
to Jaggi [2013] and the references therein.
A.2 Proofs
To begin, we prove the following lemma.
Lemma 1. Suppose assumptions 1-5 hold. If (x, z) and (x, z′) are in the same partition of
X ×Z, as specified by assumption 3, then
∣Ψ(z, δ) −Ψ(z′, δ)∣ ≤ (α(LD + 1 +√2λmax ln 1/δ) +L(
√2 ln 1/δ + 3)) ∣∣z − z′∣∣,
where Ψ(z, δ) = f(x, z) − f(x, z) − 23γn
ln(1/δ) −√2V (x, z) ln(1/δ) −L ⋅B(x, z).
Proof. We first note ∣f(x, z) − f(x, z′)∣ ≤ L∣∣z − z′∣∣ by the Lipschitz assumption on c(z; y).
190
Next, since (x, z) and (x, z′) are contained in the same partition,
∣f(x, z) − f(x, z′)∣ = ∣∑i
wi(x, z)c(z;Y i) −wi(x, z′)c(z′;Y i)∣
≤ ∣∑i
wi(x, z)c(z;Y i) −wi(x, z)c(z′;Y i)∣
+ ∣∑i
wi(x, z)c(z′;Y i) −wi(x, z′)c(z′;Y i)∣
≤ L∣∣z − z′∣∣ + ∣∣w(x, z) −w(x, z′)∣∣1
≤ (L + α)∣∣z − z′∣∣,
where we have used Holder’s inequality, the uniform bound on c, and Assumption 3.
Similarly, for the bias term,
∣LB(x, z) −LB(x, z′)∣ = L ∣∑i
wi(x, z)∣∣(X i, Zi) − (x, z)∣∣ −wi(x, z′)∣∣(X i, Zi) − (x, z′)∣∣∣
≤ L ∣∑i
wi(x, z)∣∣(X i, Zi) − (x, z)∣∣ −wi(x, z)∣∣(X i, Zi) − (x, z′)∣∣∣
+L ∣∑i
wi(x, z)∣∣(X i, Zi) − (x, z′)∣∣ −wi(x, z′)∣∣(X i, Zi) − (x, z′)∣∣∣
≤ L∑i
wi(x, z) ∣∣∣(X i, Zi) − (x, z)∣∣ − ∣∣(X i, Zi) − (x, z′)∣∣∣
+L∣∣w(x, z) −w(x, z′)∣∣1 supi∣∣(X i, Zi) − (x, z)∣∣
≤ (L +LαD)∣∣z − z′∣∣.
Next, we consider variance term. We let Σ(z) denote the diagonal matrix with
191
Var(c(z;Y i)∣X i, Zi) for i = 1, . . . , n as entries. As before,
∣√V (x, z) −
√V (x, z′)∣ =
RRRRRRRRRRRR
√
∑i
w2i (x, z)Var(c(z;Y i)∣Xi, Zi) −
√
∑i
w2i (x, z
′)Var(c(z′;Y i)∣Xi, Zi)
RRRRRRRRRRRR
≤
RRRRRRRRRRRR
√
∑i
w2i (x, z)Var(c(z;Y i)∣Xi, Zi) −
√
∑i
w2i (x, z)Var(c(z′;Y i)∣Xi, Zi)
RRRRRRRRRRRR
+
RRRRRRRRRRRR
√
∑i
w2i (x, z)Var(c(z′;Y i)∣Xi, Zi) −
√
∑i
w2i (x, z
′)Var(c(z′;Y i)∣Xi, Zi)
RRRRRRRRRRRR
= ∣√w(x, z)TΣ(z)w(x, z) −
√w(x, z)TΣ(z′)w(x, z)∣
+ ∣∣∣w(x, z)∣∣Σ(z′) − ∣∣w(x, z′)∣∣Σ(z′)∣ ,
where ∣∣v∣∣Σ =√vTΣv. One can verify that, because Σ is positive semidefinite, ∣∣ ⋅ ∣∣Σ is
seminorm that satisfies the triangle inequality. Therefore, we can upper bound the latter
term by
√(w(x, z) −w(x, z′))TΣ(w(x, z) −w(x, z′)) ≤ ∣∣w(x, z) −w(x, z′)∣∣
≤ ∣∣w(x, z) −w(x, z′)∣∣1
≤ α∣∣z − z′∣∣,
where we have used the assumption that ∣c(z; y)∣ ≤ 1.
The former term can again be upper bounded by the triangle inequality.
∣√∑i
w2i (x, z)Var(c(z;Y i)∣X i, Zi) −
√∑i
w2i (x, z)Var(c(z′;Y i)∣X i, Zi)∣
≤√∑i
w2i (x, z)(
√Var(c(z;Y i)∣X i, Zi) −
√Var(c(z′;Y i)∣X i, Zi))2 (A.5)
Noting that√
Var(c(z;Y i)) = ∣∣c(z;Y i)−E[c(z;Y i)]∣∣L2 (dropping conditioning for notational
192
convenience), we can apply the triangle inequality to the L2 norm:
(∣∣c(z;Y i) −E[c(z;Y i)]∣∣L2 − ∣∣c(z′;Y i) −E[c(z′;Y i)]∣∣L2)2
≤ ∣∣c(z;Y i) − c(z′;Y i) −E[c(z;Y i) − c(z′;Y i)]∣∣2L2
≤ E[(c(z;Y i) − c(z′;Y i))2]
≤ L2∣∣z − z′∣∣2.
Therefore, we can upper bound (A.5) by
√∑i
w2i (x, z)L2∣∣z − z′∣∣2
≤∑i
wi(x, z)L∣∣z − z′∣∣ = L∣∣z − z′∣∣,
where we have used the concavity of the square root function. Therefore,
∣√V (x, z) −
√V (x, z′)∣ ≤ (α +L)∣∣z − z′∣∣.
Combining the three results with the triangle inequality yields the desired result.
Proof of Theorem 1. To derive a regret bound, we first restrict our attention to the fixed
design setting. Here, we condition on X1, Z1, . . . ,Xn, Zn and bound f(x, z) around its
expectation. To simplify notation, we write X to denote (X1, . . . ,Xn) and Z to denote
(Z1, . . . , Zn). Note that by the honesty assumption, in this setting, f is a simple sum of
independent random variables. Applying Bernstein’s inequality (see, for example, Boucheron
et al. [2013]), we have, for δ ∈ (0,1),
P (E[f(x, z) ∣X,Z] − f(x, z) ≤ 2
3γnln(1/δ) +
√2V (x, z) ln(1/δ)∣X,Z) ≥ 1 − δ.
Next, we need to bound the difference between E[f(x, z)∣X,Z] and f(x, z). By the honesty
193
assumption, Jensen’s inequality, and the Lipschitz assumption, we have
∣E[f(x, z) ∣X,Z] − f(x, z)∣ = ∣∑i
wi(x, z)(f(X i, Zi) − f(x, z))∣
≤∑i
wi(x, z)∣f(X i, Zi) − f(x, z)∣
≤ L∑i
wi(x, z)∣∣(X i, Zi) − (x, z)∣∣
= L ⋅B(x, z).
Combining this with the previous result, we have, with probability at least 1−δ (conditioned
on X and Z),
f(x, z) − f(x, z) ≤ 2
3γnln(1/δ) +
√2V (x, z) ln(1/δ) +L ⋅B(x, z) (A.6)
Next, we extend this result to hold uniformly over all z ∈ Z. To do so, we partition X ×Zinto Γn regions as in Assumption 3. For each region, we construct a ν-net. Therefore, we have
a set z1, . . . , zKn such that for any z ∈ Z, there exists a zk such that (x, z) and (x, zk) are
contained in the same region with ∣∣z − zk∣∣ ≤ ν. For ease of notation, let k ∶ Z → 1, . . . ,Knreturn an index that satisfies these criteria. By assumption, Z ⊂ Rdz has finite diameter,
D, so we can construct this set with Kn ≤ Γn(3D/ν)dz (e.g., Shalev-Shwartz and Ben-David
[2014, pg. 337]).
By Lemma 1 (and using the notation therein), we have
Ψ(z, δ) ≤ Ψ(zk(z), δ) + ν (α(LD + 1 +√2 ln 1/δ) +L(
√2 ln 1/δ + 3)) .
Taking the supremum over z of both sides, we get
supz
Ψ(z, δ) ≤maxk
Ψ(zk, δ) + ν (α(LD + 1 +√2 ln 1/δ) +L(
√2 ln 1/δ + 3)) .
194
If we let ν = 13γn(α(LD + 1 +
√2) +L(
√2 + 3))−1, we have
P (supz
Ψ(z, δ) > 0∣X,Z)
≤ P (maxk
Ψ(zk, δ) + ν (α(LD + 1 +√2 ln 1/δ) +L(
√2 ln 1/δ + 3)) > 0∣X,Z)
≤ P (maxk
Ψ(zk, δ) + ν (α(LD + 1 +√2) +L(
√2 + 3)) ln 1/δ > 0∣X,Z)
≤∑k
P (Ψ(zk, δ) +ln 1/δ3γn
> 0∣X,Z)
≤∑k
P (Ψ(zk,√δ) > 0∣X,Z)
≤Kn
√δ,
where we have used the union bound and (A.6). Replacing δ with δ2/K2n and integrating
both sides to remove the conditioning completes the proof.
Proof of Theorem 2. By Theorem 1, with probability at least 1 − δ/2,
f(x, z) ≤ f(x, z) + 4
3γnln(2Kn/δ) + λ1
√V (x, z) + λ2B(x, z)
≤ f(x, z∗) + 4
3γnln(2Kn/δ) + λ1
√V (x, z∗) + λ2B(x, z∗),
where the second inequality follows from the definition of z. Using the same argument we
used to derive (A.6), since z∗ is not a random quantity, we have, with probability at least
1 − δ/2,
f(x, z∗) − f(x, z∗) ≤ 2
3γnln(2/δ) +
√2V (x, z∗) ln(2/δ) +L ⋅B(x, z∗)
≤ 2
3γnln(2Kn/δ) + λ1
√V (x, z∗) + λ2B(x, z∗).
Combining the two inequalities with the union bound yields the desired result.
Proof of Corollary 1. We show f(x, z) − 2LB(x, z∗) →p f(x, z∗). The desired result follows
195
from the assumption regarding B(x, z∗) and Slutsky’s theorem. First, we note, due to the
assumption ∣c(z; y)∣ ≤ 1,
V (x, z∗) =∑i
wi(x, z∗)Var(c(z∗;Y i)∣X i, Zi) ≤ 1
γn∑i
wi(x, z∗) =1
γn.
We have, for any ε > 0,
P (∣f(x, z) − 2LB(x, z∗) − f(x, z∗)∣ > ε)
≤ P (f(x, z) − 2LB(x, z∗) − f(x, z∗) > ε/2)
+ P (f(x, z∗) − f(x, z) + 2LB(x, z∗) > ε/2).
By Theorem 2, for large enough n, the first term is upper bounded by
2Kn exp⎛⎝− ε2
4(2/γn + 4√V (x, z∗))2
⎞⎠
≤ 2Kn exp(−ε2
4(2/√γn + 4/√γn)2)
= 2Γn (9Dγn (α(LD + 1 +√2) +L(
√2 + 3)))
dzexp(−γnε
2
144)
≤ C1n1+β exp(−C2n
β)→ 0.
Because f(x, z∗) ≤ f(x, z), the latter term is upper bounded by
P (B(x, z∗) > ε/4L)→ 0.
Proof of Example 1. First we consider the case that the zero variance action has cost 0, and
the other actions have cost 1 (call this event A). Because the cost of the optimal action is
0 and the cost of a suboptimal action is 1, the expected regret in this problem equals the
probability of the algorithm selecting a suboptimal action. Noting that f(j) ∼ N (1,1/m)
196
for j = 1, . . . ,m, we can express the expected regret of the µ = 0 algorithm as
E[R0∣A] = P (f(j) < 0 for some j ∈ 1, . . . ,m∣A) = P (maxj
Wj >√m) ,
where W1, . . . ,Wm are i.i.d. standard normal random variables. Similarly, the expected
regret of the µ > 0 algorithm can be expressed as
E[Rµ∣A] = P (f(j) < −λ√lnm√m
for some j ∈ 1, . . . ,m∣A)
= P (maxj
Wj >√m + λ
√lnm)
We can construct an upper bound on ERµ with the union bound and a concentration in-
equality (as in the proof of Theorem 1). Applying the Gaussian tail inequality (see, for
example, Vershynin [2016, Proposition 2.1.2]), we have
E[Rµ∣A] ≤mP (W1 >√m + λ
√lnm)
≤√m√2π
exp(−12(√m + λ
√lnm)2)
=√m
mλ2/2√2πexp(−m/2) exp(−λ
√m lnm)
≤ 1√m√2π
e−m/2,
where we have used the assumption λ ≥√2.
To lower bound the expected regret of the µ = 0 algorithm, we can use a similar Gaussian
197
tail inequality.
E[R0∣A] = 1 − [1 − P (W1 >√m)]m
≥ 1 − [1 − (1 − 1
m) 1√m√2π
e−m/2]m
≥ 1 − [1 − 1
2√m√2π
e−m/2]m
≥ 1 −⎡⎢⎢⎢⎢⎣[1 − 1
2√m√2π
e−m/2]2√2πm exp(m/2)⎤⎥⎥⎥⎥⎦
√m exp(−m/2)/2√2π
,
where the second inequality is valid for all m ≥ 2. One can verify that (1 − 1/n)n is a
monotonically increasing function that converges to e−1. Therefore, for all m ≥ 2,
E[R0∣A] ≥ 1 − exp(−√m
2√2π
exp(−m/2)) .
Next, we use these bounds to compute the ratio E[Rµ∣A]/E[R0∣A] in the limit as m→∞.
E[Rµ∣A]E[R0∣A] ≤
1√m√2πe−m/2
1 − exp (−√m
2√2π
exp(−m/2)).
Applying L’Hopital’s rule, the limit of the right hand side is equal to the limit of
2(2π)−1/2 (−m−3/2e−m/2 −m−1/2e−m/2)(2π)−1/2 [m−1/2e−m/2 −m1/2e−m/2] exp (−
√m
2√2πe−m/2)
= 2 −1 −mm −m2
⋅ exp(√m
2√2π
e−m/2)→ 0.
Next, we consider the case that the zero variance action has cost 1, and the other actions
have cost 0. The expected regret equals the probability that the zero variance action is
198
selected. For sufficiently large m,
E[Rµ∣Ac] = P (f(j) > 1 − λ√lnm√m
∀j ∈ 1, . . . ,m∣Ac)
≤ P (W1 >√m − λ
√lnm)m
≤ P (W1 >√m/2)m
≤ ( 2√2π
e−m/8)m
≤ e−m2/8 = o(E[Rµ∣A]).
Therefore, for sufficiently large m and some constant C,
E[Rµ]E[R0] =
E[Rµ∣A] +E[Rµ∣Ac]E[R0∣A] +E[R0∣Ac]
≤ E[Rµ∣A] +E[Rµ∣Ac]E[R0∣A]
≤ (1 +C)E[Rµ∣A]
E[R0∣A] → 0.
A.3 Optimization with Linear Predictive Models
Here, we detail the optimization of (2) with linear predictive models. We focus on the case
that c(z;Y ) = Y for simplicity. For these models, we posit the outcome is a linear function
of the auxiliary covariates and decision. That is there exists a β such that, given X = x,
Y (z) = (x, z)Tβ + ε, where ε is a mean 0 subgaussian noise term with variance σ2. If we let
A denote the design matrix for the problem, a matrix with rows consisting of (X i, Zi) for
i = 1, . . . , n, then the ordinary least squares (OLS) estimator for β is given by
βOLS = (ATA)−1ATY.
199
The ordinary least squares estimator is unbiased, so when solving (2), we set λ2 = 0. The vari-
ance of (x, z)T βOLS is given by σ2(x, z)T (ATA)−1(x, z). (ATA)−1 is a positive semidefinite
matrix, so√V (x, z) is convex. Therefore, (2) becomes
minz∈Z
(x, z)T βOLS + λ1σ√(x, z)T (ATA)−1(x, z),
which is a second order conic optimization problem if Z is polyhedral and can be solved
efficiently by commercial solvers. Even if Z is a mixed integer set, commercial solvers such
as Gurobi [Gurobi] can still solve the problem for sizes of practical interest.
For regularized linear models such as ridge and lasso regression, we use a similar approach.
Although these estimators are biased, we set λ2 = 0 for computational reasons. The ridge
estimator for β has a similar form to the OLS estimator:
βRidge = (ATA + αI)−1ATY,
for some α ≥ 0. The resulting optimization problem is essentially the same as with the OLS
estimator. The lasso estimator does not have a closed form solution, but we can approximate
it as in Tibshirani [1996]:
PβLasso ≈ (PATAP T + αPW )−1PATY,
where W = diag(1/∣β∗1 ∣, . . . ,1/∣β∗d+p∣), β∗ is the true lasso solution, and P is a projection
matrix that projects to the nonzero components of β∗. (The zero components of β∗ are still
0 in the approximation.) With this approximation, the resulting optimization takes the same
form as those for the OLS and ridge estimators.
200
A.4 Data Generation
A.4.1 Portfolio Optimization
In this example,we simulate the uncertainty as following a Normal distribution, given by
y(x) = N(µy + 0.1(x1 − 1000) ⋅ 16 + 1000 ⋅ x2 ⋅ 16 + 10 ⋅ log(x3 + 1) ⋅ 16,Σy),
where 16 represents a vector of ones in R6, and the mean vector and covariance matrix µy,Σy
are given by
µy = (86.8625 71.6059 75.3759 97.6258 52.7854 84.8973)T ,
Σ1/2Y =
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
136.687 ∗ ∗ ∗ ∗ ∗8.79766 142.279 ∗ ∗ ∗ ∗16.1504 15.0637 122.613 ∗ ∗ ∗18.4944 15.6961 26.344 139.148 ∗ ∗3.41394 16.5922 14.8795 13.9914 151.732 ∗24.8156 18.7292 17.1574 6.36536 24.7703 144.672
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
.
Finally, each of the three covariates are distributed as
x1 ∼ N(1000,50),
x2 ∼ N(0.02,0.01),
log(x3) ∼ N(0,1).
A.4.2 Newsvendor problem
In this example, for each week t, client r and item i, we include as product features the past
demands of product i at client r and times t − 1, . . . , t − 5, physical product characteristics
such as weight and number of pieces per unit, and indicator variables to encode whether
201
its name contains strings such as “whole grain", “fiber", “multigrain", “chocolate", “vanilla",
and “burrito". We also include client r specific information as features – aggregate demand
over all products sold at client r and times t − 1, . . . , t − 5, and indicator variables to deter-
mine the client type among the following categories – Walmart, Individual store, General
Market, Supermarket, Small franchise, or NA/Other. The final data set includes twenty five
covariates.
A.4.3 Pricing
For our synthetic pricing example, we consider a store offering 5 products. We generate
auxiliary covariates, X i, from a N (10,1) distribution. We generate historical prices, Zi,
from a Gaussian distribution,
N
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
1 0
1 0
0 1
0 1
0.5 0.5
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
T
X i,100I
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
.
We compute the expected demand for each product as:
µ =
⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝
500 − (Zi1)2/10 −X i
1 ⋅Zi1/10 − (X i
1)2/10 −Zi2
500 − (Zi2)2/10 −X i
1 ⋅Zi2/10 − (X i
1)2/10 −Zi1
500 − (Zi3)2/10 −X i
2 ⋅Zi3/10 − (X i
2)2/10 +Zi1 +Zi
2
500 − (Zi4)2/10 −X i
2 ⋅Zi4/10 − (X i
2)2/10 +Zi1 +Zi
2
500 − (Zi5)2/10 −X i
2 ⋅Zi5/20 −X i
1 ⋅Zi5/20 − (X i
2)2/10
⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠
,
and generate Y i from a N (µ,2500I) distribution. This example serves to simulate the
situation in which some products are complements and some are substitutes.
202
A.4.4 Warfarin Dosing
To simulate how physicians might assign Warfarin doses to patients, we compute a nor-
malized BMI for each patient (i.e. body mass divided by height squared, normalized by
the population standard deviation of BMI). For each patient, we then sample a dose (in
mg/week), Zi, from
Zi ∼ N (30 + 15 ⋅BMIi,64).
If Zi is negative, we assign a dose drawn uniformly from [0,20]. If the data dose not contain
the patients height and/or weight, we assign a dose drawn uniformly from [10,50], a standard
range for Warfarin doses.
To simulate the response that a physician observes for a particular patient, we compute
the difference between the the assigned dose and the true optimal dose for that patient, Zi∗,
and add noise. We then cap the response so it is less than or equal to 40 in absolute value.
The reasoning behind this construction is that the INR measurement gives the physician
some idea of whether the assigned dose is too high or too low and whether it is close to
the optimal dose. However, if the dose is very far from optimal, then the information INR
provides is not very useful in determining the optimal dose (it is purely directional). The
response of patient i is given by
Y i =
⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
−40, Ri < −40
Ri, −40 ≤ Ri ≤ 40
40, Ri > 40
,
where Ri ∼ N (Zi −Zi∗,400).
203
204
Appendix B
Supplement for Chapter 5
B.1 Heuristic for generating upper bounds
In this section, we explain the heuristics referenced to in Section 3.3 in greater detail. Recall
the smoothed function given by
fτ(θ,ξ) =maxλ≥0
1
2∥y − θ∥2 + λ′(ATθ +
n
∑i=1
BT,iξi) + ρn
∑i=1∥ξi∥1 −
τ
2∥λ∥22. (B.1)
This function fτ is Lipschitz continuous with parameter `, where ` = λMAX(M′M)τ [Nesterov,
2005]. The matrix M ∈ R(m)×(n+nd), where m is the number of rows of AT (the number of
active equality constraints), and is given by
M = [AT BT,1 . . . BT,n]
Now, the optimal λ∗τ can be computed by
λ∗τ =1
τ(ATθ +
n
∑i=1
BT,iξi)+. (B.2)
Let Θ be the combined vector of θ and ξ1, . . . ,ξn. As mentioned previously, we minimize
the upper convex quadratic envelope, which is a majorizer of the smoothed function, given
205
by
gτ(S,Θ) =1
2∥y − θ∥2 + hτ(Θ(t)) + ⟨∇hτ(Θ(t)),Θ −Θ(t)⟩ +
`
2∥Θ −Θ(t)∥2
where hτ(Θ) is given by
hτ(Θ) = ⟨λ∗τ , ATθ +n
∑i=1
BT,iξi⟩ −τ
2∥λ∗τ∥22
= 1
2τ∥(ATθ +
n
∑i=1
BT,iξi)+∥22.
Thus, the problem we aim to solve now is
minS,Θ
1
2∥y − θ∥2 + hτ(Θ(t)) + ⟨∇hτ(Θ(t)),Θ −Θ(t)⟩ +
`
2∥Θ −Θ(t)∥2
subject to Supp(ξi) ⊆ S ∀i,
∣S∣ ≤ k,
Θ′ = [θ′,ξ′1, . . . ,ξ′n].
(B.3)
After using the expressions for the gradient term and some algebra, the update scheme takes
the following form:min
S,θ,ξ1,...,ξn
` + 12`∥θ − u∥2 + 1
2
n
∑j=1∥ξj − vj∥2
subject to Supp(ξi) ⊆ S ∀i,
∣S∣ ≤ k,
(B.4)
where the vectors u, vj, 1 ≤ j ≤ n are given by
u = `
` + 1θ(t) − 1
` + 1(A′Tλ∗τ − y),
vj = ξ(t)j −1
`(B′i,Tλ∗τ) ∀1 ≤ j ≤ n.
The solution to this problem is to just set S to be the set of indices of the k largest elements
of n
∑j=1(vj)21, . . . ,
n
∑j=1(vj)2d. We denote this complete procedure as simply
(θ(t+1),ξ(t+1)) = Hτk(θ(t),ξ(t))
206
Thus, we now arrive at our algorithms. Algorithm 10 presents the iterative thresholding
heuristic applied to the function gτ .
Algorithm 10 Hard thresholding algorithm for warm startsInput: (θ(0),ξ(0)), with active constraint set T (0), tolerance TOL > 0, and iteration limit
MAX ITER.
Output: Output an improved sparse feasible solution (θ∗,ξ∗1 , . . . ,ξ∗n) to Problem (29).
1: Set t← 0.
2: while t ≤MAX ITER do
3: Compute the dual vector λ using equation (B.2).
4: Using this dual value, compute the vectors u,v1, . . . ,vn.
5: Perform the thresholding (θ(t+1),ξ(t+1)) = Hτk(θ(t),ξ(t)).
6: Repeat until gτ(θ(t),ξ(t)) − gτ(θ(t+1),ξ(t+1)) ≤ TOL.
7: t← t + 18: end while
Clearly τ is a parameter that controls the degree of smoothness of the approximation, and as
a heuristic we decrease it iteratively. Algorithm 11 presents a scheme where τ is iteratively
varied, reduced by a factor of γ at each iteration, and combining this with Algorithm 10.
Algorithm 11 Varying τ for Iterative Hard thresholding on smooth gτ(.)Input: θ(0) ∈ Rn,ξ(0) ∈ Rn×d, with active constraint set T (0), threshold τMIN > 0.
Output: Output a sparse solution (θ∗,ξ∗1 , . . . ,ξ∗n) to Problem (29).
1: while τ > τMIN do
2: Apply Algorithm 10 to the smooth function gτ(θ0,ξ0). Let (θ∗τ ,ξ∗τ ) be the limiting
solution.3: Decrease τ ← γτ , for some damping factor 0 < γ < 1.
4: Go back to Step 2 with (θ0,ξ0) = (θ∗τ ,ξ∗τ ).5: end while
207
B.1.1 Heuristics for norm bounded subgradients
In this section, we emphasize that this heuristic can also be adapted to generate fast feasible
solutions for the sparse regression problem with Lipschitz bounded subgradients as well. To
be precise, the full problem we consider is as follows:
minθ,ξini=1,S
n
∑i=1(yi − θi)2
subject to θi + ξTi (xj − xi) ≤ θj ∀i, j,
Supp(ξi) ⊆ S ∀i,
∥ξi∥ ≤ L ∀i,
θ ∈ Rn,
ξi ∈ Rd ∀i,
∣S∣ ≤ k,S ⊆ 1, . . . , n ,
(B.5)
Note that Nesterov smoothing [Nesterov, 2005] cannot be directly applied to this problem,
due to the conic constraints. We resort to conic duality, and using ideas from Becker et al.
[2011] and Nesterov smoothing, we obtain the following smoothed objective function
fτ(θ,ξ) = maxλ≥0,µ1,...,µn
1
2∥y − θ∥2 + λ′(ATθ +
n
∑i=1
BT,iξi) +n
∑j=1(µ′jξj −L∥µj∥∗)−
τ
2(∥λ∥22 +
n
∑j=1∥µj∥22)
where λ,µ1, . . . , µn are the dual variables, and ∥.∥∗ is the dual norm of ∥.∥. To compute µ∗j,τ
for any 1 ≤ j ≤ n, we need to solve the following problem efficiently:
µ∗j,τ ∈ argminµj
− µ′jξj +τ
2∥µj∥22 +L∥µj∥∗.
208
`2 norm bound
When the constraints are on the `2 norm of ξi, as the `2 norm is dual to itself, we need to
solve the problem given by
µ∗j,τ ∈ argminµj
− µ′jξj +τ
2∥µj∥22 +L∥µj∥2.
The solution to this problem can indeed be computed analytically as
µ∗j,τ = Shrink(1τξj,
L
τ),
where Shrink is an `2 shrinkage operation given by
Shrink(u, γ) =max1 − γ
∥u∥2,0 .u =
⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩
0, if ∥u∥2 ≤ γ,
(1 − γ∥u∥2 ).u, else.
The expressions for λ∗τ and the rest of the iterative hard thresholding algorithm follows
similar steps as before.
`∞ norm bound
When the `∞ norm of ξi are bounded, using the fact that the dual norm of `∞ is now the `1
norm, we need to solve the problem given by
µ∗j,τ ∈ argminµj
− µ′jξj +τ
2∥µj∥22 +L∥µj∥1.
The solution to this problem is the well known soft thresholding operator, given by
µ∗j,τ = PLτ(1τξj).
209
Pγ(.) is an `1 shrinkage operation, with ith element
(Pγ(u))i = sign(ui)(∣ui∣ − γ)+ =
⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
ui + γ, if ui ≤ −γ,
ui − γ, if ui ≥ γ,
0, else.
The expressions for λ∗τ and the rest of the iterative hard thresholding algorithm follows
similar steps as before.
B.1.2 Implementation details
In this section, we outline some practical implementation details of the iterative thresholding
heuristic outlined above.
Computing the Lipschitz values for the heuristic
In this subsection, we provide some computational details about the computation of the
Lipschitz values of the first order heuristic. Forming the matrix M would require to store
a matrix of dimensions m × (n + nd), which is not very practical for say n = 105, d = 100.
Similarly, storing the matrices AT and BT,i would also require substantial memory. However,
we utilize the structure of the problem and avoid storing these large matrices. We store a
two dimensional array Φ ∶ [m] → [n] × [n], which for each constraint gives the two indices
which are part of that constraint. For example, if the first constraint was
θ2 + ξ′2(x5 − x2) ≤ θ5,
then Φ(1) = [2,5], where Φ1(1) = 2,Φ2(1) = 5. Thus the whole system of constraints
represented by the matrices AT , BT,1, . . . , BT,n can be stored efficiently.
Now, in order to compute the vector A′Tλ, we first need to define the inverse map of Φ.
To be precise, let Ψ1 ∶ [n] → 2[m] be a vector valued function which when given an index
210
i ∈ [n] outputs the subset of constraints in which i is the first index, i.e.,
Ψ1(i) = j ∶ Φ1(j) = i ,
and similarly
Ψ2(i) = j ∶ Φ2(j) = i .
Now, it is easy to see that the product A′Tλ can be easily computed. The ith element of this
vector is simply
∑q∈Ψ1(i)
λq − ∑q∈Ψ2(i)
λq.
Similarly, the ith element of the d dimensional vector B′T,jλ is given by
∑q∈Ψ1(j)
(xΦ2(q) − xj)iλq.
In order to compute an estimate of the Lipschitz constant `, we use backtracking (see Beck
and Teboulle [2009]), rather than computing the largest eigenvalue of the matrix M′Mwhich can potentially be computationally expensive.
211
212
Bibliography
Gad Allon, Michael Beenstock, Steven Hackman, Ury Passy, and Alexander Shapiro. Non-parametric estimation of concave production technologies by entropic methods. J. Appl.Econometrics, 22:795–816, 2007.
Naomi S Altman. An introduction to kernel and nearest-neighbor nonparametric regression.The American Statistician, 46(3):175–185, 1992.
Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, andVinayaka Pandit. Local search heuristics for k-median and facility location problems.SIAM Journal on computing, 33(3):544–562, 2004.
Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects.Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016.
Susan Athey and Stefan Wager. Efficient policy learning. arXiv preprint arXiv:1702.02896,2017.
Gábor Balázs, András György, and Csaba Szepesvári. Near-optimal max-affine estimators forconvex regression. In Proceedings of the Eighteenth International Conference on ArtificialIntelligence and Statistics, 38:56–64, 2015.
Gabriel Baron, Elodie Perrodeau, Isabelle Boutron, and Philippe Ravaud. Reporting ofanalyses from randomized controlled trials with multiple arms: a systematic review. BMCmedicine, 11(1):84, 2013.
Hamsa Bastani and Mohsen Bayati. Online decision-making with high-dimensional covari-ates. Available at SSRN 2661896, 2015. Working paper.
Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linearinverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
Stephen R. Becker, Emmanuel J. Candès, and Michael C. Grant. Templates for convexcone problems with applications to sparse signal recovery. Mathematical ProgrammingComputation, 3(3):165–218, 2011.
Kristin P Bennett and J Blue. Optimal decision trees. Rensselaer Polytechnic Institute MathReport, 214, 1996.
213
Dimitris Bertsimas and Jack Dunn. Optimal classification trees. Machine Learning, pages1–44, 2017.
Dimitris Bertsimas and Jack Dunn. Machine Learning under a Modern Optimization Lens.Dynamic Ideas, Belmont, 2019. To appear.
Dimitris Bertsimas and Nathan Kallus. Pricing from observational data. arXiv preprintarXiv:1605.02347, 2017. Under review.
Dimitris Bertsimas and Nathan Kallus. From predictive to prescriptive analytics. Manage-ment Science, 2019. To appear.
Dimitris Bertsimas and Rahul Mazumder. Least quantile regression via modern optimization.The Annals of Statistics, 42(6):2494–2525, 2014.
Dimitris Bertsimas and Nishanth Mundru. Sparse convex regression. INFORMS Journal onComputation, 2018. Minor revision.
Dimitris Bertsimas and Nishanth Mundru. Prescriptive scenario reduction for stochasticoptimization. 2019. in preparation.
Dimitris Bertsimas and John N. Tsitsiklis. Introduction to Linear Optimization, volume 6.Athena Scientific, 1997.
Dimitris Bertsimas and Bart Van Parys. Sparse high dimensional regression: Exact scalablealgorithms and phase transitions. The Annals of Statistics, 2016. To appear.
Dimitris Bertsimas and Bart Van Parys. Bootstrap robust prescriptive analytics. arXivpreprint arXiv:1711.09974, 2017.
Dimitris Bertsimas, Mac Johnson, and Nathan Kallus. The power of optimization overrandomization in designing experiments involving small samples. Operations Research, 63(4):868–876, 2015.
Dimitris Bertsimas, Nathan Kallus, and Amjad Hussain. Inventory management in the eraof big data. Production and Operations Management, 25(12):2006–2009, 2016a.
Dimitris Bertsimas, Angela King, and Rahul Mazumder. Best subset selection via a modernoptimization lens. The Annals of Statistics, 44(2):813–852, 2016b.
Dimitris Bertsimas, Nathan Kallus, Alex Weinstein, and Ying Daisy Zhuo. Personalizeddiabetes management using electronic medical records. Diabetes Care, 40(2):210–217,2017.
Dimitris Bertsimas, Jack Dunn, and Nishanth Mundru. Optimal Prescriptive Trees. InformsJournal on Optimization, 2019a. In print.
214
Dimitris Bertsimas, Christopher McCord, and Nishanth Mundru. Prescriptive analytics forobservational data. 2019b. Submitted.
Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. Julia: A fresh approachto numerical computing. SIAM review, 59(1):65–98, 2017.
Robert E. Bixby. A brief history of linear and mixed-integer programming computation.Documenta Mathematica, Extra Volume: Optim. Stories, pages 107–121, 2012.
Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: Anonasymptotic theory of independence. Oxford university press, 2013.
Stephen P. Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge UniversityPress, 2004.
Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. Classification andRegression Trees. Wadsworth and Brooks, Monterey, California, 1984.
Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochasticmulti-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122,2012.
Dalia Buffery. The 2015 oncology drug pipeline: innovation drives the race to cure cancer.American health & drug benefits, 8(4):216, 2015.
Hong Chen and David D. Yao. Fundamentals of Queueing Networks: Performance, Asymp-totics, and Optimization. Springer-Verlag, 2001.
Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedingsof the 22nd acm sigkdd international conference on knowledge discovery and data mining,pages 785–794. ACM, 2016.
Maxime C Cohen, Ngai-Hang Zachary Leung, Kiran Panchamgam, Georgia Perakis, andAnthony Smith. The impact of linear optimization on promotion planning. OperationsResearch, 65(2):446–468, 2017.
International Warfarin Pharmacogenetics Consortium et al. Estimation of the warfarin dosewith clinical and pharmacogenetic data. N Engl J Med, 2009(360):753–764, 2009.
Dick den Hertog and Krzysztof Postek. Bridging the gap between predictive and prescrip-tive analytics-new optimization methodology needed. Technical report, Tilburg Univer-sity, Netherlands, 2016. Available at: http://www.optimization-online.org/DB_HTML/2016/12/5779.html.
215
Priya Donti, J Zico Kolter, and Brandon Amos. Task-based end-to-end model learning instochastic optimization. In Advances in Neural Information Processing Systems, pages5488–5498, 2017.
John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projectionsonto the l 1-ball for learning in high dimensions. In International Conference on Machinelearning, pages 272–279. ACM, 2008.
Jack Dunn. Optimal Trees for Prediction and Prescription. PhD thesis, MassachusettsInstitute of Technology, 2018. URL http://jack.dunn.nz/papers/Thesis.pdf.
Iain Dunning, Joey Huchette, and Miles Lubin. JuMP: A modeling language for mathemat-ical optimization. SIAM Review, 59(2):295–320, 2017.
Jitka Dupačová, Nicole Gröwe-Kuska, and Werner Römisch. Scenario reduction in stochasticprogramming. Mathematical Programming, 95(3):493–511, 2003.
Adam N Elmachtoub and Paul Grigas. Smart “predict, then optimize". arXiv preprintarXiv:1710.08005, 2017.
Michael L Feldstein, Edwin D Savlov, and Russell Hilf. A statistical model for predictingresponse of breast cancer patients to cytotoxic chemotherapy. Cancer research, 38(8):2544–2548, 1978.
Kris Johnson Ferreira, Bin Hong Alex Lee, and David Simchi-Levi. Analytics for an onlineretailer: Demand forecasting and price optimization. Manufacturing & Service OperationsManagement, 18(1):69–88, 2015.
Carlos A Flores. Estimation of dose-response functions and optimal doses with a continuoustreatment. University of Miami. Typescript, 2007.
Patrick A Flume, Brian P O’sullivan, Karen A Robinson, Christopher H Goss, Peter JMogayzel Jr, Donna Beth Willey-Courand, Janet Bujan, Jonathan Finder, Mary Lester,Lynne Quittell, et al. Cystic fibrosis pulmonary guidelines: chronic medications for main-tenance of lung health. American journal of respiratory and critical care medicine, 176(10):957–969, 2007.
Jérémie Gallien, Adam J Mersereau, Andres Garro, Alberte Dapena Mora, and Martín NóvoaVidal. Initial shipment decisions for new products at zara. Operations Research, 63(2):269–286, 2015.
John C Gittins. Multi-Armed Bandit Allocation Indices. Wiley, Chichester, UK, 1989.
Chong Yang Goh and Patrick Jaillet. Structured prediction by least squares estimatedconditional risk minimization. arXiv preprint arXiv:1611.07096, 2016.
216
Alexander Goldenshluger and Assaf Zeevi. Recovering convex boundaries from blurred andnoisy observations. The Annals of Statistics, 34:1375–1394, 2006.
Alexander Goldenshluger and Assaf Zeevi. A linear response bandit problem. StochasticSystems, 3(1):230–261, 2013.
Marjan Gort, Manda Broekhuis, Renée Otter, and Niek S Klazinga. Improvement of bestpractice in early breast cancer: actionable surgeon and hospital factors. Breast cancerresearch and treatment, 102(2):219–226, 2007.
Thomas Grubinger, Achim Zeileis, and Karl-Peter Pfeiffer. evtree: Evolutionary learn-ing of globally optimal classification and regression trees in r. Journal of statisti-cal software, 61(1):1–29, 2014. ISSN 1548-7660. doi: 10.18637/jss.v061.i01. URLhttps://www.jstatsoft.org/v061/i01.
Gurobi. Gurobi Optimizer Reference Manual. http://www.gurobi.com, 2015.
Qiyang Han and Jon A Wellner. Multivariate convex regression: global risk bounds andadaptation. arXiv preprint arXiv:1601.06844, 2016.
Lauren A. Hannah and David B. Dunson. Multivariate convex regression with adaptivepartitioning. J. Mach. Learn. Res., 14:3261–3294, 2013.
Lauren A. Hannah, Warren B. Powell, and David B. Dunson. Semiconvex regression formetamodeling based optimization. SIAM Journal on Optimization, 24(2):573–597, 2014.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learn-ing. Springer Series in Statistics, Springer, New York, second edition, 2009.
Holger Heitsch and Werner Römisch. Scenario reduction algorithms in stochastic program-ming. Computational optimization and applications, 24(2-3):187–206, 2003.
Jennifer L Hill. Bayesian nonparametric modeling for causal inference. Journal of Compu-tational and Graphical Statistics, 20(1):217–240, 2011.
Keisuke Hirano and Guido W Imbens. The propensity score with continuous treatments.Applied Bayesian modeling and causal inference from incomplete-data perspectives, 226164:73–84, 2004.
Tito Homem-de Mello and Güzin Bayraksan. Monte carlo sampling-based methods forstochastic optimization. Surveys in Operations Research and Management Science, 19(1):56–85, 2014.
Kosuke Imai and Marc Ratkovic. Covariate balancing propensity score. Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 76(1):243–263, 2014.
Thomas R Insel. Translating scientific opportunity into public health impact: a strategicplan for research on mental illness. Archives of General Psychiatry, 66(2):128–133, 2009.
217
Amir Jaffer and Lee Bragg. Practical tips for warfarin dosing and monitoring. ClevelandClinic journal of medicine, 70(4):361–371, 2003.
Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Inter-national Conference on Machine Learning, pages 427–435, 2013.
Nathan Kallus. Balanced policy evaluation and learning. arXiv preprint arXiv:1705.07384,2017a.
Nathan Kallus. Recursive partitioning for personalization using observational data. InInternational Conference on Machine Learning, pages 1789–1798, 2017b.
Nathan Kallus and Angela Zhou. Policy evaluation and optimization with continuous treat-ments. arXiv preprint arXiv:1802.06037, 2018.
Yi-hao Kao, Benjamin V Roy, and Xiang Yan. Directed regression. In Advances in NeuralInformation Processing Systems, pages 889–897, 2009.
James E Kelley, Jr. The cutting-plane method for solving convex programs. Journal of thesociety for Industrial and Applied Mathematics, 8(4):703–712, 1960.
Anton J Kleywegt, Alexander Shapiro, and Tito Homem-de Mello. The sample average ap-proximation method for stochastic discrete optimization. SIAM Journal on Optimization,12(2):479–502, 2002.
Robert J LaLonde. Evaluating the econometric evaluations of training programs with ex-perimental data. The American economic review, pages 604–620, 1986.
Avinash S. Lele, Sanjeev R. Kulkarni, and Alan S. Willsky. Convex-polygon estimation fromsupport-line measurements and applications to target reconstruction from laser-radar data.Journal of the Optical Society of America, Series A, 9:1693–1714, 1992.
Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approachto personalized news article recommendation. In Proceedings of the 19th internationalconference on World wide web, pages 661–670. ACM, 2010.
Eunji Lim and Peter W. Glynn. Consistency of multidimensional convex regression. Opera-tions Research, 60(1):196–208, 2012.
Ilya Lipkovich and Alex Dmitrienko. Strategies for identifying predictive biomarkers andsubgroups with enhanced treatment effect in clinical trials using sides. Journal of biophar-maceutical statistics, 24(1):130–153, 2014.
Alessandro Magnani and Stephen P. Boyd. Convex piecewise-linear fitting. Optimizationand Engineering, 10(1):1–17, 2009.
Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variancepenalization. arXiv preprint arXiv:0907.3740, 2009.
218
Rahul Mazumder, Arkopal Choudhury, Garud Iyengar, and Bodhisattva Sen. A compu-tational framework for multivariate convex regression and its variants. Journal of theAmerican Statistical Association, pages 1–14, 2018.
Maethee Mekaroonreung and Andrew L Johnson. Estimating the shadow prices of so2 andnox for us coal power plants: a convex nonparametric least squares approach. EnergyEconomics, 34(3):723–732, 2012.
Velibor V Mišić. Optimization of tree ensembles. arXiv preprint arXiv:1705.10883, 2017.
Stephen L Morgan and Christopher Winship. Counterfactuals and causal inference. Cam-bridge University Press, 2014.
Elizbar A Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141–142, 1964.
Balas Kausik Natarajan. Sparse approximate solutions to linear systems. SIAM Journal onComputing, 24:227–234, 1995.
George Nemhauser. Integer programming: The global impact. EURO, INFORMS, 2013.URL https://smartech.gatech.edu/handle/1853/49829.
Arkadi Nemirovski and Alexander Shapiro. Convex approximations of chance constrainedprograms. SIAM Journal on Optimization, 17(4):969–996, 2006.
Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming,103(1):127–152, 2005.
Mahesh KB Parmar, James Carpenter, and Matthew R Sydes. More multiarm randomisedtrials of superiority are needed. The Lancet, 384(9940):283, 2014.
Scott Powers, Junyang Qian, Kenneth Jung, Alejandro Schuler, H. Shah, Nigam, TrevorHastie, and Robert Tibshirani. Some methods for heterogenous treatment effect estimationin high dimensions. arXiv preprint arXiv:1707.00102v1, 2017. Working paper.
Min Qian and Susan A Murphy. Performance guarantees for individualized treatment rules.The Annals of Statistics, 39(2):1180, 2011.
Hamed Rahimian, Güzin Bayraksan, and Tito Homem-de Mello. Identifying effective sce-narios in distributionally robust stochastic programs with total variation distance. Math-ematical Programming, pages 1–38, 2018.
R Tyrrell Rockafellar and Stanislav Uryasev. Optimization of conditional value-at-risk.Journal of risk, 2:21–42, 2000.
Paul R Rosenbaum. Observational studies. In Observational studies, pages 1–17. Springer,2002.
219
Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in obser-vational studies for causal effects. Biometrika, pages 41–55, 1983.
Cynthia Rudin and Gah-Yi Vahn. The big data newsvendor: Practical insights from machinelearning. 2014.
Napat Rujeerapaiboon, Kilian Schindler, Daniel Kuhn, and Wolfram Wiesemann. Scenarioreduction revisited: Fundamental limits and guarantees. Mathematical Programming,pages 1–36, 2017.
Emilio Seijo and Bodhisattva Sen. Nonparametric least squares estimation of a multivariateconvex regression function. The Annals of Statistics, 39:1633–1657, 2011.
Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory toalgorithms. Cambridge university press, 2014.
Alexander Shapiro and Arkadi Nemirovski. On complexity of stochastic programming prob-lems. Continuous optimization, pages 111–146, 2005.
Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on stochasticprogramming: modeling and theory. SIAM, 2009a.
Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on StochasticProgramming, Modeling and Theory. Society for Industrial and Applied Mathematics andthe Mathematical Programming Society, 2009b.
Nguyen Hung Son. From optimal hyperplanes to optimal decision trees. Fundamenta Infor-maticae, 34(1, 2):145–174, 1998.
Evan Stubbs. The value of business analytics. http://analytics-magazine.org/the-value-of-business-analytics/, 2016. Accessed: 2018-01-30.
Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization. In Pro-ceedings of the 24th International Conference on World Wide Web, pages 939–941. ACM,2015.
Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the RoyalStatistical Society. Series B (Methodological), pages 267–288, 1996.
Huseyin Topaloglu and Warren B. Powell. An algorithm for approximating piecewise linearconcave functions from sample gradients. Operations Research Letters, 31:66–76, 2003.
Theja Tulabandhula and Cynthia Rudin. Machine learning with operational costs. TheJournal of Machine Learning Research, 14(1):1989–2028, 2013.
Hal R. Varian. The nonparametric approach to demand analysis. Econometrica, 50(4):945–973, 1982.
220
Hal R. Varian. The nonparametric approach to production analysis. Econometrica, 52(3):579–597, 1984.
Roman Vershynin. High-dimensional probability. An Introduction with Applications, 2016.
Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effectsusing random forests. Journal of the American Statistical Association, 113(523):1228–1242,2018.
Stein W Wallace and William T Ziemba. Applications of stochastic programming. SIAM,2005.
Geoffrey S Watson. Smooth regression analysis. Sankhya: The Indian Journal of Statistics,Series A, pages 359–372, 1964.
Daniel Westreich, Justin Lessler, and Michele Jonsson Funk. Propensity score estimation:machine learning and classification methods as alternatives to logistic regression. Journalof clinical epidemiology, 63(8):826, 2010.
Min Xu, Minhua Chen, and John Lafferty. Faithful variable screening for high-dimensionalconvex regression. The Annals of Statistics, 44(6):2624–2660, 2016.
Xin Zhou, Nicole Mayer-Hamblett, Umer Khan, and Michael R Kosorok. Residual weightedlearning for estimating individualized treatment rules. Journal of the American StatisticalAssociation, 112(517):169–187, 2017.
Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journalof the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
José R Zubizarreta. Using mixed integer programming for matching in an observationalstudy of kidney failure after surgery. Journal of the American Statistical Association, 107(500):1360–1371, 2012.
221
top related