new interpretable ai - princeton university · 2018. 10. 29. · interpretable ai i ai and...

Interpretable AI

Dimitris Bertsimas

MIT

Septemebr 2018

1 / 24

Interpretable AI

I AI and especially deep learning have made significant progressin computer vision, automatic translation and voicerecognition that are affecting society.

I Deep learning suffers from lack of interpretability.

I A driveless car is involved in an accident with loss of life. Whois at fault? Can society tolerate not understanding?

I A student is not selected for freshman admissions. Is it anadequate response that an algorithm made the decision?

I Interpretability matters.

2 / 24

Goal: Develop AI algorithms that are interpretable andprovide state of the art performance.

PatientinfoAge: 30Gender: maleAlbumin: 2.8g/dLSepsis: noneINR: 1.1Diabetic: yes…

Mortality risk: 26.4%

Black-boxmodels

Interpretablemodels

Mortality risk: 26.4%

Age<25?

13.2% Male?

26.4% 18.3%

PatientinfoAge: 30Gender: maleAlbumin: 2.8g/dLSepsis: noneINR: 1.1Diabetic: yes…

3 / 24

Leo Breiman, On Interpretability Trees receive an A+

I Leo Breiman et. al. (1984) introduced CART, a heuristicapproach to make predictions (either binary or continuous)from data.

I Widespread use in academia and industry (∼ 37,000citations!)

I The Iris flower data set introduced by Fisher 1936 to classifyflowers based on four measurements: petal width/height andsepal width/height.

4 / 24

The Iris data set

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

2.0

2.5

3.0

3.5

4.0

4.5

4 5 6 7 8Sepal Length

Sepa

l Wid

th

Species●

●

setosavirginica

5 / 24

The Tree Representation

1

2 V

V S

Sepal length < 5.75 Sepal length ≥ 5.75

Sepal width < 2.7 Sepal width ≥ 2.7

6 / 24

Leo again ....

I CART is fundamentally greedy—it makes a series of locallyoptimal decisions, but the final tree could be far from optimal

I Finally, another problem frequently mentioned(by others, not by us) is that the tree procedure isonly one-step optimal and not overall optimal. . . . Ifone could search all possible partitions . . . the tworesults might be quite different.

We do not address this problem. At this stage ofcomputer technology, an overall optimal treegrowing procedure does not appear feasible for anyreasonably sized data set.

I On interpretability trees receive an A+

7 / 24

B.+Dunn, “Optimal Trees”, Machine Learning, 2017

I Use Mixed-Integer Optimization (MIO) and local search toconsider the entire decision tree problem at once and solve toobtain the Optimal Tree for both regression and classification.

I The Algorithms scale with n = 1, 000, 000, p = 10, 000.

I Motivation: MIO is the natural form for the Optimal Treeproblem:

I Decisions: Which variable to split on, which label to predict fora region

I Outcomes: Which region a point ends up in, whether a point iscorrectly classified

8 / 24

OCT-H

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●● ●

●

●

●

●

●

●

●

●

2.0

2.5

3.0

3.5

4.0

4.5

4 5 6 7 8Sepal Length

Sepa

l Wid

th Species●

●

●

setosaversicolorvirginica

9 / 24

Performance of Optimal Classification Trees

I Average out-of-sample accuracy across 60 real-world datasets:

70

75

80

85

2 4 6 8 10Maximum depth of tree

Out

−of

−sa

mpl

e ac

cura

cy

CART OCT OCT−H

10 / 24

Performance of Optimal Classification Trees

I Average out-of-sample accuracy across 60 real-world datasets:

70

75

80

85

2 4 6 8 10Maximum depth of tree

Out

−of

−sa

mpl

e ac

cura

cy

CART OCT OCT−H Random Forest XGBoost

10 / 24

How do trees compare with Deep Learning?

I B. + Mazumder+ Sobiesk, 2018

I Theorem: Optimal classification and regression trees withhyperplanes are as powerful as classification and regression(feedforward, convolutional and recurrent) neural networks,that is given a NN we can find a OCT-H (or ORT-H) that hasthe same in sample performance.

I Out of sample performance is very comparable on 7 populardata sets between NNs and OCT-Hs.

I Why is this result important?

11 / 24

Surgical Outcomes Prediction - used at MGH

Figure: Decision tree for predicting any complication post surgery.

12 / 24

Surgical Outcomes Prediction - App

Figure: Surgical outcome prediction questionnaire based on OptimalTrees.

13 / 24

Mortality Prediction in Cancer Patients - used atDanna-Farber

Figure: Decision tree for predicting 60-day mortality in breast cancerpatients.

14 / 24

Mortality Prediction in Cancer Patients - App

Figure: Cancer mortality prediction questionnaire based on Optimal Trees.

15 / 24

Saving Lives in Liver Transplantation

Using OCT, we designed a new system for prioritizing livertransplantation recipients that averts 400 deaths per year in theUS compared to current practice.

16 / 24

Critical Brain Injury

Using OCT, we can identify critical brain injury in children using40% less CT scans than CART and missing only 5 children (out of337, instead of 9 for CART ).

17 / 24

Designing financial plans from transactions

I Using OCT we can accurately predict whether a person islikely to buy a house, or open an educational account basedon transactional data (payroll, credit cards, ...).

I Based on these predictions we create a financial plan thatmaximizes the probability of success of goals.

18 / 24

Optimal Prescriptive Trees

I B+Dunn+Mundru, Optimal Prescriptive Trees, 2018.

I Consider a healthcare setting (personalized medicine, manyother applications)

I Historical observational data (Xi , zi ,Yi ), i = 1, . . . , n.

I Xi ∈ Rd : Features of patient i .

I zi ∈ {1, 2, . . . ,m} : Treatment assigned to patient i by doctor.

I Yi ∈ R : Outcome recorded of patient i (Lower the better).

I Question: When a new patient comes in with features x ,what treatment τ(x) ∈ {1, 2, . . . ,m} is best for this person?

19 / 24

Can we use Machine Learning?

I For each patient xi : If we knew the best treatment(treatment out of m options that leads to best outcome),then it is a standard multiclass classification problem.

I We could learn a classifier that predicts in {1, . . . ,m} givenx ∈ Rd using this historical data.

I KEY CHALLENGE: But, we only know the outcome for zi(historically given treatment) and not the others.

I We do not know what would have happened(“counterfactuals”) to patient i under the other (m − 1)treatments.

20 / 24

Optimal Prescriptive Trees

I Objective: Determine τ(x) to minimize

µ Mean outcome + (1− µ)Prediction error, 0 < µ < 1

µ

n∑i=1

(yiI[τ(xi ) = zi ] +

∑t 6=zi

yi (t)I[τ(xi ) = t]

)+

(1− µ)

[n∑

i=1

(yi − yi (zi ))2],

I Need to predict counterfactuals.1. For each subject i : If he/she received treatment 1, we know

Yi = Yi (1).2. Estimate Yi (0) as average of patients in that leaf who received

0.3. Can also use linear regression.

I Use B+Dunn OCT or ORT algorithms.

21 / 24

Personalized Diabetes Management

I Data from the Boston Medical Center, from 1999-2014.

I 100,000 patient visits for type 2 diabetes.

I 13 possible treatment options (regimens).

I Patient features include demographic information (sex, race,gender etc.), treatment history, and diabetes progression.

I Outcome of interest: HbA1c level; smaller the better.

I Varied # training samples from 1,000–50,000 to examine theeffect on out-of-sample performance. Averaged this processover ten different splits of the data.

22 / 24

OPT has a Performance and Interpretability Edge

● ● ● ●

● ● ● ●

● ●

●

●

●● ●

●

●

● ●●

●●

●

●

−0.6

−0.4

−0.2

0.0

103 103.5 104 104.5

Training size

Mea

n H

bA1c

cha

nge

● ● ● ●

● ● ● ●

●●

●

●

●

●●

●●

● ●

●●

●

●

●

−0.6

−0.4

−0.2

0.0

103 103.5 104 104.5

Training size

Con

diti

onal

HbA

1c c

hang

e

● ● ● ●

● ● ● ●

●●

●●

●

●●

●

●●

● ●

●●

● ●

0%

25%

50%

75%

100%

103 103.5 104 104.5

Training size

Pro

p. d

iffer

from

SO

C

●

●

●

●

●

●

BaselineOracle

RC−kNNRC−LASSO

RC−RFOPT

23 / 24

Conclusions

I OCT and OCT-H provide interpretable, state of the artpredictions.

I OPT provide state of the art prescriptions direcltly from data

I Exciting applications in medicine and many other fields:computer security, financial services, drug discovery amongmany others.

I Rethink how we teach optimization.

I New Class: Machine Learning and Personalized Medicine

24 / 24

new interpretable ai - princeton university · 2018. 10. 29. · interpretable ai i ai and...

Documents