arxiv:1811.01988v4 [math.oc] 21 jan 2020 · 2020-01-23 · arxiv:1811.01988v4 [math.oc] 21 jan 2020...

37
arXiv:1811.01988v4 [math.OC] 21 Jan 2020 Noname manuscript No. (will be inserted by the editor) Strong mixed-integer programming formulations for trained neural networks Ross Anderson ¨ Joey Huchette ¨ Will Ma ¨ Christian Tjandraatmadja ¨ Juan Pablo Vielma the date of receipt and acceptance should be inserted later Abstract We present strong mixed-integer programming (MIP) formulations for high-dimensional piecewise linear functions that correspond to trained neu- ral networks. These formulations can be used for a number of important tasks, such as verifying that an image classification network is robust to adversarial inputs, or solving decision problems where the objective function is a machine learning model. We present a generic framework, which may be of independent interest, that provides a way to construct sharp or ideal formulations for the maximum of d affine functions over arbitrary polyhedral input domains. We apply this result to derive MIP formulations for a number of the most popular nonlinear operations (e.g. ReLU and max pooling) that are strictly stronger than other approaches from the literature. We corroborate this computation- ally, showing that our formulations are able to offer substantial improvements in solve time on verification tasks for image classification networks. Keywords Mixed-integer programming, Formulations, Deep learning Mathematics Subject Classification (2010) 90C11 An extended abstract version of this paper appeared in [4]. R. Anderson E-mail: [email protected] J. Huchette E-mail: [email protected] W. Ma E-mail: [email protected] C. Tjandraatmadja E-mail: [email protected] J. P. Vielma E-mail: [email protected], [email protected]

Upload: others

Post on 24-Mar-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

arX

iv:1

811.

0198

8v4

[m

ath.

OC

] 2

1 Ja

n 20

20

Noname manuscript No.(will be inserted by the editor)

Strong mixed-integer programming formulations for

trained neural networks‹

Ross Anderson ¨ Joey Huchette ¨ WillMa ¨ Christian Tjandraatmadja ¨ JuanPablo Vielma

the date of receipt and acceptance should be inserted later

Abstract We present strong mixed-integer programming (MIP) formulationsfor high-dimensional piecewise linear functions that correspond to trained neu-ral networks. These formulations can be used for a number of important tasks,such as verifying that an image classification network is robust to adversarialinputs, or solving decision problems where the objective function is a machinelearning model. We present a generic framework, which may be of independentinterest, that provides a way to construct sharp or ideal formulations for themaximum of d affine functions over arbitrary polyhedral input domains. Weapply this result to derive MIP formulations for a number of the most popularnonlinear operations (e.g. ReLU and max pooling) that are strictly strongerthan other approaches from the literature. We corroborate this computation-ally, showing that our formulations are able to offer substantial improvementsin solve time on verification tasks for image classification networks.

Keywords Mixed-integer programming, Formulations, Deep learning

Mathematics Subject Classification (2010) 90C11

‹ An extended abstract version of this paper appeared in [4].

R. AndersonE-mail: [email protected]

J. HuchetteE-mail: [email protected]

W. MaE-mail: [email protected]

C. TjandraatmadjaE-mail: [email protected]

J. P. VielmaE-mail: [email protected], [email protected]

2 Ross Anderson et al.

Acknowledgements The authors gratefully acknowledge Yeesian Ng and Ondrej Sykorafor many discussions on the topic of this paper, and for their work on the development ofthe tf.opt package used in the computational experiments.

1 Introduction

Deep learning has proven immensely powerful at solving a number of importantpredictive tasks arising in areas such as image classification, speech recognition,machine translation, and robotics and control [32,48]. The workhorse modelin deep learning is the feedforward network NN : Rm0 Ñ R

ms that maps a(possibly very high-dimensional) input x0 P R

m0 to an output xs “ NNpx0q PR

ms . A feedforward network with s layers can be recursively described as

xij “ NLi,jpwi,j ¨ xi´1 ` bi,jq @i P JsK, j P JmiK (1)

where JnKdef

“ t1, . . . , nu, mi is both the number of neurons in layer i and theoutput dimension of the neurons in layer i ´ 1 (with the input x0 P R

m0

considered to be the 0-th layer). Furthermore, for each i P JsK and j P JmiK,NLi,j : R Ñ R is some simple nonlinear activation function that is fixed beforetraining, and wi,j and bi,j are the weights and bias of an affine function which islearned during the training procedure. In its simplest and most common form,the activation function would be the rectified linear unit (ReLU), defined asReLUpvq “ maxt0, vu.

Many standard nonlinearities NL, such as the ReLU, are piecewise linear :that is, there exists a partition tSi Ď Dudi“1 of the domain and affine functionstf iudi“1 such that NLpxq “ f ipxq for all x P Si. If all nonlinearities describingNN are piecewise linear, then the entire network NN is piecewise linear as well,and conversely, any continuous piecewise linear function can be represented bya ReLU neural network [5]. Beyond the ReLU, we also investigate activationfunctions that can be expressed as the maximum of finitely many affine func-tions, a class which includes certain common operations such as max poolingand reduce max.

There are numerous contexts in which one may want to solve an optimiza-tion problem containing a trained neural network such as NN. For example, thisparadigm arises in deep reinforcement learning problems with high dimensionalaction spaces and where any of the cost-to-go function, immediate cost, or thestate transition functions are learned by a neural network [6,23,56,62,64,80].More generally, this approach may be used to approximately optimize learnedfunctions that are difficult to model explicitly [34,52,53]. Alternatively, therehas been significant recent interest in verifying the robustness of trained neuralnetworks deployed in systems like self-driving cars that are incredibly sensitiveto unexpected behavior from the machine learning model [20,60,68]. Relatedly,a string of recent work has used optimization over neural networks trained forvisual perception tasks to generate new images which are “most representa-tive” for a given class [59], are “dreamlike” [57], or adhere to a particularartistic style via neural style transfer [30].

MIP formulations for trained neural networks 3

1.1 MIP formulation preliminaries

In this work, we study mixed-integer programming (MIP) approaches for opti-mization problems containing trained neural networks. In contrast to heuristicor local search methods often deployed for the applications mentioned above,MIP offers a framework for producing provably optimal solutions. This is par-ticularly important, for example, in verification problems, where rigorous dualbounds can guarantee robustness in a way that purely primal methods cannot.

Let f : D Ď Rη Ñ R be a η-variate affine function with domain D, NL :

R Ñ R be an univariate nonlinear activation function, and ˝ be the standardfunction composition operator so that pNL ˝ fqpxq “ NLpfpxqq. Functions ofthe form pNL ˝ fqpxq precisely describe an individual neuron, with η inputsand one output. A standard way to model a function g : D Ď R

η Ñ R

using MIP is to construct a formulation of its graph defined by grpg;Dqdef

“t px, yq P R

η ˆ R | x P D, y “ gpxq u. We therefore focus on constructing MIPformulations for the graph of individual neurons:

grpNL ˝ f ;Dqdef

“ t px, yq P Rη ˆ R | x P D, y “ pNL ˝ fqpxq u , (2)

because, as we detail at the end of this section, we can readily produce a MIPformulation for the entire network as the composition of formulations for eachindividual neuron.

Definition 1 Throughout, we will notationally use the convention that x PR

η and y P R are respectively the input and output variables of the function g :D Ď R

η Ñ R we are modeling. In addition, v P Rh and z P R

q are respectivelyany auxiliary continuous and binary variables used to construct a formulationof grpg;Dq. The orthogonal projection operator Proj will be subscripted by thevariables to project onto, e.g. Projx,ypRq “ t px, yq | Dv, z s.t. px, y, v, zq P R uis the orthogonal projection of R onto the “original space” of x and y variables.We denote by extpQq the set of extreme points of a polyhedron Q.

Take Sdef

“ grpg;Dq Ď Rη`1 to be the set we want to model, and a polyhe-

dron Q Ă Rη`1`h`q . Then:

– A (valid) mixed-integer programming (MIP) formulation of S consists ofthe linear constraints on px, y, v, zq P R

η`1`h`q which define a polyhedronQ, combined with the integrality constraints z P t0, 1uq, such that

S “ Projx,y`Q X pRη`1`h ˆ t0, 1uqq

˘.

We refer toQ as the linear programming (LP) relaxation of the formulation.– A MIP formulation is sharp if Projx,ypQq “ ConvpSq.– A MIP formulation is hereditarily sharp if fixing any subset of binary vari-

ables z to 0 or 1 preserves sharpness.– A MIP formulation is ideal (or perfect) if extpQq Ď R

η`1`h ˆ t0, 1uq.– The separation problem for a family of inequalities is to find a valid in-

equality violated by a given point or certify that no such inequality exists.

4 Ross Anderson et al.

– An inequality is valid for the formulation if each integer feasible point inQ “ t px, y, v, zq P Q | z P t0, 1uq u satisfies the inequality. Moreover, a validinequality is facet-defining if the dimension of the points in Q satisfyingthe inequality at equality is exactly one less than the dimension of Q itself.

Note that ideal formulations are sharp, but the converse is not necessarilythe case [72, Proposition 2.4]. In this sense, ideal formulations offer the tightestpossible relaxation, and the integrality property in Definition 1 tends to leadto superior computational performance. Furthermore, note that a hereditarilysharp formulation is a formulation which retains its sharpness at every node ina branch-and-bound tree, and as such is potentially superior to a formulationwhich only guarantees sharpness at the root node [41,42]. Additionally, it isimportant to keep in mind that modern MIP solvers will typically require anexplicit, finite list of inequalities defining Q.

Finally, for each i P JsK and j P JmiK let Qi,j Ď Rmi´1`1`hi,j`qi,j be a poly-

hedron such that grpNLi,j˝f i,j;Di,jq “ Projx,y`Qi,j X pRmi´1`1`hi,j ˆ t0, 1uqi,jq

˘,

where each Di,j is the domain of neuron NLi,j and f i,jpxq “ wi,j ¨x`bi,j is thelearned affine function in network (1). Then, a formulation for the completenetwork (1) is given by

`xi´1, xi

j , vi,j , zi,j

˘P`Qi,j X pRmi´1`1`hi,j ˆ t0, 1uqi,j q

˘@i P JsK, j P JmiK.

Any properties of the individual-neuron formulations Qi,j (e.g. being ideal),are not necessarily preserved for the combined formulation for the completenetwork. However, for similarly sized-formulations, combining stronger formu-lations (e.g. ideal instead of sharp) usually leads to complete formulationsthat are more computationally effective [72]. Hence, our focus for all theo-retical properties of the formulations will be restricted to individual-neuronformulations.

1.2 Our contributions

We highlight the contributions of this work as follows.

1. Generic recipes for building strong MIP formulations of the maximum ofd affine functions for any bounded polyhedral input domain.– [Propositions 3 and 4] We derive both primal and dual characteri-

zations for ideal formulations via the Cayley embedding.– [Propositions 5 and 6]We relax the Cayley embedding in a particular

way, and use this to derive simpler primal and dual characterizationsfor (hereditarily) sharp formulations.

– We discuss how to separate both dual characterizations via subgradientdescent.

2. Simplifications for common special cases.– [Corollary 1] We show the equivalence of the ideal and sharp charac-

terizations when d “ 2 (i.e. the maximum of two affine functions).

MIP formulations for trained neural networks 5

– [Proposition 7] We show that, if the input domain is a product of sim-plices, the separation problem of the sharp formulation can be reframedas a series of transportation problems.

– [Corollaries 2 and 3, and Proposition 9] When the input domainis a product of simplices, and either (1) d “ 2, or (2) each simplex istwo-dimensional, we provide an explicit, finite description for the sharpformulation. Furthermore, none of these inequalities are redundant.

3. Application of these results to construct MIP formulations for neural net-work nonlinearities.– [Propositions 12 and 13] We derive an explicit ideal formulation

for the ReLU nonlinearity over a box input domain, the most commoncase. Separation over this ideal formulation can be performed in timelinear in the input dimension.

– [Corollary 5] We derive an explicit ideal formulation for the ReLUnonlinearity where some (or all) of the input domain is one-hot encodedcategorical or discrete data. Again, the separation can be performedefficiently, and none of the inequalities are redundant.

– [Propositions 10 and 11] We present an explicit hereditarily sharpformulation for the maximum of d affine functions over a box inputdomain, and provide an efficient separation routine. Moreover, a subsetof the defining constraints serve as a tightened big-M formulation.

– [Proposition 14] We produce similar results for more exotic neurons:the leaky ReLU and the clipped ReLU (see the online supplement).

4. Computational experiments on verification problems arising from imageclassification networks trained on the MNIST digit data set.– We observe that our new formulations, along with just a few rounds

of separation over our families of cutting planes, can improve the solvetime of Gurobi on verification problems by orders of magnitude.

Our contributions are depicted in Figure 1. It serves as a roadmap of the paper.

1.3 Relevant prior work

In recent years a number of authors have used MIP formulations to modeltrained neural networks [19,22,24,29,40,47,54,64,66,67,70,80,81], mostly ap-plying big-M formulation techniques to ReLU-based networks. When appliedto a single neuron of the form (2), these big-M formulations will not be ideal oroffer an exact convex relaxation; see Example 1 for an illustration. Addition-ally, a stream of literature in the deep learning community has studied convexrelaxations [12,26,27,61,63], primarily for verification tasks. Moreover, someauthors have investigated how to use convex relaxations within the trainingprocedure in the hopes of producing neural networks with a priori robustnessguarantees [25,78,79].

The usage of mathematical programming in deep learning, specifically fortraining predictive models, has also been investigated in [15]. Beyond math-ematical programming and convex relaxations, a number of authors have

6 Ross Anderson et al.

General Recipe forIdeal Formulations of

Max. of d Affine Functions(Prop. 3–4)

General Recipe forSharp Formulations of

Max. of d Affine Functions(Prop. 5–6)

Equivalence of Recipesfor d “ 2

(Lem. 1–2, Cor. 1)

Sharp Formulations viaTransportation Problemsfor D “ p∆pqτ (Prop. 7)

Explicit Solution toTransportation Problemfor D “ p∆pqτ , d “ 2

(Prop. 8, Cor. 2)

Explicit Solution toTransportation Problemfor D “ p∆2qτ (Cor. 3)

Ideal Formulationfor D “ p∆pqτ , ReLU

with Oppτq separation(Cor. 5)

Sharp Formulationfor D “ rL,Usη

(Prop. 10) with Opηdqseparation (Prop. 11)

Ideal Formulationfor D “ rL,Usη , Leakywith Opηq separation

(Prop. 14)Ideal Formulation

for D “ rL,Usη , ReLU(Prop. 12) with Opηqseparation (Prop. 13)

Technical Results

Applied Formulations

Fig. 1: A roadmap of our results. The single arrows depict dependencies be-tween our technical results, while the double arrows depict the way in whichwe use these results to establish our applied formulations in the paper. Nota-tionally, d is the number of pieces of the piecewise linear function, rL,U sη isan η-dimensional box, and p∆pqτ refers to a product of τ simplices, each withp components.

investigated other algorithmic techniques for modeling trained neural net-works in optimization problems, drawing primarily from the satisfiability, con-straint programming, and global optimization communities [10,11,43,51,65].Throughout this stream of the literature, there is discussion of the perfor-mance for specific subclasses of neural network models, including binarized [44]or input convex [3] neural networks. Improving our formulations for specificsubclasses of neural networks would constitute interesting future work.

Outside of applications in machine learning, the formulations presentedin this paper also have connections with existing structures studied in theMIP and constraint programming communities, like indicator variables, on/offconstraints, and convex envelopes [7,13,17,37,38,69]. Finally, this paper hasconnections with distributionally robust optimization, in that the primal-dualpair of sharp formulations presented can be viewed as equivalent to the discretesup-sup duality result [21] for the marginal distribution model [35,58,77].

MIP formulations for trained neural networks 7

1.4 Starting assumptions and notation

We use the following notational conventions throughout the paper.

– The nonnegative orthant: Rě0def

“ t x P R | x ě 0 u.

– The n-dimensional simplex: ∆n def

“ x P R

ně0

ˇ řni“1 xi “ 1

(.

– The set of integers from 1 to n: JnK “ t1, . . . , nu.

– “Big-M” coefficients:M`pf ;Dqdef

“ maxxPD fpxq andM´pf ;Dqdef

“ minxPD fpxq.

– The dilation of a set: if z P Rě0 and D Ď Rη, then z ¨ D

def

“ t zx | x P D u.1

Furthermore, we will make the following simplifying assumptions.

Assumption 1 The input domain D is a bounded polyhedron.

While a bounded input domain assumption will make the formulationsand analysis considerably more difficult than the unbounded setting (see [7]for a similar phenomenon), it ensures that standard MIP representability con-ditions are satisfied (e.g. [72, Section 11]). Furthermore, variable bounds arenatural for many applications (for example in verification problems), and areabsolutely essential for ensuring reasonable dual bounds.

Assumption 2 Each neuron is irreducible: for any k P JdK, there exists somex P D where fkpxq ą f ℓpxq for each ℓ ‰ k.

Observe that if a neuron is not irreducible, this means that it is unnec-essarily complex, and one or more of the affine functions can be completelyremoved. Moreover, the assumption can be verified in polynomial time via d

LPs by checking if maxx,∆ ∆

ˇx P D, ∆ ď fkpxq ´ f ℓpxq @ℓ ‰ k

(ą 0 for

each k P JdK. In the special case where d “ 2 (e.g. ReLU) and D is a box, thiscan be done in linear time. Finally, if the assumption does not hold, it willnot affect the validity of the formulations or cuts derived in this work, thoughcertain results pertaining to non-redundancy or facet-defining properties mayno longer hold.

2 Motivating example: The ReLU nonlinearity over a box domain

Since the work of Glorot et al. [31], the ReLU neuron has become the workhorseof deep learning models. Despite the simplistic ReLU structure, neural net-works formed by ReLU neurons are easy to train, reason about, and can modelany continuous piecewise linear function [5]. Moreover, they can approximateany continuous, non-linear function to an arbitrary precision under a boundedwidth [36].

Accordingly, the general problem of maximizing (or minimizing) the outputof a trained ReLU network is NP-hard [43]. Nonetheless, in this section wepresent the strongest possible MIP formulations for a single ReLU neuronwithout continuous auxiliary variables. As discussed at the end of Section 1.1and corroborated in the computational study in Section 6, this leads to fastersolve times for the entire ReLU network in practice.

1 Note that if D “ t x P Rη | Ax ď b u is polyhedral, then z ¨ D “ t x P Rη | Ax ď bz u.

8 Ross Anderson et al.

2.1 A big-M formulation

To start, we will consider the ReLU in the simplest possible setting: where theinput is univariate. Take the two-dimensional set grpReLU; rl, usq, where rl, us issome interval in R containing zero. It is straightforward to construct an idealformulation for this univariate ReLU.

Proposition 1 An ideal formulation for grpReLU; rl, usq is:

y ě x (3a)

y ď x ´ lp1 ´ zq (3b)

y ď uz (3c)

px, y, zq P R ˆ Rě0 ˆ r0, 1s (3d)

z P t0, 1u. (3e)

Proof Follows from inspection, or as a special case of Proposition 12 to bepresented in Section 5.2. [\

A more realistic setting would have an η-variate ReLU nonlinearity whoseinput is some η-variate affine function f : rL,U s Ñ R where L,U P R

η and

rL,U sdef

“ t x P Rη | Li ď xi ď Ui @i P JηK u (i.e. the η-variate ReLU nonlinear-

ity given by ReLU ˝ f). The box-input rL,U s corresponds to known (finite)bounds on each component, which can typically be efficiently computed viainterval arithmetic or other standard methods.

Observe that we can model the graph of the η-variate ReLU neuron as asimple composition of the graph of a univariate ReLU activation function andan η-variate affine function:

gr pReLU ˝ f ; rL,U sq “

"px, yqP R

η`1

ˇˇ pfpxq, yq P gr pReLU; rm´,m`sq

L ď x ď U

*,

(4)

where m´ def

“ M´pf ; rL,U sq and m` def

“ M`pf ; rL,U sq. Using formulation (3)as a submodel, we can write a formulation for the ReLU over a box domainas:

y ě fpxq (5a)

y ď fpxq ´ M´pf ; rL,U sq ¨ p1 ´ zq (5b)

y ď M`pf ; rL,U sq ¨ z (5c)

px, y, zq P rL,U s ˆ Rě0 ˆ r0, 1s (5d)

z P t0, 1u. (5e)

This is the approach taken recently in the bevy of papers referenced in Sec-tion 1.3. Unfortunately, after the composition with the affine function f overa box input domain, this formulation is no longer sharp.

MIP formulations for trained neural networks 9

x2

x1

y

x2

x1

y

Fig. 2: For fpxq “ x1`x2´1.5: (Left) ConvpgrpReLU˝f ; r0, 1s2qq, and (Right)the projection of the big-M formulation (5) to px, yq-space, where we mark thepoint px, yq “ pp1, 0q, 0.25q that is not in the convex hull, but is valid for theprojection of the big-M LP relaxation (6a-6d).

Example 1 If fpxq “ x1 ` x2 ´ 1.5, formulation (5) for grpReLU ˝ f ; r0, 1s2q is

y ě x1 ` x2 ´ 1.5 (6a)

y ď x1 ` x2 ´ 1.5 ` 1.5p1 ´ zq (6b)

y ď 0.5z (6c)

px, y, zq P r0, 1s2 ˆ Rě0 ˆ r0, 1s (6d)

z P t0, 1u. (6e)

The point px, y, zq “ pp1, 0q, 0.25, 0.5q is feasible for the LP relaxation (6a-6d).However, observe that the inequality y ď 0.5x2 is valid for grpReLU˝f ; r0, 1s2q,but is violated by px, yq. Therefore, the formulation does not offer an exactconvex relaxation (and, hence, is not ideal). See Figure 2 for an illustration: onthe left, of the big-M formulation projected to px, yq-space, and on the right,the tightest possible convex relaxation.

Moreover, the integrality gap of (5) can be arbitrarily bad, even in fixeddimension η.

Example 2 Fix γ P Rě0 and even η P N. Take the affine function fpxq “řηi“1 xi, the input domain rL,U s “ r´γ, γsη, and x “ γ ¨ p1,´1, ¨ ¨ ¨ , 1,´1q as

a scaled vector of alternating positive and negative ones. We can check thatpx, y, zq “ px, 1

2γη,12 q is feasible for the LP relaxation of the big-M formulation

(5). Additionally, fpxq “ 0, and for any y such that px, yq P ConvpgrpReLU ˝f ; rL,U sqq, then y “ 0 necessarily. Therefore, there exists a fixed point x inthe input domain where the tightest possible convex relaxation (for example,from a sharp formulation) is exact, but the big-M formulation deviates fromthis value by at least 1

2γη.

Intuitively, this example suggests that the big-M formulation can be par-ticularly weak around the boundary of the input domain, as it cares only aboutthe value fpxq of the affine function, and not the particular input value x.

10 Ross Anderson et al.

2.2 An ideal extended formulation

It is possible to produce an ideal extended formulation for the ReLU neuronby introducing a modest number of auxiliary continuous variables:

px, yq “ px0, y0q ` px1, y1q (7a)

y0 “ 0 ě w ¨ x0 ` bp1 ´ zq (7b)

y1 “ w ¨ x1 ` bz ě 0 (7c)

Lp1 ´ zq ď x0 ď Up1 ´ zq (7d)

Lz ď x1 ď Uz (7e)

z P t0, 1u, (7f)

This is the standard “multiple choice” formulation for piecewise linear func-tions [75], which can also be derived from techniques due to Balas [8,9].

Although the multiple choice formulation offers the tightest possible convexrelaxation for a single neuron, it requires a copy x0 of the input variables (thecopy x1 can be eliminated using the equations (7a)). This means that when thisformulation is applied to every neuron in the network to formulate NN, the totalnumber of continuous variables required is m0 `

řri“1pmi´1 ` 1qmi, where mi

is the number of neurons in layer i. In contrast, the big-M formulation requiresonly m0`

řri“1 mi continuous variables to formulate the entire network. As we

will see in Section 6, the quadratic growth in size of the extended formulationcan quickly become burdensome. Additionally, a folklore observation in theMIP community is that multiple choice formulations tend to not perform aswell as expected in simplex-based branch-and-bound algorithms, likely due todegeneracy introduced by the block structure of the formulation [74].

2.3 An ideal MIP formulation without auxiliary continuous variables

In this work, our most broadly useful contribution is the derivation of anideal MIP formulation for the ReLU nonlinearity over a box domain that isnon-extended; that is, it does not require additional auxiliary variables as informulation (7). We informally summarize this main result as follows.

Main result for ReLU networks (informal)

There exists an explicit ideal nonextended formulation for the η-variateReLU nonlinearity over a box domain, i.e. it requires only a single aux-iliary binary variable. It has an exponential (in η) number of inequalityconstraints, each of which are facet-defining. However, it is possible toseparate over this family of inequalities in time scaling linearly in η.

We defer the formal statement and proof to Section 5.2, for after we havederived the requisite machinery. This is also the main result for the extendedabstract version of this work [4], where it is derived through alternative means.

MIP formulations for trained neural networks 11

3 Our general machinery: Formulations for the maximum of daffine functions

We will state our main structural results in the following generic setting. Takethe maximum operator Maxpv1, . . . , vdq “ maxdi“1 vi over d scalar inputs. Westudy the composition of this nonlinearity with d affine functions f i : D Ñ R

with f ipxq “ wi ¨ x ` bi, all sharing some bounded polyhedral domain D:

Smaxdef

“ grpMax ˝ pf1, . . . , fdq;Dq ”!

px, yq P D ˆ R

ˇˇ y “

dmaxi“1

f ipxq),

This setting subsumes the ReLU over box input domain presented in Section 2as a special case with d “ 2, f2pxq “ 0, and D “ rL,U s. It also covers anumber of other settings arising in modern deep learning, either by making D

more complex (e.g. one-hot encodings for categorical features), or by increasingd (e.g. max pooling neurons often used in convolutional networks for imageclassification tasks [18], or in maxout networks [33]).

In this section, we present structural results that characterize the Cayleyembedding [39,73,74,76] of Smax. Take the set

Scayleydef

“dď

k“1

px, fkpxq, ekq

ˇx P D|k

(,

where ek is the unit vector where the k-th element is 1, and for each k P JdK,

D|kdef

"x P D

ˇˇ k P arg

dmaxℓ“1

f ℓpxq

*”

x P D

ˇwk ¨ x ` bk ě wℓ ¨ x ` bℓ @ℓ ‰ k

(

is the portion of the input domainD where fk attains the maximum. The Cay-ley embedding of Smax is the convex hull of this set: Rcayley

def

“ ConvpScayleyq.The following straightforward observation holds directly from definition.

Observation 1 The set Rcayley is a bounded polyhedron, and an ideal formu-lation for Smax is given by the system

px, y, zq P Rcayley

ˇz P t0, 1ud

(.

Therefore, if we can produce an explicit inequality description for Rcayley,we immediately have an ideal MIP formulation for Smax. Indeed, we havealready seen an extended representation in (7) when d “ 2, which we nowstate in the more general setting (projecting out the superfluous copies of y).

Proposition 2 An ideal MIP formulation for Smax is:

px, yq “dÿ

k“1

prxk, wk ¨ rxk ` bkzkq (8a)

rxk P zk ¨ D|k @k P JdK (8b)

z P ∆d (8c)

z P t0, 1ud. (8d)

Denote its LP relaxation by Rextended “

px, y, z, rx1, . . . , rxkqˇ

p8a ´ 8cq(.

Then Projx,y,zpRextendedq “ Rcayley.

12 Ross Anderson et al.

Proof Follows directly from results in [8,9]. [\

Although this formulation is ideal and polynomially-sized, this extended for-mulation can exhibit poor practical performance, as noted in Section 2.2 andcorroborated in the computational experiments in Section 6. Observe that,from definition, constraint (8b) for a given k P JdK is equivalent to the set ofconstraints rxk P zk ¨ D and wk ¨ rxk ` bkzk ě wℓ ¨ rxk ` bℓzk for each ℓ ‰ k.

3.1 A recipe for constructing ideal formulations

Our goal in this section is to derive generic tools that allow us to build idealformulations for Smax via the Cayley embedding.

3.1.1 A primal characterization

Our first structural result provides a characterization for the Cayley embed-ding. Although it is not an explicit polyhedral characterization, we will sub-sequently see how it can be massaged into a more practically amenable form.

Take the system

y ď gpx, zq (9a)

y ě gpx, zq (9b)

px, y, zq P D ˆ R ˆ ∆d, (9c)

where

gpx, zqdef

“ maxrx1,...,rxd

#dÿ

k“1

wk ¨ rxk ` bkzk

ˇˇˇ

x “ř

k rxk

rxk P zk ¨ D|k @k P JdK

+

gpx, zqdef

“ minrx1,...,rxd

#dÿ

k“1

wk ¨ rxk ` bkzk

ˇˇˇ

x “ř

k rxk

rxk P zk ¨ D|k @k P JdK

+,

and define the set Ridealdef

“ t px, y, zq | (9) u. Note that Rideal can be thoughtof as (8) with the rxk variables implicitly projected out.

Proposition 3 The set Rideal is polyhedral, and Rideal “ Rcayley.

Proof By Proposition 2, it suffices to show that Rideal “ Projx,y,zpRextendedq.We start by observing that, as g (respectively gq is concave (resp. convex) inits imputs as it is the value function of a linear program (cf. [14, Theorem5.1]). Therefore, the set of points satisfying (9) is convex.

Let px, y, zq be an extreme point of Rideal, which exists as Rideal is convex.Then it must satisfy either (9a) or (9b) at equality, as otherwise it is a convexcombination of px, y ´ ǫ, zq and px, y ` ǫ, zq for some ǫ ą 0. Take rx1, . . . , rxd

that optimizes gpx, zq or gpx, zq, depending on which constraint is satisfied at

MIP formulations for trained neural networks 13

equality. Then px, y, z, rx1, . . . , rxdq P Rextended. In other words, extpRidealq ĎProjx,y,zpRextendedq, and thus Rideal Ď Projx,y,zpRextendedq by convexity.

Conversely, let px, y, z, rx1, . . . , rxdq P Rextended. Then y “řd

k“1pwk ¨ rxk `bkzkq ď gpx, zq, as ptrxkudk“1q is feasible for the optimization problem in gpx, zq.Likewise, y ě gpx, zq, and (9c) is trivially satisfied. Therefore, px, y, zq P Rideal.

Polyhedrality of Rideal then follows as Rcayley is itself a polyhedron. [\

Note that the input domain constraint x P D is implied by the constraints(9a)–(9b), and therefore do not need to be explicitly included in this descrip-tion. However, we include it here in our description for clarity. Moreover,observe that g is a function from D ˆ ∆d to R Y t´8u, since the optimiza-tion problem may be infeasible, but is always bounded from above since D isbounded. Likewise, g is a function from D ˆ ∆d to R Y t`8u.

3.1.2 A dual characterization

From Proposition 3, we can derive a more useful characterization by applyingLagrangian duality to the LPs describing the envelopes gpx, zq and gpx, zq.

Proposition 4 The Cayley embedding Rcayley is equal to all px, y, zq satisfying

y ď α ¨ x `dÿ

k“1

ˆmax

xkPD|k

tpwk ´ αq ¨ xku ` bk˙zk @α P R

η (10a)

y ě α ¨ x `dÿ

k“1

ˆmin

xkPD|k

tpwk ´ αq ¨ xku ` bk˙zk @α P R

η (10b)

px, y, zq P D ˆ R ˆ ∆d. (10c)

Proof Consider the upper bound inequalities for y in Proposition 3, that is,

(9a). Now apply the change of variables xk Ð rxk

zkfor each k P JdK and, for all

px, zq, take the Lagrangian dual of the optimization problem in gpx, zq withrespect to the constraint x “

řk x

kzk. Note that the duality gap is zero sincethe problem is an LP. We then obtain that

gpx, zq “ minα

maxxkPD|k

dÿ

k“1

`wk ¨ xk ` bk

˘zk ` α ¨

˜x ´

dÿ

k“1

xkzk

¸

“ minα

α ¨ x `dÿ

k“1

ˆmax

xkPD|k

tpwk ´ αq ¨ xku ` bk˙zk.

In other words, we can equivalently express (9a) via the family of inequal-ities (10a). The same can be done with (9b), yielding the set of inequali-ties (10b). Therefore, t px, y, zq | (10) u “ Projx,y,zpRextendedq “ Rcayley. [\

14 Ross Anderson et al.

This gives an exact outer description for the Cayley embedding in terms ofan infinite number of linear inequalities. Despite this, the formulation enjoys asimple, interpretable form: we can view the inequalities as choosing coefficientson x and individually tightening the coefficients on z according to explicitlydescribed LPs. Similar to the relationship between (8) and (9), (10) can be seenas a simplification of the standard cut generation LP for unions of polyhedra[8,9]. In later sections, we will see that this decoupling is helpful to simplify(a variant of) this formulation for special cases.

Separating a point px, y, zq over (10a) can be done by evaluating gpx, zq inthe form (10a) (and in the analogous form of gpx, zq for (10b)). As typicallydone when using Lagrangian relaxation, this optimization problem can besolved via a subgradient or bundle method, where each subgradient can becomputed by solving the inner LP in (10a) for all xk P D|k. Observe thatany feasible solution α for the optimization problem in (10a) yields a validinequality. However, this optimization problem is unbounded when px, zq RProjx,zpRcayleyq (i.e. when the primal form of gpx, zq is infeasible). In otherwords, as illustrated in Figure 3, g is an extended real valued function suchthat gpx, zq P RYt´8u, so care must be taken to avoid numerical instabilitieswhen separating a point px, zq where gpx, zq “ ´8.2

3.2 A recipe for constructing hereditarily sharp formulations

Although Proposition 4 gives a separation-based way to optimize over Smax,there are two potential downsides to this approach. First, it does not give usa explicit, finite description for a MIP formulation that we can directly passto a MIP solver. Second, the separation problem requires optimizing over D|k,which may be substantially more complicated than optimizing over D (forexample, if D is a box).

Therefore, in this section we set our sights slightly lower and present asimilar technique to derive sharp MIP formulations for Smax. Furthermore,we will see that our formulations trivially satisfy the hereditary sharpnessproperty. In the coming sections, we will see how we can deploy these results ina practical manner, and study settings in which the simpler sharp formulationwill also, in fact, be ideal.

3.2.1 A primal characterization

Consider the system

y ď hpx, zq (11a)

y ě wk ¨ x ` bk @k P JdK (11b)

px, y, zq P D ˆ R ˆ ∆d, (11c)

2 As is standard in a Benders’ decomposition approach, we can address this by adding afeasibility cut describing the domain of g (the region where it is finite valued) instead of anoptimality cut of the form (10a).

MIP formulations for trained neural networks 15

where

hpx, zqdef

“ maxrx1,...,rxd

#dÿ

k“1

pwk ¨ rxk ` bkzkq

ˇˇˇx “

řk rxk

rxk P zk ¨ D @k P JdK

+.

Take the set Rsharpdef

“ t px, y, zq | (11) u.It is worth dwelling on the differences between the systems (9) and (11).

First, we have completely replaced the constraint (9b) with d explicit linearinequalities (11b). Second, when replacing g with h we have replaced the innermaximization over D|k with an inner maximization over D (modulo constantscaling factors). As we will see in Section 5.1, this is particularly advantageouswhen D is trivial to optimize over (for example, a simplex or a box), allowingus to write these constraints in closed form, whereas optimizing over D|k maybe substantially more difficult (i.e. requiring an LP solve).

Furthermore, we will show that while (11) is not ideal, it is hereditarilysharp, and so in general may offer a strictly stronger relaxation than a standardsharp formulation. In particular, the formulation may be stronger than a sharpformulation constructed by composing a big-M formulation, along with anexact convex relaxation in the px, yq-space produced, for example, by studyingthe upper concave envelope of the function Max ˝ pf1, . . . , fdq.

Proposition 5 The system describing

px, y, zq P Rsharp

ˇz P t0, 1ud

(is a

hereditarily sharp MIP formulation of Smax.

Proof For the result, we must show four properties: polyhedrality of Rsharp,validity of the formulation whose LP relaxation is Rsharp, sharpness, and thenhereditary sharpness. We proceed in that order.

To show polyhedrality, consider a fixed value px, y, zq feasible for (11), andpresume that we express the domain via the linear inequality constraints D “t x | Ax ď c u. First, observe that due to (11a) and (11b), hpx, zq is boundedfrom below. Now, using LP duality, we may rewrite

hpx, zq “ maxrx1,...,rxd

#dÿ

k“1

wk ¨ rxk

ˇˇˇx “

řk rxk

Arxk ď zkc @k P JdK

+`

dÿ

k“1

bkzk

“ minpα,β1,...,βkqPR

#α ¨ x `

dÿ

k“1

c ¨ βkzk

+`

dÿ

k“1

bkzk,

where R is a polyhedron that is independent of x and z. Therefore, as (a) theabove optimization problem is linear with x and z fixed, and (b) hpx, zq isbounded from below, we may replace R with extpRq in the above optimizationproblem. In other words, hpx, zq is equal to the minimum of a finite number ofalternatives which are affine in x and z. Therefore, h is a concave continuouspiecewise linear function, and so Rsharp is polyhedral.

To show validity, we must have that Projx,y`Rsharp X pRη ˆ R ˆ t0, 1udq

˘“

Smax. Observe that if px, y, zq P Rsharp X pRη ˆ R ˆ t0, 1udq, then z “ eℓ for

16 Ross Anderson et al.

some ℓ P JdK, and

hpx, zq “ maxrx1,...,rxd

$&%

dÿ

k“1

wk ¨ rxk ` bℓ

ˇˇˇx “

řk rxk

rxℓ P D

rxk “ 0η @k ‰ ℓ

,.- “ wℓx ` bℓ,

where the first equality follows as rxk “ 0η for each k ‰ ℓ (recall that if Dis bounded, then 0 ¨ D “ t x P R

η | Ax ď 0 u “ t0ηu). Along with (11b), thisimplies that y “ wℓ ¨ x` bℓ, and that y ě wk ¨ x` bk for each k ‰ ℓ, giving theresult.

To show sharpness, we must prove that Projx,ypRsharpq “ ConvpSmaxq.First, recall from Proposition 2 that ConvpSmaxq “ Projx,ypRextendedq; thus,we state our proof in terms ofRextended. We first show that Projx,ypRextendedq Ď

Projx,ypRsharpq. Take px, y, z, txkudk“1q P Rextended. Then y “řd

k“1pwk ¨ xk `

bkzkq ď hpx, zq, as ptxkudk“1q is feasible for the optimization problem in hpx, zq.It also holds that y ě wk ¨ x ` bk for all k P JdK and x P D directly from thedefinition of Smax, giving the result.

Next, we show that Projx,ypRsharpq Ď Projx,ypRextendedq. This proof is sim-ilar to the proof of Proposition 3, except that we choose z that simplifies theconstraints. It suffices to show that extpProjx,ypRsharpqq Ď Projx,ypRextendedq.

Let px, yq P extpProjx,ypRsharpqq. Define hpxqdef

“ maxz hpx, zq

ˇz P ∆d

(.

Then either px, yq satisfies y “ hpxq, or it satisfies (11b) at equality for somek P JdK, since otherwise px, yq is a convex combination of the points px, y ´ ǫqand px, y ` ǫq feasible for Projx,ypRsharpq for some ǫ ą 0. We show that ineither case, px, yq P Projx,ypRextendedq.

Case 1: Suppose that for some j P JdK, px, yq satisfies the correspond-ing inequality in (11b) at equality; that is, y “ wj ¨ x ` bj . Then the pointpx, y, ej , txkudk“1q P Rextended, where xj “ x and xℓ “ 0 if ℓ ‰ j. Hence,px, yq P projx,ypRextendedq.

Case 2: Suppose px, yq satisfies y “ hpxq. Let z be an optimal solution forthe optimization problem defining hpxq, and txkudk“1 be an optimal solutionfor hpx, zq. By design, px, y, z, txkudk“1q satisfies all constraints in Rextended,except potentially constraint (8b).

We show that constraint (8b) is satisfied as well. Suppose not for contra-diction; that is, wk ¨ xk ` bkzk ă wℓ ¨ xk ` bℓzk for some pair k, ℓ P JdK, ℓ ‰ k.Consider the solution ptxkudk“1, zq identical to ptxkudk“1, zq except that zk “ 0,xk “ 0η, zℓ “ zℓ ` zk, and xℓ “ xℓ ` xk. By inspection, this solution is feasiblefor hpxq. The objective value changes by ´pwk ¨xk `bkzkq`pwℓ ¨xk`bℓzkq ą 0,contradicting the optimality of ptxkudk“1, zq. Therefore, px, y, z, txkudk“1q PRextended, and thus px, yq P Projx,ypRextendedq.

Finally, we observe that hereditary sharpness follows from the definition ofh. In particular, fixing any zk “ 0 implies that rxk “ 0η in the maximizationproblem defining h. In other words, the variables rxk and zk drop completelyfrom the maximization problem defining h, meaning that it is equal to thecorresponding version of h with the function k completely dropped as input.

MIP formulations for trained neural networks 17

Additionally, if any zk “ 1, then zℓ “ 0 for each ℓ ‰ k since z P ∆d, and hencerxℓ “ 0η. In this case, hpx, zq “ wk ¨ x ` bk, which gives the result. [\

3.2.2 A dual characterization

We can apply a duality-based approach to produce an (albeit infinite) linearinequality description for the set Rsharp, analogous to Section 3.1.2.

Proposition 6 The set Rsharp is equal to all px, y, zq such that

y ď α ¨ x `dÿ

k“1

ˆmaxxkPD

tpwk ´ αq ¨ xku ` bk˙zk @α P R

η (12a)

y ě wk ¨ x ` bk @k P JdK (12b)

px, y, zq P D ˆ R ˆ ∆d. (12c)

Proof Follows in an analogous manner as in Proposition 4. [\

Figure 3 depicts slices of the functions gpx, zq, gpx, zq, and hpx, zq, createdby fixing some value of z and varying x. Observe that gpx, zq can be viewed asthe largest value for y such that px, yq can be written as a convex combinationof points in the graph using convex multipliers z. Likewise, gpx, zq can be in-

terpreted as the minimum value for y. In hpx, zq, we relax D|k to D, and thuswe can interpret it similarly to gpx, zq, except that we may take convex com-binations of points constructed by evaluating the affine functions at any pointin the domain, not only those where the given function attains the maximum.Figure 3b shows that, in general, hpx, zq can be strictly looser than gpx, zq forpx, zq P Projx,zpRcayleyq. A similar situation occurs for the lower envelopes asillustrated by Figure 3d. However, we prove in the next section that this doesnot occur for d “ 2, along with other desirable properties in special cases.

4 Simplifications to our machinery under common special cases

In this section we study how our dual characterizations in Propositions 4 and 6simplify under common special cases with the number of input affine functionsd and the input domain D.

4.1 Simplifications when d “ 2

When we consider taking the maximum of only two affine functions (i.e. d “ 2),we can prove that Rsharp is, in fact, ideal.

We start by returning to Proposition 3 and show that it can be greatlysimplified when d “ 2. We first show gpx, zq can be replaced by the maximumof the affine functions as illustrated in Figure 3c, although it is not possiblefor d ą 2 as seen in Figure 3d.

18 Ross Anderson et al.

y

x0

´8

gpx, zq hpx, zq

gpxq

(a) Upper bounds gpx, zq and hpx, zq for

gpxq “ maxxPr0,2s

t0, x ´ 2u, z “`1

2, 1

2

˘.

y

x0

´8

gpx, zqhpx, zq

gpxq

(b) Upper bounds gpx, zq and hpx, zq for

gpxq “ maxxPr0,2s

t´x ` 1, 0, x ´ 2u, z “`0, 1

2, 1

2

˘.

y

x0

`8

gpx, zq

gpxq

(c) Lower bounds gpx, zq for

gpxq “ maxxPr0,2s

t0, x ´ 2u, z “`1

2, 1

2

˘.

y

x0

`8

gpx, zq

gpxq

(d) Lower bounds gpx, zq for

gpxq “ maxxPr0,2s

t´x ` 1, 0, x ´ 2u, z “`1

2, 0, 1

2

˘.

Fig. 3: Examples of the functions gpx, zq, hpx, zq, and gpx, zq defined in (9)and (11) with some fixed value for z. Here, gpxq is the “maximum” functionof interest, defined below each subfigure. Figures (a) and (c) depict the cased “ 2 (maximum of two affine functions), while (b) and (d) illustrate whend “ 3, each pair emphasizing upper and lower bounds respectively. Note thatgpx, zq and hpx, zq coincide in (a) for x P r1, 2.5s, and in (b) for x P r2, 2.5s.The thick solid lines represent gpx, zq in (a–b) and gpx, zq in (c–d), whereas

the dashed lines correspond to hpx, zq. The thin solid lines represent Smax andthe shaded region is the slice of Rcayley with z “ z.

Lemma 1 If d “ 2, then at any values of px, zq where gpx, zq ě gpx, zq (i.e.there exists a y such that px, y, zq P Rcayley), we have

gpx, zq “ maxtw1 ¨ x ` b1, w2 ¨ x ` b2u.

Proof For convenience, we work with the change of variables xk Ð rxk

zkfor

each k P J2K. Suppose without loss of generality that x P D|2, and considerany feasible solution px1, x2q to the optimization problem for gpx, zq, which isfeasible by the assumption that px, zq P Projx,zpRcayleyq. We will show thatpw1 ¨ x1 ` b1qz1 ` pw2 ¨ x2 ` b2qz2 ě w2 ¨ x ` b2. We assume that z2 ą 0, sinceotherwise x “ x1 P D|1 X D|2 and the result is immediate.

MIP formulations for trained neural networks 19

Since x “ x1z1`x2z2 and z P ∆2, the line segment joining x1 to x2 containsx. Furthermore, since x1 P D|1 and x2 P D|2, this line segment also intersects

the hyperplane x P R

ηˇw1 ¨ x ` b1 “ w2 ¨ x ` b2

(. Let x1 denote this point

of intersection, and let z1 P ∆2 be such that x1 “ x1z11 ` x2z12 . Since x P D|2,we know that x1 is closer to x1 than x, i.e. z11 ě z1. Moreover, take the pointx2 on this line segment such that x “ x1z1 ` x2z2, where x

2 “ x1z21 `x2z22 forsome z2 P ∆2. We have z21 ď z1 since x2 is further away from x1 than x. Notethat x1 P D|1 X D|2 while x2 P D|2, and thus px1, x2q is feasible.

It can be computed that z21 “ z1 ¨z1

2

z2, which implies that z1 “ z1pz11 ` z12q “

z1z11 ` z2z

21 and z2 “ z2pz21 ` z22q “ z1z

12 ` z2z

22 . Using these two identities, we

obtain

pw1 ¨ x1 ` b1qz1 ` pw2 ¨ x2 ` b2qz2 “ pw1 ¨ x1 ` b1qpz1z11 ` z2z

21q ` pw2 ¨ x2 ` b2qpz1z

12 ` z2z

22q

“ ppw1 ¨ x1 ` b1qz11 ` pw2 ¨ x2 ` b2qz12qz1

` ppw1 ¨ x1 ` b1qz21 ` pw2 ¨ x2 ` b2qz22qz2

“ pfpx1qz11 ` fpx2qz12qz1 ` pfpx1qz21 ` fpx2qz22qz2,

where we let fpxq denote the function maxtw1 ¨ x ` b1, w2 ¨ x ` b2u, recallingthat x1 P D|1 and x2 P D|2. Since fpxq is convex, by Jensen’s inequality thepreceding expression is at least fpx1z11 ` x2z12qz1 ` fpx1z21 ` x2z22qz2. Thepreceding expression equals pw2 ¨ x1 ` b2qz1 ` pw2 ¨ x2 ` b2qz2 by the definitionsof x1 and x2, and the fact that they both lie in D|2. Recalling the equationx “ x1z1 ` x2z2 used to select x2 completes the proof. [\

Moreover, we show that when d “ 2 we can replace g with h, as illustratedin Figure 3a. This property may not hold when d ą 2 as shown in Figure 3b.

Lemma 2 If d “ 2, then at any values of px, zq where gpx, zq ě gpx, zq (i.e.there exists a y such that px, y, zq P Rcayley), we have

gpx, zq “ maxrx1,rx2

$&% w1 ¨ rx1 ` b1z1 ` w2 ¨ rx2 ` b2z2

ˇˇˇx “ rx1 ` rx2

rx1 P z1 ¨ Drx2 P z2 ¨ D

,.- . (13)

Proof We show that despite expanding the feasible set of the optimizationproblem in (13) by replacing D|k by D, its optimal value is no greater whenpx, zq is such that gpx, zq ě gpx, zq. It suffices to show without loss of generality

that w1 ¨ rx1 ` b1z1 ě w2 ¨ rx1 ` b2z1 holds for any optimal rx1, rx2. By theassumption on px, zq and Lemma 1, we have gpx, zq ě w2 ¨x`b2, which impliesthe existence of some optimal rx1, rx2. That is, we have w2 ¨ px´rx1q`b2z2 `w1 ¨rx1 `b1z1 ě w2 ¨x`b2, which is equivalent to w1 ¨rx1 `b1z1 ě w2 ¨rx1 `b2z1. [\

After observing that these simplifications are identical to those presentedin Proposition 6, we obtain the following corollary promised at the beginningof the section.

Corollary 1 When d “ 2,

px, y, zq P Rsharp

ˇz P t0, 1ud

(is an ideal MIP

formulation of Smax.

20 Ross Anderson et al.

Proof Lemmas 1 and 2 imply that Rsharp “ Rideal, while Proposition 3 impliesthat Rideal “ Rcayley, completing the chain and giving the result. [\

In later sections, we will study conditions under which we can produce anexplicit inequality description for Rsharp.

4.2 Simplifications when D is the product of simplices

In this section, we consider another important special case: when the inputdomain is the Cartesian product of simplices. Indeed, the box domain caseintroduced in Section 2 can be viewed as a product of two-dimensional sim-plices, and we will also see in Section 5.3 that this structure naturally arisesin machine learning settings with categorical or discrete features.

When D is the product of simplices, we can derive a finite representationfor the the set (12) (i.e. a finite representation for the infinite family of lin-ear inequalities (12a)) through an elegant connection with the transportationproblem. To do so, we introduce the following notation.

Definition 2 Suppose the input domain isD “śτ

i“1 ∆pi , with p1`¨ ¨ ¨`pτ “

η. For notational simplicity, we re-organize the indices of x and refer to itsentries via xi,j , where i P JτK is the simplex index, and j P JpiK refers to thecoordinate within simplex i. The domain for x is then

D “

ppxi,jqpi

j“1qτi“1

ˇpxi,jqpi

j“1 P ∆pi @i P JτK(, (14)

where the rows of x correspond to each simplex. Correspondingly, we re-indexthe weights of the affine functions so that for each k P JdK, we have fkpxq “řτ

i“1

řpi

j“1 wki,jxi,j ` bk.

Using the notation from Definition 2, constraints (12a) can be written as

y ďτÿ

i“1

piÿ

j“1

αi,jxi,j `dÿ

k“1

˜maxxkPD

τÿ

i“1

piÿ

j“1

pwki,j ´ αi,jqxk

i,j ` bk

¸zk @α P R

η.

Since D is a product of simplices, the maximization over xk P D appearingin the right-hand side above is separable over each simplex i P JτK. Moreover,for each simplex i, the maximum value of

řpi

j“1pwkij ´ αijqxk

ij , subject to the

constraint xk P D, is obtained when xkij “ 1 for some j P JpiK. Therefore, the

family of constraints (12a) is equivalent to

y ď minα

˜τÿ

i“1

piÿ

j“1

αi,jxi,j `dÿ

k“1

˜τÿ

i“1

pi

maxj“1

pwki,j ´ αi,jq ` bk

¸zk

¸

“τÿ

i“1

minαi,1,...,αi,pi

˜piÿ

j“1

αi,jxi,j `dÿ

k“1

zk ¨pi

maxj“1

pwki,j ´ αi,jq

¸`

dÿ

k“1

bkzk. (15)

We show that the minimization problem in (15), for any i, is equivalent toa transportation problem defined as follows.

MIP formulations for trained neural networks 21

Definition 3 For any values x P ∆p and z P ∆d, and arbitrary weights wkj P R

for all j P JpK and k P JdK, define the max-weight transportation problem to be

Transportpx, z;w1, . . . , wdqdef

“ maxβě0

$’’’’&’’’’%

dÿ

k“1

pÿ

j“1

wkj β

kj

ˇˇˇˇˇ

dÿ

k“1

βkj “ xj @j P JpK

pÿ

j“1

βkj “ zk @k P JdK

,////.////-

In the transportation problem, sinceř

j xj “ 1 “ř

k zk, it follows that βkj P

r0, 1s, and so this value can be interpreted as the percent of total flow “shipped”between j and k. The relation to (15) is now established through LP duality.

Proposition 7 For any fixed x P ∆p and z P ∆d,

minα

˜pÿ

j“1

αjxj `dÿ

k“1

zk ¨p

maxj“1

pwkj ´ αjq

¸“ Transportpx, z;w1, . . . , wdq. (16)

Therefore, when D is a product of simplices, the constraints (12a) can bereplaced in (12) with the single inequality

y ďτÿ

i“1

Transportppxi,jqpi

j“1, z; pw1i,jqpi

j“1, . . . , pwdi,jqpi

j“1q `dÿ

k“1

bkzk. (17)

Proof By using a variable γk to model the value of maxpj“1pwkj ´ αjq for each

k P JdK, the minimization problem on the LHS of (16) is equivalent to

minα,γ

#pÿ

j“1

αjxj `dÿ

k“1

γkzk

ˇˇˇ γk ě wk

j ´ αj @k P JdK, j P JpK

+

which is a minimization LP with free variables αj and γk. Applying LP duality,this completes the proof of equation (16). The inequality (17) then arises bysubstituting equation (16) into (15), for every simplex i “ 1, . . . , τ . [\

4.3 Simplifications when both d “ 2 and D is the product of simplices

Proposition 7 shows that, when the input domain is a product of simplices, thetightest upper bound on y can be computed through a series of transportationproblems. We now leverage the fact that if either side of the transportationproblem from Definition 3 (i.e. p or d) has only two entities, then it reduces toa simpler fractional knapsack problem. Later, this will allow us to represent(15), in either of the cases d “ 2 or p1 “ ¨ ¨ ¨ “ pτ “ 2, using an explicitfinite family of linear inequalities in x and z which has a greedy linear-timeseparation oracle.

22 Ross Anderson et al.

Proposition 8 Given data w1, w2 P Rp, take rwj “ w1

j ´ w2j for all j P JpK,

and suppose the indices have been sorted so that rw1 ď ¨ ¨ ¨ ď rwp. Then

Transportpx, z;w1, w2q “p

minJ“1

˜rwJz1 `

pÿ

j“J`1

p rwj ´ rwJ qxj

¸`

pÿ

j“1

w2jxj . (18)

Moreover, a J P JpK that attains the minimum in the right-hand side of (18)can be found in Oppq time.

Proof When d “ 2, the transportation problem becomes

maxβ1,β2ě0

#pÿ

j“1

pw1jβ

1j ` w2

jβ2j q

ˇˇˇ β1

j ` β2j “ xj @j P JpK,

pÿ

j“1

β1j “ z1

+(19)

where the constraintřp

j“1 β2j “ z2 is implied because

řpj“1 β

2j “

řpj“1pxj ´

β1j q “ 1´

řpj“1 β

1j “ 1´ z1 “ z2. Substituting β2

j “ xj ´β1j for all j P JpK and

then omitting the superscript “1”, (19) becomes

maxβ

#pÿ

j“1

pw1j ´ w2

j qβj `pÿ

j“1

w2jxj

ˇˇˇ

pÿ

j“1

βj “ z1, 0 ď βj ď xj @j P JpK

+

which is a fractional knapsack problem that can be solved greedily.An optimal solution to the fractional knapsack LP above, assuming the

sorting w11 ´ w2

1 ď ¨ ¨ ¨ ď w1p ´ w2

p, can be computed greedily. Let J P JpK bethe maximum index at which

řpj“J xj ě z1. We set βj “ 0 for all j ă J ,

βJ “ z1 ´řp

j“J`1 xj , and βj “ xj for each j ą J . The optimal cost is

pÿ

j“J`1

pw1j ´ w2

j qxj ` pw1J ´ w2

J q

˜z1 ´

pÿ

j“J`1

xj

¸`

pÿ

j“1

w2jxj ,

which yields the desired expression after substituting rwj “ w1j ´ w2

j for allj ě J . Moreover, the index J above can be found in Oppq time by storing arunning total for

řpj“J xj , completing the proof. [\

Observe that the Oppq runtime in Proposition 8 is non-trivial, as a naıveimplementation would run in time Opp2q, as the inner sum is linear in p.

Combining Propositions 7 and 8 immediately yields the following result.

Corollary 2 Suppose that d “ 2 and that D is a product of simplices. Letz “ z1 ” 1 ´ z2. For each simplex i P JτK, take rwi,j “ w1

i,j ´ w2i,j for all

j “ 1, . . . , pi and relabel the indices so that rwi,1 ď ¨ ¨ ¨ ď rwi,pi. Then, in the

context of (12), the upper-bound constraints (12a) are equivalent to

y ďτÿ

i“1

¨˝rwi,Jpiqz `

piÿ

j“Jpiq`1

p rwi,j ´ rwi,Jpiqqxi,j `piÿ

j“1

w2i,jxi,j

˛‚` pb1 ´ b2qz ` b2

@ mappings J : JτK Ñ Z with Jpiq P JpiK @i P JτK. (20)

Moreover, given any point px, y, zq P D ˆ R ˆ r0, 1s, feasibility can be verifiedor a most violated constraint can be found in Opp1 ` ¨ ¨ ¨ ` pτ q time.

MIP formulations for trained neural networks 23

Corollary 2 gives an explicit finite family of linear inequalities equivalent to(12). Moreover, we have already shown in Corollary 1 that Rsharp yields anideal formulation when d “ 2. Hence, we have an ideal nonextended formula-tion whose exponentially-many inequalities can be separated in Opp1`¨ ¨ ¨`pτ qtime, where the initial sorting requires Opp1 log p1 ` ¨ ¨ ¨`pτ log pτ q time. Notethat this sorting can be avoided: we may instead solve the fractional knapsackproblem in the separation via weighted median in Opp1 ` ¨ ¨ ¨ ` pτ q time [46,Chapter 17.1].

We can also show that none of the constraints in (20) are redundant.

Proposition 9 Consider the polyhedron P defined as the intersection of allhalfspaces corresponding to the inequalities (20). Consider some arbitrary map-ping J : JτK Ñ Z with Jpiq P JpiK for each i P JτK. Then the inequality in (20)for J is irredundant with respect to P . That is, removing the halfspace corre-sponding to mapping J (note that this halfspace could also correspond to othermappings) from the description of P will strictly enlarge the feasible set.

Proof Fix a mapping J . Consider the feasible points with z “ 1{2, and xi,j “1rj “ Jpiqs for each i and j. At such points, the constraint corresponding toany mapping J 1 ‰ J in (20) is

y ďτÿ

i“1

¨˝ rwi,J 1piq

2`

piÿ

j“J 1piq`1

p rwi,j ´ rwi,J 1piqq1rj “ Jpiqs `piÿ

j“1

w2i,j1rj “ Jpiqs

˛‚`

b1 ` b2

2

“τÿ

i“1

ˆ rwi,J 1piq

2` p rwi,Jpiq ´ rwi,J 1piqq1rJ 1piq ă Jpiqs

˙`

τÿ

i“1

w2i,Jpiq `

b1 ` b2

2.

For any simplex i, recall that the indices are sorted so that rwi1 ď ¨ ¨ ¨ ď rwipi.

Thus, if J 1piq ě Jpiq, then the expression inside the outer parentheses equalsrwi,J1piq

2 ěrwi,Jpiq

2 . On the other hand, if J 1piq ă Jpiq, then the expression inside

the outer parentheses can be re-written asrwi,Jpiq

2 `rwi,Jpiq´ rwi,J1piq

2 ěrwi,Jpiq

2 .Therefore, setting J 1piq “ Jpiq for every simplex i achieves the tightest upperbound in (20), which simplifies to

y ďτÿ

i“1

rwi,Jpiq

2`

τÿ

i“1

w2i,Jpiq `

b1 ` b2

2. (21)

Now, suppose that the same upper bound on y is achieved by a mappingJ 1 such that J 1piq ‰ Jpiq on a simplex i. By the argument above, regardlessof whether J 1piq ą Jpiq or J 1piq ă Jpiq, the expression inside the outer paren-

theses can equalrwi,Jpiq

2 only if rwi,J 1piq “ rwi,Jpiq. In this case, inspecting theterm inside the summation in (20) for mappings J and J 1, we observe thatregardless of the values of x or z,

rwi,J 1piqz `piÿ

j“J 1piq`1

p rwi,j ´ rwi,J 1piqqxi,j “ rwi,Jpiqz `piÿ

j“Jpiq`1

p rwi,j ´ rwi,Jpiqqxi,j .

24 Ross Anderson et al.

Therefore, for a mapping J 1 to achieve the tightest upper bound (21), it mustbe the case that rwi,J 1piq “ rwi,Jpiq on every simplex i, and so J 1 and J necessarilycorrespond to the same half-space in D ˆRˆ r0, 1s, completing the proof. [\

To close this section, we consider the setting where every simplex is 2-dimensional, i.e. p1 “ ¨ ¨ ¨ “ pτ “ 2, but the number of affine functions d

can be arbitrary. Note that by contrast, the previous results (Proposition 8,Corollary 2, Proposition 9) held in the setting where d “ 2 but p1, . . . , pτ werearbitrary. We exploit the symmetry of the transportation problem to immedi-ately obtain the following analogous results, all wrapped up into Corollary 3.

Corollary 3 Given data w1, . . . , wd P R2, take rwk “ wk

1 ´ wk2 for all k P JdK,

and suppose the indices have been sorted so that rw1 ď ¨ ¨ ¨ ď rwd. Then

Transportpx, z;w1, . . . , wdq “d

minK“1

˜rwKx1 `

Kÿ

k“1

wk2zk `

dÿ

k“K`1

pwk1 ´ rwKqzk

¸.

(22)

Moreover, a K P JdK that attains the minimum in the right-hand side of (22)can be found in Opdq time.

Therefore, suppose that D is a product of τ simplices of dimensions p1 “¨ ¨ ¨ “ pτ “ 2. For each simplex i P JτK, take rwk

i “ wki,1´wk

i,2 for all k “ 1, . . . , d

and relabel the indices so that rw1i ď ¨ ¨ ¨ ď rwd

i . Then, in the context of (12),the upper-bound constraints (12a) are equivalent to

y ďτÿ

i“1

¨˝rwKpiq

i xi,1 `

Kpiqÿ

k“1

wki,2zk `

dÿ

k“Kpiq`1

pwki,1 ´ rwKpiq

i qzk

˛‚`

dÿ

k“1

bkzk

@ mappings K : JτK Ñ JdK. (23)

Furthermore, none of the constraints in (23) are redundant.

Corollary 3 will be particularly useful in Section 5.1, where it will allow usto derive sharp formulations for the maximum of d affine functions over a boxinput domain, in analogy to (20).

5 Applications of our machinery

We are now prepared to return to the concrete goal of this paper: buildingstrong MIP formulations for nonlinearities used in modern neural networks.

5.1 A hereditarily sharp formulation for Max on box domains

We can now present a hereditarily sharp formulation, with exponentially-manyconstraints that can be efficiently separated, for the maximum of d ą 2 affinefunctions over a shared box input domain.

MIP formulations for trained neural networks 25

Proposition 10 For each k, ℓ P JdK, take

N ℓ,k “ηÿ

i“1

maxtpwki ´ wℓ

i qLi, pwki ´ wℓ

i qUiqu.

A valid MIP formulation for grpMax ˝ pf1, . . . , fdq; rL,U sq is

y ď wℓ ¨ x `dÿ

k“1

`N ℓ,k ` bk

˘zk @ℓ P JdK (24a)

y ě wk ¨ x ` bk @k P JdK (24b)

px, y, zq P rL,U s ˆ R ˆ ∆d (24c)

z P t0, 1ud. (24d)

Moreover, a hereditarily sharp formulation is given by (24), along with theconstraints

y ďηÿ

i“1

˜w

Ipiqi xi `

dÿ

k“1

maxtpwki ´ w

Ipiqi qLi, pwk

i ´ wIpiqi qUiuzk

¸`

dÿ

k“1

bkzk

@ mappings I : JηK Ñ JdK (25)

Furthermore, none of the constraints (24b) and (25) are redundant.

Proof The result directly follows from Corollary 3 if we make a change ofvariables to transform the box domain rL,U s to the domain p∆2qη, where thex-coordinates are given by xi,1, xi,2 ě 0 over i P JηK, with xi,1`xi,2 “ 1 for eachsimplex i. For each i, let wk

i,1 “ wki Ui, w

ki,2 “ wk

i Li, and let σi : JdK Ñ JdK be a

permutation such that wσip1qi,1 ´ w

σip1qi,2 ď ¨ ¨ ¨ ď w

σipdqi,1 ´ w

σipdqi,2 . Fix a mapping

I : JηK Ñ JdK for (23). We make the change of variables ξi Ð pUi ´Liqxi,1 `Li

and rewrite (23) from Corollary 3 to take the form (25), as follows:

y ďηÿ

i“1

˜w

σipIpiqqi pUi ´ Liqxi,1 `

Ipiqÿ

k“1

wσipkqi Lizk `

dÿ

k“Ipiq`1

pwσipkqi Ui ´ w

σipIpiqqi pUi ´ Liqqzk

¸`

dÿ

k“1

bkzk,

“ηÿ

i“1

¨˝w

σipIpiqqi ξi `

Ipiqÿ

k“1

pwσipkqi ´ w

σipIpiqqi qLizk `

dÿ

k“Ipiq`1

pwσipkqi ´ w

σipIpiqqi qUizk

˛‚`

dÿ

k“1

bkzk

“ηÿ

i“1

˜w

σipIpiqqi ξi `

dÿ

k“1

maxtpwσipkqi ´ w

σipIpiqqi qLi, pw

σipkqi ´ w

σipIpiqqi qUiuzk

¸`

dÿ

k“1

bkzk

“ηÿ

i“1

˜w

σipIpiqqi ξi `

dÿ

k“1

maxtpwki ´ w

σipIpiqqi qLi, pwk

i ´ wσipIpiqqi qUiuzk

¸`

dÿ

k“1

bkzk,

(26)

where the first equality holds because ξi ´ pUi ´ Liqxi,1 “ Li for each i

andřd

k“1 zk “ 1, the second equality holds because pwσipkqi ´ w

σipIpiqqi qLi ě

26 Ross Anderson et al.

pwσipkqi ´ w

σipIpiqqi qUi if k ď Ipiq (and a reverse argument can be made if

k ą Ipiq), and the third equality holds as each σi : JdK Ñ JdK is a bijection.Now, since (26) is taken over all mappings I : JηK Ñ JdK, it is equivalent toreplace σipIpiqq with Ipiq in (26). This yields (25) over ξ instead of x.

The lower bound in the context of Corollary 3 can be expressed as y ěřηi“1pwk

i,1xi,1`wki,2xi,2q`bk, @k P JdK. To transform it to (24b) over ξ, observe

that wki,1xi,1 ` wk

i,2xi,2 can be re-written as wki pUixi,1 ` Lip1 ´ xi,1qq “ wk

i ξi.Finally, note that this transformation of variables is a bijection since each

xi,1 was allowed to range over r0, 1s, and thus each ξi is allowed to range overrLi, Uis. Hence the new formulation over pξ, y, zq P rL,U s ˆ R ˆ ∆d is alsosharp, yielding the desired result.

The irredundancy of (25) also follows from Corollary 3. For the irredun-dancy of (24b), fix k and take a point px, y, zq where fkpxq ą f ℓpxq for allℓ ‰ k, which exists by Assumption 2. Thus, since the inequalities (24b) arethe only ones bounding y from below, the point px, y ´ ǫ, zq for some ǫ ą 0 issatisfied by all constraints except (24b) corresponding to k, and therefore it isnot redundant. [\

Observe that the constraints (24a) are a special case of (25) for thoseconstant mappings Ipiq “ ℓ for each i P JηK. We note in passing that this big-M formulation (24) may be substantially stronger than existing formulationsappearing in the literature. For example, the tightest version of the formulationof Tjeng et al. [70] is equivalent to (24) with the coefficients N ℓ,k replaced with

N ℓ,k “ bℓ ´ bk ` maxt‰ℓ

˜ηÿ

i“1

`maxtwt

iLi, wtiUiu ´ mintwℓ

iLi, wℓiUiu

˘¸.

Note in particular that as the inner maximization and minimization are com-pletely decoupled, and that the outer maximization in the definition of N ℓ,k,`

is completely independent of k.In Appendix A, we generalize (24) to provide a valid formulation for arbi-

trary polytope domains. We next emphasize that the hereditarily sharp for-mulation from Proposition 10 is particularly strong when d “ 2.

Corollary 4 The formulation given by (24) and (25) is ideal when d “ 2.Moreover, the constraints in the families (25) and (24b) are facet-defining.

Proof Idealness follows directly from Corollary 1. Since the constraints (25)and (24b) are irredundant and the formulation is ideal, they must eitherbe facet-defining or describe an implied equality. Given that the equalityřd

k“1 zk “ 1 appears in (24c), it suffices to observe that the polyhedron definedby (24) and (25) has dimension η ` d, which holds under Assumption 2. [\

We can compute a most-violated inequality from the family (25) efficiently.

Proposition 11 Consider the family of inequalities (25). Take some pointpx, y, zq P rL,U sˆRˆ∆d. If any constraint in the family is violated at the given

MIP formulations for trained neural networks 27

point, a most-violated constraint can be constructed by selecting I : JηK Ñ JdKsuch that

Ipiq P arg minℓPJdK

˜wℓ

i xi `dÿ

k“1

maxtpwki ´ wℓ

i qLi, pwki ´ wℓ

i qUiuzk

¸(27)

for each i P JηK. Moreover, if the weights wki are sorted on k for each i P JηK,

this can be done in Opηdq time.

Proof Follows directly from Corollary 3, which says that the minimizationproblem (27) can be solved in Opdq time for any i P JηK. [\

Note that naıvely, the minimization problem (27) would take Opd2q time,because one has to check every ℓ P JdK, and then sum over k P JdK for everyℓ. However, if we instead pre-sort the weights w1

i , . . . , wdi for every i P JηK

in Opηd log dq time, we can use Corollary 3 to run efficiently separate via alinear search. We note, however, that this pre-sorting step can potentially beobviated by solving the fractional knapsack problems appearing as a weightedmedian problem, which can be solved in Opdq time.

5.2 The ReLU over a box domain

We can now present the results promised in Theorem 2.3. In particular, wederive a non-extended ideal formulation for the ReLU nonlinearity, stated onlyin terms of the original variables px, yq and the single additional binary variablez. Put another way, it is the strongest possible tightening that can be appliedto the big-M formulation (5), and so matches the strength of the multiplechoice formulation without the growth in the number of variables remarkedupon in Section 2.2. Notationally, for each i P JηK take

Li “

#Li if wi ě 0

Ui if wi ă 0and Ui “

#Ui if wi ě 0

Li if wi ă 0.

Proposition 12 Take some affine function fpxq “ w ¨x`b over input domainD “ rL,U s. The following is an ideal MIP formulation for grpReLU˝f ; rL,U sq:

y ďÿ

iPI

wipxi ´ Lip1 ´ zqq `

˜b `

ÿ

iRI

wiUi

¸z @I Ď JηK (28a)

y ě w ¨ x ` b (28b)

px, y, zq P rL,U s ˆ Rě0 ˆ r0, 1s (28c)

z P t0, 1u. (28d)

Furthermore, each inequality in (28a) and (28b) is facet-defining.

28 Ross Anderson et al.

Proof 3 This result is a special case of Corollary 4. Observe that (28) is equiv-alent to (24) and (25) with w1

i “ wi and w2i “ 0 for all i P JηK, b1 “ b,

b2 “ 0, z1 “ z, and z2 “ 1 ´ z. The constraints (28a) are found by setting

I “!i P JηK

ˇˇ Ipiq “ 1

)for each mapping I in (25). [\

Formulation (28) has a number of constraints exponential in the inputdimension η, so it will not be useful directly as a MIP formulation. However,it is straightforward to separate the exponential family (28a) efficiently.

Proposition 13 Take px, y, zq P rL,U s ˆ Rě0 ˆ r0, 1s, along with the set

I “!i P JηK

ˇˇ wixi ă wi

´Lp1 ´ zq ` Uiz

¯ ).

If

y ą bz `ÿ

iPI

wi

´xi ´ Lp1 ´ zq

¯`ÿ

iRI

wiUiz,

then the constraint in (28a) corresponding to I is the most violated in thefamily. Otherwise, no inequality in the family is violated at px, y, zq.

Proof Follows as a special case of Proposition 11. [\

Note that (5b) and (5c) correspond to (28a) with I “ JηK and I “ H,respectively. All this suggests an iterative approach to formulating ReLU neu-rons over box domains: start with the big-M formulation (5), and use Propo-sition 13 to separate strengthening inequalities from (28a) as they are needed.

5.3 The ReLU with one-hot encodings

Although box domains are a natural choice for many applications, it is oftenthe case that some (or all) of the first layer of a neural network will be con-strained to be the product of simplices. The one-hot encoding is a standardtechnique used in the machine learning community to preprocess discrete orcategorical data to a format more amenable for learning (see, for example,[16, Chapter 2.2]). More formally, if input x is constrained to take categor-ical values x P C “ tc1, . . . , ctu, the one-hot transformation encodes this asrx P t0, 1ut, where rxi “ 1 if and only if x “ ci. In other words, the input is

constrained such that rx P ∆η def

“ ∆η X t0, 1uη.It is straightforward to construct a small ideal formulation for grpReLU ˝

f ;∆ηq as t px,řη

i“1 maxt0, wixi ` buq | x P ∆η u. However, it is typically thecase that multiple features will be present in the input, meaning that theinput domain would consist of the product of (potentially many) simplices. Forexample, neural networks have proven well-suited for predicting the propensityfor a given DNA sequence to bind with a given protein [2,83], where the

3 Alternatively, a constructive proof of validity and idealness using Fourier–Motzkin elim-ination is given in the extended abstract of this work [4, Proposition 1].

MIP formulations for trained neural networks 29

network input consists of a sequence of n base pairs, each of which can take 4possible values. In this context, the input domain would be

śni“1 ∆

4.In this section, we restate the general results presented in Section 2, spe-

cialized for the standard case of the ReLU nonlinearity.

Corollary 5 Presume that the input domain D “ ∆p1 ˆ¨ ¨ ¨ˆ∆pτ is a productof τ simplices, and that fpxq “

řτi“1

řpi

j“1 wi,jxi,j ` b is an affine function.Presume that, for each i P JτK, the weights are sorted such that wi,1 ď ¨ ¨ ¨ ďwi,pi

. Then an ideal formulation for grpReLU ˝ f ;Dq is:

y ě w ¨ x ` b (29a)

y ďτÿ

i“1

¨˝wi,Jpiqz `

piÿ

j“Jpiq`1

pwi,j ´ wi,Jpiqqxi,j

˛‚` bz

@ mappings J : JτK Ñ Z with Jpiq P JpiK @i P JτK

(29b)

px, y, zq P D ˆ Rě0 ˆ t0, 1u. (29c)

Moreover, a most-violated constraint from the family (29b), if one exists, canbe identified in Opp1 ` ¨ ¨ ¨ ` pτ q time. Finally, none of the constraints from(29b) are redundant.

Proof Follows directly from applying Corollary 2 to the set Rsharp. By Corol-lary 1, this set actually leads to an ideal formulation, because we are taking themaximum of only two functions (with one of them being zero). The statementabout non-redundancy follows from Proposition 9. [\

5.4 The leaky ReLU over a box domain

A slightly more exotic variant of the ReLU is the leaky ReLU, defined asLeakypv;αq “ maxtαv, vu for some constant 0 ă α ă 1. Instead of fixing anynegative input to zero, the leaky ReLU scales it by a (typically small) constantα. This has been empirically observed to help avoid the “vanishing gradient”problem during the training of certain networks [55,82]. We present analogousresults for the leaky ReLU as for the ReLU: an ideal MIP formulation with anefficient separation routine for the constraints.

Proposition 14 Take some affine function fpxq “ w ¨x`b over input domainD “ rL,U s. The following is a valid formulation for grpLeaky ˝ f ; rL,U sq:

y ě fpxq (30a)

y ě αfpxq (30b)

y ď fpxq ´ p1 ´ αq ¨ M´pf ; rL,U sq ¨ p1 ´ zq (30c)

y ď αfpxq ´ pα ´ 1q ¨ M`pf ; rL,U sq ¨ z (30d)

px, y, zq P rL,U s ˆ R ˆ r0, 1s (30e)

z P t0, 1u. (30f)

30 Ross Anderson et al.

Moreover, an ideal formulation is given by (30), along with the constraints

y ď

˜ÿ

iPI

wipxi ´ Lip1 ´ zqq `

˜b `

ÿ

iRI

wiUi

¸z

¸

` α

˜ÿ

iRI

wipxi ´ Uizq `

˜b `

ÿ

iPI

wiLi

¸p1 ´ zq

¸@I Ď JηK. (31)

Additionally, the most violated inequality from the family (31) can be separatedin Opηq time. Finally, each inequality in (30a–30d) and (31) is facet-defining.

Proof Follows as a special case of Corollary 4. [\

6 Computational experiments

To conclude this work, we perform a preliminary computational study of ourapproaches for ReLU-based networks. We focus on verifying image classifica-tion networks trained on the canonical MNIST digit data set [49]. We traina neural network f : r0, 1s28ˆ28 Ñ R

10, where each of the 10 outputs corre-sponds to the logits4 for each of the digits from 0 to 9. Given a training imagex P r0, 1s28ˆ28, our goal is to prove that there does not exist a perturbation of xsuch that the neural network f produces a wildly different classification result.If fpxqi “ max10j“1 fpxqj , then x is placed in class i. Consider an input imagewith known label i. To evaluate robustness around x with respect to class j, se-lect some small ǫ ą 0 and solve the problem maxa:||a||8ďǫ fpx`aqj ´fpx`aqi.If the optimal solution (or a dual bound thereof) is less than zero, this verifiesthat our network is robust around x as we cannot produce a small perturbationthat will flip the classification from i to j. We note that in the literature thereare a number of variants of the verification problem presented above, producedby selecting a different objective function [50,81] or constraint set [28]. We usea model similar to that of Dvijotham et al. [25,26].

We train two models, each using the same architecture, with and with-out L1 regularization. The architecture starts with a convolutional layer withReLU activation functions (4 filters, kernel size of 4, stride of 2), which has676 ReLUs, then a linear convolutional layer (with the same parameters butwithout ReLUs), feeding into a dense layer of 16 ReLU neurons, and then adense linear layer with one output per digit representing the logits. Finally, wehave a softmax layer that is only enabled during training time to normalize thelogits and output probabilities. Such a network is smaller than typically usedfor image classification tasks, but is nonetheless capable of achieving near-perfect out of sample accuracy on the MNIST data set, and of presenting uswith challenging optimization problems. We generate 100 instances for each

4 In this context, logits are non-normalized predictions of the neural network so thatx P r0, 1s28ˆ28 is predicted to be digit i ´ 1 with probability exppfpxqiq{

ř10

j“1exppfpxqjq

[1].

MIP formulations for trained neural networks 31

network by randomly selecting images x with true label i from the test data,along with a random target adversarial class j ‰ i.

As a general rule of thumb, larger networks will be capable of achievinghigher accuracy, but will lead to larger optimization problems which are moredifficult to solve. However, even with a fixed network architecture, there canstill be dramatic variability in the difficulty of optimizing over different pa-rameter realizations. We refer the interested reader to Ryu et al. [62] for anexample of this phenomena in reinforcement learning. Moreover, in the scopeof this work we make no attempts to utilize recent techniques that train thenetworks to be verifiable [25,78,79,81].

For all experiments, we use the Gurobi v7.5.2 solver, running with a sin-gle thread on a machine with 128 GB of RAM and 32 CPUs at 2.30 GHz.We use a time limit of 30 minutes (1800 s) for each run. We perform ourexperiments using the tf.opt package for optimization over trained neuralnetworks; tf.opt is under active development at Google, with the intentionto open source the project in the future. The big-M + (28a) method is thebig-M formulation (5) paired with separation5 over the exponential family(28a), and with Gurobi’s cutting plane generation turned off. Similarly, thebig-M and the extended methods are the big-M formulation (5) and the ex-tended formulation (7) respectively, with default Gurobi settings. Finally, thebig-M + no cuts method turns off Gurobi’s cutting plane generation withoutseparating over (28a). As a preprocessing step, if we can infer that a neuronis linear based on its bounds (e.g. a nonnegative lower bound or nonpositiveupper bound on the input affine function of a ReLU), we transform it into alinear neuron, thus ensuring Assumption 2.

6.1 Network with standard training

We start with a model trained with a standard procedure, using the Adamalgorithm [45], running for 15 epochs with a learning rate of 10´3. The modelattains 97.2% test accuracy. We select a perturbation ball radius of ǫ “ 0.1.We report the results in Table 1a and in Figure 4a. The big-M + (28a) methodsolves 7 times faster on average than the big-M formulation. Indeed, for 79out of 100 instances the big-M method does not prove optimality after 30minutes, and it is never the fastest choice (the “win” column). Moreover, thebig-M + no cuts times out on every instance, implying that using some cutsis important. The extended method is roughly 5 times slower than the big-M+ (28a) method, but only exceeds the time limit on 19 instances, and so issubstantially more reliable than the big-M method for a network of this size.

5 We use cut callbacks in Gurobi to inject separated inequalities into the cut loop. Whilethis offers little control over when the separation procedure is run, it allows us to takeadvantage of Gurobi’s sophisticated cut management implementation.

32 Ross Anderson et al.

(a) Network with standard training.

method time (s) optimality gap win

big-M + (28a) 174.49 0.53% 81big-M 1233.49 6.03% 0big-M + no cuts 1800.00 125.6% 0extended 890.21 1.26% 6

(b) Network trained with L1 regularization.

method time (s) optimality gap win

big-M + (28a) 9.17 0% 100big-M 434.72 1.80% 0big-M + no cuts 1646.45 21.52% 0extended 1120.14 3.00% 0

Table 1: Shifted geometric mean for time and optimality gap taken over 100instances (shift of 10 and 1, respectively). The “win” column is the number of(solved) instances on which the method is the fastest.

6.2 ReLU network with L1 regularization

The optimization problems studied in Section 6.1 are surprisingly difficultgiven the relatively small size of the networks involved. This can largely beattributed to the fact that the weights describing the neural network are al-most completely dense. To remedy this, we train a second model, using thesame network architecture, but with L1 regularization as suggested by Xiaoet al. [81]. We again set a radius of ǫ “ 0.1, and train for 100 epochs with alearning rate of 5 ¨ 10´4 and a regularization parameter of 10´4. The networkachieves a test accuracy of 98.0%. With regularization, 4% of the weights inthe network are zero, compared to 0.3% in the previous network. Moreover,with regularization we can infer from variable bounds that on average 73.1%of the ReLUs are linear within the perturbation box of radius ǫ, enabling usto eliminate the corresponding binary variables. In the first network, we cando so only for 27.8% of the ReLUs.

We report the corresponding results in Table 1b and Figure 4b. While theextended approach does not seem to be substantially affected by the networksparsity, the big-M -based approaches are able to exploit it to solve more in-stances, more quickly. The big-M approach is able to solve 70 of 100 instancesto optimality within the time limit, though the mean solve time is still quitelarge. In contrast, the big-M + (28a) approach fully exploits the sparsity in themodel, solving each instance strictly faster than each of the other approaches.Indeed, the approach is able to solve 69 instances in less than 10 seconds, andsolves all instances in under 120 seconds.

MIP formulations for trained neural networks 33

(a) Network with standard training.

100 101 102 1030

20

40

60

80

100

Time (s)

Number

ofinstancessolved

big-M

big-M + (28a)

extended

big-M + no cuts

(b) Network trained with L1 regularization.

100 101 102 1030

20

40

60

80

100

Time (s)

Number

ofinstancessolved

big-M

big-M + (28a)

extended

big-M + no cuts

Fig. 4: Number of instances solved within a given amount of time. Curves tothe upper left are better, with more instances solved in less time.

References

1. URL https://developers.google.com/machine-learning/glossary/#logits2. Alipanahi, B., Delong, A., Weirauch, M.T., Frey, B.J.: Predicting the sequence speci-

ficities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology 33,831–838 (2015)

3. Amos, B., Xu, L., Kolter, J.Z.: Input convex neural networks. In: D. Precup, Y.W. Teh(eds.) Proceedings of the 34th International Conference on Machine Learning, vol. 70,pp. 146–155. PMLR, International Convention Centre, Sydney, Australia (2017)

4. Anderson, R., Huchette, J., Tjandraatmadja, C., Vielma, J.P.: Strong mixed-integerprogramming formulations for trained neural networks. In: A. Lodi, V. Nagara-jan (eds.) Proceedings of the 20th Conference on Integer Programming and Com-binatorial Optimization, pp. 27–42. Springer International Publishing, Cham (2019).https://arxiv.org/abs/1811.08359

34 Ross Anderson et al.

5. Arora, R., Basu, A., Mianjy, P., Mukherjee, A.: Understanding deep neural networkswith rectified linear units. arXiv preprint arXiv:1611.01491 (2016)

6. Arulkumaran, K., Deisenroth, M.P., Brundage, M., Bharath, A.A.: Deep reinforcementlearning: A brief survey. IEEE Signal Processing Magazine 34(6), 26–38 (2017)

7. Atamturk, A., Gomez, A.: Strong formulations for quadratic optimization with M-matrices and indicator variables. Mathematical Programming (2018)

8. Balas, E.: Disjunctive programming and a hierarchy of relaxations for discrete optimiza-tion problems. SIAM Journal on Algorithmic Discrete Methods 6(3), 466–486 (1985)

9. Balas, E.: Disjunctive programming: Properties of the convex hull of feasible points.Discrete Applied Mathematics 89, 3–44 (1998)

10. Bartolini, A., Lombardi, M., Milano, M., Benini, L.: Neuron constraints to model com-plex real-world problems. In: International Conference on the Principles and Practiceof Constraint Programming, pp. 115–129. Springer, Berlin, Heidelberg (2011)

11. Bartolini, A., Lombardi, M., Milano, M., Benini, L.: Optimization and controlled sys-tems: A case study on thermal aware workload dispatching. In: Proceedings of theTwenty-Sixth AAAI Conference on Artificial Intelligence, pp. 427–433 (2012)

12. Bastani, O., Ioannou, Y., Lampropoulos, L., Vytiniotis, D., Nori, A.V., Criminisi, A.:Measuring neural net robustness with constraints. In: Advances in Neural InformationProcessing Systems, pp. 2613–2621 (2016)

13. Belotti, P., Bonami, P., Fischetti, M., Lodi, A., Monaci, M., Nogales-Gomez, A., Sal-vagnin, D.: On handling indicator constraints in mixed integer programming. Compu-tational Optimization and Applications 65(3), 545–566 (2016)

14. Bertsimas, D., Tsitsiklis, J.: Introduction to Linear Optimization. Athena Scientific(1997)

15. Bienstock, D., Munoz, G., Pokutta, S.: Principled deep neural network training throughlinear programming. arXiv preprint arXiv:1810.03218 (2018)

16. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)17. Bonami, P., Lodi, A., Tramontani, A., Wiese, S.: On mathematical programming with

indicator constraints. Mathematical Programming 151(1), 191–223 (2015)18. Boureau, Y.L., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recogni-

tion. In: IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition, pp. 2559–2566 (2010)

19. Bunel, R., Turkaslan, I., Torr, P.H., Kohli, P., Kumar, M.P.: A unified view of piece-wise linear neural network verification. In: Advances in Neural Information ProcessingSystems (2018)

20. Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. In: 2017IEEE Symposium on Security and Privacy (SP), pp. 39–57 (2017)

21. Chen, L., Ma, W., Natarajan, K., Simchi-Levi, D., Yan, Z.: Distributionally robustlinear and discrete optimization with marginals. Available at SSRN 3159473 (2018)

22. Cheng, C.H., Nuhrenberg, G., Ruess, N.: Maximum resilience of artifical neural net-works. In: International Symposium on Automated Technology for Verification andAnalysis. Springer, Cham (2017)

23. Dulac-Arnold, G., Evans, R., van Hasselt, H., Sunehag, P., Lillicrap, T., Hunt, J., Mann,T., Weber, T., Degris, T., Coppin, B.: Deep reinforcement learning in large discreteaction spaces (2015). https://arxiv.org/abs/1512.07679

24. Dutta, S., Jha, S., Sanakaranarayanan, S., Tiwari, A.: Output range analysis for deepfeedforward neural networks. In: NASA Formal Methods Symposium (2018)

25. Dvijotham, K., Gowal, S., Stanforth, R., Arandjelovic, R., O’Donoghue, B.,Uesato, J., Kohli, P.: Training verified learners with learned verifiers (2018).https://arxiv.org/abs/1805.10265

26. Dvijotham, K., Stanforth, R., Gowal, S., Mann, T., Kohli, P.: A dual approach toscalable verification of deep networks. In: Thirty-Fourth Conference Annual Conferenceon Uncertainty in Artificial Intelligence (2018)

27. Ehlers, R.: Formal verification of piece-wise linear feed-forward neural networks. In:International Symposium on Automated Technology for Verification and Analysis.Springer, Cham (2017)

28. Engstrom, L., Tran, B., Tsipras, D., Schmidt, L., Madry, A.: Exploring the landscapeof spatial robustness. In: K. Chaudhuri, R. Salakhutdinov (eds.) Proceedings of the

MIP formulations for trained neural networks 35

36th International Conference on Machine Learning, Proceedings of Machine Learning

Research, vol. 97, pp. 1802–1811. PMLR, Long Beach, California, USA (2019). URLhttp://proceedings.mlr.press/v97/engstrom19a.html

29. Fischetti, M., Jo, J.: Deep neural networks and mixed integer linear optimization. Con-straints (2018)

30. Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style (2015).https://arxiv.org/abs/1508.06576

31. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedingsof the fourteenth international conference on artificial intelligence and statistics, pp.315–323 (2011)

32. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning, vol. 1. MIT Press Cambridge(2016)

33. Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout net-works. In: Proceedings of the 30th International Conference on Machine Learning,vol. 28, pp. 1319–1327 (2013)

34. Grimstad, B., Andersson, H.: ReLU networks as surrogate models in mixed-integerlinear programs. Computers & Chemical Engineering 131 (2019)

35. Haneveld, W.K.K.: Robustness against dependence in pert: An application of dualityand distributions with known marginals. In: Stochastic Programming 84 Part I, pp.153–182. Springer (1986)

36. Hanin, B.: Universal function approximation by deep neural nets with bounded widthand relu activations. arXiv preprint arXiv:1708.02691 (2017)

37. Hijazi, H., Bonami, P., Cornuejols, G., Ouorou, A.: Mixed-integer nonlinear programsfeaturing ”on/off” constraints. Computational Optimization and Applications 52(2),537–558 (2012)

38. Hijazi, H., Bonami, P., Ouorou, A.: A note on linear on/off constraints (2014).http://www.optimization-online.org/DB_FILE/2014/04/4309.pdf

39. Huber, B., Rambau, J., Santos, F.: The Cayley Trick, lifting subdivisions and the Bohne-Dress theorem of zonotopal tiltings. Journal of the European Mathematical Society 2(2),179–198 (2000)

40. Huchette, J.: Advanced mixed-integer programming formulations: Methodology, com-putation, and application. Ph.D. thesis, Massachusetts Institute of Technology (2018)

41. Jeroslow, R., Lowe, J.: Modelling with integer variables. Mathematical ProgrammingStudy 22, 167–184 (1984)

42. Jeroslow, R.G.: Alternative formulations of mixed integer programs. Annals of Opera-tions Research 12, 241–276 (1988)

43. Katz, G., Barrett, C., Dill, D.L., Julian, K., Kochenderfer, M.J.: Reluplex: An effi-cient SMT solver for verifying deep neural networks. In: International Conference onComputer Aided Verification, pp. 97–117 (2017)

44. Khalil, E.B., Gupta, A., Dilkina, B.: Combinatorial attacks on binarized neural net-works. In: International Conference on Learning Representations (2019)

45. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2014).https://arxiv.org/abs/1412.6980

46. Korte, B., Vygen, J.: Combinatorial Optimization: Theory and Algorithms. Springer(2000)

47. Kumar, A., Serra, T., Ramalingam, S.: Equivalent and approximate transformations ofdeep neural networks (2019). https://arxiv.org/abs/1905.11428

48. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)49. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to

document recognition. In: Proceedings of the IEEE, vol. 86, pp. 2278–2324 (1998)50. Liu, C., Arnon, T., Lazarus, C., Barrett, C., Kochenderfer, M.J.: Algorithms for veri-

fying deep neural networks (2019). https://arxiv.org/abs/1903.06758

51. Lombardi, M., Gualandi, S.: A lagrangian propagator for artificial neural networks inconstraint programming. Constraints 21(4), 435–462 (2016)

52. Lombardi, M., Milano, M.: Boosting combinatorial problem modeling with machinelearning. In: Proceedings IJCAI, pp. 5472–5478 (2018)

53. Lombardi, M., Milano, M., Bartolini, A.: Empirical decision model learning. Artif. Intell.244, 343–367 (2017)

36 Ross Anderson et al.

54. Lomuscio, A., Maganti, L.: An approach to reachability analysis for feed-forward ReLUneural networks (2017). https://arxiv.org/abs/1706.07351

55. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural networkacoustic models. In: ICML Workshop on Deep Learning for Audio, Speech and Language(2013)

56. Mladenov, M., Boutilier, C., Schuurmans, D., Elidan, G., Meshi, O., Lu, T.: Approxi-mate linear programming for logistic Markov decision processes. In: Proceedings of theTwenty-sixth International Joint Conference on Artificial Intelligence (IJCAI-17), pp.2486–2493. Melbourne, Australia (2017)

57. Mordvintsev, A., Olah, C., Tyka, M.: Inception-ism: Going deeper into neural networks (2015).https://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html

58. Natarajan, K., Song, M., Teo, C.P.: Persistency model and its applications in choicemodeling. Management Science 55(3), 453–469 (2009)

59. Olah, C., Mordvintsev, A., Schubert, L.: Feature visualization. Distill (2017).https://distill.pub/2017/feature-visualization

60. Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., Swami, A.: The lim-itations of deep learning in adversarial settings. In: IEEE European Symposium onSecurity and Privacy, pp. 372–387 (2016)

61. Raghunathan, A., Steinhardt, J., Liang, P.: Semidefinite relaxations for certifying ro-bustness to adversarial examples. In: Proceedings of the 32nd International Conferenceon Neural Information Processing Systems, NIPS’18, pp. 10,900–10,910. Curran Asso-ciates Inc. (2018)

62. Ryu, M., Chow, Y., Anderson, R., Tjandraatmadja, C., Boutilier, C.: CAQL: Continuousaction Q-learning (2019). https://arxiv.org/abs/1909.12397

63. Salman, H., Yang, G., Zhang, H., Hsieh, C.J., Zhang, P.: A convex re-laxation barrier to tight robustness verification of neural networks (2019).https://arxiv.org/abs/1902.08722

64. Say, B., Wu, G., Zhou, Y.Q., Sanner, S.: Nonlinear hybrid planning with deep netlearned transition models and mixed-integer linear programming. In: Proceedings ofthe Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17,pp. 750–756 (2017)

65. Schweidtmann, A.M., Mitsos, A.: Global deterministic optimization with artificial neuralnetworks embedded. Journal of Optimization Theory and Applications (2018)

66. Serra, T., Ramalingam, S.: Empirical bounds on linear regions of deep rectifier networks(2018). https://arxiv.org/abs/1810.03370

67. Serra, T., Tjandraatmadja, C., Ramalingam, S.: Bounding and counting linear regionsof deep neural networks. In: Thirty-fifth International Conference on Machine Learning(2018)

68. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus,R.: Intriguing properties of neural networks. In: International Conference on LearningRepresentations (2014)

69. Tawarmalani, M., Sahinidis, N.: Convexification and Global Optimization in Contin-uous and Mixed-Integer Nonlinear Programming: Theory, Algorithms, Software andApplications, vol. 65. Springer Science & Business Media (2002)

70. Tjeng, V., Xiao, K., Tedrake, R.: Verifying neural networks with mixed integer pro-gramming. In: International Conference on Learning Representations (2019)

71. Trespalacios, F., Grossmann, I.E.: Improved big-M reformulation for generalized dis-junctive programs. Computers and Chemical Engineering 76, 98–103 (2015)

72. Vielma, J.P.: Mixed integer linear programming formulation techniques. SIAM Review57(1), 3–57 (2015)

73. Vielma, J.P.: Embedding formulations and complexity for unions of polyhedra. Man-agement Science 64(10), 4471–4965 (2018)

74. Vielma, J.P.: Small and strong formulations for unions of convex sets from the Cayleyembedding. Mathematical Programming (2018)

75. Vielma, J.P., Nemhauser, G.: Modeling disjunctive constraints with a logarithmic num-ber of binary variables and constraints. Mathematical Programming 128(1-2), 49–72(2011)

MIP formulations for trained neural networks 37

76. Weibel, C.: Minkowski sums of polytopes: Combinatorics and computation. Ph.D. thesis,Ecole Polytechnique Federale de Lausanne (2007)

77. Weiss, G.: Stochastic bounds on distributions of optimal value functions with applica-tions to pert, network flows and reliability. Operations Research 34(4), 595–605 (1986)

78. Wong, E., Kolter, J.Z.: Provable defenses against adversarial examples via the convexouter adversarial polytope. In: International Conference on Machine Learning (2018)

79. Wong, E., Schmidt, F., Metzen, J.H., Kolter, J.Z.: Scaling provable adversarial defenses.In: 32nd Conference on Neural Information Processing Systems (2018)

80. Wu, G., Say, B., Sanner, S.: Scalable planning with Tensorflow for hybrid nonlineardomains. In: Advances in Neural Information Processing Systems, pp. 6276–6286 (2017)

81. Xiao, K.Y., Tjeng, V., Shafiullah, N.M., Madry, A.: Training for faster adversarial ro-bustness verification via inducing ReLU stability. In: International Conference on Learn-ing Representations (2019)

82. Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of rectified activations inconvolution network (2015). https://arxiv.org/abs/1505.00853

83. Zeng, H., Edwards, M.D., Liu, G., Gifford, D.K.: Convolutional neural network archi-tectures for predicting DNA–protein binding. Bioinformatics 32(12), 121–127 (2016)

A A tight big-M formulation for Max on polyhedral domains

We present a tightened big-M formulation for the maximum of d affine functions over anarbitrary polytope input domain. We can view the formulation as a relaxation of the systemin Proposition 4, where we select d inequalities from each of (10a) and (10b): those corre-sponding to α, α P tw1, . . . , wdu. This subset yields a valid formulation, and we obviate theneed for direct separation. This formulation can also be viewed as an application of Propo-sition 6.2 of Vielma [72], and is similar to the big-M formulations for generalized disjunctiveprograms of Trespalacios and Grossmann [71].

Proposition 15 Take coefficients N such that, for each ℓ, k P JdK with ℓ ‰ k,

Nℓ,k,` ě maxxkPD|k

tpwk ´ wℓq ¨ xku (32a)

Nℓ,k,´ ď minxkPD|k

tpwk ´ wℓq ¨ xku, (32b)

and Nk,k,` “ Nk,k,´ “ 0 for all k P JdK. Then a valid formulation for grpMax˝pf1, . . . , fdq;Dqis:

y ď wℓ ¨ x `dÿ

k“1

pNℓ,k,` ` bkqzk @ℓ P JdK (33a)

y ě wℓ ¨ x `dÿ

k“1

pNℓ,k,´ ` bkqzk @ℓ P JdK (33b)

px, y, zq P D ˆ R ˆ ∆d (33c)

z P t0, 1ud (33d)

The tightest possible coefficients in (32) can be computed exactly by solving an LP foreach pair of input affine functions ℓ ‰ k. While this might be exceedingly computationallyexpensive if d is large, it is potentially viable if d is a small fixed constant. For example, themax pooling neuron computes the maximum over a rectangular window in a larger array [32,Section 9.3], and is frequently used in image classification architectures. Typically, maxpooling units work with a 2ˆ2 or a 3ˆ3 window, in which case d “ 4 or d “ 9, respectively.

In addition, if in practice we observe that if the set D|k is empty, then we can inferthat the neuron is not irreducible as the k-th input function is never the maximum, and wecan safely prune it. In particular, if we attempt to compute the coefficients for zk and it isproven infeasible, we can prune the k-th function.