bpgrad: towards global optimality in deep learning via ... · 2)f is differentiable in the...

BPGrad:TowardsGlobalOptimalityinDeepLearningviaBranchandPruningZiming Zhang*,YuanweiWu*andGuanghui [email protected],[email protected],[email protected]

Problem: Localizing global optima in deep learning.

Contributions: 1) Explored the possibility of locating global optimality

in DL from the algorithmic perspective;2) Proposed an efficient SGD-based solver, BPGrad,

with Branch & Pruning;3) Empirically good performance of BPGrad solver on

object recognition, detection, and segmentation.

Branch & Pruning:

SETTING BPGRAD ALGORITHM BPGRAD FOR DEEP LEARNING EXPERIMENTSApproximate DL solver based on BPGrad:

A1. Minimizing distortion is more important than minimizing step sizes, i.e. 𝛾 ≪ 1;A2. Χ is sufficiently large where ∃𝜂' ≥ 0 so that 𝑥'+, = 𝑥' − 𝜂'𝛻𝑓1 𝑥' ∈ Χ\Χℝ 𝑡 always holds;A3. 𝜂' ≥ 0 is always sufficiently small for local update;A4. 𝑥'+,can be sampled only based on 𝑥' and 𝛻𝑓1 𝑥' .

Estimation of Lipschitz Constant 𝑳:

Momentum helps evolve toward global optima

Global Convergence of BPGrad Solver:

Object Recognition:

CIFAR10

ImageNet

MNIST

(* denotes equal contributions)

• Numerical test (Li&Yuan, NIPS’17)* Two-layer neural network* 10,302 parameters* On MNIST with 20 epochs and

batch size of 200• SGD vs. BPGrad

* Momentum coefficient 0.9* Best performance* Euclidean distance between the

solutions is 0.6

v

v

v

Output𝑊9𝑥

ReLU 𝐼 +𝑊 9𝑥

Identity Link +𝑥

𝑊9𝑥

Input

Lipschitz Continuity:𝑓 𝑥, − 𝑓 𝑥< ≤ 𝐿 𝑥, − 𝑥< <, ∀𝑥,, 𝑥< ∈ 𝑋

𝑓 𝑥B − 𝐿 𝑥B − 𝑥∗ < ≤ 𝑓∗ ≤ minGH,,⋯,'

𝑓 𝑥G , ∀𝑖 ∈ [𝑡]

𝑥∗

𝑥B𝑥B − 𝑥∗ <

1𝐿𝑓 𝑥B − min

GH,,⋯,'𝑓 𝑥G

Any solution inside the circle can be safely removed as pruning

BPGrad Algorithm for Lipschitz functions:

minMNOP∈Q,RNST

𝑥'+, − 𝑥' − 𝜂'𝛻𝑓1 𝑥'<

<+ 𝛾𝜂'< , (5)

𝑠. 𝑡. maxBH,,⋯,'

𝑓 𝑥B − 𝐿 𝑥B − 𝑥'+, < ≤ 𝜌 minBH,,⋯,'

𝑓 𝑥B , 4

👍• Global optimality• Tightness bound

guarantee• Provable convergence

👎• Tracking of infeasible

solutions• Exponential samples

𝐵_ > 𝑇bBc𝐵_ ≤ 𝑇bBc𝑋, 𝑋<

x

minM∈d

𝑓(𝑥)

𝑋

e𝑇bBc = min(𝑇bBc, 𝑓(𝑥))otherwise, remove𝑥

Lower bound

Leaf

Assumptions:1) The objective f has both lower and upper bounds;2) f is differentiable in the parameter space;3) f is Lipschitz continuous, or can be approximated by

Lipschitz functions, with constant 𝐿 ≥ 0.

Projection

0 500 1000 1500 2000# iterations

-4

-3

-2

-1

0

1

valu

e

cifar10, L=50, =0.9

Left-HandRight-Hand

0 500 1000 1500 2000# iterations

0

0.5

1

valu

e

cifar10, L=50, =0Left-HandRight-Hand

Object Detection: Segmentation:

25354555657585

Average Precision (AP, %) on VOC2007 test dataset

Adagrad RMSProp Adam BPGrad

0 20 40 60 80 100

Epochs

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Test top1 error

0 20 40 60 80 100

Epochs

0

0.5

1

1.5

2

Training objective

0 5 10 15 20

Epochs

1

2

3

4

5

6

Training objective

0 5 10 15 20

Epochs

0.4

0.5

0.6

0.7

0.8

0.9

Validation top1 erro

0 10 20 30 40 50

Epochs

10-3

10-2

10-1

Test top1 error (log sc

0 10 20 30 40 50

Epochs

10-3

10-2

10-1

100

Training objective (log

bpgrad: towards global optimality in deep learning via ... · 2)f is differentiable in the...

Documents