bpgrad: towards global optimality in deep learning via ... · 2)f is differentiable in the...

1
BPGrad: Towards Global Optimality in Deep Learning via Branch and Pruning Ziming Zhang*, Yuanwei Wu* and Guanghui Wang [email protected], [email protected], [email protected] Problem: Localizing global optima in deep learning. Contributions: 1) Explored the possibility of locating global optimality in DL from the algorithmic perspective; 2) Proposed an efficient SGD-based solver, BPGrad, with Branch & Pruning; 3) Empirically good performance of BPGrad solver on object recognition, detection, and segmentation. Branch & Pruning: SETTING BPGRAD ALGORITHM BPGRAD FOR DEEP LEARNING EXPERIMENTS Approximate DL solver based on BPGrad: A1. Minimizing distortion is more important than minimizing step sizes, i.e. ≪1; A2. Χ is sufficiently large where ≥0 so that ’+, = 1 ∈Χ\Χ always holds; A3. ≥0 is always sufficiently small for local update; A4. ’+, can be sampled only based on and 1 . Estimation of Lipschitz Constant : Momentum helps evolve toward global optima Global Convergence of BPGrad Solver: Object Recognition: CIFAR10 ImageNet MNIST (* denotes equal contributions) Numerical test (Li&Yuan, NIPS’17) * Two-layer neural network * 10,302 parameters * On MNIST with 20 epochs and batch size of 200 SGD vs. BPGrad * Momentum coefficient 0.9 * Best performance * Euclidean distance between the solutions is 0.6 v v v Output 9 ReLU + 9 Identity Link + 9 Input Lipschitz Continuity: , < , < < , ∀ , , < B B < min GH,,⋯,’ G , ∀ ∈ [] B B < 1 B min GH,,⋯,’ G Any solution inside the circle can be safely removed as pruning BPGrad Algorithm for Lipschitz functions: min M NOP ∈Q, R N ST ’+, 1 < < + < , (5) . . max BH,,⋯,’ B B ’+, < ≤ min BH,,⋯,’ B , 4 Global optimality Tightness bound guarantee Provable convergence Tracking of infeasible solutions Exponential samples _ > bBc _ bBc , < x min M∈d () e bBc = min( bBc , ()) otherwise, remove Lower bound Leaf Assumptions: 1) The objective f has both lower and upper bounds; 2) f is differentiable in the parameter space; 3) f is Lipschitz continuous, or can be approximated by Lipschitz functions, with constant ≥0. Projection 0 500 1000 1500 2000 # iterations -4 -3 -2 -1 0 1 value cifar10, L=50, =0.9 Left-Hand Right-Hand 0 500 1000 1500 2000 # iterations 0 0.5 1 value cifar10, L=50, =0 Left-Hand Right-Hand Object Detection: Segmentation: 25 35 45 55 65 75 85 Average Precision (AP, %) on VOC2007 test dataset Adagrad RMSProp Adam BPGrad 0 20 40 60 80 100 Epochs 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Test top1 error 0 20 40 60 80 100 Epochs 0 0.5 1 1.5 2 Training objective 0 5 10 15 20 Epochs 1 2 3 4 5 6 Training objective 0 5 10 15 20 Epochs 0.4 0.5 0.6 0.7 0.8 0.9 Validation top1 erro 0 10 20 30 40 50 Epochs 10 -3 10 -2 10 -1 Test top1 error (log sc 0 10 20 30 40 50 Epochs 10 -3 10 -2 10 -1 10 0 Training objective (log

Upload: others

Post on 13-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BPGrad: Towards Global Optimality in Deep Learning via ... · 2)f is differentiable in the parameter space; 3)f is Lipschitz continuous, or can be approximated by Lipschitz functions,

BPGrad:TowardsGlobalOptimalityinDeepLearningviaBranchandPruningZiming Zhang*,YuanweiWu*andGuanghui [email protected],[email protected],[email protected]

Problem: Localizing global optima in deep learning.

Contributions: 1) Explored the possibility of locating global optimality

in DL from the algorithmic perspective;2) Proposed an efficient SGD-based solver, BPGrad,

with Branch & Pruning;3) Empirically good performance of BPGrad solver on

object recognition, detection, and segmentation.

Branch & Pruning:

SETTING BPGRAD ALGORITHM BPGRAD FOR DEEP LEARNING EXPERIMENTSApproximate DL solver based on BPGrad:

A1. Minimizing distortion is more important than minimizing step sizes, i.e. 𝛾 ≪ 1;A2. Χ is sufficiently large where ∃𝜂' ≥ 0 so that 𝑥'+, = 𝑥' − 𝜂'𝛻𝑓1 𝑥' ∈ Χ\Χℝ 𝑡 always holds;A3. 𝜂' ≥ 0 is always sufficiently small for local update;A4. 𝑥'+,can be sampled only based on 𝑥' and 𝛻𝑓1 𝑥' .

Estimation of Lipschitz Constant 𝑳:

Momentum helps evolve toward global optima

Global Convergence of BPGrad Solver:

Object Recognition:

CIFAR10

ImageNet

MNIST

(* denotes equal contributions)

• Numerical test (Li&Yuan, NIPS’17)* Two-layer neural network* 10,302 parameters* On MNIST with 20 epochs and

batch size of 200• SGD vs. BPGrad

* Momentum coefficient 0.9* Best performance* Euclidean distance between the

solutions is 0.6

v

v

v

Output𝑊9𝑥

ReLU 𝐼 +𝑊 9𝑥

Identity Link +𝑥

𝑊9𝑥

Input

Lipschitz Continuity:𝑓 𝑥, − 𝑓 𝑥< ≤ 𝐿 𝑥, − 𝑥< <, ∀𝑥,, 𝑥< ∈ 𝑋

𝑓 𝑥B − 𝐿 𝑥B − 𝑥∗ < ≤ 𝑓∗ ≤ minGH,,⋯,'

𝑓 𝑥G , ∀𝑖 ∈ [𝑡]

𝑥∗

𝑥B𝑥B − 𝑥∗ <

1𝐿𝑓 𝑥B − min

GH,,⋯,'𝑓 𝑥G

Any solution inside the circle can be safely removed as pruning

BPGrad Algorithm for Lipschitz functions:

minMNOP∈Q,RNST

𝑥'+, − 𝑥' − 𝜂'𝛻𝑓1 𝑥'<

<+ 𝛾𝜂'< , (5)

𝑠. 𝑡. maxBH,,⋯,'

𝑓 𝑥B − 𝐿 𝑥B − 𝑥'+, < ≤ 𝜌 minBH,,⋯,'

𝑓 𝑥B , 4

👍• Global optimality• Tightness bound

guarantee• Provable convergence

👎• Tracking of infeasible

solutions• Exponential samples

𝐵_ > 𝑇bBc𝐵_ ≤ 𝑇bBc𝑋, 𝑋<

x

minM∈d

𝑓(𝑥)

𝑋

e𝑇bBc = min(𝑇bBc, 𝑓(𝑥))otherwise, remove𝑥

Lower bound

Leaf

Assumptions:1) The objective f has both lower and upper bounds;2) f is differentiable in the parameter space;3) f is Lipschitz continuous, or can be approximated by

Lipschitz functions, with constant 𝐿 ≥ 0.

Projection

0 500 1000 1500 2000# iterations

-4

-3

-2

-1

0

1

valu

e

cifar10, L=50, =0.9

Left-HandRight-Hand

0 500 1000 1500 2000# iterations

0

0.5

1

valu

e

cifar10, L=50, =0Left-HandRight-Hand

Object Detection: Segmentation:

25354555657585

Average Precision (AP, %) on VOC2007 test dataset

Adagrad RMSProp Adam BPGrad

0 20 40 60 80 100

Epochs

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Test top1 error

0 20 40 60 80 100

Epochs

0

0.5

1

1.5

2

Training objective

0 5 10 15 20

Epochs

1

2

3

4

5

6

Training objective

0 5 10 15 20

Epochs

0.4

0.5

0.6

0.7

0.8

0.9

Validation top1 erro

0 10 20 30 40 50

Epochs

10-3

10-2

10-1

Test top1 error (log sc

0 10 20 30 40 50

Epochs

10-3

10-2

10-1

100

Training objective (log