bpgrad: towards global optimality in deep learning via ... · 2)f is differentiable in the...
TRANSCRIPT
BPGrad:TowardsGlobalOptimalityinDeepLearningviaBranchandPruningZiming Zhang*,YuanweiWu*andGuanghui [email protected],[email protected],[email protected]
Problem: Localizing global optima in deep learning.
Contributions: 1) Explored the possibility of locating global optimality
in DL from the algorithmic perspective;2) Proposed an efficient SGD-based solver, BPGrad,
with Branch & Pruning;3) Empirically good performance of BPGrad solver on
object recognition, detection, and segmentation.
Branch & Pruning:
SETTING BPGRAD ALGORITHM BPGRAD FOR DEEP LEARNING EXPERIMENTSApproximate DL solver based on BPGrad:
A1. Minimizing distortion is more important than minimizing step sizes, i.e. 𝛾 ≪ 1;A2. Χ is sufficiently large where ∃𝜂' ≥ 0 so that 𝑥'+, = 𝑥' − 𝜂'𝛻𝑓1 𝑥' ∈ Χ\Χℝ 𝑡 always holds;A3. 𝜂' ≥ 0 is always sufficiently small for local update;A4. 𝑥'+,can be sampled only based on 𝑥' and 𝛻𝑓1 𝑥' .
Estimation of Lipschitz Constant 𝑳:
Momentum helps evolve toward global optima
Global Convergence of BPGrad Solver:
Object Recognition:
CIFAR10
ImageNet
MNIST
(* denotes equal contributions)
• Numerical test (Li&Yuan, NIPS’17)* Two-layer neural network* 10,302 parameters* On MNIST with 20 epochs and
batch size of 200• SGD vs. BPGrad
* Momentum coefficient 0.9* Best performance* Euclidean distance between the
solutions is 0.6
v
v
v
Output𝑊9𝑥
ReLU 𝐼 +𝑊 9𝑥
Identity Link +𝑥
𝑊9𝑥
Input
Lipschitz Continuity:𝑓 𝑥, − 𝑓 𝑥< ≤ 𝐿 𝑥, − 𝑥< <, ∀𝑥,, 𝑥< ∈ 𝑋
𝑓 𝑥B − 𝐿 𝑥B − 𝑥∗ < ≤ 𝑓∗ ≤ minGH,,⋯,'
𝑓 𝑥G , ∀𝑖 ∈ [𝑡]
𝑥∗
𝑥B𝑥B − 𝑥∗ <
1𝐿𝑓 𝑥B − min
GH,,⋯,'𝑓 𝑥G
Any solution inside the circle can be safely removed as pruning
BPGrad Algorithm for Lipschitz functions:
minMNOP∈Q,RNST
𝑥'+, − 𝑥' − 𝜂'𝛻𝑓1 𝑥'<
<+ 𝛾𝜂'< , (5)
𝑠. 𝑡. maxBH,,⋯,'
𝑓 𝑥B − 𝐿 𝑥B − 𝑥'+, < ≤ 𝜌 minBH,,⋯,'
𝑓 𝑥B , 4
👍• Global optimality• Tightness bound
guarantee• Provable convergence
👎• Tracking of infeasible
solutions• Exponential samples
𝐵_ > 𝑇bBc𝐵_ ≤ 𝑇bBc𝑋, 𝑋<
x
minM∈d
𝑓(𝑥)
𝑋
e𝑇bBc = min(𝑇bBc, 𝑓(𝑥))otherwise, remove𝑥
Lower bound
Leaf
Assumptions:1) The objective f has both lower and upper bounds;2) f is differentiable in the parameter space;3) f is Lipschitz continuous, or can be approximated by
Lipschitz functions, with constant 𝐿 ≥ 0.
Projection
0 500 1000 1500 2000# iterations
-4
-3
-2
-1
0
1
valu
e
cifar10, L=50, =0.9
Left-HandRight-Hand
0 500 1000 1500 2000# iterations
0
0.5
1
valu
e
cifar10, L=50, =0Left-HandRight-Hand
Object Detection: Segmentation:
25354555657585
Average Precision (AP, %) on VOC2007 test dataset
Adagrad RMSProp Adam BPGrad
0 20 40 60 80 100
Epochs
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Test top1 error
0 20 40 60 80 100
Epochs
0
0.5
1
1.5
2
Training objective
0 5 10 15 20
Epochs
1
2
3
4
5
6
Training objective
0 5 10 15 20
Epochs
0.4
0.5
0.6
0.7
0.8
0.9
Validation top1 erro
0 10 20 30 40 50
Epochs
10-3
10-2
10-1
Test top1 error (log sc
0 10 20 30 40 50
Epochs
10-3
10-2
10-1
100
Training objective (log