channel pruning and other methods for compressing cnn · channel pruning and other methods for...

16
Channel Pruning and Other Methods for Compressing CNN Yihui He Xi’an Jiaotong Univ. October 12, 2017 Yihui He (Xi’an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, 2017 1 / 16

Upload: others

Post on 08-Feb-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Channel Pruning and Other Methods for CompressingCNN

Yihui He

Xi’an Jiaotong Univ.

October 12, 2017

Yihui He (Xi’an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, 2017 1 / 16

Background

Recent CNN acceleration works fall into three categories:(1) optimized implementation (e.g., FFT, dictionary)(2) quantization (e.g., BinaryNet)(3) structured simplification (convert CNN structure into compact one)We focuses on the last one.

Yihui He (Xi’an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, 2017 2 / 16

Background

number of channels

W2

W3

W1

(a) (b) (c) (d)

nonlinear

nonlinear

Structured simplification methods that accelerate CNNs: (a) a network with 3conv layers. (b) sparse connection deacti-vates some connections betweenchannels. (c) tensor factorization factorizes a convolutional layer into severalpieces. (d) channel pruning reduces number of channels in each layer (focusof this paper).

Yihui He (Xi’an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, 2017 3 / 16

Sparse Connection

S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connectionsfor efficient neural network. In Advances in Neural Information ProcessingSystems, pages 1135–1143, 2015

Yihui He (Xi’an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, 2017 4 / 16

Tensor factorization

Accelerating very deep convolutional networks for classification anddetection.Speeding up convolutional neural networks with low rank expansions.

Yihui He (Xi’an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, 2017 5 / 16

Our approach

nc

c

kh

kw

A B C

nonlinear nonlinear

W

We aim to reduce the width of feature map B, while minimizing thereconstruction error on feature map C. Our optimization algorithm performswithin the dotted box, which does not involve nonlinearity. This figureillustrates the situation that two channels are pruned for feature map B. Thuscorresponding channels of filters W can be removed. Furthermore, eventhough not directly optimized by our algorithm, the corresponding filters inthe previous layer can also be removed (marked by dotted filters). c, n:number of channels for feature maps B and C, kh × kw: kernel size.

Yihui He (Xi’an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, 2017 6 / 16

Optimization

Formally, to prune a feature map with c channels, we consider applyingn× c× kh × kw convolutional filters W on N × c× kh × kw input volumes Xsampled from this feature map, which produces N × n output matrix Y. Here,N is the number of samples, n is the number of output channels, and kh, kw arethe kernel size. For simple representation, bias term is not included in ourformulation. To prune the input channels from c to desired c′ (0 ≤ c′ ≤ c),while minimizing reconstruction error, we formulate our problem as follow:

arg minβ,W

12N

∥∥∥∥∥Y−c∑

i=1

βiXiWi>

∥∥∥∥∥2

F

subject to ‖β‖0 ≤ c′(1)

‖·‖F is Frobenius norm. Xi is N × khkw matrix sliced from ith channel ofinput volumes X, i = 1, ..., c. Wi is n× khkw filter weights sliced from ithchannel of W. β is coefficient vector of length c for channel selection, and βi

is ith entry of β. Notice that, if βi = 0, Xi will be no longer useful, whichcould be safely pruned from feature map. Wi could also be removed.

Yihui He (Xi’an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, 2017 7 / 16

Optimization

Solving this `0 minimization problem in Eqn. 1 is NP-hard. we relax the `0 to`1 regularization:

arg minβ,W

12N

∥∥∥∥∥Y−c∑

i=1

βiXiWi>

∥∥∥∥∥2

F

+ λ ‖β‖1

subject to ‖β‖0 ≤ c′,∀i ‖Wi‖F = 1

(2)

λ is a penalty coefficient. By increasing λ, there will be more zero terms in βand one can get higher speed-up ratio. We also add a constrain ∀i ‖Wi‖F = 1to this formulation, which avoids trivial solution. Now we solve this problemin two folds. First, we fix W, solve β for channel selection. Second, we fix β,solve W to reconstruct error.

Yihui He (Xi’an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, 2017 8 / 16

Optimization

(i) The subproblem of β: In this case, W is fixed. We solve β for channelselection.

β̂LASSO

(λ) = arg minβ

12N

∥∥∥∥∥Y−c∑

i=1

βiZi

∥∥∥∥∥2

F

+ λ ‖β‖1

subject to ‖β‖0 ≤ c′(3)

Here Zi = XiWi> (size N × n). We will ignore ith channels if βi = 0.

(ii) The subproblem of W: In this case, β is fixed. We utilize the selectedchannels to minimize reconstruction error. We can find optimized solution byleast squares:

arg minW′

∥∥∥Y− X′(W′)>∥∥∥2

F(4)

Here X′ = [β1X1 β2X2 ... βiXi ... βcXc] (size N × ckhkw). W′ is n× ckhkw

reshaped W, W′ = [W1 W2 ... Wi ... Wc]. After obtained result W′, it isreshaped back to W. Then we assign βi ← βi ‖Wi‖F ,Wi ←Wi/ ‖Wi‖F.Constrain ∀i ‖Wi‖F = 1 satisfies.

Yihui He (Xi’an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, 2017 9 / 16

Analysis

Yihui He (Xi’an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, 2017 10 / 16

Analysis

0

1

2

3

4

5

incr

ease

of

err

or

(%)

conv1_1

first kmax responseours

conv2_1

first kmax responseours

conv3_1

first kmax responseours

1.0 1.5 2.0 2.5 3.0 3.5 4.0speed-up ratio

0

1

2

3

4

5

incr

ease

of

err

or

(%)

conv3_2

first kmax responseours

1.0 1.5 2.0 2.5 3.0 3.5 4.0speed-up ratio

conv4_1

first kmax responseours

1.0 1.5 2.0 2.5 3.0 3.5 4.0speed-up ratio

conv4_2

first kmax responseours

Yihui He (Xi’an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, 2017 11 / 16

Analysis

Yihui He (Xi’an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, 2017 12 / 16

Results

Increase of top-5 error (1-view, baseline 89.9%)Solution 2× 4× 5×

Jaderberg (Zhang 2015’s impl.) - 9.7 29.7Asym. 0.28 3.84 -

Filter pruning (fine-tuned, our impl.) 0.8 8.6 14.6Ours (without fine-tune) 2.7 7.9 22.0

Ours (fine-tuned) 0 1.0 1.7

Table 1: Accelerating the VGG-16 model using a speedup ratio of 2×, 4×, or 5×(smaller is better).

Yihui He (Xi’an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, 2017 13 / 16

Results

Increase of top-5 error (1-view, 89.9%)Solution 4× 5×

Asym. 3D 0.9 2.0Asym. 3D (fine-tuned) 0.3 1.0

Our 3C 0.7 1.3Our 3C (fine-tuned) 0.0 0.3

Table 2: Performance of combined methods on the VGG-16 model using a speed-upratio of 4× or 5×. Our 3C solution outperforms previous approaches (smaller isbetter).

Yihui He (Xi’an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, 2017 14 / 16

Absolute performance

Table 3. GPU acceleration comparison. We measure forward-pass time perimage. Our approach generalizes well on GPU (smaller is better).

Yihui He (Xi’an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, 2017 15 / 16

Acc Detection

2×, 4× acceleration for Faster R-CNN detection. mmAP is AP atIoU=.50:.05:.95 (primary challenge metric of COCO [32]).

Yihui He (Xi’an Jiaotong Univ.) Channel Pruning and Other Methods for Compressing CNN October 12, 2017 16 / 16