legr: filter pruning via learned global ranking · 2019. 4. 30. · legr: filter pruning via...

LeGR: Filter Pruning via Learned Global Ranking

Ting-Wu Chin, Ruizhou Ding, Diana MarculescuECE Dept., Carnegie Mellon University

tingwuc, rding, [email protected]

Cha ZhangMicrosoft Research

[email protected]

Abstract

Filter pruning has shown to be effective for learning resource-constrained convolu-tional neural networks (CNNs). However, prior methods for resource-constrainedfilter pruning have some limitations that hinder their effectiveness and efficiency.When searching for constraint-satisfying CNNs, prior methods either alter theoptimization objective or adopt local search algorithms with heuristic parame-terization, which are sub-optimal, especially in low-resource regime. From theefficiency perspective, prior methods are often costly to search for constraint-satisfying CNNs. In this work, we propose learned global ranking, dubbed LeGR1,which improves upon prior art in the two aforementioned dimensions. Inspiredby theoretical analysis, LeGR is parameterized to learn layer-wise affine transfor-mations over the filter norms to construct a learned global ranking. With globalranking, resource-constrained filter pruning at various constraint levels can bedone efficiently. We conduct extensive empirical analyses to demonstrate the ef-fectiveness of the proposed algorithm with ResNet and MobileNetV2 networkson CIFAR-10, CIFAR-100, Bird-200, and ImageNet datasets. Code is publiclyavailable at https://github.com/cmu-enyac/LeGR.

1 Introduction

With the ubiquity of mobile and edge devices, it has become desirable to bring the performance ofconvolutional neural networks (CNNs) to edge devices without going through the cloud, especially dueto privacy and latency considerations. However, mobile and edge devices are often characterized bystringent resource constraints, such as energy consumption and model size. Furthermore, dependingon the application domain, the number of floating-point operations (FLOPs) or latency may also needto be constrained. To simplify the discussion, we use FLOPs as the target resource throughout thepaper, keeping in mind that it can be replaced with any other resource of interest (e.g., model size,inference runtime, energy consumption, etc.).

Filter pruning is one of the techniques that can trade-off accuracy for FLOPs to arrive at CNNs withvarious FLOPs of interest. Filter pruning has received growing interests because of three reasons.First, it is a simple and systematic way to explore the trade-offs between accuracy and FLOPs.Second, it is easy to apply to various CNNs. Lastly, the pruned network can be accelerated usingmodern deep learning framework without specialized hardware or software support. The key researchquestion in the filter pruning literature is to provide a systematic algorithm that better trades accuracyfor hardware metrics such as FLOPs.

Besides the quality of the trade-off curve given by a filter pruning algorithm, the cost of obtaining thecurve is also of interest when using filter pruning algorithms in production. When building a systempowered by CNNs, e.g., an autonomous robot, the CNN designer often do not know a priori what themost suitable constraint level is, but rather proceed in a trial-and-error fashion with some constraintlevels in mind. Thus, obtaining the trade-off curve faster is also of interest.

1Pronounced as “ledger".

Preprint. Work in progress.

arX

iv:1

904.

1236

8v1

[cs

.CV

] 2

8 A

pr 2

019

https://github.com/cmu-enyac/LeGR

Most of the prior methods in filter pruning provides a non-intuitive tunable parameter that trades-offaccuracy for FLOPs [1, 2, 3, 4, 5, 6]. However, when the goal is to obtain a model at a target constraintlevel, these algorithms generally require many rounds of tuning for the tunable parameter, which istime-consuming. Recently, algorithms targeting the resource-constrained setting are proposed [7, 8],however, there are several limitations that hinder their effectiveness. Specifically, MorphNet [8]alters the optimization target by adding `1 regularization when searching for a constraint-satisfyingCNN. On the other hand, AMC [7] uses reinforcement learning to search for a constraint-satisfyingCNN. However, their parameterization of the search space (i.e., the state space and action space)involves lots of human-induced heuristics. The aforementioned limitations are more pronounced in alow-FLOPs regime, which leads to less accurate CNNs given a target constraint. Additionally, priorart is costly in searching constraint-satisfying CNNs.

In this research, we introduce a different parameterization for searching constraint-satisfying CNNs.Our parameterization is inspired by theoretical analysis of the loss difference between the pre-trainedmodel and the pruned-and-fine-tuned model. Rather than searching for the percentage of filters toprune for each layer [7], we search layer-wise affine transformations over filter norms such thatthe transformed filter norms can rank filters globally across layers. Beyond the better empiricalresults as shown in Sec. 4, the global ranking structure provides an efficient way to explore CNNsof different constraint levels, which can be done simply by thresholding the bottom ranked filters.We show empirically that our proposed method outperforms prior art in resource-constrained filterpruning. We justify our finding with extensive empirical analyses using ResNet and MobileNetV2 onCIFAR-10/100, Bird-200, and ImageNet datasets. The main contributions of this work are as follows:

• We propose a novel parameterization, dubbed LeGR, for searching resource-constrainedCNNs with filter pruning. LeGR is inspired by the theoretical analysis of the loss differencebetween the pre-trained model and the pruned-and-fine-tuned model.

• We show empirically that LeGR outperforms prior art in resource-constrained filter pruningwith CIFAR, Bird-200, and ImageNet datasets using VGG, ResNet, and MobileNetV2.

2 Related Work

2.1 Neural Architecture Search

Recently, neural architecture search has been adopted for identifying neural networks that not onlyachieve good accuracy, but also are constrained by compute-related metrics, such as size or FLOPs.Within this domain there are various recent approaches [9, 10, 11, 12, 13, 14, 15, 16], however, mostof them are much more expensive compared to top-down approaches such as quantization [17, 18, 19]and filter pruning discussed in this paper.

2.2 Filter Pruning

We group prior art of filter pruning into two categories depending on whether it learns a set ofnon-trivial sparsity across layers. We note that the sparsity of a layer discussed in this work is definedas the percentage of the pruned filters in a layer compared to its pre-trained model.

Expert-designed or Uniform Sparsity Most literature that falls within this category focuses onproposing a metric to evaluate the importance of filters within a layer. For example, past workhas used `2-norm of filter weights [20], `1-norm of filter weights [21, 6], the error contributiontoward the output of the current layer [4, 5], and the variance of the max activation [22]. While theaforementioned work proposes insightful metrics to determine filter importance within a layer, thenumber of filters or channels to be pruned for each layer is either hand-crafted that is not systematicand requires expert knowledge, or uniformly distributed across layers that is sub-optimal.

Learned Sparsity In this category, there are two groups of work. The first group merge pruningwith training using sparsity-induced regularization in the loss function while the second groupdetermine which filters to prune based on the information extracted from the pre-trained model.

In the group of joint optimization, prior methods merge the training of a CNN with pruning by addinga sparsity-induced regularization term to the loss function. In this fashion, the sparsity can be learned

2

as a by-product of the optimization [23, 24, 1, 2, 8]. With the regularizer controlling the final sparsity,tuning the hyper-parameter of the regularizer can be expensive to achieve a specific target constraint.Moreover, we find that obtaining networks in low-FLOPs regime using larger `1 regularization haslarge variance.

The second group of methods proceed based on a pre-trained model. Lin et al. [25] use first-orderTaylor approximation to globally rank and prune filters. Zhuang et al. [26] propose to use the impacton the auxiliary classification loss of each channel to evaluate the importance of a channel and prunegradually and greedily. Yang et al. [27] use ADMM to search for energy-constrained models. Yanget al. [28] use `2 filter norms to determine which filters to prune for each layer and determine the bestlayer to be pruned by evaluating all the layers. Molchanov et al. [3] proposed to use the scaled firstorder Talyor approximation to evaluate the importance of a filter and prunes one filter at a time. Whilepruning iteratively and greedily produce non-trivial layer sparsity, it is often costly and sub-optimal.To counteract greedy solutions, He et al. [7] use `2 norm to rank filters intra-layer wise and usereinforcement learning to search the layer-wise sparsity given a target constraint. However, theirparameterization of the search space is based on heuristics. In this research, instead of searching forthe layer-wise sparsity directly with heuristic parameterization, we search for the layer-wise affinetransformations over filter norms, which is theoretically-inspired as we will illustrate in the followingsection.

3 Learned Global Ranking

Figure 1: Workflow of LeGR. ‖Θ‖2 represents the filter norm. LeGR uses evolutionary algorithm toiteratively find better layer-wise affine transformations (α-κ pair as will be described in Sec. 3.2).The found affine transformations are used to obtain the pruned networks at various user specifiedresource constraint levels.

In this section, we introduce the development of our proposed parameterization toward learning thesparsity for each layer. Specifically, we proposed learned global ranking, or LeGR, which is to learnlayer-wise affine transformations over filter norms such that the transformed filter norms can rankfilters across layers. The overall pruning flow of LeGR is shown in Fig. 1. Given the learned affinetransformations (a α-κ pair), the filter norms of a pre-trained model are transformed and compareacross layers. With a user-defined constraint level, pruning is simply thresholding out the bottomranked filters until the constraint is satisfied. The pruned network will then be fine-tuned to obtain thefinal pruned network. We note that the affine transformations for a network only need to be learnedonce and can be re-used many times for different user-defined constraint levels.

The rationale behind the proposed algorithm stems from minimizing a surrogate of a derived upperbound for the loss difference between (1) the pruned-and-fine-tuned CNN and (2) the pre-trainedCNN.

3.1 Problem Formulation

To develop such a method, we treat filter pruning as an optimization problem with the objective ofminimizing the loss difference between (1) the pruned-and-fine-tuned model and (2) the pre-trainedmodel. Concretely, we would like to solve for the filter masking binary variables z ∈ 0, 1K , withK being the number of filters. If a filter k is pruned, the corresponding mask will be zero (zk = 0),

3

otherwise it will be one (zk = 1). Thus, we have the following optimization problem:

minz

L(Θ z− ητ∑j=1

∆w(j) z)− L(Θ)

s.t. C(z) ≤ ζFLOPs

(1)

, where Θ denotes all the filters of the CNN, L(Θ) = 1|D|∑

(x,y)∈D L(f(x|Θ), y) denotes the lossfunction of filters where x and y are the input and label, respectively. D denotes the training data, fis the CNN model and L is the loss function for prediction (e.g., cross entropy loss). η denotes thelearning rate, τ denotes the number of gradient updates, ∆w(j) denotes the gradient with respect tothe filter weights computed at step j, and denotes element-wise multiplication. On the constraintside, C(·) is the modeling function for FLOPs and ζ is the desired FLOPs constraint. By fine-tuning,we mean updating the filter weights with stochastic gradient descent (SGD) for τ steps.

Let us assume the loss function L is Ωl-Lipschitz continuous for the l-th layer of the CNN, then thefollowing holds:

L(Θ z− ητ∑j=1

∆w(j) z)− L(Θ)

≤ L(Θ z) +

K∑i=1

Ωl(i)η

∥∥∥∥∥∥τ∑j=1

∆w(j)i zi

∥∥∥∥∥∥− L(Θ)

≤K∑i=1

Ωl(i) ‖Θi‖hi +

K∑i=1

Ω2l(i)ητzi

=

K∑i=1

(Ωl(i) ‖Θi‖ − Ω2l(i)ητ)hi + Ω2

l(i)ητ

(2)

, where l(i) is the layer index for the i-th filter, h = 1− z, and ‖·‖ denotes `2 norms.

On the constraint side of equation (1), let Rl(i) be the FLOPs of layer l(i) where filter i resides.Analytically, FLOPs of a layer depends linearly on the number of filters in its preceding layer:

Rl(i) = ul(i) ‖z : zj ∀j ∈ P (l(i)‖0 , ul(i) ≥ 0 (3)

, where P (l(i)) returns a set of filter indices for the layer that precedes layer l(i) and ul(i) is a layer-dependent positive constant. Let Rl(i) denote the FLOPs of layer l(i) for the pre-trained network(z = 1), one can see from equation (3) that Rl(i) ≤ Rl(i) ∀i, z. Thus, the following holds:

C(1− h) =

K∑i

Rl(i)(1− hi) ≤K∑i

Rl(i)(1− hi). (4)

Based on equations (2) and (4), instead of minimizing equation (1), we minimize its upper bound in aLagrangian form. That is,

minh

K∑i=1

(αl(i) ‖Θi‖+ κl(i)

)hi. (5)

, where αl(i) = Ωl(i) and κl(i) = ητΩ2l(i) − λRl(i). To guarantee the solution will satisfy the

constraint, we rank all the filters by their scores si = αl(i) ‖Θi‖ + κl(i) ∀ i and threshold out thebottom ranked (small in scores) filters such that the constraint C(1− h) ≤ ζFLOPs is satisfied and‖h‖0 is maximized. We term the process of thresholding until the constraint is met LeGR-Pruning.

3.2 Learning Global Ranking

Since we do not know the layer-wise Lipschitz constants a priori, we treat α and κ as latent variablesto be estimated. We assume that the more accurate the estimates are, the better the network obtainedby LeGR-Pruning performs on the original objective, i.e., equation (1).

4

Algorithm 1 LeGRInput: model Θ, constraint level ζ, random walk size σ, total search iterations E, sample size S,mutation ratio u, population size P , fine-tune iterations τOutput: α,κInitialize Pool to a size P queuefor e = 1 to E doα = 1, κ = 0if Pool has S samples thenV = Pool.sample(S)α,κ = argmaxFitness(V )

end ifLayer= Sample u% layers to mutatefor l ∈ Layer do

stdl=computeStd([Mi ∀ i ∈ l])αl = αl × αl, where αl ∼ eN (0,σ2)

κl = κl + κl, where κl ∼ N (0,stdl)end forz = LeGR-Pruning(α,κ, ζ)Θ = finetune(Θ z, train_set, τ )Fitness = accuracy(Θ, val_set)Pool.replaceOldestWith(α,κ, F itness)

end for

Specifically, to estimate α-κ pair, we use the regularized evolutionary algorithm proposed in [29]for its effectiveness in the neural architecture search problem. We can treat each latent variablesα-κ pair as a network architecture, which can be obtained by the flow introduced in Fig. 1. Once apruned architecture is obtained, we fine-tune the resulting architecture by τ gradient steps and use itsaccuracy on the validation set as the fitness for the corresponding α-κ pair. We note that, we use τinstead of τ for approximation and we empirically find that τ = 200 (200 gradient updates) workspretty well under the pruning setting across the datasets and networks we study.

Hence, in our regularized evolutionary algorithm, as shown in Algorithm 1, we first generate a poolof candidates (α and κ) and record the fitness for each candidate, and then repeat the followingsteps: (i) sample a subset from the candidates, (ii) identify the fittest candidate, (iii) generate a newcandidate by mutating the fittest candidate and measure its fitness accordingly, and (iv) replace theoldest candidate in the pool with the generated one. To mutate the fittest candidate, we randomlyselect a subset of the layers Layer and conduct one step of random walk from their current values,i.e., αl,κl ∀ l ∈ Layer.

4 Evaluations

4.1 Datasets and Training Setting

Our work is evaluated on various image classification benchmarks including CIFAR-10/100 [30],ImageNet [31], and Birds-200 [32]. CIFAR-10/100 consists of 50k training images and 10k testingimages with a total of 10/100 classes to be classified. ImageNet is a large scale image classificationdataset that includes 1.2 million training images and 50k testing images with 1k classes to be classified.On the other hand, we also benchmark the proposed algorithms on a transfer learning setting since inpractice, we want a small and fast model on some target datasets. Specifically, we use the Birds-200dataset that consists of 6k training images and 5.7k testing images covering 200 species of bird.

For Bird-200, we use 10% of the training data as the validation set used for early stopping andto avoid over-fitting. The training scheme for CIFAR-10/100 follows [20], which uses stochasticgradient descent with nesterov [33], weight decay 5e−4, batch size 128, 1e−1 initial learning ratewith decrease by 5x at epochs 60, 120, and 160, and training for 200 epochs in total. For controlexperiments with CIFAR-100 and Bird-200, the fine-tuning after pruning setting is as follows: wekeep all training hyper-parameters the same, but change the initial learning rate to 1e−2 and train for60 epochs (i.e., τ ≈ 21k). We drop the learning rate by 10x at proportionally at similar times, i.e.,

5

epochs 18, 36, and 48. To compare numbers with prior art in CIFAR-10 and ImageNet, we follow thenumber of iterations in [26]. Specifically, for CIFAR-10 we fine-tuned for 400 epochs with initiallearning rate 1e−2, drop by 5x at epochs 120, 240, and 320. For ImageNet, we use pre-trained modelsand we fine-tuned the pruned models for 60 epochs with initial learning rate 1e−2, drop by 10x atepochs 30 and 45.

For the hyper-parameters of LeGR, we select τ = 200, meaning fine-tune for 200 gradient stepsbefore measuring the validation accuracy when searching for the α-κ pair. We note that we do thesame for AMC [7] for a fair comparison. Moreover, we set the number of architectures explored to bethe same with AMC, i.e., 400. Among the 400 searched architectures we set the pool size P = 64 andnumber of search iterations E = 336. We set the sample size of the evolutionary algorithm S = 16which follows [29]. The exploration σ is set to 1 with linear decrease and mutation ratio u is set to0.1 to sample 10% of the layers to mutate. In the following experiments, we use the smallest ζFLOPsconsidered to search for the latent variables α and κ and the found α-κ pair to obtain the prunednetworks at various constraint levels. For example, for ResNet-56 with CIFAR-100 (Fig. 3), weuse ζFLOPs = 20% to obtain the α-κ pair and use the same α-κ pair to obtain the seven networks(ζFLOPs = 20%, ..., 80%) with the flow described in Fig. 1.

4.2 LeGR vs. AMC

Figure 2: Comparison between AMC and (DDPG-)LeGR. The dotted black line is the time we starttraining both Actor and Critic neural networks.

We first discuss the parameterization be-tween LeGR and AMC using the same solver(DDPG [34]). We use DDPG in a sequen-tial fashion that follows [7] while our statespace and action space are different. LeGRrequires two continuous actions (i.e., αl andκl) for layer l while AMC needs one action(i.e., sparsity) only. We conduct the compari-son of pruning ResNet-56 to 50% of its originalFLOPs targeting CIFAR-100 with τ = 0 andhyper-parameters follow [7]. As show in Fig. 2,while both LeGR and AMC outperform randomsearch, LeGR converges faster to a better solu-tion. Moreover, the solutions explored by LeGRare always tight to the user-defined constraint while AMC spends time exploring solutions that arenot tight. Since there is a trade-off between accuracy and FLOPs for filter pruning, it would be betterto explore more architectures that are tight to the constraint.

4.3 CIFAR-100 Results

Figure 3: The trade-off curve of pruning ResNet-56 andMobileNetV2 on CIFAR-100 using various methods.We average across three trials and plot the mean andstandard deviation.

Figure 4: Training cost for learningseven resource-constrained CNNsusing various methods targetingResNet-56 on CIFAR-100. We re-port the average training cost con-sidering seven constraint levels, i.e.,20% to 80% FLOPs in a step of10% on NVIDIA GTX 1080 Ti.

6

We compare LeGR with resource-constrained filter pruning methods, i.e., MorphNet [8], AMC [7],and a baseline that prunes filters uniformly across layers using ResNet-56 and MobileNetV2. Asshown in Fig. 3, we find that all of the approaches outperform the uniform baseline in a high-FLOPsregime. However, both AMC and MorphNet have higher variances when pruned more aggressively.In both CNNs, LeGR outperforms the prior art, especially in low-FLOPs regime.

On the efficiency side, we measure the average time an algorithm takes us to generate a data point forResNet-56 in Fig. 3 using our hardware (i.e., NVIDIA GTX 1080 Ti). Figure 4 shows the efficiencyof AMC, MorphNet, and the proposed LeGR. The learning cost can be dissected into two parts: (1)pruning: the time it takes to search for a constraint-satisfying network and (2) fine-tuning: the time ittakes for fine-tuning the weights of a pruned network. For MorphNet, we consider three trials foreach constraint level to find an appropriate hyper-parameter λ to satisfy the constraint. The numbersare normalize to the pruning time of AMC. In terms of pruning time, LeGR is 7x and 5x faster thanAMC and MorphNet, respectively. Considering the total learning time, LeGR is 3x and 2x faster thanAMC and MorphNet, respectively. The efficiency comes from the fact that LeGR only searches theα-κ pair once and re-use it for seven constraint levels. In contrast, both AMC and MorphNet have tosearch for constraint-satisfying networks for every new constraint level.

4.4 CIFAR-10 Results

We also compare, in Table. 1, LeGR with the prior art that reports results on CIFAR-10. First,for ResNet-56, we find that LeGR outperform most of the prior art in both FLOPs and accuracydimensions and perform similarly compared to [20, 26] in accuracy but with less FLOPs. ForVGG-13, LeGR achieves significantly better results compared to the prior art.

Table 1: Comparison with prior art in CIFAR-10. We group methods into sections according todifferent FLOPs. Values for our approaches are averaged across three trials and we report themean and standard deviation. We use bold face to denote the best numbers and use ∗ to denote ourimplementation. The accuracy is represented in the format of pre-trained 7→ pruned-and-fine-tuned.

NETWORK METHOD ACC. (%) MFLOPS

RESNET-56

PF [6] 93.0 −→ 93.0 90.9 (72%)TAYLOR [3]∗ 93.9 −→ 93.2 90.8 (72%)

LEGR 93.9 −→ 94.1±0.0 87.8 (70%)DCP-ADAPT [26] 93.8 −→ 93.8 66.3 (53%)

CP [4] 92.8 −→ 91.8 62.7 (50%)AMC [7] 92.8 −→ 91.9 62.7 (50%)DCP [26] 93.8 −→ 93.5 62.7 (50%)SFP [20] 93.6±0.6 −→ 93.4±0.3 59.4 (47%)

LEGR 93.9 −→ 93.7±0.2 58.9 (47%)

VGG-13

BC-GNJ [23] 91.9 −→ 91.4 141.5 (45%)BC-GHS [23] 91.9 −→ 91 121.9 (39%)VIBNET [24] 91.9 −→ 91.5 70.6 (22%)

LEGR 91.9 −→ 92.4±0.2 70.3 (22%)

4.5 ImageNet Results

For ImageNet, we prune ResNet-50 and MobileNetV2 with LeGR to compare with prior art. Asshown in Table 2, LeGR is superior to prior art that reports on ResNet-50. Specifically, when pruningto 73% of FLOPs, LeGR achieves even higher accuracy compared to the pre-trained model. Whenpruned to 58% and 47% our algorithm achieves better results compared to prior art. Similarly, forMobileNetV2, LeGR achieves superior results compared to prior art.

4.6 Bird-200 Results

We analyze how LeGR performs in a transfer learning setting where we have a model pre-trained ona large dataset, i.e., ImageNet, and we want to transfer its knowledge to adapt to a smaller dataset,i.e., Bird-200. We prune the fine-tuned network on the target dataset directly instead of pruning on

7

Table 2: Summary of pruning on ImageNet. The sections are defined based on the number of FLOPsleft. The accuracy is represented in the format of pre-trained 7→ pruned-and-fine-tuned.

NETWORK METHOD TOP-1 TOP-1 DIFF TOP-5 TOP-5 DIFF FLOPS (%)

RESNET-50

NISP [35] - −→ - -0.2 - −→ - - 73LEGR 76.1 −→ 76.2 +0.1 92.9 −→ 93.0 +0.1 73

SSS [36] 76.1 −→ 74.2 -1.9 92.9 −→ 91.9 -1.0 69THINET [5] 72.9 −→ 72.0 -0.9 91.1 −→ 90.7 -0.4 63GDP [25] 75.1 −→ 72.6 -2.5 92.3 −→ 91.1 -1.2 58SFP [20] 76.2 −→ 74.6 -1.6 92.9 −→ 92.1 -0.8 58

LEGR 76.1 −→ 75.7 -0.4 92.9 −→ 92.7 -0.2 58NISP [35] - −→ - -0.9 - −→ - - 56

CP [4] - −→ - - 92.2 −→ 90.8 -1.4 50SPP [37] - −→ - - 91.2 −→ 90.4 -0.8 50

LEGR 76.1 −→ 75.3 -0.8 92.9 −→ 92.4 -0.5 47DCP [26] 76.0 −→ 74.9 -1.1 92.9 −→ 92.3 -0.6 44

MOBILENETV2

AMC [7] 71.8 −→ 70.8 -1.0 −→ - - 70LEGR 71.8 −→ 71.4 -0.4 −→ - - 70LEGR 71.8 −→ 70.8 -1.0 −→ - - 60

DCP [26] 70.1 −→ 64.2 -5.9 −→ - - 55LEGR 71.8 −→ 69.4 -2.4 −→ - - 50

Figure 5: The trade-off curve for pruning ResNet-50 and MobileNetV2 on Bird-200 using various ofmethods.

the large dataset before transferring for two reasons: (1) the user only cares about the performance ofthe network on the target dataset instead of the source dataset, which means we need the accuracyand FLOPs trade-off curve in the target dataset and (2) pruning on a smaller dataset is much moreefficient compared to pruning on a large dataset. We note that directly pruning on target dataset hasbeen adopted in prior art as well [38, 5]. Also, to avoid over-fitting, we use 10% of the training datato act as validation set to pick the best model for testing. We first obtain a fine-tuned MobileNetV2and ResNet-50 on the Bird-200 dataset with top-1 accuracy 80.2% and 79.5%, respectively. Thenumbers are comparable to the reported number from ResNet-101 [39], VGG-16, DenseNet-121 [40]from prior art. As shown in Fig. 5, we find that LeGR outperforms Uniform and AMC, which isconsistent with previous analyses. Moreover, it is interesting that MobileNetV2, a more compactmodel, outperforms ResNet-50 in both accuracy and FLOPs dimensions under this setting.

5 Conclusion

In this work, we propose LeGR, a novel parameterization for searching constraint-satisfying CNNswith filter pruning. The rationale behind LeGR stems from minimizing the upper bound of theloss difference between the pre-trained model and the pruned-and-fine-tuned model. Our empiricalresults show that LeGR outperforms prior art in resource-constrained filter pruning especially in thelow-FLOPs regime. Additionally, LeGR can be 7x and 5x faster in searching constraint-satisfyingCNNs when there are multiple constraint levels considered. We verify the effectiveness of LeGRusing two kinds of CNNs including ResNet and MobileNetV2 on various datasets such as CIFAR,Bird-200, and ImageNet.

8

References[1] Jianbo Ye, Xin Lu, Zhe Lin, and James Z Wang. Rethinking the smaller-norm-less-informative

assumption in channel pruning of convolution layers. International Conference on LearningRepresentation (ICLR), 2018.

[2] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang.Learning efficient convolutional networks through network slimming. In Computer Vision(ICCV), 2017 IEEE International Conference on, pages 2755–2763. IEEE, 2017.

[3] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolu-tional neural networks for resource efficient inference. International Conference on LearningRepresentation (ICLR), 2017.

[4] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neuralnetworks. In International Conference on Computer Vision (ICCV), volume 2, 2017.

[5] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deepneural network compression. arXiv preprint arXiv:1707.06342, 2017.

[6] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters forefficient convnets. International Conference on Learning Representation (ICLR), 2017.

[7] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for modelcompression and acceleration on mobile devices. arXiv preprint arXiv:1802.03494, 2018.

[8] Ariel Gordon, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, Tien-Ju Yang, and Edward Choi.Morphnet: Fast & simple resource-constrained structure learning of deep networks. In IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2018.

[9] Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei Sun, Yanghan Wang, MaratDukhan, Yunqing Hu, Yiming Wu, Yangqing Jia, et al. Chamnet: Towards efficient networkdesign through platform-aware model adaptation. 2019.

[10] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on targettask and hardware. arXiv preprint arXiv:1812.00332, 2018.

[11] Jin-Dong Dong, An-Chieh Cheng, Da-Cheng Juan, Wei Wei, and Min Sun. Dpp-net:Device-aware progressive search for pareto-optimal neural architectures. arXiv preprintarXiv:1806.08198, 2018.

[12] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018.

[13] Chi-Hung Hsu, Shu-Huan Chang, Da-Cheng Juan, Jia-Yu Pan, Yu-Ting Chen, Wei Wei, andShih-Chieh Chang. Monas: Multi-objective neural architecture search using reinforcementlearning. arXiv preprint arXiv:1806.10332, 2018.

[14] Yanqi Zhou, Siavash Ebrahimi, Sercan Ö Arık, Haonan Yu, Hairong Liu, and Greg Diamos.Resource-efficient neural architect. arXiv preprint arXiv:1806.07912, 2018.

[15] Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, JieLiu, and Diana Marculescu. Single-path nas: Designing hardware-efficient convnets in less than4 hours. arXiv preprint arXiv:1904.02877, 2019.

[16] Dimitrios Stamoulis, Ting-Wu Rudy Chin, Anand Krishnan Prakash, Haocheng Fang, SribhuvanSajja, Mitchell Bognar, and Diana Marculescu. Designing adaptive neural networks for energy-constrained image classification. In Proceedings of the International Conference on Computer-Aided Design, page 23. ACM, 2018.

[17] Ruizhou Ding, Ting-Wu Chin, Zeye Liu, and Diana Marculescu. Regularizing activationdistribution for training binarized deep networks. 2019.

[18] Ruizhou Ding, Zeye Liu, Ting-Wu Chin, Diana Marculescu, et al. Flightnns: Lightweightquantized deep neural networks for fast and accurate inference. 2019.

[19] Jungwook Choi, Pierce I-Jen Chuang, Zhuo Wang, Swagath Venkataramani, VijayalakshmiSrinivasan, and Kailash Gopalakrishnan. Bridging the accuracy gap for 2-bit quantized neuralnetworks (qnn). arXiv preprint arXiv:1807.06964, 2018.

[20] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft filter pruning foraccelerating deep convolutional neural networks. In IJCAI, pages 2234–2240, 2018.

9

[21] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J Dally.Exploring the regularity of sparse structure in convolutional neural networks. arXiv preprintarXiv:1705.08922, 2017.

[22] Hong-Jun Yoon, Sarah Robinson, J Blair Christian, John X Qiu, and Georgia D Tourassi.Filter pruning of convolutional neural networks for text classification: A case study of cancerpathology report comprehension. In Biomedical & Health Informatics (BHI), 2018 IEEE EMBSInternational Conference on, pages 345–348. IEEE, 2018.

[23] Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. InAdvances in Neural Information Processing Systems, pages 3288–3298, 2017.

[24] Bin Dai, Chen Zhu, and David Wipf. Compressing neural networks using the variationalinformation bottleneck. arXiv preprint arXiv:1802.10399, 2018.

[25] Shaohui Lin, Rongrong Ji, Yuchao Li, Yongjian Wu, Feiyue Huang, and Baochang Zhang.Accelerating convolutional networks via global & dynamic filter pruning. In IJCAI, pages2425–2432, 2018.

[26] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, JunzhouHuang, and Jinhui Zhu. Discrimination-aware channel pruning for deep neural networks. InAdvances in Neural Information Processing Systems, pages 883–894, 2018.

[27] Haichuan Yang, Yuhao Zhu, and Ji Liu. Ecc: Energy-constrained deep neural network compres-sion via a bilinear regression model. arXiv preprint arXiv:1812.01803, 2018.

[28] Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Vivienne Sze, and HartwigAdam. Netadapt: Platform-aware neural network adaptation for mobile applications. arXivpreprint arXiv:1804.03230, 2018.

[29] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution forimage classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.

[30] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.Technical report, Citeseer, 2009.

[31] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visualrecognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.

[32] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. Thecaltech-ucsd birds-200-2011 dataset. 2011.

[33] Yurii E Nesterov. A method for solving the convex programming problem with convergencerate o (1/kˆ 2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547, 1983.

[34] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. InInternational Conference on Learning Representations, 2015.

[35] Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I. Morariu, Xintong Han, Mingfei Gao,Ching-Yung Lin, and Larry S. Davis. Nisp: Pruning networks using neuron importance scorepropagation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),June 2018.

[36] Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks.In The European Conference on Computer Vision (ECCV), September 2018.

[37] Huan Wang, Qiming Zhang, Yuehai Wang, and Haoji Hu. Structured probabilistic pruning forconvolutional neural network acceleration. 2018.

[38] Yang Zhong, Vladimir Li, Ryuzo Okada, and Atsuto Maki. Target aware network adaptationfor efficient representation learning. arXiv preprint arXiv:1810.01104, 2018.

[39] Xingjian Li, Haoyi Xiong, Hanchao Wang, Yuxuan Rao, Liping Liu, and Jun Huan. DELTA:DEEP LEARNING TRANSFER USING FEATURE MAP WITH ATTENTION FOR CON-VOLUTIONAL NETWORKS. In International Conference on Learning Representations,2019.

[40] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single networkby iterative pruning. In The IEEE Conference on Computer Vision and Pattern Recognition(CVPR), June 2018.

10

A Filter Norm vs. Taylor Approximation

LeGR is a method that uses norm-based global ranking to conduct pruning, which is relevant to priorart [25] that uses first order Taylor approximation to globally rank the filters. We specifically comparethe norm of filter weights and first order Taylor approximation to the loss increase caused by pruning(no fine-tuning considered [3, 25]) as the global ranking metric. Due to the the large amount of filterremoval, Taylor approximation could be erroneous. In Fig. 6, we plot the accuracy after pruningbefore fine-tuning. It is clear that ranking filters with their norms consistently outperforms using thefirst order Taylor expansion.

Figure 6: Comparison between norm of filter weights and its first order approximation to the lossdifference under the global pruning context. We use ResNet-56 on CIFAR-100 for this experiment.

B Ablation Study

Figure 7: Pruning ResNet-56 for CIFAR-100with LeGR using the pre-trained models that aretrained with different epochs. Black dotted lineindicates the learning rate schedules for trainingthe pre-trained model.

Figure 8: Pruning ResNet-56 for CIFAR-100with LeGR by solving α and κ with different τand ζFLOPs.

B.1 Pre-trained Epochs

So far, we have discussed filter pruning by assuming we have a pre-trained model to begin with.To study the impact of the pre-trained model, we train the pre-trained model with various epochsand analyze how it affects the accuracy of the LeGR-pruned model. Specifically, we fix the hyper-parameters used for training the pre-trained model and conduct early stopping. As shown in Fig. 7,we find that the pre-trained model does not need to converge to be pruned. The pruned model canachieve 68.4% top-1 accuracy when using the pre-trained model that is only trained for half of thetraining time (i.e., 100 epochs). In comparison, the pruned model that stems from the fully-trainedmodel achieves 68.2% top-1 accuracy. Interestingly, the pre-trained models at epoch 100 and 190have a 8.6% accuracy difference (63.3 vs. 71.9). This empirical evidences suggest that one might be

11

able to obtain effective pruned models without paying the cost to train the full-blown model, which iscostly.

B.2 Fine-tune Iterations

Since we use τ to approximate τ when searching for the α-κ pair, it is expected that the closer τto τ the better α-κ pair LeGR can find. In this subsection, we are interested in how τ affects theperformance of LeGR. Specifically, we use LeGR to prune ResNet-56 for CIFAR-100 and solve forthe latent variables at three constraint levels ζFLOPs ∈ 10%, 30%, 50%. For τ , we experimentwith 0, 50, 200, 500. We note that once the α-κ pair is found, we use LeGR-Pruning to obtain theresource-constrained CNN and fine-tune it for τ steps. In this experiment, τ = 21120. As shown inFig. 8, the results align with our intuition that there are diminishing returns in increasing τ . With theconsidered network and dataset, fine-tuning for 50 SGD steps is good enough to guide the learning ofα and κ.

12

legr: filter pruning via learned global ranking · 2019. 4. 30. · legr: filter pruning via...

Documents