destruction and construction learning for fine-grained image …ylbai.me/dcl_cvpr19.pdf · 2019. 5....

Destruction and Construction Learning for Fine-grained Image Recognition

Yue Chen1∗ Yalong Bai2∗ Wei Zhang3 Tao Mei4

JD AI Research, Beijing, [email protected], [email protected], [email protected], [email protected]

Abstract

Delicate feature representation about object parts playsa critical role in fine-grained recognition. For example, ex-perts can even distinguish fine-grained objects relying onlyon object parts according to professional knowledge. In thispaper, we propose a novel “Destruction and ConstructionLearning” (DCL) method to enhance the difficulty of fine-grained recognition and exercise the classification model toacquire expert knowledge. Besides the standard classifica-tion backbone network, another “destruction and construc-tion” stream is introduced to carefully “destruct” and then“reconstruct” the input image, for learning discriminativeregions and features. More specifically, for “destruction”,we first partition the input image into local regions and thenshuffle them by a Region Confusion Mechanism (RCM). Tocorrectly recognize these destructed images, the classifi-cation network has to pay more attention to discrimina-tive regions for spotting the differences. To compensate thenoises introduced by RCM, an adversarial loss, which dis-tinguishes original images from destructed ones, is appliedto reject noisy patterns introduced by RCM. For “construc-tion”, a region alignment network, which tries to restore theoriginal spatial layout of local regions, is followed to modelthe semantic correlation among local regions. By jointlytraining with parameter sharing, our proposed DCL injectsmore discriminative local details to the classification net-work. Experimental results show that our proposed frame-work achieves state-of-the-art performance on three stan-dard benchmarks. Moreover, our proposed method does notneed any external knowledge during training, and there isno computation overhead at inference time except the stan-dard classification network feed-forwarding. Source code:https://github.com/JDAI-CV/DCL.

1. Introduction

In the past decade, generic object recognition hasachieved steady progress with efforts from both large-scale

∗Equal contribution.

annotated dataset and sophisticated model design. How-ever, recognizing fine-grained object categories (e.g., birdspecies [3], car models [14] and aircraft [18]) is still achallenging task, which attracts extensive research atten-tion. Although fine-grained objects are visually similar by arough glimpse, they can be correctly recognized by detailsin discriminative local regions.

Learning discriminative feature representations fromdiscriminative parts plays the key role in fine-grained imagerecognition. Existing fine-grained recognition methods canbe roughly grouped into two categories, as illustrated in Fig-ure 1. One group (a) first locates the discriminative objectparts and then classifies based on the discriminative regions.These two-steps methods [21, 11, 1] mostly need additionalbounding box annotations on objects or parts, which are ex-pensive to collect. The other group (b) tries to automaticallylocalize discriminative regions by attention mechanism inan unsupervised manner, and thus does not needs extra an-notations. However, these methods [7, 42, 41, 22] usuallyneed additional network structure (e.g., attention mecha-nism), and thus introduce extra computation overhead forboth training and inference stages.

In this paper, we propose a novel fine-grained imagerecognition framework named “Destruction and Construc-tion Learning” (DCL), as shown in Figure 1 (c). Besidesthe standard classification backbone network, we introducea DCL stream to learn from discriminative regions automat-ically. An input image is first carefully destructed to em-phasize discriminative local details, and then reconstructedto model the semantic correlation among local regions. Onone hand, DCL automatically localizes discriminative re-gions, and thus does not need any extra knowledge whiletraining. On the other hand, the DCL structure is onlyadopted at the training stage, and thus introduces no com-putational overhead at inference time.

For “Destruction”, we propose a Region ConfusionMechanism (RCM) to deliberately “confuse” the globalstructure, which partitions the input image into localpatches and then shuffles them randomly (Figure 3). Forfine-grained recognition, local details play a more importantrole than global structures, since images from different fine-

4321

https://github.com/JDAI-CV/DCL

(c)(a) (b)

Detection

Attention Construction

Destruction

Figure 1. Illustrations of two previous general frameworks (a,b) and our proposed framework (c) for fine-grained classification. (a) Two-stage part detection based framework. (b) Attention based framework. (c) Our proposed destruction and construction learning framework.The network structures in dashed lines are disabled during inference.

grained categories usually share the same global structureor shape, but only differ in local details. Discarding globalstructure and keeping local details could force the networkto identify and focus on the discriminative local regions forrecognition. After all, the devil is in the details. Shufflingis also adopted in natural language processing [15] to letthe neural network focus on discriminative words. Simi-larly, if local regions in an image are “shuffled”, irrelevantregions that are non-critical to fine-grained recognition willbe neglected, and the network will be forced to classify im-ages based on the discriminative local details. With RCM,the visual appearance of the image has been substantiallychanged. As shown in the bottom row of Figure 3, though itbecomes more difficult for recognition, bird experts can stillspot the difference easily. Car enthusiasts can distinguishcar models by only examining parts of car [34]. Similarly,the neural network also needs to learn expert knowledge toclassify the destructed images.

It is worth noting that “destruction” is not always benefi-cial. As a side effect, RCM introduces several noisy visualpatterns as in Figure 3. To offset the negative impact, we ap-ply an adversarial loss to distinguish original images fromdestructed ones. As a result, the effect of noisy patterns canbe minimized, keeping only beneficial local details. Con-ceptually, the adversarial and classification losses work inan adversarial manner to carefully learn from “destruction”.

For “Construction”, a region alignment network is intro-duced to restore the original region arrangement, which actsin the opposite way of RCM. By learning to restore the orig-inal layout as in [19, 6], the network needs to understandthe semantics of each region, including those discrimina-tive ones. Through “construction”, the correlation betweendifferent local regions can be modeled.

The main contributions are summarized as follows:

• A novel “Destruction and Construction Learning(DCL)” framework is proposed for fine-grained recog-nition. For destruction, the region confusion mecha-nism (RCM) forces the classification network to learnfrom discriminative regions, and the adversarial lossprevents over-fitting the RCM-induced noisy patterns.For construction, the region alignment network re-stores the original region layout by modeling the se-mantic correlation among regions.

• State-of-the-art performances are reported on threestandard benchmark datasets, where our DCL consis-tently outperforms existing methods.• Compared to existing methods, our proposed DCL

does not need extra part/object annotation and intro-duces no computational overhead at inference time.

2. Related worksResearches for fine-grained image recognition task

mainly proceed along two dimensions. One is learningbetter visual representations from the original image di-rectly [26, 25, 28] and the other is using part/attention basedmethods [41, 42, 7, 13] to obtain discriminative regions inimages and learn region-based feature representations.

Due to the success of deep learning, fine-grained recog-nition methods have shifted from multistage frameworksbased on hand-crafted features [39, 36, 23, 10] to multi-stage frameworks with CNN features [13, 31, 29]. Sec-ond order bilinear feature interactions were shown to havea significant improvement for visual representations learn-ing [16, 30]. This method was later extended to a series ofrelated works with further improvements [12, 4, 8]. Deepmetric learning is also used to capture subtle visual differ-ences. Zhang et al. [40] introduced label structures and ageneralization of triplet loss to learn fine-grained featurerepresentations. Chen et al. [27] investigate simultaneouslypredicting categories of different levels in the hierarchy andintegrating this structured correlation information into thenetwork by an embedding method. However, these pair-wise neural network models often bring complex networkcomputing.

There is also a large amount of part localization basedmethods proposed regarding the theory that the object partsare essential to learning discriminative features for fine-grained classification [32]. Fu et al. [7] proposed a re-inforced attention proposal network to obtain discriminat-ing attention regions and region-based feature representa-tion of multiple scales. Sun et al. [20] proposed a one-squeeze multi-excitation module to learns multiple atten-tion region features of each input image, and then apply amulti-attention multi-class constraint in a metric learningframework. Zheng et al. [42] adopted a channel groupingnetwork to generate multiple parts by clustering, then classi-

4322

Region Confusion Mechanism

Classification Network

Feature Vector

Adversarial Learning Network

Ad

v Lo

ssC

ls L

oss

Region Alignment Network

Loc

Loss

Feature Maps Location Matrix

···

···

···

RC

M

Figure 2. The framework of the proposed DCL method which consists of four parts. (1) Region Confusion Mechanism: a module toshuffle the local regions of the input image. (2) Classification Network: the backbone classification network that classifies images intofine-grained categories. (3) Adversarial Learning Network: an adversarial loss is applied to distinguish original images from destructedones. (4) Region Alignment Network: appended after the classification network to recover the spatial layout of local regions.

fied these parts features to predict the categories of input im-ages. Compared with earlier part/attention based methods,some of the recent methods tend to be weak supervised anddo not require the annotations of parts or key areas [21, 35].In particular, Peng et al. [21] proposed a part spatial con-straint to make sure that the model could select discrimina-tive regions, and a specialized clustering algorithm is usedto integrate the features of these regions. Yang et al. [35]introduced a method to detect informative regions and thenscrutinizes them to make final predictions. However, thecorrelation among regions is helpful to build deep under-standing about the objects, it is usually ignored by previousworks. The research [19] also shows that utilizing the loca-tion information of regions can enhance the visual represen-tation ability of the neural network and result in improvingperformance on classification and detection tasks.

Our proposed method differs previous works in three as-pects: First, by training classifier with our proposed RCM,the discriminative regions can be automatically detectedwithout using any prior knowledge except object labels.Second, our formulation considers not only the fine-grainedlocal region feature representations but also the semanticcorrelation among different regions in the whole image.Third, our proposed method is highly efficient, that thereis no additional overhead except backbone network feed-forward in prediction time.

3. Proposed Method

In this section, we present our proposed Destruction andConstruction Learning (DCL) method. As shown in Fig-ure 2, the whole framework is composed of four parts.

Please note that only the “classification network” is neededduring inference time.

3.1. Destruction Learning

The devil is in the details. For fine-grained image recog-nition, local details are much more important than the globalstructure. In most cases, different fine-grained categoriesusually share a similar global structure and only differ incertain local details. In this work, we propose to carefullydestruct the global structure by shuffling the local regionsfor better identifying discriminative regions and learningdiscriminative features (Section 3.1.1). To prevent the net-work learning from noisy patterns introduced by destruc-tion, an adversarial counterpart (Section 3.1.2) is proposedto reject RCM-induced patterns that are irrelevant to fine-grained classification.

3.1.1 Region Confusion Mechanism

As an analogy [15] to natural language processing, shufflingwords in a sentence would force the neural network to focuson discriminative words and neglect irrelevant ones. Simi-larly, if local regions in an image are “shuffled”, the neuralnetwork would be forced to learn from discriminative re-gion details for classification.

As shown in Figure 3, our proposed Region ConfusionMechanism (RCM) is designed to disrupt the spatial layoutof local image regions. Given an input image I , we firstuniformly partition the image into N × N sub-regions de-noted by Ri,j , where i and j are the horizontal and verticalindices respectively and 1 ≤ i, j ≤ N . Inspired by [15],our proposed RCM shuffles these partitioned local regions

4323

in their 2D neighbourhood. For the jth row of R, a ran-dom vector qj of size N is generated, where the ith elementqj,i = i+ r, where r ∼ U(−k, k) is a random variable fol-lowing a uniform distribution in the range of [−k, k]. Here,k is a tunable parameter (1 ≤ k < N ) defining the neigh-bourhood range. Then we can get a new permutation σrowj

of regions in jth row by sorting the array qj , verifying thecondition:

∀i ∈ {1, ..., N} ,∣∣σrowj (i)− i

∣∣ < 2k. (1)

Similarly, we apply the permutation σcoli to the regionscolumn-wisely, verifying the condition:

∀j ∈ {1, ..., N} ,∣∣σcoli (j)− j

∣∣ < 2k. (2)

Therefore, the region at (i, j) in original region location isplaced to a new coordinate:

σ(i, j) = (σrowj (i), σcoli (j)). (3)

This shuffling method destructs the global structure and en-sures that the local region jitters inside its neighbourhoodwith a tunable size.

The original image I , its destructed version φ(I) and itsground truth one-vs-all label l indicating the fine-grainedcategories are coupled as 〈I, φ(I), l〉 for training. The clas-sification network maps input image into a probability dis-tribution vector C(I, θcls), where θcls is all learnable pa-rameters in the classification network. The loss function ofthe classification network Lcls can be written as:

Lcls = −∑I∈I

l · log [C (I)C (φ(I))] , (4)

where I is the image set for training.Since the global structure has been destructed, to rec-

ognize these randomly shuffled images, the classificationnetwork has to find the discriminative regions and learn thedelicate differences among categories.

3.1.2 Adversarial Learning

Destructing images with RCM does not always bring ben-eficial information for fine-grained classification. For ex-ample in Figure 3, RCM also introduces noisy visual pat-terns as we shuffle the local regions. Features learned fromthese noise visual patterns are harmful to the classificationtask. To this end, we propose another adversarial loss LAdvto prevent overfitting the RCM-induced noise patterns fromcreeping into the feature space.

Considering the original images and the destructed onesas two domains, the adversarial loss and classification losswork in an adversarial manner to 1) keep domain-invariantpatterns, and 2) reject domain-specific patterns between Iand φ(I).

RC

M

Location Matrix

(0,0) (1,3)

(5,6)

(6,6)

(5,4)

(0,1)

···

···

···

(0,0) (0,1)

(6,6)

(5,6)

(6,5)

(1,0)

···

···

···

N×N

Figure 3. Example images for fine-grained recognition (top) andthe corresponding “destructed” images by our proposed RCM(bottom).

We label each image as a one-hot vector d ∈ {0, 1}2indicating whether the image is destructed or not. A dis-criminator can be added as a new branch in the frameworkto judge whether an image I is destructed or not by:

D(I, θadv) = softmax(θadvC(I, θ[1,m]cls )), (5)

whereC(I, θ[1,m]cls ) is the feature vector extract from the out-

puts of the mth layer in backbone classification network,θ[1,m]cls is the learnable parameters from the 1st layer to mth

layer in the classification network, and θadv ∈ Rd×2 is alinear mapping. The loss of the discriminator network Ladvcan be computed as:

Ladv = −∑I∈I

d · log [D(I)]+(1−d) · log [D(φ(I))] . (6)

Justification. To better understand how the adversarialloss tunes feature learning, we further visualize the featuresof backbone network ResNet-50 with and without the ad-versarial loss. Given an input image I , we denote the kth

feature map in mth layer by F km(I). For ResNet-50, we ex-tract feature from the outputs of the convolutional layer withaverage pooling next to the last fully-connect layer for ad-versarial learning. Thus, the response of kth filter in the lastconvolutional layer for ground truth label c can be measuredby rk(I, c) = F̄ km(I) × θ[m+1]

cls [k, c], where θ[m+1]cls [k, c] is

the weight between the kth feature map and the cth outputlabel.

We compare the responses of different filters for origi-nal image and its destructed version in scatter plot shownas Figure 4, where every filter with positive response ismapped to the data point (r(I, c), r(φ(I), c)) in the scat-ter plot. We can find that the distributions of feature mapstrained by Lcls is more compact than those trained byLcls+Ladv . It means that the filters have large responses onthe noise patterns introduced by RCM may also have largeresponses on the original image (as the visual patterns vi-sualized in A, B and C, there are lots of filters responding

4324

to edge-style visual patterns or irrelevant patterns that areintroduced by RCM). These filters may mislead the predic-tions on the original image.

We also colored the points in scatter plot about backbonenetwork trained by Lcls + Ladv , according to the value of

δk = F̄ km(I)× θadv[k, 1]− F̄ km(φ(I))× θadv[k, 2], (7)

where θadv[k, 1] is the weight connecting feature mapF km(·) and the label representing orignal image, andθadv[k, 2] is the weight connecting F km(·) and the label rep-resenting destructed image. δk evaluates whether the kth fil-ter tends to be visual patterns in original image or not. It canbe observed that the filter respond to noisy visual patterncan be distinguished (D VS. F ) by using adversarial loss.The points in figures can be divides into three parts. D:filters that tend to respond to noisy patters (RCM-inducedimage features); F : filters that tend to respond to globalcontext description (orignal image specific image features);E: the vast majority of filters are related to the detailed localregion descriptions that enhanced by Lcls (common imagefeature maps between orignal image and destructed image).Lcls and Ladv together contribute to the “destruction”

learning, where only discriminative local details are en-hanced and irrelevant features are filtered out.

3.2. Construction Learning

Considering it is the combination of correlative regionsin images constitute the complex and diverse visual pat-terns, we propose another learning method to model thecorrelation among local regions. Specifically, we proposea region alignment network with region construction lossLloc, that measures the location precision of different re-gions in images, to induce backbone network to model thesemantic correlative among regions by end-to-end training.

Given an image I and its corresponding destructed ver-sion φ(I), the region Ri,j located at (i, j) in I is consistentwith the region Rσ(i,j) in φ(I). Region alignment networkworks on the output features of one convolution layer of theclassification network C(·, θ[1,n]cls ), where the nth layer is aconvolutional layer. The features are processed by a 1 × 1convolution to obtain outputs with two channels. Then theoutputs are handled by an ReLU and an average pooling toget a map with the size of 2×N×N . The outputs of regionalignment network can be written as:

M (I) = h(C(I, θ

[1,n]cls ), θloc

), (8)

where the two channels in M (I) correspond to the loca-tion coordinates of rows and columns, respectively, h is ourproposed region alignment network, and θloc is the param-eters in region alignment network. We denote the predictedlocation of Rσ(i,j) in I as Mσ(i,j) (φ(I)), predicted loca-tion of Ri,j in I as Mi,j (I, i, j). Both ground truth of

Cls Lo

ssC

ls Loss + A

dv Lo

ss

A

C

B

A

C

B

D

F

E

D

F

E

Figure 4. Visualization of filters learned by using Lcls and Lcls +Ladv respectively. The 1st row shows the original image I and itsdestructed version φ(I). The left side of 2nd and 3rd rows showthe scatter plots about the filters’ responses to I and φ(I). Theright side of 2nd and 3rd rows show the visualization of featuremaps belongs to filters that have various responses on I and φ(I).A,D: filters with larger response to φ(I). C,F : filters with largerresponse to I . B,E: filters with large responses on both of I andφ(I). (The figure is best viewed in color.)

Mσ(i,j) (φ(I)) and Mi,j (I) should be (i, j). The regionalignment loss Lloc is defined as the L1 distance betweenthe predicted coordinates and original coordinates, whichcan be expressed as:

Lloc =∑I∈I

N∑i=1

N∑j=1

∣∣∣∣Mσ(i,j) (φ(I))−[ij

]∣∣∣∣1

+

∣∣∣∣Mi,j (I)−[ij

]∣∣∣∣1

(9)The region construction loss is helpful to locate the main

objects in images and tends to find the correlation amongsub-regions. By end-to-end training, the region construc-tion loss can help the classification backbone network tobuild deep understanding about objects and model the struc-ture information, such as the shape of objects and semanticcorrelation among parts of object.

4325

3.3. Destruction and Construction Learning

In our framework, the classification, adversarial and re-gion alignment losses are trained in an end-to-end man-ner, in which the network can leverage both enhanced localdetails and well-modeled object parts correlation for fine-grained recognition. Specifically, we want to minimize thefollowing objective:

L = αLcls + βLadv + γLloc. (10)

Figure 2 shows the architecture of the DCL framework.The destruction learning mainly helps to learn from dis-criminative regions, while the construction learning helpsto re-arrange the learned local details according to seman-tic correlation among regions. Hence, DCL yields to a setof complex and diverse visual representations based on thewell-structured detail features from discriminative regions.

Note that only f(·, θ[1,l]cls ) is used for predicting the cate-gory label of given images. Thus, there is no external com-putational overhead except the backbone classification net-work for inference.

4. ExperimentsWe evaluate the performance of our proposed DCL

on three standard fine-grained object recognition datasets:CUB-200-2011 (CUB) [3], Stanford Cars (CAR) [14] andFGVC-Aircraft (AIR) [18]. We do not use any boundingbox/part annotations in all our experiments.

4.1. Implementation Details

Backbone network: We evaluate our proposed methodon two classification widely used backbone networks:ResNet-50 [9] and VGG-16 [25]. These two networks arepre-trained on ImageNet dataset. The category label of theimage is the only annotation used for training. The inputimages are resized to a fixed size of 512 × 512 and ran-domly cropped into 448 × 448. Random rotation and ran-dom horizontal flip are applied for data augmentation. Allabove settings are standard in the literature. To recognizehigh-resolution images on VGG-16 without sub-sampling,the first two fully connected layers in VGG-16 are trans-formed into two convolution layers respectively. For all theexperiments in this paper, the feature maps of the last convo-lutional layer of backbone network are feed into the regionalignment network, and the feature vector formed by theoutput of average pooling following the last convolutionallayer is feed into the adversarial learning network.

The number of regions N in RCM is based on the back-bone network and the size of the input image. The width wand length h of the region should be divisible by the strideof the last convolutional layer, which is 32 for VGG-16 andResNet-50. Meanwhile, to ensure the feasibility of regionalignment, the width and height of the input image should

also be divisible byN . Without special mention, the defaultvalue of division numberN for RCM is set to 7 in this paper.The influence of choice of N is discussed in Section 4.4.

All models in experiments were trained for 180 epochs,and learning rates decay by a factor of 10 for every 60epochs. At test time, RCM is disabled, and the networksstructures for adversarial loss and region construction areremoved. The input images are center cropped and thenfeed into the backbone classification network for final pre-dictions.

4.2. Performance Comparison

The results on CUB-200-2011, Stanford Cars, andFGVC-Aircraft are presented in Table 1. Considering thatsome of the compared methods use image-level labels orbounding box annotations, the information of extra anno-tations is also presented in parentheses for direct compar-isons. The single model and single crop performance of ourproposed DCL achieved state-of-the-art with no extra anno-tation on all of the three datasets.

We set α = β = 1 for all experiments reported in this pa-per. For non-rigid objects recognition tasks like CUB-200-2011, the correlation among different regions is importantfor building deep understanding about objects. Thus we setγ = 1. While for rigid objects recognition tasks like Stan-ford Cars and FGVC-Aircraft, parts of objects are discrim-inative and complementary. Thus object and part locationmay play a significant role [34]. We set γ = 0.01 for rigidobjects recognition tasks to highlight the role of destructionlearning in learning detail visual representations from dis-criminative regions. Different from other fine-grained cate-gories like bird and car, the structure of aircraft can changewith their design significantly [18]. For example, the num-ber of wings, undercarriages, wheels per undercarriage, en-gines, etc. varies. Thus we set N as 2 for DCL on FGVC-Aircraft in Table 1 to retain the structure information to acertain extent.

Tables 1, 2 show that our ResNet-50 baseline is alreadyvery competitive. Luckily, our proposed DCL can still out-perform the strong baseline with a large margin (e.g., 2.3%absolute improvement on average) on all of the three tasks.

4.3. Ablation Studies

We conduct ablation studies to understand different com-ponents in our proposed DCL. We design different runs inthree datasets using ResNet-50 as the backbone networkand report the results in Table 2. The results show that theproposed DCL boosts the performance significantly. Theperformance improvement caused by destruction learning(DL) proves that a well-structured visual feature space thatdistinguishing the noisy visual pattern, detail visual patternand the global visual pattern are beneficial to fine-grainedrecognition task. Likewise, the shape and constitution in-

4326

Method Base ModelAccuracy (%)

CUB-200-2011 Stanford Cars FGVC-AircraftCoSeq(+BBox) [13] VGG-19 82.8 92.8 -FCAN(+BBox) [17] ResNet-50 84.7 93.1 -B-CNN [16] VGGnet 84.1 91.3 84.1HIHCA [2] VGG-16 85.3 91.7 88.3RA-CNN [7] VGG-19 85.3 92.5 88.2OPAM [21] VGG-16 85.8 92.2 -Kernel-Pooling [5] VGG-16 86.2 92.4 86.9Kernel-Pooling [5] ResNet-50 84.7 91.1 85.7MA-CNN [42] VGG-19 86.5 92.8 89.9DFL-CNN [33] ResNet-50 87.4 93.1 91.7DCL VGG-16 86.9 94.1 91.2DCL ResNet-50 87.8 94.5 93.0

Table 1. Comparison results on three different standard datasets. Base Model means the backbone network used in the method.

MethodAccuracy (%)

CUB CAR AIRResNet-50 85.5 92.7 90.3+ RCM 86.2 93.4 89.9DL 87.2 94.4 91.6CL 86.7 94.1 90.7DCL 87.8 94.5 92.2

Table 2. Ablation studies of the proposed method regarding recog-nition accuracy on three different datasets. ResNet-50: ResNet-50 finetuned on fine-grained tasks. + RCM: Model trained byLcls. DL: Model trained by Lcls + Ladv . CL: Model trainedby Lcls + Lloc. DCL: Model trained by L.

N=Divisor( 44832

) CUB CAR AIR1 85.5% 92.7% 90.3%2 86.5% 93.5% 93.0%7 87.8% 94.5% 92.2%14 85.7% 93.0% 92.1%

Table 3. The recognition accuracy on three datasets of the pro-posed method by using different N .

formation of objects modeled by construction learning (CL)can further improve the performance of fine-grained classi-fication model. Moreover, the adversarial learning and re-gion construction are highly complementary.

4.4. Discussions

Partition Granularity (N ): The number of partitionsN for RCM is an important parameter for the proposedmethod. Table 3 shows the recognition accuracy on threedatasets with all feasible N with the size of input images448× 448.

It can be observed that the recognition accuracy in-creases first and then decreases whileN increases. The bestperformance is achieved at N = 7 on CUB-200-2011 andStanford Cars. For experiments on FGVA-Aircraft, our pro-

posed method can still get better performance than the state-of-the-art method with 0.5% absolute improvement even ifwe set N = 7. In general, if we set N as a small num-ber, the advantage of our proposed method would likely tobe restricted. On the other hand, if we set N bigger, thevisual patterns can be learned from regions would be morelimited, and the region construction network would be moredifficult to converge. In particular, the performance of ourproposed DCL is equivalent to ResNet-50 baseline whensetting N = 1.

Ratio of Destructed Images in a Min-batch: The de-fault ratio of original images and destructed images in amin-batch is set as 1 : 1. Table 4 shows the recognition ac-curacy on CUB-200-2011 with this ratio ranging from 1 : 0to 0 : 1. As shown, the performance decreases by a largemargin when we set the ratio as 0 : 1, since there is noglobal context information in the training data.

Ratio 1:0 1:1 1:2 1:3 0:1Accuracy(%) 85.5 87.8 86.8 86.5 84.1

Table 4. The recognition accuracy on CUB-200-2011 of the modeltrained with different composition of training samples. The ratiorepresents the proportion of original images and images with RCMin one batch.

Feature Visualization: We visualize the feature maps ofthe last convolution layer in Figure 5. Comparing the fea-ture maps from baseline model and proposed method, wecan find that the feature map responses of DCL are moreconcentrated in discriminative regions. With different shuf-fling, the discriminative regions can be consistently high-lighted by DCL based model, which demonstrating the ro-bustness of our DCL method.

Object Localization: We also tested DCL on weaklysupervised object localization task on VOC2007 dataset us-ing SPN [43]. We choose Pointing Localization Accu-racy (PLAcc) as the evaluation criterion, which measures

4327

Images Baseline DCL

Figure 5. Visualization of the feature maps from the last con-volution layer of ResNet-50. For each dataset, the first columnshows the orignal image and two destructed versions; the 2nd and3rd columns show the feature maps of two filters from baselineResNet-50; the 4th and 5th columns show the feature maps of twodifferent filters from the ResNet-50 guided by DCL. This figure isbest viewed in color.

whether the network can locate the correct regions of thetarget. The experimental results is shown in Table 5. We canfind that, after applying DCL, PLAcc was improved from87.5% to 88.7%, which serves another numerical evidencethat DCL is helpful to learn correct regions.

Destruction Hyperparameter (k): Since RCM in ourproposed method requires the selection of a hyperparameterk, we conduct experiments to study the sensitivity of clas-sification performance to the choice of k in Table 6. Therecognition accuracy improved, and then decreased as k in-

Method VGG-16Center [24] 69.5Deconv [37] 75.5Grad [24] 76.0c-MWP [38] 80.0SPN [43] 87.5DCL 88.1

Table 5. Pointing localization accuracy (%) on VOC2007 test set.Center is a baseline method which uses the image centers as esti-mation of object centers.

k 0 1 2 3 4 5 6Acc.(%) 85.5 86.7 87.8 87.6 87.4 87.3 87.2

Table 6. The recognition accuracy on CUB-200-2011 of the modeltrained with different value of k.

creases. The best performance is obtained at k = 2. Inparticular, the accuracy decreased slowly when k increasedfrom 2 to 6, which indicates that our method is not particu-larly sensitive to k.

Model Complexity: During training, DCL only re-quires a simple operation (RCM) and two lightweight net-work structures (Adversarial Learning Network and Re-gion Alignment Network). For ResNet-50 + DCL, thereare 8,192 new parameters introduced by DCL, which isonly 0.034% more parameters than the baseline ResNet-50. Since there are only negligible additional parameters inDCL, the network is efficient to train. Moreover, it takes thesame number of iterations as the baseline for finetuning thenetwork upon convergence.

During testing, only the backbone classification networkis activated. Compared with the ResNet-50 baseline, ourmethod yields a significantly better result (+2.3%) with thesame time cost at inference, which adds extra practical valueto our proposed method.

5. Conclusion

In this paper, we propose a novel DCL framework forfine-grained image recognition. The destruction learning inDCL enhances the difficulty of recognition to guide the net-work learn expert knowledge for fine-grained recognition.While the construction learning can model the semantic cor-relation among parts of the object. Our method does not re-quire extra supervision information and can be trained end-to-end in one stage. Extensive experiments against state-of-the-art methods exhibit the superior performances of ourmethod on various fine-grained recognition tasks. Also, ourproposed method is lightweight, easy to train, agile for in-ference and has a good practical value.Acknowledgement. This work is partly supported by Na-tional Natural Science Foundation of China No. 61602463.

4328

References[1] T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs,

and P. N. Belhumeur. Birdsnap: Large-scale fine-grainedvisual categorization of birds. In 2014 IEEE Conferenceon Computer Vision and Pattern Recognition, pages 2019–2026, June 2014.

[2] S. Cai, W. Zuo, and L. Zhang. Higher-order integration ofhierarchical convolutional activations for fine-grained visualcategorization. In 2017 IEEE International Conference onComputer Vision, pages 511–520, Oct 2017.

[3] W. Catherine, B. Steve, W. Peter, P. Pietro, and B. Serge. Thecaltech-ucsd birds-200-2011 dataset. (CNS-TR-2011-001),2011.

[4] Y. Chaojian, Z. Xinyi, Z. Qi, Z. Peng, and Y. Xinge. Hier-archical bilinear pooling for fine-grained visual recognition.pages 595–610, 2018.

[5] Y. Cui, F. Zhou, J. Wang, X. Liu, Y. Lin, and S. Belongie.Kernel pooling for convolutional neural networks. In 2017IEEE Conference on Computer Vision and Pattern Recogni-tion, pages 3049–3058, July 2017.

[6] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visualrepresentation learning by context prediction. In 2015 IEEEInternational Conference on Computer Vision, pages 1422–1430, Dec 2015.

[7] J. Fu, H. Zheng, and T. Mei. Look closer to see bet-ter: Recurrent attention convolutional neural network forfine-grained image recognition. In 2017 IEEE Conferenceon Computer Vision and Pattern Recognition, pages 4476–4484, July 2017.

[8] Y. Gao, O. Beijbom, N. Zhang, and T. Darrell. Compactbilinear pooling. In 2016 IEEE Conference on ComputerVision and Pattern Recognition, pages 317–326, June 2016.

[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In 2016 IEEE Conference on Com-puter Vision and Pattern Recognition, pages 770–778, June2016.

[10] C. Huang, Z. He, G. Cao, and W. Cao. Task-driven pro-gressive part localization for fine-grained object recognition.IEEE Transactions on Multimedia, 18(12):2372–2383, Dec2016.

[11] S. Huang, Z. Xu, D. Tao, and Y. Zhang. Part-stacked cnn forfine-grained visual categorization. In 2016 IEEE Conferenceon Computer Vision and Pattern Recognition, pages 1173–1182, June 2016.

[12] S. Kong and C. Fowlkes. Low-rank bilinear pooling for fine-grained classification. In 2017 IEEE Conference on Com-puter Vision and Pattern Recognition, pages 7025–7034,July 2017.

[13] J. Krause, H. Jin, J. Yang, and L. Fei-Fei. Fine-grainedrecognition without part annotations. In 2015 IEEE Con-ference on Computer Vision and Pattern Recognition, pages5546–5555, June 2015.

[14] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object rep-resentations for fine-grained categorization. In 2013 IEEEInternational Conference on Computer Vision Workshops,pages 554–561, Dec 2013.

[15] G. Lample, A. Conneau, L. Denoyer, and M. Ranzato. Un-supervised machine translation using monolingual corporaonly. 2018.

[16] T. Lin, A. RoyChowdhury, and S. Maji. Bilinear cnn mod-els for fine-grained visual recognition. In 2015 IEEE Inter-national Conference on Computer Vision, pages 1449–1457,Dec 2015.

[17] X. Liu, T. Xia, J. Wang, and Y. Lin. Fully convolutional at-tention localization networks:efficient attention localizationfor fine-grained recognition. CoRR, abs/1603.06765, 2016.

[18] S. Maji, E. Rahtu, J. Kannala, M.B. Blaschko, and A.Vedaldi. Fine-grained visual classification of aircraft. CoRR,abs/1306.5151, 2013.

[19] N. Mehdi and F. Paolo. Unsupervised learning of visual rep-resentations by solving jigsaw puzzles. In Computer Vision– ECCV 2016, pages 69–84, Cham, 2016. Springer Interna-tional Publishing.

[20] S. Ming, Y. Yuchen, Z. Feng, and D. Errui. Multi-attentionmulti-class constraint for fine-grained image recognition.pages 834–850, 2018.

[21] Y. Peng, X. He, and J. Zhao. Object-part attention modelfor fine-grained image classification. IEEE Transactions onImage Processing, 27(3):1487–1500, March 2018.

[22] Pau Rodrı́guez, Josep M Gonfaus, Guillem Cucurull, FXavierRoca, and Jordi Gonzalez. Attend and rectify: agated attention mechanism for fine-grained recovery. In Pro-ceedings of the European Conference on Computer Vision(ECCV), pages 349–364, 2018.

[23] X. Shu, J. Tang, G. Qi, Z. Li, Y. Jiang, and S. Yan. Imageclassification with tailored fine-grained dictionaries. IEEETransactions on Circuits and Systems for Video Technology,28(2):454–467, Feb 2018.

[24] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman.Deep inside convolutional networks: Visualising image clas-sification models and saliency maps. 2013.

[25] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition. CoRR,abs/1409.1556, 2014.

[26] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D.Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Go-ing deeper with convolutions. In 2015 IEEE Conference onComputer Vision and Pattern Recognition, pages 1–9, June2015.

[27] C. Tianshui, W. Wenxi, G. Yuefang, D. Le, L. Xiaonan, andL. Liang. Fine-grained representation learning and recogni-tion by exploiting hierarchical semantic embedding. In The26th ACM International Conference on Multimedia, MM’18, pages 2023–2031. ACM, 2018.

[28] J. Wang, J. Fu, Y. Xu, and T. Mei. Beyond object recognition:Visual sentiment analysis with deep coupled adjective andnoun neural networks. August 2016.

[29] Y. Wang, J. Choi, V. I. Morariu, and L. S. Davis. Mining dis-criminative triplets of patches for fine-grained classification.In 2016 IEEE Conference on Computer Vision and PatternRecognition, pages 1163–1172, June 2016.

[30] Xing Wei, Yue Zhang, Yihong Gong, Jiawei Zhang, andNanning Zheng. Grassmann pooling as compact homoge-neous bilinear pooling for fine-grained visual classification.

4329

In Proceedings of the European Conference on Computer Vi-sion (ECCV), pages 355–370, 2018.

[31] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang.The application of two-level attention models in deep convo-lutional neural network for fine-grained image classification.In 2015 IEEE Conference on Computer Vision and PatternRecognition, pages 842–850, June 2015.

[32] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang.The application of two-level attention models in deep convo-lutional neural network for fine-grained image classification.In 2015 IEEE Conference on Computer Vision and PatternRecognition, pages 842–850, June 2015.

[33] W. Yaming, M. Vlad I, and D. Larry S. Learning a discrim-inative filter bank within a cnn for fine-grained recognition.In 2018 IEEE Conference on Computer Vision and PatternRecognition, pages 4148–4157, 2018.

[34] L. Yang, P. Luo, C. C. Loy, and X. Tang. A large-scalecar dataset for fine-grained categorization and verification.In 2015 IEEE Conference on Computer Vision and PatternRecognition, pages 3973–3981, June 2015.

[35] Ze Yang, Tiange Luo, Dong Wang, Zhiqiang Hu, Jun Gao,and Liwei Wang. Learning to navigate for fine-grained clas-sification. In Proceedings of the European Conference onComputer Vision (ECCV), pages 420–435, 2018.

[36] B. Yao, G. Bradski, and L. Fei-Fei. A codebook-free andannotation-free approach for fine-grained image categoriza-tion. In 2012 IEEE Conference on Computer Vision and Pat-tern Recognition, pages 3466–3473, June 2012.

[37] Matthew D Zeiler and Rob Fergus. Visualizing and under-standing convolutional networks. In European conference oncomputer vision, pages 818–833. Springer, 2014.

[38] Jianming Zhang, Sarah Adel Bargal, Zhe Lin, JonathanBrandt, Xiaohui Shen, and Stan Sclaroff. Top-down neu-ral attention by excitation backprop. International Journalof Computer Vision, 126(10):1084–1102, 2018.

[39] N. Zhang, R. Farrell, F. Iandola, and T. Darrell. Deformablepart descriptors for fine-grained recognition and attributeprediction. In 2013 IEEE International Conference on Com-puter Vision, pages 729–736, Dec 2013.

[40] X. Zhang, F. Zhou, Y. Lin, and S. Zhang. Embedding labelstructures for fine-grained feature representation. In 2016IEEE Conference on Computer Vision and Pattern Recogni-tion, pages 1114–1123, June 2016.

[41] B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan. Diversifiedvisual attention networks for fine-grained object classifica-tion. IEEE Transactions on Multimedia, 19(6):1245–1256,June 2017.

[42] H. Zheng, J. Fu, T. Mei, and J. Luo. Learning multi-attentionconvolutional neural network for fine-grained image recog-nition. In 2017 IEEE International Conference on ComputerVision, pages 5219–5227, Oct 2017.

[43] Yi Zhu, Yanzhao Zhou, Qixiang Ye, Qiang Qiu, and JianbinJiao. Soft proposal networks for weakly supervised objectlocalization. In Proceedings of the IEEE International Con-ference on Computer Vision, pages 1841–1850, 2017.

4330

destruction and construction learning for fine-grained image …ylbai.me/dcl_cvpr19.pdf · 2019. 5....

Documents