arxiv:1712.01493v1 [cs.cv] 5 dec 2017 · pdf filezhou yin wei-shi zheng ancong wu hong-xing yu...

7
Adversarial Attribute-Image Person Re-identification Zhou Yin, Wei-Shi Zheng, Ancong Wu, Hong-Xing Yu, Hai Wan, Jianhuang Lai School of Data and Computer Science, Sun Yat-Sen University {yinzh,wuancong}@mail2.sysu.edu.cn,{zhwshi,wanhai,stsljh}@mail.sysu.edu.cn,[email protected] Abstract While attributes have been widely used for per- son re-identification (Re-ID) that matches the same person images across disjoint camera views, they are used either as extra features or for performing multi-task learning to assist the image-image per- son matching task. However, how to find a set of person images according to a given attribute de- scription, which is very practical in many surveil- lance applications, remains a rarely investigated cross-modal matching problem in Person Re-ID. In this work, we present this challenge and employ ad- versarial learning to formulate the attribute-image cross-modal person Re-ID model. By imposing the regularization on the semantic consistency con- straint across modalities, the adversarial learning enables generating image-analogous concepts for query attributes and getting it matched with im- age in both global level and semantic ID level. We conducted extensive experiments on three attribute datasets and demonstrated that the adversarial mod- elling is so far the most effective for the attribute- image cross-modal person Re-ID problem. 1 Introduction Pedestrian attributes, e.g., age, gender and dressing, are searchable semantic elements to describe a person. Always, as depicted in Figure 1, it is required to search a person image in surveillance environment according to specific attribute de- scriptions provided by users. We call this the attribute-image person re-identification (Re-ID). This task is significant in finding missing people with tens of thousands of surveil- lance cameras equipped in modern metropolises. Compared with conventional image-based Re-ID [Zhao et al., 2014; Yang et al., 2016], attribute-image Re-ID has the advantage that its query is much easier to be obtained, for instance it is more practical to search for criminal suspects when only verbal testimony about the suspects’ appearances is given. Despite the great significance, the attribute-image person Re-ID is still a very open problem and has been rarely inves- tigated before. While a lot of attribute person Re-ID mod- Figure 1: The attribute-image person Re-ID problem. The query is an attribute vector labeled by users, and the corresponding target person images that are matched with the query are retrieved. els [Lin et al., 2017; Layne et al., 2012a; Su et al., 2016; Layne et al., 2012b; Layne et al., 2014b; Layne et al., 2014a; Su et al., 2015a] have been developed recently, they are mainly used either for multi-task learning or for supplying extra features so as to enhance the image-image person Re- ID model, but provide no solution for the attribute-image per- son Re-ID. Although the most intuitive solution is to predict attributes for each person image, and search within the pre- dicted attributes [Siddiquie et al., 2011; Vaquero et al., 2009; Scheirer et al., 2012]. If we can reliably recognize the at- tributes of each pedestrian image, this could be the best way to find the person corresponding to the query attributes. How- ever, recognizing attributes from a person image is still an open issue, as pedestrian images from surveillance environ- ment often suffer from low resolution, pose variations and il- lumination changes. The problem of imperfect recognition limits the intuitive methods to be a good solution to deal with mapping between heterogeneous domains. Very often in a large-scale scenario, the predicted attributes from two pedestrians are different but very similar, which results in a very small inter-class distance in the predicted attribute space. Therefore, the imperfect prediction deteriorate the reliability of these existing models. In this paper, we formulate an adversarial attribute-image person Re-ID framework. Intuitively, when we hold some at- tribute description in mind, e.g., “dressed in red”, we generate an obscure and vague imagination on how a person dressed arXiv:1712.01493v2 [cs.CV] 6 Feb 2018

Upload: ngothuy

Post on 24-Mar-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: arXiv:1712.01493v1 [cs.CV] 5 Dec 2017 · PDF fileZhou Yin Wei-Shi Zheng Ancong Wu Hong-Xing Yu Hai Wan Jianhuang Lai School of Data and Computer Science, ... Introduction Pedestrian

Adversarial Attribute-Image Person Re-identification

Zhou Yin, Wei-Shi Zheng, Ancong Wu, Hong-Xing Yu, Hai Wan, Jianhuang LaiSchool of Data and Computer Science, Sun Yat-Sen University

{yinzh,wuancong}@mail2.sysu.edu.cn,{zhwshi,wanhai,stsljh}@mail.sysu.edu.cn,[email protected]

AbstractWhile attributes have been widely used for per-son re-identification (Re-ID) that matches the sameperson images across disjoint camera views, theyare used either as extra features or for performingmulti-task learning to assist the image-image per-son matching task. However, how to find a set ofperson images according to a given attribute de-scription, which is very practical in many surveil-lance applications, remains a rarely investigatedcross-modal matching problem in Person Re-ID. Inthis work, we present this challenge and employ ad-versarial learning to formulate the attribute-imagecross-modal person Re-ID model. By imposingthe regularization on the semantic consistency con-straint across modalities, the adversarial learningenables generating image-analogous concepts forquery attributes and getting it matched with im-age in both global level and semantic ID level. Weconducted extensive experiments on three attributedatasets and demonstrated that the adversarial mod-elling is so far the most effective for the attribute-image cross-modal person Re-ID problem.

1 IntroductionPedestrian attributes, e.g., age, gender and dressing, aresearchable semantic elements to describe a person. Always,as depicted in Figure 1, it is required to search a person imagein surveillance environment according to specific attribute de-scriptions provided by users. We call this the attribute-imageperson re-identification (Re-ID). This task is significant infinding missing people with tens of thousands of surveil-lance cameras equipped in modern metropolises. Comparedwith conventional image-based Re-ID [Zhao et al., 2014;Yang et al., 2016], attribute-image Re-ID has the advantagethat its query is much easier to be obtained, for instance itis more practical to search for criminal suspects when onlyverbal testimony about the suspects’ appearances is given.

Despite the great significance, the attribute-image personRe-ID is still a very open problem and has been rarely inves-tigated before. While a lot of attribute person Re-ID mod-

Figure 1: The attribute-image person Re-ID problem. The queryis an attribute vector labeled by users, and the corresponding targetperson images that are matched with the query are retrieved.

els [Lin et al., 2017; Layne et al., 2012a; Su et al., 2016;Layne et al., 2012b; Layne et al., 2014b; Layne et al., 2014a;Su et al., 2015a] have been developed recently, they aremainly used either for multi-task learning or for supplyingextra features so as to enhance the image-image person Re-ID model, but provide no solution for the attribute-image per-son Re-ID. Although the most intuitive solution is to predictattributes for each person image, and search within the pre-dicted attributes [Siddiquie et al., 2011; Vaquero et al., 2009;Scheirer et al., 2012]. If we can reliably recognize the at-tributes of each pedestrian image, this could be the best wayto find the person corresponding to the query attributes. How-ever, recognizing attributes from a person image is still anopen issue, as pedestrian images from surveillance environ-ment often suffer from low resolution, pose variations and il-lumination changes. The problem of imperfect recognitionlimits the intuitive methods to be a good solution to dealwith mapping between heterogeneous domains. Very oftenin a large-scale scenario, the predicted attributes from twopedestrians are different but very similar, which results in avery small inter-class distance in the predicted attribute space.Therefore, the imperfect prediction deteriorate the reliabilityof these existing models.

In this paper, we formulate an adversarial attribute-imageperson Re-ID framework. Intuitively, when we hold some at-tribute description in mind, e.g., “dressed in red”, we generatean obscure and vague imagination on how a person dressed

arX

iv:1

712.

0149

3v2

[cs

.CV

] 6

Feb

201

8

Page 2: arXiv:1712.01493v1 [cs.CV] 5 Dec 2017 · PDF fileZhou Yin Wei-Shi Zheng Ancong Wu Hong-Xing Yu Hai Wan Jianhuang Lai School of Data and Computer Science, ... Introduction Pedestrian

in red may look like, which we refer to as a concept. Once aconcrete image is given, our vision system automatically pro-cesses the low-level features (e.g., color and edge) to obtainsome perceptions, and then try to judge whether the percep-tions and the concept are coincident with each other.

Formally, our adversarial attribute-image person Re-ID isto learn a semantically discriminative joint space, which canbe regarded as a concept space (Figure 2), for generative ad-versarial architecture to generate some concepts which have avery similar structure compared to the features extracted fromraw person concepts. However, the generic adversarial archi-tecture is still hard to fit the match between two extremelylarge discrepant domains (attribute and image domains). Forthis problem, we impose a semantic consistency regulariza-tion across modalities in order to regularize the adversarialarchitecture, which enhances the learned joint space as a roleof the bridge between two modalities. In a word, our ad-versarial attribute-image model learns the semantically dis-criminative structure of low-level person images, and gener-ate the corresponding aligned image-analogous concept forhigh-level attribute towards image concept. By the proposedstrategy, directly estimating the attributes of a person imageis averted, and the imperfect prediction and low semantic dis-criminative problems are naturally solved, because we learna semantically discriminative joint space, rather than predictand match attributes.

We conducted experiments on three large-scale benchmarkdatasets, namely Duke Attribute [Lin et al., 2017], MarketAttribute [Lin et al., 2017] and PETA [Deng et al., 2014] tovalidate our model. By our learning, some interesting find-ings are:(1) Compared with other related cross-modal models, we findthe regularized adversarial learning framework is so far mosteffective for solving the attribute-image Re-ID problem;(2) For achieving better cross-modal matching between at-tribute and person image, it is more effective to use adver-sarial model to generate image-analogous concept and get itmatched with image’s concept rather than doing reversely;(3) The semantic consistency as regularization on adversariallearning is important for the attribute-image Re-ID.

2 Related Works2.1 Attribute-based Re-IDWhile pedestrian attributes in most of researches have been aside information or mid-level representation to improve con-ventional image-image person Re-ID tasks [Lin et al., 2017;Layne et al., 2012a; Su et al., 2016; Layne et al., 2012b;Layne et al., 2014b; Layne et al., 2014a; Su et al., 2015a;Su et al., 2015b; Su et al., 2018], quite a few work [Va-quero et al., 2009] has discussed attribute-image person Re-ID problem. The work in [Vaquero et al., 2009] is to formattribute-attribute matching. However, despite the improve-ment on performance achieved by attribute prediction meth-ods [Li et al., 2015], directly retrieving people according totheir predicted attributes is still challenging, because the at-tribute prediction methods are still not robust to the cross-view condition changes like different lighting conditions andviewpoints.

In this work, for the first time, we have extensive inves-tigation on the attribute-image person Re-ID problem underan adversarial framework. Rather than directly predicting at-tributes of an image, we cast the cross-view attribute-imagematching as cross-modality matching by an adversarial learn-ing problem.

2.2 Cross-modality Retrieval

Our work is related to cross-modal content search, whichaims to bridge between different modalities [Hotelling, 1936;Mineiro and Karampatziakis, 2014; Andrew et al., 2013;Kiros et al., 2014]. The most traditional and practical so-lution to this task is Canonical Correlation Analysis (CCA)[Hotelling, 1936], which projects two modalities into a com-mon space that maximizes their correlation. The originalCCA method that uses a linear projection is further improvedby [Mineiro and Karampatziakis, 2014; Andrew et al., 2013].Ngiam et al. and Feng et al. also applied autoencoder-basedmethods to model the cross-modal correlation [Ngiam et al.,2011; Feng et al., 2014], and Wei et al. proposed a deep se-mantic matching method to address the cross-modal retrievalproblem [Wei et al., 2017] with respect to samples which areannotated with one or multiple labels. Recently, A. Eisen-schtat and L. Wolf et al. have designed a novel model of twotied pipelines that maximize the projection correlation usingan Euclidean loss, which achieves state of the art results insome datasets [Eisenschtat and Wolf, 2017]. A most relatedwork to ours was proposed by [S.Li et al., 2017], which ap-plies a neural attention mechanism to retrieve pedestrian im-ages using language descriptions. Compared with this set-ting, our attribute-image Re-ID has its own strength in em-bedding more pre-defined attribute descriptions for obtainingbetter performance.

2.3 Distribution Alignment Methods

The adversarial model employed in this work is in line withthe GAN methods[Eric et al., 2017; Reed et al., 2016] has itsstrength in domain alignment by a two player minimax game.Actually, for performing domain alignment, there are othertechniques called domain adaptation techniques [Tzeng et al.,2014; Long and Wang, 2015; Ganin and Lempitsky, 2015].Indomain adaptation, to align the distribution of data fromtwo different domains, several works [Tzeng et al., 2014;Long and Wang, 2015] apply MMD-based loss, which mini-mize the norm of difference between means of the two dis-tributions. Different from these methods, the deep Corre-lation Alignment(CORAL) [Sun and Saenko, 2016] methodproposed to match both the mean and covariance of the twodistributions. In this work, we examine all these approachesand find the adversarial modelling is more suitable for theattribute-image person Re-ID, which could be due to its morepowerful ability on aligning more discrepant data distribu-tions. Our work is different from these methods as we addressthe problem of attribute based person re-id, which requiresour model to generate homogeneous structure from the twodiscrepant modalities, while in the same time keeping the se-mantic consistency across modalities.

Page 3: arXiv:1712.01493v1 [cs.CV] 5 Dec 2017 · PDF fileZhou Yin Wei-Shi Zheng Ancong Wu Hong-Xing Yu Hai Wan Jianhuang Lai School of Data and Computer Science, ... Introduction Pedestrian

Figure 2: The structure of our network. It aims to generate image-analogous concept for an attribute input optimized by three lostfunctions: image concept extraction loss, semantic consistency con-straint, adversary loss. See Sec. 3.3 for details about the networkarchitecture.

3 Attribute-image Person Re-IDGiven an attribute description Ai, attribute-image person Re-ID aims at re-identifying the matched pedestrian images Iifrom an image database I = {Ii}Ni=1 captured under realsurveillance environment, where N is the size of I. Sincedifferent person images could have the same attribute descrip-tion, the attribute-image person Re-ID uses Semantic ID (i.e.,attribute description) to group person images, that is peoplewith the same attribute description are of the same semanticID.

The goal of our method is to learn two mappings ΦI andΦA that respectively map person images and high-level se-mantic attributes into a joint discriminative space, whichcould be regarded as the concept space as mentioned. Thatis, CIi = ΦI(Ii) and CAi = ΦA(Ai), where CIi and CAiare called the mid-level concept that is generated from theimage Ii and attribute Ai, respectively. To achieve this, weform an image embedding network by CNN and an attribute-embedding network by a deep fully connected network. Weparameterize our model by Θ, and obtain Θ by optimizinga concept generation objective Lconcept. Given training pairsof images and attributes (Ii, Ai), the optimization problem isformulated as

minΘ

Lconcept =1

N

N∑i=1

lconcept(ΦI(Ii),ΦA(Ai)). (1)

In this paper, we design lconcept as a combination of sev-eral loss terms, each of which formulates a specific aspect ofconsideration to jointly formulate our problem. The first con-sideration is that the concepts we extract from the low-levelnoisy person images should be semantically discriminative.We formulate it by image concept extraction loss lI . Thesecond consideration is that image-analogous concepts CAgenerated from attributes and image concepts CI should behomogeneous. Inspired by the powerful ability of generativeadversary networks to close the gap between heterogeneousdistributions, we propose to embed an adversarial learningstrategy into our model. This is modelled by a concept gen-erating objective LCG, which aims to generate concepts not

only discriminative but also homogeneous with concepts ex-tracted from images. Therefore, we have

lconcept = lI + lCG. (2)

In the following, we describe each of them in detail.

3.1 Image Concept ExtractionOur image concept extraction loss lI is based on softmax clas-sification on the image concepts ΦI(I). Since our objective isto learn semantically discriminative concepts that could dis-tinguish objects with different attributes rather than to dis-tinguish between specific persons, we re-assign semantic IDsyi for any person image Ii according to its attributes ratherthan real person IDs, which means different people with thesame attributes could have the same semantic ID. We definethe image concept extraction loss as a softmax classificationobjective on semantic IDs. Denoting ΨI as the classifier andI as the input image, the image concept extraction loss is thenegative log likelihood of predicted scores ΨI(ΦI(I)):

lI =∑i

− log ΨI(ΦI(Ii))yIi , (3)

where Ii is the ith input image, yIi is the semantic ID of Iiand ΨI(ΦI(Ii))k is the kth element of ΨI(ΦI(Ii)).

3.2 Semantic-preserving Image-analogousConcept Generation

Image-analogous Concept Generation. We regard ΦA as agenerative process, just like the process of people generatingan imagination from an attribute description. As the seman-tically discriminative latent concepts could be extracted fromimages, they can also provide information to learn the image-analogous concepts ΦA(A) for attributes as a guideline.

Basically, the generated image-analogous concepts shouldobey the same distribution as image concepts, i.e., PI(C) =PA(C), where C denotes a concept in the joint concept spaceof ΦI(I) and ΦA(A) and PI , PA denote the distribution ofimage concepts and image-analogous concepts, respectively.We intend to use a function P̂I to approximate image con-cept distribution PI , and make the image-analogous conceptsΦA(A) obey distribtion P̂I . It can be achieved by an adver-sarial training process of GAN, in which discriminator D isregarded as P̂I and the generator G is regarded as image-analogous concept generator ΦA.

In the adversary training process, we design a networkstructure (see Sec. 3.3) and train our concept generator ΦAwith a goal of fooling a skillful concept discriminator D thatis trained to distinguish the image-analogous concept fromthe image concept, so that the generated image-analogousconcept is aligned with the image concept. We design thediscriminator network D with parameters θD and denote theparameters of ΦA as θG. The adversarial min-max problemis formulated as

minθG

maxθD

V (D,G) =EI∼pI [logD(ΦI(I))]+

EA∼pA [log(1−D(ΦA(A)))].(4)

Page 4: arXiv:1712.01493v1 [cs.CV] 5 Dec 2017 · PDF fileZhou Yin Wei-Shi Zheng Ancong Wu Hong-Xing Yu Hai Wan Jianhuang Lai School of Data and Computer Science, ... Introduction Pedestrian

The above optimization problem is solved by iteratively op-timizing θD and θG. Therefore, the objective can be decom-posed into two loss terms lGadv and lDadv , which are for trainingthe concept generator ΦA and the discriminator D, respec-tively. Then the whole objective during adversary trainingladv could be formed by:

ladv = λDlDadv + λGl

Gadv, (5)

where lGadv = − logD(ΦA(A)),

lDadv = −(logD(ΦI(I)) + log(1−D(ΦA(A)))).

Semantic Consistency Constraint. The adversarial learn-ing pattern ladv is important for generator ΦA to generateimage-analogous concept with the same distribution of imageconcept ΦI(I). Furthermore, we should generate meaningfulconcepts preserving the semantic discriminability of the at-tribute modality, i.e., P sidI (C) = P sidA (C), where P sidI andP sidA denote the distributions of image concepts and image-analogous concepts of semantic ID sid. If we analyze theimage concept extraction loss lI in Equation (3) indepen-dently, ΨI can be regarded as a function to approximate aset of distributions P sidI (C) for each semantic ID sid. Withthe assumption that the generated image-analogous conceptsshould be in the same concept space as image concepts, ΨI

is shared by image concept extraction and image-analogousconcept generation, so as to guarantee identical distributionof two domains in semantic ID level. We integrate a semanticconsistency constraint lsc using the same classifier for imageconcept ΨI :

lsc =∑i

− log ΨI(ΦA(Ai))yAi, (6)

where Ai is the ith input attribute, yAiis the semantic ID

of Ai and ΨI(ΦA(Ai))k is the kth element of ΨI(ΦA(Ai)).Thus the overall concept generating objective for attributesLCG becomes the sum of ladv andlsc:

LCG = ladv + lsc. (7)By this way, we encourage our generation model to gener-ate more homogeneous image-analogous concept structure,while at the same time correlating image-analogous conceptswith semantically matched image concepts by maintainingsemantic discriminability in the learned structure.

3.3 The Network ArchitectureOur network structure is shown in Figure 2. We particularlydesign the attribute part to make it have enough capabilityto learn image-analogous concepts, and thus the structure isdesigned to contain several fully connected layers and batchnormalization, details of which is shown in table 1. The im-age part of our network is obtained by removing the last Soft-max classification layer of Resnet-50 and adding a 128 fea-ture layer as learned image concepts before finally classifiedinto semantic IDs. We use concepts of the last layer beforeclassification as our learned latent space, the size of that layeris also set to be 128 in the attribute pipeline.

As introduced above, we impose the semantic consistencyconstraint on attribute and thus we pass image-analogous con-cepts into Semantic ID classifier for image channel. At the in-ference stage, we rank the gallery pedestrian image concepts

Structure Sizefc1 attributeSize× 128BatchNormalization 128ReLU 128fc2 128 × 256BatchNormalization 256ReLU 256fc3 256 × 512BatchNormalization 512ReLU 512fc4 512 × embeddingSizeTanh 128fc5 128 × SemanticIDSize

Table 1: The structure of our network’ attribute part. Fc means fullyconnected layers. 128 is set to be the embedding size in our work.

ΦI according to their cosine distances to the query image-analogous concepts ΦA in the latent embedding space.Implementation Details. Our source code is modified fromOpen-ReID1, using the Pytorch framework. We first pre-trained our image network for 100 epochs using the seman-tic ID, with an adam optimizer [Kingma and Ba, 2015] withlearning rate 0.01, momentum 0.9 and weight decay 5e-4. Af-ter that, we jointly train the whole network. We set λG in Eq.(2) as 0.001, and λD as 0.5, which will be discussed in Sec-tion 4.2. The total epoch was set to 300. During training, weset the learning rate of the attribute branch to 0.01, and setthe learning rate of the image branch to 0.001 because it hadbeen pre-trained. Parameters are fixed in comparisons acrossall the datasets.

4 Experiments4.1 Datasets and SettingsDatasets. We evaluated our approach and compared with re-lated methods on three benchmark datasets, including DukeAttribute [Lin et al., 2017], Market Attribute [Lin et al.,2017], and PETA [Deng et al., 2014]. We tried to followthe setting in literatures. The Duke Attribute dataset con-tains 16522 images for training, and 19889 images for testing.Each person has 23 attributes. We labelled the images usingsemantic IDs according to their attributes. As a result, wehave 300 semantic IDs for training and 387 semantic IDs fortesting. Similar to Duke Attribute, the Market Attribute alsohas 27 attributes to describe a person, with 12141 images and508 semantic IDs in the training set, and 15631 images and484 semantic IDs in the test set. For PETA dataset, each per-son has 65 attributes (61 binary and 4 multi-valued). We used10500 images with 1500 semantic IDs for training, and 1500images with 200 semantic IDs for testing.Evaluation Metrics. We computed both Cumulative MatchCharacteristic (CMC) and mean average precision (mAP) asmetrics to measure performances of the compared models.

4.2 Evaluation on the proposed modelAdversarial vs. other domain adaptation techniques. Forour cross-modal person search, we employ the adversar-ial technique to make the image-analogous concepts gener-ated from attribute aligned with the image concepts. WhileCCA is also an optional and will be discussed when com-paring DCCA later, we examine whether the other widely

1https://cysu.github.io/open-reid/

Page 5: arXiv:1712.01493v1 [cs.CV] 5 Dec 2017 · PDF fileZhou Yin Wei-Shi Zheng Ancong Wu Hong-Xing Yu Hai Wan Jianhuang Lai School of Data and Computer Science, ... Introduction Pedestrian

Method Market Duke PETArank1 rank5 rank10 mAP rank1 rank5 rank10 mAP rank1 rank5 rank10 mAP

DeepCCAE [Wang et al., 2015] 8.12 23.97 34.55 9.72 33.28 59.35 67.64 14.95 14.24 22.09 29.94 14.45DeepCCA [Andrew et al., 2013] 29.94 50.70 58.14 17.47 36.71 58.79 65.11 13.53 14.44 20.77 26.31 11.492WayNet [Eisenschtat and Wolf, 2017] 11.29 24.38 31.47 7.76 25.24 39.88 45.92 10.19 23.73 38.53 41.93 15.38DeepMAR [Li et al., 2015] 13.15 24.87 32.90 8.86 36.60 57.70 67.00 14.34 17.80 25.59 31.06 12.67CMCE [Li et al., 2017] 35.04 50.99 56.47 22.80 39.75 56.39 62.79 15.40 31.72 39.18 48.35 26.23ours w/o adv 33.83 48.17 53.48 17.82 39.30 55.88 62.50 15.17 36.34 48.48 53.03 25.35ours w/o sc 2.08 4.80 4.80 1.00 5.26 9.37 10.87 1.56 3.43 4.15 4.15 5.80ours w/o adv+MMD 34.15 47.96 57.20 18.90 41.77 62.32 68.61 14.23 39.31 48.28 54.88 31.54ours w/o adv+DeepCoral 36.56 47.61 55.92 20.08 46.09 61.02 68.15 17.10 35.62 48.65 53.75 27.58ours 40.26 49.21 58.61 20.67 46.60 59.64 69.07 15.67 39.00 53.62 62.20 27.86

Table 2: Comparison results on the three benchmark datasets. Performances are measured by the rank1, rank5 and rank10 matching accuracyof the cumulative matching curve, as well as mAP. The best performances are indicated in red and the second indicated in blue.

used alignment methods can work for our problem. We con-sider the MMD objective, which minimize difference be-tween means of two distributions, and DeepCoral [Sun andSaenko, 2016], which matches both mean and covarianceof two distributions, as traditional and effective distributionalignment baselines. Since their original models cannot bedirectly applied, we modify our model for comparison, that iswe compare 1) our model without the adversary learning butwith an MMD objective(ours w/o adv+MMD); 2) our modelwithout the adversary learning but with Coral objective(oursw/o adv+DeepCoral). We also provide the baseline that ad-versarial learning is not presented, denoted as “ours w/o adv”.

Compared with the model that does not use adversariallearning (ours w/o adv), all the other baselines including ouradversary method perform clearly better. Among all, the ad-versary learning framework generally performs better (withthe best and second best performance), although the perfor-mance is sometimes slightly inferior on Duke as comparedwith (ours w/o adv+DeepCoral) as shown in Table 2.

n With vs. Without Semantic Consistency Constraint.In our framework, we tested our performance when the se-mantic consistency constraint is not used, denoted as “oursw/o sc”. As reported in Table 2, without semantic consistencyconstraint the performance drops sharply. This is because al-though two domains are aligned, the corresponding pair isnot correctly matched. Hence, the semantic consistency con-straint actually regularizes the adversarial learning to avertthis problem. As shown, with semantic consistency constraintbut without adversarial learning (i.e., “ours w/o adv”) clearlyperformed worse than our full model and is even not inferiorto other alignment technique. All this suggests the generic ad-versarial model itself does not directly fit the task of aligningtwo domains which are highly discrepant, but the regularizedone by semantic consistency constraint does.

A2Img vs. Img2A. In our framework, we currently use theadversarial loss to align the generated image-analogous con-cept of attribute towards image concept, we call such casegeneration from attributes to image (A2Img). We now pro-vide comparison results on generation from image to attribur-tes (Img2A). As reported in Table 3, we find that Img2A isalso effective, which even outperformed A2Img on the PETAdataset. But on larger datasets Market and Duke, A2Img per-formed better. The reason may be that distribution of se-mantic IDs is much sparser than the distribution of images.Thus, estimating the manifold of images from the training

Method Market Duke PETAA2Img (proposed) 40.3 46.6 39.0

Img2A (reverse of the proposed) 36.0 43.7 43.6Real Images 8.13 20.01 19.85

Table 3: The rank1 matching accuracy of some variants of ourmodel. “A2Img” denotes the proposed model which generates con-cepts from attributes. “Img2A” does the reverse of “A2Img”. “RealImages” denotes the model which generates images (rather than con-cepts) for attributes.

data is more reliable than estimating that of attributes. Butin PETA, the number of images is relatively small but se-mantic IDs are relatively abundant compared with the othertwo datasets. Moreover, PETA also has more complicatedsceneries and larger number of attribute descriptions, whichare more challenging for images to learn discriminative con-cepts. Thus learning generated attribute-analogous conceptand aligning with attribute concept provides more discrimi-native information, and so Img2A performs better on PETA.

Generation in concept space or in image space. Westudy the effectiveness of our method that generates image-analogous pattern for attribute input in the concept spacerather than in the low level image space. To study this point,we used a conditional GAN to generate fake images, whichhave aligned structure with real images, from our semanticattributes and a random noise input. After several trainingepochs, we added the semantic ID classification objective asour original model. We used cosine distance of the conceptsfrom the penultimate layer of our semantic ID classificationResNet as the affinity between fake and real images.

The code we used to generate fake images is from Reedet.al [Reed et al., 2016], where we have modified some in-put dimension and added some convolution and deconvolu-tion layers to fit our setting. We trained the generative modelsfor 200 epochs, and then the classification loss was added foranother training of 200 epochs, where the trade-off parameterof the generative loss λG is also set to 0.001 as our settings.

We find the retrieval performance is unsatisfactory, asshown in Table 3. While generating the whole pedestrianimage in the complicated surveillance environment from thecross-modality setting is very difficult, for our problem, theprocess of adding the style noise to generate noisy low-levelimages and then eliminating these noise to extract discrim-inative concepts is not necessary. Thus generating image-analogous pattern for attribute input in the same discrimina-tive concept space is more effective.

Page 6: arXiv:1712.01493v1 [cs.CV] 5 Dec 2017 · PDF fileZhou Yin Wei-Shi Zheng Ancong Wu Hong-Xing Yu Hai Wan Jianhuang Lai School of Data and Computer Science, ... Introduction Pedestrian

4.3 Comparison with related workComparing with Attribute Prediction Method. As men-tioned above, an intuitive method of attribute-based personimage retrieval is to predict attributes from person imagesand perform the matching between predicted attributes andquery attributes. We compared a classical attribute recogni-tion model DeepMAR [Li et al., 2015], which formulates at-tribute recognition as a muti-task learning problem and actsas an off-the-shelf attribute predictor in our experiment. Asshown in Table 2, our model outperformed DeepMAR, andit is because DeepMAR still suffers from the indiscriminateproblem of predicted attributes. Different from DeepMAR,we choose to learn latent representations as the bridge be-tween the two modalities, where we successfully avert theproblem caused by attribute prediction and learn more dis-criminative concepts using adversary training.Comparing with Cross Modality Retrieval Models. Sinceour problem is essentially a cross-modal retrieval problem.Therefore, we compared our model with the typical andcommonly used Deep canonical correlation analysis (DCCA)[Andrew et al., 2013], Deep canonically correlated autoen-coders (DCCAE) [Wang et al., 2015] and a state-of-the-artmodel 2WayNet [Eisenschtat and Wolf, 2017]. Deep CCAapplies the CCA objective in deep neural networks in orderto maximize the correlation between two different modalities.DCCAE[Wang et al., 2015] jointly models the cross-modalcorrelation and reconstruction information in the joint spacelearning process. 2WayNet is a two pipeline model for maxi-mizing sample correlations by optimizing euclidean distance.

As reported in Table 2, DCCA typically performed betterthan others, and the 2WayNet largely outperforms others inthe small PETA dataset, which we think it may be benefitedfrom its relatively small number of weights caused by weightsharing. Our methods outperformed all the cross modalityretrieval baselines on all three datasets by a large margin, as itis specifically designed to address the heterogeneous structureproblem between the two modalities.

In addition, we compared a most related work cross modal-ity cross entropy (CMCE) [Li et al., 2017], which achieved astate-of-the-art result in text-based person retrieval. We con-ducted CMCE with semantic identities for fair comparison.The results in Table 2 shows that our model is more effectivefor solving cross-modal person search problem.

4.4 More EvaluationsFirstly, we study the effect of two important parameters λDand λG. We conducted the experiments on the Duke AttributeDataset, and the results are shown in Figure 3. We first studythe effectiveness of λG and set λD to 1. The result is shownin the left figure. Then we chose the best λG and changes λDand get results of the right figure. The best λG and λD are0.001 and 0.5, respectively.

Secondly, we conduct qualitative evaluations on our pro-posed method. Figure 4 shows example top10 ranked imagesaccording to a query attribute description in Market Attributedataset. We notice that fine-grained features of pedestrian im-ages (e.g., stride of a backpack) are the main reasons thatcauses mistakes in our baseline method (see ours w/o adv in

Figure 3: Results of experiment on the trade- off parameters λG andλD . We firstly set λD to 1 and change the value of λG, and get theresults in the left image. Then we chose our best λG=0.001 in ourexperiments and change λD on the right.

Figure 4: Qualitative example in Market Attribute Dataset. The firstrow shows the results of our proposed method and the second areabout a baseline. To save space, we only list 6 attribute items amongall the 27 ones in Market Attribute in the third row. Wrongly re-trieved samples are marked by red rectangles in the figure.

the second row of Figure 4). But with the adversarial objec-tive, our model could get an intuition and generate the con-cept of what a person wearing a backpack would look like,and then concentrate more on possible fine-grained features.

5 ConclusionThe attribute-image person Re-ID problem differs from theprevious attribute-based person Re-ID problem, and it is across-modal matching problem that is realistic in practice. Inthis work, we have identified its challenge through the exper-iments on three datasets. We have shown that an adversarialframework regularized by a semantic consistency constraintis so far the most effective way to solve the attribute-imageperson Re-ID problem. Also, by our learning, we find thatunder the adversarial learning framework, it is more useful tolearn image-analogous concept from inquired attributes andmake it aligned with the corresponding real image’s concept,as compared its reverse.

References[Andrew et al., 2013] Galen Andrew, Raman Arora, Karen

Livescu, and Jeff Bilmes. Deep canonical correlation anal-ysis. In ICML, 2013.

[Deng et al., 2014] Y. Deng, P. Luo, C. Loy, and X. Tang.Pedestrian attribute recognition at far distance. InACMMM, 2014.

[Eisenschtat and Wolf, 2017] A. Eisenschtat and L. Wolf.Linking image and text with 2-way nets. In CVPR, 2017.

[Eric et al., 2017] Tzeng Eric, Hoffman Judy, Saenko Kate,and Darrell Trevor. Adversarial discriminative domainadaptation. In CVPR, 2017.

Page 7: arXiv:1712.01493v1 [cs.CV] 5 Dec 2017 · PDF fileZhou Yin Wei-Shi Zheng Ancong Wu Hong-Xing Yu Hai Wan Jianhuang Lai School of Data and Computer Science, ... Introduction Pedestrian

[Feng et al., 2014] Fangxiang Feng, Xiaojie Wang, andRuifan Li. Cross-modal retrieval with correspondence au-toencoder. In ACMMM, 2014.

[Ganin and Lempitsky, 2015] Yaroslav Ganin and VictorLempitsky. Unsupervised domain adaptation by backprop-agation. In ICML, 2015.

[Hotelling, 1936] Harold Hotelling. Relations between twosets of variates. Biometrika, 1936.

[Kingma and Ba, 2015] Diederik P. Kingma and Jimmy LeiBa. Adam: A method for stochastic optimization. In ICLR,2015.

[Kiros et al., 2014] R. Kiros, R. Salakhutdinov, and R.S.Zemel. Unifying visual-semantic embeddings with mul-timodal neural language models. In arXiv, 2014.

[Layne et al., 2012a] Ryan Layne, Tim Hospedales, andShaogang Gong. Person re-identification by attributes. InBMVC, 2012.

[Layne et al., 2012b] Ryan Layne, Timothy M. Hospedales,and Shaogang Gong. Towards person identification andre-identification with attributes. In ECCV, 2012.

[Layne et al., 2014a] Ryan Layne, Tim Hospedales, andShaogang Gong. Re-id: Hunting attributes in the wild.In BMVC, 2014.

[Layne et al., 2014b] Ryan Layne, Timothy M. Hospedales,and Shaogang Gong. Attributes-Based Re-identification.2014.

[Li et al., 2015] D. Li, X. Chen, and K. Huang. Multi-attribute learning for pedestrian attribute recognition insurveillance scenarios. In ACPR, 2015.

[Li et al., 2017] Shuang Li, Tong Xiao, Hongsheng Li, WeiYang, and Xiaogang Wang. Identity-aware textual-visualmatching with latent co-attention. In ICCV, 2017.

[Lin et al., 2017] Yutian Lin, Liang Zheng, and Wu Yu andYang Yi Zheng, Zhedong and. Improving person re-identification by attribute and identity learning. In arXiv,2017.

[Long and Wang, 2015] Mingsheng Long and JianminWang. Learning transferable features with deep adapta-tion networks. 02 2015.

[Mineiro and Karampatziakis, 2014] Paul Mineiro andNikos Karampatziakis. A randomized algorithm for cca.In CoRR, 2014.

[Ngiam et al., 2011] Jiquan Ngiam, Aditya Khosla, MingyuKim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. Mul-timodal deep learning. In ICML, 2011.

[Reed et al., 2016] S. Reed, Z. Akata, X. Yan, L. Lo-geswaran, B. Schiele, and H. Lee. Generative adversarialtext to image synthesis. In ICML, 2016.

[Scheirer et al., 2012] W. Scheirer, N. Kumar, P. Belhumeur,and T. Boult. Multi-attribute spaces: Calibration for at-tribute fusion and similarity search. In CVPR, 2012.

[Siddiquie et al., 2011] B. Siddiquie, R. S. Feris, andL. Davis. Image ranking and retrieval based on multi-attribute queries. In CVPR, 2011.

[S.Li et al., 2017] S.Li, T.Xiao, H.Li, and X.Wang et al. Per-son search with natural language description. In CVPR,2017.

[Su et al., 2015a] C. Su, F. Yang, S. Zhang, Q. Tian, L. S.Davis, and W. Gao. Multi-task learning with low rank at-tribute embedding for person re-identification. In ICCV,2015.

[Su et al., 2015b] Chi Su, Shiliang Zhang, Fan Yang,Guangxiao Zhang, Qi Tian, Wen Gao, and Larry S. Davis.Tracklet-to-tracklet person re-identification by attributeswith discriminative latent space mapping. In ICMS, 2015.

[Su et al., 2016] Chi Su, Shiliang Zhang, Junliang Xing,Wen Gao, and Qi Tian. Deep attributes driven multi-camera person re-identification. In ECCV, 2016.

[Su et al., 2018] Chi Su, Shiliang Zhang, Junliang Xing,Wen Gao, and Qi Tian. Multi-type attributes driven multi-camera person re-identification. In PR, 2018.

[Sun and Saenko, 2016] Baochen Sun and Kate Saenko.Deep coral: correlation alignment for deep domain adap-tation. In ICCV, 2016.

[Tzeng et al., 2014] Eric Tzeng, Judy Hoffman, Ning Zhang,Kate Saenko, and Trevor Darrell. Deep domain confusion:Maximizing for domain invariance. CoRR, 2014.

[Vaquero et al., 2009] D. A. Vaquero, R. S. Feris, D. Tran,L. Brown, A. Hampapur, and M. Turk. Attribute-basedpeople search in surveillance environments. In WACV,2009.

[Wang et al., 2015] Weiran Wang, Raman Arora, KarenLivescu, and Jeff Bilmes. On deep multi-view represen-tation learning. In ICML, 2015.

[Wei et al., 2017] Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu,Z. Zhu, and S. Yan. Cross-modal retrieval with cnn visualfeatures: A new baseline. IEEE T. CYBERN, 2017.

[Yang et al., 2016] Y. Yang, Z. Lei, S. Zhang, H. Shi, andS. Z. Li. Metric embedded discriminative vocabularylearning for high-level person representation. In AAAI,2016.

[Zhao et al., 2014] R. Zhao, W. Ouyang, and X. Wang.Learning mid-level filters for person re-identification. InCVPR, 2014.