improving pedestrian attribute recognition with weakly...

4
Improving Pedestrian Attribute Recognition With Weakly-Supervised Multi-Scale Attribute-Specific Localization Supplementary Material Chufeng Tang 1 Lu Sheng 2 Zhaoxiang Zhang 3 Xiaolin Hu 1* 1 State Key Laboratory of Intelligent Technology and Systems, Institute for Artificial Intelligence, Department of Computer Science and Technology, Beijing National Research Center for Information Science and Technology, Tsinghua University 2 College of Software, Beihang University 3 Institute of Automation, Chinese Academy of Sciences {tcf18@mails,xlhu@mail}.tsinghua.edu.cn [email protected] [email protected] 1. Implementation Details We adopt the BN-Inception model pretrained from Ima- geNet as the backbone network. The proposed framework is implemented with PyTorch framework and trained end- to-end with only image-level annotations. We adopt Adam optimizer since it converges faster than SGD in our experi- ments with momentum set to 0.9 and a weight decay equals to 0.0005. The initial learning rate equals to 0.0001 and the batch size is set to 32. For RAP and PA-100K dataset, we train the model for 30 epochs and the learning rate de- cays by 0.1 every 10 epochs. For the smaller PETA dataset, we double the training epochs. For data preprocessing, we resize the input pedestrian images to 256 × 128 and apply random horizontal mirroring and data shuffling for data aug- mentation. 2. Different Attribute-Specific Methods In Section 4.4 of the main paper, we compare the pro- posed method against the other two attribute-specific local- ization methods, including visual attention and rigid parts. Different from most existing attribute-agnostic attention- based and part-based methods, we build two attribute- specific models based on these ideas for comparison. Here we show the details of the compared models. 2.1. Attention Masks Model We replace the proposed ALM with a spatial attention module while keeping others unchanged for fair compari- son. The spatial attention module is implemented by a tiny 3-layers sub-network, as shown in Figure S2, which is in- spired by HA-CNN [2]. The input features X i R H×W×C at the i-th level (a certain layer in the backbone network, to- * Corresponding author. tally three levels) are first fed into a cross-channel averaging layer. A 3 × 3 Conv-BatchNorm-ReLU block is followed to generate the expected attention mask S m i R H×W×1 , which is used for localizing the m-th attribute at the i- th level. All channels share the identical spatial attention mask. Subsequently, the attentive features are obtained by channel-wise multiplying the attention mask with the input features, and the corresponding prediction is calculated as follows: ˆ y m i = f (S m i · X i ), (1) where f denotes a fully-connected layer. Each spatial at- tention module only serves one attribute at a singe level, the same as Figure 2 in the main paper. 2.2. Rigid Parts Model For attribute-specific part-based model, we replace ALM with a body-parts guided module, as shown in Figure S3. The key idea is to associate each attribute with a predefined body region, including head, torso, legs, and the whole image, e.g., the LongHair attribute is associated with the head part. Since the body-part annotations are unavailable on most pedestrian attribute datasets, we adopt an external pose estimation model to localize the body parts, which is inspired by SpindleNet [3]. Specifically, we localize 14 human body keypoints for each pedestrian image using a pretrained pose estimation model [3]. The pedestrian im- age is then divided into three body-part regions based on these keypoints, as shown in Figure S1. In the body-parts guided module (Figure S3), the body-part-based local fea- tures are extracted from the input features X i through an RoI pooling layer [1]. For attribute prediction, the most rel- evant features are selected according to the attribute-region correspondence, as listed in Table S1, e.g. recognizing hat using features only from the head part.

Upload: others

Post on 12-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving Pedestrian Attribute Recognition With Weakly …openaccess.thecvf.com/content_ICCV_2019/supplemental/... · 2019-10-23 · cays by 0:1 every 10 epochs. For the smaller PETA

Improving Pedestrian Attribute Recognition WithWeakly-Supervised Multi-Scale Attribute-Specific Localization

Supplementary Material

Chufeng Tang1 Lu Sheng2 Zhaoxiang Zhang3 Xiaolin Hu1∗

1State Key Laboratory of Intelligent Technology and Systems,Institute for Artificial Intelligence, Department of Computer Science and Technology,

Beijing National Research Center for Information Science and Technology, Tsinghua University2College of Software, Beihang University 3Institute of Automation, Chinese Academy of Sciences{tcf18@mails,xlhu@mail}.tsinghua.edu.cn [email protected] [email protected]

1. Implementation DetailsWe adopt the BN-Inception model pretrained from Ima-

geNet as the backbone network. The proposed frameworkis implemented with PyTorch framework and trained end-to-end with only image-level annotations. We adopt Adamoptimizer since it converges faster than SGD in our experi-ments with momentum set to 0.9 and a weight decay equalsto 0.0005. The initial learning rate equals to 0.0001 andthe batch size is set to 32. For RAP and PA-100K dataset,we train the model for 30 epochs and the learning rate de-cays by 0.1 every 10 epochs. For the smaller PETA dataset,we double the training epochs. For data preprocessing, weresize the input pedestrian images to 256 × 128 and applyrandom horizontal mirroring and data shuffling for data aug-mentation.

2. Different Attribute-Specific MethodsIn Section 4.4 of the main paper, we compare the pro-

posed method against the other two attribute-specific local-ization methods, including visual attention and rigid parts.Different from most existing attribute-agnostic attention-based and part-based methods, we build two attribute-specific models based on these ideas for comparison. Herewe show the details of the compared models.

2.1. Attention Masks Model

We replace the proposed ALM with a spatial attentionmodule while keeping others unchanged for fair compari-son. The spatial attention module is implemented by a tiny3-layers sub-network, as shown in Figure S2, which is in-spired by HA-CNN [2]. The input features Xi ∈ RH×W×C

at the i-th level (a certain layer in the backbone network, to-

∗Corresponding author.

tally three levels) are first fed into a cross-channel averaginglayer. A 3 × 3 Conv-BatchNorm-ReLU block is followedto generate the expected attention mask Sm

i ∈ RH×W×1,which is used for localizing the m-th attribute at the i-th level. All channels share the identical spatial attentionmask. Subsequently, the attentive features are obtained bychannel-wise multiplying the attention mask with the inputfeatures, and the corresponding prediction is calculated asfollows:

ymi = f(Smi ·Xi), (1)

where f denotes a fully-connected layer. Each spatial at-tention module only serves one attribute at a singe level, thesame as Figure 2 in the main paper.

2.2. Rigid Parts Model

For attribute-specific part-based model, we replace ALMwith a body-parts guided module, as shown in Figure S3.The key idea is to associate each attribute with a predefinedbody region, including head, torso, legs, and the wholeimage, e.g., the LongHair attribute is associated with thehead part. Since the body-part annotations are unavailableon most pedestrian attribute datasets, we adopt an externalpose estimation model to localize the body parts, which isinspired by SpindleNet [3]. Specifically, we localize 14human body keypoints for each pedestrian image using apretrained pose estimation model [3]. The pedestrian im-age is then divided into three body-part regions based onthese keypoints, as shown in Figure S1. In the body-partsguided module (Figure S3), the body-part-based local fea-tures are extracted from the input features Xi through anRoI pooling layer [1]. For attribute prediction, the most rel-evant features are selected according to the attribute-regioncorrespondence, as listed in Table S1, e.g. recognizing hatusing features only from the head part.

Page 2: Improving Pedestrian Attribute Recognition With Weakly …openaccess.thecvf.com/content_ICCV_2019/supplemental/... · 2019-10-23 · cays by 0:1 every 10 epochs. For the smaller PETA

Region Attributes

HeadBaldHead, LongHair, BlackHair, Hat, Glasses,Muffler, Calling

Torso

Shirt, Sweater, Vest, TShirt, Cotton, Jacket,Suit-Up, Tight, ShortSleeve, LongTrousers,Skirt, ShortSkirt, Dress, Jeans, TightTrousers,CarryingbyArm, CarryingbyHand

LegsLeatherShoes, SportShoes, Boots, ClothShoes,CasualShoes

Whole

Female, AgeLess16, Age17-30, Age31-45, BodyFat,BodyNormal, BodyThin, Customer, Clerk, Backpack,SSBag, HandBag, Box, PlasticBag, PaperBag,HandTrunk, OtherAttchment, Talking, Gathering,Holding, Pushing, Pulling

Table S1: Attribute-region correspondence in RAP dataset.

2 1

4 3

6

8

5

710 9

11

14 13

12

Head

Torso

Legs

Figure S1: Illustration of body-parts generation. We di-vide a pedestrian image into three body-part regions (head,torso, and legs) based on 14 human body keypoints.

2.3. More Results

We provide more localization results belong to differentattributes, as shown in Figure S4.

References[1] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE Inter-

national Conference on Computer Vision, pages 1440–1448,2015.

[2] Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious atten-tion network for person re-identification. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recog-nition, pages 2285–2294, 2018.

[3] Haiyu Zhao, Maoqing Tian, Shuyang Sun, Jing Shao, JunjieYan, Shuai Yi, Xiaogang Wang, and Xiaoou Tang. Spindlenet: Person re-identification with human body region guidedfeature decomposition and fusion. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 907–915, 2017.

Page 3: Improving Pedestrian Attribute Recognition With Weakly …openaccess.thecvf.com/content_ICCV_2019/supplemental/... · 2019-10-23 · cays by 0:1 every 10 epochs. For the smaller PETA

cross-channelaveraging

3 × 3 conv

channel-wisemultiplication

H ×W × C H ×W × C

H ×W × 1 H ×W × 1

FC 𝑦"#$𝐗#

Attention Mask𝐒'(

Figure S2: Details of the spatial attention module for one attribute at a singe level. The expected attention mask follows across-channel averaging layer and a 3× 3 Conv-BN-ReLU block.

H ×W × C

Body-parts

𝐗" FC

Select

H × W × C

𝑦$"%

Head

Torso

Legs

RoIpooling

Figure S3: Details of the body-parts guided module for one attribute at a singe level. The three body-part regions arecalculated based on several human body keypoints predicted by a pretrained pose estimation model. The local featuresbelonging to different body-parts are extracted by an RoI pooling layer. The most relevant features are selected for attributeclassification according to the predefined attribute-region correspondence (Table S1).

Page 4: Improving Pedestrian Attribute Recognition With Weakly …openaccess.thecvf.com/content_ICCV_2019/supplemental/... · 2019-10-23 · cays by 0:1 every 10 epochs. For the smaller PETA

Input Rigid PartsAttributeRegions AttentionMasks

Hat

Jacket

LongHair

BodyFat

Pushing

Figure S4: Case studies of different attribute-specific localization methods for five different attributes.