improving pedestrian attribute recognition with …...improving pedestrian attribute recognition...

1
Improving Pedestrian Attribute Recognition With Weakly-Supervised Multi-Scale Attribute-Specific Localization Chufeng Tang 1 , Lu Sheng 2 , Zhaoxiang Zhang 3 , Xiaolin Hu 1 * 1 Tsinghua University, 2 Beihang University, 3 Chinese Academy of Sciences Ø Problem: previous pedestrian attribute recognition methods failed to indicate the attribute-region correspondence Ø Contribution: performing attribute-specific localization at multiple scales to find the most discriminative region for each attribute in a weakly-supervised manner Ø Results: improvement across three datasets, end-to-end trainable, less computational cost Ø Top-down feature pyramid low-level details: feature learning high-level semantics: localization Ø Deep supervision for training 4 predictions are directly supervised by GT , trained insufficiently otherwise Ø Maximum voting for inference choosing the most confident prediction Effectiveness of each component Ø Attribute-agnostic attention: attend to a broad region, no attribute-region correspondence Ø Rigid body parts localization: simply fuse the local features, require extra computation Ø We need Attribute-Specific Localization maintain the attribute-region correspondence fully adaptive, without region annotations interpretable and computationally efficient Summary Methodology Ablation Study Motivation 1×1 Conv ReLU Global Pool 1×1 Conv Sigmoid Mul Add " FC Spatial Transformer ... FC Attribute Localization Module ... # $ % & $ ' ... ... ... 256 × 128 8×4 ... 32 × 16 ... 16 × 8 ... Attribute Localization Module Concat Concat up × 2 up × 2 ... ... Element-wise Maximum ... ... " # $ 1 2 M ... ... " # $ 1 2 M ... ... " # $ 1 2 M ... 1 2 M ) * + , Attribute Prediction 1 2 M 1 2 M - . + (-) . * (-) . ) (-) 1 + 1 * 1 ) 2 3 , 2 3 + 2 3 * 2 3 ) Quantitative Results Ø Spatial transformer simplified STN, learn to represent attribute region should be adaptive and differentiable (RoI pooling can’t) Ø Feature alignment a tiny channel attention sub-network, modulating the inter- channel dependencies, since features from different levels should contribute unequally (some need more details). Ø One for each attribute, but still light-weight Three different attribute-specific methods Ø Each attention mask corresponds to one attribute over-adaptive: try to cover all pixels but often failed, since there is no accurate localization labels. Ø Each attribute associated with predefined parts lack-adaptive: discard the adaptive factors, which are less robust to variances. Ø We achieve a balance between two extremes using attribute-specific bounding boxes, which relatively coarse but more interpretable. Input Rigid Parts Attribute Regions Attention Masks Boots Glasses Box weighted binary cross-entropy loss 0.5 0.6 0.7 0.8 0.9 1 Female Clerk Boots LongHair TightTrousers Jeans Vest Pusing Hat Skirt Customer Calling Dress Shirt AgeLess16 LongTrousers HandTrunk Backpack Glasses LeatherShoes CarryingbyHand Suit-Up ClothShoes Cotton SSBag Jacket ShortSleeve ShortSkirt HandBag Box Pulling SportShoes BaldHead BlackHair PlasticBag TShirt OtherAttchment PaperBag Muffler Tight Age31-45 Age17-30 Sweater CarryingbyArm CasualShoes BodyFat Gathering BodyThin Holding BodyNormal Talking Baseline Ours Input Ours Attention Body-parts Attribute: Longhair BodyFat Backpack PlasticBag Hat +3.1% +1.7% +1.3%

Upload: others

Post on 30-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving Pedestrian Attribute Recognition With …...Improving Pedestrian Attribute Recognition With Weakly-Supervised Multi-Scale Attribute-Specific Localization Chufeng Tang1, Lu

Improving Pedestrian Attribute Recognition With Weakly-Supervised Multi-Scale Attribute-Specific Localization

Chufeng Tang1, Lu Sheng2, Zhaoxiang Zhang3, Xiaolin Hu1*1Tsinghua University, 2Beihang University, 3Chinese Academy of Sciences

Ø Problem: previous pedestrian attributerecognition methods failed to indicate theattribute-region correspondence

Ø Contribution: performing attribute-specificlocalization at multiple scales to find the mostdiscriminative region for each attribute in aweakly-supervised manner

Ø Results: improvement across three datasets,end-to-end trainable, less computational cost

Ø Top-down feature pyramidlow-level details: feature learninghigh-level semantics: localization

Ø Deep supervision for training4 predictions are directly supervised byGT, trained insufficiently otherwise

Ø Maximum voting for inferencechoosing the most confident prediction

Effectiveness of each component

Ø Attribute-agnostic attention: attend to a broadregion, no attribute-region correspondence

Ø Rigid body parts localization: simply fuse thelocal features, require extra computation

Ø We need Attribute-Specific Localization√ maintain the attribute-region correspondence√ fully adaptive, without region annotations√ interpretable and computationally efficient

Summary Methodology Ablation Study

Motivation 1×1Conv

ReLU

GlobalPool

1×1Conv

Sigmoid

Mul Add

"FC

SpatialTransformer

...

FC

Attribute Localization Module

...

#$

%&$'

.........

256 × 128 8 × 4. ..

32 × 16. ..

16 × 8. ..

AttributeLocalizationModule

ConcatConcatup × 2 up × 2

...

...

Element-wiseMaximum

...

...

" # $

1 2 M

...

...

" # $

1 2 M

...

...

" # $

1 2 M

...

12

M

ℒ)ℒ*ℒ+

ℒ,

AttributePrediction

12

M

12

M

- .+(-) .*(-) .)(-)

1+ 1* 1)

23,

23+ 23* 23)

Quantitative Results

Ø Spatial transformersimplified STN, learn to represent attribute regionshould be adaptive and differentiable (RoI pooling can’t)

Ø Feature alignmenta tiny channel attention sub-network, modulating the inter-channel dependencies, since features from different levelsshould contribute unequally (some need more details).

Ø One for each attribute, but still light-weight

Three different attribute-specific methodsØ Each attention mask corresponds to one attribute

over-adaptive: try to cover all pixels but often failed,since there is no accurate localization labels.

Ø Each attribute associated with predefined partslack-adaptive: discard the adaptive factors, which areless robust to variances.

Ø We achieve a balancebetween two extremes usingattribute-specific boundingboxes, which relatively coarse but more interpretable.

Input Rigid PartsAttribute Regions Attention Masks

Boots

Glasses

Box

weighted binary cross-entropy loss

0.5

0.6

0.7

0.8

0.9

1

Female Cle

rkBoots

LongHair

TightTrousersJeans Ve

stPusing Ha

tSkirt

CustomerCallingDressShirt

AgeLess16

LongTrousers

HandTrunk

Backpack

Glasses

LeatherShoes

CarryingbyH

and

Suit-Up

ClothShoesCottonSSBagJacket

ShortSleeve

ShortSkirt

HandBag Bo

xPulling

SportShoes

BaldHead

BlackH

air

PlasticBagTShirt

OtherAttchment

PaperBag

MufflerTight

Age31-45

Age17-30

Sweater

CarryingbyA

rm

CasualShoes

BodyFat

Gathering

BodyThin

Holdin

g

BodyNormal

Talking

BaselineOurs

Input Ours Attention Body-parts

Attribute: Longhair

BodyFatBackpack PlasticBag Hat

+3.1%

+1.7%

+1.3%