improving pedestrian attribute recognition with …...improving pedestrian attribute recognition...

Improving Pedestrian Attribute Recognition With Weakly-Supervised Multi-Scale Attribute-Specific Localization

Chufeng Tang1, Lu Sheng2, Zhaoxiang Zhang3, Xiaolin Hu1*1Tsinghua University, 2Beihang University, 3Chinese Academy of Sciences

Ø Problem: previous pedestrian attributerecognition methods failed to indicate theattribute-region correspondence

Ø Contribution: performing attribute-specificlocalization at multiple scales to find the mostdiscriminative region for each attribute in aweakly-supervised manner

Ø Results: improvement across three datasets,end-to-end trainable, less computational cost

Ø Top-down feature pyramidlow-level details: feature learninghigh-level semantics: localization

Ø Deep supervision for training4 predictions are directly supervised byGT, trained insufficiently otherwise

Ø Maximum voting for inferencechoosing the most confident prediction

Effectiveness of each component

Ø Attribute-agnostic attention: attend to a broadregion, no attribute-region correspondence

Ø Rigid body parts localization: simply fuse thelocal features, require extra computation

Ø We need Attribute-Specific Localization√ maintain the attribute-region correspondence√ fully adaptive, without region annotations√ interpretable and computationally efficient

Summary Methodology Ablation Study

Motivation 1×1Conv

ReLU

GlobalPool

1×1Conv

Sigmoid

Mul Add

"FC

SpatialTransformer

...

FC

Attribute Localization Module

...

#$

%&$'

.........

256 × 128 8 × 4. ..

32 × 16. ..

16 × 8. ..

AttributeLocalizationModule

ConcatConcatup × 2 up × 2

...

...

Element-wiseMaximum

...

...

" # $

1 2 M

...

...

" # $

1 2 M

...

...

" # $

1 2 M

...

12

M

ℒ)ℒ*ℒ+

ℒ,

AttributePrediction

12

M

12

M

- .+(-) .*(-) .)(-)

1+ 1* 1)

23,

23+ 23* 23)

Quantitative Results

Ø Spatial transformersimplified STN, learn to represent attribute regionshould be adaptive and differentiable (RoI pooling can’t)

Ø Feature alignmenta tiny channel attention sub-network, modulating the inter-channel dependencies, since features from different levelsshould contribute unequally (some need more details).

Ø One for each attribute, but still light-weight

Three different attribute-specific methodsØ Each attention mask corresponds to one attribute

over-adaptive: try to cover all pixels but often failed,since there is no accurate localization labels.

Ø Each attribute associated with predefined partslack-adaptive: discard the adaptive factors, which areless robust to variances.

Ø We achieve a balancebetween two extremes usingattribute-specific boundingboxes, which relatively coarse but more interpretable.

Input Rigid PartsAttribute Regions Attention Masks

Boots

Glasses

Box

weighted binary cross-entropy loss

0.5

0.6

0.7

0.8

0.9

1

Female Cle

rkBoots

LongHair

TightTrousersJeans Ve

stPusing Ha

tSkirt

CustomerCallingDressShirt

AgeLess16

LongTrousers

HandTrunk

Backpack

Glasses

LeatherShoes

CarryingbyH

and

Suit-Up

ClothShoesCottonSSBagJacket

ShortSleeve

ShortSkirt

HandBag Bo

xPulling

SportShoes

BaldHead

BlackH

air

PlasticBagTShirt

OtherAttchment

PaperBag

MufflerTight

Age31-45

Age17-30

Sweater

CarryingbyA

rm

CasualShoes

BodyFat

Gathering

BodyThin

Holdin

g

BodyNormal

Talking

BaselineOurs

Input Ours Attention Body-parts

Attribute: Longhair

BodyFatBackpack PlasticBag Hat

+3.1%

+1.7%

+1.3%

improving pedestrian attribute recognition with …...improving pedestrian attribute recognition...

Documents