improving pedestrian attribute recognition with …...improving pedestrian attribute recognition...
TRANSCRIPT
Improving Pedestrian Attribute Recognition With Weakly-Supervised Multi-Scale Attribute-Specific Localization
Chufeng Tang1, Lu Sheng2, Zhaoxiang Zhang3, Xiaolin Hu1*1Tsinghua University, 2Beihang University, 3Chinese Academy of Sciences
Ø Problem: previous pedestrian attributerecognition methods failed to indicate theattribute-region correspondence
Ø Contribution: performing attribute-specificlocalization at multiple scales to find the mostdiscriminative region for each attribute in aweakly-supervised manner
Ø Results: improvement across three datasets,end-to-end trainable, less computational cost
Ø Top-down feature pyramidlow-level details: feature learninghigh-level semantics: localization
Ø Deep supervision for training4 predictions are directly supervised byGT, trained insufficiently otherwise
Ø Maximum voting for inferencechoosing the most confident prediction
Effectiveness of each component
Ø Attribute-agnostic attention: attend to a broadregion, no attribute-region correspondence
Ø Rigid body parts localization: simply fuse thelocal features, require extra computation
Ø We need Attribute-Specific Localization√ maintain the attribute-region correspondence√ fully adaptive, without region annotations√ interpretable and computationally efficient
Summary Methodology Ablation Study
Motivation 1×1Conv
ReLU
GlobalPool
1×1Conv
Sigmoid
Mul Add
"FC
SpatialTransformer
...
FC
Attribute Localization Module
...
#$
%&$'
.........
256 × 128 8 × 4. ..
32 × 16. ..
16 × 8. ..
AttributeLocalizationModule
ConcatConcatup × 2 up × 2
...
...
Element-wiseMaximum
...
...
" # $
1 2 M
...
...
" # $
1 2 M
...
...
" # $
1 2 M
...
12
M
ℒ)ℒ*ℒ+
ℒ,
AttributePrediction
12
M
12
M
- .+(-) .*(-) .)(-)
1+ 1* 1)
23,
23+ 23* 23)
Quantitative Results
Ø Spatial transformersimplified STN, learn to represent attribute regionshould be adaptive and differentiable (RoI pooling can’t)
Ø Feature alignmenta tiny channel attention sub-network, modulating the inter-channel dependencies, since features from different levelsshould contribute unequally (some need more details).
Ø One for each attribute, but still light-weight
Three different attribute-specific methodsØ Each attention mask corresponds to one attribute
over-adaptive: try to cover all pixels but often failed,since there is no accurate localization labels.
Ø Each attribute associated with predefined partslack-adaptive: discard the adaptive factors, which areless robust to variances.
Ø We achieve a balancebetween two extremes usingattribute-specific boundingboxes, which relatively coarse but more interpretable.
Input Rigid PartsAttribute Regions Attention Masks
Boots
Glasses
Box
weighted binary cross-entropy loss
0.5
0.6
0.7
0.8
0.9
1
Female Cle
rkBoots
LongHair
TightTrousersJeans Ve
stPusing Ha
tSkirt
CustomerCallingDressShirt
AgeLess16
LongTrousers
HandTrunk
Backpack
Glasses
LeatherShoes
CarryingbyH
and
Suit-Up
ClothShoesCottonSSBagJacket
ShortSleeve
ShortSkirt
HandBag Bo
xPulling
SportShoes
BaldHead
BlackH
air
PlasticBagTShirt
OtherAttchment
PaperBag
MufflerTight
Age31-45
Age17-30
Sweater
CarryingbyA
rm
CasualShoes
BodyFat
Gathering
BodyThin
Holdin
g
BodyNormal
Talking
BaselineOurs
Input Ours Attention Body-parts
Attribute: Longhair
BodyFatBackpack PlasticBag Hat
+3.1%
+1.7%
+1.3%