diversity in fashion recommendation using …17/20 conclusion i using clothing parts for...
TRANSCRIPT
Diversity in Fashion Recommendation UsingSemantic Parsing
Sagar Verma1, Sukhad Anand1, Chetan Arora1, Atul Rai2
1Department of Computer Science and EngineeringIndraprastha Institute of Information Technology, Delhi.
2Staqu Technologies
ICIP, 2018
1/20
Problem Statement
Recommendation based on contextual similarity
Hat
Dress
Bag
Query Image
Relevant Item Images
Images retrieved by finding similarity
between features computed over whole
image.
Images retrieved by finding similarity between features
computed over semantically similar parts of image.
2/20
Contributions
1. Our method recommends similar images based on differentparts of a query image.
2. To identify different parts we use attention and weakly labeleddata.
3. Instead of features from standard pre-trained neural networks,we suggest using texture-based features.
4. We evaluate our method on item recognition task,consumer-to-shop retrieval and in-shop retrieval tasks.
3/20
Related Work
I Cloth parsing,
I Clothing attribute recognition,
I Detecting fashion style, and
I Cross-domain image retrieval using Siamese network andTriplet network.
4/20
Proposed Architecture
Spatial Transformer(ST)
Texture Encoding
LSTM
Input Image
StyleNet
First six convolution
layers
Feature Map(fI)
Mk+1
fI,cint,hint
fIk,ck,hk,Mk+1
Category-wisemax-pooled labels
Texture Encoding
Texture Encoding
fI,M0
fI0
fI1 fI
k
Localization Network, ST-LSTM focuses on different clothing items present in the image at each time step.
Reshape
5/20
CNN for Global Image Features
Spatial Transformer(ST)
Texture Encoding
LSTM
Mk+1
fI,cint,hint
fIk,ck,hk,Mk+1
Category-wisemax-pooled labels
Texture Encoding
Texture Encoding
fI,M0
fI0
fI1 fI
k
Localization Network, ST-LSTM focuses on different clothing items present in the image at each time step.
Reshape
Input Image
StyleNet
First six convolution
layers
Feature Map(fI)
6/20
Visual Attention Module
Category-wisemax-pooled labels
Texture Encoding
Texture Encoding
Reshape
StyleNet
Texture Encoding
Input Image First six
convolution layers
Spatial Transformer(ST)
LSTM
Mk+1
fI,cint,hint
fIk,ck,hk,Mk+1
fI,M0
fI0
fI1 fI
k
Localization Network, ST-LSTM focuses on different clothing items present in the image at each time step.
Feature Map(fI)
7/20
Texture Encoding Layer
StyleNetInput Image First six
convolution layers
Spatial Transformer(ST)
LSTM
Mk+1
fI,cint,hint
fIk,ck,hk,Mk+1
fI,M0
fI0
fI1 fI
k
Localization Network, ST-LSTM focuses on different clothing items present in the image at each time step.
Feature Map(fI)
Category-wisemax-pooled labels
Texture Encoding
Texture Encoding
Reshape
Texture Encoding
8/20
Multi-label classification loss
Lcls =1
N
N∑i=1
C∑c=1
(pci − p̂ci )2 (1)
where, N is training sample, C is total number of classes, p̂i isground truth label vector of sample i and pi is predicted labelvector of sample i .
9/20
Diversity loss
Diversity loss is the correlation between adjacent attention maps,
Ldiv =1
K − 1
K∑k=2
HxW∑i=1
lk−1,i .lk,i (2)
where, K is the total steps of recurrent attention, HxW is theheight and width of attention maps, lk is the kth attention map.
10/20
Localisation loss
Localization loss, Lloc from [1] is used to remove redundantlocations and force localization network to look at small clothingparts.
11/20
Combined Loss
L = Lcls + λ1Ldiv + λ2Lloc (3)
where λ1 and λ2 are multiplicative factors. We use 0.01 for all ourexperiments.
12/20
Datasets
I Fashion144K [2]I 90,000 images with multilabel annotation.I 128 classes.I Image resolution is 384x256.
I Fashion550K [3]I 66 classes.
I DeepFashion [4]I 800,000 imagesI Similarity pairs is available for consumer-to-shop and in-shop
retrieval tasks
13/20
Experiments
I Model is trained on Fashion144K [2] dataset with 59 itemlabels, color labels were excluded.
I Evaluated item recognition task on Fashion144K [2] andFashion550K [3] dataset.
I Consumer-to-shop and in-shop retrieval tasks are evaluated onDeepFashion [4] dataset.
14/20
Results
Dataset Fashion144k [2] Fashion550k [3]Model APall mAP APall mAP
StyleNet [2] 65.6 48.34 69.53 53.24Baseline [3] 62.23 49.66 69.18 58.68Viet et al. [5] NA NA 78.92 63.08Inoue et al. [3] NA NA 79.87 64.62Ours 82.78 68.38 82.81 57.93
Multi-label classification on Fashion144k [2] and Fashion550k [3]
15/20
Results
0 10 20 30 40 50Retrieved Images(k=1,2,3.....50)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Avera
ge a
ccura
cy
WTBI(0.506)
DARN(0.67)
FashionNet(0.764)
Ours(0.784)
(a) In-Shop retrieval
0 10 20 30 40 50Retrieved Images(k=1,2,3.....50)
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
Avera
ge a
ccura
cy
WTBI(0.063)
DARN(0.111)
FashionNet(0.188)
Ours(0.253)
(b) Consumer-to-shop retrieval
Retrieval results for In-shop and Consumer-to-shop retrieval taskson DeepFashion dataset [4].
16/20
Results
Dress
Hat
Belt
QueryImage 4
Boots, SkirtBoots, Skirt
Query Image 5
Hat , Bag
Hat , Dress
Hat
Query Image 1
Skirt, BootsDress
Glasses, Dress
QueryImage 3
Dress, Hat, Boots
Hat, Boots Hat, Coat
Boots, Coat
Query Image 6
Shoes Jeans, DressQuery Image 2
Semantically similar results for some of the query images fromFashion144k dataset [2] using our method.
17/20
Conclusion
I Using clothing parts for recommendation gives muchvariability in the recommendation results.
I Attention can be used to learn discriminative features fromweak labels.
I Texture cues are important for learning different parts.
18/20
References
Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu, and Liang Lin,
“Multi-label image recognition by recurrently discovering attentional regions,”in ICCV, 2017.
E. Simo-Serra and H. Ishikawa,
“Fashion style in 128 floats: Joint ranking and classification using weak data for feature extraction,”in CVPR, 2016, pp. 298–307.
Naoto Inoue, Edgar Simo-Serra, Toshihiko Yamasaki, and Hiroshi Ishikawa,
“Multi-label fashion image classification with minimal human supervision,”in ICCVW, 2017, pp. 2261–2267.
Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang,
“Deepfashion: Powering robust clothes recognition and retrieval with rich annotations,”in CVPR, 2016, pp. 1096–1104.
Andreas Veit et al.,
“Learning from noisy large-scale datasets with minimal supervision,”in CVPR, 2017.
19/20
Thank youQuestions?
20/20