diversity in fashion recommendation using …17/20 conclusion i using clothing parts for...

Diversity in Fashion Recommendation UsingSemantic Parsing

Sagar Verma1, Sukhad Anand1, Chetan Arora1, Atul Rai2

1Department of Computer Science and EngineeringIndraprastha Institute of Information Technology, Delhi.

2Staqu Technologies

ICIP, 2018

1/20

Problem Statement

Recommendation based on contextual similarity

Hat

Dress

Bag

Query Image

Relevant Item Images

Images retrieved by finding similarity

between features computed over whole

image.

Images retrieved by finding similarity between features

computed over semantically similar parts of image.

2/20

Contributions

1. Our method recommends similar images based on differentparts of a query image.

2. To identify different parts we use attention and weakly labeleddata.

3. Instead of features from standard pre-trained neural networks,we suggest using texture-based features.

4. We evaluate our method on item recognition task,consumer-to-shop retrieval and in-shop retrieval tasks.

3/20

Related Work

I Cloth parsing,

I Clothing attribute recognition,

I Detecting fashion style, and

I Cross-domain image retrieval using Siamese network andTriplet network.

4/20

Proposed Architecture

Spatial Transformer(ST)

Texture Encoding

LSTM

Input Image

StyleNet

First six convolution

layers

Feature Map(fI)

Mk+1

fI,cint,hint

fIk,ck,hk,Mk+1

Category-wisemax-pooled labels

Texture Encoding

Texture Encoding

fI,M0

fI0

fI1 fI

k

Localization Network, ST-LSTM focuses on different clothing items present in the image at each time step.

Reshape

5/20

CNN for Global Image Features


Texture Encoding

LSTM

Mk+1

fI,cint,hint

fIk,ck,hk,Mk+1


Texture Encoding

Texture Encoding

fI,M0

fI0

fI1 fI

k


Reshape

Input Image

StyleNet

First six convolution

layers

Feature Map(fI)

6/20

Visual Attention Module


Texture Encoding

Texture Encoding

Reshape

StyleNet

Texture Encoding

Input Image First six

convolution layers


LSTM

Mk+1

fI,cint,hint

fIk,ck,hk,Mk+1

fI,M0

fI0

fI1 fI

k


Feature Map(fI)

7/20

Texture Encoding Layer

StyleNetInput Image First six

convolution layers


LSTM

Mk+1

fI,cint,hint

fIk,ck,hk,Mk+1

fI,M0

fI0

fI1 fI

k


Feature Map(fI)


Texture Encoding

Texture Encoding

Reshape

Texture Encoding

8/20

Multi-label classification loss

Lcls =1

N

N∑i=1

C∑c=1

(pci − p̂ci )2 (1)

where, N is training sample, C is total number of classes, p̂i isground truth label vector of sample i and pi is predicted labelvector of sample i .

9/20

Diversity loss

Diversity loss is the correlation between adjacent attention maps,

Ldiv =1

K − 1

K∑k=2

HxW∑i=1

lk−1,i .lk,i (2)

where, K is the total steps of recurrent attention, HxW is theheight and width of attention maps, lk is the kth attention map.

10/20

Localisation loss

Localization loss, Lloc from [1] is used to remove redundantlocations and force localization network to look at small clothingparts.

11/20

Combined Loss

L = Lcls + λ1Ldiv + λ2Lloc (3)

where λ1 and λ2 are multiplicative factors. We use 0.01 for all ourexperiments.

12/20

Datasets

I Fashion144K [2]I 90,000 images with multilabel annotation.I 128 classes.I Image resolution is 384x256.

I Fashion550K [3]I 66 classes.

I DeepFashion [4]I 800,000 imagesI Similarity pairs is available for consumer-to-shop and in-shop

retrieval tasks

13/20

Experiments

I Model is trained on Fashion144K [2] dataset with 59 itemlabels, color labels were excluded.

I Evaluated item recognition task on Fashion144K [2] andFashion550K [3] dataset.

I Consumer-to-shop and in-shop retrieval tasks are evaluated onDeepFashion [4] dataset.

14/20

Results

Dataset Fashion144k [2] Fashion550k [3]Model APall mAP APall mAP

StyleNet [2] 65.6 48.34 69.53 53.24Baseline [3] 62.23 49.66 69.18 58.68Viet et al. [5] NA NA 78.92 63.08Inoue et al. [3] NA NA 79.87 64.62Ours 82.78 68.38 82.81 57.93

Multi-label classification on Fashion144k [2] and Fashion550k [3]

15/20

Results

0 10 20 30 40 50Retrieved Images(k=1,2,3.....50)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Avera

ge a

ccura

cy

WTBI(0.506)

DARN(0.67)

FashionNet(0.764)

Ours(0.784)

(a) In-Shop retrieval

0 10 20 30 40 50Retrieved Images(k=1,2,3.....50)

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Avera

ge a

ccura

cy

WTBI(0.063)

DARN(0.111)

FashionNet(0.188)

Ours(0.253)

(b) Consumer-to-shop retrieval

Retrieval results for In-shop and Consumer-to-shop retrieval taskson DeepFashion dataset [4].

16/20

Results

Dress

Hat

Belt

QueryImage 4

Boots, SkirtBoots, Skirt

Query Image 5

Hat , Bag

Hat , Dress

Hat

Query Image 1

Skirt, BootsDress

Glasses, Dress

QueryImage 3

Dress, Hat, Boots

Hat, Boots Hat, Coat

Boots, Coat

Query Image 6

Shoes Jeans, DressQuery Image 2

Semantically similar results for some of the query images fromFashion144k dataset [2] using our method.

17/20

Conclusion

I Using clothing parts for recommendation gives muchvariability in the recommendation results.

I Attention can be used to learn discriminative features fromweak labels.

I Texture cues are important for learning different parts.

18/20

References

Zhouxia Wang, Tianshui Chen, Guanbin Li, Ruijia Xu, and Liang Lin,

“Multi-label image recognition by recurrently discovering attentional regions,”in ICCV, 2017.

E. Simo-Serra and H. Ishikawa,

“Fashion style in 128 floats: Joint ranking and classification using weak data for feature extraction,”in CVPR, 2016, pp. 298–307.

Naoto Inoue, Edgar Simo-Serra, Toshihiko Yamasaki, and Hiroshi Ishikawa,

“Multi-label fashion image classification with minimal human supervision,”in ICCVW, 2017, pp. 2261–2267.

Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang,

“Deepfashion: Powering robust clothes recognition and retrieval with rich annotations,”in CVPR, 2016, pp. 1096–1104.

Andreas Veit et al.,

“Learning from noisy large-scale datasets with minimal supervision,”in CVPR, 2017.

19/20

Thank youQuestions?

20/20

diversity in fashion recommendation using …17/20 conclusion i using clothing parts for...

Documents