, and charles ollion october 29, 2019 arxiv:2010.02849v1

CORE: Color Regression for Multiple Colors Fashion Garments

Alexandre Rame*1,2, Arthur Douillard1,2, and Charles Ollion2

1Sorbonne Universite, CNRS, LIP6, 4 place Jussieu, 75005 Paris2Heuritech, 134 Rue Legendre, 75017 Paris

October 29, 2019

Abstract

Among all fashion attributes, color is challenging to de-tect due to its subjective perception. Existing classificationapproaches can not go beyond the predefined list of discretecolor names. In this paper, we argue that color detectionis a regression problem. Thus, we propose a new architec-ture, based on attention modules and in two-stages. Thefirst stage corrects the image illumination while detectingthe main discrete color name. The second stage combines acolorname-attention (dependent of the detected color) withan object-attention (dependent of the clothing category) andfinally weights a spatial pooling over the image pixels’ RGBvalues. We further expand our work for multiple colors gar-ments. We collect a dataset where each fashion item is la-beled with a continuous color palette: we empirically showthe benefits of our approach.

1. Introduction

Convolutional Neural Networks (CNNs) [26, 20] gen-erated a lot of interest in the fashion industry. Recentdatasets of fashion images [28, 49, 17] encouraged variousapproaches for attributes classification [7, 40], visual search[44, 19, 30, 13] and object detection [23, 34].

One of the key attributes of a fashion item is its color.However, colors are subjective properties of garments asnot all humans recognize colors the same way [32]: theirautomatic estimation is therefore challenging. So far, mostapproaches consider the problem through the angle of dis-crete color naming: a classifier chooses amongst 11 col-ors of the English language [4], based on features from his-tograms [2, 9, 39] or from CNNs [42, 33, 10, 46, 45, 21, 24].Recent approaches increased the number of color names upto 28 [47] or 313 [48]. We take a step forward and tackle theproblem through the angle of continuous color regressionto handle its inherent ambiguity and multimodality. Rather

*Corresponding author: [email protected]

than predicting an approximate discrete color name, we aimat predicting the exact continuous RGB color value. Thisrefined representation is necessary for many industrial ap-plications such as precise visual search and fine-grainedtrend detection.

Factors such as complex garments structure and varyingillumination spark discrepancies between the raw color andwhat we perceive: the human brain automatically focuseson the right regions to detect the color and corrects the con-trast. First, our approach needs to focus on the interestingregions of the image, by reducing the impact of complexbackgrounds or clutters [43, 10, 46]. Second, we need colorconstancy so that scene’s illumination is ideal white light[5, 3, 29, 6, 36, 15, 22, 37].

Our architecture tackles all above challenges end-to-endsimultaneously: (1) a semantic segmenter learns to detectfashion garments, (2) a color constancy module learns tocorrect the illumination of the image, and finally (3) wepredict colors’s RGBs by refining the rough discrete colornames through weighted pooling over pixels.

2. Model

The network in Figure 1 is trained end-to-end on the taskof continuous color regression, with neither illuminationsnor clothing segmentation ground-truth annotations.

2.1. Pixel Image Correction

Our model needs to correct the bias in the pixels of theimage. All the images are first contrast normalized ina preprocessing step by global histogram stretching in or-der to be robust across various lighting conditions. Notethat this does not change the relative contributions of thethree RGB channels. Second, to remove illumination colorcasts, we need to detect the initial illumination of the image(a scalar value per channel) and then correct each pixel inthe image using the Von Kries method [41]. Following [29]and because of its capacity to extract style attributes [16],we choose the VGG16 [38] neural architecture.

1

arX

iv:2

010.

0284

9v1

[cs

.CV

] 6

Oct

202

0

Figure 1. Color regressor componentsin the mono color setup: (1) DeepLabv3object-attention network selects pixelsinside the garment (2) VGG16 with twoheads: the first predicts the image ini-tial illumination (to be corrected withVon Kries method), and the second headdetects the main discrete color nameamongst 72 available colors. This maincolor is used to create a colorname-attention. Combined with the previousobject-attention, it selects only pixelsthat are inside the clothing & that havea RGB value close to the RGB of thepredicted discrete color name. Theseattentions are used for weighted spatialpooling over the pixels of the image forcolor regression.

2.2. Spatial Pooling over Pixels for ContinuousColor Regression

To sample only the interesting regions of the image,we apply a spatial pooling over the pixels of the imageweighted by two complementary attention modules: thefirst focuses on the clothing category, the second on thecolor name detected. These attention masks will be mul-tiplied to create the combined-attention.

2.2.1 Through Object-Attention

Following the work from [46], we use the popular fully con-volutional Deeplabv3 [8]. This attention network is pre-trained on the task of semantic segmentation over the cloth-ing crops from Modanet [49] and therefore has an emphasison the garment surface: it produces an object-attention dif-ferent for each clothing category (top, coat, ...). This spatialprior is then fine-tuned and used to modulate the impor-tance of each pixel: it identifies the regions which containthe relevant color information given a garment type.

2.2.2 Through Colorname-Attention

The previous object-attention will fail for multiple colorsgarments. If the apparel can take on a set of distinct anddistant RGB values, the mean of this set will be predicted:it leads to unsaturated colors predictions. In Figure 1, weneed to discard the white background in the shirt that wouldotherwise lead to predict a light pink. Thus we add acolorname-attention, that only selects pixels in the imagesufficiently close to the main color.

In the mono color setup, we first need to detect the maindiscrete color name of the garment. The VGG16 featuresare followed by a fully connected layer with a softmax ac-tivation function [42, 33, 10, 46, 45]: it predicts a distribu-

tion over the 72 color names, and is trained by minimizing acategorical cross-entropy loss. If the color name with max-imum score has color nameRGBc, given a pixel p of RGBvalueRGBp, the colorname-attention CAp for pixel p is:

CAp ∝ exp−1

127.52∗T(RGBp−RGBc)

2

. (1)

It sums to 1 over all pixels and its peakedness depends onT , the temperature of the RGB Spatial Softmax. T is theonly parameter of the colorname-attention module, whichcan be either a predefined hyper-parameter, either learned,either input-dependant and predicted with a fully connectedlayer from VGG features.

2.3. Training Overview

The LAB colorspace best models perceptual distance incolorization [48, 11]: thus, the chosen regression objectiveis the Euclidean distance between predicted and groundtruth colors converted to LAB. Our two losses, color namingand color regression, are back propagated simultaneously.

2.4. Multiple Colors

We can easily generalize our approach for multiple RGBcolors. We predict multiple color names with a sigmoid[45] followed by a combination of categorical cross en-tropy and binary cross entropy [34]. Multiple colors sharethe same object-attention but have different colorname-attention: therefore we predict different RGBs.

The challenge is then to know how many colors shouldbe predicted. Following the work from [45], we explicitlylearn the number of colors in each garment (1, 2 or 3 col-ors). Note that we apply class weights for handling unbal-anced classes.

Selecting the color names with maximum scores wouldoften predict several times the same major color: for ex-ample a light red and a dark red would converge towards a

2

Method Main Color Multiple ColorsName Color Naming Object-Attention Illumination T ≤ 10 ≤ 20 ≤ 30 ≤ 40 ≤ 10 ≤ 20 ≤ 30 ≤ 40

Unsupervised K-means Clustering [27] 47 72 83 89 19 31 37 42Colorname RGB [45] X 51 72 87 92 34 52 64 68

Object-Attention X - 50 78 90 95 - - - -Colorname-Attention X Set to 1 48 76 88 93 35 56 65 68Combined-Attention X X Set to 1 59 87 93 95 44 64 69 71Combined-Attention X X Trainable 62 87 93 95 46 65 69 71

Combined-Attention & Illumination X X X Trainable 67 89 94 96 48 66 70 72Combined-Attention & Illumination X X X Predicted 73 90 93 95 54 68 71 73

Table 1. Score for RGB regressions: percentage of predictions closer to the ground truths RGBs than several thresholds (10, 20, 30, 40),according to the deltaE ciede2000 distance [14, 31]. Bold highlights best score.

medium red. Thus, we apply a non-maximum-suppressionalgorithm to delete predicted RGBs too close to each other.In Figure 1, the network’s confusion between two differentkinds of reds would have prevented from predicting white.

2.5. Discussion: Analogy with Faster R-CNN

Our two-stages approach is highly inspired from FasterR-CNN [35]. The anchors of the Region Proposal Networkare replaced by color names. Given a selected anchor (colorname), the final regression adds a small continuous offsetleading to a precise color quantified as RGB values. In bothcases, the first classifier gives a rough estimate, refined bythe second regression stage.

3. Experiments3.1. Setup

Dataset As far as we know, there is no dataset for the taskof continuous colors regression. Thus, we have collected anew dataset of 30,269 fashion garments: 5,363 coats, 8,166dresses, 3,991 pants, 6,871 shoes and 5,878 tops. The val-idation dataset and the test dataset are composed of 2000images, the remaining images are used for training. Eachgarment was labeled by a single operator with its exactRGBs colors values, not like it appears in the image, butlike the operator though it would appear in ideal white light:the annotation process is therefore more complex and time-consuming than classical classification. The color namescan automatically be derived by nearest neighbours in theLAB space. In case of multiple colors items, the operatorswere asked to tag them in decreasing order of importance.

Implementation Our code is in Tensorflow [1] and Keras[12]. We chose an Adam [25] optimizer with a learningrate of 0.0001 and a batch of 16 images during 50 epochs.We applied standard data augmentations methods: randomcropping and translation.

3.2. Results

Table 1 shows our experiments done on our internaldataset for the regression task, in the main color and in themultiple colors setups.

Baselines We compare our model against two existingbaselines. First we train an unsupervised K-means pixelsclustering [27] directly on the pixel space, after a back-ground removal done by an external semantic segmentationmodel trained on Modanet [49]. Colors are ranked accord-ing to the number of pixels in each cluster. The second base-line, Colorname RGB, is the direct extension of the mul-titagger approach from Yazici et al. [45]: it detects the colornames (e.g. velvet red, dark purple, etc.) and then producesthe associated RGB values (e.g. (134, 71, 71) for velvet red).

Attention First, fine-tuning the semantic segmentationmodel on the regression task (object-attention) already sur-passes previous approaches in the mono color setup. Thecolorname-attention improves overall results in the multi-ple colors setup. These two attentions, when combined, aremutually reinforcing and complementary.

Temperature We show that the temperature T value isimportant: bigger T leads to sharp distribution and takesinto account fewer pixels from the initial images that witha lower T . Rather than grid-searching its optimal value, itcan be learned for improved results. The optimal T dependson the image: therefore, best results are obtained with Tpredicted from VGG features.

Illumination Finally, the color constancy improves per-formances. It may be more impactful with pretraining onthe Color Checker Dataset [18].

4. ConclusionIn this paper we address the problem of detecting refined

multiple color continuous values for fashion garments. Wecollect a unique dataset of 30,269 fashion garments and em-pirically show the benefits of our two-stages architecture.Finally, we hope to motivate researchers to investigate thedifferent subjective properties of fashion garments.

Acknowledgements This work is supported byHeuritech, which forecasts fashion trends to producemore sustainably.

3

References[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,

C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghe-mawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia,R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane,R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster,J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. War-den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor-Flow: Large-scale machine learning on heterogeneous sys-tems, 2015. Software available from tensorflow.org. 3

[2] N. Baek, S.-M. Park, K.-J. Kim, and S.-B. Park. Vehi-cle color classification based on the support vector machinemethod. In International conference on intelligent comput-ing, pages 1133–1139. Springer, 2007. 1

[3] J. T. Barron. Convolutional color constancy. In Proceedingsof the IEEE International Conference on Computer Vision,pages 379–387, 2015. 1

[4] B. Berlin and P. Kay. Basic color terms: Their universalityand evolution. Univ of California Press, 1991. 1

[5] S. Bianco, C. Cusano, and R. Schettini. Color constancyusing cnns. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition Workshops, pages 81–89, 2015. 1

[6] S. Bianco, C. Cusano, and R. Schettini. Single and multipleilluminant estimation using convolutional neural networks.IEEE Transactions on Image Processing, 26(9):4347–4362,2017. 1

[7] L. Bossard, M. Dantone, C. Leistner, C. Wengert, T. Quack,and L. Van Gool. Apparel classification with style. In Asianconference on computer vision, pages 321–335. Springer,2012. 1

[8] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Re-thinking atrous convolution for semantic image segmenta-tion. arXiv preprint arXiv:1706.05587, 2017. 2

[9] P. Chen, X. Bai, and W. Liu. Vehicle color recognition onurban road by feature context. IEEE Transactions on Intelli-gent Transportation Systems, 15(5):2340–2346, 2014. 1

[10] Z. Cheng, X. Li, and C. C. Loy. Pedestrian color naming viaconvolutional neural network. In ACCV, 2016. 1, 2

[11] Z. Cheng, Q. Yang, and B. Sheng. Deep colorization. In Pro-ceedings of the IEEE International Conference on ComputerVision, pages 415–423, 2015. 2

[12] F. Chollet. Keras. https://github.com/fchollet/keras, 2015. 3

[13] C. Corbiere, H. Ben-Younes, A. Rame, and C. Ollion. Lever-aging weakly annotated data for fashion image retrieval andlabel prediction. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 2268–2274, 2017. 1

[14] M. D. Fairchild. Color appearance models. John Wiley &Sons, 2013. 3

[15] D. Fourure, R. Emonet, E. Fromont, D. Muselet,A. Tremeau, and C. Wolf. Mixed pooling neural networksfor color constancy. In 2016 IEEE International Conferenceon Image Processing (ICIP), pages 3997–4001. IEEE, 2016.1

[16] L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithmof artistic style. CoRR, abs/1508.06576, 2015. 1

[17] Y. Ge, R. Zhang, L. Wu, X. Wang, X. Tang, and P. Luo.Deepfashion2: A versatile benchmark for detection, pose es-timation, segmentation and re-identification of clothing im-ages. arXiv preprint arXiv:1901.07973, 2019. 1

[18] P. V. Gehler, C. Rother, A. Blake, T. Minka, and T. Sharp.Bayesian color constancy revisited. In 2008 IEEE Confer-ence on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008. 3

[19] M. Hadi Kiapour, X. Han, S. Lazebnik, A. C. Berg, and T. L.Berg. Where to buy it: Matching street clothing photos inonline shops. In Proceedings of the IEEE international con-ference on computer vision, pages 3343–3351, 2015. 1

[20] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. In Proceedings of the IEEE con-ference on computer vision and pattern recognition, pages770–778, 2016. 1

[21] C. Hickey and B.-T. Zhang. Hierarchical color learningin convolutional neural networks. In Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition Workshops, pages 384–385, 2020. 1

[22] Y. Hu, B. Wang, and S. Lin. Fc4: fully convolutional colorconstancy with confidence-weighted pooling. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 4085–4094, 2017. 1

[23] M. Jia, Y. Zhou, M. Shi, and B. Hariharan. A deep-learning-based fashion attributes detection model. arXiv preprintarXiv:1810.10148, 2018. 1

[24] K. R. Jyothi and M. Okade. Computational color namingfor human-machine interaction. In 2019 IEEE Region 10Symposium (TENSYMP), pages 391–396, 2019. 1

[25] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization. In International Conference on Learning Rep-resentations (ICLR), 2015. 3

[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InAdvances in neural information processing systems, pages1097–1105, 2012. 1

[27] X. Lewis. Extracting colours from an image using k-meansclustering. https://towardsdatascience.com/extracting-colours-from-an-image-using-k-means-clustering-9616348712be,2018, (accessed October 6, 2020). 3

[28] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfash-ion: Powering robust clothes recognition and retrieval withrich annotations. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 1096–1104,2016. 1

[29] Z. Lou, T. Gevers, N. Hu, M. P. Lucassen, et al. Color con-stancy by deep learning. 1

[30] C. Lynch, K. Aryafar, and J. Attenberg. Images don’t lie:Transferring deep visual semantic features to large-scalemultimodal learning to rank. In Proceedings of the 22ndACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, pages 541–548. ACM, 2016. 1

[31] W. Mokrzycki and M. Tatol. Colour difference e-a survey. 3

4

https://github.com/fchollet/keras

https://github.com/fchollet/keras

https://towardsdatascience.com/extracting-colours-from-an-image-using-k-means-clustering-9616348712be

https://towardsdatascience.com/extracting-colours-from-an-image-using-k-means-clustering-9616348712be

[32] T. R. Partos, S. J. Cropper, and D. Rawlings. You don’t seewhat i see: Individual differences in the perception of mean-ing from visual stimuli. PloS one, 11(3):e0150615, 2016.1

[33] R. F. Rachmadi and I. Purnama. Vehicle color recogni-tion using convolutional neural network. arXiv preprintarXiv:1510.07391, 2015. 1, 2

[34] A. Rame, E. Garreau, H. Ben-Younes, and C. Ollion. Omniafaster r-cnn: Detection in the wild through dataset mergingand soft distillation. arXiv preprint arXiv:1812.02611, 2018.1, 2

[35] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks. InAdvances in neural information processing systems, pages91–99, 2015. 3

[36] W. Shi, C. C. Loy, and X. Tang. Deep specialized network forilluminant estimation. In European Conference on ComputerVision, pages 371–387. Springer, 2016. 1

[37] O. Sidorov. Artificial color constancy via googlenet with an-gular loss function. arXiv preprint arXiv:1811.08456, 2018.1

[38] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition. CoRR,abs/1409.1556, 2014. 1

[39] J. Van De Weijer, C. Schmid, J. Verbeek, and D. Larlus.Learning color names for real-world applications. IEEETransactions on Image Processing, 18(7):1512–1523, 2009.1

[40] A. Veit, B. Kovacs, S. Bell, J. McAuley, K. Bala, and S. Be-longie. Learning visual clothing style with heterogeneousdyadic co-occurrences. In Proceedings of the IEEE Inter-national Conference on Computer Vision, pages 4642–4650,2015. 1

[41] J. von Kries. Chromatic adaptation, festschrift der albercht-ludwig-universitat, 1902. 1

[42] Y. Wang, J. Liu, J. Wang, Y. Li, and H. Lu. Color nameslearning using convolutional neural networks. In 2015 IEEEInternational Conference on Image Processing (ICIP), pages217–221. IEEE, 2015. 1, 2

[43] W. Wieclawek and E. Pietka. Car segmentation and colourrecognition. In 2014 Proceedings of the 21st InternationalConference Mixed Design of Integrated Circuits and Systems(MIXDES), pages 426–429. IEEE, 2014. 1

[44] K. Yamaguchi, M. Hadi Kiapour, and T. L. Berg. Paper dollparsing: Retrieving similar styles to parse clothing items. InProceedings of the IEEE international conference on com-puter vision, pages 3519–3526, 2013. 1

[45] V. O. Yazici, J. van de Weijer, and A. Ramisa. Color namingfor multi-color fashion items. In World Conference on In-formation Systems and Technologies, pages 64–73. Springer,2018. 1, 2, 3

[46] L. Yu, Y. Cheng, and J. van de Weijer. Weakly super-vised domain-specific color naming based on attention. In2018 24th International Conference on Pattern Recognition(ICPR), pages 3019–3024. IEEE, 2018. 1, 2

[47] L. Yu, L. Zhang, J. van de Weijer, F. S. Khan, Y. Cheng, andC. A. Parraga. Beyond eleven color names for image under-

standing. Machine Vision and Applications, 29(2):361–373,2018. 1

[48] R. Zhang, P. Isola, and A. A. Efros. Colorful image col-orization. In European conference on computer vision, pages649–666. Springer, 2016. 1, 2

[49] S. Zheng, F. Yang, M. H. Kiapour, and R. Piramuthu.Modanet: A large-scale street fashion dataset with polygonannotations. arXiv preprint arXiv:1807.01394, 2018. 1, 2, 3

5

, and charles ollion october 29, 2019 arxiv:2010.02849v1

Documents