an intriguing influence of visual data in learning a...

8
An Intriguing Influence of Visual Data in Learning a Representation Aayush Bansal Carnegie Mellon University [email protected] Abstract We present an intriguing property of visual data that we observe in our attempt to isolate the influence of data for learning a visual representation. We observe that we can get better performance than existing model by just condi- tioning the existing representation on a million unlabeled images without any extra knowledge. As a by-product of this study, we achieve results better than prior state-of-the- art for surface normal estimation on NYU-v2 depth dataset, and improved results for semantic segmentation using a self- supervised representation on PASCAL-VOC 2012 dataset. 1. Introduction The big success for deep learning in computer vision has generally been attributed to the class labeled image data in the ImageNet [38]. A common recipe is to use a model [25, 40, 18] trained for ImageNet classification (images annotated with class labels), and fine-tune it for a specific task (be it object detection [13], human pose esti- mation [6], or an endless list of whatever tasks [30, 8, 42]) and achieve state-of-the-art performance. The availability of the big datasets [38, 29] have been fueling the deep engines that were prone to overfitting when dealing with smaller datasets [28, 10]. While the big visual data has been consid- ered a great contributing factor for this revolution in deep learning, it has never been clear as what is the big source of energy within the data itself. Is it the visual part of the data, i.e. images and their diversity, or is it the knowledge about the class labels? In the recent past, there has been an increased effort for learning a representation in a self-supervised manner, i.e. im- ages alone without class labels. A common theme for these works have been to use the images (without class labels) from ImageNet (or other related data sources) with some auxiliary tasks such as colorization [43, 26], context predic- tion [7], image in-painting [35], image rotation [12]. While significant improvement over random gaussian initialization have been shown by these approaches, the success is often attributed to the task and it is still not clear how much of it 0 0.25 0.5 0.75 1 General Trend This Work Model Data Fraction of Efforts Figure 1. General Trend vs. This Work: Recent works in com- puter vision literature primarily focuses on designing better CNN- architectures and optimization to improve the performance of tasks on various benchmarks. In this work, our focus is to use the freely- available unlabeled images in a simplest setting to learn a better representation. Only effort spend on the model part is to look-up the hyper-parameters from prior work and use them to train a new model with a million unlabelled images. is because of the images from ImageNet and how much is it because of the task itself? In this work, we want to isolate the influence of just the visual data (or images alone) in learn- ing a visual representation. There are a multiple involved factors, ranging from hyper-parameters to the nature of task, that could influence the representation. We propose a simple approach that attempts to isolate the influence of images alone on the performance without any other factors. Figure 1 contrasts the fraction of efforts required in our approach for data and model with contemporary approaches. Another path that is less explored is to improve a vi- sual representation trained from small datasets with millions of unlabeled images. A conventional wisdom might argue against such an approach as without any extra supervision or an auxiliary source of supervision how could we get a better performance than what we already have? In sync with this thought are the fundamental laws of conservation in nature and physics, be it conservation of energy in the Newtonian mechanics or conservation of mass and energy in Einsteinian physics. The basic premise of these theories 1

Upload: others

Post on 22-Jan-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Intriguing Influence of Visual Data in Learning a ...cs.cmu.edu/~aayushb/techReport/data1M.pdf · An Intriguing Influence of Visual Data in Learning a Representation Aayush Bansal

An Intriguing Influence of Visual Data in Learning a Representation

Aayush BansalCarnegie Mellon University

[email protected]

Abstract

We present an intriguing property of visual data that weobserve in our attempt to isolate the influence of data forlearning a visual representation. We observe that we canget better performance than existing model by just condi-tioning the existing representation on a million unlabeledimages without any extra knowledge. As a by-product ofthis study, we achieve results better than prior state-of-the-art for surface normal estimation on NYU-v2 depth dataset,and improved results for semantic segmentation using a self-supervised representation on PASCAL-VOC 2012 dataset.

1. IntroductionThe big success for deep learning in computer vision

has generally been attributed to the class labeled imagedata in the ImageNet [38]. A common recipe is to usea model [25, 40, 18] trained for ImageNet classification(images annotated with class labels), and fine-tune it for aspecific task (be it object detection [13], human pose esti-mation [6], or an endless list of whatever tasks [30, 8, 42])and achieve state-of-the-art performance. The availability ofthe big datasets [38, 29] have been fueling the deep enginesthat were prone to overfitting when dealing with smallerdatasets [28, 10]. While the big visual data has been consid-ered a great contributing factor for this revolution in deeplearning, it has never been clear as what is the big source ofenergy within the data itself. Is it the visual part of the data,i.e. images and their diversity, or is it the knowledge aboutthe class labels?

In the recent past, there has been an increased effort forlearning a representation in a self-supervised manner, i.e. im-ages alone without class labels. A common theme for theseworks have been to use the images (without class labels)from ImageNet (or other related data sources) with someauxiliary tasks such as colorization [43, 26], context predic-tion [7], image in-painting [35], image rotation [12]. Whilesignificant improvement over random gaussian initializationhave been shown by these approaches, the success is oftenattributed to the task and it is still not clear how much of it

0

0.25

0.5

0.75

1

General Trend This Work

Model Data

Frac

tion

of E

fforts

Figure 1. General Trend vs. This Work: Recent works in com-puter vision literature primarily focuses on designing better CNN-architectures and optimization to improve the performance of taskson various benchmarks. In this work, our focus is to use the freely-available unlabeled images in a simplest setting to learn a betterrepresentation. Only effort spend on the model part is to look-upthe hyper-parameters from prior work and use them to train a newmodel with a million unlabelled images.

is because of the images from ImageNet and how much is itbecause of the task itself? In this work, we want to isolate theinfluence of just the visual data (or images alone) in learn-ing a visual representation. There are a multiple involvedfactors, ranging from hyper-parameters to the nature of task,that could influence the representation. We propose a simpleapproach that attempts to isolate the influence of imagesalone on the performance without any other factors. Figure 1contrasts the fraction of efforts required in our approach fordata and model with contemporary approaches.

Another path that is less explored is to improve a vi-sual representation trained from small datasets with millionsof unlabeled images. A conventional wisdom might argueagainst such an approach as without any extra supervisionor an auxiliary source of supervision how could we get abetter performance than what we already have? In syncwith this thought are the fundamental laws of conservationin nature and physics, be it conservation of energy in theNewtonian mechanics or conservation of mass and energyin Einsteinian physics. The basic premise of these theories

1

Page 2: An Intriguing Influence of Visual Data in Learning a ...cs.cmu.edu/~aayushb/techReport/data1M.pdf · An Intriguing Influence of Visual Data in Learning a Representation Aayush Bansal

is that we cannot generate more from less, and that there issome price paid for every gain. We question if such conser-vation principles also hold in the computer vision world, or ifwe can defy them to learn a better performing models from aworse representation without any additional supervision? Inthis work, we observe that it is indeed possible to improve avisual representation by just using the unlabeled images withthe existing representation without any additional knowledgeor cost.Contributions: We present an approach to isolate the influ-ence of visual data in learning a representation. Our findingssuggest that we can get better performing models by condi-tioning existing representation on diverse unlabeled images.As a by-product of this study, we achieve performance betterthan prior state-of-the-art for surface normal estimation, andimproved performance for semantic segmentation in the lightof self-supervised representation learning.

2. Related WorkThe new datasets [38, 29, 45] in computer vision have

fueled an onset for a fourth industrial revolution powered bydeep learning [27]. The visual data is analogous to the coalpowering the deep steam engine. We have made significantprogress in designing better architectures [25, 40, 18, 19] thatcan churn this visual data and lead to better performances. Inour endeavor to improve the efficiency of the deep models,we have largely neglected the source visual data itself andwhat makes it so powerful to learn a visual representation.A few works have explored the influence of data in contextof object recognition [2], or transfer learning [20]. However,the efforts have been primarily to understand the influence oflabeled ImageNet data. As a community, we have not beenable to understand the full potential of visual data (imagesalone) in learning. In this work, our goal is to explore theinfluence of this visual data on learning a representation.Self-Supervised Representation Learning: The successof supervised approaches have also attracted a sizable com-munity exploring the self-supervised way to learn a visualrepresentation. These approaches specifically use the im-ages [7, 35, 12, 33, 43] or videos [23, 14, 32, 34] to learn abase representation. Doersch et al. [7] used patches from Im-ageNet [38] to do context prediction. Pathak et al. [35] pro-posed to learn representation via image inpainting, [43, 26]used colorization, [12] used image rotation paper, [33] usedjigsaw puzzle solver for the optimization. A common themein all these approaches is the proposal of an auxiliary taskthat can use the image data without any additional cost, anduse it for training the model for learning a representation.While most of these approaches are using the images fromImageNet, still it is the task that overshadows the underlyingimages. We have not been able to isolate the influence ofvisual data on learning the representation. In this work, wepropose a careful study that aims to isolate this visual data

amongst other factors involved in learning the representation.

Weakly Supervised & Semi-Supervised Learning: Thepower of data has also been explored in context of weaklysupervised learning [24, 41, 22] when weak labels (suchas user tags etc) are provided, or in a semi-supervised set-ting [44, 37, 31] with a few labeled data and largely unla-beled data. Our work is partially inspired from these weakly-supervised and semi-supervised approaches as we try tosimulate labels on a diverse set of unlabeled images to learna better representation. Different from weakly supervised ap-proaches, we do not use any additional source of knowledgewith the images. Further, our work shares similarity withrecently proposed data distillation approach by Radosavovicet al. [37]. However, our approach is exceedingly simple.We do not assume a good teacher model in our work, andthe initial model that is used to simulate labels is trainedusing a small dataset. Despite this, we see a similar orderof performance improvement for semantic segmentation andsurface normal estimation.

Unreasonable Effectiveness of Data: While we live in aworld of a big data, the effectiveness [15] of visual data isstill untapped. A great demonstration of potential in thevisual data was made by Hays and Efros [17] who usedmillions of images for scene completion. Bansal et al. [5]used simple pixel-wise nearest neighbor to synthesize multi-modal high frequency outputs from “incomplete” priors.Despite the availability of even more data in last decade(since [17]), we have not been able to demonstrate its effec-tiveness in a simple manner for different tasks. We hopethat our work inspires to explore the aspects of visual dataand help it achieve a status of first-class citizen of computervision.

3. Approach

A fundamental goal of this work is to isolate the influenceof visual data, {X : X ∈ R3}, for learning a mappingf : X → Y where {Y : Y ∈ RN} is the intended target.

Data: We poke X by varying its source and distribution.X1 and X2 are two data sources, and each comes froma different distribution. The samples in each of X1, andX2 are represented as x1s and x2t respectively. The numberof samples in both X1 and X2 are equal. The samples inY are represented by ys. Also, there exists a paired datacorrespondence between X1 and Y , i.e. we have {(x1s, ys)}.However, we have only {x2t} and no corresponding data inY . Finally, X1 comes from a constrained setting, whereasX2 has a great variety.

Learning a Mapping: We use the {(x1s, ys)} to learn amapping f for this data (Figure 2-a). This mapping (f ) is anexample of paired image-to-image translation, and that we

Page 3: An Intriguing Influence of Visual Data in Learning a ...cs.cmu.edu/~aayushb/techReport/data1M.pdf · An Intriguing Influence of Visual Data in Learning a Representation Aayush Bansal

{(xs1, ys)}

(a). Learning on Paired Data (b). No Labels Available

{xt2} {ø}

X1 Y

xs1 ys

f

X2 Y

xt2

X2 Y

xt2 f(xt 2)

f

(c). Simulating Labels

{f(xt2)}

X2 Y

xt2 f(xt2)

g

(d). Learning on Simulated Data

{(xt2, f(xt2)}{xt2}

Figure 2. Conditioning representation on a different data distribution: The figure qualitatively shows how we condition a learntrepresentation on a different data distribution X2 that has no labeled data. As shown in (a), we learn a mapping function f from a paireddata {(x1

s, ys)}. Since there exists no labeled data for X2 (as shown in (b)), we generate labels via f (as shown in (c)). Finally (d), we usethis image and simulated label data pair to learn another visual representation Shown in Eq. 2, g is trying to mimic f via samples in X2.

can minimize reconstruction error on paired data:

minf

∑s

||ys − f(x1s)||2 (1)

Learning from Simulated Labels: We do not have anypaired data for X2 (Figure 2-b). We, therefore, simulate thelabels by using the samples {x2t} and f learned in Equation 1(Figure 2-c). This enables us to get a paired data betweenX2

and Y , {(x2t , f(x2t )}. We intend to learn a new mapping g(Figure 2-d) over the data pair {(x2t , f(x2t )}. Since we havelabels for X2, we can now learn a mapping by minimizingreconstruction for the simulated data pair.

ming

∑t

||f(x2t )− g(x2t )||2 (2)

More precisely, we are forcing g to learn f via samples inX2, i.e. {x2t}. Importantly, the number of samples in X2 aresufficient to learn the parameters of f .Point of Contention: We now have two mapping functionsf and g, where g is trying to mimic f by learning over thesamples of X2. If there was no role of X in learning thismapping, both f and g should behave similarly when fine-tuned for a particular task for different data sources. Infact,g should underperform because it is an approximation off . We make a test scenarios to see if this holds. We use fand g as an initialization for a task whose data distributionis X1 and learns a mapping to Y . Against the conventionalwisdom, our findings suggest that g can perform significantlybetter than f . We also evaluate our approach in the lightof self-supervised representation learning. We observe thatthere is something peculiar with the “visual data” that couldimpact learning a representation.

3.1. Implementation Details

We now explain the different components that will beused in our experiments. We consider the task of surfacenormal estimation [11, 8, 4] for learning a representationas it naturally provides for the different data distributiondescribed above. We use NYU-v2 depth dataset [39] andImageNet [38] for our experiments. The NYU-v2 depthdataset [39] consists of 220, 000 video frames collected us-ing kinect in the indoor scenes. Each frame has a depthmap that helps in computing a surface normal map. In oursettings, the NYU-v2 depth dataset acts as source for X1

and Y . We use a random subset of ImageNet [38] for X2.This subset of ImageNet contains same number of images asin X1. The ImageNet dataset provides a variety of images,and does not have any corresponding depth/surface-normallabeled data. The two data sources are quite complimentaryas one is focussed primarily on the indoor scenes collectedusing the kinect data, the other is primarily a collection ofweb images that has a big proportion of outdoor scenes. Ourgoal in this work is to isolate the impact of visual data andits diversity, and therefore we have fixed the number of im-ages in two data sources as well to avoid any bias in ourexperiments.

Default Model: We use the model from Bansal et al. [4, 3]for surface normal estimation. We briefly describe the modelhere. This network architecture, also known as PixelNet [3],consists of a VGG-16 style architecture [40] and a multi-layer perceptron (MLP) on top of it for pixel-level prediction.There are 13 convolutional layers and three fully connected(fc) layers in VGG-16 architecture. The first two fcs aretransformed to convolutional filters following [30]. We de-note these transformed fc layers of VGG-16 as conv-6 andconv-7. All the layers are denoted as {11, 12, 21, 22, 31,32, 33, 41, 42, 43, 51, 52, 53, 6, 7}. We use hypercolumn

Page 4: An Intriguing Influence of Visual Data in Learning a ...cs.cmu.edu/~aayushb/techReport/data1M.pdf · An Intriguing Influence of Visual Data in Learning a Representation Aayush Bansal

features from conv-{12, 22, 33, 43, 53, 7}. An MLP is usedover hypercolumn features with 3-fully connected layers ofsize 4, 096 followed by ReLU [25] activations, where thelast layer outputs predictions for 3 outputs (nx, ny , nz) witha euclidean loss for regression. Finally, we use batch nor-malization [21] with each convolutional layer when trainingfrom scratch for faster convergence. More details about thearchitecture/model can be obtained from [3].Learning mapping f : We use the above model, initialize itwith a random gaussian distribution, and train it for NYU-v2depth dataset. The initial learning rate is set to ε = 0.001,and it drops by a factor of 10 at step of 50, 000. The modelis trained for 60, 000 iterations. We use all the parametersfrom [3], and have kept them fixed for all our experimentsto avoid any bias due to hyper-parameter tuning.Learning mapping g: Firstly, we need to create labels forX2 to learn g. We use f trained above with the randomlysubsampled images from ImageNet to create the trainingdata pair. We use this data to learn mapping function g thatis trained from scratch and follows the same procedure as f .Using a million images: Finally, we use f trained abovewith a million images from ImageNet to create the trainingdata pair. We use this data to learn mapping function h thatis trained from scratch and follows the same procedure asf (except that step size is now 200, 000 and we train it for430, 000 iterations)1.

We have tried to make sure that only thing that changein this experiment is the data source (X1 and X2), and resteverything is kept fixed to avoid any external influence onthese experiments. We will now evaluate f , g, and h for twotasks: (1). Surface normal estimation - We use our learntmappings f , g, and h, and fine-tune them using NYU-v2depth dataset [39] for surface normal estimation. The goalof this experiment is to study the conservation principles forvisual data. If we do not get any performance improvementthan everything is in sync with the world that we live in. Onthe other hand, we need to consider the influence of the visualdata if we get better performance with this simple approachby just using an additional diverse set of images (without anylabels). An important thing to consider here is that the datadistribution for this task is closer to X1 than it is to X2. Bya conventional wisdom, the one trained with X1 should dobetter than X2. Also note that all the hyper-parameters andsettings are kept same for all analysis; (2). Self-supervisedrepresentation learning - we use our learned mapping hfor the illustrative task of semantic segmentation using thePASCAL VOC-2012 dataset [9], and compare it with priorwork in this direction. We also improve the results furtherby going back to the unlabeled images and training a newrepresentation for semantic segmentation this time.

1We arbitrarily shut the training of this model after 2 epochs. Bettermodels may be learn by running it for longer.

4. ExperimentsWe now quantitatively and qualitatively evaluate our hy-

pothesis described in Section 3.

4.1. Surface Normal Estimation

We fine-tune f , g, and h on NYU-v2 depth dataset [39](described earlier in Section 3.1) for surface normal estima-tion. The initial learning rate is set to ε = 0.001, and it dropsby a factor of 10 at step of 50, 000. Each model is fine-tunedfor 60, 000 iterations. We use 654 images from the test setof NYU-v2 depth dataset [39] for evaluation. Following[4], we compute six statistics over the angular error betweenthe predicted normals and depth-based normals to evaluatethe performance – Mean, Median, RMSE, 11.25◦, 22.5◦,and 30◦ – The first three criteria capture the mean, median,and RMSE of angular error, where lower is better. The lastthree criteria capture the percentage of pixels within a givenangular error, where higher is better.

Table 1 compares the performance of f and g when fine-tuned on NYU-v2 for surface normal estimation. Each ofthem is denoted as f+FT, g+FT, and h+FT respectively. Weobserve that f+FT improves over f . More importantly, g+FThas a better performance than f+FT, and is comparable tothe model fine-tuned from the ImageNet with class labels.Further, we observe that f+FT saturates and does not im-prove performance but g+FT when allowed to run for longer(120, 000 iterations) could further improve the performanceand can also get a performance better than a ImageNet (withclass labels) pre-trained model. Further, with the increasein the number of unlabeled images (h + FT), we can evenachieve better performance. A concurrent work [36] getssimilar performance by a careful use of multi-task optimiza-tion with ImageNet pretrained models whereas we achieve itsimply by using a million unlabeled images. These resultssuggest that we can actually get better performance with asmall labeled data, and millions of unlabeled images. Moreimportantly, this experiment suggest that there is somethingpeculiar with the visual data that enables us to get betterperforming models with the low performing models by justuse of extra unlabeled images.

Figure 3 qualitatively compares the performance of dif-ferent models. Our approach is able to correct the normalswhere the previous model failed, and could also get betteroutputs than prior art.Do we improve globally or locally? One may suspect thata model fine-tuned from ImageNet (with class labels) cancapture more local information as the pre-training consistsof class labels. We analyzed if g+FT is also able to capturethese local aspects in the scene or is it capturing more globalinformation. Table 2 contrast the performance of two ap-proaches on indoor scene furniture categories such as chair,sofa, and bed. We observe that despite being trained on one-sixth of training data and without any explicit class labels,

Page 5: An Intriguing Influence of Visual Data in Learning a ...cs.cmu.edu/~aayushb/techReport/data1M.pdf · An Intriguing Influence of Visual Data in Learning a Representation Aayush Bansal

(a).2Dimage (b).Kinect (c).ImageNetLabels (d).Scratch (e).g+FTFigure 3. Influence of Unlabeled Data on Surface Normal Estimation: For a given single 2D image (shown in (a)), we contrast theperformance of various model. Shown in (c) are the results from prior work [4, 3] using ImageNet labels; (d). shows the one trained fromscratch, f ; and finally (e). shows our approach, initialized using representation conditioned on ImageNet (g) with various approaches. Theinfluence of unlabeled data can be gauged by improvements from (d) to (e). By just conditioning the learnt representation on a diverse data,we can get better performance without any additional cost. For reference, we have also shown normals from kinect in (b).

the performance of g+FT is competitive (and sometimes evenslightly better) to the model fine-tuned from ImageNet (withclass labels). However, the performance for local objectsalso improve when trained using a million images (h+FT).This suggests that the proposed approach can capture bothlocal and global information quite well.

4.2. Self-Supervised Representation Learning

We now evaluate h for semantic segmentation in lightof self-supervised visual representation. We fine-tune husing the training images from PASCAL VOC-2012 [9]for semantic segmentation, and additional labels collectedon 8498 images by Hariharan et al. [16]. We evaluate theperformance on the test set that required submission onPASCAL web server [1]. We report results using the standardmetrics of region intersection over union (IoU) averaged over

Page 6: An Intriguing Influence of Visual Data in Learning a ...cs.cmu.edu/~aayushb/techReport/data1M.pdf · An Intriguing Influence of Visual Data in Learning a Representation Aayush Bansal

Approach Mean Median RMSE 11.25◦ 22.5◦ 30◦

ImageNet Labels [4, 3] 19.8 12.0 28.2 47.9 70.0 77.8

Scratch (f ) 21.2 13.4 29.6 44.2 66.6 75.1

f +FT 20.4 12.6 28.7 46.3 68.2 76.4g+FT 19.8 12.0 28.0 47.7 69.4 77.5h+FT 18.9 11.1 27.2 50.4 71.3 78.9

g+FT (until convergence) 19.4 11.5 27.8 49.2 70.4 78.1h+FT (until convergence) 18.7 10.8 27.2 51.3 71.9 79.3

Table 1. Influence of Unlabeled Visual Data on Surface NormalEstimation: We study the influence of data in this experiment.The top row shows the performance of surface normal estimationwhen a model pre-trained using ImageNet and class labels is used.The second row shows the performance when trained from scratch(initialized from a random gaussian distribution). This is the fmodel in our setting. The next two rows shows f and g fine-tunedfor NYU-v2 depth dataset for surface normal estimation. We ob-serve that for same compute g+FT improves the performance overf+FT. Further, we observe that f+FT saturates but g+FT improvesand gets performance even better than prior work (first row) usingpre-trained ImageNet (with class labels). We also demonstrate ashow performance can be further improved by using more unlabeledimages. h+FT is trained using a million images in contrast to g+FTthat is using 220, 000 images.

Mean Median RMSE 11.25◦ 22.5◦ 30◦

chairImageNet Labels [4, 3] 31.7 24.0 40.2 21.4 47.3 58.9g+FT (until convergence) 32.4 25.2 40.5 19.1 45.2 57.3h+FT (until convergence) 31.2 23.6 39.6 21.0 47.9 59.8

sofaImageNet Labels [4, 3] 20.6 15.7 26.7 35.5 66.8 78.2g+FT (until convergence) 21.4 16.1 27.6 34.9 64.4 76.1h+FT (until convergence) 20.0 15.2 26.1 37.5 67.5 79.4

bedImageNet Labels [4, 3] 19.3 13.1 26.6 44.0 70.2 80.0g+FT (until convergence) 19.2 12.9 26.4 44.6 70.3 79.7h+FT (until convergence) 18.4 12.3 25.5 46.5 72.7 81.7

Table 2. Performance for local objects: We contrast the perfor-mance of our approach with the model fine-tuned using ImageNet(with class labels) on furniture categories, i.e. chair, sofa, and bed.Without having any explicit class information as compared to theone trained using class labels, g is still competitive (and sometimeseven slightly better) to prior art. Further when using a millionimages (h+FT), our approach exceeds the performance of prior art.

classes (higher is better).We follow [3] for this experiment. The initial learning

rate is set to ε = 0.001, and it drops by a factor of 10 at stepof 100, 000. The model is fine-tuned for 160, 000 iterations.Table 3 contrasts the performance of our approach with otherapproaches. We observe a slight performance improvementover the prior work [3] that used normals for initialization,and 7% over the model trained from scratch. Finally, wefollow the approach similar to surface normal estimation.

We ran the trained model on a million unlabeled images,train a new model from scratch for segmentation2, and fine-tune it for PASCAL dataset (using same hyper-parameters asearlier). We observe a further 2.7% boost in the performancethereby closing the gap between a pre-trained model andself-supervised model to 3.6%. We hope that use of moreunlabeled images (probably ten or hundred millions) candrastically improve the performance.

4.3. Why is it intriguing?

There are prior work in computer vision literature thathave used images (without class labels) from ImageNet forself-supervised representation learning via auxiliary tasksand have shown better performance. However, what makesthe current observation intriguing is that we are not using anexplicit auxiliary task but just trying to learn the parametersof another model conditioning it on a random subset ofImageNet and yet be able to show significant improvementover baseline model. We posit that it is primarily due tothe diverse data used in the experiment that is helping inlearning better representation. Without any extra supervisionor any other knowledge, we are seeing better performanceby conditioning the model on ImageNet data. We believethere is something intriguing about the visual data (imagesalone) in learning the representation that can get us to betterperforming models. This in some sense defies a conventionalwisdom because it is not clear that what is the cost we hadto pay to get better performance?

5. Discussions & Future Work

The current experiments suggest that using a million unla-belled images from ImageNet can help us get better perform-ing models. Our observations are currently limited to surfacenormal estimation, and self-supervised representation learn-ing for semantic segmentation. Our choice of the task wasprimarily motivated by the two different data distributionavailable for this task. However, we hope that our work in-spires the community to conduct experiments for more tasksespecially the pixel-level tasks, where it is hard to collect theground truth data. There is no scarcity of images availableon web, and that it seems we can improve the performancewithout any additional expense of labeling. Finally, we hopethat our experiments on being able to get better performingmodels from a low performing model just by addition ofunlabeled diverse images, inspires a larger community forunsupervised visual representation learning that is aimed atlearning without any labeled data.

2We used a batch-size of 5. The initial learning rate is set to ε = 0.001,and it drops by a factor of 10 at step of 250, 000. The model is trainedfor 300, 000 iterations. More iterations may further help in improvingperformance. Finally, there may be a better choice of hyper-parameters thatcan give more boost in performance. We have not explored that space.

Page 7: An Intriguing Influence of Visual Data in Learning a ...cs.cmu.edu/~aayushb/techReport/data1M.pdf · An Intriguing Influence of Visual Data in Learning a Representation Aayush Bansal

VOC 2012 test aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv bg mAP

Scratch [3] 62.3 26.8 41.4 34.9 44.8 72.2 59.5 56.0 16.2 49.9 45.0 49.7 53.3 63.6 65.4 26.5 46.9 37.6 57.0 40.4 85.2 49.3

Geometry [3] 71.8 29.7 51.8 42.1 47.8 77.9 65.9 59.7 19.7 50.8 45.9 55.0 59.1 68.2 69.3 32.5 54.3 42.1 60.8 43.8 87.6 54.1Our Approach (h) 74.4 34.5 60.5 47.3 57.1 74.3 73.1 61.7 22.4 51.4 36.4 52.0 60.9 68.5 69.1 37.6 58.0 34.3 64.3 50.2 90.0 56.1+ Final 82.2 35.1 62.0 47.4 62.1 76.6 74.1 62.7 23.9 49.9 47.0 55.5 58.0 74.9 73.9 40.1 56.4 43.6 65.4 52.8 90.9 58.8

ImageNet [3] 79.0 33.5 69.4 51.7 66.8 79.3 75.8 72.4 25.1 57.8 52.0 65.8 68.2 71.2 74.0 44.1 63.7 43.4 69.3 56.4 91.1 62.4

Table 3. Evaluation on VOC-2012: We compare the performance of model fine-tuned from h with the model trained from scratch (randomgaussian initialization). We observe a significant 7% improvement in performance. We also compare with the prior work [3] that usedmodels trained from normals for NYU-v2 as an initialization. We observe a 2% improvement in performance just by changing the underlyingdata to learn the representation. We further improve the performance by 2.7% by running the previous model on unlabeled images, andtraining a model from scratch specifically for segmentation. Finally, we observe that our approach has closed the gap between ImageNet(with class labels) pre-trained model and self-supervised model to 3.6%.

References[1] Pascal voc server. https://host.robots.ox.ac.

uk:8080//. 5[2] P. Agrawal, R. Girshick, and J. Malik. Analyzing the perfor-

mance of multilayer neural networks for object recognition.In ECCV, 2014. 2

[3] A. Bansal, X. Chen, B. Russell, A. Gupta, and D. Ramanan.PixelNet: Representation of the pixels, by the pixels, and forthe pixels. arXiv:1702.06506, 2017. 3, 4, 5, 6, 7

[4] A. Bansal, B. Russell, and A. Gupta. Marr Revisited: 2D-3D model alignment via surface normal prediction. In Proc.of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2016. 3, 4, 5, 6

[5] A. Bansal, Y. Sheikh, and D. Ramanan. PixelNN: Example-based image synthesis. In ICLR, 2018. 2

[6] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR,2017. 1

[7] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised visualrepresentation learning by context prediction. In Proc. of theIEEE International Conference on Computer Vision (ICCV),2015. 1, 2

[8] D. Eigen and R. Fergus. Predicting depth, surface normalsand semantic labels with a common multi-scale convolutionalarchitecture. In Proc. of the IEEE International Conferenceon Computer Vision (ICCV), 2015. 1, 3

[9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, andA. Zisserman. The PASCAL Visual Object Classes (VOC)Challenge. IJCV, 2010. 4, 5

[10] L. Fei-Fei, R. Fergus, and P. Perona. Learning generativevisual models from few training examples: An incrementalbayesian approach tested on 101 object categories. Computervision and Image understanding, 106(1):59–70, 2007. 1

[11] D. F. Fouhey, A. Gupta, and M. Hebert. Data-driven 3Dprimitives for single image understanding. In ICCV, 2013. 3

[12] S. Gidaris, P. Singh, and N. Komodakis. Unsupervised rep-resentation learning by predicting image rotations. CoRR,abs/1803.07728, 2018. 1, 2

[13] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Richfeature hierarchies for accurate object detection and semanticsegmentation. In CVPR, 2014. 1

[14] R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun.Unsupervised learning of spatiotemporally coherent metrics.In Proc. of the IEEE International Conference on ComputerVision (ICCV), pages 4086–4093, 2015. 2

[15] A. Halevy, P. Norvig, and F. Pereira. The unreasonable effec-tiveness of data. IEEE Intelligent Systems, 24(2):8–12, 2009.2

[16] B. Hariharan, P. Arbelez, L. Bourdev, S. Maji, and J. Malik.Semantic contours from inverse detectors. In Proc. of theIEEE International Conference on Computer Vision (ICCV),2011. 5

[17] J. Hays and A. A. Efros. Scene completion using millions ofphotographs. ACM Transactions on Graphics, 2007. 2

[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. arXiv preprint arXiv:1512.03385,2015. 1, 2

[19] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger.Densely connected convolutional networks. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, 2017. 2

[20] M. Huh, P. Agrawal, and A. A. Efros. What makes imagenetgood for transfer learning? CoRR, abs/1608.08614, 2016. 2

[21] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.In Proceedings of the 32Nd International Conference on In-ternational Conference on Machine Learning - Volume 37,ICML’15. JMLR.org, 2015. 4

[22] H. Izadinia, B. C. Russell, A. Farhadi, M. D. Hoffman, andA. Hertzmann. Deep classifiers from image tags in the wild. InProceedings of the 2015 Workshop on Community-OrganizedMultimodal Mining: Opportunities for Novel Solutions. ACM,2015. 2

[23] D. Jayaraman and K. Grauman. Learning image representa-tions tied to ego-motion. In Proc. of the IEEE InternationalConference on Computer Vision (ICCV), pages 1413–1421,2015. 2

[24] A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache.Learning visual features from large weakly supervised data.In European Conference on Computer Vision, pages 67–84.Springer, 2016. 2

Page 8: An Intriguing Influence of Visual Data in Learning a ...cs.cmu.edu/~aayushb/techReport/data1M.pdf · An Intriguing Influence of Visual Data in Learning a Representation Aayush Bansal

[25] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InNIPS, 2012. 1, 2, 4

[26] G. Larsson, M. Maire, and G. Shakhnarovich. Learningrepresentations for automatic colorization. In Proc. of theEuropean Conference on Computer Vision (ECCV), 2016. 1,2

[27] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature,521(7553):436–444, 2015. 2

[28] Y. LeCun, F. J. Huang, and L. Bottou. Learning methodsfor generic object recognition with invariance to pose andlighting. In Computer Vision and Pattern Recognition, 2004.CVPR 2004. Proceedings of the 2004 IEEE Computer SocietyConference on, volume 2, pages II–104. IEEE. 1

[29] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Gir-shick, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L.Zitnick. Microsoft COCO: common objects in context. CoRR,abs/1405.0312, 2014. 1, 2

[30] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalmodels for semantic segmentation. In Proc. of the IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR),2015. 1, 3

[31] I. Misra, A. Shrivastava, and M. Hebert. Watch and learn:Semi-supervised learning of object detectors from videos. InCVPR, 2015. 2

[32] I. Misra, C. L. Zitnick, and M. Hebert. Shuffle and Learn:Unsupervised Learning using Temporal Order Verification.In ECCV, 2016. 2

[33] M. Noroozi and P. Favaro. Unsupervised learning of vi-sual representations by solving jigsaw puzzles. CoRR,abs/1603.09246, 2016. 2

[34] D. Pathak, R. Girshick, P. Dollar, T. Darrell, and B. Hariharan.Learning features by watching objects move. In CVPR, 2017.2

[35] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. Efros.Context encoders: Feature learning by inpainting. In Proc.of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2016. 1, 2

[36] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia. Geonet: Ge-ometric neural network for joint depth and surface normalestimation. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 283–291, 2018.4

[37] I. Radosavovic, P. Dollar, R. B. Girshick, G. Gkioxari, andK. He. Data distillation: Towards omni-supervised learning.CoRR, abs/1712.04440, 2017. 2

[38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,and L. Fei-Fei. ImageNet large scale visual recognition chal-lenge. IJCV, 2015. 1, 2, 3

[39] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoorsegmentation and support inference from rgbd images. InECCV, 2012. 3, 4

[40] K. Simonyan and A. Zisserman. Very deep convolu-tional networks for large-scale image recognition. CoRR,abs/1409.1556, 2014. 1, 2, 3

[41] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisitingunreasonable effectiveness of data in deep learning era. In2017 IEEE International Conference on Computer Vision(ICCV), pages 843–852. IEEE, 2017. 2

[42] S. Xie and Z. Tu. Holistically-nested edge detection. In Proc.of the IEEE International Conference on Computer Vision(ICCV), 2015. 1

[43] R. Zhang, P. Isola, and A. A. Efros. Colorful image coloriza-tion. Proc. of the European Conference on Computer Vision(ECCV), 2016. 1, 2

[44] Y. Zhang, K. Lee, and H. Lee. Augmenting supervised neuralnetworks with unsupervised objectives for large-scale im-age classification. In International Conference on MachineLearning, pages 612–621, 2016. 2

[45] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba.Places: A 10 million image database for scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 2017. 2