1 interpreting deep visual representations via network...
TRANSCRIPT
1
Interpreting Deep Visual Representations viaNetwork Dissection
Bolei Zhou⇤, David Bau⇤, Aude Oliva, and Antonio Torralba
Abstract—The success of recent deep convolutional neural networks (CNNs) depends on learning hidden representations that cansummarize the important factors of variation behind the data. However, CNNs often criticized as being black boxes that lackinterpretability, since they have millions of unexplained model parameters. In this work, we describe Network Dissection, a method thatinterprets networks by providing labels for the units of their deep visual representations. The proposed method quantifies theinterpretability of CNN representations by evaluating the alignment between individual hidden units and a set of visual semantic concepts.By identifying the best alignments, units are given human interpretable labels across a range of objects, parts, scenes, textures, materials,and colors. The method reveals that deep representations are more transparent and interpretable than expected: we find thatrepresentations are significantly more interpretable than they would be under a random equivalently powerful basis. We apply the methodto interpret and compare the latent representations of various network architectures trained to solve different supervised andself-supervised training tasks. We then examine factors affecting the network interpretability such as the number of the training iterations,regularizations, different initializations, and the network depth and width. Finally we show that the interpreted units can be used to provideexplicit explanations of a prediction given by a CNN for an image. Our results highlight that interpretability is an important property ofdeep neural networks that provides new insights into their hierarchical structure.
Index Terms—Convolutional Neural Networks, Network Interpretability, Visual Recognition, Interpretable Machine Learning.
F
1 INTRODUCTION
OBSERVATIONS of hidden units in large deep neural networkshave revealed that human-interpretable concepts sometimes
emerge as individual latent variables within those networks. Forexample, object detector units emerge within networks trained torecognize places [1]; part detectors emerge in object classifiers [2];and object detectors emerge in generative video networks [3]. Thisinternal structure has appeared in situations where the networks arenot constrained to decompose problems in any interpretable way.
The emergence of interpretable structure suggests that deepnetworks may be learning disentangled representations sponta-neously. While it is commonly understood that a network canlearn an efficient encoding that makes economical use of hiddenvariables to distinguish the input, the appearance of a disentangledrepresentation is not well understood. A disentangled representationaligns its variables with a meaningful factorization of the underlyingproblem structure, and encouraging disentangled representationsis a significant area of research [4]. If the internal representationof a deep network is partly disentangled, one possible path forunderstanding its mechanisms is to detect disentangled structure,and simply read out the human interpretable factors.
We address the following three key issues about the deep visualrepresentations in this work:
• What is a disentangled representation of neural networks, andhow can its factors be quantified and detected?
• Do interpretable hidden units reflect a special alignment offeature space, or are interpretations a chimera?
• What differences in network architectures, data sources, andtraining conditions lead to the internal representations withgreater or lesser entanglement?
• B. Zhou and D.Bau contributed equally to this work.• B. Zhou, D. Bau, A.Oliva, and A. Torralba are with CSAIL, MIT, MA,
02139.E-mail: { }
To examine these issues, we propose a general analyticframework, Network Dissection, for interpreting deep visualrepresentations and quantifying their interpretability. Using Broden,a broadly and densely labeled dataset, our framework identifieshidden units’ semantics for any given CNN, then aligns them withhuman-interpretable concepts.
Building upon the preliminary result published at [5], webegin with a detailed description of the methodology of NetworkDissection. We use the method to interpret a variety of deepvisual representations trained with different network architectures(AlexNet, VGG, GoogLeNet, ResNet, DenseNet) and supervisions(supervsied training on ImageNet for object recognition andon Places for scene recognition, along with various self-taughtsupervision tasks). We show that interpretability is an axis-alignedproperty of a representation that can be destroyed by rotationwithout affecting discriminative power. We further examine howinterpretability is affected by different training datasets, trainingregularizations such as dropout [6] and batch normalization [7],and fine-tuning between different data sources. Our experimentsreveal that units emerge as semantic detectors in the intermediatelayers of most deep visual representations, while the degree ofinterpretability can vary widely across changes in architectureand training. We conclude that representations learned by deepnetworks are more interpretable than previously thought, andthat measurements of interpretability provide insights about thestructure of deep visual representations that that are not revealedby their classification power alone1.
1.1 Related WorkVisualizing deep visual representations. Though CNN modelsare notoriously known as black boxes, a growing number of
1. Code, data, and more dissection results are available at the project pagehttp://netdissect.csail.mit.edu/.
2
techniques have been developed to visualize the internal repre-sentations of convolutional neural networks. The behavior of aCNN can be visualized by sampling image patches that maximizeactivation of hidden units [1], [8], [9], or by using variants ofbackpropagation to identify or generate salient image features [8],[10], [11]. Back-propagation together with a natural image priorcan be used to invert a CNN layer activation [12], and an imagegeneration network can be trained to invert the deep features bysynthesizing the input images [13]. [14] further synthesizes theprototypical images for individual units by learning a feature codefor the image generation network from [13].These visualizationsreveal the image patterns that have been learned in a deep visualrepresentation and provide a qualitative guide to the interpretationand interpretability of units. In [1], a quantitative measure ofinterpretability was introduced: human evaluation of visualizationsto determine which individual units behave as object detectors ina network trained to classify scenes. However, human evaluationis not scalable to increasingly large networks such as ResNet [15],with more than 100 layers. Therefore the aim of the present workis to develop a scalable method to go from qualitative visualizationto quantitative interpretation.
Analyzing the properties of deep visual representations.Various intrinsic properties of deep visual representations havebeen explored. Much research has focused on studying the powerof CNN layer activations to be used as generic visual featuresfor classification [16], [17]. The transferability of activations fora variety of layers has been analyzed, and it has been found thathigher layer units are more specialized to the target task [18].Susceptibility to adversarial input reveals that discriminative CNNmodels are fooled by particular image patterns [19], [20]. Analysisof correlation between different random initialized networks revealthat many units converge to the same set of representations aftertraining [21]. The question of how representations generalize hasbeen investigated by showing that a CNN can easily fit a randomlabeling of training data even under explicit regularization [22].Our work focuses on another less explored property of deep visualrepresentations: interpretability.
Unsupervised learning of deep visual representations. Re-cent work on unsupervised learning or self-supervised learningexploits the correspondence structure that comes for free fromunlabeled images to train networks from scratch [23], [24], [25],[26], [27]. For example, CNN is trained by predicting image context[23], by colorizing gray images [28], [29], by solving image puzzle[24], and by associating the images with ambient sounds [30].The resulting deep visual representations learned from differentunsupervised learning tasks are compared by evaluating them asgeneric visual features on classification datasets such as PascalVOC. Our work provides an alternative approach to compare deepvisual representations in terms of their interpretability, beyond justtheir discriminative power.
2 FRAMEWORK OF NETWORK DISSECTION
The notion of a disentangled representation rests on the humanperception of what it means for a concept to be mixed up. Thereforewe define the interpretability of deep visual representation interms of the degree of alignment with a set of human-interpretableconcepts.
Our quantitative measurement of interpretability for deep visualrepresentations proceeds in three steps:
1) Identify a broad set of human-labeled visual concepts.
TABLE 1Statistics of each label type included in the dataset.
Category Classes Sources Avg samplescene 468 ADE [32] 38object 584 ADE [32], Pascal-Context [34] 491part 234 ADE [32], Pascal-Part [35] 854
material 32 OpenSurfaces [33] 1,703texture 47 DTD [36] 140color 11 Generated 59,250
2) Gather the response of the hidden variables to known concepts.3) Quantify alignment of hidden variable�concept pairs.
This three-step process of network dissection is reminiscent ofthe procedures used by neuroscientists to understand similarrepresentation questions in biological neurons [31]. Since ourpurpose is to measure the level to which a representation isdisentangled, we focus on quantifying the correspondence betweena single latent variable and a visual concept.
In a fully interpretable local coding such as a one-hot-encoding,each variable will match exactly with one human-interpretableconcept. Although we expect a network to learn partially nonlocalrepresentations in interior layers [4], and past experience showsthat an emergent concept will often align with a combination ofa several hidden units [2], [17], our present aim is to assess howwell a representation is disentangled. Therefore we measure thealignment between single units and single interpretable concepts.This does not gauge the discriminative power of the representation;rather it quantifies its disentangled interpretability. As we willshow in Sec. 3.2, it is possible for two representations of perfectlyequivalent discriminative power to have very different levels ofinterpretability.
To assess the interpretability of any given CNN, we drawconcepts from a new broadly and densely labeled image datasetthat unifies labeled visual concepts from a heterogeneous collectionof labeled data sources, described in Sec. 2.1. We then measurethe alignment of each hidden unit of the CNN with each conceptby evaluating the feature activation of each individual unit as asegmentation model for each concept. To quantify the interpretabil-ity of a layer as a whole, we count the number of distinct visualconcepts that are aligned with a unit in the layer, as detailed inSec. 2.2.
2.1 Broden: Broadly and Densely Labeled DatasetTo be able to ascertain alignment with both low-level conceptssuch as colors and higher-level concepts such as objects, we haveassembled a new heterogeneous dataset.
The Broadly and Densely Labeled Dataset (Broden) unifiesseveral densely labeled image datasets: ADE [32], OpenSurfaces[33], Pascal-Context [34], Pascal-Part [35], and the DescribableTextures Dataset [36]. These datasets contain examples of a broadrange of objects, scenes, object parts, textures, and materials in avariety of contexts. Most examples are segmented down to the pixellevel except textures and scenes, which are given for full images.In addition, every image pixel in the dataset is annotated withone of the eleven common color names according to the humanperceptions classified by van de Weijer [37]. Samples of the typesof labels in the Broden dataset are shown in Fig. 1.
The purpose of Broden is to provide a ground truth set ofexemplars for a broad set of visual concepts. The concept labels inBroden are normalized and merged from their original datasets so
3
red (color)
yellow (color)
wrinkled (texture)
meshed (texture)
wood (material)
fabric (material)
foot (part)
door (part)
airplane (object)
waterfall (object)
art studio (scene)
beach (scene)
Fig. 1. Samples from the Broden Dataset. The ground truth for each concept is a pixel-wise dense annotation.
Topactivatedimages
Segmentedimagesusingthebinarized unitactivationmap
Semanticsegmentationannotations
Segmentedannotations
Fig. 2. Scoring unit interpretability by evaluating the unit for semanticsegmentation.
that every class corresponds to an English word. Labels are mergedbased on shared synonyms, disregarding positional distinctions suchas ‘left’ and ‘top’ and avoiding a blacklist of 29 overly generalsynonyms (such as ‘machine’ for ‘car’). Multiple Broden labelscan apply to the same pixel: for example, a black pixel that hasthe Pascal-Part label ‘left front cat leg’ has three labels in Broden:a unified ‘cat’ label representing cats across datasets; a similarunified ‘leg’ label; and the color label ‘black’. Only labels with atleast 10 image samples are included. Table 1 shows the number ofclasses per dataset and the average number of image samples perlabel class. Totally there are 1197 visual concept classes included.
2.2 Scoring Unit InterpretabilityThe proposed network dissection method evaluates every individualconvolutional unit in a CNN as a solution to a binary segmentationtask to every visual concept in Broden, as illustrated in Fig. 3. Ourmethod can be applied to any CNN using a forward pass withoutthe need for training or backpropagation.
For every input image x in the Broden dataset, the activationmap Ak(x) of every internal convolutional unit k is collected.Then the distribution of individual unit activations ak is computed.For each unit k, the top quantile level Tk is determined such thatP (ak > Tk) = 0.005 over every spatial location of the activationmap in the dataset.
To compare a low-resolution unit’s activation map to the input-resolution annotation mask Lc for some concept c, the activationmap is scaled up to the mask resolution Sk(x) from Ak(x) usingbilinear interpolation, anchoring interpolants at the center of eachunit’s receptive field.
Sk(x) is then thresholded into a binary segmentation: Mk(x) ⌘Sk(x) � Tk, selecting all regions for which the activation exceedsthe threshold Tk. These segmentations are evaluated against everyconcept c in the dataset by computing intersections Mk(x)\Lc(x),for every (k, c) pair.
The score of each unit k as segmentation for concept c isreported as a the intersection over union score across all the imagesin the dataset,
IoUk,c =
P|Mk(x) \ Lc(x)|P|Mk(x) [ Lc(x)|
, (1)
where | · | is the cardinality of a set. Because the dataset containssome types of labels which are not present on some subsets ofinputs, the sums are computed only on the subset of images thathave at least one labeled concept of the same category as c. Thevalue of IoUk,c is the accuracy of unit k in detecting conceptc; we consider one unit k as a detector for concept c if IoUk,c
exceeds a threshold. Our qualitative results are insensitive to theIoU threshold: different thresholds denote different numbers ofunits as concept detectors across all the networks but relativeorderings remain stable. For our comparisons we report a detectorif IoUk,c > 0.04. Note that one unit might be the detector formultiple concepts; for the purpose of our analysis, we choose thetop ranked label. To quantify the interpretability of a layer, wecount the number unique concepts aligned with units. We call thisthe number of unique detectors.
Figure 2 summarizes the whole process of scoring unit inter-pretability: By segmenting the annotation mask using the receptivefield of units for the top activated images, we compute the IoU foreach concept. The IoU evaluating the quality of the segmentationof a unit is an objective confidence score for interpretability that iscomparable across networks. Thus this score enables us to compareinterpretability of different representations and lays the basis for theexperiments below. Note that network dissection works only as wellas the underlying dataset: if a unit matches a human-understandableconcept that is absent in Broden, then it will not score well forinterpretability. Future versions of Broden will be expanded toinclude more kinds of visual concepts.
3 INTERPRETING DEEP VISUAL REPRESENTA-TIONS
For testing we prepare a collection of CNN models with differentnetwork architectures and supervision of primary tasks, as listedin Table 2. The network architectures include AlexNet [38],GoogLeNet [39], VGG [40], ResNet [15], and DenseNet [41].For supervised training, the models are trained from scratch (i.e.,not pretrained) on ImageNet [42], Places205 [43], and Places365[44]. ImageNet is an object-centric dataset, which contains 1.2
4
Input image Network being probed Pixel-wise segmentation
Freeze trained network weights
Conv
Conv
Conv
Conv
Conv
Upsample target layer
One Unit
Activation
Blu
e
Fabri
c
Door
Gra
ss
Pers
on
Evaluate on segmentation tasks
Car
Fig. 3. Illustration of network dissection for measuring semantic alignment of units in a given CNN. Here one unit of the last convolutional layer of agiven CNN is probed by evaluating its performance on various segmentation tasks. Our method can probe any convolutional layer.
TABLE 2Tested CNN Models
Training Network dataset or task
none AlexNet random
Supervised
AlexNet ImageNet, Places205, Places365, Hybrid.GoogLeNet ImageNet, Places205, Places365.
VGG-16 ImageNet, Places205, Places365, Hybrid.ResNet-152 ImageNet, Places365.
DenseNet-161 ImageNet, Places365.
Self AlexNet
context, puzzle, egomotion,tracking, moving, videoorder,audio, crosschannel,colorization.objectcentric, transinv.
million images from 1000 classes. Places205 and Places365 aretwo subsets of the Places Database, which is a scene-centricdataset with categories such as kitchen, living room, and coast.Places205 contains 2.4 million images from 205 scene categories,while Places365 contains 1.6 million images from 365 scenecategories. “Hybrid” refers to a combination of ImageNet andPlaces365. For self-supervised training tasks, we select severalrecent models trained on predicting context (context) [23], solvingpuzzles (puzzle) [24], predicting ego-motion (egomotion) [25],learning by moving (moving) [26], predicting video frameorder (videoorder) [45] or tracking (tracking) [27], detectingobject-centric alignment (objectcentric) [46], colorizing images(colorization) [28], inpainting (contextencoder) [47], predict-ing cross-channel (crosschannel) [29], predicting ambient soundfrom frames (audio) [30], and tracking invariant patterns invideos (transinv) [48]. The self-supervised models we analyze arecomparable to each other in that they all use AlexNet or an AlexNet-derived architecture, with one exception model transinv [48],which uses VGG as the base network.
In the following experiments, we begin by validating ourmethod using human evaluation. Then, we use random unitaryrotations of a learned representation to test whether interpretabilityof CNNs is an axis-independent property; we find that it is not,and we conclude that interpretability is not an inevitable resultof the discriminative power of a representation. Next, we analyzeall the convolutional layers of AlexNet as trained on ImageNet[38] and as trained on Places [43], and confirm that our methodreveals detectors for higher-level concepts at higher layers andlower-level concepts at lower layers; and that more detectors forhigher-level concepts emerge under scene training. Then, we showthat different network architectures such as AlexNet, VGG, andResNet yield different interpretability, while differently supervised
training tasks and self-supervised training tasks also yield a varietyof levels of interpretability. Additionally we show the impact ofdifferent training conditions, examine the relationship betweendiscriminative power and interpretability, and investigate a possibleway to improve the interpretability of CNNs by increasing theirwidth. Finally we utilize the interpretable units as explanatoryfactors to the prediction given by a CNN.
3.1 Human Evaluation of InterpretationsUsing network dissection, we analyze the interpretability of unitswithin all the convolutional layers of Places-AlexNet and ImageNet-AlexNet, then compare with human interpretation. Places-AlexNetis trained for scene classification on Places205 [43], whileImageNet-AlexNet is the identical architecture trained for objectclassification on ImageNet [38].
Our evaluation was done by raters on Amazon MechanicalTurk (AMT). As a baseline description of unit semantics, weused human-written descriptions of each unit from [1]. Thesedescriptions were collected by asking raters to write words or shortphrases to describe the common meaning or pattern selected byeach unit, based on a visualization of the top image patches. Threedescriptions and a confidence were collected for each unit. As acanonical description we chose the most common description of aunit (when raters agreed), and the highest-confidence description(when raters did not agree). Some units may not be interpretable.To identify these, raters were shown the canonical descriptions ofvisualizations and asked whether they were descriptive. Units withvalidated descriptions are taken as the set of interpretable units.
To compare these baseline descriptions to network-dissection-derived labels, we ran the following experiment. Raters were showna visualization of top images patches for an interpretable unit, alongwith a word or short phrase description, and they were asked tovote (yes/no) whether the given phrase was descriptive of most ofthe image patches. The baseline human-written descriptions wererandomized with the labels derived using net dissection, and theorigin of the labels was not revealed to the raters.
Table 3 summarizes the results. The number of interpretableunits is shown for each layer, and average positive votes fordescriptions of interpretable units are shown, both for human-written labels and network-dissection-derived labels. Human labelsare most highly consistent for units of conv5, suggesting thathumans have no trouble identifying high-level visual conceptdetectors, while lower-level detectors are more difficult to label.Similarly, labels given by network dissection are best at conv5,and are found to be less descriptive for lower layers.
5
Fig. 4. The annotation interface used by human raters on AmazonMechanical Turk. Raters are shown descriptive text in quotes togetherwith fifteen images, each with highlighted patches, and must evaluatewhether the quoted text is a good description for the highlighted patches.
TABLE 3Human evaluation of our Network Dissection approach.
conv1 conv2 conv3 conv4 conv5Interpretable units 57/96 126/256 247/384 258/384 194/256Human consistency 82% 76% 83% 82% 91%Network Dissection 37% 56% 54% 59% 71%
Comparison of the human interpretation and the labels predictedby network dissection is plotted in Fig. 5. A sample of units isshown together with both automatically inferred interpretationsand manually assigned interpretations taken from [1]. We can seethat the predicted labels match the human annotation well, thoughsometimes they capture a different description of a visual concept,such as the ‘crosswalk’ predicted by the algorithm compared to‘horizontal lines’ given by the human for the third unit in conv4 ofPlaces-AlexNet in Fig. 5. Confirming intuition, color and textureconcepts dominate at lower layers conv1 and conv2 while moreobject and part detectors emerge in conv5.
3.2 Measurement of Axis-aligned Interpretability
We conduct an experiment to determine whether it is meaningfulto assign an interpretable concept to an individual unit. Twopossible hypotheses can explain the emergence of interpretabilityin individual hidden layer units:Hypothesis 1. Interpretability is a property of the representation
as a whole, and individual interpretable units emerge becauseinterpretability is a generic property of typical directions ofrepresentation space. Under this hypothesis, projecting to anydirection would typically reveal an interpretable concept, andinterpretations of single units in the natural basis would notbe more meaningful than interpretations that can be found inany other direction.
Hypothesis 2. Interpretable alignments are unusual, and inter-pretable units emerge because learning converges to a specialbasis that aligns explanatory factors with individual units.In this model, the natural basis represents a meaningfuldecomposition learned by the network.
Hypothesis 1 is the default assumption: in the past it has beenfound [19] that with respect to interpretability “there is nodistinction between individual high level units and random linearcombinations of high level units.”
Network dissection allows us to re-evaluate this hypothesis.We apply random changes in basis to a representation learned byAlexNet. Under hypothesis 1, the overall level of interpretabilityshould not be affected by a change in basis, even as rotationscause the specific set of represented concepts to change. Underhypothesis 2, the overall level of interpretability is expected to dropunder a change in basis.
We begin with the representation of the 256 convolutional unitsof AlexNet conv5 trained on Places205 and examine the effect of achange in basis. To avoid any issues of conditioning or degeneracy,we change basis using a random orthogonal transformation Q.The rotation Q is drawn uniformly from SO(256) by applyingGram-Schmidt on a normally-distributed QR = A 2 R2562
with positive-diagonal right-triangular R, as described by [49].Interpretability is summarized as the number of unique visualconcepts aligned with units, as defined in Sec. 2.2.
Denoting AlexNet conv5 as f(x), we find that the number ofunique detectors in Qf(x) is 80% fewer than the number of uniquedetectors in f(x). Our finding is inconsistent with hypothesis 1and consistent with hypothesis 2.
We also test smaller perturbations of basis using Q↵ for 0 ↵ 1, where the fractional powers Q↵ 2 SO(256) are chosento form a minimal geodesic gradually rotating from I to Q; theseintermediate rotations are computed using a Schur decomposition.Fig. 6 shows that interpretability of Q↵f(x) decreases as largerrotations are applied. Fig. 7 shows some examples of the linearlycombined units.
Each rotated representation has exactly the same discriminativepower as the original layer. Writing the original network asg(f(x)), note that g0(r) ⌘ g(QT r) defines a neural networkthat processes the rotated representation r = Qf(x) exactly as theoriginal g operates on f(x). We conclude that interpretabilityis neither an inevitable result of discriminative power, nor isit a prerequisite to discriminative power. Instead, we find thatinterpretability is a different quality that must be measuredseparately to be understood.
We repeat the complete rotation (↵ = 1) on Places365 andImageNet 10 times, the result is shown in Fig. 8. We observe thedrop of interpretability for both of the network, while it dropsmore for the AlexNet on Places365. It is because originally theinterpretability of AlexNet on Places365 is higher than AlexNet onImageNet thus the random rotation damages more.
3.3 Network Architectures with Supervised LearningHow do different network architectures affect disentangled in-terpretability of the learned representations? We apply networkdissection to evaluate a range of network architectures trained onImageNet and Places. For simplicity, the following experimentsfocus on the last convolutional layer of each CNN, where semanticdetectors emerge most.
Results showing the number of unique detectors that emergefrom various network architectures trained on ImageNet andPlaces are plotted in Fig. 9. In terms of network architecture,we find that interpretability of ResNet > DenseNet > VGG >GoogLeNet > AlexNet. Deeper architectures usually appear toallow greater interpretability, though individual layer structure is
6
Pla
ces
veined (texture) h:green
orange (color) h:color yellow
red (color) h:pink or red
sky (object) h:sky
lacelike (texture) h:black&white
lined (texture) h:grid pattern
grass (object) h:grass
banded (texture) h:corrugated
perforated (texture) h:pattern
chequered (texture) h:windows
tree (object) h:tree
crosswalk (part) h:horiz. lines
bed (object) h:bed
car (object) h:car
mountain (scene) h:montain
Imag
eNet
red (color) h:red
yellow (color) h:yellow
sky (object) h:blue
woven (texture) h:yellow
banded (texture) h:striped
grid (texture) h:mesh
food (material) h:orange
sky (object) h:blue sky
dotted (texture) h:nosed
muzzle (part) h:animal face
swirly (texture) h:round
head (part) h:face
wheel (part) h:wheels
cat (object) h:animal faces
leg (part) h:leg
conv1 conv2 conv3 conv4 conv5
Fig. 5. Comparison of the interpretability of all five convolutional layers of AlexNet, as trained on classification tasks for Places (top) and ImageNet(bottom).Four examples of units in each layer are shown with identified semantics. The segmentation generated by each unit is shown on the threeBroden images with highest activation. Top-scoring labels are shown above to the left, and human-annotated labels are shown above to the right.Some disagreement can be seen for the dominant judgment of meaning. For example, human annotators mark the first conv4 unit on Places as a‘windows’ detector, while the algorithm matches the ‘chequered’ texture.
baseline rotate 0.2 rotate 0.4 rotate 0.6 rotate 0.8 rotate 10
10
20
30
40
Nu
mb
er
of
un
iqu
e d
ete
cto
rs
Rotation of Representation
objectscenepartmaterialtexturecolor
Fig. 6. Interpretability over changes in basis of the representation ofAlexNet conv5 trained on Places. The vertical axis shows the numberof unique interpretable concepts that match a unit in the representation.The horizontal axis shows ↵, which quantifies the degree of rotation.
different across architecture. Comparing training datasets, we findPlaces > ImageNet. As discussed in [1], one scene is composed ofmultiple objects, so it may be beneficial for more object detectorsto emerge in CNNs trained to recognize scenes.
Fig. 10 shows the histogram of object detectors identified insideResNet and DenseNet trained on ImageNet and Places respectively.DenseNet161-Places365 has the largest number of unique objectdetectors among all the networks. The emergent detectors differacross both training data source and architecture. The most frequentobject detectors in the two networks trained on ImageNet are dogdetectors, because there are more than 100 dog categories out ofthe 1000 classes in the ImageNet training set.
Fig. 11 shows the examples of object detectors groupedby object categories. For the same object category, the visualappearance of the unit as detector varies not only within the samenetwork but also across different networks. DenseNet and ResNethas such good detectors for bus and airplane with IoU more than0.25.
Baseline (individual units) Rotate 1 (linear combinations)car (single unit 87) IoU 0.16
car (combination, closest to unit 173) IoU 0.06
skyscraper (single unit 94) IoU 0.16
skyscraper (combination, closest to unit 94) IoU 0.05
tree (single unit 228) IoU 0.10
tree (combination, closest to unit 228) IoU 0.02
head (single unit 3) IoU 0.09
head (combination, closest to unit 70) IoU 0.02
closet (single unit 107) IoU 0.06
closet (combination, closest to unit 34) IoU 0.02
Fig. 7. Visualizations of the best single-unit concept detectors of fiveconcepts taken from individual units of AlexNet conv5 trained on Places(left), compared with the best linear-combination detectors of the sameconcepts taken from the same representation under a random rotation(right). In each case, the highest IoU scoring unit that matches the givenconcept is shown. For most concepts, both the IoU and the visualizationof the top activating image patches confirm that individual units matchconcepts better than linear combinations. In other cases, (e.g. headdetectors) the visualization of the linear combination appears highlyconsistent, but the IoU score reveals lower consistency when evaluatedover the whole dataset.
baseline random combination0
10
20
30
40
Nu
mb
er
of
un
iqu
e d
ete
cto
rs
AlexNet on Places365
objectscenepartmaterialtexturecolor
baseline random combination0
5
10
15
20
25
Nu
mb
er
of
un
iqu
e d
ete
cto
rs
AlexNet on ImageNet
objectscenepartmaterialtexturecolor
Fig. 8. Complete rotation (↵ = 1) repeated on AlexNet trained onPlaces365 and ImageNet respectively. Rotation reduces the interpretabil-ity significantly for both of the networks.
7
Res
Net
152-
Place
s365
Den
seNet
161-
Place
s365
Res
Net
152-
Imag
eNet
Den
seNet
161-
Imag
eNet
VGG-P
lace
s205
VGG-H
ybrid
VGG-P
lace
s365
Goo
gLeN
et-P
lace
s365
Goo
gLeN
et-P
lace
s205
Goo
gLeN
et-Im
ageN
et
VGG-Im
ageN
et
AlexN
et-P
lace
s365
AlexN
et-H
ybrid
AlexN
et-P
lace
s205
AlexN
et-Im
ageN
et
AlexN
et-ra
ndom
0
50
100
150
200
250
300
350
Num
ber
of uniq
ue d
ete
ctors
objectscenepartmaterialtexturecolor
Fig. 9. Interpretability across different architectures trained on ImageNetand Places.
Fig. 3.3 showcases the unique interpretable units of all typeson a variety of networks.
Fig. 13 shows the unique interpretable detectors over differentlayers for different network architectures trained on Places365. Weobserve that more object and scene detectors emerge at the higherlayers across all architectures: AlexNet, VGG, GoogLeNet, andResNet. This suggests that representation ability increases overlayer depth. Because of the compositional structure of the CNNlayers, the deeper layers should have higher capacity to representconcepts with larger visual complexity such as objects and sceneparts. Our measurements confirm this, and we conclude that highernetwork depth encourages the emergence of visual concepts withhigher semantic complexity.
3.4 Representations from Self-supervised LearningRecently many work have explored a novel paradigm for unsuper-vised learning of CNNs without using millions of annotated images,namely self-supervised learning. For example, [23] trains deepCNNs to predict the neighborhoods of two image patches, while[28] trains networks by colorizing images. Totally we investigate12 networks trained for different self-supervised learning tasks.How do different supervisions affect those internal representations?
Here we compare the interpretability of the deep visual repre-sentations resulting from self-supervised learning and supervisedlearning. We keep the network architecture the same as AlexNetfor each model (one exception is the recent model transinv
which uses VGG as the base network). Results are shown inFig. 14. We observe that training on Places365 creates the largestnumber of unique detectors. Self-supervised models create manytexture detectors but relatively few object detectors; apparently,supervision from a self-taught primary task is much weaker atinferring interpretable concepts than supervised training on a largeannotated dataset. The form of self-supervision makes a difference:for example, the colorization model is trained on colorless images,and almost no color detection units emerge. We hypothesize thatemergent units represent concepts required to solve the primarytask.
Fig. 15 shows some typical visual detectors identified in theself-supervised CNN models. For the models audio and puzzle,some object and part detectors emerge. Those detectors may beuseful for CNNs to solve the primary tasks: the audio model istrained to associate objects with a sound source, so it may be useful
to recognize people and cars; while the puzzle model is trainedto align the different parts of objects and scenes in an image. Forcolorization and tracking, recognizing textures might be goodenough for the CNN to solve primary tasks such as colorizing adesaturated natural image; thus it is unsurprising that the texturedetectors dominate.
3.5 Training Conditions
Training conditions such as the number of training iterations,dropout [6], batch normalization [7], and random initialization [21],are known to affect the representation learning of neural networks.To analyze the effect of training conditions on interpretability, wetake the Places205-AlexNet as the baseline model and prepareseveral variants of it, all using the same AlexNet architecture. Forthe variants Repeat1, Repeat2 and Repeat3, we randomly initializethe weights and train them with the same number of iterations. Forthe variant NoDropout, we remove the dropout in the FC layersof the baseline model. For the variant BatchNorm, we apply batchnormalization at each convolutional layer of the baseline model.Repeat1, Repeat2, Repeat3 all have nearly the same top-1 accuracy50.0% on the validation set. The variant without dropout has top-1accuracy 49.2%. The variant with batch norm has top-1 accuracy50.5%.
Fig. 16 shows the interpretability of units in the CNNs overdifferent training conditions. We find several effects: 1) Comparingdifferent random initializations, the models converge to similarlevels of interpretability, both in terms of the unique detectornumber and the total detector number; this matches observationsof convergent learning discussed in [21]. 2) For the networkwithout dropout, more texture detectors emerge but fewer objectdetectors. 3) Batch normalization seems to decrease interpretabilitysignificantly.
The batch normalization result serves as a caution that dis-criminative power is not the only property of a representation thatshould be measured. Our intuition for the loss of interpretabilityunder batch normalization is that the batch normalization ‘whitens’the activation at each layer, which smooths out scaling issuesand allows a network to easily rotate axes of intermediaterepresentations during training. While whitening apparently speedstraining, it may also have an effect similar to random rotationsanalyzed in Sec. 3.2 which destroy interpretability. As discussedin Sec. 3.2, however, interpretability is neither a prerequisite noran obstacle to discriminative power. Finding ways to capture thebenefits of batch normalization without destroying interpretabilityis an important area for future work.
Fig. 17 plots the interpretability of snapshots of the baselinemodel at different training iterations along with the accuracy on thevalidation set. We can see that object detectors and part detectorsbegin emerging at about 10,000 iterations (each iteration processesa batch of 256 images). We do not find evidence of transitionsacross different concept categories during training. For example,units in conv5 do not turn into texture or material detectors beforebecoming object or part detectors. In Fig. 18, we keep track ofsix units over different training iteration. We observe that someunits start converging to the semantic concept at early stage. Forexample, unit138 starts detecting mountain snowy as early as atiteration 2446. We also observe that units evolve over time: unit74and unit108 detect road first before they start detecting car andairplane respectively.
8
0
19
37
units
airpla
netrain ca
rbu
sbe
dho
use ca
tdo
g
mounta
inho
rsewate
rgra
ssbo
ttle treebo
atso
fa
build
ingtoilet
plant bir
d
motorbi
ke
pool
table sk
yten
tse
abo
ok
potte
dplan
tfen
ce
bicyc
le
paint
ingroad
perso
n
curta
insin
ksn
owtablesh
elf
batht
ub
window
pane
chan
delie
r
stairw
ay
skys
crape
rsto
verocktra
ckfield
washe
r
sidew
alkchair
courtflo
wer
work su
rface
waterfa
ll
bridg
ebe
nchsh
eeprai
ling
tvmon
itormirro
rpa
lm
book
case
61 o
bjec
ts
0
18
36
units
carbe
dbu
s
airpla
newate
rdo
g
pool
table ca
t
mounta
in
paint
inggra
ssroa
dbo
oktoi
let
ceilin
gtra
in sky
bicyc
leho
rseho
use
skys
crape
rso
fa
potte
dplan
tsin
kpla
ntbo
ttle
motorbi
keboat tre
e
perso
nse
atra
ck
waterfa
lllam
pten
tbir
dse
atsto
ve
washe
rrai
ling
sidew
alk
stairw
ay
work su
rface
chan
delie
r
grand
stand
curta
inch
airflo
wer
build
ing
window
pane
tvmon
itor
bridg
eba
ll
book
caseea
rthshelf
casesn
ow
swive
l cha
irsh
eeppa
lmpit
chpil
lowgro
und
coun
tertab
le
cabin
etfie
ldcra
dle
signb
oard
bedc
lothe
sfen
ce cow
73 o
bjec
ts
0
43
86
units
dog ca
tca
rho
rse
airpla
ne bus
birdbe
dtra
in
motorbi
kebo
ttletoiletwate
r
pool
table
cowsh
eep
perso
n
bicyc
le
mounta
inho
usegra
ssroa
dpla
nt
batht
ub seabo
okch
airsink
potte
dplan
t
tvmon
itorsto
ve
signb
oard
window
pane
shelfsn
owrockfie
ldtab
lece
ilingflo
wer tree
tentso
fafen
ce
skys
crape
r
washe
rbo
atpla
te
book
case
sidew
alkpil
low
bedc
lothe
sea
rthrai
lingse
at
cush
ionpalmsa
ndbe
nch
compu
terlam
p
wardrob
e
62 o
bjec
ts
0
13
26
units
car
bicyc
le cat
pool
table
airpla
neho
use
bus
motorbi
ke bedtoi
let skyho
rsewate
rdo
gsin
kbo
ttle seach
airbo
ok cowsto
veroadtra
ck
book
case
build
ing
mounta
in
tvmon
itortra
insn
ow birdsh
elflam
psh
eepgra
ss
window
pane
batht
ub
paint
ingsofa
cush
ion
curta
in
skys
crape
r
perso
ntab
lece
ilingse
atpla
ntfen
ce
washe
r
sidew
alk
stairw
ayfieldflo
werpla
tebo
at tree
chan
delie
r
signb
oard
57 o
bjec
ts
Resnet152
(Places)
Densenet
(Places)
Resnet152
(Imagenet)
Densenet
(Imagenet)
Fig. 10. Histogram of the object detectors from the ResNet and DenseNet trained on ImageNet and Places respectively.
Closet Dog Plant Bus Airplane
DenseNet-161
layer161 unit 1639 IoU=0.225 layer161 unit 2035 IoU=0.199 layer161 unit 1126 IoU=0.076 layer161 unit 1492 IoU=0.282 layer161 unit 1518 IoU=0.205
layer161 unit 1788 IoU=0.201 layer161 unit 2028 IoU=0.113 layer161 unit 1356 IoU=0.067 layer161 unit 1519 IoU=0.155 layer161 unit 1512 IoU=0.125
ResNet-152
res5c unit 2011 IoU=0.171 res5c unit 1573 IoU=0.217 res5c unit 264 IoU=0.125 res5c unit 674 IoU=0.265 res5c unit 1243 IoU=0.172
res5c unit 9 IoU=0.161 res5c unit 1718 IoU=0.195 res5c unit 766 IoU=0.094 res5c unit 74 IoU=0.256 res5c unit 963 IoU=0.156
GoogLeN
et
inception_5b unit 758 IoU=0.159 inception_4e unit 750 IoU=0.203 inception_4e unit 56 IoU=0.139 inception_4e unit 824 IoU=0.168 inception_4e unit 92 IoU=0.164
inception_5b unit 235 IoU=0.136 inception_4e unit 225 IoU=0.152 inception_4e unit 714 IoU=0.105 inception_5b unit 603 IoU=0.154 inception_4e unit 759 IoU=0.144
VGG-16
conv5_3 unit 213 IoU=0.125 conv5_3 unit 142 IoU=0.205 conv5_3 unit 85 IoU=0.086 conv5_3 unit 191 IoU=0.153 conv5_3 unit 151 IoU=0.150
conv5_3 unit 107 IoU=0.065 conv5_3 unit 491 IoU=0.112 conv4_3 unit 336 IoU=0.068 conv5_3 unit 20 IoU=0.149 conv5_3 unit 204 IoU=0.077
AlexNet
conv5 unit 235 IoU=0.017 conv5 unit 180 IoU=0.090 conv5 unit 55 IoU=0.087 conv5 unit 10 IoU=0.040 conv5 unit 13 IoU=0.101
conv3 unit 255 IoU=0.015 conv5 unit 250 IoU=0.051 conv5 unit 16 IoU=0.062 conv5 unit 174 IoU=0.029 conv5 unit 28 IoU=0.049
Fig. 11. Comparison of several visual concept detectors identified by network dissection in DenseNet, ResNet, GoogLeNet, VGG, and AlexNet. Eachnetwork is trained on Places365. The two highest-IoU matches among convolutional units of each network is shown. The segmentation generated byeach unit is shown on the four maximally activating Broden images. Some units activate on concept generalizations, e.g., GoogLeNet 4e’s unit 225on horses and dogs, and 759 on white ellipsoids and jets.
9
0 19 37units
airplanetrain
carbusbed
housecat
dogmountain
horsewatergrassbottle
treeboatsofa
buildingtoiletplantbird
motorbikepool table
skytentsea
bookpottedplant
fencebicycle
paintingroad
personcurtain
sinksnowtableshelf
bathtubwindowpane
chandelierstairway
skyscraperstoverocktrackfield
washersidewalk
chaircourt
flowerwork surface
waterfallbridgebenchsheeprailing
tvmonitormirrorpalm
bookcasehighway
closetnursery
art galleryskyscraper
corridorreception
airport terminalcockpit
shoe shopattic
classroomlighthouse
playgroundbakery-shopyouth hosteldining room
volleyball court-outdoorjacuzzi-indoor
conference roomsauna
landing deckoffice
mountain snowyauditorium
alleycastle
bowling alleylaundromatgas station
windmillgame room
subway interioroperating room
playroomsandboxstaircase
pantryescalator-indoor
fire escapeforest-broadleaf
parking garage-indoorbathroomcarrousel
gymnasium-indoorbadlands
cubicle-officecemetery
butchers shoprope bridge
dinette-vehiclebeauty salon
amusement arcadeforest-needleleaf
streetwaiting room
airplane cabindolmen
golf coursebus interior
corn fieldice skating rink-indoor
beachpoolroom-home
elevator-doorhome theater
greenhouse-indoorzen garden
gazebo-exteriorart studio
parktheater-indoor procenium
martial arts gymcoast
field-cultivatedliving room
bullringslum
balcony-interiortelevision studio
catwalkwater tower
ice cream parlorbank vault
creekdesert-sand
61 objects
0 22 44units
busairplane
bedtrain
pool tabledogcar
horsehouse
roadboat
motorbiketree
grasswaterseat
catshelf
groundsea
skyscraperstovefence
work surfacetoilet
bottlesofasink
buildingcockpit
conference roomstreetcloset
poolroom-homeclassroom
nurseryball pit
highwaycorridor
airport terminalattic
bathroomcastle
art galleryalley
dining roomoffice
kitchenbar
mountain snowyliving room
staircaseshoe shopskyscraperlighthouse
water towergazebo-exteriorsubway interior
bowling alleybakery-shopwaiting room
cathedral-indoorlanding deck
laundromatauditoriumgas station
crevasseice skating rink-indoor
windmillbus interior
pantrygreenhouse-indoor
amphitheatersupermarket
casino-indoorutility room
cloister-indoorbuilding facade
bookstorereception
viaductcar interior-backseat
rope bridgecubicle-office
airplane cabinwarehouse-indoorgymnasium-indoor
bridgepark
saunaescalator-outdoor
badlandspagoda
waterfall-fanplayroom
galleywaterfall-block
lake-naturalapartment building-outdoor
jacuzzi-indoorcampsite
library-indoorkindergarden classroom
artists loftbullring
carrouseljunkyard
bow window-outdoorcreek
hayfieldski slopemountaincrosswalk
screenhead
shop windowcoach
foodpotholed
paisleystriped
cobwebbedlined
sprinkledinterlaced
studdedbanded
dottedfrilly
stratifiedspiralled
wovengrooved
fibrousmeshed
crystallinebumpy
scalybraided 139
29 objects
83 scenes
5 parts1 material
21 textures
0 4 8units
airplanebuscar
roadgrassplant
windowpaneperson
bedtrain
motorbikehorse
treehousewater
pool tabletabletoiletboat
mountaindog
paintingbook
curtainflower
buildinggrandstand
bridgerailing
swimming pooltent
signboardchair
work surfaceskyscraper
bottlechandelier
seatstoveshelftrack
cabinetcushion
sofasnow
waterfallpillow
sinksearockdesk
sidewalkplatepalm
groundsky
swivel chairfloor
fenceceilingball pit
skyscrapermountain snowy
bakery-shoplaundromat
closetcockpit
shoe shopcorridor
conference roomcastle
forest-needleleaffield-cultivated
windmillstaircase
classroomescalator-indoor
playgroundpantryoffice
bowling alleycemetery
desert-sandhighway
streetamusement park
lighthousecreek
nurserycorn field
wheat fieldshopping mall-indoor
atticcavern-indoor
youth hostelsandbox
barreception
childs roomcasino-indoor
bookstorepagoda
gymnasium-indoorwaterfall-fan
slumart gallery
butchers shophair
wheeldrawerscreen
headcrosswalk
roofbalcony
seat cushionpot
shop windowhand
carpetfood
chequeredstripedwaffled
meshedpaisleydotted
bandedswirly
crackedstratified
linedpolka-dotted
perforatedgrid
sprinkledlacelikefibrous
interlacedfrilly
spiralledpotholed
wovenfreckledstudded
veinedred 147
60 objects
47 scenes
12 parts
2 materials
25 textures1 color
0 5 10units
cargrass
airplanemountain
paintingtree
ceilingdogbus
roadpool table
waterhorse
skyscraperplant
motorbikecat
bedsea
tracksinkfield
stairwayshelf
work surfacebuilding
housewaterfall
groundsidewalk
bookchair
skywindowpane
floortoilet
personrailing
washersignboard
tableflower
chandelierbird
ball pitskyscraper
mountain snowycloset
swimming pool-outdoorstadium-baseball
coastcorridor
auditoriumcreek
highwayconference room
jacuzzi-indoorhair
headscreenwheel
roofheadboardcrosswalk
seat cushiondrawer
legarm
bodyshop window
foodchequeredperforated
stripedgrid
spiralleddottedswirlylined
meshedstuddedpaisleybanded
polka-dottedcobwebbed
groovedstratified
poroushoneycombed
woveninterlaced
red
92
44 objects
13 scenes
13 parts1 material
20 textures1 color
0 3 6units
watergrass
treecar
plantwindowpane
roadmountainairplane
skyscraperdogsea
ceilingbuildingperson
horsebed
trackbook
pool tablecabinet
chairpaintingwaterfallsidewalk
sinkshelf
skyhousestovefloorbus
mountain snowyball pitpantry
building facadeskyscraper
streethair
wheelhead
screencrosswalk
shop windowfood
woodlined
dottedbandedstudded
gridhoneycombed
paisleyzigzagged
waffledmeshedcrackedstratified
chequeredperforated
sprinkledpotholedgroovedpleatedmatted
freckledswirly
spiralledwovenfibrous
cobwebbedred
72
32 objects
6 scenes
6 parts
2 materials
25 textures1 color
0 4 8units
dogcat
grasstree
bicycleseasky
waterroad
carpainting
windowpanemountainmotorbike
booksidewalk
busmountain snowy
wheelhair
headear
muzzlearmleg
screenfood
zigzaggeddotted
chequeredbanded
cobwebbedwaffled
perforatedstriped
polka-dottedfrilly
spiralledstudded
honeycombedgrid
meshedsprinkled
veinedporous
crackedinterlaced
crosshatchedred
yellow
50
17 objects1 scene
8 parts1 material
21 textures
2 colors
0 10 20units
grassroadsky
watertreedog
ball pitforest-broadleafmountain snowybuilding facade
highwaychequered
linedbandedporouslacelikeveined
gridfrilly
fleckedperforated
crackedstratifiedpotholedfreckledstudded
wovenpolka-dotted
purpleorange
redyellow
bluepink
green
35
6 objects
5 scenes
17 textures
7 colors
0 5 9units
cartree
grasssea
mountainhighway
headhair
bandedlined
chequeredstudded
zigzaggedstriped
perforatedflecked
gridcrackedmeshed
gauzydotted
21
5 objects1 scene
2 parts
13 textures
0 6 12units
skyceilinggrass
treeforest-broadleaf
headbanded
lineddotted
perforatedgrid
chequeredcrosshatched
spiralled
14
4 objects1 scene
1 part
8 textures
0 5 9units
skyball pitveined
chequeredmeshed
stripeddotted
perforatedstuddedlacelike
frillycobwebbed
red
13
1 object1 scene
10 textures1 color
Resnet152 (Places)R
esnet152 continued ...
creekdesert-sand
vineyardsupermarket
kindergarden classroomlocker room
warehouse-indoorbuilding facade
bridgeaqueduct
pastureriver
florist shop-indoorwaterfall-block
car interior-backseatorchard
boxing ringmusic studio
crevasseball pit
amusement parkarchive
galleycloister-indoor
fountainhayfield
lobbytopiary gardengarage-indoor
jewelry shopbar
temple-east asiaearth fissure
amphitheaterhome office
desert-vegetationbatters box
parlorrubble
shopfrontclothing store
kasbahkitchen
junkyardbookstore
shopping mall-indoormuseum-indoor
cavern-indoorswimming pool-outdoornatural history museum
greenhouse-outdoordambarn
doorway-outdoorapartment building-outdoor
medinafield-wild
hot springmountaincampsiteballroom
market-outdoorarrival gate-outdoor
drivewayinn-indoor
excavationwindow seat
construction sitelibrary-indoor
clean roomutility room
hangar-indoorartists loft
basketball court-outdoorfootball fieldroundabout
screencrosswalk
bodyheadboard
wheelhead
hairroof
shop windowcoach
back pillowbalcony
torsomonitor
foodtile
carpetpaperdotted
crackedstratified
linedcobwebbed
potholedchequered
spiralledflecked
interlacedzigzagged
freckledpolka-dotted
veinedfrilly
bandedstriped
perforatedgrooved
fibrousgrid
lacelikescaly
porousmeshed
swirlypleated
gauzycrystalline
crosshatchedmattedpaisley
studdedpitted
bubblysprinkled
bumpymarbledbraided
red 279
160 scenes
14 parts
4 materials
39 textures1 color
Googlenet (Places)
VGG16 (Places)
AlexNetGAPWide (Places)
Alexnet (Places)
Alexnet (Imagenet)
Alexnet (Video Tracking)
Alexnet (Ambient Sound)
Alexnet (Puzzle Solving)
Alexnet (Egomotion)
Fig.12.Com
parisonofunique
detectorsofalltypes
ona
varietyofarchitectures.M
oreresults
areatthe
projectpage.
10
conv1conv2
conv3conv4
conv50
10
20
30
40N
umbe
r of u
niqu
e de
tect
ors
AlexNet on Places365
objectscenepartmaterialtexturecolor
conv1conv2
conv3conv4
conv50
5
10
15
20
25
Num
ber o
f uni
que
dete
ctor
s
AlexNet on ImageNet
objectscenepartmaterialtexturecolor
conv3-3conv4-3
conv5-1conv5-2
conv5-30
20
40
60
80
Num
ber o
f uni
que
dete
ctor
s
VGG16 on Places365
objectscenepartmaterialtexturecolor
conv3-3conv4-3
conv5-1conv5-2
conv5-30
10
20
30
40
50
Num
ber o
f uni
que
dete
ctor
s
VGG16 on ImageNet
objectscenepartmaterialtexturecolor
conv1-7x7-s2
conv2-norm2
inception-3b
inception-4c
inception-4e
inception-5a
inception-5b0
20
40
60
80
100
120
Num
ber o
f uni
que
dete
ctor
s
GoogLeNet on Places365
objectscenepartmaterialtexturecolor
conv1-7x7-s2
conv2-norm2
inception-3b
inception-4c
inception-4e
inception-5a
inception-5b0
10
20
30
40
50
60
Num
ber o
f uni
que
dete
ctor
s
GoogLeNet on ImageNet
objectscenepartmaterialtexturecolor
BN-1
Eltwise-116
Eltwise-238
Eltwise-358
Eltwise-478
Eltwise-5100
50
100
150
200
Num
ber o
f uni
que
dete
ctor
s
ResNet152 on Places365
objectscenepartmaterialtexturecolor
BN-1
Eltwise-116
Eltwise-238
Eltwise-358
Eltwise-478
Eltwise-5100
20
40
60
80
100
Num
ber o
f uni
que
dete
ctor
s
ResNet152 on ImageNet
objectscenepartmaterialtexturecolor
Fig. 13. Comparison of interpretability of the layers for AlexNet, VGG16, GoogLeNet, and ResNet152 trained on Places365. All five conv layers ofAlexNet and the selected layers of VGG, GoogLeNet, and ResNet are included.
AlexN
et-P
lace
s365
AlexN
et-Im
ageN
et
track
ing
objectce
ntric
trans
inv
audio
mov
ing
coloriz
ation
puzz
le
cros
scha
nnel
egom
otion
cont
ext
fram
eord
er
AlexN
et-ra
ndom
cont
exte
ncod
er0
20
40
60
80
100
Num
ber
of uniq
ue d
ete
ctors object
scenepartmaterialtexturecolor
Fig. 14. Semantic detectors emerge across different supervision of theprimary training task. All these models use the AlexNet architecture andare tested at conv5.
audio puzzle colorization trackingchequered (texture) 0.102 head (part) 0.091 dotted (texture) 0.140 chequered (texture) 0.167
car (object) 0.063 perforated (texture) 0.085 head (part) 0.056 grass (object) 0.120
head (part) 0.061 sky (object) 0.069 sky (object) 0.048 red-c (color) 0.100
Fig. 15. The top ranked concepts in the three top categories in four self-supervised networks. Some object and part detectors emerge in audio.Detectors for person heads also appear in puzzle and colorization. Avariety of texture concepts dominate models with self-supervised training.
3.6 Transfer Learning between Places and ImageNet
Fine-tuning a pre-trained network to a target domain is commonlyused in transfer learning. The deep features from the pre-trainednetwork show good generalization across different domains. Thepre-trained network also makes the training converge faster andresults in better accuracy, especially if there is not enough trainingdata at the target domain. Here we analyze what happens inside
Number of detectors
base
line
repe
at1
repe
at2
repe
at3
NoD
ropo
ut
Batch
Nor
m
0
50
100
150
200objectscenepartmaterialtexturecolor
Number of unique detectors
base
line
repe
at1
repe
at2
repe
at3
NoD
ropo
ut
Batch
Nor
m
0
20
40
60
80
100objectscenepartmaterialtexturecolor
Fig. 16. Effect of regularizations on the interpretability of CNNs.
100 102 104 1060
10
20
30
40
Num
ber
of uniq
ue d
ete
ctors
AlexNet on Places205
objectscenepartmaterialtexturecolor
100 102 104 106
Training iteration
0
0.2
0.4
0.6
Valid
atio
n a
ccura
cy
Fig. 17. The evolution of the interpretability of conv5 of Places205-AlexNet over 3,000,000 training iterations. The accuracy on the validationat each iteration is also plotted. The baseline model is trained to 300,000iterations (marked at the red line).
the representation and how the interpretation of the internal unitsevolve during the transfer learning.
We run two sets of experiments: fine-tuning Places-AlexNet toImageNet and fine-tuning ImageNet-AlexNet to Places. We want tosee how individual units mutate across domains. The interpretabilityresults of the model checkpoints at different fine-tuning iterationare plotted in Fig. 19. We can see that the training indeed convergesfaster compared to the network trained from scratch on Placesin Fig. 17. The interpretations of the units also change over fine-tuning. For example, the number of unique object detectors firstdrop then keep increasing for the network trained on ImageNetbeing fine-tuned to Places365, while it is slowly dropping for thenetwork trained on Places being fine-tuned to ImageNet.
Fig. 20 shows some examples of the individual unit evolutionhappening in the network trained from ImageNet to Places365 andthe network trained from Places365 to ImageNet. For each network,we show six units with their interpretation at the beginning of fine-tuning and at the end of fine-tuning. For example, in the networkfine-tuned from ImageNet to Places365, unit15 which detects thewhite dogs first, mutates to detect waterfall; unit136 and unit144which detect dogs first, mutate to detect horse and cow respectively,as a lot of scene categories in Places such as pasture and corralcontain such animals. On the other hand, in the network fine-tunedfrom Places365 to ImageNet, a lot of units mutate to various kindsof dog detectors. Interestingly though those units mutate to detectdifferent concepts, those concepts share low-level similarity such
11
Fig. 18. The interpretations of units change over iterations. Each row shows the interpretation of one unit.
100 102 104 1060
10
20
30
40
50
Nu
mb
er
of
un
iqu
e d
ete
cto
rs
Places365 to ImageNet
objectscenepartmaterialtexturecolor
100 102 104 106
Training iteration
0
0.2
0.4
0.6
Va
lida
tion
acc
ura
cy
100 102 104 1060
10
20
30
40ImageNet to Places365
100 102 104 106
Training iteration
0
0.2
0.4
0.6
Fig. 19. a) Fine-tune AlexNet from ImageNet to Places365. b) Fine-tuneAlexNet from Places365 to ImageNet.
Places365 to ImageNetImageNet to Places365Before After Before After
Fig. 20. Units mutate from a) the network fine-tuned from ImageNet toPlaces365 and b) the network fine-tuned from Places365 to ImageNet.Six units are shown with their semantics at the beginning of the fine-tuning and at the end of the fine-tuning.
as colors and textures.In Fig. 21, we zoom into two units from each of the two fine-
tuning processes and plot the history of concept evolution. Wecan see that some units switch its top ranked label several timesbefore converging to some concept: unit15 in the fine-tuning ofImageNet to Places365 flipped to white, crystalline, before reachesthe waterfall concept. On the other hand, some units switch faster,for example unit132 in the fine-tuning of Places365 to ImageNetswitches from hair to dog at early stage of fine-tuning.
3.7 Layer Width vs. InterpretabilityFrom AlexNet to ResNet, CNNs for visual recognition have growndeeper in the quest for higher classification accuracy. Depth hasbeen shown to be important to high discrimination ability, and wehave seen in Sec. 3.3 that interpretability can increase with depth
ImageNet to Places365
Places365 to ImageNet
before before
before before
after after
afterafter
unit15 unit100
unit31 unit132
Fig. 21. The history of one unit mutation during the fine-tuning fromImageNet to Places365 (top) and Places365 to ImageNet (low).
as well. However, the width of layers (the number of units perlayer) has been less explored. One reason is that increasing thenumber of convolutional units in a layer significantly increasescomputational cost while yielding only marginal improvementsin classification accuracy. Nevertheless, some recent work [50]shows that a carefully designed wide residual network can achieveclassification accuracy superior to the commonly used thin anddeep counterparts.
To explore how the width of layers affects interpretability ofCNNs, we do a preliminary experiment to test how width affectsemergence of interpretable detectors. We remove the FC layers ofthe AlexNet, then triple the number of units at the conv5, i.e., from256 units to 768 units, as AlexNet-GAP-Wide. We further triplethe number of units for all the previous conv layers except conv1for the standard AlexNet, as AlexNet-GAP-WideAll. Finally we puta global average pooling layer after conv5 and fully connect thepooled 768-feature activations to the final class prediction.
After training on Places365, the AlexNet-GAP-Wide and theAlexNet-GAP-WideAll obtain similar classification accuracy onthe validation set as the standard AlexNet ( 0.5% top1 accuracylower and higher), but it has many more emergent unique conceptdetectors at conv5 for AlexNet-GAP-Wide and at all the convlayers for AlexNet-GAL-WideAll, as shown in Fig. 22. We havealso increased the number of units to 1024 and 2048 at conv5,but the number of unique concepts does not significantly increasefurther. This may indicate a limit on the capacity of AlexNet toseparate explanatory factors; or it may indicate that a limit onthe number of disentangled concepts that are helpful to solve theprimary task of scene classification.
12
conv
1
conv
2
conv
3
conv
4
conv
5
conv
1
conv
2
conv
3
conv
4
conv
5
conv
1
conv
2
conv
3
conv
4
conv
50
20
40
60
80
100
120
140
Num
ber
of uniq
ue d
ete
ctors
objectscenepartmaterialtexturecolor
AlexNet AlexNet-GAP-Wide AlexNet-GAP-WideAll
Fig. 22. Comparison of the standard AlexNet, AlexNet-GAP-Wide, andAlexNet-GAP-WideAll. Widening the layer brings the emergence of moredetectors. Networks are trained on Places365.
3.8 Discrimination vs. Interpretability
Activations from the higher layers of pre-trained CNNs are oftenused as generic visual features (noted as deep features), generalizingvery well to other image datasets [16], [43]. It is interesting tobridge the generalization of the deep visual representation asgeneric visual features with their interpretability.
Here we first benchmark the deep features from severalnetworks on six image classification datasets for their discriminativepower. For each network, we feed in the images and extract theactivation at the last convolutional layer as the visual feature, thentrain a linear SVM with C = 0.001 on the train split and evaluatethe performance on the test split. We compute the classificationaccuracy averaged across classes. We include the event8 [51],action40 [52], indoor67 [53], sun397 [54], caltech101 [55], andcaltech256 [56].
The classification accuracies on six image datasets using thedeep features are plotted in Fig. 23. We can see that the deepfeatures from supervised trained networks perform much betterthan the ones from the self-supervised trained networks. Networkstrained on Places have better features for scene-centric datasetssuch as sun397 and indoor67, while networks trained on ImageNethave better features for object-centric datasets such as caltech101and action40.
Fig. 24 plots the number of the unique object detectors for eachrepresentation over that representation’s classification accuracy onthree selected datasets. We can see there is positive correlationbetween them. Thus the supervision tasks that encourage theemergence of more concept detectors may also improve thediscrimination ability of deep features. Interestingly, on someof the object centric dataset, the best discriminative representationis the representation from ResNet152-ImageNet, which has fewerunique object detectors compared to the ResNet152-Places365. Wehypothesize that the accuracy on a representation when applied toa task is dependent not only on the number of concept detectors inthe representation, but on how well the concept detectors capturesthe characteristics of the hidden factors in the transferred dataset.
3.9 Explanatory Factors for the Deep Features
After we interpret the units inside the deep visual representation,we show that the unit activation along with the interpreted labelcan be used as explanatory factors for analyzing the predictiongiven by the deep features. Previous work [57] uses the weightedsum of the unit activation maps to highlight which image regions
are most informative to the prediction, here we further decouple atindividual unit level to segment the informative image regions.
We first plot the Class-specific units. After the linear SVM istrained, we can rank the elements of the feature according to theirSVM weights to obtain the elements of the deep features whichcontribute most to that class. Those elements are units that act asexplanatory factors, and we call those top ranked units associatedwith each output class class-specific units. Fig. 25 shows the class-specific units of ResNet152-ImageNet and ResNet152-Places365for one class from action40 and sun397 respectively. For example,for the Walking the dog class from action40, the top three class-specific units from ResNet152-ImageNet are two dog detection unitand one person detection unit; for the Picnic area class from sun397,the top three class-specific units from ResNet152-Places365 areplant detection unit, grass detection unit, and fence detection unit.The intuitive match between visual detectors and the classes theyexplain suggests that visual detectors from CNNs behave as thebag-of-semantic-words visual features.
We further use the individual units identified as conceptdetectors to build an explanation of the individual image predictiongiven by a classifier. The procedure is as follows: Given anyimage, let the unit activation of the deep feature (for ResNet theGAP activation) be [x1, x2, ..., xN ], where each xn representsthe value summed up from the activation map of unit n. Letthe top prediction’s SVM response be s =
Pn wnxn, where
[w1, w2, ..., wN ] is the SVM’s learned weight. We get the topranked units in Figure 26 by ranking [w1x1, w2x2, ..., wNxN ],which are the unit activations weighted by the SVM weight for thetop predicted class. Then we simply upsample the activation mapof the top ranked unit to segment the image.
The image segmentation using the individual unit activation areplotted in Fig. 26a. The unit segmentation explain the predictionexplicitly. For example, the prediction for the first image isGardening, and the explanatory units detect plant, grass, person,flower, and pot. The prediction for the second image is Ridinga horse, the explanatory units detect horse, fence and dog. Wealso plot some wrongly predicted samples in Figure 26b. Thesegmentation gives the intuition as to why the classifier mademistakes. For example, for the first image the classifier predictscutting vegetables rather than the true label gardening, because thesecond unit wrongly considers the ground as table.
4 CONCLUSIONNetwork Dissection translates qualitative visualizations of represen-tation units into quantitative interpretations and measurements ofinterpretability. We have found that the units of a deep representa-tion are significantly more interpretable than expected for a basis ofthe representation space. We have investigated the interpretabilityof deep visual representations resulting from different architectures,training supervisions, and training conditions. Furthermore, wehave shown that interpretability of deep visual representations isrelevant to the power of the representation as a generalizable visualfeature. We conclude that interpretability is an important propertyof deep neural networks that provides new insights into theirhierarchical structure. Our work motivates future work towardsbuilding more interpretable and explainable AI systems.
ACKNOWLEDGMENTSThis work was partly supported by the National Science Foundationunder Grants No. 1524817 to A.T., and No. 1532591 to A.T. and
13
event8
0.9
73
0.9
66
0.9
63
0.9
55
0.9
46
0.9
33
0.9
28
0.9
27
0.9
23
0.9
20
0.9
18
0.9
16
0.9
04
0.9
04
0.8
97
0.8
75
0.8
52
0.8
38
0.8
36
0.8
29
0.8
25
0.8
06
0.7
82
0.7
39
0.6
88
0.6
21
0.5
51
Res
Net
152-
Imag
eNet
Res
Net
50-Im
ageN
et
VGG-H
ybrid
Res
Net
152-
Place
s365
VGG-Im
ageN
et
VGG-P
lace
s365
Goo
gLeN
et-Im
ageN
et
Goo
gLeN
et-P
lace
s365
VGG-P
lace
s205
AlexN
et-P
lace
s365
-GAP
AlexN
et-P
lace
s205
AlexN
et-H
ybrid
AlexN
et-Im
ageN
et
Goo
gLeN
et-P
lace
s205
AlexN
et-P
lace
s365
AlexN
et-P
lace
s205
-BNau
dio
coloriz
ation
cros
scha
nnel
track
ing
cont
ext
objectce
ntric
puzz
le
egom
otion
mov
ing
fram
eord
er
AlexN
et-ra
ndom
0
0.5
1
Acc
ura
cy
action40
0.7
66
0.7
48
0.7
08
0.6
91
0.6
91
0.6
74
0.5
63
0.5
55
0.5
55
0.5
27
0.5
27
0.5
20
0.4
98
0.4
92
0.4
85
0.4
84
0.3
53
0.3
51
0.3
47
0.3
46
0.3
29
0.2
99
0.2
71
0.2
41
0.2
08
0.1
87
0.1
77
Res
Net
152-
Imag
eNet
Res
Net
50-Im
ageN
et
Goo
gLeN
et-Im
ageN
et
Res
Net
152-
Place
s365
VGG-H
ybrid
VGG-Im
ageN
et
VGG-P
lace
s205
VGG-P
lace
s365
Goo
gLeN
et-P
lace
s365
AlexN
et-H
ybrid
Goo
gLeN
et-P
lace
s205
AlexN
et-Im
ageN
et
AlexN
et-P
lace
s205
-BN
AlexN
et-P
lace
s365
-GAP
AlexN
et-P
lace
s365
AlexN
et-P
lace
s205
cont
ext
coloriz
ationau
dio
cros
scha
nnel
track
ing
puzz
le
objectce
ntric
egom
otion
mov
ing
AlexN
et-ra
ndom
fram
eord
er
0
0.2
0.4
0.6
0.8
1
Acc
ura
cy
caltech101
0.9
34
0.9
26
0.9
19
0.9
18
0.8
89
0.8
62
0.8
61
0.8
54
0.7
98
0.7
85
0.7
50
0.7
43
0.7
37
0.7
28
0.7
13
0.7
04
0.6
78
0.6
69
0.6
50
0.6
49
0.6
22
0.5
76
0.5
46
0.5
17
0.5
11
0.4
66
0.3
43
Res
Net
152-
Imag
eNet
Res
Net
50-Im
ageN
et
Goo
gLeN
et-Im
ageN
et
VGG-H
ybrid
VGG-Im
ageN
et
Res
Net
152-
Place
s365
AlexN
et-Im
ageN
et
AlexN
et-H
ybrid
AlexN
et-P
lace
s365
AlexN
et-P
lace
s205
AlexN
et-P
lace
s205
-BN
cont
ext
coloriz
ation
cros
scha
nnel
VGG-P
lace
s365
AlexN
et-P
lace
s365
-GAPpu
zzle
VGG-P
lace
s205
audio
Goo
gLeN
et-P
lace
s365
track
ing
Goo
gLeN
et-P
lace
s205
objectce
ntric
mov
ing
egom
otion
fram
eord
er
AlexN
et-ra
ndom
0
0.2
0.4
0.6
0.8
1
Acc
ura
cy
indoor67
0.8
27
0.7
97
0.7
92
0.7
88
0.7
85
0.7
68
0.7
56
0.7
50
0.6
74
0.6
72
0.6
64
0.6
15
0.6
07
0.5
80
0.5
74
0.5
10
0.4
30
0.4
24
0.4
02
0.3
92
0.3
62
0.3
39
0.3
35
0.2
85
0.2
59
0.2
06
0.1
78
Res
Net
152-
Place
s365
Goo
gLeN
et-P
lace
s205
Goo
gLeN
et-P
lace
s365
VGG-P
lace
s205
VGG-P
lace
s365
VGG-H
ybrid
Res
Net
50-Im
ageN
et
Res
Net
152-
Imag
eNet
AlexN
et-P
lace
s365
-GAP
VGG-Im
ageN
et
Goo
gLeN
et-Im
ageN
et
AlexN
et-P
lace
s205
-BN
AlexN
et-P
lace
s205
AlexN
et-P
lace
s365
AlexN
et-H
ybrid
AlexN
et-Im
ageN
et
cont
ext
cros
scha
nnel
audio
coloriz
ation
track
ing
puzz
le
objectce
ntric
egom
otion
mov
ing
fram
eord
er
AlexN
et-ra
ndom
0
0.2
0.4
0.6
0.8
1
Acc
ura
cy
sun397
0.6
90
0.6
66
0.6
57
0.6
55
0.6
49
0.6
29
0.6
05
0.5
97
0.5
15
0.5
13
0.5
13
0.4
72
0.4
60
0.4
54
0.4
47
0.3
81
0.2
87
0.2
71
0.2
68
0.2
50
0.2
50
0.2
08
0.1
88
0.1
60
0.1
42
0.1
05
0.1
02
Res
Net
152-
Place
s365
VGG-P
lace
s365
VGG-H
ybrid
VGG-P
lace
s205
Goo
gLeN
et-P
lace
s365
Goo
gLeN
et-P
lace
s205
Res
Net
152-
Imag
eNet
Res
Net
50-Im
ageN
et
AlexN
et-P
lace
s365
-GAP
VGG-Im
ageN
et
Goo
gLeN
et-Im
ageN
et
AlexN
et-P
lace
s205
-BN
AlexN
et-P
lace
s205
AlexN
et-P
lace
s365
AlexN
et-H
ybrid
AlexN
et-Im
ageN
et
cont
ext
cros
scha
nnel
audio
track
ing
coloriz
ation
puzz
le
objectce
ntric
egom
otion
mov
ing
AlexN
et-ra
ndom
fram
eord
er
0
0.2
0.4
0.6
0.8
Acc
ura
cy
caltech256
0.8
27
0.8
09
0.7
82
0.7
66
0.7
54
0.7
32
0.5
85
0.5
71
0.5
19
0.5
15
0.4
94
0.4
93
0.4
89
0.4
83
0.4
76
0.4
33
0.3
87
0.3
80
0.3
77
0.3
43
0.3
22
0.3
19
0.2
55
0.2
29
0.2
21
0.1
90
0.1
48
Res
Net
152-
Imag
eNet
Res
Net
50-Im
ageN
et
Goo
gLeN
et-Im
ageN
et
VGG-H
ybrid
VGG-Im
ageN
et
Res
Net
152-
Place
s365
AlexN
et-Im
ageN
et
AlexN
et-H
ybrid
VGG-P
lace
s365
VGG-P
lace
s205
AlexN
et-P
lace
s365
-GAP
AlexN
et-P
lace
s205
AlexN
et-P
lace
s365
Goo
gLeN
et-P
lace
s365
AlexN
et-P
lace
s205
-BN
Goo
gLeN
et-P
lace
s205
cont
ext
cros
scha
nnel
coloriz
ationau
dio
track
ing
puzz
le
objectce
ntric
egom
otion
mov
ing
fram
eord
er
AlexN
et-ra
ndom
0
0.2
0.4
0.6
0.8
1
Acc
ura
cy
Fig. 23. The classification accuracy of deep features on the six image datasets.
0.5 0.6 0.7 0.8 0.9 1Accuracy on event8
0
20
40
60
80
100
Num
ber o
f uniq
ue o
bject
dete
ctors
ResNet152-ImageNet
ResNet152-Places365
ResNet50-ImageNet
VGG-ImageNet
VGG-Places365
VGG-Places205VGG-Hybrid
GoogLeNet-ImageNetGoogLeNet-Places205
GoogLeNet-Places365
AlexNet-random
AlexNet-ImageNet
AlexNet-Places205AlexNet-HybridAlexNet-Places205-BN
AlexNet-Places365
AlexNet-Places365-GAP
context
audioframeorder tracking
egomotion
movingobjectcentric
puzzlecolorizationcrosschannel
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8Accuracy on action40
0
20
40
60
80
100
Num
ber o
f uniq
ue o
bject
dete
ctors
ResNet152-ImageNet
ResNet152-Places365
ResNet50-ImageNet
VGG-ImageNet
VGG-Places365
VGG-Places205VGG-Hybrid
GoogLeNet-ImageNetGoogLeNet-Places205
GoogLeNet-Places365
AlexNet-random
AlexNet-ImageNet
AlexNet-Places205 AlexNet-HybridAlexNet-Places205-BN
AlexNet-Places365
AlexNet-Places365-GAP
contextaudio
frameordertracking
egomotion
movingobjectcentricpuzzle
colorizationcrosschannel
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Accuracy on caltech101
0
20
40
60
80
100
Num
ber o
f uniq
ue o
bject
dete
ctors
AlexNet-Places205-BN
ResNet152-Places365ResNet50-ImageNet
VGG-ImageNet
VGG-Places365
VGG-Places205VGG-Hybrid
GoogLeNet-ImageNetGoogLeNet-Places205
GoogLeNet-Places365
AlexNet-random
AlexNet-ImageNet
AlexNet-Places205 AlexNet-Hybrid
ResNet152-ImageNet
AlexNet-Places365AlexNet-Places365-GAP
context
audio
frameorder
tracking
egomotion
movingobjectcentric
puzzle
colorization
crosschannel
0 0.2 0.4 0.6 0.8 1Accuracy on indoor67
0
20
40
60
80
100
Num
ber o
f uniq
ue o
bject
dete
ctors
ResNet152-ImageNet
ResNet152-Places365
ResNet50-ImageNet
VGG-ImageNet
VGG-Places365VGG-Places205
VGG-Hybrid
GoogLeNet-ImageNetGoogLeNet-Places205GoogLeNet-Places365
AlexNet-random
AlexNet-ImageNetAlexNet-Places205AlexNet-Hybrid
AlexNet-Places205-BN
AlexNet-Places365
AlexNet-Places365-GAP
context
audioframeorder
tracking
egomotion
moving
objectcentric
puzzle colorizationcrosschannel
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Accuracy on sun397
0
20
40
60
80
100
Num
ber o
f uniq
ue o
bject
dete
ctors
ResNet152-ImageNet
ResNet152-Places365
ResNet50-ImageNet
VGG-ImageNet
VGG-Places365VGG-Places205
VGG-Hybrid
GoogLeNet-ImageNetGoogLeNet-Places205
GoogLeNet-Places365
AlexNet-random
AlexNet-ImageNet
AlexNet-Places205AlexNet-Hybrid
AlexNet-Places205-BN
AlexNet-Places365
AlexNet-Places365-GAP
contextaudio
frameorder
tracking
egomotion
movingobjectcentric
puzzle
colorization
crosschannel
0 0.2 0.4 0.6 0.8 1Accuracy on caltech256
0
20
40
60
80
100
Num
ber o
f uniq
ue o
bject
dete
ctors
ResNet152-ImageNet
ResNet152-Places365
ResNet50-ImageNet
VGG-ImageNet
VGG-Places365
VGG-Places205VGG-Hybrid
GoogLeNet-ImageNetGoogLeNet-Places205 GoogLeNet-Places365
AlexNet-random
AlexNet-ImageNet
AlexNet-Places205 AlexNet-HybridAlexNet-Places205-BN
AlexNet-Places365
AlexNet-Places365-GAP
contextaudio
frameorder
tracking
egomotion
movingobjectcentric
puzzlecolorization
crosschannel
Fig. 24. The number of unique object detectors in the last convolutional layer compared to each representations classification accuracy on threedatasets. Supervised (in red) and unsupervised (in green) representations clearly form two clusters.
14
ImagesfromWalkingthedog(action40)
ResNet152-ImageNet ResNet152-Places365 ResNet152-ImageNet ResNet152-Places365
ImagesfromPicnicarea(sun397)
Fig. 25. Class-specific units from ResNet152-ImageNet and ResNet152-Places365 on one class from action40 and sun397. For each class, weshow three sample images, followed by the top 3 units from ResNet152-ImageNet and ResNet152-Places365 ranked by the class weight of linearSVM to predict that class. SVM weight, detected concept name and theIoU score are shown above each unit.
A.O.; the Vannevar Bush Faculty Fellowship program sponsored bythe Basic Research Office of the Assistant Secretary of Defense forResearch and Engineering and funded by the Office of NavalResearch through grant N00014-16-1-3116 to A.O.; the MITBig Data Initiative at CSAIL, the Toyota Research Institute MITCSAIL Joint Research Center, Google and Amazon Awards, and ahardware donation from NVIDIA Corporation. B.Z. is supportedby a Facebook Fellowship.
REFERENCES[1] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Object
detectors emerge in deep scene cnns,” International Conference onLearning Representations, 2015.
[2] A. Gonzalez-Garcia, D. Modolo, and V. Ferrari, “Do semantic partsemerge in convolutional neural networks?” arXiv:1607.03738, 2016.
[3] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos withscene dynamics,” arXiv:1609.02612, 2016.
[4] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Areview and new perspectives,” IEEE transactions on pattern analysis andmachine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
[5] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Networkdissection: Quantifying interpretability of deep visual representations,” inProc. CVPR, 2017.
[6] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: a simple way to prevent neural networks from overfitting.”Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958,2014.
[7] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep networktraining by reducing internal covariate shift,” arXiv:1502.03167, 2015.
[8] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutionalnetworks,” Proc. ECCV, 2014.
[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-basedconvolutional networks for accurate object detection and segmentation,”IEEE transactions on pattern analysis and machine intelligence, 2016.
[10] A. Mahendran and A. Vedaldi, “Understanding deep image representationsby inverting them,” Proc. CVPR, 2015.
[11] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutionalnetworks: Visualising image classification models and saliency maps,”International Conference on Learning Representations Workshop, 2014.
[12] A. Mahendran and A. Vedaldi, “Understanding deep image representationsby inverting them,” in Proc. CVPR, 2015.
[13] A. Dosovitskiy and T. Brox, “Generating images with perceptual similaritymetrics based on deep networks,” in Advances in Neural InformationProcessing Systems, 2016, pp. 658–666.
[14] A. Nguyen, A. Dosovitskiy, J. Yosinski, T. Brox, and J. Clune, “Synthesiz-ing the preferred inputs for neurons in neural networks via deep generatornetworks,” In Advances in Neural Information Processing Systems, 2016.
[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proc. CVPR, 2016.
[16] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn featuresoff-the-shelf: an astounding baseline for recognition,” arXiv:1403.6382,2014.
[17] P. Agrawal, R. Girshick, and J. Malik, “Analyzing the performance ofmultilayer neural networks for object recognition,” Proc. ECCV, 2014.
[18] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable arefeatures in deep neural networks?” In Advances in Neural InformationProcessing Systems, 2014.
[19] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Good-fellow, and R. Fergus, “Intriguing properties of neural networks,”arXiv:1312.6199, 2013.
[20] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easilyfooled: High confidence predictions for unrecognizable images,” in Proc.CVPR, 2015.
[21] Y. Li, J. Yosinski, J. Clune, H. Lipson, and J. Hopcroft, “Convergentlearning: Do different neural networks learn the same representations?”arXiv:1511.07543, 2015.
[22] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understandingdeep learning requires rethinking generalization,” International Confer-ence on Learning Representations, 2017.
[23] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representa-tion learning by context prediction,” in Proc. CVPR, 2015.
[24] M. Noroozi and P. Favaro, “Unsupervised learning of visual representa-tions by solving jigsaw puzzles,” in Proc. ECCV, 2016.
[25] D. Jayaraman and K. Grauman, “Learning image representations tied toego-motion,” in Proc. ICCV, 2015.
[26] P. Agrawal, J. Carreira, and J. Malik, “Learning to see by moving,” inProc. ICCV, 2015.
[27] X. Wang and A. Gupta, “Unsupervised learning of visual representationsusing videos,” in Proc. CVPR, 2015.
[28] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” inProc. ECCV. Springer, 2016.
[29] ——, “Split-brain autoencoders: Unsupervised learning by cross-channelprediction,” in Proc. CVPR, 2017.
[30] A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba,“Ambient sound provides supervision for visual learning,” in Proc. ECCV,2016.
[31] R. Q. Quiroga, L. Reddy, G. Kreiman, C. Koch, and I. Fried, “Invariantvisual representation by single neurons in the human brain,” Nature, vol.435, no. 7045, pp. 1102–1107, 2005.
[32] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Sceneparsing through ade20k dataset,” Proc. CVPR, 2017.
[33] S. Bell, K. Bala, and N. Snavely, “Intrinsic images in the wild,” ACMTrans. on Graphics (SIGGRAPH), 2014.
[34] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun,and A. Yuille, “The role of context for object detection and semanticsegmentation in the wild,” in Proc. CVPR, 2014.
[35] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille, “Detectwhat you can: Detecting and representing objects using holistic modelsand body parts,” in Proc. CVPR, 2014.
[36] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describ-ing textures in the wild,” in Proc. CVPR, 2014.
[37] J. Van De Weijer, C. Schmid, J. Verbeek, and D. Larlus, “Learningcolor names for real-world applications,” IEEE Transactions on ImageProcessing, vol. 18, no. 7, pp. 1512–1523, 2009.
[38] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neural informa-tion processing systems, 2012, pp. 1097–1105.
[39] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” inProc. CVPR, 2015.
[40] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv:1409.1556, 2014.
[41] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Denselyconnected convolutional networks,” Proc. CVPR, 2017.
[42] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visualrecognition challenge,” Int’l Journal of Computer Vision, 2015.
[43] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learningdeep features for scene recognition using places database,” In Advancesin Neural Information Processing Systems, 2014.
[44] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A10 million image database for scene recognition,” IEEE Transactions onPattern Analysis and Machine Intelligence, 2017.
[45] I. Mikjjsra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: unsupervisedlearning using temporal order verification,” in Proc. ECCV, 2016.
[46] R. Gao, D. Jayaraman, and K. Grauman, “Object-centric representationlearning from unlabeled videos,” arXiv:1612.00500, 2016.
[47] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,“Context encoders: Feature learning by inpainting,” in Proc. CVPR, 2016.
15
Correct label: gardening Correct label: brushing
a)
b)
Fig. 26. Segmenting images using the top activated units weighted by the class label from ResNet152-Places365 deep feature. a) the correctlypredicted samples. b) the wrongly predicted samples.
[48] X. Wang, K. He, and A. Gupta, “Transitive invariance for self-supervisedvisual representation learning,” arXiv preprint arXiv:1708.02901, 2017.
[49] P. Diaconis, “What is a random matrix?” Notices of the AMS, vol. 52,no. 11, pp. 1348–1349, 2005.
[50] S. Zagoruyko and N. Komodakis, “Wide residual networks,”arXiv:1605.07146, 2016.
[51] L.-J. Li and L. Fei-Fei, “What, where and who? classifying events byscene and object recognition,” in Proc. ICCV, 2007.
[52] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei, “Humanaction recognition by learning bases of action attributes and parts,” inProc. ICCV, 2011.
[53] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Proc. CVPR,2009.
[54] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database:Large-scale scene recognition from abbey to zoo,” in Proc. CVPR, 2010.
[55] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual modelsfrom few training examples: An incremental bayesian approach tested on101 object categories,” Computer Vision and Image Understanding, 2007.
[56] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,”2007.
[57] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learningdeep features for discriminative localization,” in Proc. CVPR, 2016.
Bolei Zhou is a Ph.D. Candidate in ComputerScience and Artificial Intelligence Lab (CSAIL)at the Massachusetts Institute of Technology. Hereceived M.Phil. degree in Information Engineer-ing from the Chinese University of Hong Kongand B.Eng. degree in Biomedical Engineeringfrom Shanghai Jiao Tong University in 2010.His research interests are computer vision andmachine learning. He is an award recipient ofFacebook Fellowship, Microsoft Research AsiaFellowship, and MIT Greater China Fellowship.
David Bau is a PhD student at the MIT ComputerScience and Artificial Intelligence Laboratory(CSAIL). He received an A.B. in Mathematicsfrom Harvard in 1992 and an M.S. in ComputerScience from Cornell in 1994. He coauthored atextbook on numerical linear algebra. He was asoftware engineer at Microsoft and Google anddeveloped ranking algorithms for Google ImageSearch. His research interest is interpretablemachine learning.
Aude Oliva is a Principal Research Scientist atthe MIT Computer Science and Artificial Intelli-gence Laboratory (CSAIL). After a French bac-calaureate in Physics and Mathematics, she re-ceived two M.Sc. degrees and a Ph.D in CognitiveScience from the Institut National Polytechniqueof Grenoble, France. She joined the MIT faculty inthe Department of Brain and Cognitive Sciencesin 2004 and CSAIL in 2012. Her research onvision and memory is cross-disciplinary, spanninghuman perception and cognition, computer vision,
and human neuroscience. She received the 2006 National ScienceFoundation (NSF) Career award, the 2014 Guggenheim and the 2016Vannevar Bush fellowships.
Antonio Torralba received the degree intelecommunications engineering from TelecomBCN, Spain, in 1994 and the Ph.D. degreein signal, image, and speech processing fromthe Institut National Polytechnique de Grenoble,France, in 2000. From 2000 to 2005, he spentpostdoctoral training at the Brain and CognitiveScience Department and the Computer Scienceand Artificial Intelligence Laboratory, MIT. He isnow a Professor of Electrical Engineering andComputer Science at the Massachusetts Institute
of Technology (MIT). Prof. Torralba is an Associate Editor of the Interna-tional Journal in Computer Vision, and has served as program chair forthe Computer Vision and Pattern Recognition conference in 2015. Hereceived the 2008 National Science Foundation (NSF) Career award, thebest student paper award at the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) in 2009, and the 2010 J. K. AggarwalPrize from the International Association for Pattern Recognition (IAPR).