1 interpreting deep visual representations via network...

15
1 Interpreting Deep Visual Representations via Network Dissection Bolei Zhou , David Bau , Aude Oliva, and Antonio Torralba Abstract—The success of recent deep convolutional neural networks (CNNs) depends on learning hidden representations that can summarize the important factors of variation behind the data. However, CNNs often criticized as being black boxes that lack interpretability, since they have millions of unexplained model parameters. In this work, we describe Network Dissection, a method that interprets networks by providing labels for the units of their deep visual representations. The proposed method quantifies the interpretability of CNN representations by evaluating the alignment between individual hidden units and a set of visual semantic concepts. By identifying the best alignments, units are given human interpretable labels across a range of objects, parts, scenes, textures, materials, and colors. The method reveals that deep representations are more transparent and interpretable than expected: we find that representations are significantly more interpretable than they would be under a random equivalently powerful basis. We apply the method to interpret and compare the latent representations of various network architectures trained to solve different supervised and self-supervised training tasks. We then examine factors affecting the network interpretability such as the number of the training iterations, regularizations, different initializations, and the network depth and width. Finally we show that the interpreted units can be used to provide explicit explanations of a prediction given by a CNN for an image. Our results highlight that interpretability is an important property of deep neural networks that provides new insights into their hierarchical structure. Index Terms—Convolutional Neural Networks, Network Interpretability, Visual Recognition, Interpretable Machine Learning. F 1 I NTRODUCTION O BSERVATIONS of hidden units in large deep neural networks have revealed that human-interpretable concepts sometimes emerge as individual latent variables within those networks. For example, object detector units emerge within networks trained to recognize places [1]; part detectors emerge in object classifiers [2]; and object detectors emerge in generative video networks [3]. This internal structure has appeared in situations where the networks are not constrained to decompose problems in any interpretable way. The emergence of interpretable structure suggests that deep networks may be learning disentangled representations sponta- neously. While it is commonly understood that a network can learn an efficient encoding that makes economical use of hidden variables to distinguish the input, the appearance of a disentangled representation is not well understood. A disentangled representation aligns its variables with a meaningful factorization of the underlying problem structure, and encouraging disentangled representations is a significant area of research [4]. If the internal representation of a deep network is partly disentangled, one possible path for understanding its mechanisms is to detect disentangled structure, and simply read out the human interpretable factors. We address the following three key issues about the deep visual representations in this work: What is a disentangled representation of neural networks, and how can its factors be quantified and detected? Do interpretable hidden units reflect a special alignment of feature space, or are interpretations a chimera? What differences in network architectures, data sources, and training conditions lead to the internal representations with greater or lesser entanglement? B. Zhou and D.Bau contributed equally to this work. B. Zhou, D. Bau, A.Oliva, and A. Torralba are with CSAIL, MIT, MA, 02139. E-mail: { } To examine these issues, we propose a general analytic framework, Network Dissection, for interpreting deep visual representations and quantifying their interpretability. Using Broden, a broadly and densely labeled dataset, our framework identifies hidden units’ semantics for any given CNN, then aligns them with human-interpretable concepts. Building upon the preliminary result published at [5], we begin with a detailed description of the methodology of Network Dissection. We use the method to interpret a variety of deep visual representations trained with different network architectures (AlexNet, VGG, GoogLeNet, ResNet, DenseNet) and supervisions (supervsied training on ImageNet for object recognition and on Places for scene recognition, along with various self-taught supervision tasks). We show that interpretability is an axis-aligned property of a representation that can be destroyed by rotation without affecting discriminative power. We further examine how interpretability is affected by different training datasets, training regularizations such as dropout [6] and batch normalization [7], and fine-tuning between different data sources. Our experiments reveal that units emerge as semantic detectors in the intermediate layers of most deep visual representations, while the degree of interpretability can vary widely across changes in architecture and training. We conclude that representations learned by deep networks are more interpretable than previously thought, and that measurements of interpretability provide insights about the structure of deep visual representations that that are not revealed by their classification power alone 1 . 1.1 Related Work Visualizing deep visual representations. Though CNN models are notoriously known as black boxes, a growing number of 1. Code, data, and more dissection results are available at the project page http://netdissect.csail.mit.edu/.

Upload: others

Post on 19-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Interpreting Deep Visual Representations via Network ...people.csail.mit.edu/bzhou/publication/interpretation_prearxiv.pdf1 Interpreting Deep Visual Representations via Network Dissection

1

Interpreting Deep Visual Representations viaNetwork Dissection

Bolei Zhou⇤, David Bau⇤, Aude Oliva, and Antonio Torralba

Abstract—The success of recent deep convolutional neural networks (CNNs) depends on learning hidden representations that cansummarize the important factors of variation behind the data. However, CNNs often criticized as being black boxes that lackinterpretability, since they have millions of unexplained model parameters. In this work, we describe Network Dissection, a method thatinterprets networks by providing labels for the units of their deep visual representations. The proposed method quantifies theinterpretability of CNN representations by evaluating the alignment between individual hidden units and a set of visual semantic concepts.By identifying the best alignments, units are given human interpretable labels across a range of objects, parts, scenes, textures, materials,and colors. The method reveals that deep representations are more transparent and interpretable than expected: we find thatrepresentations are significantly more interpretable than they would be under a random equivalently powerful basis. We apply the methodto interpret and compare the latent representations of various network architectures trained to solve different supervised andself-supervised training tasks. We then examine factors affecting the network interpretability such as the number of the training iterations,regularizations, different initializations, and the network depth and width. Finally we show that the interpreted units can be used to provideexplicit explanations of a prediction given by a CNN for an image. Our results highlight that interpretability is an important property ofdeep neural networks that provides new insights into their hierarchical structure.

Index Terms—Convolutional Neural Networks, Network Interpretability, Visual Recognition, Interpretable Machine Learning.

F

1 INTRODUCTION

OBSERVATIONS of hidden units in large deep neural networkshave revealed that human-interpretable concepts sometimes

emerge as individual latent variables within those networks. Forexample, object detector units emerge within networks trained torecognize places [1]; part detectors emerge in object classifiers [2];and object detectors emerge in generative video networks [3]. Thisinternal structure has appeared in situations where the networks arenot constrained to decompose problems in any interpretable way.

The emergence of interpretable structure suggests that deepnetworks may be learning disentangled representations sponta-neously. While it is commonly understood that a network canlearn an efficient encoding that makes economical use of hiddenvariables to distinguish the input, the appearance of a disentangledrepresentation is not well understood. A disentangled representationaligns its variables with a meaningful factorization of the underlyingproblem structure, and encouraging disentangled representationsis a significant area of research [4]. If the internal representationof a deep network is partly disentangled, one possible path forunderstanding its mechanisms is to detect disentangled structure,and simply read out the human interpretable factors.

We address the following three key issues about the deep visualrepresentations in this work:

• What is a disentangled representation of neural networks, andhow can its factors be quantified and detected?

• Do interpretable hidden units reflect a special alignment offeature space, or are interpretations a chimera?

• What differences in network architectures, data sources, andtraining conditions lead to the internal representations withgreater or lesser entanglement?

• B. Zhou and D.Bau contributed equally to this work.• B. Zhou, D. Bau, A.Oliva, and A. Torralba are with CSAIL, MIT, MA,

02139.E-mail: { }

To examine these issues, we propose a general analyticframework, Network Dissection, for interpreting deep visualrepresentations and quantifying their interpretability. Using Broden,a broadly and densely labeled dataset, our framework identifieshidden units’ semantics for any given CNN, then aligns them withhuman-interpretable concepts.

Building upon the preliminary result published at [5], webegin with a detailed description of the methodology of NetworkDissection. We use the method to interpret a variety of deepvisual representations trained with different network architectures(AlexNet, VGG, GoogLeNet, ResNet, DenseNet) and supervisions(supervsied training on ImageNet for object recognition andon Places for scene recognition, along with various self-taughtsupervision tasks). We show that interpretability is an axis-alignedproperty of a representation that can be destroyed by rotationwithout affecting discriminative power. We further examine howinterpretability is affected by different training datasets, trainingregularizations such as dropout [6] and batch normalization [7],and fine-tuning between different data sources. Our experimentsreveal that units emerge as semantic detectors in the intermediatelayers of most deep visual representations, while the degree ofinterpretability can vary widely across changes in architectureand training. We conclude that representations learned by deepnetworks are more interpretable than previously thought, andthat measurements of interpretability provide insights about thestructure of deep visual representations that that are not revealedby their classification power alone1.

1.1 Related WorkVisualizing deep visual representations. Though CNN modelsare notoriously known as black boxes, a growing number of

1. Code, data, and more dissection results are available at the project pagehttp://netdissect.csail.mit.edu/.

Page 2: 1 Interpreting Deep Visual Representations via Network ...people.csail.mit.edu/bzhou/publication/interpretation_prearxiv.pdf1 Interpreting Deep Visual Representations via Network Dissection

2

techniques have been developed to visualize the internal repre-sentations of convolutional neural networks. The behavior of aCNN can be visualized by sampling image patches that maximizeactivation of hidden units [1], [8], [9], or by using variants ofbackpropagation to identify or generate salient image features [8],[10], [11]. Back-propagation together with a natural image priorcan be used to invert a CNN layer activation [12], and an imagegeneration network can be trained to invert the deep features bysynthesizing the input images [13]. [14] further synthesizes theprototypical images for individual units by learning a feature codefor the image generation network from [13].These visualizationsreveal the image patterns that have been learned in a deep visualrepresentation and provide a qualitative guide to the interpretationand interpretability of units. In [1], a quantitative measure ofinterpretability was introduced: human evaluation of visualizationsto determine which individual units behave as object detectors ina network trained to classify scenes. However, human evaluationis not scalable to increasingly large networks such as ResNet [15],with more than 100 layers. Therefore the aim of the present workis to develop a scalable method to go from qualitative visualizationto quantitative interpretation.

Analyzing the properties of deep visual representations.Various intrinsic properties of deep visual representations havebeen explored. Much research has focused on studying the powerof CNN layer activations to be used as generic visual featuresfor classification [16], [17]. The transferability of activations fora variety of layers has been analyzed, and it has been found thathigher layer units are more specialized to the target task [18].Susceptibility to adversarial input reveals that discriminative CNNmodels are fooled by particular image patterns [19], [20]. Analysisof correlation between different random initialized networks revealthat many units converge to the same set of representations aftertraining [21]. The question of how representations generalize hasbeen investigated by showing that a CNN can easily fit a randomlabeling of training data even under explicit regularization [22].Our work focuses on another less explored property of deep visualrepresentations: interpretability.

Unsupervised learning of deep visual representations. Re-cent work on unsupervised learning or self-supervised learningexploits the correspondence structure that comes for free fromunlabeled images to train networks from scratch [23], [24], [25],[26], [27]. For example, CNN is trained by predicting image context[23], by colorizing gray images [28], [29], by solving image puzzle[24], and by associating the images with ambient sounds [30].The resulting deep visual representations learned from differentunsupervised learning tasks are compared by evaluating them asgeneric visual features on classification datasets such as PascalVOC. Our work provides an alternative approach to compare deepvisual representations in terms of their interpretability, beyond justtheir discriminative power.

2 FRAMEWORK OF NETWORK DISSECTION

The notion of a disentangled representation rests on the humanperception of what it means for a concept to be mixed up. Thereforewe define the interpretability of deep visual representation interms of the degree of alignment with a set of human-interpretableconcepts.

Our quantitative measurement of interpretability for deep visualrepresentations proceeds in three steps:

1) Identify a broad set of human-labeled visual concepts.

TABLE 1Statistics of each label type included in the dataset.

Category Classes Sources Avg samplescene 468 ADE [32] 38object 584 ADE [32], Pascal-Context [34] 491part 234 ADE [32], Pascal-Part [35] 854

material 32 OpenSurfaces [33] 1,703texture 47 DTD [36] 140color 11 Generated 59,250

2) Gather the response of the hidden variables to known concepts.3) Quantify alignment of hidden variable�concept pairs.

This three-step process of network dissection is reminiscent ofthe procedures used by neuroscientists to understand similarrepresentation questions in biological neurons [31]. Since ourpurpose is to measure the level to which a representation isdisentangled, we focus on quantifying the correspondence betweena single latent variable and a visual concept.

In a fully interpretable local coding such as a one-hot-encoding,each variable will match exactly with one human-interpretableconcept. Although we expect a network to learn partially nonlocalrepresentations in interior layers [4], and past experience showsthat an emergent concept will often align with a combination ofa several hidden units [2], [17], our present aim is to assess howwell a representation is disentangled. Therefore we measure thealignment between single units and single interpretable concepts.This does not gauge the discriminative power of the representation;rather it quantifies its disentangled interpretability. As we willshow in Sec. 3.2, it is possible for two representations of perfectlyequivalent discriminative power to have very different levels ofinterpretability.

To assess the interpretability of any given CNN, we drawconcepts from a new broadly and densely labeled image datasetthat unifies labeled visual concepts from a heterogeneous collectionof labeled data sources, described in Sec. 2.1. We then measurethe alignment of each hidden unit of the CNN with each conceptby evaluating the feature activation of each individual unit as asegmentation model for each concept. To quantify the interpretabil-ity of a layer as a whole, we count the number of distinct visualconcepts that are aligned with a unit in the layer, as detailed inSec. 2.2.

2.1 Broden: Broadly and Densely Labeled DatasetTo be able to ascertain alignment with both low-level conceptssuch as colors and higher-level concepts such as objects, we haveassembled a new heterogeneous dataset.

The Broadly and Densely Labeled Dataset (Broden) unifiesseveral densely labeled image datasets: ADE [32], OpenSurfaces[33], Pascal-Context [34], Pascal-Part [35], and the DescribableTextures Dataset [36]. These datasets contain examples of a broadrange of objects, scenes, object parts, textures, and materials in avariety of contexts. Most examples are segmented down to the pixellevel except textures and scenes, which are given for full images.In addition, every image pixel in the dataset is annotated withone of the eleven common color names according to the humanperceptions classified by van de Weijer [37]. Samples of the typesof labels in the Broden dataset are shown in Fig. 1.

The purpose of Broden is to provide a ground truth set ofexemplars for a broad set of visual concepts. The concept labels inBroden are normalized and merged from their original datasets so

Page 3: 1 Interpreting Deep Visual Representations via Network ...people.csail.mit.edu/bzhou/publication/interpretation_prearxiv.pdf1 Interpreting Deep Visual Representations via Network Dissection

3

red (color)

 

yellow (color)

 

wrinkled (texture)

 

meshed (texture)

 

wood (material)

 

fabric (material)

 foot (part)

 

door (part)

 

airplane (object)

 

waterfall (object)

 

art studio (scene)

 

beach (scene)

 

Fig. 1. Samples from the Broden Dataset. The ground truth for each concept is a pixel-wise dense annotation.

Topactivatedimages

Segmentedimagesusingthebinarized unitactivationmap

Semanticsegmentationannotations

Segmentedannotations

Fig. 2. Scoring unit interpretability by evaluating the unit for semanticsegmentation.

that every class corresponds to an English word. Labels are mergedbased on shared synonyms, disregarding positional distinctions suchas ‘left’ and ‘top’ and avoiding a blacklist of 29 overly generalsynonyms (such as ‘machine’ for ‘car’). Multiple Broden labelscan apply to the same pixel: for example, a black pixel that hasthe Pascal-Part label ‘left front cat leg’ has three labels in Broden:a unified ‘cat’ label representing cats across datasets; a similarunified ‘leg’ label; and the color label ‘black’. Only labels with atleast 10 image samples are included. Table 1 shows the number ofclasses per dataset and the average number of image samples perlabel class. Totally there are 1197 visual concept classes included.

2.2 Scoring Unit InterpretabilityThe proposed network dissection method evaluates every individualconvolutional unit in a CNN as a solution to a binary segmentationtask to every visual concept in Broden, as illustrated in Fig. 3. Ourmethod can be applied to any CNN using a forward pass withoutthe need for training or backpropagation.

For every input image x in the Broden dataset, the activationmap Ak(x) of every internal convolutional unit k is collected.Then the distribution of individual unit activations ak is computed.For each unit k, the top quantile level Tk is determined such thatP (ak > Tk) = 0.005 over every spatial location of the activationmap in the dataset.

To compare a low-resolution unit’s activation map to the input-resolution annotation mask Lc for some concept c, the activationmap is scaled up to the mask resolution Sk(x) from Ak(x) usingbilinear interpolation, anchoring interpolants at the center of eachunit’s receptive field.

Sk(x) is then thresholded into a binary segmentation: Mk(x) ⌘Sk(x) � Tk, selecting all regions for which the activation exceedsthe threshold Tk. These segmentations are evaluated against everyconcept c in the dataset by computing intersections Mk(x)\Lc(x),for every (k, c) pair.

The score of each unit k as segmentation for concept c isreported as a the intersection over union score across all the imagesin the dataset,

IoUk,c =

P|Mk(x) \ Lc(x)|P|Mk(x) [ Lc(x)|

, (1)

where | · | is the cardinality of a set. Because the dataset containssome types of labels which are not present on some subsets ofinputs, the sums are computed only on the subset of images thathave at least one labeled concept of the same category as c. Thevalue of IoUk,c is the accuracy of unit k in detecting conceptc; we consider one unit k as a detector for concept c if IoUk,c

exceeds a threshold. Our qualitative results are insensitive to theIoU threshold: different thresholds denote different numbers ofunits as concept detectors across all the networks but relativeorderings remain stable. For our comparisons we report a detectorif IoUk,c > 0.04. Note that one unit might be the detector formultiple concepts; for the purpose of our analysis, we choose thetop ranked label. To quantify the interpretability of a layer, wecount the number unique concepts aligned with units. We call thisthe number of unique detectors.

Figure 2 summarizes the whole process of scoring unit inter-pretability: By segmenting the annotation mask using the receptivefield of units for the top activated images, we compute the IoU foreach concept. The IoU evaluating the quality of the segmentationof a unit is an objective confidence score for interpretability that iscomparable across networks. Thus this score enables us to compareinterpretability of different representations and lays the basis for theexperiments below. Note that network dissection works only as wellas the underlying dataset: if a unit matches a human-understandableconcept that is absent in Broden, then it will not score well forinterpretability. Future versions of Broden will be expanded toinclude more kinds of visual concepts.

3 INTERPRETING DEEP VISUAL REPRESENTA-TIONS

For testing we prepare a collection of CNN models with differentnetwork architectures and supervision of primary tasks, as listedin Table 2. The network architectures include AlexNet [38],GoogLeNet [39], VGG [40], ResNet [15], and DenseNet [41].For supervised training, the models are trained from scratch (i.e.,not pretrained) on ImageNet [42], Places205 [43], and Places365[44]. ImageNet is an object-centric dataset, which contains 1.2

Page 4: 1 Interpreting Deep Visual Representations via Network ...people.csail.mit.edu/bzhou/publication/interpretation_prearxiv.pdf1 Interpreting Deep Visual Representations via Network Dissection

4

Input image Network being probed Pixel-wise segmentation

Freeze trained network weights

Conv

Conv

Conv

Conv

Conv

Upsample target layer

One Unit

Activation

Blu

e

Fabri

c

Door

Gra

ss

Pers

on

Evaluate on segmentation tasks

Car

Fig. 3. Illustration of network dissection for measuring semantic alignment of units in a given CNN. Here one unit of the last convolutional layer of agiven CNN is probed by evaluating its performance on various segmentation tasks. Our method can probe any convolutional layer.

TABLE 2Tested CNN Models

Training Network dataset or task

none AlexNet random

Supervised

AlexNet ImageNet, Places205, Places365, Hybrid.GoogLeNet ImageNet, Places205, Places365.

VGG-16 ImageNet, Places205, Places365, Hybrid.ResNet-152 ImageNet, Places365.

DenseNet-161 ImageNet, Places365.

Self AlexNet

context, puzzle, egomotion,tracking, moving, videoorder,audio, crosschannel,colorization.objectcentric, transinv.

million images from 1000 classes. Places205 and Places365 aretwo subsets of the Places Database, which is a scene-centricdataset with categories such as kitchen, living room, and coast.Places205 contains 2.4 million images from 205 scene categories,while Places365 contains 1.6 million images from 365 scenecategories. “Hybrid” refers to a combination of ImageNet andPlaces365. For self-supervised training tasks, we select severalrecent models trained on predicting context (context) [23], solvingpuzzles (puzzle) [24], predicting ego-motion (egomotion) [25],learning by moving (moving) [26], predicting video frameorder (videoorder) [45] or tracking (tracking) [27], detectingobject-centric alignment (objectcentric) [46], colorizing images(colorization) [28], inpainting (contextencoder) [47], predict-ing cross-channel (crosschannel) [29], predicting ambient soundfrom frames (audio) [30], and tracking invariant patterns invideos (transinv) [48]. The self-supervised models we analyze arecomparable to each other in that they all use AlexNet or an AlexNet-derived architecture, with one exception model transinv [48],which uses VGG as the base network.

In the following experiments, we begin by validating ourmethod using human evaluation. Then, we use random unitaryrotations of a learned representation to test whether interpretabilityof CNNs is an axis-independent property; we find that it is not,and we conclude that interpretability is not an inevitable resultof the discriminative power of a representation. Next, we analyzeall the convolutional layers of AlexNet as trained on ImageNet[38] and as trained on Places [43], and confirm that our methodreveals detectors for higher-level concepts at higher layers andlower-level concepts at lower layers; and that more detectors forhigher-level concepts emerge under scene training. Then, we showthat different network architectures such as AlexNet, VGG, andResNet yield different interpretability, while differently supervised

training tasks and self-supervised training tasks also yield a varietyof levels of interpretability. Additionally we show the impact ofdifferent training conditions, examine the relationship betweendiscriminative power and interpretability, and investigate a possibleway to improve the interpretability of CNNs by increasing theirwidth. Finally we utilize the interpretable units as explanatoryfactors to the prediction given by a CNN.

3.1 Human Evaluation of InterpretationsUsing network dissection, we analyze the interpretability of unitswithin all the convolutional layers of Places-AlexNet and ImageNet-AlexNet, then compare with human interpretation. Places-AlexNetis trained for scene classification on Places205 [43], whileImageNet-AlexNet is the identical architecture trained for objectclassification on ImageNet [38].

Our evaluation was done by raters on Amazon MechanicalTurk (AMT). As a baseline description of unit semantics, weused human-written descriptions of each unit from [1]. Thesedescriptions were collected by asking raters to write words or shortphrases to describe the common meaning or pattern selected byeach unit, based on a visualization of the top image patches. Threedescriptions and a confidence were collected for each unit. As acanonical description we chose the most common description of aunit (when raters agreed), and the highest-confidence description(when raters did not agree). Some units may not be interpretable.To identify these, raters were shown the canonical descriptions ofvisualizations and asked whether they were descriptive. Units withvalidated descriptions are taken as the set of interpretable units.

To compare these baseline descriptions to network-dissection-derived labels, we ran the following experiment. Raters were showna visualization of top images patches for an interpretable unit, alongwith a word or short phrase description, and they were asked tovote (yes/no) whether the given phrase was descriptive of most ofthe image patches. The baseline human-written descriptions wererandomized with the labels derived using net dissection, and theorigin of the labels was not revealed to the raters.

Table 3 summarizes the results. The number of interpretableunits is shown for each layer, and average positive votes fordescriptions of interpretable units are shown, both for human-written labels and network-dissection-derived labels. Human labelsare most highly consistent for units of conv5, suggesting thathumans have no trouble identifying high-level visual conceptdetectors, while lower-level detectors are more difficult to label.Similarly, labels given by network dissection are best at conv5,and are found to be less descriptive for lower layers.

Page 5: 1 Interpreting Deep Visual Representations via Network ...people.csail.mit.edu/bzhou/publication/interpretation_prearxiv.pdf1 Interpreting Deep Visual Representations via Network Dissection

5

Fig. 4. The annotation interface used by human raters on AmazonMechanical Turk. Raters are shown descriptive text in quotes togetherwith fifteen images, each with highlighted patches, and must evaluatewhether the quoted text is a good description for the highlighted patches.

TABLE 3Human evaluation of our Network Dissection approach.

conv1 conv2 conv3 conv4 conv5Interpretable units 57/96 126/256 247/384 258/384 194/256Human consistency 82% 76% 83% 82% 91%Network Dissection 37% 56% 54% 59% 71%

Comparison of the human interpretation and the labels predictedby network dissection is plotted in Fig. 5. A sample of units isshown together with both automatically inferred interpretationsand manually assigned interpretations taken from [1]. We can seethat the predicted labels match the human annotation well, thoughsometimes they capture a different description of a visual concept,such as the ‘crosswalk’ predicted by the algorithm compared to‘horizontal lines’ given by the human for the third unit in conv4 ofPlaces-AlexNet in Fig. 5. Confirming intuition, color and textureconcepts dominate at lower layers conv1 and conv2 while moreobject and part detectors emerge in conv5.

3.2 Measurement of Axis-aligned Interpretability

We conduct an experiment to determine whether it is meaningfulto assign an interpretable concept to an individual unit. Twopossible hypotheses can explain the emergence of interpretabilityin individual hidden layer units:Hypothesis 1. Interpretability is a property of the representation

as a whole, and individual interpretable units emerge becauseinterpretability is a generic property of typical directions ofrepresentation space. Under this hypothesis, projecting to anydirection would typically reveal an interpretable concept, andinterpretations of single units in the natural basis would notbe more meaningful than interpretations that can be found inany other direction.

Hypothesis 2. Interpretable alignments are unusual, and inter-pretable units emerge because learning converges to a specialbasis that aligns explanatory factors with individual units.In this model, the natural basis represents a meaningfuldecomposition learned by the network.

Hypothesis 1 is the default assumption: in the past it has beenfound [19] that with respect to interpretability “there is nodistinction between individual high level units and random linearcombinations of high level units.”

Network dissection allows us to re-evaluate this hypothesis.We apply random changes in basis to a representation learned byAlexNet. Under hypothesis 1, the overall level of interpretabilityshould not be affected by a change in basis, even as rotationscause the specific set of represented concepts to change. Underhypothesis 2, the overall level of interpretability is expected to dropunder a change in basis.

We begin with the representation of the 256 convolutional unitsof AlexNet conv5 trained on Places205 and examine the effect of achange in basis. To avoid any issues of conditioning or degeneracy,we change basis using a random orthogonal transformation Q.The rotation Q is drawn uniformly from SO(256) by applyingGram-Schmidt on a normally-distributed QR = A 2 R2562

with positive-diagonal right-triangular R, as described by [49].Interpretability is summarized as the number of unique visualconcepts aligned with units, as defined in Sec. 2.2.

Denoting AlexNet conv5 as f(x), we find that the number ofunique detectors in Qf(x) is 80% fewer than the number of uniquedetectors in f(x). Our finding is inconsistent with hypothesis 1and consistent with hypothesis 2.

We also test smaller perturbations of basis using Q↵ for 0 ↵ 1, where the fractional powers Q↵ 2 SO(256) are chosento form a minimal geodesic gradually rotating from I to Q; theseintermediate rotations are computed using a Schur decomposition.Fig. 6 shows that interpretability of Q↵f(x) decreases as largerrotations are applied. Fig. 7 shows some examples of the linearlycombined units.

Each rotated representation has exactly the same discriminativepower as the original layer. Writing the original network asg(f(x)), note that g0(r) ⌘ g(QT r) defines a neural networkthat processes the rotated representation r = Qf(x) exactly as theoriginal g operates on f(x). We conclude that interpretabilityis neither an inevitable result of discriminative power, nor isit a prerequisite to discriminative power. Instead, we find thatinterpretability is a different quality that must be measuredseparately to be understood.

We repeat the complete rotation (↵ = 1) on Places365 andImageNet 10 times, the result is shown in Fig. 8. We observe thedrop of interpretability for both of the network, while it dropsmore for the AlexNet on Places365. It is because originally theinterpretability of AlexNet on Places365 is higher than AlexNet onImageNet thus the random rotation damages more.

3.3 Network Architectures with Supervised LearningHow do different network architectures affect disentangled in-terpretability of the learned representations? We apply networkdissection to evaluate a range of network architectures trained onImageNet and Places. For simplicity, the following experimentsfocus on the last convolutional layer of each CNN, where semanticdetectors emerge most.

Results showing the number of unique detectors that emergefrom various network architectures trained on ImageNet andPlaces are plotted in Fig. 9. In terms of network architecture,we find that interpretability of ResNet > DenseNet > VGG >GoogLeNet > AlexNet. Deeper architectures usually appear toallow greater interpretability, though individual layer structure is

Page 6: 1 Interpreting Deep Visual Representations via Network ...people.csail.mit.edu/bzhou/publication/interpretation_prearxiv.pdf1 Interpreting Deep Visual Representations via Network Dissection

6

Pla

ces

veined (texture) h:green

orange (color) h:color yellow

red (color) h:pink or red

sky (object) h:sky

lacelike (texture) h:black&white

lined (texture) h:grid pattern

grass (object) h:grass

banded (texture) h:corrugated

perforated (texture) h:pattern

chequered (texture) h:windows

tree (object) h:tree

crosswalk (part) h:horiz. lines

bed (object) h:bed

car (object) h:car

mountain (scene) h:montain

Imag

eNet

red (color) h:red

yellow (color) h:yellow

sky (object) h:blue

woven (texture) h:yellow

banded (texture) h:striped

grid (texture) h:mesh

food (material) h:orange

sky (object) h:blue sky

dotted (texture) h:nosed

muzzle (part) h:animal face

swirly (texture) h:round

head (part) h:face

wheel (part) h:wheels

cat (object) h:animal faces

leg (part) h:leg

conv1 conv2 conv3 conv4 conv5

Fig. 5. Comparison of the interpretability of all five convolutional layers of AlexNet, as trained on classification tasks for Places (top) and ImageNet(bottom).Four examples of units in each layer are shown with identified semantics. The segmentation generated by each unit is shown on the threeBroden images with highest activation. Top-scoring labels are shown above to the left, and human-annotated labels are shown above to the right.Some disagreement can be seen for the dominant judgment of meaning. For example, human annotators mark the first conv4 unit on Places as a‘windows’ detector, while the algorithm matches the ‘chequered’ texture.

baseline rotate 0.2 rotate 0.4 rotate 0.6 rotate 0.8 rotate 10

10

20

30

40

Nu

mb

er

of

un

iqu

e d

ete

cto

rs

Rotation of Representation

objectscenepartmaterialtexturecolor

Fig. 6. Interpretability over changes in basis of the representation ofAlexNet conv5 trained on Places. The vertical axis shows the numberof unique interpretable concepts that match a unit in the representation.The horizontal axis shows ↵, which quantifies the degree of rotation.

different across architecture. Comparing training datasets, we findPlaces > ImageNet. As discussed in [1], one scene is composed ofmultiple objects, so it may be beneficial for more object detectorsto emerge in CNNs trained to recognize scenes.

Fig. 10 shows the histogram of object detectors identified insideResNet and DenseNet trained on ImageNet and Places respectively.DenseNet161-Places365 has the largest number of unique objectdetectors among all the networks. The emergent detectors differacross both training data source and architecture. The most frequentobject detectors in the two networks trained on ImageNet are dogdetectors, because there are more than 100 dog categories out ofthe 1000 classes in the ImageNet training set.

Fig. 11 shows the examples of object detectors groupedby object categories. For the same object category, the visualappearance of the unit as detector varies not only within the samenetwork but also across different networks. DenseNet and ResNethas such good detectors for bus and airplane with IoU more than0.25.

Baseline (individual units) Rotate 1 (linear combinations)car (single unit 87) IoU 0.16

car (combination, closest to unit 173) IoU 0.06

skyscraper (single unit 94) IoU 0.16

skyscraper (combination, closest to unit 94) IoU 0.05

tree (single unit 228) IoU 0.10

tree (combination, closest to unit 228) IoU 0.02

head (single unit 3) IoU 0.09

head (combination, closest to unit 70) IoU 0.02

closet (single unit 107) IoU 0.06

closet (combination, closest to unit 34) IoU 0.02

Fig. 7. Visualizations of the best single-unit concept detectors of fiveconcepts taken from individual units of AlexNet conv5 trained on Places(left), compared with the best linear-combination detectors of the sameconcepts taken from the same representation under a random rotation(right). In each case, the highest IoU scoring unit that matches the givenconcept is shown. For most concepts, both the IoU and the visualizationof the top activating image patches confirm that individual units matchconcepts better than linear combinations. In other cases, (e.g. headdetectors) the visualization of the linear combination appears highlyconsistent, but the IoU score reveals lower consistency when evaluatedover the whole dataset.

baseline random combination0

10

20

30

40

Nu

mb

er

of

un

iqu

e d

ete

cto

rs

AlexNet on Places365

objectscenepartmaterialtexturecolor

baseline random combination0

5

10

15

20

25

Nu

mb

er

of

un

iqu

e d

ete

cto

rs

AlexNet on ImageNet

objectscenepartmaterialtexturecolor

Fig. 8. Complete rotation (↵ = 1) repeated on AlexNet trained onPlaces365 and ImageNet respectively. Rotation reduces the interpretabil-ity significantly for both of the networks.

Page 7: 1 Interpreting Deep Visual Representations via Network ...people.csail.mit.edu/bzhou/publication/interpretation_prearxiv.pdf1 Interpreting Deep Visual Representations via Network Dissection

7

Res

Net

152-

Place

s365

Den

seNet

161-

Place

s365

Res

Net

152-

Imag

eNet

Den

seNet

161-

Imag

eNet

VGG-P

lace

s205

VGG-H

ybrid

VGG-P

lace

s365

Goo

gLeN

et-P

lace

s365

Goo

gLeN

et-P

lace

s205

Goo

gLeN

et-Im

ageN

et

VGG-Im

ageN

et

AlexN

et-P

lace

s365

AlexN

et-H

ybrid

AlexN

et-P

lace

s205

AlexN

et-Im

ageN

et

AlexN

et-ra

ndom

0

50

100

150

200

250

300

350

Num

ber

of uniq

ue d

ete

ctors

objectscenepartmaterialtexturecolor

Fig. 9. Interpretability across different architectures trained on ImageNetand Places.

Fig. 3.3 showcases the unique interpretable units of all typeson a variety of networks.

Fig. 13 shows the unique interpretable detectors over differentlayers for different network architectures trained on Places365. Weobserve that more object and scene detectors emerge at the higherlayers across all architectures: AlexNet, VGG, GoogLeNet, andResNet. This suggests that representation ability increases overlayer depth. Because of the compositional structure of the CNNlayers, the deeper layers should have higher capacity to representconcepts with larger visual complexity such as objects and sceneparts. Our measurements confirm this, and we conclude that highernetwork depth encourages the emergence of visual concepts withhigher semantic complexity.

3.4 Representations from Self-supervised LearningRecently many work have explored a novel paradigm for unsuper-vised learning of CNNs without using millions of annotated images,namely self-supervised learning. For example, [23] trains deepCNNs to predict the neighborhoods of two image patches, while[28] trains networks by colorizing images. Totally we investigate12 networks trained for different self-supervised learning tasks.How do different supervisions affect those internal representations?

Here we compare the interpretability of the deep visual repre-sentations resulting from self-supervised learning and supervisedlearning. We keep the network architecture the same as AlexNetfor each model (one exception is the recent model transinv

which uses VGG as the base network). Results are shown inFig. 14. We observe that training on Places365 creates the largestnumber of unique detectors. Self-supervised models create manytexture detectors but relatively few object detectors; apparently,supervision from a self-taught primary task is much weaker atinferring interpretable concepts than supervised training on a largeannotated dataset. The form of self-supervision makes a difference:for example, the colorization model is trained on colorless images,and almost no color detection units emerge. We hypothesize thatemergent units represent concepts required to solve the primarytask.

Fig. 15 shows some typical visual detectors identified in theself-supervised CNN models. For the models audio and puzzle,some object and part detectors emerge. Those detectors may beuseful for CNNs to solve the primary tasks: the audio model istrained to associate objects with a sound source, so it may be useful

to recognize people and cars; while the puzzle model is trainedto align the different parts of objects and scenes in an image. Forcolorization and tracking, recognizing textures might be goodenough for the CNN to solve primary tasks such as colorizing adesaturated natural image; thus it is unsurprising that the texturedetectors dominate.

3.5 Training Conditions

Training conditions such as the number of training iterations,dropout [6], batch normalization [7], and random initialization [21],are known to affect the representation learning of neural networks.To analyze the effect of training conditions on interpretability, wetake the Places205-AlexNet as the baseline model and prepareseveral variants of it, all using the same AlexNet architecture. Forthe variants Repeat1, Repeat2 and Repeat3, we randomly initializethe weights and train them with the same number of iterations. Forthe variant NoDropout, we remove the dropout in the FC layersof the baseline model. For the variant BatchNorm, we apply batchnormalization at each convolutional layer of the baseline model.Repeat1, Repeat2, Repeat3 all have nearly the same top-1 accuracy50.0% on the validation set. The variant without dropout has top-1accuracy 49.2%. The variant with batch norm has top-1 accuracy50.5%.

Fig. 16 shows the interpretability of units in the CNNs overdifferent training conditions. We find several effects: 1) Comparingdifferent random initializations, the models converge to similarlevels of interpretability, both in terms of the unique detectornumber and the total detector number; this matches observationsof convergent learning discussed in [21]. 2) For the networkwithout dropout, more texture detectors emerge but fewer objectdetectors. 3) Batch normalization seems to decrease interpretabilitysignificantly.

The batch normalization result serves as a caution that dis-criminative power is not the only property of a representation thatshould be measured. Our intuition for the loss of interpretabilityunder batch normalization is that the batch normalization ‘whitens’the activation at each layer, which smooths out scaling issuesand allows a network to easily rotate axes of intermediaterepresentations during training. While whitening apparently speedstraining, it may also have an effect similar to random rotationsanalyzed in Sec. 3.2 which destroy interpretability. As discussedin Sec. 3.2, however, interpretability is neither a prerequisite noran obstacle to discriminative power. Finding ways to capture thebenefits of batch normalization without destroying interpretabilityis an important area for future work.

Fig. 17 plots the interpretability of snapshots of the baselinemodel at different training iterations along with the accuracy on thevalidation set. We can see that object detectors and part detectorsbegin emerging at about 10,000 iterations (each iteration processesa batch of 256 images). We do not find evidence of transitionsacross different concept categories during training. For example,units in conv5 do not turn into texture or material detectors beforebecoming object or part detectors. In Fig. 18, we keep track ofsix units over different training iteration. We observe that someunits start converging to the semantic concept at early stage. Forexample, unit138 starts detecting mountain snowy as early as atiteration 2446. We also observe that units evolve over time: unit74and unit108 detect road first before they start detecting car andairplane respectively.

Page 8: 1 Interpreting Deep Visual Representations via Network ...people.csail.mit.edu/bzhou/publication/interpretation_prearxiv.pdf1 Interpreting Deep Visual Representations via Network Dissection

8

  0

19

37

units

airpla

netrain ca

rbu

sbe

dho

use ca

tdo

g

mounta

inho

rsewate

rgra

ssbo

ttle treebo

atso

fa

build

ingtoilet

plant bir

d

motorbi

ke

pool

table sk

yten

tse

abo

ok

potte

dplan

tfen

ce

bicyc

le

paint

ingroad

perso

n

curta

insin

ksn

owtablesh

elf

batht

ub

window

pane

chan

delie

r

stairw

ay

skys

crape

rsto

verocktra

ckfield

washe

r

sidew

alkchair

courtflo

wer

work su

rface

waterfa

ll

bridg

ebe

nchsh

eeprai

ling

tvmon

itormirro

rpa

lm

book

case

61 o

bjec

ts

  0

18

36

units

carbe

dbu

s

airpla

newate

rdo

g

pool

table ca

t

mounta

in

paint

inggra

ssroa

dbo

oktoi

let

ceilin

gtra

in sky

bicyc

leho

rseho

use

skys

crape

rso

fa

potte

dplan

tsin

kpla

ntbo

ttle

motorbi

keboat tre

e

perso

nse

atra

ck

waterfa

lllam

pten

tbir

dse

atsto

ve

washe

rrai

ling

sidew

alk

stairw

ay

work su

rface

chan

delie

r

grand

stand

curta

inch

airflo

wer

build

ing

window

pane

tvmon

itor

bridg

eba

ll

book

caseea

rthshelf

casesn

ow

swive

l cha

irsh

eeppa

lmpit

chpil

lowgro

und

coun

tertab

le

cabin

etfie

ldcra

dle

signb

oard

bedc

lothe

sfen

ce cow

73 o

bjec

ts

  0

43

86

units

dog ca

tca

rho

rse

airpla

ne bus

birdbe

dtra

in

motorbi

kebo

ttletoiletwate

r

pool

table

cowsh

eep

perso

n

bicyc

le

mounta

inho

usegra

ssroa

dpla

nt

batht

ub seabo

okch

airsink

potte

dplan

t

tvmon

itorsto

ve

signb

oard

window

pane

shelfsn

owrockfie

ldtab

lece

ilingflo

wer tree

tentso

fafen

ce

skys

crape

r

washe

rbo

atpla

te

book

case

sidew

alkpil

low

bedc

lothe

sea

rthrai

lingse

at

cush

ionpalmsa

ndbe

nch

compu

terlam

p

wardrob

e

62 o

bjec

ts

  0

13

26

units

car

bicyc

le cat

pool

table

airpla

neho

use

bus

motorbi

ke bedtoi

let skyho

rsewate

rdo

gsin

kbo

ttle seach

airbo

ok cowsto

veroadtra

ck

book

case

build

ing

mounta

in

tvmon

itortra

insn

ow birdsh

elflam

psh

eepgra

ss

window

pane

batht

ub

paint

ingsofa

cush

ion

curta

in

skys

crape

r

perso

ntab

lece

ilingse

atpla

ntfen

ce

washe

r

sidew

alk

stairw

ayfieldflo

werpla

tebo

at tree

chan

delie

r

signb

oard

57 o

bjec

ts

Resnet­152 

(Places)

Densenet 

(Places)

Resnet­152 

(Imagenet)

Densenet 

(Imagenet)

Fig. 10. Histogram of the object detectors from the ResNet and DenseNet trained on ImageNet and Places respectively.

Closet Dog Plant Bus Airplane

DenseNet-161

layer161 unit 1639 IoU=0.225 layer161 unit 2035 IoU=0.199 layer161 unit 1126 IoU=0.076 layer161 unit 1492 IoU=0.282 layer161 unit 1518 IoU=0.205

layer161 unit 1788 IoU=0.201 layer161 unit 2028 IoU=0.113 layer161 unit 1356 IoU=0.067 layer161 unit 1519 IoU=0.155 layer161 unit 1512 IoU=0.125

ResNet-152

res5c unit 2011 IoU=0.171 res5c unit 1573 IoU=0.217 res5c unit 264 IoU=0.125 res5c unit 674 IoU=0.265 res5c unit 1243 IoU=0.172

res5c unit 9 IoU=0.161 res5c unit 1718 IoU=0.195 res5c unit 766 IoU=0.094 res5c unit 74 IoU=0.256 res5c unit 963 IoU=0.156

GoogLeN

et

inception_5b unit 758 IoU=0.159 inception_4e unit 750 IoU=0.203 inception_4e unit 56 IoU=0.139 inception_4e unit 824 IoU=0.168 inception_4e unit 92 IoU=0.164

inception_5b unit 235 IoU=0.136 inception_4e unit 225 IoU=0.152 inception_4e unit 714 IoU=0.105 inception_5b unit 603 IoU=0.154 inception_4e unit 759 IoU=0.144

VGG-16

conv5_3 unit 213 IoU=0.125 conv5_3 unit 142 IoU=0.205 conv5_3 unit 85 IoU=0.086 conv5_3 unit 191 IoU=0.153 conv5_3 unit 151 IoU=0.150

conv5_3 unit 107 IoU=0.065 conv5_3 unit 491 IoU=0.112 conv4_3 unit 336 IoU=0.068 conv5_3 unit 20 IoU=0.149 conv5_3 unit 204 IoU=0.077

AlexNet

conv5 unit 235 IoU=0.017 conv5 unit 180 IoU=0.090 conv5 unit 55 IoU=0.087 conv5 unit 10 IoU=0.040 conv5 unit 13 IoU=0.101

conv3 unit 255 IoU=0.015 conv5 unit 250 IoU=0.051 conv5 unit 16 IoU=0.062 conv5 unit 174 IoU=0.029 conv5 unit 28 IoU=0.049

Fig. 11. Comparison of several visual concept detectors identified by network dissection in DenseNet, ResNet, GoogLeNet, VGG, and AlexNet. Eachnetwork is trained on Places365. The two highest-IoU matches among convolutional units of each network is shown. The segmentation generated byeach unit is shown on the four maximally activating Broden images. Some units activate on concept generalizations, e.g., GoogLeNet 4e’s unit 225on horses and dogs, and 759 on white ellipsoids and jets.

Page 9: 1 Interpreting Deep Visual Representations via Network ...people.csail.mit.edu/bzhou/publication/interpretation_prearxiv.pdf1 Interpreting Deep Visual Representations via Network Dissection

9

 0 19 37units

airplanetrain

carbusbed

housecat

dogmountain

horsewatergrassbottle

treeboatsofa

buildingtoiletplantbird

motorbikepool table

skytentsea

bookpottedplant

fencebicycle

paintingroad

personcurtain

sinksnowtableshelf

bathtubwindowpane

chandelierstairway

skyscraperstoverocktrackfield

washersidewalk

chaircourt

flowerwork surface

waterfallbridgebenchsheeprailing

tvmonitormirrorpalm

bookcasehighway

closetnursery

art galleryskyscraper

corridorreception

airport terminalcockpit

shoe shopattic

classroomlighthouse

playgroundbakery-shopyouth hosteldining room

volleyball court-outdoorjacuzzi-indoor

conference roomsauna

landing deckoffice

mountain snowyauditorium

alleycastle

bowling alleylaundromatgas station

windmillgame room

subway interioroperating room

playroomsandboxstaircase

pantryescalator-indoor

fire escapeforest-broadleaf

parking garage-indoorbathroomcarrousel

gymnasium-indoorbadlands

cubicle-officecemetery

butchers shoprope bridge

dinette-vehiclebeauty salon

amusement arcadeforest-needleleaf

streetwaiting room

airplane cabindolmen

golf coursebus interior

corn fieldice skating rink-indoor

beachpoolroom-home

elevator-doorhome theater

greenhouse-indoorzen garden

gazebo-exteriorart studio

parktheater-indoor procenium

martial arts gymcoast

field-cultivatedliving room

bullringslum

balcony-interiortelevision studio

catwalkwater tower

ice cream parlorbank vault

creekdesert-sand

61 objects

 

 0 22 44units

busairplane

bedtrain

pool tabledogcar

horsehouse

roadboat

motorbiketree

grasswaterseat

catshelf

groundsea

skyscraperstovefence

work surfacetoilet

bottlesofasink

buildingcockpit

conference roomstreetcloset

poolroom-homeclassroom

nurseryball pit

highwaycorridor

airport terminalattic

bathroomcastle

art galleryalley

dining roomoffice

kitchenbar

mountain snowyliving room

staircaseshoe shopskyscraperlighthouse

water towergazebo-exteriorsubway interior

bowling alleybakery-shopwaiting room

cathedral-indoorlanding deck

laundromatauditoriumgas station

crevasseice skating rink-indoor

windmillbus interior

pantrygreenhouse-indoor

amphitheatersupermarket

casino-indoorutility room

cloister-indoorbuilding facade

bookstorereception

viaductcar interior-backseat

rope bridgecubicle-office

airplane cabinwarehouse-indoorgymnasium-indoor

bridgepark

saunaescalator-outdoor

badlandspagoda

waterfall-fanplayroom

galleywaterfall-block

lake-naturalapartment building-outdoor

jacuzzi-indoorcampsite

library-indoorkindergarden classroom

artists loftbullring

carrouseljunkyard

bow window-outdoorcreek

hayfieldski slopemountaincrosswalk

screenhead

shop windowcoach

foodpotholed

paisleystriped

cobwebbedlined

sprinkledinterlaced

studdedbanded

dottedfrilly

stratifiedspiralled

wovengrooved

fibrousmeshed

crystallinebumpy

scalybraided 139

29 objects

83 scenes

5 parts1 material

21 textures

 0 4 8units

airplanebuscar

roadgrassplant

windowpaneperson

bedtrain

motorbikehorse

treehousewater

pool tabletabletoiletboat

mountaindog

paintingbook

curtainflower

buildinggrandstand

bridgerailing

swimming pooltent

signboardchair

work surfaceskyscraper

bottlechandelier

seatstoveshelftrack

cabinetcushion

sofasnow

waterfallpillow

sinksearockdesk

sidewalkplatepalm

groundsky

swivel chairfloor

fenceceilingball pit

skyscrapermountain snowy

bakery-shoplaundromat

closetcockpit

shoe shopcorridor

conference roomcastle

forest-needleleaffield-cultivated

windmillstaircase

classroomescalator-indoor

playgroundpantryoffice

bowling alleycemetery

desert-sandhighway

streetamusement park

lighthousecreek

nurserycorn field

wheat fieldshopping mall-indoor

atticcavern-indoor

youth hostelsandbox

barreception

childs roomcasino-indoor

bookstorepagoda

gymnasium-indoorwaterfall-fan

slumart gallery

butchers shophair

wheeldrawerscreen

headcrosswalk

roofbalcony

seat cushionpot

shop windowhand

carpetfood

chequeredstripedwaffled

meshedpaisleydotted

bandedswirly

crackedstratified

linedpolka-dotted

perforatedgrid

sprinkledlacelikefibrous

interlacedfrilly

spiralledpotholed

wovenfreckledstudded

veinedred 147

60 objects

47 scenes

12 parts

2 materials

25 textures1 color

 0 5 10units

cargrass

airplanemountain

paintingtree

ceilingdogbus

roadpool table

waterhorse

skyscraperplant

motorbikecat

bedsea

tracksinkfield

stairwayshelf

work surfacebuilding

housewaterfall

groundsidewalk

bookchair

skywindowpane

floortoilet

personrailing

washersignboard

tableflower

chandelierbird

ball pitskyscraper

mountain snowycloset

swimming pool-outdoorstadium-baseball

coastcorridor

auditoriumcreek

highwayconference room

jacuzzi-indoorhair

headscreenwheel

roofheadboardcrosswalk

seat cushiondrawer

legarm

bodyshop window

foodchequeredperforated

stripedgrid

spiralleddottedswirlylined

meshedstuddedpaisleybanded

polka-dottedcobwebbed

groovedstratified

poroushoneycombed

woveninterlaced

red

92

44 objects

13 scenes

13 parts1 material

20 textures1 color

 0 3 6units

watergrass

treecar

plantwindowpane

roadmountainairplane

skyscraperdogsea

ceilingbuildingperson

horsebed

trackbook

pool tablecabinet

chairpaintingwaterfallsidewalk

sinkshelf

skyhousestovefloorbus

mountain snowyball pitpantry

building facadeskyscraper

streethair

wheelhead

screencrosswalk

shop windowfood

woodlined

dottedbandedstudded

gridhoneycombed

paisleyzigzagged

waffledmeshedcrackedstratified

chequeredperforated

sprinkledpotholedgroovedpleatedmatted

freckledswirly

spiralledwovenfibrous

cobwebbedred

72

32 objects

6 scenes

6 parts

2 materials

25 textures1 color

  

 0 4 8units

dogcat

grasstree

bicycleseasky

waterroad

carpainting

windowpanemountainmotorbike

booksidewalk

busmountain snowy

wheelhair

headear

muzzlearmleg

screenfood

zigzaggeddotted

chequeredbanded

cobwebbedwaffled

perforatedstriped

polka-dottedfrilly

spiralledstudded

honeycombedgrid

meshedsprinkled

veinedporous

crackedinterlaced

crosshatchedred

yellow

50

17 objects1 scene

8 parts1 material

21 textures

2 colors

 0 10 20units

grassroadsky

watertreedog

ball pitforest-broadleafmountain snowybuilding facade

highwaychequered

linedbandedporouslacelikeveined

gridfrilly

fleckedperforated

crackedstratifiedpotholedfreckledstudded

wovenpolka-dotted

purpleorange

redyellow

bluepink

green

35

6 objects

5 scenes

17 textures

7 colors

  

 0 5 9units

cartree

grasssea

mountainhighway

headhair

bandedlined

chequeredstudded

zigzaggedstriped

perforatedflecked

gridcrackedmeshed

gauzydotted

21

5 objects1 scene

2 parts

13 textures

  

 0 6 12units

skyceilinggrass

treeforest-broadleaf

headbanded

lineddotted

perforatedgrid

chequeredcrosshatched

spiralled

14

4 objects1 scene

1 part

8 textures

  

 0 5 9units

skyball pitveined

chequeredmeshed

stripeddotted

perforatedstuddedlacelike

frillycobwebbed

red

13

1 object1 scene

10 textures1 color

Resnet­152 (Places)R

esnet­152 continued ...

creekdesert-sand

vineyardsupermarket

kindergarden classroomlocker room

warehouse-indoorbuilding facade

bridgeaqueduct

pastureriver

florist shop-indoorwaterfall-block

car interior-backseatorchard

boxing ringmusic studio

crevasseball pit

amusement parkarchive

galleycloister-indoor

fountainhayfield

lobbytopiary gardengarage-indoor

jewelry shopbar

temple-east asiaearth fissure

amphitheaterhome office

desert-vegetationbatters box

parlorrubble

shopfrontclothing store

kasbahkitchen

junkyardbookstore

shopping mall-indoormuseum-indoor

cavern-indoorswimming pool-outdoornatural history museum

greenhouse-outdoordambarn

doorway-outdoorapartment building-outdoor

medinafield-wild

hot springmountaincampsiteballroom

market-outdoorarrival gate-outdoor

drivewayinn-indoor

excavationwindow seat

construction sitelibrary-indoor

clean roomutility room

hangar-indoorartists loft

basketball court-outdoorfootball fieldroundabout

screencrosswalk

bodyheadboard

wheelhead

hairroof

shop windowcoach

back pillowbalcony

torsomonitor

foodtile

carpetpaperdotted

crackedstratified

linedcobwebbed

potholedchequered

spiralledflecked

interlacedzigzagged

freckledpolka-dotted

veinedfrilly

bandedstriped

perforatedgrooved

fibrousgrid

lacelikescaly

porousmeshed

swirlypleated

gauzycrystalline

crosshatchedmattedpaisley

studdedpitted

bubblysprinkled

bumpymarbledbraided

red 279

160 scenes

14 parts

4 materials

39 textures1 color

Googlenet (Places)

VGG­16 (Places)

AlexNet­GAPWide (Places)

Alexnet (Places)

Alexnet (Imagenet)

Alexnet (Video Tracking)

Alexnet (Ambient Sound)

Alexnet (Puzzle Solving)

Alexnet (Egomotion)

Fig.12.Com

parisonofunique

detectorsofalltypes

ona

varietyofarchitectures.M

oreresults

areatthe

projectpage.

Page 10: 1 Interpreting Deep Visual Representations via Network ...people.csail.mit.edu/bzhou/publication/interpretation_prearxiv.pdf1 Interpreting Deep Visual Representations via Network Dissection

10

conv1conv2

conv3conv4

conv50

10

20

30

40N

umbe

r of u

niqu

e de

tect

ors

AlexNet on Places365

objectscenepartmaterialtexturecolor

conv1conv2

conv3conv4

conv50

5

10

15

20

25

Num

ber o

f uni

que

dete

ctor

s

AlexNet on ImageNet

objectscenepartmaterialtexturecolor

conv3-3conv4-3

conv5-1conv5-2

conv5-30

20

40

60

80

Num

ber o

f uni

que

dete

ctor

s

VGG16 on Places365

objectscenepartmaterialtexturecolor

conv3-3conv4-3

conv5-1conv5-2

conv5-30

10

20

30

40

50

Num

ber o

f uni

que

dete

ctor

s

VGG16 on ImageNet

objectscenepartmaterialtexturecolor

conv1-7x7-s2

conv2-norm2

inception-3b

inception-4c

inception-4e

inception-5a

inception-5b0

20

40

60

80

100

120

Num

ber o

f uni

que

dete

ctor

s

GoogLeNet on Places365

objectscenepartmaterialtexturecolor

conv1-7x7-s2

conv2-norm2

inception-3b

inception-4c

inception-4e

inception-5a

inception-5b0

10

20

30

40

50

60

Num

ber o

f uni

que

dete

ctor

s

GoogLeNet on ImageNet

objectscenepartmaterialtexturecolor

BN-1

Eltwise-116

Eltwise-238

Eltwise-358

Eltwise-478

Eltwise-5100

50

100

150

200

Num

ber o

f uni

que

dete

ctor

s

ResNet152 on Places365

objectscenepartmaterialtexturecolor

BN-1

Eltwise-116

Eltwise-238

Eltwise-358

Eltwise-478

Eltwise-5100

20

40

60

80

100

Num

ber o

f uni

que

dete

ctor

s

ResNet152 on ImageNet

objectscenepartmaterialtexturecolor

Fig. 13. Comparison of interpretability of the layers for AlexNet, VGG16, GoogLeNet, and ResNet152 trained on Places365. All five conv layers ofAlexNet and the selected layers of VGG, GoogLeNet, and ResNet are included.

AlexN

et-P

lace

s365

AlexN

et-Im

ageN

et

track

ing

objectce

ntric

trans

inv

audio

mov

ing

coloriz

ation

puzz

le

cros

scha

nnel

egom

otion

cont

ext

fram

eord

er

AlexN

et-ra

ndom

cont

exte

ncod

er0

20

40

60

80

100

Num

ber

of uniq

ue d

ete

ctors object

scenepartmaterialtexturecolor

Fig. 14. Semantic detectors emerge across different supervision of theprimary training task. All these models use the AlexNet architecture andare tested at conv5.

audio puzzle colorization trackingchequered (texture) 0.102 head (part) 0.091 dotted (texture) 0.140 chequered (texture) 0.167

car (object) 0.063 perforated (texture) 0.085 head (part) 0.056 grass (object) 0.120

head (part) 0.061 sky (object) 0.069 sky (object) 0.048 red-c (color) 0.100

Fig. 15. The top ranked concepts in the three top categories in four self-supervised networks. Some object and part detectors emerge in audio.Detectors for person heads also appear in puzzle and colorization. Avariety of texture concepts dominate models with self-supervised training.

3.6 Transfer Learning between Places and ImageNet

Fine-tuning a pre-trained network to a target domain is commonlyused in transfer learning. The deep features from the pre-trainednetwork show good generalization across different domains. Thepre-trained network also makes the training converge faster andresults in better accuracy, especially if there is not enough trainingdata at the target domain. Here we analyze what happens inside

Number of detectors

base

line

repe

at1

repe

at2

repe

at3

NoD

ropo

ut

Batch

Nor

m

0

50

100

150

200objectscenepartmaterialtexturecolor

Number of unique detectors

base

line

repe

at1

repe

at2

repe

at3

NoD

ropo

ut

Batch

Nor

m

0

20

40

60

80

100objectscenepartmaterialtexturecolor

Fig. 16. Effect of regularizations on the interpretability of CNNs.

100 102 104 1060

10

20

30

40

Num

ber

of uniq

ue d

ete

ctors

AlexNet on Places205

objectscenepartmaterialtexturecolor

100 102 104 106

Training iteration

0

0.2

0.4

0.6

Valid

atio

n a

ccura

cy

Fig. 17. The evolution of the interpretability of conv5 of Places205-AlexNet over 3,000,000 training iterations. The accuracy on the validationat each iteration is also plotted. The baseline model is trained to 300,000iterations (marked at the red line).

the representation and how the interpretation of the internal unitsevolve during the transfer learning.

We run two sets of experiments: fine-tuning Places-AlexNet toImageNet and fine-tuning ImageNet-AlexNet to Places. We want tosee how individual units mutate across domains. The interpretabilityresults of the model checkpoints at different fine-tuning iterationare plotted in Fig. 19. We can see that the training indeed convergesfaster compared to the network trained from scratch on Placesin Fig. 17. The interpretations of the units also change over fine-tuning. For example, the number of unique object detectors firstdrop then keep increasing for the network trained on ImageNetbeing fine-tuned to Places365, while it is slowly dropping for thenetwork trained on Places being fine-tuned to ImageNet.

Fig. 20 shows some examples of the individual unit evolutionhappening in the network trained from ImageNet to Places365 andthe network trained from Places365 to ImageNet. For each network,we show six units with their interpretation at the beginning of fine-tuning and at the end of fine-tuning. For example, in the networkfine-tuned from ImageNet to Places365, unit15 which detects thewhite dogs first, mutates to detect waterfall; unit136 and unit144which detect dogs first, mutate to detect horse and cow respectively,as a lot of scene categories in Places such as pasture and corralcontain such animals. On the other hand, in the network fine-tunedfrom Places365 to ImageNet, a lot of units mutate to various kindsof dog detectors. Interestingly though those units mutate to detectdifferent concepts, those concepts share low-level similarity such

Page 11: 1 Interpreting Deep Visual Representations via Network ...people.csail.mit.edu/bzhou/publication/interpretation_prearxiv.pdf1 Interpreting Deep Visual Representations via Network Dissection

11

Fig. 18. The interpretations of units change over iterations. Each row shows the interpretation of one unit.

100 102 104 1060

10

20

30

40

50

Nu

mb

er

of

un

iqu

e d

ete

cto

rs

Places365 to ImageNet

objectscenepartmaterialtexturecolor

100 102 104 106

Training iteration

0

0.2

0.4

0.6

Va

lida

tion

acc

ura

cy

100 102 104 1060

10

20

30

40ImageNet to Places365

100 102 104 106

Training iteration

0

0.2

0.4

0.6

Fig. 19. a) Fine-tune AlexNet from ImageNet to Places365. b) Fine-tuneAlexNet from Places365 to ImageNet.

Places365 to ImageNetImageNet to Places365Before After Before After

Fig. 20. Units mutate from a) the network fine-tuned from ImageNet toPlaces365 and b) the network fine-tuned from Places365 to ImageNet.Six units are shown with their semantics at the beginning of the fine-tuning and at the end of the fine-tuning.

as colors and textures.In Fig. 21, we zoom into two units from each of the two fine-

tuning processes and plot the history of concept evolution. Wecan see that some units switch its top ranked label several timesbefore converging to some concept: unit15 in the fine-tuning ofImageNet to Places365 flipped to white, crystalline, before reachesthe waterfall concept. On the other hand, some units switch faster,for example unit132 in the fine-tuning of Places365 to ImageNetswitches from hair to dog at early stage of fine-tuning.

3.7 Layer Width vs. InterpretabilityFrom AlexNet to ResNet, CNNs for visual recognition have growndeeper in the quest for higher classification accuracy. Depth hasbeen shown to be important to high discrimination ability, and wehave seen in Sec. 3.3 that interpretability can increase with depth

ImageNet to Places365

Places365 to ImageNet

before before

before before

after after

afterafter

unit15 unit100

unit31 unit132

Fig. 21. The history of one unit mutation during the fine-tuning fromImageNet to Places365 (top) and Places365 to ImageNet (low).

as well. However, the width of layers (the number of units perlayer) has been less explored. One reason is that increasing thenumber of convolutional units in a layer significantly increasescomputational cost while yielding only marginal improvementsin classification accuracy. Nevertheless, some recent work [50]shows that a carefully designed wide residual network can achieveclassification accuracy superior to the commonly used thin anddeep counterparts.

To explore how the width of layers affects interpretability ofCNNs, we do a preliminary experiment to test how width affectsemergence of interpretable detectors. We remove the FC layers ofthe AlexNet, then triple the number of units at the conv5, i.e., from256 units to 768 units, as AlexNet-GAP-Wide. We further triplethe number of units for all the previous conv layers except conv1for the standard AlexNet, as AlexNet-GAP-WideAll. Finally we puta global average pooling layer after conv5 and fully connect thepooled 768-feature activations to the final class prediction.

After training on Places365, the AlexNet-GAP-Wide and theAlexNet-GAP-WideAll obtain similar classification accuracy onthe validation set as the standard AlexNet ( 0.5% top1 accuracylower and higher), but it has many more emergent unique conceptdetectors at conv5 for AlexNet-GAP-Wide and at all the convlayers for AlexNet-GAL-WideAll, as shown in Fig. 22. We havealso increased the number of units to 1024 and 2048 at conv5,but the number of unique concepts does not significantly increasefurther. This may indicate a limit on the capacity of AlexNet toseparate explanatory factors; or it may indicate that a limit onthe number of disentangled concepts that are helpful to solve theprimary task of scene classification.

Page 12: 1 Interpreting Deep Visual Representations via Network ...people.csail.mit.edu/bzhou/publication/interpretation_prearxiv.pdf1 Interpreting Deep Visual Representations via Network Dissection

12

conv

1

conv

2

conv

3

conv

4

conv

5

conv

1

conv

2

conv

3

conv

4

conv

5

conv

1

conv

2

conv

3

conv

4

conv

50

20

40

60

80

100

120

140

Num

ber

of uniq

ue d

ete

ctors

objectscenepartmaterialtexturecolor

AlexNet AlexNet-GAP-Wide AlexNet-GAP-WideAll

Fig. 22. Comparison of the standard AlexNet, AlexNet-GAP-Wide, andAlexNet-GAP-WideAll. Widening the layer brings the emergence of moredetectors. Networks are trained on Places365.

3.8 Discrimination vs. Interpretability

Activations from the higher layers of pre-trained CNNs are oftenused as generic visual features (noted as deep features), generalizingvery well to other image datasets [16], [43]. It is interesting tobridge the generalization of the deep visual representation asgeneric visual features with their interpretability.

Here we first benchmark the deep features from severalnetworks on six image classification datasets for their discriminativepower. For each network, we feed in the images and extract theactivation at the last convolutional layer as the visual feature, thentrain a linear SVM with C = 0.001 on the train split and evaluatethe performance on the test split. We compute the classificationaccuracy averaged across classes. We include the event8 [51],action40 [52], indoor67 [53], sun397 [54], caltech101 [55], andcaltech256 [56].

The classification accuracies on six image datasets using thedeep features are plotted in Fig. 23. We can see that the deepfeatures from supervised trained networks perform much betterthan the ones from the self-supervised trained networks. Networkstrained on Places have better features for scene-centric datasetssuch as sun397 and indoor67, while networks trained on ImageNethave better features for object-centric datasets such as caltech101and action40.

Fig. 24 plots the number of the unique object detectors for eachrepresentation over that representation’s classification accuracy onthree selected datasets. We can see there is positive correlationbetween them. Thus the supervision tasks that encourage theemergence of more concept detectors may also improve thediscrimination ability of deep features. Interestingly, on someof the object centric dataset, the best discriminative representationis the representation from ResNet152-ImageNet, which has fewerunique object detectors compared to the ResNet152-Places365. Wehypothesize that the accuracy on a representation when applied toa task is dependent not only on the number of concept detectors inthe representation, but on how well the concept detectors capturesthe characteristics of the hidden factors in the transferred dataset.

3.9 Explanatory Factors for the Deep Features

After we interpret the units inside the deep visual representation,we show that the unit activation along with the interpreted labelcan be used as explanatory factors for analyzing the predictiongiven by the deep features. Previous work [57] uses the weightedsum of the unit activation maps to highlight which image regions

are most informative to the prediction, here we further decouple atindividual unit level to segment the informative image regions.

We first plot the Class-specific units. After the linear SVM istrained, we can rank the elements of the feature according to theirSVM weights to obtain the elements of the deep features whichcontribute most to that class. Those elements are units that act asexplanatory factors, and we call those top ranked units associatedwith each output class class-specific units. Fig. 25 shows the class-specific units of ResNet152-ImageNet and ResNet152-Places365for one class from action40 and sun397 respectively. For example,for the Walking the dog class from action40, the top three class-specific units from ResNet152-ImageNet are two dog detection unitand one person detection unit; for the Picnic area class from sun397,the top three class-specific units from ResNet152-Places365 areplant detection unit, grass detection unit, and fence detection unit.The intuitive match between visual detectors and the classes theyexplain suggests that visual detectors from CNNs behave as thebag-of-semantic-words visual features.

We further use the individual units identified as conceptdetectors to build an explanation of the individual image predictiongiven by a classifier. The procedure is as follows: Given anyimage, let the unit activation of the deep feature (for ResNet theGAP activation) be [x1, x2, ..., xN ], where each xn representsthe value summed up from the activation map of unit n. Letthe top prediction’s SVM response be s =

Pn wnxn, where

[w1, w2, ..., wN ] is the SVM’s learned weight. We get the topranked units in Figure 26 by ranking [w1x1, w2x2, ..., wNxN ],which are the unit activations weighted by the SVM weight for thetop predicted class. Then we simply upsample the activation mapof the top ranked unit to segment the image.

The image segmentation using the individual unit activation areplotted in Fig. 26a. The unit segmentation explain the predictionexplicitly. For example, the prediction for the first image isGardening, and the explanatory units detect plant, grass, person,flower, and pot. The prediction for the second image is Ridinga horse, the explanatory units detect horse, fence and dog. Wealso plot some wrongly predicted samples in Figure 26b. Thesegmentation gives the intuition as to why the classifier mademistakes. For example, for the first image the classifier predictscutting vegetables rather than the true label gardening, because thesecond unit wrongly considers the ground as table.

4 CONCLUSIONNetwork Dissection translates qualitative visualizations of represen-tation units into quantitative interpretations and measurements ofinterpretability. We have found that the units of a deep representa-tion are significantly more interpretable than expected for a basis ofthe representation space. We have investigated the interpretabilityof deep visual representations resulting from different architectures,training supervisions, and training conditions. Furthermore, wehave shown that interpretability of deep visual representations isrelevant to the power of the representation as a generalizable visualfeature. We conclude that interpretability is an important propertyof deep neural networks that provides new insights into theirhierarchical structure. Our work motivates future work towardsbuilding more interpretable and explainable AI systems.

ACKNOWLEDGMENTSThis work was partly supported by the National Science Foundationunder Grants No. 1524817 to A.T., and No. 1532591 to A.T. and

Page 13: 1 Interpreting Deep Visual Representations via Network ...people.csail.mit.edu/bzhou/publication/interpretation_prearxiv.pdf1 Interpreting Deep Visual Representations via Network Dissection

13

event8

0.9

73

0.9

66

0.9

63

0.9

55

0.9

46

0.9

33

0.9

28

0.9

27

0.9

23

0.9

20

0.9

18

0.9

16

0.9

04

0.9

04

0.8

97

0.8

75

0.8

52

0.8

38

0.8

36

0.8

29

0.8

25

0.8

06

0.7

82

0.7

39

0.6

88

0.6

21

0.5

51

Res

Net

152-

Imag

eNet

Res

Net

50-Im

ageN

et

VGG-H

ybrid

Res

Net

152-

Place

s365

VGG-Im

ageN

et

VGG-P

lace

s365

Goo

gLeN

et-Im

ageN

et

Goo

gLeN

et-P

lace

s365

VGG-P

lace

s205

AlexN

et-P

lace

s365

-GAP

AlexN

et-P

lace

s205

AlexN

et-H

ybrid

AlexN

et-Im

ageN

et

Goo

gLeN

et-P

lace

s205

AlexN

et-P

lace

s365

AlexN

et-P

lace

s205

-BNau

dio

coloriz

ation

cros

scha

nnel

track

ing

cont

ext

objectce

ntric

puzz

le

egom

otion

mov

ing

fram

eord

er

AlexN

et-ra

ndom

0

0.5

1

Acc

ura

cy

action40

0.7

66

0.7

48

0.7

08

0.6

91

0.6

91

0.6

74

0.5

63

0.5

55

0.5

55

0.5

27

0.5

27

0.5

20

0.4

98

0.4

92

0.4

85

0.4

84

0.3

53

0.3

51

0.3

47

0.3

46

0.3

29

0.2

99

0.2

71

0.2

41

0.2

08

0.1

87

0.1

77

Res

Net

152-

Imag

eNet

Res

Net

50-Im

ageN

et

Goo

gLeN

et-Im

ageN

et

Res

Net

152-

Place

s365

VGG-H

ybrid

VGG-Im

ageN

et

VGG-P

lace

s205

VGG-P

lace

s365

Goo

gLeN

et-P

lace

s365

AlexN

et-H

ybrid

Goo

gLeN

et-P

lace

s205

AlexN

et-Im

ageN

et

AlexN

et-P

lace

s205

-BN

AlexN

et-P

lace

s365

-GAP

AlexN

et-P

lace

s365

AlexN

et-P

lace

s205

cont

ext

coloriz

ationau

dio

cros

scha

nnel

track

ing

puzz

le

objectce

ntric

egom

otion

mov

ing

AlexN

et-ra

ndom

fram

eord

er

0

0.2

0.4

0.6

0.8

1

Acc

ura

cy

caltech101

0.9

34

0.9

26

0.9

19

0.9

18

0.8

89

0.8

62

0.8

61

0.8

54

0.7

98

0.7

85

0.7

50

0.7

43

0.7

37

0.7

28

0.7

13

0.7

04

0.6

78

0.6

69

0.6

50

0.6

49

0.6

22

0.5

76

0.5

46

0.5

17

0.5

11

0.4

66

0.3

43

Res

Net

152-

Imag

eNet

Res

Net

50-Im

ageN

et

Goo

gLeN

et-Im

ageN

et

VGG-H

ybrid

VGG-Im

ageN

et

Res

Net

152-

Place

s365

AlexN

et-Im

ageN

et

AlexN

et-H

ybrid

AlexN

et-P

lace

s365

AlexN

et-P

lace

s205

AlexN

et-P

lace

s205

-BN

cont

ext

coloriz

ation

cros

scha

nnel

VGG-P

lace

s365

AlexN

et-P

lace

s365

-GAPpu

zzle

VGG-P

lace

s205

audio

Goo

gLeN

et-P

lace

s365

track

ing

Goo

gLeN

et-P

lace

s205

objectce

ntric

mov

ing

egom

otion

fram

eord

er

AlexN

et-ra

ndom

0

0.2

0.4

0.6

0.8

1

Acc

ura

cy

indoor67

0.8

27

0.7

97

0.7

92

0.7

88

0.7

85

0.7

68

0.7

56

0.7

50

0.6

74

0.6

72

0.6

64

0.6

15

0.6

07

0.5

80

0.5

74

0.5

10

0.4

30

0.4

24

0.4

02

0.3

92

0.3

62

0.3

39

0.3

35

0.2

85

0.2

59

0.2

06

0.1

78

Res

Net

152-

Place

s365

Goo

gLeN

et-P

lace

s205

Goo

gLeN

et-P

lace

s365

VGG-P

lace

s205

VGG-P

lace

s365

VGG-H

ybrid

Res

Net

50-Im

ageN

et

Res

Net

152-

Imag

eNet

AlexN

et-P

lace

s365

-GAP

VGG-Im

ageN

et

Goo

gLeN

et-Im

ageN

et

AlexN

et-P

lace

s205

-BN

AlexN

et-P

lace

s205

AlexN

et-P

lace

s365

AlexN

et-H

ybrid

AlexN

et-Im

ageN

et

cont

ext

cros

scha

nnel

audio

coloriz

ation

track

ing

puzz

le

objectce

ntric

egom

otion

mov

ing

fram

eord

er

AlexN

et-ra

ndom

0

0.2

0.4

0.6

0.8

1

Acc

ura

cy

sun397

0.6

90

0.6

66

0.6

57

0.6

55

0.6

49

0.6

29

0.6

05

0.5

97

0.5

15

0.5

13

0.5

13

0.4

72

0.4

60

0.4

54

0.4

47

0.3

81

0.2

87

0.2

71

0.2

68

0.2

50

0.2

50

0.2

08

0.1

88

0.1

60

0.1

42

0.1

05

0.1

02

Res

Net

152-

Place

s365

VGG-P

lace

s365

VGG-H

ybrid

VGG-P

lace

s205

Goo

gLeN

et-P

lace

s365

Goo

gLeN

et-P

lace

s205

Res

Net

152-

Imag

eNet

Res

Net

50-Im

ageN

et

AlexN

et-P

lace

s365

-GAP

VGG-Im

ageN

et

Goo

gLeN

et-Im

ageN

et

AlexN

et-P

lace

s205

-BN

AlexN

et-P

lace

s205

AlexN

et-P

lace

s365

AlexN

et-H

ybrid

AlexN

et-Im

ageN

et

cont

ext

cros

scha

nnel

audio

track

ing

coloriz

ation

puzz

le

objectce

ntric

egom

otion

mov

ing

AlexN

et-ra

ndom

fram

eord

er

0

0.2

0.4

0.6

0.8

Acc

ura

cy

caltech256

0.8

27

0.8

09

0.7

82

0.7

66

0.7

54

0.7

32

0.5

85

0.5

71

0.5

19

0.5

15

0.4

94

0.4

93

0.4

89

0.4

83

0.4

76

0.4

33

0.3

87

0.3

80

0.3

77

0.3

43

0.3

22

0.3

19

0.2

55

0.2

29

0.2

21

0.1

90

0.1

48

Res

Net

152-

Imag

eNet

Res

Net

50-Im

ageN

et

Goo

gLeN

et-Im

ageN

et

VGG-H

ybrid

VGG-Im

ageN

et

Res

Net

152-

Place

s365

AlexN

et-Im

ageN

et

AlexN

et-H

ybrid

VGG-P

lace

s365

VGG-P

lace

s205

AlexN

et-P

lace

s365

-GAP

AlexN

et-P

lace

s205

AlexN

et-P

lace

s365

Goo

gLeN

et-P

lace

s365

AlexN

et-P

lace

s205

-BN

Goo

gLeN

et-P

lace

s205

cont

ext

cros

scha

nnel

coloriz

ationau

dio

track

ing

puzz

le

objectce

ntric

egom

otion

mov

ing

fram

eord

er

AlexN

et-ra

ndom

0

0.2

0.4

0.6

0.8

1

Acc

ura

cy

Fig. 23. The classification accuracy of deep features on the six image datasets.

0.5 0.6 0.7 0.8 0.9 1Accuracy on event8

0

20

40

60

80

100

Num

ber o

f uniq

ue o

bject

dete

ctors

ResNet152-ImageNet

ResNet152-Places365

ResNet50-ImageNet

VGG-ImageNet

VGG-Places365

VGG-Places205VGG-Hybrid

GoogLeNet-ImageNetGoogLeNet-Places205

GoogLeNet-Places365

AlexNet-random

AlexNet-ImageNet

AlexNet-Places205AlexNet-HybridAlexNet-Places205-BN

AlexNet-Places365

AlexNet-Places365-GAP

context

audioframeorder tracking

egomotion

movingobjectcentric

puzzlecolorizationcrosschannel

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8Accuracy on action40

0

20

40

60

80

100

Num

ber o

f uniq

ue o

bject

dete

ctors

ResNet152-ImageNet

ResNet152-Places365

ResNet50-ImageNet

VGG-ImageNet

VGG-Places365

VGG-Places205VGG-Hybrid

GoogLeNet-ImageNetGoogLeNet-Places205

GoogLeNet-Places365

AlexNet-random

AlexNet-ImageNet

AlexNet-Places205 AlexNet-HybridAlexNet-Places205-BN

AlexNet-Places365

AlexNet-Places365-GAP

contextaudio

frameordertracking

egomotion

movingobjectcentricpuzzle

colorizationcrosschannel

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Accuracy on caltech101

0

20

40

60

80

100

Num

ber o

f uniq

ue o

bject

dete

ctors

AlexNet-Places205-BN

ResNet152-Places365ResNet50-ImageNet

VGG-ImageNet

VGG-Places365

VGG-Places205VGG-Hybrid

GoogLeNet-ImageNetGoogLeNet-Places205

GoogLeNet-Places365

AlexNet-random

AlexNet-ImageNet

AlexNet-Places205 AlexNet-Hybrid

ResNet152-ImageNet

AlexNet-Places365AlexNet-Places365-GAP

context

audio

frameorder

tracking

egomotion

movingobjectcentric

puzzle

colorization

crosschannel

0 0.2 0.4 0.6 0.8 1Accuracy on indoor67

0

20

40

60

80

100

Num

ber o

f uniq

ue o

bject

dete

ctors

ResNet152-ImageNet

ResNet152-Places365

ResNet50-ImageNet

VGG-ImageNet

VGG-Places365VGG-Places205

VGG-Hybrid

GoogLeNet-ImageNetGoogLeNet-Places205GoogLeNet-Places365

AlexNet-random

AlexNet-ImageNetAlexNet-Places205AlexNet-Hybrid

AlexNet-Places205-BN

AlexNet-Places365

AlexNet-Places365-GAP

context

audioframeorder

tracking

egomotion

moving

objectcentric

puzzle colorizationcrosschannel

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7Accuracy on sun397

0

20

40

60

80

100

Num

ber o

f uniq

ue o

bject

dete

ctors

ResNet152-ImageNet

ResNet152-Places365

ResNet50-ImageNet

VGG-ImageNet

VGG-Places365VGG-Places205

VGG-Hybrid

GoogLeNet-ImageNetGoogLeNet-Places205

GoogLeNet-Places365

AlexNet-random

AlexNet-ImageNet

AlexNet-Places205AlexNet-Hybrid

AlexNet-Places205-BN

AlexNet-Places365

AlexNet-Places365-GAP

contextaudio

frameorder

tracking

egomotion

movingobjectcentric

puzzle

colorization

crosschannel

0 0.2 0.4 0.6 0.8 1Accuracy on caltech256

0

20

40

60

80

100

Num

ber o

f uniq

ue o

bject

dete

ctors

ResNet152-ImageNet

ResNet152-Places365

ResNet50-ImageNet

VGG-ImageNet

VGG-Places365

VGG-Places205VGG-Hybrid

GoogLeNet-ImageNetGoogLeNet-Places205 GoogLeNet-Places365

AlexNet-random

AlexNet-ImageNet

AlexNet-Places205 AlexNet-HybridAlexNet-Places205-BN

AlexNet-Places365

AlexNet-Places365-GAP

contextaudio

frameorder

tracking

egomotion

movingobjectcentric

puzzlecolorization

crosschannel

Fig. 24. The number of unique object detectors in the last convolutional layer compared to each representations classification accuracy on threedatasets. Supervised (in red) and unsupervised (in green) representations clearly form two clusters.

Page 14: 1 Interpreting Deep Visual Representations via Network ...people.csail.mit.edu/bzhou/publication/interpretation_prearxiv.pdf1 Interpreting Deep Visual Representations via Network Dissection

14

ImagesfromWalkingthedog(action40)

ResNet152-ImageNet ResNet152-Places365 ResNet152-ImageNet ResNet152-Places365

ImagesfromPicnicarea(sun397)

Fig. 25. Class-specific units from ResNet152-ImageNet and ResNet152-Places365 on one class from action40 and sun397. For each class, weshow three sample images, followed by the top 3 units from ResNet152-ImageNet and ResNet152-Places365 ranked by the class weight of linearSVM to predict that class. SVM weight, detected concept name and theIoU score are shown above each unit.

A.O.; the Vannevar Bush Faculty Fellowship program sponsored bythe Basic Research Office of the Assistant Secretary of Defense forResearch and Engineering and funded by the Office of NavalResearch through grant N00014-16-1-3116 to A.O.; the MITBig Data Initiative at CSAIL, the Toyota Research Institute MITCSAIL Joint Research Center, Google and Amazon Awards, and ahardware donation from NVIDIA Corporation. B.Z. is supportedby a Facebook Fellowship.

REFERENCES[1] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Object

detectors emerge in deep scene cnns,” International Conference onLearning Representations, 2015.

[2] A. Gonzalez-Garcia, D. Modolo, and V. Ferrari, “Do semantic partsemerge in convolutional neural networks?” arXiv:1607.03738, 2016.

[3] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos withscene dynamics,” arXiv:1609.02612, 2016.

[4] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Areview and new perspectives,” IEEE transactions on pattern analysis andmachine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.

[5] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Networkdissection: Quantifying interpretability of deep visual representations,” inProc. CVPR, 2017.

[6] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: a simple way to prevent neural networks from overfitting.”Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958,2014.

[7] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep networktraining by reducing internal covariate shift,” arXiv:1502.03167, 2015.

[8] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutionalnetworks,” Proc. ECCV, 2014.

[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-basedconvolutional networks for accurate object detection and segmentation,”IEEE transactions on pattern analysis and machine intelligence, 2016.

[10] A. Mahendran and A. Vedaldi, “Understanding deep image representationsby inverting them,” Proc. CVPR, 2015.

[11] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutionalnetworks: Visualising image classification models and saliency maps,”International Conference on Learning Representations Workshop, 2014.

[12] A. Mahendran and A. Vedaldi, “Understanding deep image representationsby inverting them,” in Proc. CVPR, 2015.

[13] A. Dosovitskiy and T. Brox, “Generating images with perceptual similaritymetrics based on deep networks,” in Advances in Neural InformationProcessing Systems, 2016, pp. 658–666.

[14] A. Nguyen, A. Dosovitskiy, J. Yosinski, T. Brox, and J. Clune, “Synthesiz-ing the preferred inputs for neurons in neural networks via deep generatornetworks,” In Advances in Neural Information Processing Systems, 2016.

[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proc. CVPR, 2016.

[16] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn featuresoff-the-shelf: an astounding baseline for recognition,” arXiv:1403.6382,2014.

[17] P. Agrawal, R. Girshick, and J. Malik, “Analyzing the performance ofmultilayer neural networks for object recognition,” Proc. ECCV, 2014.

[18] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable arefeatures in deep neural networks?” In Advances in Neural InformationProcessing Systems, 2014.

[19] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Good-fellow, and R. Fergus, “Intriguing properties of neural networks,”arXiv:1312.6199, 2013.

[20] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easilyfooled: High confidence predictions for unrecognizable images,” in Proc.CVPR, 2015.

[21] Y. Li, J. Yosinski, J. Clune, H. Lipson, and J. Hopcroft, “Convergentlearning: Do different neural networks learn the same representations?”arXiv:1511.07543, 2015.

[22] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understandingdeep learning requires rethinking generalization,” International Confer-ence on Learning Representations, 2017.

[23] C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representa-tion learning by context prediction,” in Proc. CVPR, 2015.

[24] M. Noroozi and P. Favaro, “Unsupervised learning of visual representa-tions by solving jigsaw puzzles,” in Proc. ECCV, 2016.

[25] D. Jayaraman and K. Grauman, “Learning image representations tied toego-motion,” in Proc. ICCV, 2015.

[26] P. Agrawal, J. Carreira, and J. Malik, “Learning to see by moving,” inProc. ICCV, 2015.

[27] X. Wang and A. Gupta, “Unsupervised learning of visual representationsusing videos,” in Proc. CVPR, 2015.

[28] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” inProc. ECCV. Springer, 2016.

[29] ——, “Split-brain autoencoders: Unsupervised learning by cross-channelprediction,” in Proc. CVPR, 2017.

[30] A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba,“Ambient sound provides supervision for visual learning,” in Proc. ECCV,2016.

[31] R. Q. Quiroga, L. Reddy, G. Kreiman, C. Koch, and I. Fried, “Invariantvisual representation by single neurons in the human brain,” Nature, vol.435, no. 7045, pp. 1102–1107, 2005.

[32] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Sceneparsing through ade20k dataset,” Proc. CVPR, 2017.

[33] S. Bell, K. Bala, and N. Snavely, “Intrinsic images in the wild,” ACMTrans. on Graphics (SIGGRAPH), 2014.

[34] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun,and A. Yuille, “The role of context for object detection and semanticsegmentation in the wild,” in Proc. CVPR, 2014.

[35] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille, “Detectwhat you can: Detecting and representing objects using holistic modelsand body parts,” in Proc. CVPR, 2014.

[36] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describ-ing textures in the wild,” in Proc. CVPR, 2014.

[37] J. Van De Weijer, C. Schmid, J. Verbeek, and D. Larlus, “Learningcolor names for real-world applications,” IEEE Transactions on ImageProcessing, vol. 18, no. 7, pp. 1512–1523, 2009.

[38] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neural informa-tion processing systems, 2012, pp. 1097–1105.

[39] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” inProc. CVPR, 2015.

[40] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv:1409.1556, 2014.

[41] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Denselyconnected convolutional networks,” Proc. CVPR, 2017.

[42] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visualrecognition challenge,” Int’l Journal of Computer Vision, 2015.

[43] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learningdeep features for scene recognition using places database,” In Advancesin Neural Information Processing Systems, 2014.

[44] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A10 million image database for scene recognition,” IEEE Transactions onPattern Analysis and Machine Intelligence, 2017.

[45] I. Mikjjsra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: unsupervisedlearning using temporal order verification,” in Proc. ECCV, 2016.

[46] R. Gao, D. Jayaraman, and K. Grauman, “Object-centric representationlearning from unlabeled videos,” arXiv:1612.00500, 2016.

[47] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros,“Context encoders: Feature learning by inpainting,” in Proc. CVPR, 2016.

Page 15: 1 Interpreting Deep Visual Representations via Network ...people.csail.mit.edu/bzhou/publication/interpretation_prearxiv.pdf1 Interpreting Deep Visual Representations via Network Dissection

15

Correct label: gardening Correct label: brushing

a)

b)

Fig. 26. Segmenting images using the top activated units weighted by the class label from ResNet152-Places365 deep feature. a) the correctlypredicted samples. b) the wrongly predicted samples.

[48] X. Wang, K. He, and A. Gupta, “Transitive invariance for self-supervisedvisual representation learning,” arXiv preprint arXiv:1708.02901, 2017.

[49] P. Diaconis, “What is a random matrix?” Notices of the AMS, vol. 52,no. 11, pp. 1348–1349, 2005.

[50] S. Zagoruyko and N. Komodakis, “Wide residual networks,”arXiv:1605.07146, 2016.

[51] L.-J. Li and L. Fei-Fei, “What, where and who? classifying events byscene and object recognition,” in Proc. ICCV, 2007.

[52] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei, “Humanaction recognition by learning bases of action attributes and parts,” inProc. ICCV, 2011.

[53] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Proc. CVPR,2009.

[54] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database:Large-scale scene recognition from abbey to zoo,” in Proc. CVPR, 2010.

[55] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual modelsfrom few training examples: An incremental bayesian approach tested on101 object categories,” Computer Vision and Image Understanding, 2007.

[56] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,”2007.

[57] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learningdeep features for discriminative localization,” in Proc. CVPR, 2016.

Bolei Zhou is a Ph.D. Candidate in ComputerScience and Artificial Intelligence Lab (CSAIL)at the Massachusetts Institute of Technology. Hereceived M.Phil. degree in Information Engineer-ing from the Chinese University of Hong Kongand B.Eng. degree in Biomedical Engineeringfrom Shanghai Jiao Tong University in 2010.His research interests are computer vision andmachine learning. He is an award recipient ofFacebook Fellowship, Microsoft Research AsiaFellowship, and MIT Greater China Fellowship.

David Bau is a PhD student at the MIT ComputerScience and Artificial Intelligence Laboratory(CSAIL). He received an A.B. in Mathematicsfrom Harvard in 1992 and an M.S. in ComputerScience from Cornell in 1994. He coauthored atextbook on numerical linear algebra. He was asoftware engineer at Microsoft and Google anddeveloped ranking algorithms for Google ImageSearch. His research interest is interpretablemachine learning.

Aude Oliva is a Principal Research Scientist atthe MIT Computer Science and Artificial Intelli-gence Laboratory (CSAIL). After a French bac-calaureate in Physics and Mathematics, she re-ceived two M.Sc. degrees and a Ph.D in CognitiveScience from the Institut National Polytechniqueof Grenoble, France. She joined the MIT faculty inthe Department of Brain and Cognitive Sciencesin 2004 and CSAIL in 2012. Her research onvision and memory is cross-disciplinary, spanninghuman perception and cognition, computer vision,

and human neuroscience. She received the 2006 National ScienceFoundation (NSF) Career award, the 2014 Guggenheim and the 2016Vannevar Bush fellowships.

Antonio Torralba received the degree intelecommunications engineering from TelecomBCN, Spain, in 1994 and the Ph.D. degreein signal, image, and speech processing fromthe Institut National Polytechnique de Grenoble,France, in 2000. From 2000 to 2005, he spentpostdoctoral training at the Brain and CognitiveScience Department and the Computer Scienceand Artificial Intelligence Laboratory, MIT. He isnow a Professor of Electrical Engineering andComputer Science at the Massachusetts Institute

of Technology (MIT). Prof. Torralba is an Associate Editor of the Interna-tional Journal in Computer Vision, and has served as program chair forthe Computer Vision and Pattern Recognition conference in 2015. Hereceived the 2008 National Science Foundation (NSF) Career award, thebest student paper award at the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) in 2009, and the 2010 J. K. AggarwalPrize from the International Association for Pattern Recognition (IAPR).