the incremental multiresolution matrix factorization algorithm · 2017. 5. 31. · a sequential...

The Incremental Multiresolution Matrix Factorization Algorithm

Vamsi K. Ithapu†, Risi Kondor§, Sterling C. Johnson†, Vikas Singh†

†University of Wisconsin-Madison, §University of Chicago

http://pages.cs.wisc.edu/˜vamsi/projects/incmmf.html

Abstract

Multiresolution analysis and matrix factorization are

foundational tools in computer vision. In this work, we

study the interface between these two distinct topics and

obtain techniques to uncover hierarchical block structure in

symmetric matrices – an important aspect in the success of

many vision problems. Our new algorithm, the incremental

multiresolution matrix factorization, uncovers such struc-

ture one feature at a time, and hence scales well to large ma-

trices. We describe how this multiscale analysis goes much

farther than what a direct “global” factorization of the data

can identify. We evaluate the efficacy of the resulting factor-

izations for relative leveraging within regression tasks using

medical imaging data. We also use the factorization on rep-

resentations learned by popular deep networks, providing

evidence of their ability to infer semantic relationships even

when they are not explicitly trained to do so. We show that

this algorithm can be used as an exploratory tool to im-

prove the network architecture, and within numerous other

settings in vision.

1. Introduction

Matrix factorization lies at the heart of a spectrum of

computer vision problems. While the wide ranging and ex-

tensive use of factorization schemes within structure from

motion [38], face recognition [40] and motion segmenta-

tion [10] have been known, in the last decade, there is re-

newed interest in these ideas. Specifically, the celebrated

work on low rank matrix completion [6] has enabled de-

ployments in a broad cross-section of vision problems from

independent components analysis [18] to dimensionality re-

duction [42] to online background estimation [43]. Novel

extensions based on Robust Principal Components Analy-

sis [13, 6] are being developed each year.

In contrast to factorization methods, a distinct and rich

body of work based on early work in signal processing is

arguably even more extensively utilized in vision. Specif-

ically, Wavelets [34] and other related ideas (curvelets [5],

shearlets [24]) that loosely fall under multiresolution analy-

sis (MRA) based approaches drive an overwhelming major-

ity of techniques within feature extraction [29] and repre-

sentation learning [34]. Also, Wavelets remain the “go to”

tool for image denoising, compression, inpainting, shape

analysis and other applications in video processing [30].

SIFT features can be thought of as a special case of the so-

called Scattering Transform (using theory of Wavelets) [4].

Remarkably, the “network” perspective of Scattering Trans-

form at least partly explains the invariances being identi-

fied by deep representations, further expanding the scope of

multiresolution approaches informing vision algorithms.

The foregoing discussion raises the question of whether

there are any interesting bridges between Factorization and

Wavelets. This line of enquiry has recently been studied

for the most common “discrete” object encountered in vi-

sion – graphs. Starting from the seminal work on Diffu-

sion Wavelets [11], others have investigated tree-like de-

compositions on matrices [25], and organizing them using

wavelets [16]. While the topic is still nascent (but evolving),

these non-trivial results suggest that the confluence of these

seemingly distinct topics potentially holds much promise

for vision problems [17]. Our focus is to study this interface

between Wavelets and Factorization, and demonstrate the

immediate set of problems that can potentially benefit. In

particular, we describe an efficient (incremental) multireso-

lution matrix factorization algorithm.

To concretize the argument above, consider a represen-

tative example in vision and machine learning where a fac-

torization approach may be deployed. Figure 1 shows a set

of covariance matrices computed from the representations

learned by AlexNet [23], VGG-S [9] (on some ImageNet

classes [35]) and medical imaging data respectively. As a

first line of exploration, we may be interested in characteriz-

ing the apparent parsimonious “structure” seen in these ma-

trices. We can easily verify that invoking the de facto con-

structs like sparsity, low-rank or a decaying eigen-spectrum

cannot account for the “block” or cluster-like structures in-

herent in this data. Such block-structured kernels were the

original motivation for block low-rank and hierarchical fac-

torizations [36, 8] — but a multiresolution scheme is much

more natural — in fact, ideal — if one can decompose the

12951

http://pages.cs.wisc.edu/~vamsi/projects/incmmf.html

Figure 1. Left-to-right example category (or class) covariances fromAlexNet, VGG-S (of a few ImageNet classes) and medical imaging data.

matrix in a way that the blocks automatically ‘reveal’ them-

selves at multiple resolutions. Conceptually, this amounts to

a sequential factorization while accounting for the fact that

each level of this hierarchy must correspond to approximat-

ing some non-trivial structure in the matrix. A recent result

introduces precisely such a multiresolution matrix factor-

ization (MMF) algorithm for symmetric matrices [21].

Consider a symmetric matrix C ∈ Rm×m. PCA de-composes C as QTΛQ where Q is an orthogonal matrix,which, in general, is dense. On the other hand, sparse

PCA (sPCA) [46] imposes sparsity on the columns of Q,allowing for fewer dimensions to interact that may not

capture global patterns. The factorization resulting from

such individual low-rank decompositions cannot capture hi-

erarchical relationships among data dimensions. Instead,

MMF applies a sequence of carefully chosen sparse rota-

tions Q1,Q2, . . .QL to factorize C in the form

C = (Q1)T (Q2)T . . . (QL)TΛQL . . .Q2Q1,

thereby uncovering soft hierarchical organization of differ-

ent rows/columns of C. Typically the Qℓs are sparse kth-order rotations (orthogonal matrices that are the identity ex-

cept for at most k of their rows/columns), leading to a hi-erarchical tree-like matrix organization. MMF was shown

to be an efficient compression tool [39] and a precondi-

tioner [21]. Randomized heuristics have been proposed to

handle large matrices [22]. Nevertheless, factorization in-

volves searching a combinatorial space of row/column in-

dices, which restricts the order of the rotations to be small

(typically, ≤ 3). Not allowing higher order rotations re-stricts the richness of the allowable block structure, result-

ing in a hierarchical decomposition that is “too localized” to

be sensible or informative (reverting back to the issues with

sPCA and other block low-rank approximations).

A fundamental property of MMF is the sequential com-

position of rotations. In this paper, we exploit the fact that

the factorization can be parameterized in terms of an MMF

graph defined on a sequence of higher-order rotations. Un-

like alternate batch-wise approaches [39], we start with a

small, randomly chosen block of C, and gradually ‘insert’new rows into the factorization – hence we refer to this as

an incremental MMF. We show that this insertion proce-

dure manipulates the topology of the MMF graph, thereby

providing an efficient algorithm for constructing higher or-

der MMFs. Our contributions are: (A) We present a fast

and efficient incremental procedure for constructing higher

order (large k) MMFs on large dense matrices; (B) We eval-uate the efficacy of the higher order factorizations for rela-

tive leveraging of sets of pixels/voxels in regression tasks

in vision; and (C) Using the output structure of incremental

MMF, we visualize the semantics of categorical relation-

ships inferred by deep networks, and, in turn, present some

exploratory tools to adapt and modify the architectures.

2. Multiresolution Matrix Factorization

Notation: We begin with some notation. Matrices are bold

upper case, vectors are bold lower case and scalars are lower

case. [m] := {1, . . . ,m} for any m ∈ N. Given a matrixC ∈ Rm×m and two set of indices S1 = {r1, . . . rk} andS2 = {c1, . . . cp}, CS1,S2 will denote the block of C cut outby the rows S1 and columns S2. C:,i is the i

th column of C.Im is the m-dimensional identity. SO(m) is the group ofm dimensional orthogonal matrices with unit determinant.RmS is the set of m-dimensional symmetric matrices whichare diagonal except for their S × S block (S–core-diagonalmatrices).

Multiresolution matrix factorization (MMF), introduced

in [21, 22], retains the locality properties of sPCA while

also capturing the global interactions provided by the many

variants of PCA, by applying not one, but multiple sparse

rotation matrices to C in sequence. We have the following.

Definition. Given an appropriate class O ⊆ SO(m) ofsparse rotation matrices, a depth parameter L ∈ N and asequence of integers m = d0 ≥ d1 ≥ . . . ≥ dL ≥ 1, themulti-resolution matrix factorization (MMF) of a sym-

metric matrix C ∈ Rm×m is a factorization of the form

M(C) := QTΛQ with Q = QL . . .Q2Q1, (1)

where Qℓ ∈ O and Qℓ[m]\Sℓ−1,[m]\Sℓ−1= Im−dℓ for some

nested sequence of sets [m] = S0 ⊇ S1 ⊇ . . . ⊇ SL with|Sℓ| = dℓ and Λ ∈ R

mSL

.

Sℓ−1 is referred to as the ‘active set’ at the ℓth level, since

Qℓ is identity outside [m] \ Sℓ−1. The nesting of the Sℓsimplies that after applying Qℓ at some level ℓ, Sℓ−1 \ Sℓrows/columns are removed from the active set, and are not

operated on subsequently. This active set trimming is done

at all L levels, leading to a nested subspace interpretationfor the sequence of compressions Cℓ = QℓCℓ−1(Qℓ)T

(C0 = C and Λ = CL). In fact, [21] has shown that, fora general class of symmetric matrices, MMF from Defini-

tion 2 entails a Mallat style multiresolution analysis (MRA)

[28]. Observe that depending on the choice of Qℓ, only afew dimensions of Cℓ−1 are forced to interact, and so thecomposition of rotations is hypothesized to extract subtle or

softer notions of structure in C.

2952

Since multiresolution is represented as matrix factoriza-

tion here (see (1)), the Sℓ−1 \ Sℓ columns of Q correspondto “wavelets”. While d1, d2, . . . can be any monotonicallydecreasing sequence, we restrict ourselves to the simplest

case of dℓ = m − ℓ. Within this setting, the number oflevels L is at most m − k + 1, and each level contributes asingle wavelet. Given S1,S2, . . . and O, the matrix factor-ization of (1) reduces to determining the Qℓ rotations andthe residual Λ, which is usually done by minimizing thesquared Frobenius norm error

minQℓ∈O,Λ∈Rm

SL

‖C−M(C)‖2Frob. (2)

The above objective can be decomposed as a sum of contri-

butions from each of the L different levels (see Proposition1, [21]), which suggests computing the factorization in agreedy manner as C = C0 7→ C1 7→ C2 7→ . . . 7→ Λ.This error decomposition is what drives much of the intu-

ition behind our algorithms.

After ℓ − 1 levels, Cℓ−1 is the compression and Sℓ−1 isthe active set. In the simplest case of O being the classof so-called k–point rotations (rotations which affect atmost k coordinates) and dℓ = m − ℓ, at level ℓ the algo-rithm needs to determine three things: (a) the k–tuple tℓ

of rows/columns involved in the rotation, (b) the nontriv-

ial part O := Qℓtℓ,tℓ

of the rotation matrix, and (c) sℓ, theindex of the row/column that is subsequently designated a

wavelet and removed from the active set. Without loss of

generality, let sℓ be the last element of tℓ. Then the contri-bution of level ℓ to the squared Frobenius norm error (2) is(see supplement)

E(Cℓ−1;Oℓ; tℓ, s) = 2

k−1∑

i=1

[OCℓ−1tℓ,tℓ

OT ]2k,i

+ 2[OBBTOT ]k,k where B = Cℓ−1tℓ,Sℓ−1\tℓ

,

(3)

and, in the definition of B, tℓ is treated as a set. The factor-ization then works by minimizing this quantity in a greedy

fashion, i.e.,

Qℓ, tℓ, sℓ ← argminO,t,s

E(Cℓ−1;O; t, s)

Sℓ ← Sℓ−1 \ sℓ ; Cℓ = QℓCℓ−1(Qℓ)T .

(4)

3. Incremental MMF

We now motivate our algorithm using (3) and (4). Solv-

ing (2) amounts to estimating the L different k-tuplest1, . . . , tL sequentially. At each level, the selection ofthe best k-tuple is clearly combinatorial, making the ex-act MMF computation (i.e., explicitly minimizing (2)) very

costly even for k = 3 or 4 (this has been independently

observed in [39]). As discussed in Section 1, higher or-

der MMFs (with large k) are nevertheless inevitable for al-lowing arbitrary interactions among dimensions (see sup-

plement for a detailed study), and our proposed incremental

procedure exploits some interesting properties of the factor-

ization error and other redundancies in k-tuple computation.The core of our proposal is the following setup.

3.1. Overview

Let C̃ ∈ R(m+1)×(m+1) be the extension of C by a singlenew column w = [uT, v]T , which manipulates C as:

C̃ =

[

C uuT v

]

. (5)

The goal is to computeM(C̃). Since C and C̃ share all butone row/column (see (5)), if we have access toM(C), oneshould, in principle, be able to modify C’s underlying se-quence of rotations to constructM(C̃). This avoids havingto recompute everything for C̃ from scratch, i.e., perform-ing the greedy decompositions from (4) on the entire C̃.

The hypothesis for manipulating M(C) to computeM(C̃) comes from the precise computations involved inthe factorization. Recall (3) and the discussion leading up

to the expression. At level ℓ+ 1, the factorization picks the‘best’ candidate rows/columns from Cℓ that correlate themost with each other, so that the resulting diagonalization

induces the smallest possible off-diagonal error over the rest

of the active set. The components contributing towards this

error are driven by the inner products (Cℓ:,i)TCℓ:,j for some

columns i and j. In some sense, the largest such correlatedrows/columns get picked up, and adding one new entry to

Cℓ:,i may not change the range of these correlations. Ex-tending this intuition across all levels, we argue that

argmaxi,j

C̃T:,iC̃:,j ≈ argmaxi,j

CT:,iC:,j . (6)

Hence, the k-tuples computed from C’s factorization arereasonably good candidates even after introducing w. Tobetter formalize this idea, and in the process present our

algorithm, we parameterize the output structure of M(C)in terms of the sequence of rotations and the wavelets.

3.2. The graph structure of M(C)

If one has access to the sequence of k-tuples t1, . . . , tL

involved in the rotations and the corresponding wavelet in-

dices (s1, . . . , sL), then the factorization is straightforwardto compute i.e., there is no greedy search anymore. Recall

that by definition sℓ ∈ tℓ and sℓ /∈ Sℓ (see (4)). To thatend, for a given O and L,M(C) can be ‘equivalently’ rep-resented using a depth L MMF graph G(C). Each level ofthis graph shows the k-tuple tℓ involved in the rotation, and

2953

s1

s2

s3

s4

s5

Q1

Q2

Q3

l = 0 l = 1 l = 2

s1 s2 s3 s4 s5

s1

s2

s3

s4

s5

3rd order MMF

l = 3

Figure 2. An example 5×5 matrix, and its 3rd order MMF graph (betterin color). Q1, Q2 and Q3 are the rotations. s1, s5 and s2 are wavelets

(marked as black ellipses) at l = 1, 2 and 3 respectively. The arrows imply

that a wavelet is not involved in future rotations.

the corresponding wavelet sℓ i.e., G(C) := {tℓ, sℓ}L1 . Inter-preting the factorization in this way is notationally conve-

nient for presenting the algorithm. More importantly, such

an interpretation is central for visualizing hierarchical de-

pendencies among dimensions of C, and will be discussedin detail in Section 4.3. An example of such a 3rd orderMMF graph constructed from a 5 × 5 matrix is shown inFigure 2 (the rows/columns are color coded for better vi-

sualization). At level ℓ = 1, s1, s2 and s3 are diagonalizedwhile designating the rotated s1 as the wavelet. This processrepeats for ℓ = 2 and 3. As shown by the color-coding ofdifferent compositions, MMF gradually teases out higher-

order correlations that can only be revealed after composing

the rows/columns at one or more scales (levels here).

For notational convenience, we denote the MMF graphs

of C and C̃ as G := {tℓ, sℓ}L1 and G̃ := {t̃ℓ, s̃ℓ}L+11 . Recall

that G̃ will have one more level than G since the row/columnw, indexed m + 1 in C̃, is being added (see (5)). The goalis to estimate G̃ without recomputing all the k-tuples usingthe greedy procedure from (4). This translates to inserting

the new index m+1 into the tℓs and modifying sℓs accord-ingly. Following the discussion from Section 3.1, incremen-

tal MMF argues that inserting this one new element into

the graph will not result in global changes in its topology.

Clearly, in the pathological case, G may change arbitrarily,but as argued earlier (see discussion about (6)) the chance

of this happening for non-random matrices with reasonably

large k is small. The core operation then is to compare thenew k-tuples resulting from the addition of w to the bestones from [m]k provided via G. If the newer k-tuple givesbetter error (see (3)), then it will knock out an existing k-tuple. This constructive insertion and knock-out procedure

is the incremental MMF.

3.3. Inserting a new row/column

The basis for this incremental procedure is that one has

access to G (i.e., MMF on C). We first present the algorithm

assuming that this “initialization” is provided, and revisit

this aspect shortly. The procedure starts by setting t̃ℓ = tℓ

and s̃ℓ = sℓ for ℓ = 1, . . . , L. Let I be the set of elements(indices) that needs to be inserted into G. At the start (thefirst level) I = {m + 1} corresponding to w. Let t̃1 ={p1, . . . , pk}. The new k-tuples that account for insertingentries of I are {m+1}∪ t1 \pi (i = 1, . . . , k). These newk candidates are the probable alternatives for the existingt̃1. Once the best among these k + 1 candidates is chosen,an existing pi from t̃

1 may be knocked out.

If s̃1 gets knocked out, then I = {s̃1} for future levels.This follows from MMF construction, where wavelets at ℓth

level are not involved in later levels. Since s̃1 is knockedout, it is the new inserting element according to G. On theother hand, if one of the k − 1 scaling functions is knockedout, I is not updated. This simple process is repeated se-quentially from ℓ = 1 to L. At L+1, there are no estimatesfor t̃L+1 and s̃L+1, and so, the procedure simply selects thebest k-tuple from the remaining active set S̃L. Algorithm 1summarizes this insertion and knock-out procedure.

Algorithm 1 INSERTROW(C,w, {tℓ, sℓ}Lℓ=1)

Output: {t̃ℓ, s̃ℓ}L+1ℓ=1C̃0 ← C̃ as in (5)z1 ← m+ 1for ℓ = 1 to L− 1 do{t̃ℓ, s̃ℓ, zℓ+1,Qℓ}←CHECKINSERT(C̃ℓ−1; tℓ, sℓ, zℓ)C̃ℓ = QℓC̃ℓ−1(Qℓ)T

end for

T ← GENERATETUPLES([m+ 1] \ ∪L−1ℓ=1 s̃ℓ(C̃))

{Õ, t̃L, s̃L} ← argminO,t∈T ,s∈t E(C̃L−1;O; t, s)

QL = Im+1, QLt̃L,t̃L

= Õ, C̃L = QLC̃L−1(QL)T

Algorithm 2 CHECKINSERT(A, t̂, ŝ, z)

Output: t̃, s̃, z, QT ← GENERATETUPLES(t̂, z){Õ, t̃, s̃} ← argminO,t∈T ,s∈t E(A;O; t, s)if s̃ ∈ z thenz ← (z ∪ ŝ) \ s̃

end if

Q = Im+1, Qt̃,t̃ = Õ

3.4. Incremental MMF Algorithm

Observe that Algorithm 1 is for the setting from (5)

where one extra row/column is added to a given MMF, and

clearly, the incremental procedure can be repeated as more

and more rows/columns are added. Algorithm 3 summa-

rizes this incremental factorization for arbitrarily large and

dense matrices. It has two components: an initialization

on some randomly chosen small block (of size m̃ × m̃) of

2954

the entire matrix C; followed by insertion of the remain-ing m− m̃ rows/columns using Algorithm 1 in a streamingfashion (similar to w from (5)). The initialization entailscomputing a batch-wise MMF on this small block (m̃ ≥ k).

BATCHMMF: Note that at each level ℓ, the error crite-rion in (3) can be explicitly minimized via an exhaustive

search over all possible k-tuples from Sℓ−1 (the active set)and a randomly chosen (using properties of QR decomposi-tion [31]) dictionary of kth order rotations. If the dictionaryis large enough, the exhaustive procedure would lead to the

smallest possible decomposition error (see (2)). However,

it is easy to see that this is combinatorially large, with an

overall complexity of O(nk) [22] and will not scale wellbeyond k = 4 or so. Note from Algorithm 1 that the errorcriterion E(·) in this second stage which inserts the rest ofthe m− m̃ rows is performing an exhaustive search as well.

Algorithm 3 INCREMENTAL MMF(C)

Output: M(C)C̄ = C[m̃],[m̃], L = m− k + 1

{tℓ, sℓ}m̃−k+11 ← BATCHMMF(C̄)for j ∈ {m̃+ 1, . . . ,m} do

{tℓ, sℓ}j−k+11 ← INSERTROW(C̄,Cj,:, {tℓ, sℓ}j−k1 )

C̄ = C[j],[j]end for

M(C) := {tℓ, sℓ}L1

Other Variants: The are two alternatives that avoid this

exhaustive search. Since Qℓ’s job is to diagonalize somek rows/columns (see Definition 2), one can simply pick therelevant k × k block of Cℓ and compute the best O (for agiven tℓ). Hence the first alternative is to bypass the searchover O (in (4)), and simply use the eigen-vectors of Cℓ

tℓ,tℓ

for some tuple tℓ. Nevertheless, the search over Sℓ−1 for tℓ

still makes this approximation reasonably costly. Instead,

the k-tuple selection may be approximated while keepingthe exhaustive search over O intact [22]. Since diagonal-ization effectively nullifies correlated dimensions, the best

k-tuple can be the k rows/columns that are maximally cor-related. This is done by choosing some s1 ∼ Sℓ−1 (fromthe current active set), and picking the rest by

s2, . . . , sk ← argminsi∼Sℓ−1\s1

k∑

i=2

(Cℓ−1:,s1 )TCℓ−1:,si

‖Cℓ−1:,s1 ‖‖Cℓ−1:,si ‖

(7)

This second heuristic (which is related to (6) from Section

3.1) has been shown to be robust [22], however, for large

k it might miss some k-tuples that are vital to the qualityof the factorization. Depending on m̃, and the availablecomputational resources at hand, these alternatives can be

used instead of the earlier proposed exhaustive procedure

for the initialization. Overall, the incremental procedure

scales efficiently for very large matrices, compared to us-

ing the batch-wise scheme on the entire matrix.

4. Experiments

We study various computer vision and medical imaging

scenarios (see supplement for details) to evaluate the quality

of incremental MMF factorization and show its utility. We

first provide evidence for factorization’s efficacy in select-

ing the relevant features of interest for regression. We then

show that the resultant MMF graph is a useful tool for visu-

alizing/decoding the learned task-specific representations.

4.1. Incremental versus Batch MMF

The first set of evaluations compares the incremental

MMF to the batch version (including the exhaustive search

based and the two approximate variants from Section 3.4).

Recall that MMF error is the off-diagonal norm of Λ, ex-cept for the SL × SL block (see (1)), and the smaller theerror is, the closer the factorization is to being exact (see

2). We observed that the incremental MMFs incur approxi-

mately the same error as the batch versions, while achieving

& 20−25 times speed-up compared to a single-core imple-mentation of the batch MMF. Specifically, across 6 differenttoy examples and 3 covariance matrices constructed fromreal data, the loss in factorization error is . 4% of ‖C‖Frob,with no strong dependence on the fraction of the initializa-

tion m̃ (see Algorithm 3). Due to space restrictions, thesesimulations are included in the supplement.

4.2. MMF Scores

The objective of MMF (see (2)) is the signal that is

not accounted for by the kth-order rotations of MMF (itis 0 whenever C is exactly factorizable). Hence, ‖(C −M(C))i,:‖ is a measure of the extra information in the i

th

row that cannot be reproduced by hierarchical compositions

of the rest. Such value-of-information summaries, referred

to as MMF scores, of all the dimensions of C give an im-portance sampling distribution, similar to statistical lever-

age scores [3, 27]. These samplers drive several regres-

sion tasks in vision including gesture tracking [33], face

alignment/tracking [7] and medical imaging [15]. More-

over, the authors in [27] have shown that statistical lever-

age type marginal importance samplers may not be opti-

mal for regression. On the other hand, MMF scores give

the conditional importance or “relative leverage” of each

dimension/feature given the remaining ones. This is be-

cause MMF encodes the hierarchical block structure in the

covariance, and so, the MMF scores provide better impor-

tance samplers than statistical leverages. We first demon-

strate this on a large dataset with 80 predictors/features and1300 instances. Figure 3(a,b) shows the instance covariancematrices after selecting the ‘best’ 5% of features. The block

2955

structure representing the two classes, diseased and non-

diseased, is clearly more apparent with MMF score sam-

pling (see the yellow block vs.the rest in Figure 3(b)).

We exhaustively compared leverages and MMF scores

on a medical imaging regression task on region-of-interest

(ROI) summaries from positron emission tomography

(PET) images. The goal is to predict the cognitive score

summary using the imaging ROIs. (see Figure 3(c), and

supplement for details). Despite the fact that the features

have high degree of block structure (covariance matrix from

Figure 3(c)), this information is rarely, if ever, utilized

within the downstream models, say multi-linear regression

for predicting health status. Here we train a linear model

using a fraction of these voxel ROIs sampled according to

statistical leverages (from [3]) and relative leverage from

MMF scores. Note that, unlike LASSO, the feature sam-

plers are agnostic to the responses (a setting similar to opti-

mal experimental design [12]). The second row of Figure 3

shows the Adjusted-R2 of the resulting linear models, andFigure 3(h,i) show the corresponding F -statistic. The x-axis corresponds to the fraction of the ROIs selected using

the leverage (black lines) and MMF scores (red lines). As

shown by the red vs. black curves, the voxel ROIs picked

by MMF are better both in terms of adjusted-R2 (the ex-plainable variance of the data) and F -statistic (the overallsignificance). More importantly, the first few ROIs picked

up by MMF scores are more informative than those from

leverage scores (left end of x-axis in Figure 3(d-i)). Fig-ure 3(j,k) show the gain in AUC of adjusted-R2 as the orderof MMF changes (x-axis). Clearly the performance gainof MMF scores is large. The error bars in these plots are

omitted for clarity (see supplement for details, and other

plots/comparisons). These results show that MMF scores

can be used within numerous regression tasks where the

number of predictors is large with sample sizes.

4.3. MMF graphs

The ability of a feature to succinctly represent the pres-

ence of an object/scene is, at least, in part, governed by

the relationship of the learned representations across multi-

ple object classes/categories. Beyond object-specific infor-

mation, such cross-covariate contextual dependencies have

shown to improve the performance in object tracking and

recognition [45] and medical applications [20] (a motivat-

ing aspect of adversarial learning [26]). Visualizing the

histogram of gradients (HoG) features is one such interest-

ing result that demonstrates the scenario where a correctly

learned representation leads to a false positive [41], for in-

stance, the HoG features of a duck image are similar to a

car HoG. [37, 14] have addressed similar aspects for deep

representations by visualizing image classification and de-

tection models, and there is recent interest in designing tools

for visualizing what the network perceives when predicting

a test label [44]. As shown in [1], the contextual images that

a deep network (even with good detection power) desires to

see may not even correspond to real-world scenarios.

The evidence from these works motivate a simple ques-

tion – Do the semantic relationships learned by the deep

representations associate with those seen by humans? For

instance, can such models infer that cats are closer to dogs

than they are to bears; or that bread goes well with but-

ter/cream rather than, say, salsa. Invariably, addressing

these questions amounts to learning hierarchical and cate-

gorical relationships in the class-covariance of hidden rep-

resentations. Using classical techniques may not easily

reveal interesting, human-relateable, trends as was shown

very recently by [32]. There are at least few reasons, but

most importantly, the covariance of hidden representations

(in general) has parsimonious structure with multiple com-

positions of blocks (the left two images in Figure 1 are from

AlexNet and VGG-S). As motivated in Section 1, and later

described in Section 3.2 using Figure 2, a MMF graph is the

natural object to analyze such parsimonious structure.

4.3.1 Decoding the deep

A direct application of MMF on the covariance of hid-

den representations reveals interesting hierarchical struc-

ture about the “perception” of deep networks. To precisely

walk through these compositions, consider the last hidden

layer (FC7, that feeds into softmax) representations from aVGG-S network [9] corresponding to 12 different ImageNetclasses, shown in Figure 4(a). Figure 4(b,c) visualize a 5th

order MMF graph learned on this class covariance matrix.

The semantics of breads and sides. The 5th order MMFsays that the five categories – pita, limpa, chapati, chutney

and bannock – are most representative of the localized struc-

ture in the covariance. Observe that these are four differ-

ent flour-based main courses, and a side chutney that shared

strongest context with the images of chapati in the training

data (similar to the body building and dumbell images from

[1]). MMF then picks salad, salsa and saute representa-

tions’ at the 2nd level, claiming that they relate the strongestto the composition of breads and chutney from the previous

level (see visualization in Figure 4(b,c)). Observe that these

are in fact the sides offered/served with bread. Although

VGG-S was not trained to predict these relations, accordingto MMF, the representations are inherently learning them

anyway – a fascinating aspect of deep networks i.e., they

are seeing what humans may infer about these classes.

Any dressing? What are my dessert options? Let us

move to the 3rd level in Figure 4(b,c). margarine is a cheesebased dressing. shortcake is dessert-type meal made from

strawberry (which shows up at 4th level) and bread (thecomposition from previous levels). That is the full course.

The last level corresponds to ketchup, which is an outlier,

distinct from the rest of the 10 classes – a typical order of

2956

No Blocks (LevSc)

(a) LevScore Sampling

2 Blocks (MMFSc)

(b) MMFScore Sampling

p featuresp ROIs

Amyloid PET images Feature Covariance

(c) Regression Setup – Constructing the ROI data

#Features used

0 0.2 0.4 0.6 0.8

Ad

juste

d R

-sq

ua

red

0.1

0.2

0.3

0.4

0.5

Lev Mdl1 (k=20)MMF Mdl1 (k=20)

(d) R2 vs. #ROIs

#Features used

0 0.2 0.4 0.6 0.8

Ad

juste

d R

-sq

ua

red

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4


(e) R2 vs. #ROIs

#Features used

0 0.2 0.4 0.6 0.8

Ad

juste

d R

-sq

ua

red

0

0.1

0.2

0.3

0.4

0.5


(f) R2 vs. #ROIs

#Features used

0 0.2 0.4 0.6 0.8

Ad

juste

d R

-sq

ua

red

0

0.2

0.4

0.6


(g) R2 vs. #ROIs

#Features used

0.2 0.4 0.6

F-s

tatistic

0

2

4

6

8

10

12


(h) F vs. #ROIs

#Features used

0.2 0.4 0.6

F-s

tatistic

0

2

4

6


(i) F vs. #ROIs

MMF Order

0 5 10 15 20

Diff.

of

AU

C u

nd

er

R-s

qu

red

2.4

2.5

2.6

2.7

2.8

2.9

3

3.1Mdl1

Mdl2

(j) Gain in R2 AUC

MMF Order

0 5 10 15 20Diff.

of

AU

C u

nd

er

R-s

qu

red

5

6

7

8Mdl3

Mdl4

(k) Gain in R2 AUC

Figure 3. Evaluating Feature Importance Sampling of MMF Scores vs. Leverage Scores (a,b) Visualizing apparent (if any) blocks in instancecovariance matrices using best 5% features, (c) Regression setup (see structure in covariance), (d-g) Adjusted R2, and (h,i) F statistic of linear models, (j,k)

gains in R2. Mdl1-Mdl4 are linear models constructed on different datasets (see supplement). m̃ = 0.1m (from Algorithm 3) for these evaluations.

dishes involving the chosen breads and sides does not in-

clude hot sauce or ketchup. Although shortcake is made

up of strawberries, “conditioned” on the 1st and 2nd leveldependencies, it is less useful in summarizing the covari-

ance structure. An interesting summary of this hierarchy

from Figure 4(b,c) is – an order of pita with side ketchup or

strawberries is atypical in the data seen by these networks.

4.3.2 Are we reading tea leaves?

It is reasonable to ask if this description is meaningful since

the semantics drawn above are subjective. We provide ex-

planations below. First, the networks are not trained tolearn the hierarchy of categories – the task was object/class

detection. Hence, the relationships are completely a by-

product of the power of deep networks to learn contextual

information, and the ability of MMF to model these compo-

sitions by uncovering the structure in the covariance matrix.

Supplement provides further evidence by visualizing such

hierarchy from few dozens of other ImageNet classes. Sec-

ond, one may ask if the compositions are sensitive/stable to

the order k – a critical hyperparameter of MMF. Figure 4(d)uses a 4th order MMF, and the resulting hierarchy is similar

to that from Figure 4(b). Specifically, the different breads

and sides show up early, and the most distinct categories

(strawberry and ketchup) appear at the higher levels. Simi-

lar patterns are seen for other choices of k (see supplement).

Further, if the class hierarchy in Figures 4(b–d) is non-

spurious, then similar trends should be implied by MMF’s

on different (higher) layers of VGG-S. Figure 4(e) shows

the compositions from the 10th layer representations (theoutputs from 3rd convolutional layer of VGG-S) of the 12classes in Figure 4(a). The strongest compositions, the 8classes from ℓ = 1 and 2, are already picked up half-way thorough the VGG-S, providing further evidence that

the compositional structure implied by MMF is data-driven.

We further discuss this in Section 4.3.3. Finally, we com-

pared MMF’s class-compositions to the hierarchical clus-

ters obtained from agglomerative clustering of representa-

tions. The relationships in Figure 4(b-d) are not apparent in

the corresponding dendrograms (see supplement, [32]) – for

instance, the dependency of chutney/salsa/salad on several

breads, or the disparity of ketchup from the others.

Overall, Figure 4(b–e) shows many of the summaries

that a human may infer about the 12 classes in Figure 4(a).Apart from visualizing deep representations, such MMF

2957

(a) 12 classes (b) 5th order (FC7 layer reps.)

Ketchup

Strawberry

Margarine

Limpa

Chutney

Pita

Saute

Chapati

Bannock

Salad

Salsa

Shortca

ke

(c) 5th order (FC7 layer reps.)

Chapati

Salad

Ketchup

Bannock

Saute

Limpa

Pita

Salsa

Shortcake

Chutney

Margarine

Strawberry

(d) 4th order (FC7 layer reps.)

ℓ = 1

ℓ = 2

ℓ = 3

ℓ = 4

ℓ = 5Chapati

Salad

Ketchup

Chutney

Bannock

Saute

Limpa

Pita

Salsa

Strawberry

Shortcake

Margarine

(e) 5th order (conv3 layer reps.)

ℓ = 1ℓ = 2

ℓ = 3

ℓ = 4

Chutney

Limpa

Margarine

Shortcake

Salsa

Strawberry

Saute

PitaBannock

Chapati

Ketchup

Salad

(f) 5th order (Pixel reps.)

Figure 4. Hierarchy and Compositions of VGG-S [9] representations inferred by MMF. (a) The 12 classes, (b,c) Hierarchical structure from 5th

order MMF, (d) the structure from a 4th order MMF, and (e,f) compositions from 3rd conv. layer (VGG-S) and inputs. m̃ = 0.1m (from Algorithm 3).

graphs are vital exploratory tools for category/scene un-

derstanding from unlabeled representations in transfer and

multi-domain learning [2]. This is because, by comparing

the MMF graph prior to inserting the new unlabeled in-

stance to the one after insertion, one can infer whether the

new instance contains non-trivial information that cannot be

expressed as a composition of existing categories.

4.3.3 The flow of MMF graphs: An exploratory tool

Figure 4(f) shows the compositions from the 5th orderMMF on the input (pixel-level) data. These features are

non-informative, and clearly, the classes whose RGB values

correlate are at l = 0 in Figure 4(f). But most importantly,comparing Figure 4(b,e) we see that l = 1 and 2 have thesame compositions. One can construct visualizations like

Figure 4(b,e,f) for all the layers of the network. Using this

trajectory of the class compositions, one can ask whether a

new layer needs to be added to the network (a vital aspect

for model selection in deep networks [19]). This is driven

by the saturation of the compositions – if the last few levels’

hierarchies are similar, then the network has already learned

the information in the data. On the other hand, variance in

the last levels of MMFs implies that adding another network

layer may be beneficial. The saturation at l = 1, 2 in Fig-ure 4(b,e) (see supplement for remaining layers’ MMFs) is

one such example. If these 8 classes are a priority, then thepredictions of the VGG-S’ 3rd convolutional layer may al-ready be good enough. Such constructs can be tested across

other layers and architectures (see supplement for MMFs

from AlexNet, VGG-S and other networks).

5. Conclusions

We present an algorithm that uncovers multiscale struc-

ture of symmetric matrices by performing a matrix factor-

ization. We showed that it is an efficient importance sampler

for relative leveraging of features.We also showed how the

factorization sheds light on the semantics of categorical re-

lationships encoded in deep networks, and presented ideas

to facilitate adapting/modifying their architectures.

Acknowledgments: The authors are supported by

NIH AG021155, EB022883, AG040396, NSF CAREER

1252725, NSF AI117924 and 1320344/1320755.

2958

References

[1] Inceptionism: Going deeper into neural networks. 2015. 6

[2] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira,

and J. W. Vaughan. A theory of learning from different do-

mains. Machine learning, 79(1-2):151–175, 2010. 8

[3] C. Boutsidis, P. Drineas, and M. W. Mahoney. Unsupervised

feature selection for the k-means clustering problem. In

Advances in Neural Information Processing Systems, pages

153–161, 2009. 5, 6

[4] J. Bruna and S. Mallat. Invariant scattering convolution net-

works. IEEE transactions on pattern analysis and machine

intelligence, 35(8):1872–1886, 2013. 1

[5] E. J. Candes and D. L. Donoho. Curvelets: A surprisingly

effective nonadaptive representation for objects with edges.

Technical report, DTIC Document, 2000. 1

[6] E. J. Candès and B. Recht. Exact matrix completion via con-

vex optimization. Foundations of Computational mathemat-

ics, 9(6):717–772, 2009. 1

[7] X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by ex-

plicit shape regression. International Journal of Computer

Vision, 107(2):177–190, 2014. 5

[8] S. Chandrasekaran, M. Gu, and W. Lyons. A fast adaptive

solver for hierarchically semiseparable representations. Cal-

colo, 42(3-4):171–185, 2005. 1

[9] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman.

Return of the devil in the details: Delving deep into convo-

lutional nets. arXiv preprint arXiv:1405.3531, 2014. 1, 6,

8

[10] A. M. Cheriyadat and R. J. Radke. Non-negative matrix fac-

torization of partial track data for motion segmentation. In

2009 IEEE 12th International Conference on Computer Vi-

sion, pages 865–872. IEEE, 2009. 1

[11] R. R. Coifman and M. Maggioni. Diffusion wavelets. Ap-

plied and Computational Harmonic Analysis, 21(1):53–94,

2006. 1

[12] P. F. de Aguiar, B. Bourguignon, M. Khots, D. Massart, and

R. Phan-Than-Luu. D-optimal designs. Chemometrics and

Intelligent Laboratory Systems, 30(2):199–210, 1995. 6

[13] F. De la Torre and M. J. Black. Robust principal compo-

nent analysis for computer vision. In Computer Vision, 2001.

ICCV 2001. Proceedings. Eighth IEEE International Confer-

ence on, volume 1, pages 362–369. IEEE, 2001. 1

[14] A. Dosovitskiy and T. Brox. Inverting visual repre-

sentations with convolutional networks. arXiv preprint

arXiv:1506.02753, 2015. 6

[15] K. J. Friston, A. P. Holmes, K. J. Worsley, J.-P. Poline, C. D.

Frith, and R. S. Frackowiak. Statistical parametric maps in

functional imaging: a general linear approach. Human brain

mapping, 2(4):189–210, 1994. 5

[16] M. Gavish and R. R. Coifman. Sampling, denoising and

compression of matrices by coherent matrix organization.

Applied and Computational Harmonic Analysis, 33(3):354–

369, 2012. 1

[17] W. Hwa Kim, H. J. Kim, N. Adluru, and V. Singh. Latent

variable graphical model selection using harmonic analysis:

applications to the human connectome project (hcp). In Pro-

ceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, pages 2443–2451, 2016. 1

[18] A. Hyvärinen. Independent component analysis: recent ad-

vances. Phil. Trans. R. Soc. A, 371(1984):20110534, 2013.

1

[19] V. K. Ithapu, S. N. Ravi, and V. Singh. On architectural

choices in deep learning: From network structure to gradi-

ent convergence and parameter estimation. arXiv preprint

arXiv:1702.08670, 2017. 8

[20] V. K. Ithapu, V. Singh, O. C. Okonkwo, et al. Imaging-

based enrichment criteria using deep learning algorithms

for efficient clinical trials in mild cognitive impairment.

Alzheimer’s & Dementia, 11(12):1489–1499, 2015. 6

[21] R. Kondor, N. Teneva, and V. Garg. Multiresolution matrix

factorization. In Proceedings of the 31st International Con-

ference on Machine Learning (ICML-14), pages 1620–1628,

2014. 2, 3

[22] R. Kondor, N. Teneva, and P. K. Mudrakarta. Parallel mmf:

a multiresolution approach to matrix computation. arXiv

preprint arXiv:1507.04396, 2015. 2, 5

[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet

classification with deep convolutional neural networks. In

Advances in neural information processing systems, pages

1097–1105, 2012. 1

[24] G. Kutyniok et al. Shearlets: Multiscale analysis for mul-

tivariate data. Springer Science & Business Media, 2012.

1

[25] A. B. Lee, B. Nadler, and L. Wasserman. Treelets: an adap-

tive multi-scale basis for sparse unordered data. The Annals

of Applied Statistics, pages 435–471, 2008. 1

[26] D. Lowd and C. Meek. Adversarial learning. In Proceedings

of the eleventh ACM SIGKDD international conference on

Knowledge discovery in data mining, pages 641–647. ACM,

2005. 6

[27] P. Ma, M. W. Mahoney, and B. Yu. A statistical perspective

on algorithmic leveraging. Journal of Machine Learning Re-

search, 16:861–911, 2015. 5

[28] S. G. Mallat. A theory for multiresolution signal decom-

position: the wavelet representation. IEEE transactions on

pattern analysis and machine intelligence, 11(7):674–693,

1989. 2

[29] B. S. Manjunath and W.-Y. Ma. Texture features for brows-

ing and retrieval of image data. IEEE Transactions on

pattern analysis and machine intelligence, 18(8):837–842,

1996. 1

[30] Y. Meyer. Wavelets-algorithms and applications. Wavelets-

Algorithms and applications Society for Industrial and Ap-

plied Mathematics Translation., 142 p., 1, 1993. 1

[31] F. Mezzadri. How to generate random matrices from the

classical compact groups. arXiv preprint math-ph/0609050,

2006. 5

[32] J. C. Peterson, J. T. Abbott, and T. L. Griffiths. Adapting

deep network features to capture psychological representa-

tions. arXiv preprint arXiv:1608.02164, 2016. 6, 7

[33] S. S. Rautaray and A. Agrawal. Vision based hand gesture

recognition for human computer interaction: a survey. Arti-

ficial Intelligence Review, 43(1):1–54, 2015. 5

2959

[34] R. Rubinstein, A. M. Bruckstein, and M. Elad. Dictionaries

for sparse representation modeling. Proceedings of the IEEE,

98(6):1045–1057, 2010. 1

[35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,

S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,

et al. Imagenet large scale visual recognition challenge.

International Journal of Computer Vision, 115(3):211–252,

2015. 1

[36] B. Savas, I. S. Dhillon, et al. Clustered low rank approxima-

tion of graphs in information science applications. In SDM,

pages 164–175. SIAM, 2011. 1

[37] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside

convolutional networks: Visualising image classification

models and saliency maps. arXiv preprint arXiv:1312.6034,

2013. 6

[38] P. Sturm and B. Triggs. A factorization based algorithm for

multi-image projective structure and motion. In European

conference on computer vision, pages 709–720. Springer,

1996. 1

[39] N. Teneva, P. K. Mudrakarta, and R. Kondor. Multiresolu-

tion matrix compression. In Proceedings of the 19th Inter-

national Conference on Artificial Intelligence and Statistics,

pages 1441–1449, 2016. 2, 3

[40] M. Turk and A. Pentland. Eigenfaces for recognition. Jour-

nal of cognitive neuroscience, 3(1):71–86, 1991. 1

[41] C. Vondrick, A. Khosla, T. Malisiewicz, and A. Torralba.

Hoggles: Visualizing object detection features. In Proceed-

ings of the IEEE International Conference on Computer Vi-

sion, pages 1–8, 2013. 6

[42] J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma. Robust

principal component analysis: Exact recovery of corrupted

low-rank matrices via convex optimization. In Advances

in neural information processing systems, pages 2080–2088,

2009. 1

[43] J. Xu, V. K. Ithapu, L. Mukherjee, J. M. Rehg, and

V. Singh. Gosus: Grassmannian online subspace updates

with structured-sparsity. In Proceedings of the IEEE Inter-

national Conference on Computer Vision, pages 3376–3383,

2013. 1

[44] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson.

Understanding neural networks through deep visualization.

arXiv preprint arXiv:1506.06579, 2015. 6

[45] T. Zhang, B. Ghanem, and N. Ahuja. Robust multi-object

tracking via cross-domain contextual information for sports

video analysis. In 2012 IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP), pages

985–988. IEEE, 2012. 6

[46] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal com-

ponent analysis. Journal of computational and graphical

statistics, 15(2):265–286, 2006. 2

2960

the incremental multiresolution matrix factorization algorithm · 2017. 5. 31. · a sequential...

Documents