the incremental multiresolution matrix factorization algorithm · 2017. 5. 31. · a sequential...

10
The Incremental Multiresolution Matrix Factorization Algorithm Vamsi K. Ithapu , Risi Kondor § , Sterling C. Johnson , Vikas Singh University of Wisconsin-Madison, § University of Chicago http://pages.cs.wisc.edu/ ˜ vamsi/projects/incmmf.html Abstract Multiresolution analysis and matrix factorization are foundational tools in computer vision. In this work, we study the interface between these two distinct topics and obtain techniques to uncover hierarchical block structure in symmetric matrices – an important aspect in the success of many vision problems. Our new algorithm, the incremental multiresolution matrix factorization, uncovers such struc- ture one feature at a time, and hence scales well to large ma- trices. We describe how this multiscale analysis goes much farther than what a direct “global” factorization of the data can identify. We evaluate the efficacy of the resulting factor- izations for relative leveraging within regression tasks using medical imaging data. We also use the factorization on rep- resentations learned by popular deep networks, providing evidence of their ability to infer semantic relationships even when they are not explicitly trained to do so. We show that this algorithm can be used as an exploratory tool to im- prove the network architecture, and within numerous other settings in vision. 1. Introduction Matrix factorization lies at the heart of a spectrum of computer vision problems. While the wide ranging and ex- tensive use of factorization schemes within structure from motion [38], face recognition [40] and motion segmenta- tion [10] have been known, in the last decade, there is re- newed interest in these ideas. Specifically, the celebrated work on low rank matrix completion [6] has enabled de- ployments in a broad cross-section of vision problems from independent components analysis [18] to dimensionality re- duction [42] to online background estimation [43]. Novel extensions based on Robust Principal Components Analy- sis [13, 6] are being developed each year. In contrast to factorization methods, a distinct and rich body of work based on early work in signal processing is arguably even more extensively utilized in vision. Specif- ically, Wavelets [34] and other related ideas (curvelets [5], shearlets [24]) that loosely fall under multiresolution analy- sis (MRA) based approaches drive an overwhelming major- ity of techniques within feature extraction [29] and repre- sentation learning [34]. Also, Wavelets remain the “go to” tool for image denoising, compression, inpainting, shape analysis and other applications in video processing [30]. SIFT features can be thought of as a special case of the so- called Scattering Transform (using theory of Wavelets) [4]. Remarkably, the “network” perspective of Scattering Trans- form at least partly explains the invariances being identi- fied by deep representations, further expanding the scope of multiresolution approaches informing vision algorithms. The foregoing discussion raises the question of whether there are any interesting bridges between Factorization and Wavelets. This line of enquiry has recently been studied for the most common “discrete” object encountered in vi- sion – graphs. Starting from the seminal work on Diffu- sion Wavelets [11], others have investigated tree-like de- compositions on matrices [25], and organizing them using wavelets [16]. While the topic is still nascent (but evolving), these non-trivial results suggest that the confluence of these seemingly distinct topics potentially holds much promise for vision problems [17]. Our focus is to study this interface between Wavelets and Factorization, and demonstrate the immediate set of problems that can potentially benefit. In particular, we describe an efficient (incremental) multireso- lution matrix factorization algorithm. To concretize the argument above, consider a represen- tative example in vision and machine learning where a fac- torization approach may be deployed. Figure 1 shows a set of covariance matrices computed from the representations learned by AlexNet [23], VGG-S [9] (on some ImageNet classes [35]) and medical imaging data respectively. As a first line of exploration, we may be interested in characteriz- ing the apparent parsimonious “structure” seen in these ma- trices. We can easily verify that invoking the de facto con- structs like sparsity, low-rank or a decaying eigen-spectrum cannot account for the “block” or cluster-like structures in- herent in this data. Such block-structured kernels were the original motivation for block low-rank and hierarchical fac- torizations [36, 8] — but a multiresolution scheme is much more natural — in fact, ideal — if one can decompose the 2951

Upload: others

Post on 26-Jan-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • The Incremental Multiresolution Matrix Factorization Algorithm

    Vamsi K. Ithapu†, Risi Kondor§, Sterling C. Johnson†, Vikas Singh†

    †University of Wisconsin-Madison, §University of Chicago

    http://pages.cs.wisc.edu/˜vamsi/projects/incmmf.html

    Abstract

    Multiresolution analysis and matrix factorization are

    foundational tools in computer vision. In this work, we

    study the interface between these two distinct topics and

    obtain techniques to uncover hierarchical block structure in

    symmetric matrices – an important aspect in the success of

    many vision problems. Our new algorithm, the incremental

    multiresolution matrix factorization, uncovers such struc-

    ture one feature at a time, and hence scales well to large ma-

    trices. We describe how this multiscale analysis goes much

    farther than what a direct “global” factorization of the data

    can identify. We evaluate the efficacy of the resulting factor-

    izations for relative leveraging within regression tasks using

    medical imaging data. We also use the factorization on rep-

    resentations learned by popular deep networks, providing

    evidence of their ability to infer semantic relationships even

    when they are not explicitly trained to do so. We show that

    this algorithm can be used as an exploratory tool to im-

    prove the network architecture, and within numerous other

    settings in vision.

    1. Introduction

    Matrix factorization lies at the heart of a spectrum of

    computer vision problems. While the wide ranging and ex-

    tensive use of factorization schemes within structure from

    motion [38], face recognition [40] and motion segmenta-

    tion [10] have been known, in the last decade, there is re-

    newed interest in these ideas. Specifically, the celebrated

    work on low rank matrix completion [6] has enabled de-

    ployments in a broad cross-section of vision problems from

    independent components analysis [18] to dimensionality re-

    duction [42] to online background estimation [43]. Novel

    extensions based on Robust Principal Components Analy-

    sis [13, 6] are being developed each year.

    In contrast to factorization methods, a distinct and rich

    body of work based on early work in signal processing is

    arguably even more extensively utilized in vision. Specif-

    ically, Wavelets [34] and other related ideas (curvelets [5],

    shearlets [24]) that loosely fall under multiresolution analy-

    sis (MRA) based approaches drive an overwhelming major-

    ity of techniques within feature extraction [29] and repre-

    sentation learning [34]. Also, Wavelets remain the “go to”

    tool for image denoising, compression, inpainting, shape

    analysis and other applications in video processing [30].

    SIFT features can be thought of as a special case of the so-

    called Scattering Transform (using theory of Wavelets) [4].

    Remarkably, the “network” perspective of Scattering Trans-

    form at least partly explains the invariances being identi-

    fied by deep representations, further expanding the scope of

    multiresolution approaches informing vision algorithms.

    The foregoing discussion raises the question of whether

    there are any interesting bridges between Factorization and

    Wavelets. This line of enquiry has recently been studied

    for the most common “discrete” object encountered in vi-

    sion – graphs. Starting from the seminal work on Diffu-

    sion Wavelets [11], others have investigated tree-like de-

    compositions on matrices [25], and organizing them using

    wavelets [16]. While the topic is still nascent (but evolving),

    these non-trivial results suggest that the confluence of these

    seemingly distinct topics potentially holds much promise

    for vision problems [17]. Our focus is to study this interface

    between Wavelets and Factorization, and demonstrate the

    immediate set of problems that can potentially benefit. In

    particular, we describe an efficient (incremental) multireso-

    lution matrix factorization algorithm.

    To concretize the argument above, consider a represen-

    tative example in vision and machine learning where a fac-

    torization approach may be deployed. Figure 1 shows a set

    of covariance matrices computed from the representations

    learned by AlexNet [23], VGG-S [9] (on some ImageNet

    classes [35]) and medical imaging data respectively. As a

    first line of exploration, we may be interested in characteriz-

    ing the apparent parsimonious “structure” seen in these ma-

    trices. We can easily verify that invoking the de facto con-

    structs like sparsity, low-rank or a decaying eigen-spectrum

    cannot account for the “block” or cluster-like structures in-

    herent in this data. Such block-structured kernels were the

    original motivation for block low-rank and hierarchical fac-

    torizations [36, 8] — but a multiresolution scheme is much

    more natural — in fact, ideal — if one can decompose the

    12951

    http://pages.cs.wisc.edu/~vamsi/projects/incmmf.html

  • Figure 1. Left-to-right example category (or class) covariances fromAlexNet, VGG-S (of a few ImageNet classes) and medical imaging data.

    matrix in a way that the blocks automatically ‘reveal’ them-

    selves at multiple resolutions. Conceptually, this amounts to

    a sequential factorization while accounting for the fact that

    each level of this hierarchy must correspond to approximat-

    ing some non-trivial structure in the matrix. A recent result

    introduces precisely such a multiresolution matrix factor-

    ization (MMF) algorithm for symmetric matrices [21].

    Consider a symmetric matrix C ∈ Rm×m. PCA de-composes C as QTΛQ where Q is an orthogonal matrix,which, in general, is dense. On the other hand, sparse

    PCA (sPCA) [46] imposes sparsity on the columns of Q,allowing for fewer dimensions to interact that may not

    capture global patterns. The factorization resulting from

    such individual low-rank decompositions cannot capture hi-

    erarchical relationships among data dimensions. Instead,

    MMF applies a sequence of carefully chosen sparse rota-

    tions Q1,Q2, . . .QL to factorize C in the form

    C = (Q1)T (Q2)T . . . (QL)TΛQL . . .Q2Q1,

    thereby uncovering soft hierarchical organization of differ-

    ent rows/columns of C. Typically the Qℓs are sparse kth-order rotations (orthogonal matrices that are the identity ex-

    cept for at most k of their rows/columns), leading to a hi-erarchical tree-like matrix organization. MMF was shown

    to be an efficient compression tool [39] and a precondi-

    tioner [21]. Randomized heuristics have been proposed to

    handle large matrices [22]. Nevertheless, factorization in-

    volves searching a combinatorial space of row/column in-

    dices, which restricts the order of the rotations to be small

    (typically, ≤ 3). Not allowing higher order rotations re-stricts the richness of the allowable block structure, result-

    ing in a hierarchical decomposition that is “too localized” to

    be sensible or informative (reverting back to the issues with

    sPCA and other block low-rank approximations).

    A fundamental property of MMF is the sequential com-

    position of rotations. In this paper, we exploit the fact that

    the factorization can be parameterized in terms of an MMF

    graph defined on a sequence of higher-order rotations. Un-

    like alternate batch-wise approaches [39], we start with a

    small, randomly chosen block of C, and gradually ‘insert’new rows into the factorization – hence we refer to this as

    an incremental MMF. We show that this insertion proce-

    dure manipulates the topology of the MMF graph, thereby

    providing an efficient algorithm for constructing higher or-

    der MMFs. Our contributions are: (A) We present a fast

    and efficient incremental procedure for constructing higher

    order (large k) MMFs on large dense matrices; (B) We eval-uate the efficacy of the higher order factorizations for rela-

    tive leveraging of sets of pixels/voxels in regression tasks

    in vision; and (C) Using the output structure of incremental

    MMF, we visualize the semantics of categorical relation-

    ships inferred by deep networks, and, in turn, present some

    exploratory tools to adapt and modify the architectures.

    2. Multiresolution Matrix Factorization

    Notation: We begin with some notation. Matrices are bold

    upper case, vectors are bold lower case and scalars are lower

    case. [m] := {1, . . . ,m} for any m ∈ N. Given a matrixC ∈ Rm×m and two set of indices S1 = {r1, . . . rk} andS2 = {c1, . . . cp}, CS1,S2 will denote the block of C cut outby the rows S1 and columns S2. C:,i is the i

    th column of C.Im is the m-dimensional identity. SO(m) is the group ofm dimensional orthogonal matrices with unit determinant.RmS is the set of m-dimensional symmetric matrices whichare diagonal except for their S × S block (S–core-diagonalmatrices).

    Multiresolution matrix factorization (MMF), introduced

    in [21, 22], retains the locality properties of sPCA while

    also capturing the global interactions provided by the many

    variants of PCA, by applying not one, but multiple sparse

    rotation matrices to C in sequence. We have the following.

    Definition. Given an appropriate class O ⊆ SO(m) ofsparse rotation matrices, a depth parameter L ∈ N and asequence of integers m = d0 ≥ d1 ≥ . . . ≥ dL ≥ 1, themulti-resolution matrix factorization (MMF) of a sym-

    metric matrix C ∈ Rm×m is a factorization of the form

    M(C) := QTΛQ with Q = QL . . .Q2Q1, (1)

    where Qℓ ∈ O and Qℓ[m]\Sℓ−1,[m]\Sℓ−1= Im−dℓ for some

    nested sequence of sets [m] = S0 ⊇ S1 ⊇ . . . ⊇ SL with|Sℓ| = dℓ and Λ ∈ R

    mSL

    .

    Sℓ−1 is referred to as the ‘active set’ at the ℓth level, since

    Qℓ is identity outside [m] \ Sℓ−1. The nesting of the Sℓsimplies that after applying Qℓ at some level ℓ, Sℓ−1 \ Sℓrows/columns are removed from the active set, and are not

    operated on subsequently. This active set trimming is done

    at all L levels, leading to a nested subspace interpretationfor the sequence of compressions Cℓ = QℓCℓ−1(Qℓ)T

    (C0 = C and Λ = CL). In fact, [21] has shown that, fora general class of symmetric matrices, MMF from Defini-

    tion 2 entails a Mallat style multiresolution analysis (MRA)

    [28]. Observe that depending on the choice of Qℓ, only afew dimensions of Cℓ−1 are forced to interact, and so thecomposition of rotations is hypothesized to extract subtle or

    softer notions of structure in C.

    2952

  • Since multiresolution is represented as matrix factoriza-

    tion here (see (1)), the Sℓ−1 \ Sℓ columns of Q correspondto “wavelets”. While d1, d2, . . . can be any monotonicallydecreasing sequence, we restrict ourselves to the simplest

    case of dℓ = m − ℓ. Within this setting, the number oflevels L is at most m − k + 1, and each level contributes asingle wavelet. Given S1,S2, . . . and O, the matrix factor-ization of (1) reduces to determining the Qℓ rotations andthe residual Λ, which is usually done by minimizing thesquared Frobenius norm error

    minQℓ∈O,Λ∈Rm

    SL

    ‖C−M(C)‖2Frob. (2)

    The above objective can be decomposed as a sum of contri-

    butions from each of the L different levels (see Proposition1, [21]), which suggests computing the factorization in agreedy manner as C = C0 7→ C1 7→ C2 7→ . . . 7→ Λ.This error decomposition is what drives much of the intu-

    ition behind our algorithms.

    After ℓ − 1 levels, Cℓ−1 is the compression and Sℓ−1 isthe active set. In the simplest case of O being the classof so-called k–point rotations (rotations which affect atmost k coordinates) and dℓ = m − ℓ, at level ℓ the algo-rithm needs to determine three things: (a) the k–tuple tℓ

    of rows/columns involved in the rotation, (b) the nontriv-

    ial part O := Qℓtℓ,tℓ

    of the rotation matrix, and (c) sℓ, theindex of the row/column that is subsequently designated a

    wavelet and removed from the active set. Without loss of

    generality, let sℓ be the last element of tℓ. Then the contri-bution of level ℓ to the squared Frobenius norm error (2) is(see supplement)

    E(Cℓ−1;Oℓ; tℓ, s) = 2

    k−1∑

    i=1

    [OCℓ−1tℓ,tℓ

    OT ]2k,i

    + 2[OBBTOT ]k,k where B = Cℓ−1tℓ,Sℓ−1\tℓ

    ,

    (3)

    and, in the definition of B, tℓ is treated as a set. The factor-ization then works by minimizing this quantity in a greedy

    fashion, i.e.,

    Qℓ, tℓ, sℓ ← argminO,t,s

    E(Cℓ−1;O; t, s)

    Sℓ ← Sℓ−1 \ sℓ ; Cℓ = QℓCℓ−1(Qℓ)T .

    (4)

    3. Incremental MMF

    We now motivate our algorithm using (3) and (4). Solv-

    ing (2) amounts to estimating the L different k-tuplest1, . . . , tL sequentially. At each level, the selection ofthe best k-tuple is clearly combinatorial, making the ex-act MMF computation (i.e., explicitly minimizing (2)) very

    costly even for k = 3 or 4 (this has been independently

    observed in [39]). As discussed in Section 1, higher or-

    der MMFs (with large k) are nevertheless inevitable for al-lowing arbitrary interactions among dimensions (see sup-

    plement for a detailed study), and our proposed incremental

    procedure exploits some interesting properties of the factor-

    ization error and other redundancies in k-tuple computation.The core of our proposal is the following setup.

    3.1. Overview

    Let C̃ ∈ R(m+1)×(m+1) be the extension of C by a singlenew column w = [uT, v]T , which manipulates C as:

    C̃ =

    [

    C uuT v

    ]

    . (5)

    The goal is to computeM(C̃). Since C and C̃ share all butone row/column (see (5)), if we have access toM(C), oneshould, in principle, be able to modify C’s underlying se-quence of rotations to constructM(C̃). This avoids havingto recompute everything for C̃ from scratch, i.e., perform-ing the greedy decompositions from (4) on the entire C̃.

    The hypothesis for manipulating M(C) to computeM(C̃) comes from the precise computations involved inthe factorization. Recall (3) and the discussion leading up

    to the expression. At level ℓ+ 1, the factorization picks the‘best’ candidate rows/columns from Cℓ that correlate themost with each other, so that the resulting diagonalization

    induces the smallest possible off-diagonal error over the rest

    of the active set. The components contributing towards this

    error are driven by the inner products (Cℓ:,i)TCℓ:,j for some

    columns i and j. In some sense, the largest such correlatedrows/columns get picked up, and adding one new entry to

    Cℓ:,i may not change the range of these correlations. Ex-tending this intuition across all levels, we argue that

    argmaxi,j

    C̃T:,iC̃:,j ≈ argmaxi,j

    CT:,iC:,j . (6)

    Hence, the k-tuples computed from C’s factorization arereasonably good candidates even after introducing w. Tobetter formalize this idea, and in the process present our

    algorithm, we parameterize the output structure of M(C)in terms of the sequence of rotations and the wavelets.

    3.2. The graph structure of M(C)

    If one has access to the sequence of k-tuples t1, . . . , tL

    involved in the rotations and the corresponding wavelet in-

    dices (s1, . . . , sL), then the factorization is straightforwardto compute i.e., there is no greedy search anymore. Recall

    that by definition sℓ ∈ tℓ and sℓ /∈ Sℓ (see (4)). To thatend, for a given O and L,M(C) can be ‘equivalently’ rep-resented using a depth L MMF graph G(C). Each level ofthis graph shows the k-tuple tℓ involved in the rotation, and

    2953

  • s1

    s2

    s3

    s4

    s5

    Q1

    Q2

    Q3

    l = 0 l = 1 l = 2

    s1 s2 s3 s4 s5

    s1

    s2

    s3

    s4

    s5

    3rd order MMF

    l = 3

    Figure 2. An example 5×5 matrix, and its 3rd order MMF graph (betterin color). Q1, Q2 and Q3 are the rotations. s1, s5 and s2 are wavelets

    (marked as black ellipses) at l = 1, 2 and 3 respectively. The arrows imply

    that a wavelet is not involved in future rotations.

    the corresponding wavelet sℓ i.e., G(C) := {tℓ, sℓ}L1 . Inter-preting the factorization in this way is notationally conve-

    nient for presenting the algorithm. More importantly, such

    an interpretation is central for visualizing hierarchical de-

    pendencies among dimensions of C, and will be discussedin detail in Section 4.3. An example of such a 3rd orderMMF graph constructed from a 5 × 5 matrix is shown inFigure 2 (the rows/columns are color coded for better vi-

    sualization). At level ℓ = 1, s1, s2 and s3 are diagonalizedwhile designating the rotated s1 as the wavelet. This processrepeats for ℓ = 2 and 3. As shown by the color-coding ofdifferent compositions, MMF gradually teases out higher-

    order correlations that can only be revealed after composing

    the rows/columns at one or more scales (levels here).

    For notational convenience, we denote the MMF graphs

    of C and C̃ as G := {tℓ, sℓ}L1 and G̃ := {t̃ℓ, s̃ℓ}L+11 . Recall

    that G̃ will have one more level than G since the row/columnw, indexed m + 1 in C̃, is being added (see (5)). The goalis to estimate G̃ without recomputing all the k-tuples usingthe greedy procedure from (4). This translates to inserting

    the new index m+1 into the tℓs and modifying sℓs accord-ingly. Following the discussion from Section 3.1, incremen-

    tal MMF argues that inserting this one new element into

    the graph will not result in global changes in its topology.

    Clearly, in the pathological case, G may change arbitrarily,but as argued earlier (see discussion about (6)) the chance

    of this happening for non-random matrices with reasonably

    large k is small. The core operation then is to compare thenew k-tuples resulting from the addition of w to the bestones from [m]k provided via G. If the newer k-tuple givesbetter error (see (3)), then it will knock out an existing k-tuple. This constructive insertion and knock-out procedure

    is the incremental MMF.

    3.3. Inserting a new row/column

    The basis for this incremental procedure is that one has

    access to G (i.e., MMF on C). We first present the algorithm

    assuming that this “initialization” is provided, and revisit

    this aspect shortly. The procedure starts by setting t̃ℓ = tℓ

    and s̃ℓ = sℓ for ℓ = 1, . . . , L. Let I be the set of elements(indices) that needs to be inserted into G. At the start (thefirst level) I = {m + 1} corresponding to w. Let t̃1 ={p1, . . . , pk}. The new k-tuples that account for insertingentries of I are {m+1}∪ t1 \pi (i = 1, . . . , k). These newk candidates are the probable alternatives for the existingt̃1. Once the best among these k + 1 candidates is chosen,an existing pi from t̃

    1 may be knocked out.

    If s̃1 gets knocked out, then I = {s̃1} for future levels.This follows from MMF construction, where wavelets at ℓth

    level are not involved in later levels. Since s̃1 is knockedout, it is the new inserting element according to G. On theother hand, if one of the k − 1 scaling functions is knockedout, I is not updated. This simple process is repeated se-quentially from ℓ = 1 to L. At L+1, there are no estimatesfor t̃L+1 and s̃L+1, and so, the procedure simply selects thebest k-tuple from the remaining active set S̃L. Algorithm 1summarizes this insertion and knock-out procedure.

    Algorithm 1 INSERTROW(C,w, {tℓ, sℓ}Lℓ=1)

    Output: {t̃ℓ, s̃ℓ}L+1ℓ=1C̃0 ← C̃ as in (5)z1 ← m+ 1for ℓ = 1 to L− 1 do{t̃ℓ, s̃ℓ, zℓ+1,Qℓ}←CHECKINSERT(C̃ℓ−1; tℓ, sℓ, zℓ)C̃ℓ = QℓC̃ℓ−1(Qℓ)T

    end for

    T ← GENERATETUPLES([m+ 1] \ ∪L−1ℓ=1 s̃ℓ(C̃))

    {Õ, t̃L, s̃L} ← argminO,t∈T ,s∈t E(C̃L−1;O; t, s)

    QL = Im+1, QLt̃L,t̃L

    = Õ, C̃L = QLC̃L−1(QL)T

    Algorithm 2 CHECKINSERT(A, t̂, ŝ, z)

    Output: t̃, s̃, z, QT ← GENERATETUPLES(t̂, z){Õ, t̃, s̃} ← argminO,t∈T ,s∈t E(A;O; t, s)if s̃ ∈ z thenz ← (z ∪ ŝ) \ s̃

    end if

    Q = Im+1, Qt̃,t̃ = Õ

    3.4. Incremental MMF Algorithm

    Observe that Algorithm 1 is for the setting from (5)

    where one extra row/column is added to a given MMF, and

    clearly, the incremental procedure can be repeated as more

    and more rows/columns are added. Algorithm 3 summa-

    rizes this incremental factorization for arbitrarily large and

    dense matrices. It has two components: an initialization

    on some randomly chosen small block (of size m̃ × m̃) of

    2954

  • the entire matrix C; followed by insertion of the remain-ing m− m̃ rows/columns using Algorithm 1 in a streamingfashion (similar to w from (5)). The initialization entailscomputing a batch-wise MMF on this small block (m̃ ≥ k).

    BATCHMMF: Note that at each level ℓ, the error crite-rion in (3) can be explicitly minimized via an exhaustive

    search over all possible k-tuples from Sℓ−1 (the active set)and a randomly chosen (using properties of QR decomposi-tion [31]) dictionary of kth order rotations. If the dictionaryis large enough, the exhaustive procedure would lead to the

    smallest possible decomposition error (see (2)). However,

    it is easy to see that this is combinatorially large, with an

    overall complexity of O(nk) [22] and will not scale wellbeyond k = 4 or so. Note from Algorithm 1 that the errorcriterion E(·) in this second stage which inserts the rest ofthe m− m̃ rows is performing an exhaustive search as well.

    Algorithm 3 INCREMENTAL MMF(C)

    Output: M(C)C̄ = C[m̃],[m̃], L = m− k + 1

    {tℓ, sℓ}m̃−k+11 ← BATCHMMF(C̄)for j ∈ {m̃+ 1, . . . ,m} do

    {tℓ, sℓ}j−k+11 ← INSERTROW(C̄,Cj,:, {tℓ, sℓ}j−k1 )

    C̄ = C[j],[j]end for

    M(C) := {tℓ, sℓ}L1

    Other Variants: The are two alternatives that avoid this

    exhaustive search. Since Qℓ’s job is to diagonalize somek rows/columns (see Definition 2), one can simply pick therelevant k × k block of Cℓ and compute the best O (for agiven tℓ). Hence the first alternative is to bypass the searchover O (in (4)), and simply use the eigen-vectors of Cℓ

    tℓ,tℓ

    for some tuple tℓ. Nevertheless, the search over Sℓ−1 for tℓ

    still makes this approximation reasonably costly. Instead,

    the k-tuple selection may be approximated while keepingthe exhaustive search over O intact [22]. Since diagonal-ization effectively nullifies correlated dimensions, the best

    k-tuple can be the k rows/columns that are maximally cor-related. This is done by choosing some s1 ∼ Sℓ−1 (fromthe current active set), and picking the rest by

    s2, . . . , sk ← argminsi∼Sℓ−1\s1

    k∑

    i=2

    (Cℓ−1:,s1 )TCℓ−1:,si

    ‖Cℓ−1:,s1 ‖‖Cℓ−1:,si ‖

    (7)

    This second heuristic (which is related to (6) from Section

    3.1) has been shown to be robust [22], however, for large

    k it might miss some k-tuples that are vital to the qualityof the factorization. Depending on m̃, and the availablecomputational resources at hand, these alternatives can be

    used instead of the earlier proposed exhaustive procedure

    for the initialization. Overall, the incremental procedure

    scales efficiently for very large matrices, compared to us-

    ing the batch-wise scheme on the entire matrix.

    4. Experiments

    We study various computer vision and medical imaging

    scenarios (see supplement for details) to evaluate the quality

    of incremental MMF factorization and show its utility. We

    first provide evidence for factorization’s efficacy in select-

    ing the relevant features of interest for regression. We then

    show that the resultant MMF graph is a useful tool for visu-

    alizing/decoding the learned task-specific representations.

    4.1. Incremental versus Batch MMF

    The first set of evaluations compares the incremental

    MMF to the batch version (including the exhaustive search

    based and the two approximate variants from Section 3.4).

    Recall that MMF error is the off-diagonal norm of Λ, ex-cept for the SL × SL block (see (1)), and the smaller theerror is, the closer the factorization is to being exact (see

    2). We observed that the incremental MMFs incur approxi-

    mately the same error as the batch versions, while achieving

    & 20−25 times speed-up compared to a single-core imple-mentation of the batch MMF. Specifically, across 6 differenttoy examples and 3 covariance matrices constructed fromreal data, the loss in factorization error is . 4% of ‖C‖Frob,with no strong dependence on the fraction of the initializa-

    tion m̃ (see Algorithm 3). Due to space restrictions, thesesimulations are included in the supplement.

    4.2. MMF Scores

    The objective of MMF (see (2)) is the signal that is

    not accounted for by the kth-order rotations of MMF (itis 0 whenever C is exactly factorizable). Hence, ‖(C −M(C))i,:‖ is a measure of the extra information in the i

    th

    row that cannot be reproduced by hierarchical compositions

    of the rest. Such value-of-information summaries, referred

    to as MMF scores, of all the dimensions of C give an im-portance sampling distribution, similar to statistical lever-

    age scores [3, 27]. These samplers drive several regres-

    sion tasks in vision including gesture tracking [33], face

    alignment/tracking [7] and medical imaging [15]. More-

    over, the authors in [27] have shown that statistical lever-

    age type marginal importance samplers may not be opti-

    mal for regression. On the other hand, MMF scores give

    the conditional importance or “relative leverage” of each

    dimension/feature given the remaining ones. This is be-

    cause MMF encodes the hierarchical block structure in the

    covariance, and so, the MMF scores provide better impor-

    tance samplers than statistical leverages. We first demon-

    strate this on a large dataset with 80 predictors/features and1300 instances. Figure 3(a,b) shows the instance covariancematrices after selecting the ‘best’ 5% of features. The block

    2955

  • structure representing the two classes, diseased and non-

    diseased, is clearly more apparent with MMF score sam-

    pling (see the yellow block vs.the rest in Figure 3(b)).

    We exhaustively compared leverages and MMF scores

    on a medical imaging regression task on region-of-interest

    (ROI) summaries from positron emission tomography

    (PET) images. The goal is to predict the cognitive score

    summary using the imaging ROIs. (see Figure 3(c), and

    supplement for details). Despite the fact that the features

    have high degree of block structure (covariance matrix from

    Figure 3(c)), this information is rarely, if ever, utilized

    within the downstream models, say multi-linear regression

    for predicting health status. Here we train a linear model

    using a fraction of these voxel ROIs sampled according to

    statistical leverages (from [3]) and relative leverage from

    MMF scores. Note that, unlike LASSO, the feature sam-

    plers are agnostic to the responses (a setting similar to opti-

    mal experimental design [12]). The second row of Figure 3

    shows the Adjusted-R2 of the resulting linear models, andFigure 3(h,i) show the corresponding F -statistic. The x-axis corresponds to the fraction of the ROIs selected using

    the leverage (black lines) and MMF scores (red lines). As

    shown by the red vs. black curves, the voxel ROIs picked

    by MMF are better both in terms of adjusted-R2 (the ex-plainable variance of the data) and F -statistic (the overallsignificance). More importantly, the first few ROIs picked

    up by MMF scores are more informative than those from

    leverage scores (left end of x-axis in Figure 3(d-i)). Fig-ure 3(j,k) show the gain in AUC of adjusted-R2 as the orderof MMF changes (x-axis). Clearly the performance gainof MMF scores is large. The error bars in these plots are

    omitted for clarity (see supplement for details, and other

    plots/comparisons). These results show that MMF scores

    can be used within numerous regression tasks where the

    number of predictors is large with sample sizes.

    4.3. MMF graphs

    The ability of a feature to succinctly represent the pres-

    ence of an object/scene is, at least, in part, governed by

    the relationship of the learned representations across multi-

    ple object classes/categories. Beyond object-specific infor-

    mation, such cross-covariate contextual dependencies have

    shown to improve the performance in object tracking and

    recognition [45] and medical applications [20] (a motivat-

    ing aspect of adversarial learning [26]). Visualizing the

    histogram of gradients (HoG) features is one such interest-

    ing result that demonstrates the scenario where a correctly

    learned representation leads to a false positive [41], for in-

    stance, the HoG features of a duck image are similar to a

    car HoG. [37, 14] have addressed similar aspects for deep

    representations by visualizing image classification and de-

    tection models, and there is recent interest in designing tools

    for visualizing what the network perceives when predicting

    a test label [44]. As shown in [1], the contextual images that

    a deep network (even with good detection power) desires to

    see may not even correspond to real-world scenarios.

    The evidence from these works motivate a simple ques-

    tion – Do the semantic relationships learned by the deep

    representations associate with those seen by humans? For

    instance, can such models infer that cats are closer to dogs

    than they are to bears; or that bread goes well with but-

    ter/cream rather than, say, salsa. Invariably, addressing

    these questions amounts to learning hierarchical and cate-

    gorical relationships in the class-covariance of hidden rep-

    resentations. Using classical techniques may not easily

    reveal interesting, human-relateable, trends as was shown

    very recently by [32]. There are at least few reasons, but

    most importantly, the covariance of hidden representations

    (in general) has parsimonious structure with multiple com-

    positions of blocks (the left two images in Figure 1 are from

    AlexNet and VGG-S). As motivated in Section 1, and later

    described in Section 3.2 using Figure 2, a MMF graph is the

    natural object to analyze such parsimonious structure.

    4.3.1 Decoding the deep

    A direct application of MMF on the covariance of hid-

    den representations reveals interesting hierarchical struc-

    ture about the “perception” of deep networks. To precisely

    walk through these compositions, consider the last hidden

    layer (FC7, that feeds into softmax) representations from aVGG-S network [9] corresponding to 12 different ImageNetclasses, shown in Figure 4(a). Figure 4(b,c) visualize a 5th

    order MMF graph learned on this class covariance matrix.

    The semantics of breads and sides. The 5th order MMFsays that the five categories – pita, limpa, chapati, chutney

    and bannock – are most representative of the localized struc-

    ture in the covariance. Observe that these are four differ-

    ent flour-based main courses, and a side chutney that shared

    strongest context with the images of chapati in the training

    data (similar to the body building and dumbell images from

    [1]). MMF then picks salad, salsa and saute representa-

    tions’ at the 2nd level, claiming that they relate the strongestto the composition of breads and chutney from the previous

    level (see visualization in Figure 4(b,c)). Observe that these

    are in fact the sides offered/served with bread. Although

    VGG-S was not trained to predict these relations, accordingto MMF, the representations are inherently learning them

    anyway – a fascinating aspect of deep networks i.e., they

    are seeing what humans may infer about these classes.

    Any dressing? What are my dessert options? Let us

    move to the 3rd level in Figure 4(b,c). margarine is a cheesebased dressing. shortcake is dessert-type meal made from

    strawberry (which shows up at 4th level) and bread (thecomposition from previous levels). That is the full course.

    The last level corresponds to ketchup, which is an outlier,

    distinct from the rest of the 10 classes – a typical order of

    2956

  • No Blocks (LevSc)

    (a) LevScore Sampling

    2 Blocks (MMFSc)

    (b) MMFScore Sampling

    p featuresp ROIs

    Amyloid PET images Feature Covariance

    (c) Regression Setup – Constructing the ROI data

    #Features used

    0 0.2 0.4 0.6 0.8

    Ad

    juste

    d R

    -sq

    ua

    red

    0.1

    0.2

    0.3

    0.4

    0.5

    Lev Mdl1 (k=20)MMF Mdl1 (k=20)

    (d) R2 vs. #ROIs

    #Features used

    0 0.2 0.4 0.6 0.8

    Ad

    juste

    d R

    -sq

    ua

    red

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    Lev Mdl2 (k=10)MMF Mdl2 (k=10)

    (e) R2 vs. #ROIs

    #Features used

    0 0.2 0.4 0.6 0.8

    Ad

    juste

    d R

    -sq

    ua

    red

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    Lev Mdl3 (k=20)MMF Mdl3 (k=20)

    (f) R2 vs. #ROIs

    #Features used

    0 0.2 0.4 0.6 0.8

    Ad

    juste

    d R

    -sq

    ua

    red

    0

    0.2

    0.4

    0.6

    Lev Mdl4 (k=10)MMF Mdl4 (k=10)

    (g) R2 vs. #ROIs

    #Features used

    0.2 0.4 0.6

    F-s

    tatistic

    0

    2

    4

    6

    8

    10

    12

    Lev Mdl1 (k=20)MMF Mdl1 (k=20)

    (h) F vs. #ROIs

    #Features used

    0.2 0.4 0.6

    F-s

    tatistic

    0

    2

    4

    6

    Lev Mdl4 (k=10)MMF Mdl4 (k=10)

    (i) F vs. #ROIs

    MMF Order

    0 5 10 15 20

    Diff.

    of

    AU

    C u

    nd

    er

    R-s

    qu

    red

    2.4

    2.5

    2.6

    2.7

    2.8

    2.9

    3

    3.1Mdl1

    Mdl2

    (j) Gain in R2 AUC

    MMF Order

    0 5 10 15 20Diff.

    of

    AU

    C u

    nd

    er

    R-s

    qu

    red

    5

    6

    7

    8Mdl3

    Mdl4

    (k) Gain in R2 AUC

    Figure 3. Evaluating Feature Importance Sampling of MMF Scores vs. Leverage Scores (a,b) Visualizing apparent (if any) blocks in instancecovariance matrices using best 5% features, (c) Regression setup (see structure in covariance), (d-g) Adjusted R2, and (h,i) F statistic of linear models, (j,k)

    gains in R2. Mdl1-Mdl4 are linear models constructed on different datasets (see supplement). m̃ = 0.1m (from Algorithm 3) for these evaluations.

    dishes involving the chosen breads and sides does not in-

    clude hot sauce or ketchup. Although shortcake is made

    up of strawberries, “conditioned” on the 1st and 2nd leveldependencies, it is less useful in summarizing the covari-

    ance structure. An interesting summary of this hierarchy

    from Figure 4(b,c) is – an order of pita with side ketchup or

    strawberries is atypical in the data seen by these networks.

    4.3.2 Are we reading tea leaves?

    It is reasonable to ask if this description is meaningful since

    the semantics drawn above are subjective. We provide ex-

    planations below. First, the networks are not trained tolearn the hierarchy of categories – the task was object/class

    detection. Hence, the relationships are completely a by-

    product of the power of deep networks to learn contextual

    information, and the ability of MMF to model these compo-

    sitions by uncovering the structure in the covariance matrix.

    Supplement provides further evidence by visualizing such

    hierarchy from few dozens of other ImageNet classes. Sec-

    ond, one may ask if the compositions are sensitive/stable to

    the order k – a critical hyperparameter of MMF. Figure 4(d)uses a 4th order MMF, and the resulting hierarchy is similar

    to that from Figure 4(b). Specifically, the different breads

    and sides show up early, and the most distinct categories

    (strawberry and ketchup) appear at the higher levels. Simi-

    lar patterns are seen for other choices of k (see supplement).

    Further, if the class hierarchy in Figures 4(b–d) is non-

    spurious, then similar trends should be implied by MMF’s

    on different (higher) layers of VGG-S. Figure 4(e) shows

    the compositions from the 10th layer representations (theoutputs from 3rd convolutional layer of VGG-S) of the 12classes in Figure 4(a). The strongest compositions, the 8classes from ℓ = 1 and 2, are already picked up half-way thorough the VGG-S, providing further evidence that

    the compositional structure implied by MMF is data-driven.

    We further discuss this in Section 4.3.3. Finally, we com-

    pared MMF’s class-compositions to the hierarchical clus-

    ters obtained from agglomerative clustering of representa-

    tions. The relationships in Figure 4(b-d) are not apparent in

    the corresponding dendrograms (see supplement, [32]) – for

    instance, the dependency of chutney/salsa/salad on several

    breads, or the disparity of ketchup from the others.

    Overall, Figure 4(b–e) shows many of the summaries

    that a human may infer about the 12 classes in Figure 4(a).Apart from visualizing deep representations, such MMF

    2957

  • (a) 12 classes (b) 5th order (FC7 layer reps.)

    Ketchup

    Strawberry

    Margarine

    Limpa

    Chutney

    Pita

    Saute

    Chapati

    Bannock

    Salad

    Salsa

    Shortca

    ke

    (c) 5th order (FC7 layer reps.)

    Chapati

    Salad

    Ketchup

    Bannock

    Saute

    Limpa

    Pita

    Salsa

    Shortcake

    Chutney

    Margarine

    Strawberry

    (d) 4th order (FC7 layer reps.)

    ℓ = 1

    ℓ = 2

    ℓ = 3

    ℓ = 4

    ℓ = 5Chapati

    Salad

    Ketchup

    Chutney

    Bannock

    Saute

    Limpa

    Pita

    Salsa

    Strawberry

    Shortcake

    Margarine

    (e) 5th order (conv3 layer reps.)

    ℓ = 1ℓ = 2

    ℓ = 3

    ℓ = 4

    Chutney

    Limpa

    Margarine

    Shortcake

    Salsa

    Strawberry

    Saute

    PitaBannock

    Chapati

    Ketchup

    Salad

    (f) 5th order (Pixel reps.)

    Figure 4. Hierarchy and Compositions of VGG-S [9] representations inferred by MMF. (a) The 12 classes, (b,c) Hierarchical structure from 5th

    order MMF, (d) the structure from a 4th order MMF, and (e,f) compositions from 3rd conv. layer (VGG-S) and inputs. m̃ = 0.1m (from Algorithm 3).

    graphs are vital exploratory tools for category/scene un-

    derstanding from unlabeled representations in transfer and

    multi-domain learning [2]. This is because, by comparing

    the MMF graph prior to inserting the new unlabeled in-

    stance to the one after insertion, one can infer whether the

    new instance contains non-trivial information that cannot be

    expressed as a composition of existing categories.

    4.3.3 The flow of MMF graphs: An exploratory tool

    Figure 4(f) shows the compositions from the 5th orderMMF on the input (pixel-level) data. These features are

    non-informative, and clearly, the classes whose RGB values

    correlate are at l = 0 in Figure 4(f). But most importantly,comparing Figure 4(b,e) we see that l = 1 and 2 have thesame compositions. One can construct visualizations like

    Figure 4(b,e,f) for all the layers of the network. Using this

    trajectory of the class compositions, one can ask whether a

    new layer needs to be added to the network (a vital aspect

    for model selection in deep networks [19]). This is driven

    by the saturation of the compositions – if the last few levels’

    hierarchies are similar, then the network has already learned

    the information in the data. On the other hand, variance in

    the last levels of MMFs implies that adding another network

    layer may be beneficial. The saturation at l = 1, 2 in Fig-ure 4(b,e) (see supplement for remaining layers’ MMFs) is

    one such example. If these 8 classes are a priority, then thepredictions of the VGG-S’ 3rd convolutional layer may al-ready be good enough. Such constructs can be tested across

    other layers and architectures (see supplement for MMFs

    from AlexNet, VGG-S and other networks).

    5. Conclusions

    We present an algorithm that uncovers multiscale struc-

    ture of symmetric matrices by performing a matrix factor-

    ization. We showed that it is an efficient importance sampler

    for relative leveraging of features.We also showed how the

    factorization sheds light on the semantics of categorical re-

    lationships encoded in deep networks, and presented ideas

    to facilitate adapting/modifying their architectures.

    Acknowledgments: The authors are supported by

    NIH AG021155, EB022883, AG040396, NSF CAREER

    1252725, NSF AI117924 and 1320344/1320755.

    2958

  • References

    [1] Inceptionism: Going deeper into neural networks. 2015. 6

    [2] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira,

    and J. W. Vaughan. A theory of learning from different do-

    mains. Machine learning, 79(1-2):151–175, 2010. 8

    [3] C. Boutsidis, P. Drineas, and M. W. Mahoney. Unsupervised

    feature selection for the k-means clustering problem. In

    Advances in Neural Information Processing Systems, pages

    153–161, 2009. 5, 6

    [4] J. Bruna and S. Mallat. Invariant scattering convolution net-

    works. IEEE transactions on pattern analysis and machine

    intelligence, 35(8):1872–1886, 2013. 1

    [5] E. J. Candes and D. L. Donoho. Curvelets: A surprisingly

    effective nonadaptive representation for objects with edges.

    Technical report, DTIC Document, 2000. 1

    [6] E. J. Candès and B. Recht. Exact matrix completion via con-

    vex optimization. Foundations of Computational mathemat-

    ics, 9(6):717–772, 2009. 1

    [7] X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by ex-

    plicit shape regression. International Journal of Computer

    Vision, 107(2):177–190, 2014. 5

    [8] S. Chandrasekaran, M. Gu, and W. Lyons. A fast adaptive

    solver for hierarchically semiseparable representations. Cal-

    colo, 42(3-4):171–185, 2005. 1

    [9] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman.

    Return of the devil in the details: Delving deep into convo-

    lutional nets. arXiv preprint arXiv:1405.3531, 2014. 1, 6,

    8

    [10] A. M. Cheriyadat and R. J. Radke. Non-negative matrix fac-

    torization of partial track data for motion segmentation. In

    2009 IEEE 12th International Conference on Computer Vi-

    sion, pages 865–872. IEEE, 2009. 1

    [11] R. R. Coifman and M. Maggioni. Diffusion wavelets. Ap-

    plied and Computational Harmonic Analysis, 21(1):53–94,

    2006. 1

    [12] P. F. de Aguiar, B. Bourguignon, M. Khots, D. Massart, and

    R. Phan-Than-Luu. D-optimal designs. Chemometrics and

    Intelligent Laboratory Systems, 30(2):199–210, 1995. 6

    [13] F. De la Torre and M. J. Black. Robust principal compo-

    nent analysis for computer vision. In Computer Vision, 2001.

    ICCV 2001. Proceedings. Eighth IEEE International Confer-

    ence on, volume 1, pages 362–369. IEEE, 2001. 1

    [14] A. Dosovitskiy and T. Brox. Inverting visual repre-

    sentations with convolutional networks. arXiv preprint

    arXiv:1506.02753, 2015. 6

    [15] K. J. Friston, A. P. Holmes, K. J. Worsley, J.-P. Poline, C. D.

    Frith, and R. S. Frackowiak. Statistical parametric maps in

    functional imaging: a general linear approach. Human brain

    mapping, 2(4):189–210, 1994. 5

    [16] M. Gavish and R. R. Coifman. Sampling, denoising and

    compression of matrices by coherent matrix organization.

    Applied and Computational Harmonic Analysis, 33(3):354–

    369, 2012. 1

    [17] W. Hwa Kim, H. J. Kim, N. Adluru, and V. Singh. Latent

    variable graphical model selection using harmonic analysis:

    applications to the human connectome project (hcp). In Pro-

    ceedings of the IEEE Conference on Computer Vision and

    Pattern Recognition, pages 2443–2451, 2016. 1

    [18] A. Hyvärinen. Independent component analysis: recent ad-

    vances. Phil. Trans. R. Soc. A, 371(1984):20110534, 2013.

    1

    [19] V. K. Ithapu, S. N. Ravi, and V. Singh. On architectural

    choices in deep learning: From network structure to gradi-

    ent convergence and parameter estimation. arXiv preprint

    arXiv:1702.08670, 2017. 8

    [20] V. K. Ithapu, V. Singh, O. C. Okonkwo, et al. Imaging-

    based enrichment criteria using deep learning algorithms

    for efficient clinical trials in mild cognitive impairment.

    Alzheimer’s & Dementia, 11(12):1489–1499, 2015. 6

    [21] R. Kondor, N. Teneva, and V. Garg. Multiresolution matrix

    factorization. In Proceedings of the 31st International Con-

    ference on Machine Learning (ICML-14), pages 1620–1628,

    2014. 2, 3

    [22] R. Kondor, N. Teneva, and P. K. Mudrakarta. Parallel mmf:

    a multiresolution approach to matrix computation. arXiv

    preprint arXiv:1507.04396, 2015. 2, 5

    [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet

    classification with deep convolutional neural networks. In

    Advances in neural information processing systems, pages

    1097–1105, 2012. 1

    [24] G. Kutyniok et al. Shearlets: Multiscale analysis for mul-

    tivariate data. Springer Science & Business Media, 2012.

    1

    [25] A. B. Lee, B. Nadler, and L. Wasserman. Treelets: an adap-

    tive multi-scale basis for sparse unordered data. The Annals

    of Applied Statistics, pages 435–471, 2008. 1

    [26] D. Lowd and C. Meek. Adversarial learning. In Proceedings

    of the eleventh ACM SIGKDD international conference on

    Knowledge discovery in data mining, pages 641–647. ACM,

    2005. 6

    [27] P. Ma, M. W. Mahoney, and B. Yu. A statistical perspective

    on algorithmic leveraging. Journal of Machine Learning Re-

    search, 16:861–911, 2015. 5

    [28] S. G. Mallat. A theory for multiresolution signal decom-

    position: the wavelet representation. IEEE transactions on

    pattern analysis and machine intelligence, 11(7):674–693,

    1989. 2

    [29] B. S. Manjunath and W.-Y. Ma. Texture features for brows-

    ing and retrieval of image data. IEEE Transactions on

    pattern analysis and machine intelligence, 18(8):837–842,

    1996. 1

    [30] Y. Meyer. Wavelets-algorithms and applications. Wavelets-

    Algorithms and applications Society for Industrial and Ap-

    plied Mathematics Translation., 142 p., 1, 1993. 1

    [31] F. Mezzadri. How to generate random matrices from the

    classical compact groups. arXiv preprint math-ph/0609050,

    2006. 5

    [32] J. C. Peterson, J. T. Abbott, and T. L. Griffiths. Adapting

    deep network features to capture psychological representa-

    tions. arXiv preprint arXiv:1608.02164, 2016. 6, 7

    [33] S. S. Rautaray and A. Agrawal. Vision based hand gesture

    recognition for human computer interaction: a survey. Arti-

    ficial Intelligence Review, 43(1):1–54, 2015. 5

    2959

  • [34] R. Rubinstein, A. M. Bruckstein, and M. Elad. Dictionaries

    for sparse representation modeling. Proceedings of the IEEE,

    98(6):1045–1057, 2010. 1

    [35] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,

    S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,

    et al. Imagenet large scale visual recognition challenge.

    International Journal of Computer Vision, 115(3):211–252,

    2015. 1

    [36] B. Savas, I. S. Dhillon, et al. Clustered low rank approxima-

    tion of graphs in information science applications. In SDM,

    pages 164–175. SIAM, 2011. 1

    [37] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside

    convolutional networks: Visualising image classification

    models and saliency maps. arXiv preprint arXiv:1312.6034,

    2013. 6

    [38] P. Sturm and B. Triggs. A factorization based algorithm for

    multi-image projective structure and motion. In European

    conference on computer vision, pages 709–720. Springer,

    1996. 1

    [39] N. Teneva, P. K. Mudrakarta, and R. Kondor. Multiresolu-

    tion matrix compression. In Proceedings of the 19th Inter-

    national Conference on Artificial Intelligence and Statistics,

    pages 1441–1449, 2016. 2, 3

    [40] M. Turk and A. Pentland. Eigenfaces for recognition. Jour-

    nal of cognitive neuroscience, 3(1):71–86, 1991. 1

    [41] C. Vondrick, A. Khosla, T. Malisiewicz, and A. Torralba.

    Hoggles: Visualizing object detection features. In Proceed-

    ings of the IEEE International Conference on Computer Vi-

    sion, pages 1–8, 2013. 6

    [42] J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma. Robust

    principal component analysis: Exact recovery of corrupted

    low-rank matrices via convex optimization. In Advances

    in neural information processing systems, pages 2080–2088,

    2009. 1

    [43] J. Xu, V. K. Ithapu, L. Mukherjee, J. M. Rehg, and

    V. Singh. Gosus: Grassmannian online subspace updates

    with structured-sparsity. In Proceedings of the IEEE Inter-

    national Conference on Computer Vision, pages 3376–3383,

    2013. 1

    [44] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson.

    Understanding neural networks through deep visualization.

    arXiv preprint arXiv:1506.06579, 2015. 6

    [45] T. Zhang, B. Ghanem, and N. Ahuja. Robust multi-object

    tracking via cross-domain contextual information for sports

    video analysis. In 2012 IEEE International Conference on

    Acoustics, Speech and Signal Processing (ICASSP), pages

    985–988. IEEE, 2012. 6

    [46] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal com-

    ponent analysis. Journal of computational and graphical

    statistics, 15(2):265–286, 2006. 2

    2960