unsupervised co-learning on $g$-manifolds across irreducible...

13
Unsupervised Co-Learning on G -Manifolds Across Irreducible Representations Yifeng Fan 1 Tingran Gao 2 Zhizhen Zhao 1 1 University of Illinois at Urbana-Champaign 2 University of Chicago {yifengf2, zhizhenz}@illinois.edu [email protected] Abstract We introduce a novel co-learning paradigm for manifolds naturally admitting an action of a transformation group G, motivated by recent developments on learning a manifold from attached fibre bundle structures. We utilize a representation theoretic mechanism that canonically associates multiple independent vector bundles over a common base manifold, which provides multiple views for the geometry of the underlying manifold. The consistency across these fibre bundles provide a common base for performing unsupervised manifold co-learning through the redundancy created artificially across irreducible representations of the transformation group. We demonstrate the efficacy of our proposed algorithmic paradigm through drasti- cally improved robust nearest neighbor identification in cryo-electron microscopy image analysis and the clustering accuracy in community detection. 1 Introduction Fighting with the curse of dimensionality by leveraging low-dimensional intrinsic structures has become an important guiding principle in modern data science. Apart from classical structural assumptions commonly employed in sparsity or low-rank models in high dimensional statistics [63, 11, 12, 49, 2, 64, 67], recently it has become of interest to leverage more intricate properties of the underlying geometric model, motivated by algebraic or differential geometry techniques, for efficient learning and inference from massive complex datasets [15, 16, 44, 46, 8]. The assumption that high dimensional datasets lie approximately on a low-dimensional manifold, known as the manifold hypothesis, has been the cornerstone for the development of manifold learning [62, 52, 18, 3, 4, 5, 17, 57, 66] in the past few decades. In many real applications, the low-dimensional manifold underlying the dataset of high ambient dimensionality admits additional structures that can be fully leveraged to gain deeper insights into the geometry of the data. One class of such examples arises in scientific fields such as cryo-electron microscopy (cryo-EM), where large numbers of random projections for a three-dimensional molecule generate massive collections of images that can be determined only up to in-plane rotations [59, 72]. Another source of examples is the application in computer vision and robotics, where a major challenge is to recognize and compare three-dimensional spatial configurations up to the action of Euclidean or conformal groups [28, 10]. In these examples, the dataset of interest consists of images or shapes of potentially high spatial resolution, and admits a natural group action g ∈G that plays the role of a nuisance or latent variable that needs to be “quotient out” before useful information is revealed. In geometric terms, on top of a differentiable manifold M underlying the dataset of interest, as assumed in the manifold hypothesis, we also assume the manifold admits a smooth right action of a Lie group G, in the sense that there is a smooth map φ : G×M→M satisfying φ (e, m)= m and φ (g 2 (g 1 ,m)) = φ (g 1 g 2 ,m) for all m ∈M and g 1 ,g 2 ∈G, where e is the identity element of G.A left action can be defined similarly. Such a group action reflects abundant information about the symmetry of the underlying manifold, with which one can study geometric and topological 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

Upload: others

Post on 20-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unsupervised Co-Learning on $G$-Manifolds Across Irreducible ...papers.nips.cc/paper/...g-manifolds-across-irreducible-representation… · Unsupervised Co-Learning on G-Manifolds

Unsupervised Co-Learning on G-Manifolds AcrossIrreducible Representations

Yifeng Fan1 Tingran Gao2 Zhizhen Zhao1

1University of Illinois at Urbana-Champaign 2University of Chicago{yifengf2, zhizhenz}@illinois.edu [email protected]

Abstract

We introduce a novel co-learning paradigm for manifolds naturally admitting anaction of a transformation group G, motivated by recent developments on learning amanifold from attached fibre bundle structures. We utilize a representation theoreticmechanism that canonically associates multiple independent vector bundles overa common base manifold, which provides multiple views for the geometry of theunderlying manifold. The consistency across these fibre bundles provide a commonbase for performing unsupervised manifold co-learning through the redundancycreated artificially across irreducible representations of the transformation group.We demonstrate the efficacy of our proposed algorithmic paradigm through drasti-cally improved robust nearest neighbor identification in cryo-electron microscopyimage analysis and the clustering accuracy in community detection.

1 Introduction

Fighting with the curse of dimensionality by leveraging low-dimensional intrinsic structures hasbecome an important guiding principle in modern data science. Apart from classical structuralassumptions commonly employed in sparsity or low-rank models in high dimensional statistics[63, 11, 12, 49, 2, 64, 67], recently it has become of interest to leverage more intricate propertiesof the underlying geometric model, motivated by algebraic or differential geometry techniques, forefficient learning and inference from massive complex datasets [15, 16, 44, 46, 8]. The assumptionthat high dimensional datasets lie approximately on a low-dimensional manifold, known as themanifold hypothesis, has been the cornerstone for the development of manifold learning [62, 52, 18,3, 4, 5, 17, 57, 66] in the past few decades.

In many real applications, the low-dimensional manifold underlying the dataset of high ambientdimensionality admits additional structures that can be fully leveraged to gain deeper insights intothe geometry of the data. One class of such examples arises in scientific fields such as cryo-electronmicroscopy (cryo-EM), where large numbers of random projections for a three-dimensional moleculegenerate massive collections of images that can be determined only up to in-plane rotations [59, 72].Another source of examples is the application in computer vision and robotics, where a majorchallenge is to recognize and compare three-dimensional spatial configurations up to the action ofEuclidean or conformal groups [28, 10]. In these examples, the dataset of interest consists of imagesor shapes of potentially high spatial resolution, and admits a natural group action g ∈ G that playsthe role of a nuisance or latent variable that needs to be “quotient out” before useful information isrevealed.

In geometric terms, on top of a differentiable manifold M underlying the dataset of interest, asassumed in the manifold hypothesis, we also assume the manifold admits a smooth right action ofa Lie group G, in the sense that there is a smooth map φ : G ×M →M satisfying φ (e,m) = mand φ (g2, φ (g1,m)) = φ (g1g2,m) for all m ∈M and g1, g2 ∈ G, where e is the identity elementof G. A left action can be defined similarly. Such a group action reflects abundant informationabout the symmetry of the underlying manifold, with which one can study geometric and topological

33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

Page 2: Unsupervised Co-Learning on $G$-Manifolds Across Irreducible ...papers.nips.cc/paper/...g-manifolds-across-irreducible-representation… · Unsupervised Co-Learning on G-Manifolds

properties of the underlying manifold through the lens of the orbit, stabilized, or induced finite- orinfinite-dimensional representations of G. In modern differential and symplectic geometry literature,a smooth manifoldM admitting the action of a Lie group G is often referred to as a G-manifold (seee.g. [40, §6], [50, 1, 33] and references therein), and this transformation-centered methodology hasbeen proven fruitful [42, 53, 40, 30] by several generations of geometers and topologists.

Recent development of manifold learning has started to digest and incorporate the additional infor-mation encoded in the G-actions on the low-dimensional manifold underlying the high-dimensionaldata. In [36], the authors constructed a steerable graph Laplacian on the manifold of images —modeled as a rotationally invariant manifold (or U(1)-manifold in geometric terms) — that servesthe role of graph Laplacian in manifold learning but with naturally built-in rotational invariance byconstruction. In [38], the authors proposed a principal bundle model for image denoising, whichachieved state-of-the-art performance by combining patch-based image analysis with rotationallyinvariant distances in microscopy [47]. A major contribution of this paper is to provide deeperinsights into the success of these group-transformation-based manifold learning techniques fromthe perspective of multi-view learning [56, 60, 37] or co-training [7], and propose a family of newmethods that systematically utilize these additional information in a systematic way, by exploiting theinherent consistency across representation theoretic patterns. Motivated by the recent line of researchbridging manifold learning with principal and associated fibre bundles [57, 58, 22, 20, 19], we pointout that to a G-manifold admitting a principal bundle structure is naturally associated as many vectorbundles as the number of distinct irreducible representations of the transformation group G, and eachof these vector bundles provide a separate “view” towards unveiling the geometry of the commonbase manifold on which all the fibre bundles reside.

Specifically, the main contributions of this paper are summarized as follows: (1) We propose a newunsupervised co-learning paradigm on G-manifold and propose an optimal alignment affinity measurefor high-dimensional data points that lie on or close to a lower dimensional G-manifold, using boththe local cycle consistency of group transformations on the manifold (graph) and the algebraicconsistency of the unitary irreducible representations of the transformations; (2) We introducethe invariant moments affinity in order to bypass the computationally intensive pairwise optimalalignment search and efficiently learn the underlying local neighborhood structure; and (3) Weempirically demonstrate that our new framework is extremely robust to noise and apply it to improvecryo-EM image analysis and the clustering accuracy in community detection. Code is available onhttps://github.com/frankfyf/G-manifold-learning.

2 Related Work

Manifold Learning: After the ground-breaking works of [62, 52], [5, 56, 41] provided reproducingkernel Hilbert space frameworks for scalar and vector valued kernel and interpreted the manifoldassumption as a specific type of regularization; [3, 4, 14] used the estimated eigenfunctions of theLaplace–Beltrami operator to parametrize the underlying manifold; [24, 25, 59] investigated intothe representation theoretic pattern of an integral operator acting on certain complex line bundlesover the unit two-sphere naturally arising from cryo-EM image analysis; [57, 58, 22] demonstratedthe benefit of using differential operators defined on fibre bundles over the manifold, instead of theLaplace–Beltrami operator on the manifold itself, in manifold learning tasks. Recently, [20, 19, 23]proposed to utilize the consistency across multiple irreducible representations of a compact Lie groupto improve spectral decomposition based algorithms.

Co-training and Multi-view Learning: In their seminal work [7], Blum and Mitchell demonstratedboth in theory and in practice that distinct “views” of a dataset can be combined together to improve theperformance of learning tasks, through their complementary yet consistent prediction for unlabelleddata. Similar ideas exploiting the consistency of the information contained in different sets of featureshas long existed in statistical literature such as canonical correlation analysis [29]. Since then,multi-view learning has remained a powerful idea percolating through aspects of machine learningranging from supervised and semi-supervised learning to active learning and transfer learning[21, 43, 61, 13, 55, 56, 34, 35]. See surveys [60, 69, 70, 37] for more detailed accounts.

3 Geometric Motivation

We first provide a brief overview of the key concepts used in this paper from elementary grouprepresentation theory. Interested readers are referred to [54, 9] for more details.

2

Page 3: Unsupervised Co-Learning on $G$-Manifolds Across Irreducible ...papers.nips.cc/paper/...g-manifolds-across-irreducible-representation… · Unsupervised Co-Learning on G-Manifolds

Groups and Representation: A group G is a set with an operation G×G → G obeying the followingaxioms: (1) ∀g1, g2 ∈ G, g1g2 ∈ G; (2) ∀g1, g2, g3 ∈ G, g1(g2g3) = (g1g2)g3; (3) There is a uniquee ∈ G called the identity of G, such that eg = ge = g,∀g ∈ G; (4) ∀g ∈ G, there is a correspondingelement g−1 ∈ G called the inverse of g, such that gg−1 = g−1g = e. A dρ × dρ-dimensionalrepresentation of a group G over a field F is a matrix valued function ρ : G → Fdρ×dρ such thatρ(g1)ρ(g2) = ρ(g1g2),∀g1, g2 ∈ G. In this paper, we assume F = C. A representation ρ is said tobe unitary if ρ(g−1) = ρ(g)∗ for any g ∈ G and ρ is said to be reducible if it can be decomposedinto a direct sum of lower-dimensional representations as ρ(g) = Q−1(ρ1(g)

⊕ρ2(g))Q for some

invertible matrix Q ∈ Cdρ×dρ , otherwise ρ is irreducible, the symbol⊕

denotes the direct sum. Fora compact group, there exists a complete set of inequivalent irreducible representations (in brevity:irreps) and any representation can be reduced into a direct sum of irreps.

Fourier Transform: In many applications of interest, the Lie group is compact and thus alwaysadmits irreps, and the concept of irreps allows generalizing the Fourier transform to any compactgroup. By the renowned Peter–Weyl theorem, any square integrable function f ∈ L2(G) can bedecomposed as

f(g) =

∞∑k=0

dkTr [Fkρk(g)] , and Fk =

∫Gf(g)ρ∗k(g)dµg, (1)

where each ρk : G → Cdk×dk is a unitary irrep of G with dimension dk ∈ N. This is the compact Liegroup analogy of the standard Fourier series over the unit circle. The “generalized Fourier coefficient”Fk in (1) is defined by the integral taken with respect to the Haar measure on G.

Motivation: Motivated by [38, 36], we consider the principal bundle structures on a G-manifoldM.Below we state the definitions of fibre bundle and principal bundle for convenience; see [6] for moredetails. Briefly speaking, a fibre bundle is a manifold which is locally diffeomorphic to a productspace, and a principal fibre bundle is a fibre bundle with a natural group action on its “fibres.”

Definition 1 (Fibre Bundle) LetM,B,F be three differentiable manifolds, and let π :M → Bdenote a smooth surjective map betweenM and B. We say that M π→ B (or just M for short)is a fibre bundle with typical fibre F over B if B admits an open cover U such that π−1 (U) isdiffeomorphic to product space U × F for any open set U ∈ U . For any x ∈ B, we denoteFx := π−1 (x) and call it the fibre over x.

Definition 2 (Principal Bundle) LetM be a fibre bundle, and G a Lie group. We callM a principalG-bundle if (1)M is a fibre bundle, (2)M admits a right action of G that preserves the fibres ofM,in the sense that for any m ∈M we have π (m) = π (g ·m), and (3) For any two points p, q ∈Mon the same fibre ofM, there exists a group element g ∈ G satisfying p · g = q.

If M is a principal G-bundle over B, any representation ρ of G on a vector space V induces anassociated vector bundle over B with typical fibre V , denoted asM×ρ V , defined as a quotientspaceM×ρ V :=M×V

/∼ where the equivalence relation is defined by (m · g, v) ∼ (m, ρ (g) v)

for all m ∈M, g ∈ G, and v ∈ V . This construction gives rise to as many different associated vectorbundles as the number of distinct representations of the Lie group G. This allows us to study theG-manifoldM, as a principal G-bundle, through tools developed for learning an unknown manifoldfrom attached vector bundle structures, such as vector diffusion maps (VDM) [57, 58]. We considereach of these associated vector bundles as a distinct “view” towards the unknown data manifoldM,as the representations inducing these vector bundles are different. In the rest of this paper, we willillustrate with several examples how to design learning and inference algorithms that exploit theinherent consistency in these associated vector bundles by representation theoretic machinery. Unlikethe co-training setting where the consistency is induced from the labelled samples onto the unlabelledsamples, in our unsupervised setting no labelled training data is provided and the consistency isinduced purely from the geometry of the G-manifold.

4 Methods

Problem Setup: Given a collection of n data points {x1, . . . , xn} ⊂ Rl, we assume they lie onor close to a low dimensional smooth manifoldM of intrinsic dimension d � l, and thatM is aG-manifold admitting the structure of a principal G-bundle with a compact Lie group G. The dataspaceM is closed under the action of G. That is, g · x ∈M for all group transformations g ∈ G and

3

Page 4: Unsupervised Co-Learning on $G$-Manifolds Across Irreducible ...papers.nips.cc/paper/...g-manifolds-across-irreducible-representation… · Unsupervised Co-Learning on G-Manifolds

data points x ∈M, where ‘·’ denotes the group action. As an example, in a cryo-EM image dataseteach image is a projection of a macromolecule with a random orientation, thereforeM = SO(3),which is the 3-D rotation group, G = SO(2) which is the in-plane rotation of images. The G-invariantdistance dij between two data points xi and xj is defined as

dij = ming∈G‖xi − g · xj‖, and gij = argmin

g∈G‖xi − g · xj‖. (2)

where ‖ · ‖ is the Euclidean distance on the ambient space Rl and gij is the associated alignmentwhich is assumed to be unique. Then we build an undirected graph G = (V,E) whose nodesare represented by data points, edge connection is given based on dij using the ε-neighborhoodcriterion, i.e. (i, j) ∈ E iff dij <= ε, or κ-nearest neighbor criterion, i.e. (i, j) ∈ E iff j is one ofthe κ nearest neighbors of i. The edge weights wij are defined using a kernel function on dij aswij = Kσ(dij). The resulting graph G is defined on the quotient space B :=M/G and is invariantto the group transformations gij within data points, e.g. for the viewing angles of cryo-EM imagesB = SO(3)/SO(2) = S2. In a noiseless world, G should be a neighborhood graph which onlyconnects data points on B with small dij . However, in many applications, noise in the observationaldata severely degrades the estimations of G-invariant distances dij and optimal alignments gij . Thisleads to errors in the edge connection of G, which connect distant data points on B where theirunderlying geodesic distances are large.

Given the noisy graph, we consider the problem of removing the wrong connections and recoveringthe underlying clean graph structure on B, especially under high level of noise. We propose a robust,unsupervised co-learning framework for addressing this, it has two steps which first builds a series ofadjacency matrices with different irreps and filters the original noisy graph as denoising, further itchecks the affinity between node pairs for identifying true neighbors in the clean graph. The mainintuition is to systematically explores the consistency of the group transformation of the principalbundles across all irreps of G, results in a robustness measurement of the affinity (see Fig. 1).

Figure 1: Illustration of our co-learning paradigm: Given a graphwith irreps ρk for k = 1, . . . , kmax, we identify neighbors by inves-tigating the following consistencies: Within each graph of a singleirrep ρk, if nodes i, j and s are neighbors, the cycle consistency ofthe group transformation holds: ρk(gjs)ρk(gsi)ρk(gij) ≈ Idk×dk ;Across different irreps, if i, j are neighbors, the transformation gijshould be consistent algebraically (along the orange lines connectingthe blue dots).

Weight Matrices Using Irreps:We start from building a series ofweight matrices using multiple ir-reps of the compact Lie group G.Given the graph G = (V,E) withn nodes and the group transforma-tions g ∈ G, we assign weight oneach edge (i, j) ∈ E by taking intoaccount both the scalar edge connec-tion weight wij and the associatedalignment gij using unitary irreps ρkfor k = 1, . . . , kmax. The resultinggraph can be described by a set ofweight matrices Wk:

Wk(i, j) =

{wijρk(gij) (i, j) ∈ E0 otherwise

(3)

where wij = wji and ρk(gji) = ρ∗k(gij) for all (i, j) ∈ E. Recall the unitary irrep ρk(gij) ∈ Cdk×dkis a dk × dk matrix, therefore Wk is a block matrix with n× n blocks of size dk × dk. In particular,the corresponding degree matrix Dk is also a block diagonal matrix with the (i, i)-block Dk(i, i) as:

Dk(i, i) = deg(i)Idk×dk , deg(i) :=∑

j:(i,j)∈Ewij . (4)

The Hilbert space H, as a unitary representation of the compact Lie group G, admits an isotypicdecomposition H =

⊕Hk, where a function f is in Hk if and only if f(xg) = gkf(x). Then for

each irrep ρk, we construct a normalized matrix Ak = D−1k Wk, which is an averaging operator forvector fields inHk. That is, for any vector zk ∈ Hk:

(Akzk)(i) =1

deg(i)

∑j:(i,j)∈E

wijρk(gij)zk(j). (5)

Notice that Ak is similar to a Hermitian matrix Ak as:

Ak = D1/2k AkD

−1/2k = D

−1/2k WkD

−1/2k (6)

4

Page 5: Unsupervised Co-Learning on $G$-Manifolds Across Irreducible ...papers.nips.cc/paper/...g-manifolds-across-irreducible-representation… · Unsupervised Co-Learning on G-Manifolds

Algorithm 1: Weight Matrices FilteringInput: Initial graph G = (V,E) with n nodes, for each (i, j) ∈ E the scalar weight wij and alignment gij ,

maximum frequency kmax, cutoff parameter mk for k = 1, . . . , kmax, and spectral filter ηtOutput: The filtered weight matrices Wk,t for k = 1, . . . , kmax

1 for k = 1, . . . , kmax do2 Construct the block weight matrix Wk of size ndk × ndk in (3) and the normalized symmetric matrix

Ak in (6)3 Compute the largest mkdk eigenvalues λ(k)

1 ≥ λ(k)2 ,≥, . . . ,≥ λ(k)

mkdkof Ak and the corresponding

eigenvectors {u(k)l }

mkdkl=1

4 for i = 1, . . . , n do5 Construct the G-equivariant mapping, ψ(k)

t : i 7→[ηt(λ1)1/2u

(k)1 (i), . . . , ηt(λmk )1/2u

(k)mkdk

(i)]

6 (Optional) Compute the SVD of ψ(k)t (i) = UΣV ∗ and the normalized mapping ψ(k)

t (i) = UV ∗.7 end8 Vertically concatenate ψ(k)

t (i) or ψ(k)t (i) to form the matrix Ψ

(k)t of size ndk ×mkdk

9 Construct the filtered and normalized weight matrix Wk,t = Ψ(k)t

(k)t

)∗.

10 end

which has real eigenvalues and orthonormal eigenvectors {λ(k)l , u(k)l }

ndkl=1 , and all the eigenvalues

are within [−1, 1]. For simplicity, we assume data points are uniformly distributed on B. If not,the normalization proposed in [17] can be applied to Wk. Now suppose there is a random walk onG with a transition matrix A0 and the trivial representations ρ0(g) = 1,∀g ∈ G, then A2t

0 (i, j) isthe transition probability from i to j with 2t steps. Due to the usage of ρ0(gij), A2t

0 (i, j) not onlytakes into account the connectivity between the nodes i and j, but also checks the consistency oftransformations along all length-2t paths between i and j. Generally, in other cases when k ≥ 1,A2tk (i, j) is a sub-block matrix which still encodes such consistencies. Intuitively if i, j are true

neighbors on G, their transformations should be in agreement and we expect ‖A2tk (i, j)‖2HS or

‖A2tk (i, j)‖2HS to be large, where ‖ · ‖HS is the Hilbert-Schmidt norm. Previously, vector diffusion

maps (VDM) [57, 58] considers k = 1 and defines the pairwise affinity as ‖A2t1 (i, j)‖2HS.

Weight Matrices Filtering: For denoising the graph, we generalize the VDM framework by firstcomputing the filtered and normalized weight matrix Wk,t = ηt(Ak) for all irreps ρk’s, where ηt(·)denotes a spectral filter acting on the eigenvalues, for example ηt(λ) = λ2t as VDM. Moreover, sincethe small eigenvalues of Ak are more sensitive to noise, a truncation is applied by only keeping thetop mkdk eigenvalues and eigenvectors. Specifically, we equally divide u(k)l of length ndk into nblocks and denote the ith block as u(k)l (i). In this way, we define a G-equivariant mapping as:

ψ(k)t : i 7→

[ηt(λ1)

1/2u(k)1 (i), . . . , ηt(λmkdk)

1/2u(k)mkdk

(i)]∈ Cdk×mkdk , i = 1, 2, . . . , n. (7)

It can be further normalized to ensure the diagonal blocks of Wk,t are identity matrices, i.e.Wk,t(i, i) = Idk×dk for all nodes i. The steps for weight matrices filtering are detailed in Alg. 1. Theresulting denoised Wk,t is then used for building our affinity measures.

Optimal Alignment Affinity: At each irrep ρk, the filtered Wk,t involves the transformationconsistency of the graph represented by Wk and has its own ability to measure the affinity. Thensimilar to the unsupervised multi-view learning approach, it is advantageous to boost this by couplingthe information under different irreps and to achieve a more accurate measurement (see Fig. 1).Furthermore, notice that if i and j are true neighbors, for each irrep ρk the block Wk,t(i, j) shouldencode the same amount of associated alignment gij . Therefore, by applying the algebraic relationamong Wk,t across all irreps, we define the optimal alignment affinity according to the generalizedFourier transform in (1) and the definition of the weight matrices in (3):

SOAt (i, j) := max

g∈G

1

kmax

∣∣∣∣∣kmax∑k=1

Tr[Wk,t(i, j)ρ

∗k(g)

]∣∣∣∣∣ , (8)

5

Page 6: Unsupervised Co-Learning on $G$-Manifolds Across Irreducible ...papers.nips.cc/paper/...g-manifolds-across-irreducible-representation… · Unsupervised Co-Learning on G-Manifolds

which can be evaluated using generalized FFTs [39]. Here both the cycle consistency within eachgraph and the algebraic relation across different irreps in Fig. 1 are considered.

Power Spectrum Affinity: Searching for the optimal alignment among all transformations asabove could be computationally challenging and extremely time consuming. Therefore, invariantfeatures can be used to speed up the computation. First we consider the power spectrum, which isthe Fourier transform of the auto-correlation defined as Pf (k) = FkF

∗k according to the convolution

theorem. It is transformation invariant since under the right action of g ∈ G, the Fourier coefficientsFk → Fkρk(g) and Pf ·g(k) = Fkρk(g)ρk(g)

∗F ∗k = Pf (k). Hence, for each k we compute thepower spectrum Pk of Wk,t and combine them as the power spectrum affinity:

Spower spect (i, j) =

1

kmax

kmax∑k=1

Tr [Pk(i, j)] , with Pk(i, j) = Wk,t(i, j)Wk,t(i, j)∗, (9)

which does not require the search of optimal alignment and is thus computationally efficient. Recently,multi-frequency vector diffusion maps (MFVDM) [20] considers G = SO(2) and sums the powerspectrum at different irreps as their affinity. Here, we extend it to a general compact Lie group.

Bispectrum Affinity: Although, the power spectrum affinity combines the information at differentirreps, it does not couple them and loses the relative phase information, i.e. the transformation acrossdifferent irreps ρk (see Fig. 1). Consequently, the affinity might be inaccurate under high level ofnoise. In order to systematically impose the algebraic consistency without solving the optimizationproblem in (8), we consider another invariant feature called bispectrum, which is the Fourier transformof the triple correlation and has been used in several fields [32, 27, 72, 31]. Formally, let us considertwo unitary irreps ρk1 and ρk2 on finite dimensional vector spaces Hk1 and Hk2 of the compactLie group G. There is a unique decomposition of ρk1

⊗ρk2 into a set of unitary irreps ρk, k ∈ N,

where⊗

is the Kronecker product of matrices, and we use⊕

to denote direct sum. There existsG-equivariant maps from Hk1

⊗Hk2 →

⊕Hk, called generalized Clebsch–Gordan coefficients

Ck1,k2 for G , which satisfies:

ρk1(g)⊗

ρk2(g) = Ck1,k2

⊕k∈k1

⊗k2

ρk(g)

C∗k1,k2 . (10)

Using (10) and the fact that Ck1,k2 and ρk’s are unitary matrices, we have

[ρk1(g)

⊗ρk2(g)

]Ck1,k2

⊕k∈k1

⊗k2

ρ∗k(g)

C∗k1,k2 = Idk1dk2×dk1dk2 . (11)

Particularly, the triple correlation of a function f on G can be defined as a3,f (g1, g2) =∫G f∗(g)f(gg1)f(gg2)dµg. Then the bispectrum is defined as the Fourier transform of a3,f as

Bf (k1, k2) =[Fk1

⊗Fk2

]Ck1,k2

⊕k∈k1

⊗k2

F ∗k

C∗k1,k2 . (12)

Under the action of g, we have the following properties of the Fourier coefficients of f : (1) Fk →Fkρk(g), and (2) Fk1

⊗Fk2 → (Fk1ρk1(g))

⊗(Fk1ρk1(g)) = (Fk1

⊗Fk1) (ρk1(g)

⊗ρk2(g)).

Therefore, Bf is G-invariant according to (11) and (12). By combining the bispectrum at different k1and k2, we establish the bispectrum affinity as,

Sbispect (i, j) =

1

k2max

∣∣∣∣∣kmax∑k1=1

kmax∑k2=1

Tr [Bk1,k2,t(i, j)]

∣∣∣∣∣ , with (13)

Bk1,k2,t(i, j) =[Wk1,t(i, j)

⊗Wk2,t(i, j)

]Ck1,k2

⊕k∈k1

⊗k2

W ∗k,t(i, j)

C∗k1,k2 . (14)

6

Page 7: Unsupervised Co-Learning on $G$-Manifolds Across Irreducible ...papers.nips.cc/paper/...g-manifolds-across-irreducible-representation… · Unsupervised Co-Learning on G-Manifolds

60 120 1800

0.05

0.1

0.15

0.2p = 0.5

Power spec. (ours)

Bispec. (ours)

Opt (ours)

VDM

60 120 1800

0.02

0.04

0.06

0.08p = 0.1

Power spec. (ours)

Bispec. (ours)

Opt (ours)

VDM

60 120 1800

0.01

0.02

0.03

0.04

0.05p = 0.09

Power spec. (ours)

Bispec. (ours)

Opt (ours)

VDM

60 120 1800

0.005

0.01

0.015

0.02p = 0.08

Power spec. (ours)

Bispec. (ours)

Opt (ours)

VDM

Figure 2: Histograms of arccos 〈vi, vj〉 between estimated nearest neighbors onM = SO(3), G = SO(2) andB = S2 with different SNRs. The clean histogram should peak at small angles. The lines of the bispectrumand the optimal alignment affinities almost overlap in all these plots. We set kmax = 6, mk = 10 for all k’s andt = 1.

If the transformations are consistent across different k’s, the trace of Bk1,k2,t in (14) should be large.Therefore, this affinity not only takes into account the consistency of the transformation at each irrep,but also explores the algebraic consistency across different irreps.

Higher Order Invariant Moments: The power spectrum and bispectrum are second-order andthird-order cumulants, certainly it is possible to design affinities by using higher order invari-ant features. For example, we can define the order-d + 1 G-invariant features as: Mk1,...,kd =

[Fk1⊗· · ·⊗Fkd ]Ck1,...,kd

[⊕k∈k1

⊗···

⊗kdF ∗k

]C∗k1,...,kd , where Ck1,...,kd is the extension of

the Clebsch–Gordan coefficients. However, using higher order spectra dramatically increases thecomputational complexity. In practice, the bispectrum is sufficient to check the consistency of thegroup transformations between nodes and across all irreps.

Computational Complexity: Filtering the normalized weight matrix involves computing the topmkdk eigenvectors of the sparse Hermitian matricesAk, for k = 1, . . . , kmax, which can be efficientlyevaluated using block Lanczos method [51], and its cost is O(nmkd

2k(mk + lk)), where lk is the

average number of non-zero elements in each row of Ak. We compute the spectral decompositionfor different k’s in parallel. Computing the power spectrum invariant affinity for all pairs takesO(n2

∑kmaxk=1 d

2k) flops. The computational complexity of evaluating the bispectrum invariant affinity

is O(n2(∑kmax

k1=0

∑kmax

k2=0 d2k1d2k2)). For the optimal alignment affinity, the computational complexity

depends on the cost of optimal alignment search Ca and the total cost is O(n2Ca). For certain groupstructures, where FFTs are developed, the optimal alignment affinity can be efficiently and accuratelyapproximated. However, Ca is still larger than the computation cost of invariants.

Examples with G = SO(2) and SO(3): If the group transformation is 2-D in-plane rotation, i.e.G = SO(2), the unitary irreps will be ρk(α) = eıkα, where α ∈ (0, 2π] is the rotation angle. Thedimensions of the irreps are dk = 1, and k1

⊗k2 = k1 + k2. The generalized Clebsch–Gordan

coefficients is 1 for all (k1, k2) pairs. If G is the 3-D rotation group, i.e. G = SO(3), the unitaryirreps are the Wigner D-matrices for ω ∈ SO(3) [68]. The dimensions of the irreps are dk = 2k + 1,and k1

⊗k2 = {|k1 − k2|, . . . , k1 + k2}. The Clebsch–Gordan coefficients for all (k1, k2) pairs

can be numerically precomputed [26]. These two classical examples are frequently used in the realworld and are investigated in our experiments.

5 Experiments

We evaluate our paradigm through three examples: (1) Nearest neighbor (In brevity: NN) searchon 2-sphere S2 with G = SO(2); (2) nearest viewing angle search for cryo-EM images; (3) spectralclustering with G = SO(2) or G = SO(3) transformation. We compare with the baseline vectordiffusion maps (VDM) [57]. In particular, since the greatest advantage of our paradigm is therobustness to noise, we demonstrate this through datasets contaminated by extremely high level ofnoise. The setting of hyper parameters, e.g. kmax and mk, are shown in the captions, we point outthat our algorithm is not sensitive to the choice of parameters. The experiments are conducted inMATLAB on a computer with Intel i7 7th generation quad core CPU.

NN Search for M = SO(3), G = SO(2), B = S2: We simulate n = 104 points uniformlydistributed overM = SO(3) according to the Haar measure. Each point can be represented by a3× 3 orthogonal matrix R = [R1, R2, R3], whose determinant is equal to 1. Then the vector v = R3

can be realized as a point on the unit 2-sphere (i.e. B = S2). The first two columns R1 and R2 spans

7

Page 8: Unsupervised Co-Learning on $G$-Manifolds Across Irreducible ...papers.nips.cc/paper/...g-manifolds-across-irreducible-representation… · Unsupervised Co-Learning on G-Manifolds

Clean Noisy

Reference volume0 60 120 180

0

0.05

0.1

0.15SNR = 0.01

sPCA

Power spec. (ours)

Bispec. (ours)

Opt (ours)

VDM

0 60 120 1800

0.02

0.04

0.06

0.08

0.1SNR = 0.007

sPCA

Power spec. (ours)

Bispec. (ours)

Opt (ours)

VDM

0 60 120 1800

0.02

0.04

0.06SNR = 0.005

sPCA

Power spec. (ours)

Bispec. (ours)

Opt (ours)

VDM

0 60 120 1800

0.005

0.01

0.015SNR = 0.003

sPCA

Power spec. (ours)

Bispec. (ours)

Opt (ours)

VDM

Figure 3: Nearest viewing angle search for cryo-EM images. Left: clean, noisy (SNR = 0.01) projections imagesamples, and reference volume of 70s ribosome; Right: Histograms of arccos 〈vi, vj〉 between estimated nearestneighbors. sPCA is the initial noisy input of our graph structure. The lines of power spectrum and bispectrumalmost overlap in all these plots. We set kmax = 20, mk = 20 for all k’s and t = 1.

the tangent plane of the sphere at v. Given two points i and j, there exists a rotation angle αij thatoptimally aligns the tangent bundles [R1

j , R2j ] to [R1

i , R2i ] as in (2). Therefore, the manifoldM is a

G-manifold with G = SO(2). Then we build a clean neighborhood graph G = (V,E) by connectingnodes with 〈vi, vj〉 ≥ 0.97, and add noise following a random rewiring model [59]. With probabilityp, we keep the existing edge (i, j) ∈ E. With probability 1− p, we remove it and link i to anothervertex drawn uniformly at random from the remaining vertices that are not already connected to i.For those rewired edges, their alignments are uniformly distributed over [0, 2π] according to the Haarmeasure. In this way, the probability p controls the signal to noise ratio (SNR) where p = 1 indicatesthe clean case, while p = 0 is fully random. For each node, we identify its 50 NNs based on thethree proposed affinities and the affinity in VDM. In Fig. 2 we plot the histogram of arccos 〈vi, vj〉of identified NNs under different SNRs. When p = 0.08 to p = 0.1 (over 90% edges are corrupted),bispectrum and optimal alignment achieve similar result and outperform power spectrum and VDM.This indicates our proposed affinities are able to recover the underlying clean graph, even at anextremely high noise level.

Nearest Viewing Angle Search for Cryo-EM Images: One important application of the NN searchabove is in cryo-EM image analysis. Given a series of projection images of a macromolecule withunknown random orientations and extremely low SNR (see Fig. 3), we aim to identify images withsimilar projection directions and perform local rotational alignment, then image SNR can be boostedby averaging the aligned images. Therefore, each projection image can be viewed as a point lying onthe 2-sphere (i.e. B = S2), and the group transformation is the in-plane rotation of an image (i.e.,G = SO(2)).

In our experiments, we simulate n = 104 projection images from a 3D electron density map of the70S ribosome, the orientations of all projections are uniformly distributed over SO(3) and the imagesare contaminated by additive white Gaussian noise (see Fig. 3 for noisy samples). As preprocessing,we build the initial graph G by using fast steerable PCA (sPCA) [71] and rotationally invariantfeatures [72] to initially identify the images of similar views and the corresponding in-plane rotationalalignments. Similar to the example above, we compute the affinities for NNs identification. In Fig. 3,we display the histograms of arccos 〈vi, vj〉 of identified NNs under different SNRs. Result showsthat all proposed affinities outperform VDM. The power spectrum and the bispectrum affinitiesachieve similar result, and outperform the optimal alignment affinity. This result is different fromthe previous example with the random rewiring model on S2. This is because those two exampleshave different noise model, the random rewiring model has independent noise on edges, whereas theexamples using cryo-EM images have independent noise on nodes with dependent noise on edges.

Spectral Clustering with SO(2) or SO(3) Transformations: We apply our framework to spec-tral clustering. In particular, we assume there exists a group transformation gij ∈ G in addition tothe scalar weight wij between members (nodes) in a network. Formally, given n data points with Kequal sized clusters, for each point i, we uniformly assign an in-plane rotational angle αi ∈ [0, 2π),or a 3-D rotation ωi ∈ SO(3). Then the optimal alignment is αij = αi − αj , or ωij = ωiω

−1j . We

build the clean graph by fully connecting nodes within each cluster. The noisy graph is then builtfollowing the random rewiring model with a rewiring probability p. We perform clustering by usingour proposed affinities as the input of spectral clustering, and compare with the traditional spectralclustering [45, 65] which only takes into account the scalar edge connection, and VDM [57], whichdefines affinity based on the transformation consistency at a single representation. In Tab. 1, weuse Rand index [48] to measure the performance (larger value is better). Our three affinities achievesimilar accuracy and they outperform the traditional spectral clustering (scalar) and VDM. The resultsreported in Tab. 1 are evaluated over 50 trials for SO(2) and 10 trials for SO(3) respectively.

8

Page 9: Unsupervised Co-Learning on $G$-Manifolds Across Irreducible ...papers.nips.cc/paper/...g-manifolds-across-irreducible-representation… · Unsupervised Co-Learning on G-Manifolds

Table 1: Rand index (larger value is better) of spectral clustering results with SO(2) or SO(3) group transformation.We set the number of clusters Left: K = 2 and right: K = 10. For K = 10 and SO(3) case, each cluster has 25points, otherwise each cluster has 50 points. We set mk = K, kmax = 10 and t = 1 for all cases.

G method K = 2 clusters K = 10 clustersp = 0.16 p = 0.20 p = 0.25 p = 0.16 p = 0.20 p = 0.25

SO(2)

Scalar 0.569± 0.069 0.705± 0.092 0.837± 0.059 0.868± 0.010 0.948± 0.015 0.981± 0.013VDM 0.526± 0.036 0.644± 0.076 0.857± 0.057 0.892± 0.010 0.963± 0.011 0.994± 0.008

Power spec. (ours) 0.670± 0.065 0.899± 0.051 0.981± 0.021 0.975± 0.010 0.991± 0.011 0.998± 0.006Opt (ours) 0.687± 0.011 0.912± 0.009 0.986± 0.007 0.976± 0.012 0.994± 0.008 0.997± 0.005

Bispec. (ours) 0.664± 0.073 0.901± 0.062 0.983± 0.019 0.967± 0.014 0.997± 0.003 1± 0

SO(3)Scalar 0.572± 0.061 0.666± 0.095 0.862± 0.056 0.838± 0.003 0.838± 0.007 0.909± 0.019VDM 0.600± 0.048 0.840± 0.056 0.974± 0.023 0.850± 0.011 0.919± 0.013 0.965± 0.014

Power spec. (ours) 0.921± 0.038 0.986± 0.016 1± 0 0.874± 0.011 0.939± 0.011 0.981± 0.017Bispec. (ours) 0.911± 0.043 0.990± 0.010 1± 0 0.869± 0.012 0.943± 0.009 0.979± 0.011

(a) Affinity matrix of different methods (b) Bispectrum affinity component |Tr(Bk1,k2,t(i, j))|Figure 4: Spectral clustering for K = 2 clusters with SO(3) group transformation. The underlying clean graphis corrupted according to the random rewiring model. Left: Plot of the affinity matrix by different approaches.The clusters are of equal size and form two diagonal blocks in the clean affinity matrix (see the scalar column atp = 1). Here we do not include the affinity of each node with itself and the diagonal entries are 0; Right: Plot ofthe bispecturm affinity |Tr [Bk1,k2,t(i, j)] | at different k1, k2, p = 0.16.

For a better understanding, we visualize the n× n affinity matrices by different approaches as shownin Fig. 4a at K = 2 and G = SO(3). We observe that at high noise levels, such as p = 0.16 or 0.2,the underlying 2-cluster structure is visually easier to be identified through our proposed affinities.In particular, as the bispectrum affinity in (13) is the combination of the bispectrum coefficientsBf (k1, k2), Fig. 4b shows the component |Tr [Bk1,k2,t(i, j)] | at different k1, k2. Visually, the 2-cluster structure appears in each (k1, k2) component with some variations across different components.Combining those information together results in a more robust classifier.

6 Conclusion

In this paper, we propose a novel mathematical and computational framework for unsupervisedco-learning on G-manifolds across multiple unitary irreps for robust nearest neighbor search andspectral clustering. We have a two stage algorithm: At the first stage, the graph adjacency matricesare individually denoised through spectral filtering. This step uses the local cycle consistency of thegroup transformation; The second stage checks the algebraic consistency over different irreps and wepropose three different ways to combine the information across all irreps. Using invariant momentsbypasses the pairwise alignment and is computationally more efficient than the affinity based on theoptimal alignment search. Experimental results show the efficacy of the framework compared to thestate-of-the-art methods, which do not take into account of the transformation group or only use asingle representation.

9

Page 10: Unsupervised Co-Learning on $G$-Manifolds Across Irreducible ...papers.nips.cc/paper/...g-manifolds-across-irreducible-representation… · Unsupervised Co-Learning on G-Manifolds

Acknowledgement: This work is supported in part by the National Science Foundation DMS-185479 and DMS-1854831.

References[1] Andrey V. Alekseevsky and Dmitry V. Alekseevsky. Riemannian G-manifold with one-

dimensional orbit space. Annals of global analysis and geometry, 11(3):197–211, 1993.

[2] Derek Bean, Peter J. Bickel, Noureddine El Karoui, and Bin Yu. Optimal M-estimation inhigh-dimensional regression. Proceedings of the National Academy of Sciences, 110(36):14563–14568, 2013.

[3] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embeddingand clustering. In NIPS, 2002.

[4] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and datarepresentation. Neural computation, 2003.

[5] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometricframework for learning from labeled and unlabeled examples. Journal of machine learningresearch, 7(Nov):2399–2434, 2006.

[6] Nicole Berline, Ezra Getzler, and Michèle Vergne. Heat Kernels and Dirac Operators(Grundlehren Text Editions). Springer, 1992 edition, 12 2003.

[7] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. InProceedings of the eleventh annual conference on Computational learning theory, pages 92–100.ACM, 1998.

[8] Paul Breiding, Sara Kališnik, Bernd Sturmfels, and Madeleine Weinstein. Learning algebraicvarieties from samples. Revista Matemática Complutense, 31(3):545–593, 2018.

[9] Theodor Bröcker and Tammo Tom Dieck. Representations of compact Lie groups, volume 98.Springer Science & Business Media, 2013.

[10] Alexander Bronstein, Michael Bronstein, and Ron Kimmel. Numerical Geometry of Non-RigidShapes. Springer Publishing Company, Incorporated, 1 edition, 2008.

[11] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization.Foundations of Computational mathematics, 9(6):717, 2009.

[12] Venkat Chandrasekaran, Sujay Sanghavi, Pablo A Parrilo, and Alan S Willsky. Sparse andlow-rank matrix decompositions. IFAC Proceedings Volumes, 42(10):1493–1498, 2009.

[13] Minmin Chen, Kilian Q. Weinberger, and John C. Blitzer. Co-training for domain adaptation.In Proceedings of the 24th International Conference on Neural Information Processing Systems,NIPS’11, pages 2456–2464, USA, 2011. Curran Associates Inc.

[14] Ronald R Coifman and Stéphane Lafon. Diffusion maps. Applied and computational harmonicanalysis, 21(1):5–30, 2006.

[15] Ronald R Coifman, Stephane Lafon, Ann B Lee, Mauro Maggioni, Boaz Nadler, FrederickWarner, and Steven W Zucker. Geometric diffusions as a tool for harmonic analysis and structuredefinition of data: diffusion maps. Proceedings of the National Academy of Sciences of theUnited States of America, 102(21):7426–31, may 2005.

[16] Ronald R Coifman, Stephane Lafon, Ann B Lee, Mauro Maggioni, Boaz Nadler, FrederickWarner, and Steven W Zucker. Geometric diffusions as a tool for harmonic analysis and structuredefinition of data: multiscale methods. Proceedings of the National Academy of Sciences of theUnited States of America, 102(21):7432–7, may 2005.

[17] Ronald R. Coifman and Stéphane Lafon. Diffusion Maps. Applied and Computational HarmonicAnalysis, 21(1):5–30, 2006. Special Issue: Diffusion Maps and Wavelets.

10

Page 11: Unsupervised Co-Learning on $G$-Manifolds Across Irreducible ...papers.nips.cc/paper/...g-manifolds-across-irreducible-representation… · Unsupervised Co-Learning on G-Manifolds

[18] David L. Donoho and Carrie Grimes. Hessian Eigenmaps: Locally Linear EmbeddingTechniques for High-Dimensional Data. Proceedings of the National Academy of Sciences,100(10):5591–5596, 2003.

[19] Yifeng Fan and Zhizhen Zhao. Cryo-electron microscopy image analysis using multi-frequencyvector diffusion maps. arXiv preprint arXiv:1904.07772, 2019.

[20] Yifeng Fan and Zhizhen Zhao. Multi-frequency vector diffusion maps. In ICML, 2019.

[21] Jason Farquhar, David Hardoon, Hongying Meng, John S. Shawe-Taylor, and Sándor Szedmák.Two view learning: SVM-2K, theory and practice. In Y. Weiss, B. Schölkopf, and J. C. Platt,editors, Advances in Neural Information Processing Systems 18, pages 355–362. MIT Press,2006.

[22] Tingran Gao. The diffusion geometry of fibre bundles: Horizontal diffusion maps. arXivpreprint arXiv:1602.02330, 2016.

[23] Tingran Gao and Zhizhen Zhao. Multi-frequency phase synchronization. In ICML, 2019.

[24] Ronny Hadani and Amit Singer. Representation Theoretic Patterns in Three DimensionalCryo-Electron Microscopy I: The Intrinsic Reconstitution Algorithm. Annals of Mathematics,174(2):1219–1241, 2011.

[25] Ronny Hadani and Amit Singer. Representation theoretic patterns in three-dimensional cryo-electron microscopy II – the class averaging problem. Foundations of Computational Mathe-matics, 11(5):589–616, 2011.

[26] Brian Hall. Lie groups, Lie algebras, and representations: an elementary introduction, volume222. Springer, 2015.

[27] Ramakrishna Kakarala and Dansheng Mao. A theory of phase-sensitive rotation invariancewith spherical harmonic and moment-based representations. In 2010 IEEE Computer SocietyConference on Computer Vision and Pattern Recognition, pages 105–112. IEEE, 2010.

[28] David G. Kendall. A survey of the statistical theory of shape. Statist. Sci., 4(2):87–99, 05 1989.

[29] Jon R Kettenring. Canonical analysis of several sets of variables. Biometrika, 58(3):433–451,12 1971.

[30] Shoshichi Kobayashi. Transformation groups in differential geometry. Springer Science &Business Media, 2012.

[31] Imre Risi Kondor. Group theoretical methods in machine learning. PhD thesis, ColumbiaUniversity, 2008.

[32] Risi Kondor. A novel set of rotationally and translationally invariant features for images basedon the non-commutative bispectrum. arXiv preprint cs/0701127, 2007.

[33] Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolutionin neural networks to the action of compact groups. In International Conference on MachineLearning, pages 2752–2760, 2018.

[34] Abhishek Kumar and Hal Daume III. A co-training approach for multi-view spectral clustering.In Proceedings of the 28th International Conference on International Conference on MachineLearning, ICML’11, pages 393–400, USA, 2011. Omnipress.

[35] Abhishek Kumar, Piyush Rai, and Hal Daume. Co-regularized multi-view spectral clustering. InJ. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advancesin Neural Information Processing Systems 24, pages 1413–1421. Curran Associates, Inc., 2011.

[36] Boris Landa and Yoel Shkolnisky. The steerable graph laplacian and its application to filteringimage datasets. SIAM Journal on Imaging Sciences, 11(4):2254–2304, 2018.

[37] Yingming Li, Ming Yang, and Zhongfei Mark Zhang. A survey of multi-view representationlearning. IEEE Transactions on Knowledge and Data Engineering, pages 1–1, 2018.

11

Page 12: Unsupervised Co-Learning on $G$-Manifolds Across Irreducible ...papers.nips.cc/paper/...g-manifolds-across-irreducible-representation… · Unsupervised Co-Learning on G-Manifolds

[38] Chen-Yun Lin, Arin Minasian, Xin Jessica Qi, and Hau-Tieng Wu. Manifold learning via theprinciple bundle approach. Frontiers in Applied Mathematics and Statistics, 4:21, 2018.

[39] David K Maslen and Daniel N Rockmore. Generalized FFTs—a survey of some recent results.In Groups and Computation II, volume 28, pages 183–287. American Mathematical Soc., 1997.

[40] Peter W Michor. Topics in differential geometry, volume 93. American Mathematical Soc.,2008.

[41] Hà Quang Minh, Loris Bazzani, and Vittorio Murino. A unifying framework in vector-valuedreproducing kernel hilbert spaces for manifold regularization and co-regularized multi-viewlearning. Journal of Machine Learning Research, 17(25):1–72, 2016.

[42] David Mumford, John Fogarty, and Frances Kirwan. Geometric invariant theory, volume 34.Springer Science & Business Media, 1994.

[43] Ion Muslea, Steven Minton, and Craig A. Knoblock. Active learning with multiple views. J.Artif. Int. Res., 27(1):203–233, October 2006.

[44] Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, and Ronald R Coifman. Diffusion maps,spectral clustering and eigenfunctions of Fokker-Planck operators. In Advances in neuralinformation processing systems, pages 955–962, 2006.

[45] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and analgorithm. In Advances in neural information processing systems, pages 849–856, 2002.

[46] Greg Ongie, Rebecca Willett, Robert D. Nowak, and Laura Balzano. Algebraic variety modelsfor high-rank matrix completion. In Doina Precup and Yee Whye Teh, editors, Proceedings ofthe 34th International Conference on Machine Learning, volume 70 of Proceedings of MachineLearning Research, pages 2691–2700, International Convention Centre, Sydney, Australia,06–11 Aug 2017. PMLR.

[47] Pawel A Penczek, Jun Zhu, and Joachim Frank. A common-lines based method for determiningorientations for n > 3 particle projections simultaneously. Ultramicroscopy, 63(3-4):205–218,1996.

[48] William M Rand. Objective criteria for the evaluation of clustering methods. Journal of theAmerican Statistical association, 66(336):846–850, 1971.

[49] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax-optimal rates for high-dimensional sparse additive models over kernel classes. Journal of Machine Learning Research,13:281–319, 2012.

[50] Ken Richardson. The transverse geometry of g-manifolds and riemannian foliations. IllinoisJournal of Mathematics, 45(2):517–535, 2001.

[51] Vladimir Rokhlin, Arthur Szlam, and Mark Tygert. A randomized algorithm for principalcomponent analysis. SIAM Journal on Matrix Analysis and Applications, 31(3):1100–1124,2009.

[52] Sam T. Roweis and Lawrence K. Saul. Nonlinear Dimensionality Reduction by Locally LinearEmbedding. Science, 290(5500):2323–2326, 2000.

[53] Alexander HW Schmitt. Geometric invariant theory and decorated principal bundles, volume 11.European Mathematical Society, 2008.

[54] Jean-Pierre Serre. Linear representations of finite groups, volume 42. Springer, 1977.

[55] Vikas Sindhwani and Partha Niyogi. A co-regularized approach to semi-supervised learningwith multiple views. In Proceedings of the ICML Workshop on Learning with Multiple Views,2005.

[56] Vikas Sindhwani and David S. Rosenberg. An RKHS for multi-view learning and manifoldco-regularization. In Proceedings of the 25th international conference on Machine learning,pages 976–983. ACM, 2008.

12

Page 13: Unsupervised Co-Learning on $G$-Manifolds Across Irreducible ...papers.nips.cc/paper/...g-manifolds-across-irreducible-representation… · Unsupervised Co-Learning on G-Manifolds

[57] Amit Singer and Hau-Tieng Wu. Vector diffusion maps and the connection Laplacian. Commu-nications on Pure and Applied Mathematics, 65(8):1067–1144, 2012.

[58] Amit Singer and Hau-Tieng Wu. Spectral convergence of the connection Laplacian from randomsamples. Information and Inference: A Journal of the IMA, 6(1):58–123, 12 2016.

[59] Amit Singer, Zhizhen Zhao, Yoel Shkolnisky, and Ronny Hadani. Viewing angle classificationof cryo-electron microscopy images using eigenvectors. SIAM Journal on Imaging Sciences,4(2):723–759, 2011.

[60] Shiliang Sun. A survey of multi-view machine learning. Neural Computing and Applications,23(7):2031–2038, Dec 2013.

[61] Shiliang Sun and David R. Hardoon. Active learning with extremely sparse labeled examples.Neurocomputing, 73(16):2980 – 2988, 2010. 10th Brazilian Symposium on Neural Networks(SBRN2008).

[62] Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A Global Geometric Framework forNonlinear Dimensionality Reduction. Science, 290(5500):2319–2323, 2000.

[63] Robert Tibshirani, Martin Wainwright, and Trevor Hastie. Statistical learning with sparsity: thelasso and generalizations. Chapman and Hall/CRC, 2015.

[64] Roman Vershynin. High-dimensional probability: An introduction with applications in datascience, volume 47. Cambridge University Press, 2018.

[65] Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416,2007.

[66] Elif Vural and Christine Guillemot. A study of the classification of low-dimensional data withsupervised manifold learning. Journal of Machine Learning Research, 18:1–55, 2018.

[67] Martin J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48.Cambridge University Press, 2019.

[68] Eugene Paul Wigner. Gruppentheorie und ihre anwendung auf die quantenmechanik deratomspektren. Monatshefte für Mathematik und Physik, 39(1):A51, 1932.

[69] Chang Xu, Dacheng Tao, and Chao Xu. A survey on multi-view learning. arXiv preprintarXiv:1304.5634, 2013.

[70] Jing Zhao, Xijiong Xie, Xin Xu, and Shiliang Sun. Multi-view learning overview: Recentprogress and new challenges. Information Fusion, 38:43 – 54, 2017.

[71] Zhizhen Zhao, Yoel Shkolnisky, and Amit Singer. Fast steerable principal component analysis.IEEE transactions on computational imaging, 2(1):1–12, 2016.

[72] Zhizhen Zhao and Amit Singer. Rotationally invariant image representation for viewing directionclassification in cryo-EM. Journal of structural biology, 186(1):153–166, 2014.

13