diﬀusion waveletshelper.ipam.ucla.edu/publications/mgatut/mgatut_4990...diﬀusion wavelets ronald...

Diffusion Wavelets

Ronald R. Coifman, Mauro Maggioni

Program in Applied MathematicsDepartment of Mathematics

Yale UniversityNew Haven,CT,06510

U.S.A.

Abstract

We present a multiresolution construction for efficiently computing, compressingand applying large powers of operators that have high powers with low numericalrank. This allows the fast computation of functions of the operator, notably the as-sociated Green’s function, in compressed form, and their fast application. Classes ofoperators satisfying these conditions include diffusion-like operators, in any dimen-sion, on manifolds, graphs, and in non-homogeneous media. In this case our con-struction can be viewed as a far-reaching generalization of Fast Multipole Methods,achieved through a different point of view, and of the non-standard wavelet represen-tation of Calderón-Zygmund and pseudodifferential operators, achieved through adifferent multiresolution analysis adapted to the operator. We show how the dyadicpowers of an operator can be used to induce a multiresolution analysis, as in classi-cal Littlewood-Paley and wavelet theory, and we show how to construct, with fastand stable algorithms, scaling function and wavelet bases associated to this mul-tiresolution analysis, and the corresponding downsampling operators, and use themto compress the corresponding powers of the operator. This allows to extend mul-tiscale signal processing to general spaces (such as manifolds and graphs) in a verynatural way, with corresponding fast algorithms.

Key words: Wavelets, Wavelets on Manifolds, Wavelets on Graphs, HeatDiffusion, Laplace-Beltrami Operator, Diffusion Semigroups, Fast MultipoleMethod, Matrix Compression, Spectral Graph Theory.

Email addresses: [email protected] ( Ronald R. Coifman),[email protected] ( Mauro Maggioni ).

URL: www.math.yale.edu/~mmm82 ( Mauro Maggioni ).

Preprint submitted to Elsevier Science 17 August 2004

1 Introduction

We introduce a multiresolution geometric construction for the efficient com-putation of high powers of local operators (in order O(n log2 n) for severalclasses of operators, where n is the cardinality of the space). The operatorsunder consideration are positive contractions that have high powers with lownumerical rank. This organization of the geometry of a matrix representing Tenables fast computation of functions of the operator, notably the associatedGreen’s function, in compressed form. Classes of operators satisfying theseconditions include discretizations of differential operators, in any dimension,on manifolds, and in non-homogeneous media. Our construction can be viewedas an extension of Fast Multipole Methods [1], and of the non-standard waveletform for Calderón-Zygmund integral operators and pseudo-differential opera-tors of [2]. Unlike the integral equation approach we start from the generatorT of the semigroup associated to a differential operator, rather than from theGreen operator. For most examples above the (dyadic) powers of this oper-ator decrease in rank, thus suggesting the compression of the function (andgeometric) space upon which each power acts. The scheme we propose con-sists in the following: apply T to a space of test functions at the finest scale,compress the range via a local orthogonalization procedure, represent T inthe compressed range, compute T 2, compress and orthogonalize, and so on:at scale j we obtain a compressed representation of T 2

j+1, acting on the range

of T 2j+1−1, for which we have a (compressed) orthonormal basis, then we ap-

ply T 2j+1

, locally orthogonalize and compress the result, thus getting the nextcoarser subspace. The computation of the inverse “Laplacian” (I − T )−1 (ofcourse, on the complement of the eigenspace corresponding to the eigenvalue1) can be carried out (in compressed form) via the Schultz method [2]: wehave

(I − T )−1f =+∞∑k=1

T kf

which implies, letting SK =∑2K

k=1 Tk,

SK+1 = SK + T2KSK =

K∏k=0

(I + T 2

k)f . (1.1)

Since we can apply fast T 2k

to any function f , and hence the product SK+1, wecan apply (I−T )−1 to any function f fast, in time O(n log2 n). Moreover, sincethis construction adapts to the natural geometry induced by the operator, itis very promising in view of applications to the problem of homogenization oflinear diffusion partial differential operators with non-constant coefficients.

The interplay between geometry of sets, function spaces on sets, and oper-ators on sets is of course classical in Harmonic Analysis. Our constructionviews the columns of a matrix representing T as data points in Euclidean

2

space which are then viewed as lying on a manifold, for which the first feweigenvectors of T provide coordinates (see [3,4]). The spectral theory of Tprovides a Fourier Analysis on this set relating our approach to multiscalegeometric analysis, multiscale Fourier and wavelet analysis. The action of agiven diffusion semigroup on the space of functions on the set is analyzed in amultiresolution fashion, where (dyadic powers of) the diffusion operator cor-responds to dilations, and projections (in general nonrothogonal) correspondto downsampling. The localization of the scaling functions we construct thenallows to reinterpret these operations in function space in a geometric fash-ion. This mathematical construction has a numerical implementation which isfast and stable. We thus get a fast construction of multiresolution analyses onrather general spaces, with respect to the dilations induced by a diffusion semi-group, and fast algorithms for computing the corresponding transform. Theframework we consider in this work includes at once large classes of manifolds,graphs, spaces of homogeneous type, and can be extended even further.

Given a metric measure space and a symmetric diffusion semigroup on it,there is a natural multiresolution analysis associated to it, and an associ-ated Littlewood-Paley theory for the decomposition of the natural functionspaces on the metric space. These ideas are classical (see [5] for a specificinstance in which the diffusion semigroup is center-stage, but the literatureon multiscale decompositions in general settings is vast, see e.g. [6–10] andreferences therein). Generalized Heisenberg principles exist [11] that guaran-tee that eigenspaces are well approximated by scaling functions spaces at theappropriate scale. Our work shows that scaling functions and wavelets can beconstructed and fast numerical algorithms that allow the implementation ofthese ideas exist. In effect, this construction allows to “lift” multiscale signalprocessing techniques to datasets, for compression and denoising of functionsand operators on a dataset (and of the data set itself), for approximation andlearning.

It also allows the introduction of dictionaries of diffusion wavelet packets [12],with the corresponding fast algorithms for best basis selection.

Material related to this paper, such as examples, and Matlab scripts for theirgeneratation, will be made available on the web page at http://www.math.yale.edu/~mmm82/diffusionwavelets.html.

Note 1 Because of the scope of this article and for space constraints, it isnot possible to list all the references to all the works related to motivations,techniques, and applications of the constructions and algorithms introduced inthis paper. We provide references to the most relevant works, but many othershad to be omitted.

3

2 The construction in the finite dimensional discrete setting

In this section we present a particular case of our construction in a finite-dimensional, purely discrete setting, for which only finite dimensional linearalgebra is needed. We refer the reader to the diagram in Figure 2 for a visualrepresentation of the scheme presented.

0 5 10 15 20 25 30

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

σ(T)

ε

V1 V

2 V3 V4

...

σ(T3)

σ(T7)

σ(T15)

Fig. 1. Spectra of powers of T and corresponding multiscale eigenspace decomposi-tion.

We consider a finite graph X and a symmetric positive definite and positive“diffusion” operator T on (functions on) X (for a precise definition and dis-cussion of these hyperotheses see Section 4). Without loss of generality, wecan assume that ||T ||2 ≤ 1. For example:

(i) The graph could represent a metric space in which points are data (e.g.documents) and edges have weights (e.g. a function of the similarity betweendocuments), and I − T could be a Laplacian on X, that induces a naturaldiffusion process. T is then similar to a Markov matrix P representing thenatural random walk on the graph [13].

(ii) X could represent the discretization of a domain and T = e−∆, where ∆ is apartial differential operator such as a Laplacian on the domain, or manifoldwith smooth boundary.

Our main assumptions are that T is local, i.e. T (δk), where δk is the Diracδ-function at k ∈ X, has small support, and that high powers of T have lownumerical rank. Ideally there exists a γ < 1 such that for every j ≥ 0 we haverk�(T

2j ) < γ rk�(T2j−1), where rk� denotes the �-numerical rank as in definition

6.

We want to compute and describe efficiently the powers T 2j, for j > 0, which

describe the long term behavior of the diffusion. This will allow the computa-tion of functions of the operator in compressed form (notably of the Green’sfunction (I − T )−1), as well as the fast computation of the diffusion from anyinitial conditions. This is of course of great interest in the solution of dis-

4

cretized partial differential equations, of Markov chains, but also in learningand related classification problems.

−−−−−−−−−→M0 −−−−−−−→M1 −−−−−−−→Mj −−−−−−−−→Mj+1−−−−−−−−−→

T0

−−−−−−

−−−−−−

−−→G0

−−−−−−−−−→

T 21

−−−−−−

−−−−−−

−−→G1

−−−−−−

−−−−−−

−−−→

Gj

−−−−−−−−−→

T 2j

j

−−−−−−

−−−−−−

−−→Gj+1

Φ0 Φ1 . . . Φj+1 . . .

Φ̃1 Φ̃2 . . . Φ̃j+1

Fig. 2. Diagram for downsampling, orthogonalization and operator compression.(All triangles are commutative by construction)

The reason why one expects to be able to compress high powers of T is thatthey are low rank by assumption, so that it should be possible to efficientlyrepresent them on an appropriate basis, at the appropriate resolution. Fromthe analyst’s perspective, these high powers are smooth functions with smallgradient (or even in a sense “band-limited” with small band), hence compress-ible.

We start by fixing a precision � > 0, assume that T is self-adjoint and isrepresented on the basis Φ0 = {δk}k∈X , and consider the columns of T , whichcan be interpreted as the set of functions Φ̃1 = {Tδk}k∈X on X. We use alocal multiscale Gram-Schmidt procedure, which is a linear transformation werepresent by a matrix G0, to carefully orthonormalize these columns to geta basis Φ1 = {ϕ1,k}k∈X1 (X1 is defined as this index set) for the range of Tup to precision �. This yields a subspace that we denote by V1. EssentiallyΦ1 is a basis for a subspace which is �-close to the range of T , and with basiselements that are well-localized. Moreover, the elements of Φ1 are coarses thanthe elements of Φ0, since the “dilation” they are the result of applying T once.Obviously |X1| ≤ |X| but the inequality may already be strict since part ofthe numerical range of T may be below the precision �. Whether this is thecase or not, we have then a map M0 from X to X1, which is the composition ofT with the orthonormalization by G0. We can also represent T in the basis Φ1:we denote this matrix by T1 and compute T

21 = M0M

T0 . We proceed now by

looking at the columns of T 21 , which are Φ̃2 = {T 21 δk}k∈X1 i.e., by unravellingthe bases on which this is happening, {T 2ϕ1,k}k∈X1 up to the precision �.Again we can apply a local Gram-Schmidt procedure to orthonormalize thisset: this yields a matrix G1 and an orthonormal basis Φ2 = {ϕ2,k}k∈X2 for therange of T 21 up to precision �, and hence for the range of T

30 up to precision

2�. Moreover, depending on the decay of the spectrum of T , |X2| is in generala fraction of |X1|. The matrix M1 which is the composition of G1 with T 21 isthen of size |X2| × |X1|, and T 22 = M1MT1 is a representation of T 4 acting onΦ2.

5

After j steps in this fashion, we will have a representation of T 1+2+22+···+2j =

T 2j+1−1 onto a basis Φj = {ϕj,k}k∈Xj . Depending on the decay of the spectrum

of T , we expect |Xj|

Definition 1 Let X be a set. A function d : X × X → [0,+∞) is called aquasi-metric if

(i) d(x, y) ≥ 0 for every x, y ∈ X, with equality if and only if x = y,(ii) d(x, y) = d(y, x), for every x, y ∈ X,(iii) there exists C > 0 such that d(x, y) ≤ C(d(x, z)+d(z, y)) for every x, y, z ∈

X (quasi-triangle inequality).

The pair (X, d) is called a quasi-metric space. (X, d) is a metric space if onecan choose C = 1 in (iii).

Example 2 A weighted undirected connected graph (G,E,W ), whereW is theset of positive weights on the edges in E, is a metric space when the distanceis defined by

d(x, y) = infγx,y

∑e∈γx,y

we

where γx,y is a path connecting x and y. Often in applications a measure on (thevertices of) G is either uniform or specified by some connectivity properties ofeach vertex (e.g. sum of weights of the edges concurrent in each vertex).

Let (X, d, µ) be a quasi-metric space with Borel measure µ. For x ∈ X, δ > 0,let

Bδ(x) = {y ∈ X : d(x, y) < δ}be the open ball of radius δ around x. For a subset S of X, we will use thenotation

Nδ(S) = {x ∈ X : ∃y ∈ S : d(x, y) < δ}for the δ-neighborhood of S and, for δ1 < δ2, we let

Nδ1,δ2(S) = Nδ1 \ Nδ2 .

Since we would like to handle finite, discrete and continuous situations in aunique framework, and since many finite-dimensional discrete problems arisefrom the discretization of continuous and infinite-dimensional problems, weintroduce the following definitions.

Definition 3 (Coifman and Weiss, [?]) A quasi-metric measure space (X, d, µ)is said to be of homogeneous type [?,6] if µ is a non-negative Borel measureand there exists a constant CX > 0 such that for every x ∈ X, δ > 0,

µ(B2δ(x)) ≤ CXµ(Bδ(x)) (3.1)

We assume µ(Bδ(x)) 0, and we will work on connectedspaces X. In the continuous situation, we will assume that µ({x}) = 0 forevery x ∈ X.

7

One can replace the quasi-metric d with an equivalent quasi-metric ρ so thatthe δ-balls in the ρ metric have measure approximately δ: it is enough to defineρ(x, y) as the measure of the smallest ball containing x and y. One can alsoassume some Hölder-smoothness for ρ, in the sense that

|ρ(x, y) − ρ(x′, y)| ≤ cρ(x, x′)β [ρ(x, y) + ρ(x′, y)]1−β

for some β ∈ (0, 1), c > 0, see [7,8,15,16].

Example 4 Examples of spaces of homogeneous type include:

(i) Euclidean spaces of any dimension, with isotropic or anisotropic metricsinduced by positive-definite bilinear forms and their powers (see e.g. [17]).

(ii) Compact Riemannian manifolds with respect to geodesic metric, or also withrespect to metrics induced by certain classes of vector fields [18].

(iii) Graphs with bounded degree (the degree of a vertex is defined as

dx =∑y∼x

wyx ,

where y ∼ x means y is connected to x), in particular regular graphs.

Definition 5 A family of functions {ϕk}k on (X, d) is called δ-local, for someδ > 0, if

supk

diam.(supp.ϕk) < δ,

i.e. there exists a set {xk}k ⊆ X, called a supporting set for {ϕk}k, such thatfor every k we have supp.ϕk ⊆ Bδ(xk).

Notation 1 If V is a closed linear subspace of L2(X,µ), we denote the or-thogonal projection onto V by PV .

The closure V of any subspace V is taken in L2(X,µ), unless otherwise spec-ified.

Definition 6 Two subspaces V and W are �-close if there exists a linearisomorphism L with ||I − L||2 < � mapping V onto W .

An �-numerical span of a set of vectors {ϕk}k in L2 is defined as a closedsubspace which is �-close to the closure of the span of {ϕk}k.

An �-numerical rank of an operator T is a subspace which is �-close to therange of T . The definition of numerical kernel is analogous.

Notation 2 If L is a self-adjoint bounded operator on L2(X,µ), with spec-trum σL, and spectral decomposition

L =∫

σLλ dEλ ,

8

we define

L� =∫{λ∈σL:|λ|>�}

λ dEλ .

4 Multiresolution Analysis induced by symmetric diffusion semi-groups

4.1 Symmetric diffusion semigroups

We start from the following definition from [5].

Definition 7 Let {T t}t∈[0,+∞) be a family of operators on (X,µ), each map-ping L2(X,µ) into itself. Suppose this family is a semigroup, i.e. T 0 = I andT t1+t2 = T t1T t2 for any t1, t2 ∈ [0,+∞), and limt→0+ T tf = f in L2(X,µ) forany f ∈ L2(X,µ).

Such a semigroup is called a symmetric diffusion semigroup if it satisfies thefollowing:

(i) ||T t||p ≤ 1, for every 1 ≤ p ≤ +∞ (contraction property).(ii) Each T t is self-adjoint. (symmetry property).(iii) T t is positive preserving: Tf ≥ 0 for every f ≥ 0 in L2(X) (positivity

property).(iv) The semigroup has a generator −∆, so that

T t = e−t∆, (4.1)

While some of these assumptions are not strictly necessary, their adoptionsimplifies this presentation without reducing the types of applications we areinterested in at this time.

We will denote by σT the spectrum of T (so that {λt}λ∈σT is the spectrum ofT t), and by {ξλ}λ∈σT the corresponding basis (orthogonal since T t is normal)of eigenvectors, normalized in L2 (without loss of generality). Observe thathere and in all that follows we will use the abuse of notation that implicitlyaccounts for possible multiplicities of the eigenvalues.

Remark 8 Observe that σT ⊆ [0, 1]: by the semigroup property and (ii), wehave

T = T12T

12 =

(T

12

)∗T

12

so the spectrum is non-negative. The upper bound obviously follows from con-dition (i).

9

Remark 9 There are classes of diffusion operators that are not self-adjointbut very important in applications. The construction and the algorithm wepropose do not depend on this hypothesis, however the interpretation of manyresults does depend on the spectral decomposition. In the paper [19] we willaddress these and related issues in a broader framework.

Example 10 Examples include

(a) The Poisson semigroup, for example on the circle or half-space, or anisotropicand/or higher dimensional anisotropic versions (e.g. [17]).

(b) The random walk diffusion induced by a symmetric Markov chain (on agraph, a manifold etc...).

(c) The semigroup T t = etL generated by a second-order differential operatoron some interval (a, b) (a and/or b possibly infinite), in the form

Lf = a2(x)d2

dx2f + a1(x)

d

dxf + a0(x)f(x)

with a2(x) > 0, a0(x) ≤ 0, acting on a subspace of L2((a, b), q(x)dx), whereq is an appropriate weight function, given by imposing appropriate boundaryconditions, so that L is (unbounded) self-adjoint. Conditions (i) to (iii) aresatisfied, and (iv) is satisfied if c = 0.

This extends to Rn by considering elliptic partial differential operators inthe form

Lf =1

w(x)

n∑i=1

∂

∂xi

(aij(x)

∂

∂xjf

)+ c(x)f ,

where we assume c(x) ≤ 0 and w(x) > 0, and we consider this operatoras defined on a smooth manifold. L, applied to functions with appropriateboundary conditions, is formally self-adjoint and generates a semigroup sat-isfying (i) to (iii), and (iv) is satisfied if c = 0.

An important case is the Laplace-Beltrami operator on a compact smoothmanifold (or even on a Lie group), and subordinated operators [5].

(f) If (X, d, µ) is derived from a graph (G,W ) as described above in example2, one can define Di =

∑j Wij and then the matrix D

−1W is a Markovmatrix, which corresponds to the natural random walk on G. The operatorL = D−

12 (I −W )D− 12 is the normalized Laplacian on graphs, it is a con-

traction on L(G) and is self-adjoint. This discrete setting is extremely usefulin applications and widely used in a number of fields such as data analysis(e.g. clustering, learning on manifolds, parametrization of data sets), com-puter vision (e.g. segmentation) and computer science (e.g. network design,distributed processing). We believe our construction will have application inall these settings. As a starting point, see for example [13] for an overview,and [20] for application to image segmentation.

(g) One parameter groups (continuous or discrete) of dilations in Rn, or othermotion groups (e.g. the Galilei group or the Heisenberg-type groups), act on

10

square-integrable functions on these spaces (with the appropriate measures)as isometries or contractions. Continuous wavelets in some of these settingshave been studied (see e.g. [21] and references therein), but an efficientdiscretization of such transforms seems still an open problem (see however[22]).

(h) The random walk diffusion on a hypergroup (see e.g. [23] and referencestherein).

Definition 11 A positive diffusion semigroup of operators acts η-regularly,for some η > 0, if for every x and every smooth test function ϕ δ-local aroundx, the function Tϕ is ηδ-local around x.

4.2 Diffusion metrics and embeddings induced by a diffusion semigroup

The material in this section and its applications are discussed in greater detailin [24,3] and references therein.

From now on, in all that follows, we shall assume, mainly for simplicity, thatT is compact.

Being positive definite, T t induces the diffusion metric

d(t)(x, y) =√∑

λ∈σTλt (ξλ(x) − ξλ(y))2

=√〈δx − δy, T t(δx − δy)〉

= ||T t2 δx − T t2 δy||2.

(4.2)

Definition 12 The metric defined above is called the diffusion metric associ-ated to T , at time t.

If the action of T t on L2(X,µ) can be represented by a (symmetric) kernelKt(x, y), then d(t) can be written as

d(t)(x, y) =√Kt(x, x) +Kt(y, y)− 2Kt(x, y) , (4.3)

since the spectral decomposition of K is

Kt(x, y) =∑

λ∈σTλtξλ(x)ξλ(y) . (4.4)

For any subset σ′T ⊆ σT we can consider the map of metric spaces

Φ(t)σ′T

: (X, d(t)) → (R|σ′T |, dEuc.)x �→

(λ

t2 ξλ(x)

)λ∈σ′T

(4.5)

11

which is in particular cases called eigenmap [25,26], and is a form of localmultidimensional scaling (see also [3,4,24]). By the definition of d(t), this mapis an isometry when σ′T = σT , and an approximation to an isometry whenσ′T � σT .

If σ′T is the set of the first n top eigenvalues, and if∫X K(x, y)dµ = 1, then

Φσ′T

is a minimum local distortion map from (X, d(t)) to Rn, in the sense itminimizes

n− tr(P〈{φλ}λ∈σ′T〉K(t)P ∗〈{φλ}λ∈σ′

T〉) =

1

2

∑λ∈σ′

T

∫X

∫X

(φλ(x) − φλ(y))2K(t)(x, y) dµ(x)dµ(y) (4.6)

among all possible maps x �→ (φ1(x), . . . , φn(x)) such that 〈φi, φj〉 = δij. Thisis just a rephrasing, in our situation, of the well-known fact that the top ksingular vectors span the best approximating k dimensional subspace to thedomain of a linear operator, in the L2 sense. See e.g. [27] and references thereinfor a comparison between different dimensionality reduction techniques thatcan be cast in this framework.

These ideas are related to various techniques used for nonlinear dimension re-duction: see for example http://www.cse.msu.edu/~lawhiu/manifold/ fora list of relevant links and [28–31,4,3,32] and references therein.

Observe that for any fixed precision �, one can find σT ′ so that the eigenfunc-tion expansion (4.2) of d(t) can be truncated on σT ′ with approximation errorsmaller than �.

Finally, we observe that the semigroup not only generates a family of metricson X, but a family of measures as well. We simply consider the σ-algebra A(t)generated by the family of sets

S(t) = {B(t)δ (x)) := supp. T t(χBδ(x)), x ∈ X, δ > 0}

(where χA denotes the characteristic function of a set A) and define µ(t)(B

(t)δ (x)) =

µ(Bδ(x)) on S(t) and extend it to A(t).

Example 13 Suppose we have a symmetric Markov chain M on (X,µ),which we can assume irreducible, with transition matrix P . We assume Pis positive definite (otherwise we could consider 1

2(I+P )). Let {λl}l be the set

of eigenvalues of P , ordered by increasing value, and {ξl}l be the correspond-ing set of right eigenvectors of P , which form an orthonormal basis. Then byfunctional calculus we have

P ti,j =∑

l

λtlξl(i)ξl(j)

12

and for an initial distribution f on X, we define

P tf =∑

l

λtl〉f, ξl〉ξl . (4.7)

Observe that if f is a probability distribution, so is P tf for every t since P isMarkov. The diffusion map Φσ′

Tembeds the Markov chain in Euclidean R|σ

′T |

in such a way that the diffusion metric induced by M on X becomes Euclideandistance in R|σ

′T |.

4.3 Construction of scaling functions and wavelets

We can interpret T as a dilation operator acting on functions in L2(X,µ), anduse it to define a multiresolution structure. As in [5] and in classical wavelettheory, we start by discretizing the semigroup at the increasing sequence oftimes

tj =j∑

i=0

2l = 2j+1 − 1. (4.8)

Let {λi}i≥0 be the spectrum of T , ordered in decreasing order, and {ξi}i thecorresponding eigenvectors. We can define “low-pass” portions of the spectrumby letting

σT,j = {λ ∈ σT , λtj ≥ �} . (4.9)

For a fixed � ∈ (0, 1) (which we may think of as our precision), we define the(finite dimensional!) approximation spaces by

Vj =< {ξλ : λ ∈ σT , λtj ≥ �} >=< {ξλ : λ ∈ σT,j} > (4.10)

The set of subspaces {Vj}j∈Z is a multiresolution analysis in the sense that itsatisfies the following properties:

(i) limj→−∞ Vj = Ran(T ), limj→+∞ Vj =< {ξi : λi = 1} >.(ii) Vj+1 ⊆ Vj for every j ∈ Z.(iii) {ξλ : λtj ≥ �} is an orthonormal basis for Vj.

We can also define, for j ≥ 0, the subspaces Wj as the orthogonal complementof Vj+1 in Vj , so that we have the familiar relation between approximation anddetail subspaces as in the classical wavelet multiresolution constructions:

Vj+1 = Vj ⊕⊥ Wj . (4.11)

13

The direct orthogonal sum

L2(X) =⊥⊕

j≥0Wj

is a wavelet decomposition of the space, induced by the diffusion semigroup,and related to the Littlewood-Paley decomposition studied in this setting byStein [5].

While by definition (4.10) we already have an orthonormal basis of eigenfunc-tions of T for each subspace Vj (and for the subspaces Wj as well), thesebasis functions are in general highly non-localized, being generalized globalFourier modes of the operator. Our aim is to build localized bases for each ofthese subspaces, starting from a basis of the fine subspace V0 and explicitlyconstructing a downsampling scheme that yields an orthonormal basis for Vj ,j > 0. This is motivated by general Heisenberg principles (see e.g. [33] for asetting similar to ours) that guarantee that eigenfunctions have a smoothnessor “frequency content” or “scale” determined by the corresponding eigenval-ues, and can be reconstructed by maximally localized “bump functions”, oratoms, at that scale. Such ideas have many applications in numerical analy-sis, especially to problems motivated by physics (multigrid techniques [34,35],matrix compression [36–39], etc...) and are in general behind many techniquesfor multiscale compression.

We avoid the computation of the eigenfunctions, nevertheless the approxi-mation spaces Ṽj that we build will be about �-approximations of the trueVj’s.

We propose two ways of constructing the scaling functions in the Multi-Resolution. One uses the following:

Proposition 14 (Multiscale local orthonormalization) Let Φ̃ be a δ-localset of functions on X, spanning a subspace V0, S a supporting set for Φ̃, andlet � > 0. Then there exists an orthonormal basis

Φ = {Φl}l=0,...,L = {{ϕl,k}k∈Kl}l=0,...,Lof V0 such that:

(a) for each l ∈ {0, . . . , L}, the family {ϕl,k}k∈Kl is (2l+1−1)δ-local and it has asupporting set Sl = {xl,k}k∈Kl ⊂ S that is a (2l+2 − 2)δ-lattice. In particular|Kl| ≈ 2−l µ(X)µ(Bδ) ;

(b) the linear map G from Φ̃ to Φ is local and sparse;(c) there exists a monotonic function L such that for each l ≥ 0, {Φl}l=0,...,L(l)

spans a subspace which is �-close to the subspace spanned by the top l sin-gular vectors of Φ̃.

14

We postpone the proof of this Proposition, and comments on its numericalaspects and implementation, till section 5.2. Here we would like to remarkthat under rather general assumptions on {̃Φ} and geometric assumptions onits supporting set S, one can prove that the constructed Φ have exponentialdecay.

For the second orthonormalization technique, which guarantees good boundson the decay on the orthonormal functions we build, we will need the followingdefinition from [6]

Definition 15 A matrix (B)(j,k)∈J×J is called η-accreative, if there exists anη > 0 such that for every ξ ∈ l2(J) we have

�e∑j

∑k

Bjkξjξk ≥ η∑j

||ξ||2l2(J).

In [6, Proposition 3 in Section 11.4], Coifman and Meyer prove a result sug-gesting the following

Proposition 16 Let (X, d, µ) be a space of homogeneous type, and let Φ̃ ={ϕ̃j}j∈J be a countable, δ-local Riesz basis of L2(X,µ), with supporting setS = {sj}j∈J . If the Gramian matrix Gij = 〈ϕ̃i, ϕ̃j〉 is η-accreative and thereexist C, α > 0 such that

|Gi,j| ≤ Ce−αd(si,sj) ,then there exist C ′, α′ > 0 and an orthonormal basis {ϕ}j∈J such that

fj =∑k∈J

γ(j, k)ϕ̃k

with|γ(j, k)| ≤ C ′e−βd(si,sj) .

The proof proceeds exactly as in [6], with the traingle inequality on Zn replacedby the quasi-triangle inequality for the metric d.

Definition 17 Let (X, ρ, µ) be a space of homogeneous type, and γ > 0. Asubset {xi}i of X is a γ-lattice if ρ(xi, xj) > cγ, for all i, j, and for everyy ∈ X, there exists i(y) such that ρ(y, xi(y)) < cγ. Here c depends only on theconstant CX of the space of homogeneous type.

We first present our construction in general.

Theorem 18 (Construction of the Multi-Resolution Analysis) Let {T t}be a symmetric diffusion semigroup acting regularly on X, Φ̃ = {ϕ̃k}k∈K̃ be acountable orthonormal δ-local basis, with supporting set S.

Let Ṽ0 = < Φ̃ > and assume Ṽ0 is �-close to V0, for some fixed � > 0.

15

Then we can build a set of linear maps {Gj} and {Mj} as in figure 2, andscaling functions

Φ = {{{ϕj,l,k}k∈Kj,l}l=0,...,Lj}j=1,...,J ,

with the following properties:

(I) For every j ≥ 1, let

Φj = {{ϕj,l,k}k∈Kj,l}l=0,...,Lj

and ˜̃Vj = < Φj >. Then Φj is an orthonormal basis and˜̃Vj is 2(j+1)�-close

to Vj.(II) For every j ≥ 1, the orthonormal basis Φj has the following layered struc-

ture:(a) For every l ∈ {0, . . . , Lj}, |Kj,l| � 2−j+1 µ(X)

µ(B(tj )

δ).

(b) For every l ∈ Lj, with respect to the metric d(tj), the set {ϕj,l,k}k∈Kj,l is(2l+1−1)δ-local and admits a supporting set Sj,l that is a (2l+2−2)δ-lattice.In particular, supp. ϕj,l,k1 ∩ supp. ϕj,l,k2 = ∅, ∀k1, k2 ∈ Kj,l, k1 �= k2.

(III) The representation of T 2j+1

on Φj is faithful, in the sense that T2j+1PΦj is

2(j + 1)�-close to T 2j+1|Ran(T 2j+1−1).

(IV) The orthonormalization map Gj mapping Φ̃j = T2jΦj onto Φj+1, and the

transformation Mj = GjT2j mapping Φj onto Φj+1 are local and sparse.

A localized orthonormal basis of wavelets spanning the subspace Wj, definedas in (4.11), can be constructed.

Alternatively, by using the orthogonalization of Proposition 16, a family of

scaling functions Φ with exponential decay for the same subspaces ˜̃Vj can beconstructed.

PROOF. We apply T to the given orthonormal basis Φ (dilation step). Thisyields a ηδ-local family of functions {ϕ̃1,k}k∈K̃0 onX whose span Ṽ1 is �-distantfrom V1, by definition of V1, because Ṽ0 was �-close to V0 and T� is a contraction.

We then apply Proposition 14 or to {ϕ̃1,k}k∈K̃0 (downsampling step) to get theorthonormal basis

Φ1 = {{ϕ1,l,k}k∈K1,l}l=0,...,L1for a subspace ˜̃V1 which is �-close to Ṽ1, and hence 2�-close to V1, with theproperties claimed in Theorem 18.

At this point we represent (T 2)� on Φ1. Observe that only the top L(�2) layers

of Φ1 are needed to express (T2)� up to precision �, because of the properties of

16

the algorithm used to build Φ1. Moreover the subset {{ϕ1,l,k}k∈K1,l}l=0,...,L(�2)of Φ1 is (η

L(�2)+1 − 1)δ-local.

By induction, once we have built Φj with all the properties claimed, we apply

(T 2j+1

)� to Φj getting a set of functions Φ̃j . As before, observe that only

the first L(�2j+1

) layers of Φj will be necessary to express T2j+1 to precision

�, and this subset is local. Applying Proposition 14 yields an orthonormalbasis Φj+1 with the properties claimed. The map Gj is constructed in theproof of Proposition 14, and is a local orthogonalization map that is sparse,being a multiscale orthogonalization of local functions. The orthogonalization

produces an error term of order � in the distance between Ṽj and˜̃Vj .

Observe that each orthonormalization could have achieved by applying Propo-sition 16, since at each level we are applying a self-adjoint, strictly positiveoperator to an orthonormal basis: it is easy to see that the resulting systemhas an accreative Gramian matrix.

4.4 Downsampling and compression of the powers of T

Let us summarize this construction in terms of linear operators (and corre-sponding matrix representations), to show how this leads not only to the fastconstruction of the multiscale analysis and the orthonormal bases of scalingfunctions, but also to a multi-level compression of the operator at differentscales. Figure 2 is again a visual summary of what follows.

By induction suppose we have constructed the orthonormal scaling functions

Φj = {{ϕj,l,k}k∈Kj,l}l=0,...,Ljat scale j ≥ 1, written on the orthonormal basis

Φj−1 = {{ϕj−1,l,k}k∈Kj−1,l}l=0,...,Lj−1at scale j − 1, and compress T 2j by representing it on these two bases in thedomain and range respectively. Then we perform two operations: we computeΦ̃j+1 = {ϕ̃j,l,k} = {T 2jϕj,l,k} for every k ∈ Kj,l, l ∈ {0, . . . , Lj} (this operationis fast because T 2

jis compressed and well-localized after compression), and

then use a fast local modified Gram-Schmidt to find the numerical span of{ϕ̃j,l,k}k∈Kj,l and at the same time orthogonalize them. This is a linear map Gjthat yields the orthonormal basis Φj+1 = {ϕj+1,l,k}k∈Kj+1,l, l ∈ {0, . . . , Lj+1},naturally represented on the bases Φj (in the domain) and Φj+1 (in the range),such that < Φj+1 > is �-close to < Φ̃j+1 >. The map Mj is the composition

Gj ◦ T 2j , and is thus naturally represented on the bases Φj (in the domain)and Φj+1 (in the range). Observe that for j = 0 we can assume the operator

17

T is given represented on the basis {ϕk}k∈K̃, and then we represent T 2j

on thebasis Φj+1 simply by computing

T 2j

j =(ΦjT

2j−1j−1 Φ

Tj

)2= Mj−1MTj−1 . (4.12)

Φj is of size |Kj|Lj × |Kj|Lj, so T 2jj is of the same size, and is a compressedversion of the operator T 2

jacting on Ṽj.

We can interpret this as having “downsampled” the subspace Vj at the critical

“rate” for representing up to precision the action of T 2jon it. This is related to

the Heisenberg principle and sampling theory for these operators [11], whichimplies that low-frequency eigenfunctions are not well localized and can besynthesized by coarse (depending on the scale) “bump” functions. One thenexpects the existence of quadrature formulas determined by very few sam-ples: we consider this problem of great importance and it is currently beinginvestigated. Most of the functions {ϕj,l,k} are essentially as well localized asit is compatible with their being in Vj. Moreover, because of this localizationproperty, we can interpret this downsampling in function space geometrically.We have the identifications

Xj := {xl,k : xl,k is center of ϕj,l,k} ↔ Kj,l ↔ {ϕj,l,k}k∈Kj,l . (4.13)

The natural metric on Xj is d(tj), which is, by (4.2), the distance between the

ϕj,l,k in L2(X,µ), and can be computed recursively in each Xj by combining(4.2) with (4.12). This allows us to interpret Xj as a space representing themetric d(tj) in compressed form.

In our construction we only compute ϕj,l,k expanded on the basis {ϕj−1,l,k}k∈Kj−1,l(this is the role ofMj), in other words, up to the identifications above, we knowϕj,l,k on the downsampled space Xj only. However we can extend these func-tions to Xj−1 and recursively all the way down to X0 = X, just by using themultiscale relations:

ϕj,l,k(x) = Mj−1ϕj−1,l,k(x) , x ∈ Xj−1

= Mj−1Mj−2 · . . . ·M0 ϕ0,l,k(x) =⎛⎝j−1∏

l=0

Ml

⎞⎠ϕ0,l,k(x) , x ∈ X0

(4.14)This is of course completely analogous to the standard construction of scalingfunctions in the Euclidean setting [40,40,41,2,42]. This formula also immedi-ately generalizes to arbitrary functions in Vj , extending them from Xj to thewhole original space X (see for example Figure 6).

The matrix Mj is sparse, since T2j is expected to be local (in the compressed

space |Xj|) and Gj is sparse (as we show in the Appendix), so that Mj hasessentially |Xj| log |Xj| elements above precision (in fact, in many cases of

18

practical interest, only |X|!), at least for j not too large. When j is largethe operator is in general not local anymore, but the space Xj on which it iscompressed is expected to be very small. For interesting classes of operatorswe actually expect to have only |Xj | entries in Mj which are bigger than �,and thus the algorithms presented will have order O(n). This follows fromthe observation that the basis functions {{ϕl,k}k∈Kl}l=0,...,L1 by constructionessentially span the subspace corresponding to the top singular vectors ofΦ̃ = T 2

jΦj−1, which correspond to the (fast) decaying singular values of T 2

j.

Remark 19 We could have started from Ṽ0 and the defined Vj as the re-sult of j steps of our algorithm: in this way we could do without the spectraldecomposition for the semigroup. This permits the application of our wholeconstruction and algorithms to the non-self-adjoint case.

Remark 20 (Biorthogonal bases) While at this point we are mainly con-cerned with the construction of orthonormal bases for the approximation spacesVj, well-conditioned bases would be just as good for most purposes, and wouldlead to the construction of stable biorthogonal scaling function bases. This couldfollow the ideas of “dual” operators in Coifman’s formula on space of homo-geneous type [43,6], and also, exactly as in the classical biorthogonal waveletconstruction [44], we would have two ladders of approximation subspaces, withwavelet subspaces giving the oblique projection onto their corresponding duals.This will be reported in the separate work [19].

4.5 Wavelets

We would like to construct bases {ψj,l,k}k,l for the spaces Wj , j ≥ 1, suchthat Vj ⊕⊥ Wj = Vj+1. To achieve this, after having built {{ϕj,l,k}k∈Kj,l}land {{ϕj,l,k}k∈Kj,l}l, we can apply our modified Gram-Schmidt procedure withgeometric pivoting to the set of functions

{(Pj − Pj+1)ϕj,l,k}k∈Kj,l,

which will yield an orthonormal basis Ψj of wavelets for the orthogonal com-plement Wj of Vj+1 in Vj . Moreover one can easily prove upper bounds for thediameters of the supports of the wavelets so obtained and for their decay.

4.6 Vanishing moments for the scaling functions and wavelets

In Euclidean settings vanishing moments are usually defined via orthogonalityrelations to subspaces of polynomials up to a certain degree. In our setting thenatural subspaces with respect to which to measure orthogonality is the set of

19

eigenfunctions ξλ of T . So the number of vanishing moments of an orthonormalscaling function ϕj,l,k can be defined as the number of eigenfunctions corre-

sponding to eigenvalues in σT,j \ σT,j+1 (as defined in (4.9)) to which T 2jϕj,l,kis orthogonal up to precision �. Observe this is comparable to defining thenumber of vanishing moments based on ||T 2j+1ϕj,l,k||2, which evokes classicalestimates in multiscale theory of Calderón-Zygmund operators.

In section 5 we will see that there is an approximately monotonic relationshipbetween the index l for the “layer” of scaling functions and the number ofeigenfunctions to which the scaling function are orthogonal, i.e. the numberof moments of the scaling functions.

In particular, the mean of the scaling functions is typically roughly constantfor the first few layers, and then quickly drops below precision as l grows: thiswould allow to split each scaling function space in two subspaces, one withscaling functions of non zero mean, and one of scaling functions with zeromean, which in fact could be appropriately called wavelets.

5 The orthogonalization step: computational and numerical con-siderations

In this section we discuss details of the implementation, comment on thecomputational complexity of the algorithms, and on their numerical stability.

Suppose we are given a δ−local family Φ̃ = {ϕ̃k}k∈K̃ of positive functions onX, and let X0 = {x̃k}k∈K̃ be a supporting set for Φ̃. With obvious meaning, wewill sometimes write ϕ̃x̃k for ϕ̃k. Let V = < {ϕ̃k}k∈K̃ > . We want to build anorthonormal basis Φ = {ϕk}k∈K whose span is �-close to V . Out of the manypossible solutions to this standard problem (see e.g. [45,46] and referencestherein as a starting point), we seek one for which the ϕk’s have small support(ideally of the same order as the support of ϕ̃k). Standard orthonormalizationin general may completely destroy the size the of the support of (most of) theϕ̃k’s.

We suggest two algorithms that try to remedy to this problem: the first oneuses geometric information on the set (hence something outside the scope oflinear algebra) to derive a pivoting rule that guarantees several orthogonalfunctions to have small support; the second one uses uniquely the matrixrepresenting T in a local basis, and a nonlinear pivoting rule mixing L2 andLp norms, p < 2.

The computational cost of constructing the Multi-Resolution Analysis de-scribed in the Theorem can be as high as O(n3) in general, where n is the

20

cardinality of X. However, there are two distinct types of properties of T thatwould allow to dramatically improve the time necessary for the construction:

I. The decay of σ(T ): the faster the spectrum decays, the smaller the numericalrank of the powers of T , the more these can be compressed, hence the fasterthe construction of Φj for large j’s.

II. If each basis Φj of scaling functions that we build is such that the elementswith large support (large “layer index” l) are in a subspace spanned byeigenvectors of T corresponding to very small eigenvalues (depending on l),then these basis functions of large support will not be needed to computethe next (dyadic) power of T . This implies that the matrices representingall the basis transformations will be uniformly sparse.

A combination of these two hypothesis, which are in fact quite natural andverified by many operators arising in practice, allow to perform the construc-tion of the whole Multi-Resolution Analysis described in Theorem 18 in timeO(n log2 n) or even O(n).

Item (I.) above can only be an assumption.

Item (II.) however is a question about the algorithms from linear algebra usedto construct the bases of scaling functions. We want to comment here onseveral possible choices of algorithms to be used.

(a) The first one is called modified Gram-Schmidt with pivoting, it is classicalin numerical analysis (see for example [46]).

(b) The second one is also based on finite dimensional linear algebra, and isa refinement of the first. It constructs so-called Rank Revealing QR fac-torizations. Being purely based on linear algebra, it holds with minimumassumptions on T , and this comes at the expense of computational speed.

(c) The third one is based on analysis and geometry, and applies at least if Tlooks like a Laplacian, in a sense to be made precise below. In this case thereare generalized Heisenberg principles, or sampling theorems, that give infor-mation on the critical sampling rate necessary to approximate eigenvectorsof T , depending on the corresponding eigenvalues. One can then constructthe scaling functions bases using this information, and in this way guaranteethat property (II) is satisfied.

5.1 Modified Gram-Schmidt with pivoting

Given a set of functions Φ̃ on X, and � > 0, the algorithm “Modified Gram-Schmidt with Pivoting” computes an orthonormal basis Φ for a subspace �-close < Φ̃ >, as follows.

21

1 Let ˜̃Φ = Φ̃, k = 0.

2 Let ϕk be an element of˜̃Φ with largest L2 norm. If ||ϕk||2 < �√|Φ̃| , stop,

otherwise add ϕk/||ϕk||2 to Φ.3 Orthogonalize all the remaining elements of Φ̃ to ϕk, obtaining a new set

˜̃Φ. Increment k by 1. Go to step 2.

When the algorithm stops, we have obtained an orthonormal basis Φ whosespan is �-close to < Φ̃ >. This follows from the fact the singular values of

the family Φ ∪ ˜̃Φ are the same as those of Φ̃, however all the columns of ˜̃Φhave L2-norm smaller than �, so the singular values of ˜̃Φ cannot be larger than√|Φ̃| − |Φ| �.

This procedure can be shown to be stable numerically, at least when Φ̃ is nottoo ill-conditioned [46,?]. When the problem is very ill-conditioned, a loss oforthogonality in Φ may occur while running the algorithm. Reorthogonalizingthe elements already selected in Φ is enough to achieve stability, see [45] fordetails and references.

The main drawback of this procedure is that the supports of the functionsin Φ can be arbitrarily large even if the supports of Φ̃ are small, and thereexists another orthonormal basis for < Φ̃ > made of well-localized functions.Since the size of the supports of the functions in the new basis are crucialfor the speed of the algorithm, we suggest two modifications of the modifiedGram-Schmidt that seek to obtain orthonormal bases of functions with smallersupport.

5.2 Multiscale modified Gram-Schmidt with geometric pivoting

We suggest here to use a fast local Gram-Schmidt orthogonalization with “ge-ometric pivoting”, that provides some guarantees on the size of the support ofthe orthonormal functions it constructs, besides having complexity O(n logn),where n is the cardinality of X, when applied to a local family. Here we areassuming that the approximate (up to precision) search for the �-neighbors ofany point in X has complexity at most logn. To achieve this, X may need tobe pre-processed, and state of the art fast nearest neighbor algorithms achievethis in time O(n1+η log n), with constant exponential in the “dimension” of X,defined in a reasonable way. The literature on fast approximate nearest neigh-bors and range searches is vast, see for example [47] as a starting point. Herewe limit ourselves to observe that since T is local and is given represented ina δ-local basis, we already have a very valuable amount of information aboutthe nearest neighbors of each point, which can be exploited.

22

In the following subsection we will present an algorithm that does not use thegeometry of the supports and does not need to know nearest neighbors.

The Gram-Schmidt procedure is completely local, in the sense that it is orga-nized so that each of the constructed functions needs to be orthonormalizedonly to the nearby functions, across scales, at the appropriate scale.

PROOF. [Construction of Proposition 14, greedy version]

We start by building K0 so that {x0,k}k∈K0 is a 2δ-lattice, following [11] (seealso [48]). We pick x0,0 ∈ S arbitrary, then we find a x0,1 ∈ S \ (S ∩B2δ(x0,0))closest to x0,0 and such that ϕ̃x0,0 is δ-local (when there is more than one,choose one randomly) and so on: when x0,0, . . . , x0,k ∈ S have been picked, pickx0,k+1 ∈ S\(S∩N2δ({x0,0, . . . , x0,k})) closest to N2δ({x0,0, . . . , x0,k}), and suchthat ϕ̃x0,k+1 is δ-local, till no such point can be found. The set {x0,k}k∈K0 is a 2δ-lattice by construction, and the functions ϕ0,k := ϕ̃0,k, k ∈ K0 are orthogonalbecause they are δ-local and hence have disjoint compact supports. Of coursewe normalize them in L2 before proceeding. Now locally orthogonalize all theremaining ϕ̃k to the selected {ϕ0,k}k∈K0 (this operation is local as well): theresulting set of “residual” functions ˜̃Φ0 = ˜̃ϕk is (2

1 + 1)δ-local. We can repeat

the procedure above on ˜̃Φ by building a (22 + 2)δ-lattice K1. In the actualnumerical implementation, we pick x1,k such that the corresponding “residualfunction” ˜̃ϕx1,k has the largest L2-norm among the remaining ones (like inpivoted Gram-Schmidt).

The thesis follows by induction: after having built

Φ1 ∪ · · · ∪ Φl1 = {{ϕl,k}k∈Kl}l=0,...,l1we orthogonalize the remaining ϕ̃k to < Φ1, . . . ,Φl1 >, thus getting a set

˜̃Φl1 .

Each function in ˜̃Φl1 is (2l1+1 − 1)δ-local: we find a (2l1+2 − 2)δ-lattice Kl1+1

and normalize the family { ˜̃ϕk}k∈Kl1+1 of functions with disjoint supports toget {ϕl1+1,k}k∈Kl1+1. As before, in the implementation we choose ˜̃ϕl1+1,k thathas the maximum norm among the “residual” functions ˜̃Φl1 .

In the non-greedy version of the algorithm, at each level, rather than choosing

one closest scaling function in ˜̃Φl with disjoint support to the ones alreadychosen at that level, we choose the scaling function with largest L2 normamong the ones in the corona N(2l+1δ−2)δ,2(2l+1δ−2)δ : any of these is guaranteedto have support disjoint from the previous ones (but not too far away from atleast one of them), and choosing the one with largest L2 norm is the naturalchoice for pivoting to maintain numerical stability. We stop the constructionwhen no element ˜̃ϕk has L2-norm greater than �.

23

With these modifications, this construction inherits all the stability propertiesof modified Gram-Schmidt with pivoting, at least in the case in which ||ϕ̃x||2is between two close constants, since in this case we are simply rearrangingthe pivoting based on geometric information, resulting in better packings, lo-calization of the orthogonalization computations, and achieving the promisedcomputational cost.

Remark 21 When the initial system Φ̃ is very ill-conditioned and orthogo-nality of Φ is crucial, at each level l we need to re-orthogonalize the systembuilt so far: see [49,45] for details and references (and observe that this onlyabout doubles the overall cost).

At least when X is finite, the construction above is a particular, geometric-driven, QR decomposition of the matrix representing the change of basis fromΦ̃ to Φ. More precisely, let (Φ̃)kj = ϕ̃k(xj), with k ∈ K and j ∈ X (withthe identification between j and xj), and similarly let (Φ)kj = ϕk(xj), withk running through the indices in K0, . . . ,KL in this order. Assume that Φ isδ-local. Then we have, up to a permutation of the columns,

Φ =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

ϕ0,0

. . .

ϕ0,|K0|

ϕ1,0

. . .

ϕ1,|K1|

. . .

ϕL,0

. . .

ϕL,|KL|

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

=

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎝

D0,0 0 . . . 0

E1,0 D1,1 0...

.... . .

. . . 0

EJ,0 EJ,1 · · · DJ,J

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎠

Φ̃ (5.1)

where Dl,l is a |Kl| × |Kl| diagonal matrix, and Ei,j, which orthogonalizesthe residual functions ˜̃Φi to the subspace spanned by the first j levels, i.e.< Φ0, . . . ,Φj >, is of size |Ki| × |Kj| and sparse. In fact, it is important torealize that by construction each row of Ei,j contains about

|B2lδ||B2iδ| ≈ C

l−iX

non-zero entries. Hence in the whole matrix there are about

24

log |X|∑l=0

l∑i=0

|Kl|C l−iX ≈log |X|∑

l=0

|Kl|C lX1 − C l+1X1 − C−1X

≈log |X|∑

l=0

|X|C lX

C lX ≈ |X| log |X|

non-zero entries. Of these, for large classes of useful operators, the number ofentries above precision is actually of the order of |X|.

It is not by chance that the form of the matrix above is very similar to theones for wavelet compressed Calderón-Zygmund operators as in [2]. There thewavelets for the compression of the operator where fixed and “universal”, herethey are constructed in arbitrary geometries and adapted to the operator.

Remark 22 The construction in this proposition applies in particular to thefast construction of local orthonormal (multi-)scaling functions in Rn, whichwe think is of independent interest. As an example, this construction can be ap-plied to spline functions of any order, yielding an orthonormal basis of splinesdifferent from the standard one due to Strömberg [50].

Remark 23 The cost of running the above algorithm is O(n logn) times thecost of finding (approximate) r-neighbors, in addition to the cost of any pre-processing that may be necessary to perform these searches fast. This followsfrom the fact that all the orthogonalization steps are local. Observe that becauseof the geometric assumptions on Φ0 and T , essentially all the informationneeded regarding nearest neighbor is encoded in the matrix representing T .

5.3 Modified Gram-Schmidt with mixed L2-Lp pivoting

In this section we propose a second algorithm for computing an orthogonalbasis spanning the same subspace as a given δ-local family Φ̃. The main mo-tivation for the algorithm in the previous section was the observation thatmodified Gram-Schmidt with pivoting in general can generate basis functionswith large supports, while it is crucial in our setting to find orthonormal baseswith rather small support.

We would like to introduce a term in modified Gram-Schmidt that prefersfunctions with smaller support to functions with larger support. On the unitball of L2, the functions with concentrated support are those in L0 or, moregenerally, in Lp, for some p < 2.

We can then modify the “modified Gram-Schmidt with pivoting” algorithmof section 5.1 as follows.

25

1 Let ˜̃Φ = Φ̃, k = 0, λ > 0, p ∈ [0, 2].2 Let ϕk′ the element of

˜̃Φ with largest L2 norm, say Nk = ||ϕk′||2. Amongall the elements of ˜̃Φ with L2 norm larger than Nk/λ, pick the one withthe smallest Lp norm: let it be ϕk. If ||ϕk||2 < �√|Φ̃| , stop, otherwise addϕk/||ϕk||2 to Φ.

3 Orthogonalize all the remaining elements of Φ̃ to ϕk, obtaining a new set˜̃Φ. Increment k by 1. Go to step 2.

Choosing the element by bounding from below its L2 norm is important fornumerical stability, but relaxing the condition of picking the element withlargest L2 norm allows us to pick an element with smaller Lp norm, yieldingpotentially much better localized bases. The parameter λ controls this slack,and in practice choosing it between 3/4 and 1 gave good results. It is easyto construct examples in which the standard modified Gram-Schmidt withL2 pivoting (which corresponds to λ = 1) leads to bases with most elementshaving large support, while our modification with mixed norms yields muchbetter localized bases.

Remark 24 Observe that this algorithm does not require knowledge of thenearest neighbors. In the implementation, to achieve maximum speed, the ma-trix representing T should be in sparse form and queries of non-zero elementsby rows and columns should be fast.

We have not investigated theoretically the stability of this algorithm. We justobserved that it worked very well in the examples tried, with stability compa-rable to the standard modified Gram-Schmidt with pivoting, and very oftenmuch better localization in the resulting basis functions.

5.4 Rank Revealing QR factorizations

We refer the reader to [51], and references therein, for details regarding thenumerical and algorithmic aspects related to the rank-revealing factorizationsdiscussed in this section.

Definition 25 A partial QR factorization of a matrix M ∈ Mat(n, n)

MΠ = QR = Q

⎛⎜⎝Ak Bk

0 Ck

⎞⎟⎠ , (5.2)

where Q is orthogonal, Ak ∈ Mat(k, k) and is upper-triangular with nonnega-tive diagonal elements, Bk ∈ Mat(k, n− k), Ck ∈ Mat(n− k, n− k), and Π is

26

a permutation matrix, is a strong rank-revealing QR factorization if

σmin(Ak) ≥ σk(M)p1(k, n)

and σj(Ck) ≤ σk+j(M) p1(k, n) (5.3)

and ∣∣∣(A−1k Bk)∣∣∣ ≤ p2(k, n) (5.4)

where Q in orthogonal, p1(k, n) and p2(k, n) are functions bounded by a low-degree polynomial in k and n, and σmax(A) and σmin(A) denote the largest andsmallest singular values of a matrix A.

Modified Gram-Schmidt with pivoting, as described in section 5.1 actuallyyields a decomposition satisfying (5.3), but with p1 depending exponentiallyin n [51].

Let SRRQR(M, �) be an algorithm that computes a rank-revealing QR fac-torization of M , with σk+1 ≤ �. Gu and Eisenstat present such an algorithm,that satisfies (5.3) and (5.4) with

p1(k, n) =√

1 + nk(n− k) and p2(k, n) =√n .

Moreover, this algorithm requires in general O(n3) operations, but faster al-gorithms exploiting sparsity of the matrices involved can be devised.

We can then proceed in the construction of the Multi-Resolution Analysisdescribed in Theorem 18 as follows. We replace each orthogonalization stepGj by the following two steps. First we apply SRRQR(T

2j

j , �) to get a rank-

revealing QR factorization of T 2j, on the basis Φj, which we write as

T 2j

j Πj = QjRj = Qj

⎛⎜⎝Aj,k Bj,k

0 Cj,k

⎞⎟⎠ . (5.5)

The columns of Qj span Ṽj+1 up to error � · p(k, n), where

k = argmaxi{i : λ2j+1−1

i ≥ �} = dim(Ṽj+1) .

In fact, the strong rank-revealing QR decomposition above shows that thefirst k columns of T 2

jΠj are a well-conditioned basis that � · p(k, n)-spans

Ṽj+1. Now, this basis is a well-selected subset of “bump functions” T2j+1−1Φ̃,

and can be orthogonalized with Proposition 14, with estimates on the supportsexactly as claimed Theorem 18.

27

5.5 Computation of the Wavelets

In the computation of wavelets as described in section 4.5, numerically onehas to make sure that, at every orthogonalization step, the construction ofthe wavelets is made on vectors that are numerically orthogonal to the scalingfunctions at the previous scale. This can be attained, in a numerically stableway, by repeatedly orthogonalizing the wavelets to the scaling functions. Ob-serve that this orthogonalization is again a local operation, and hence can becomputed fast. This is also necessary to guarantee that the construction ofwavelets will stop exactly at the exhaustion of the wavelet subspace, without“spilling” into the scaling spaces.

6 Fast Algorithms

Assume that both (I) and (II) in section 5 are satisfied or, more generally,that each orthonormalization step and change of basis can be computed intime O(n log2 n). Then there are fast algorithms for constructing the wholeMulti-Resolution Analysis:

Proposition 26 The construction of all the scaling functions, and lineartransformations {Gj} and {Mj}, for j = 0, . . . , J , as described in Theorem18, can be done in time O(n log2 n).

Corollary 27 (The Fast Scaling Function Transform) Let the scaling func-tion transform of a function f ∈ L2(X,µ), |X| = n, be the set of all coefficients

{< f,Φj >}j=0,...,J .

All these coefficients can be computed in time O(n log2 n).

Remark 28 For various interesting classes of operators arising in applica-tions, the computational complexity of the transform is actually O(n) insteadof O(n log2 n) since the number of entries in Mj above precision is n insteadof logn.

The following is a simple consequence of Corollary 27:

Corollary 29 (The Fast Wavelet Transform) Let the wavelet transformof a function f ∈ L2(X,µ) be the set of all coefficients

{< f,Ψj >}j=0,...,J .

All these coefficients can be computed in time O(n log2 n).

28

Remark 30 Remark 28 applies here as well.

6.1 Fast eigenfunction computation

The computation of approximations to the eigenfunctions ξλ corresponding toλ > λ0 can be computed efficiently in Vj, where j is min{j : λtj0 > �}, andthen the result can be extended to the whole space X by using the extensionformula (4.14).

These eigenfunctions can also be used to embed the coarsened space Xj intoRn, along the lines of section 4.2, or can be extended to X and used to embedthe whole space X.

See [52,35] for results of multigrid techniques applied to fast eigenfunctioncomputations.

6.2 Fast inversion of “Laplacian-like” operators

Given a fast method like ours to compute the powers of T , the “Laplacian”(I − T ) can be inverted (on the orthogonal complement of the eigenspacecorresponding to the eigenvalue 1) via the Schultz method [2]: since

(I − T )−1f =+∞∑k=1

T kf

and, if SK =∑2K

k=1 Tk, we have

SK+1 = SK + T2KSK =

K∏k=0

(I + T 2

k)f .

Since we can apply fast T 2k

to any function f , and hence the product SK+1,we can apply (I−T )−1 to any function f fast, in time O(Kn log2 n). The valueof K depends on the gap between 1 and the first eigenvalue smaller than 1,which essentially regulates the speed of convergence of a distribution to itsasymptotic distribution.

Observe that we never construct the full matrix representing (I−T )−1 (whichis general full!), but we only keep a compressed multiscale representation ofit, which we can use to compute the action of the operator on any function.

29

6.3 Relationship with eigenmap embeddings

In Section 4.2 we discussed metrics induced by a symmetric diffusion semi-group {T t} acting on L2(X,µ), and how eigenfunctions of the generator ofthe semigroup can be used to embed X in Euclidean space. Similar embed-dings can be obtained by using diffusion scaling functions and wavelets, sincethey span subspaces localized around spectral bands. So for example the topdiffusion scaling functions in Vj approximate well, by the arguments in thisSection, the top eigenfunctions. See for example Figure 6.

7 Natural multiresolution on sampled Riemannian manifolds andon graphs

The natural construction of Brownian motion and of the associated Laplace-Beltrami operator for compact Riemannian manifolds can be approximated ona finite number of points which are realizations of a random variable takingvalues on the manifold according to a certain probability distribution as in[4,3]. The construction starts with assuming that we have a data set Γ whichis obtained by drawing according to some unknown distribution p, and whoserange is a smooth manifold Γ. Given a kernel that satisfies some natural mildconditions, for example in the standard Gaussian form

K(x, y) = e−(||x−y||

δ )2

for some scale factor δ > 0, is normalized twice as follows before computing itseigenfunctions. The main observation is that any integration on the empiricaldata set Γ of the kernel against a function is in the form

∑Γ

K(x, yi)f(yi)

and thus it a Riemann sum associated to the integral

∫ΓK(x, y)f(y)p(y)dy .

Hence to capture only the geometric content of the manifold it is necessary toget rid of the measure p on the dataset. This can be estimated at scale δ forexample by convolving with some smooth kernel (for example K itself), thusobtaining an estimated probability density pδ. One then considers the kernel

K̃(x, y) =K(x, y)

pδ(y)

30

as new kernel. This kernel is further normalized so that it becomes averagingon the data set, yielding

K(x, y) =K̃(x, y)√∫

Γ K̃(x, y)p(y)dy.

It is shown in [3] that with this normalization, in the limit |Γ| → +∞ andδ → 0, the kernel K thus obtained is the one associated with the Laplace-Beltrami operator on Γ. Applying our construction to this natural kernel, weobtain the natural multiresolution analysis associated to the Laplace-Beltramioperator on a compact Riemannian manifold. This also leads to compressedrepresentation of the heat flow and of functions of it.

On weighted graphs, the canonical random walk is associated with the matrixof transition probabilities

P = D−1Wwhere W is the symmetric matrix of weights and D is the diagonal matrixdefined by Dii =

∑j Wij . P is in general not symmetric, but it is conjugate,

via D12 to the symmetric matrix D−

12WD−

12 which is a contraction on L2.

Our construction allows the construction of a multiresolution on the graph,together with downsampled graphs, adapted to the random walk [13].

8 Examples and applications

8.1 Orthogonal spline multiresolution

Spline multiresolution analyses are well-studied in the wavelet literature (seefor example [50,44,53–55] and references therein, for multiwavelets see for ex-ample [56,57] and references therein) and used in several applications, imageanalysis, finite elements methods for PDEs and others.

The classical spline functions of order L on the real line, with knots on theinteger grid, are piecewise polynomials of degree L on each interval of the form[k, k+ 1], k ∈ Z, and globally of class CL−1. The basic spline function of orderL is defined by

ϕ̂(ξ) =

(1 + eiξ

2

)L+1

so thatϕ(x) = (∗)L+1 χ[0,1](x) .

The space of splines with knots on the integer grid is spanned by {ϕ(·−k)}k∈Z,and in fact this is a Riesz basis. To obtain an orthonormal basis these func-tions are usually orthonormalized via a translation-invariant Gram-Schmidt

31

501001502002500

0.5

1

V1

50100150200250

0

0.2

0.4

0.6

0.8

V2

50100150200250

−0.2

0

0.2

0.4

0.6

0.8

V3

50100150200250−0.4

−0.2

0

0.2

0.4

0.6

V4

50100150200250−0.4

−0.2

0

0.2

0.4

V5

50100150200250

−0.2

0

0.2

0.4

V6

50100150200250

−0.2

0

0.2

V7

50100150200250

−0.2

0

0.2

V8

50100150200250

−0.1

0

0.1

0.2

V9

50100150200250

−0.1

0

0.1

0.2

V10

50100150200250

−0.1

0

0.1

V11

50100150200250

−0.1

0

0.1

V12

50100150200250−0.1

−0.05

0

0.05

0.1

V13

50100150200250

−0.05

0

0.05

0.1

V14

50100150200250

−0.05

0

0.05

0.1

V15

Fig. 3. Multiresolution Analysis on the circle. We consider 256 points on the unitcircle, start with ϕ0,k = δk and with the standard diffusion. We plot several scalingfunctions in each approximation space Vj.

50100150200250

50

100

150

200

25050100150200250

50

100

150

200

25050100150200250

50

100

150

200

25050100150200250

50

100

150

200

25050100150200250

50

100

150

200

250

50 100 150

50

100

150

20406080100120

20

40

60

80

100

120

20 40 60 80

20

40

60

80

20 40 60

20

40

6020 40

10

20

30

40

10 20 30

10

20

305 10 15 20

5

10

15

20

5 10 15

5

10

152 4 6 8 10

2

4

6

8

10

2 4 6

2

4

6

50100150200250

50

100

150

200

25050100150200250

50

100

150

200

25050100150200250

50

100

150

200

25050100150200250

50

100

150

200

25050100150200250

50

100

150

200

250

50 100 150

50

100

150

20406080100120

20

40

60

80

100

120

20 40 60 80

20

40

60

80

20 40 60

20

40

6020 40

10

20

30

40

10 20 30

10

20

305 10 15 20

5

10

15

20

5 10 15

5

10

152 4 6 8 10

2

4

6

8

10

2 4 6

2

4

6

Fig. 4. Multiresolution Analysis on the circle: on the left we plot the compressedmatrices representing powers of the diffusion operator, on the right we plot theentries of the same matrices which are above working precision.

procedure [50,58] yielding the so called Franklin scaling functions and the cor-responding wavelets, which have support on the whole real line, but still areexponentially decaying. To preserve the compactness of the support, [44] builtbiorthogonal splines.

We can apply our construction to this situation, by letting X = R (or Z, orperturbed versions of Z!) with standard distance and measure, and the initialfamily of scaling functions being the family of splines of order L, {ϕL(·−k)}k∈Z(restricted to Z if X = Z).. Our Gram-Schmidt orthogonalization with geo-metric pivoting yields infinitely many different scaling functions, and scalingfunctions at substrate l having support of size comparable to 2l·diam.supp. ϕL.

32

50 100 150 200 250−2

02

50 100 150 200 250−2−1

01V1

50 100 150 200 250−2

02

x 10−15

W1

50 100 150 200 250−1

01V2

50 100 150 200 250−0.05

00.05

W2

50 100 150 200 250−1

01

V3

50 100 150 200 250−0.2

00.2

W3

50 100 150 200 250−1

01

V4

50 100 150 200 250−0.2

00.2

W4

50 100 150 200 250−1

01

V5

50 100 150 200 250−0.2

00.2

W5

50 100 150 200 250−1

01

V6

50 100 150 200 250−0.2

00.2

W6

50 100 150 200 250−1

01

V7

50 100 150 200 250−0.2

00.2

W7

50 100 150 200 250−1

01

V8

50 100 150 200 250−0.1

00.1

W8

50 100 150 200 250−1

01

V9

50 100 150 200 250−0.05

00.05W9

50 100 150 200 250−1

01

V10

50 100 150 200 250−0.05

00.05

W10

50 100 150 200 250−1

01

V11

50 100 150 200 250−0.05

00.05

W11

50 100 150 200 250−0.50

0.5

50 100 150 200 250−0.50

0.5V1

50 100 150 200 250−2

02

x 10−5

W1

50 100 150 200 250−0.50

0.5V2

50 100 150 200 250−2

02

x 10−4

W2

50 100 150 200 250−0.50

0.5V3

50 100 150 200 250−5

05

x 10−4

W3

50 100 150 200 250−0.50

0.5V4

50 100 150 200 250−1

01

x 10−3

W4

50 100 150 200 250−0.50

0.5V5

50 100 150 200 250−4−2

024

x 10−3

W5

50 100 150 200 250−1

01

V6

50 100 150 200 250−0.2

00.2

W6

50 100 150 200 250−1

01

V7

50 100 150 200 250−0.5

00.5W7

50 100 150 200 250−0.5

00.5V8

50 100 150 200 250−1

01

W8

50 100 150 200 250−0.5

00.5

V9

50 100 150 200 250−0.5

00.5

W9

50 100 150 200 250−0.2

00.2

V10

50 100 150 200 250−0.4−0.2

00.20.4

W10

50 100 150 200 250−0.1

−0.050

0.05V11

50 100 150 200 250−0.2

00.2

W11

Fig. 5. Multiresolution Analysis on the circle. In the same setting as for Figure 3, wecompute the multiscale transform of a periodic signal on the circle, contaminatedby two δ-impulses (top) and of windowed chirp (bottom). In the first column weplot the projections onto coarser and coarser scaling spaces, in the second columnwe plot the projection on the corresponding wavelet subspaces. Computations herewere done to 5 digits of precision.

8.2 A simple homogenization problem

We consider the non-homogeneous heat equation on the circle

∂u

∂t=

∂

∂x

(c(x)

∂

∂xu

)(8.1)

where c(x) is quite non-uniform, and we want to represent the large scale/largetime behavior of the solution by compressing powers of the operator repre-senting the discretization of the spatial differential operator ∂

∂x

(c(x) ∂

∂x

). The

spatial operator is of course one of the simplest Sturm-Liouville operators in

33

0.5

1

1.5

2

30

210

60

240

90

270

120

300

150

330

180 0

c(x) 0.1

0.2

0.3

30

210

60

240

90

270

120

300

150

330

180 0

φ15

50 100 150 200 250

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

50 100 150 200 250

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

50 100 150 200

50

100

150

200

−0.15−0.1−0.0500.050.1

−0.05

0

0.05

0.1

0.15

0.2

Fig. 6. Non homogeneous medium with non-constant diffusion coefficient (plottedtop left): the scaling functions for V15 (top right). In the middle row: an eigenfunc-tion (the 35th) of the diffusion operator (left), and the same eigenfunction recon-structed by extending the corresponding eigenvector of the compressed T10 (right):the L2-error is of order 10−5. The entries above precision of the matrix T 8 com-pressed on V4 (bottom left). We plot bottom right {(ϕ18,1(xi), ϕ18,2(xi))}i, whichare an approximation to the eigenmap corresponding to the first two eigenfunctions,since V18 has dimension 3.

34

one dimension.

We choose the function 0 < c < 1 represented in figure 6. We discretize thespatial differential operator, thus obtaining a matrix W . In order to have anoperator T with L2-norm 1, we let Dii = ∑j(2 −Wij) and T = D− 12WD− 12 ,which has the desired properties of contraction, self-adjointedness and posi-tiveness.

We discretize the interval at 256 equispaced points, as in the previous exam-ple, and the discretization of the right-hand side of (8.1) is a matrix T which,when properly normalized, can be interpreted as a non-translation invariantrandom walk. Our construction yields a multiresolution associated to this op-erator that is highly nonuniform, with most scaling functions concentratedaround the points where the conductivity is highest, for several scales. Thedimension of Vj drops quite fast and V8 is already reduced to two dimensions.In Figure 6 we plot, among other things, the embedding given by the two scal-ing functions in V8: while it differs from an eigenmap because the coordinatesare not scaled by the eigenvalues (which we could cheaply compute since it’s a2× 2 matrix), it clearly shows the non-uniformity of the heat distance (whichis roughly Euclidean distance in the range of this map): the points with veryhigh conductivity are very tightly clustered together, while all the others arefurther apart and almost equispaced.

The compressed matrices representing the (dyadic) powers of this operator canbe viewed as homogenized versions, at a certain scale which is time and spacedependent, of the original operator. The scaling functions at that scale spanthe domain the homogenized operator at that scale. Further developmentsin the applications to homogenization problems will be presented in a futurework.

8.3 Perturbed lattices

Our construction applies naturally to perturbed lattices in any dimension.Wavelets and multiresolution analysis on irregular lattices have been stud-ied and applied by various researchers, see for example [59,60] and referencestherein. Our construction overcomes the usual difficulties related to local over-sampling (too many lattice points accumulating around a single point) andnonuniformities, by automatically computing the local oversampling and lo-cally downsampling and orthogonalizing as requested by the problem.

As an example, we consider a set of 500 randomly drawn from a uniformdistribution on the square, and consider the diffusion operator normalized ala Beltrami as described in Section 7. We plot in Figure 7 some of the scalingfunctions we obtain.

35

00.2

0.40.6

0.81

0

0.2

0.4

0.6

0.8

1−0.1

−0.05

0

0.05

0.1

0.15

00.2

0.40.6

0.81

0

0.2

0.4

0.6

0.8

1−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

00.2

0.40.6

0.81

0

0.2

0.4

0.6

0.8

1−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

00.2

0.40.6

0.81

0

0.2

0.4

0.6

0.8

1−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

Fig. 7. Example of scaling functions at coarse level associated with a Beltramidiffusion on randomly distributed points on the unit square. For graphical reasons,we are plotting a smooth extension of these scaling functions on a uniform grid bycubic interpolation.

Fig. 8. Some diffusion scaling functions and wavelets at different scales on a dumb-bell-shaped manifold sampled at 1400 points.

8.4 Dumbbell manifold

We consider a dumbbell-shaped manifold, sampled at 1400 points, and thediffusion associated to the (discretized) Laplace-Beltrami operator discussedin Section 7. See Figure 8.4 for the pictures of some scaling functions andwavelets.

8.5 A noisy example

We consider a data setX consisting of 1200 points in R3, which are the union of400 realizations of thee independent Gaussian normal variablesG1,G2,G3, withvalues in R3, with means (1, 0, 0), (0, 1, 0), (0, 0, 1) respectively, and standarddeviations all equal to 0.4. The data set if depicted in Figure 9. We considerthe graph (G,E,W ) built as follows: the vertex set G is X, two points x, y ∈ G

36

−2−1

01

23

−2

−1

0

1

2

3−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

−1

−0.5

0

0.5 −0.06−0.04

−0.020

0.020.04

0.060.08

−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

−2−1

01

23

−2

−1

0

1

2

3−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

0

0.01

0.02

0.03

0.04

0.05

0.06

−2−1

01

23

−2

−1

0

1

2

3−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

−2−1

01

23

−2

−1

0

1

2

3−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

−0.01

0

0.01

0.02

0.03

0.04

0.05

0.06

−2−1

01

23

−2

−1

0

1

2

3−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

0.08

0.1

Fig. 9. Top, left to right: the dataset described in Example 8.5 with labels for eachGaussian random variable, the embedding given by the top 3 nontrivial eigenfunc-tion of the graph Laplacian, the values of the scaling function ϕ6,1,1. Bottom, leftto right: values of the scaling functions ϕ6,1,2, ϕ6,1,3, and of the wavelet ψ8,5.

are connected by an edge of weight Wx,y = e−( ||x−y||

0.3)2 if such weight is greater

than 10−3. We let T be the random walk naturally associated to this graph,which is defined by the row-stochastic matrix of transition probabilities

P = D−1W ,

where Dx,x =∑

y∈GWx,y and Dx,y = 0 for x �= y. We let the diffusion semi-group be generated by T = P/||P ||2,2, and construct the corresponding mul-tiresolution analysis and wavelets.

We can see in Figure 9 that the eigenfunctions of this operator can tell ratherwell the three Gaussian clouds apart, since the diffusion inside each cloudis rather fast while the diffusion across clouds is much slower. For the samereason, the diffusion scaling functions, as well as the wavelets, tend to beconcentrated on each cloud. The top scaling functions in several scaling sub-spaces approximate very well the characteristic function of each cloud, andthe wavelets analyze well oscillating functions which are concentrated on eachcloud (local-modes).

In Figure 10 we show how the diffusion scaling functions compress the data.Let χR, χB, χG be the characteristic functions of each Gaussian cloud, and χ =χR +χB +χG. In the figure we represent, for several j’s, Xj = Mj · · · · ·M1M0X(see ), which is a compressed version of X after diffusing till time 2j+1 − 1.We also compute the multiscale expansion of χ, and use colors to representthe projection onto j-th approximation subspace, evaluated on Xj.

37

−2

0

2

4

6

−2

0

2

4

6−1

0

1

2

3

4

5

6

−2

0

2

4

6

8

10

12

14

16

−50

510

1520

−5

0

5

10

15−2

0

2

4

6

8

10

12

14

16

0

5

10

15

20

25

30

35

40

45

50

−5

0

5

10

15

−5

0

5

10

15−2

0

2

4

6

8

10

12

0

5

10

15

20

25

30

35

40

45

50

Fig. 10. Left to right, we plot Xj = Mj · · · · · M1M0X, for j = 1, 3, 5 re-spectively. The color is proportional to the projection of χ onto Vj evalu-ated at each point of Xj . The dataset is being compressed, since we have|X| = 1200, |X1 | = 1200, |X3| = 186, |X5| = 39, and the images of the centersof the classes are preserved as sampling points at each scale.

9 Extending the scaling functions outside the set

Suppose there exists a metric measure space (X, d, µ) such that X ⊂ X, andsuppose that the projection operator

RXX : L2(X,µ)→L2(X,µ)f �→ f |X

is bounded. Suppose we have a sequence of subspaces {En}n≥0, with En →L2(X,µ) as n → ∞, and a singular value decomposition for RXX |En that wewrite as

RXX |Enf =∑i≥1

αn,i〈f, θ̃n,i〉θn,i

where αn,1 ≥ αn,1 ≥ · · · ≥ 0.

Given a scaling function ϕj,k, we can find the smallest n such that

ϕj,k ∈ PEn (9.1)

where in practice this equality is to be intended in the sense of numerical rangeof the projection involved. Then we would define the extension of

ϕj,k =∑i≥1

aiθn,i (9.2)

to be

ϕj,k =∑i≥1

aiαn,i

θ̃n,i . (9.3)

This makes numerical sense only when the sum is extended over those indicesi such that αn,i is well separated below from 0, which is equivalent to thecondition that (9.1) holds in the numerical sense. See also [24,61].

38

Observe that since the n for which (9.1) holds depends on ϕj,k, each scalingfunction will be extended in different ways. If we think of the En as spaces offunctions on X with increasing complexity as n grows, we try to extend eachscaling function to a function on X with the minimum complexity required,the requirement being (9.1).

Example 31 If X is a subset of Rn, then we can for example [4,3] proceedin any of the following ways.

(i) Let

BLc ={f ∈ L2(Rn) : supp. f̂ ⊆ Bc(0)

}= < {ei : |ξ| ≤ c} >,

the space of band-limited functions with band c, and then let En be BL2n .The {θ̃n,i}i in this case are generalized prolate spherical functions of [62,63],called geometric harmonics in [4,3].

(ii) Let En be one the approximation spaces Vn of some classical wavelet mul-tiresolution analysis in Rn, like the one associated to Meyer wavelets.

(iii) Let

BGc = <{e−

||x−y||c

}>,

the bump algebra at scale 1/c, and furthermore let En = BG2−n. While inthis case it is not the case that En ⊂ En+1, it is true that for every � > 0there exists an integer p = p(�) > 0 such that Epn is �-dense in En, forevery n ≥ 0.

Once all the scaling functions have been extended, any function f in L2(X)can be extended to a function f in L2(X), by expanding f onto the scalingfunctions and then extending each term of the expansion. How far the supportof f extends outside of X ⊂ X depends on properties of f and on how X isembedded in X and on the function spaces En we use to extend.

When the diffusion operator is induced on the data set by a kernel that isdefined outside the dataset, it is also possible to use the Nyström extension[64–66].

For a survey of extension techniques in the context of nonlinear dimensionreduction see [29] and references therein, as well as [4,3].

Example 32 We illustrate the construction above by extending one of

diﬀusion waveletshelper.ipam.ucla.edu/publications/mgatut/mgatut_4990...diﬀusion wavelets ronald...

Documents