a nonlinear approach to dimension reduction robert krauthgamer weizmann institute of science joint...
TRANSCRIPT
A Nonlinear Approach to Dimension Reduction
Robert Krauthgamer
Weizmann Institute of Science
Joint work with Lee-Ad Gottlieb
A Nonlinear Approach to Dimension Reduction 2
Data As High-Dimensional Vectors Data is often represented by vectors in Rd
Image color or intensity histogram Document word frequency
A typical goal – Nearest Neighbor Search: Preprocess data, so that given a query vector, quickly find closest
vector in data set. Common in various data analysis tasks – classification, learning,
clustering.
A Nonlinear Approach to Dimension Reduction 3
Curse of Dimensionality [Bellman’61] Cost of maintaining data is exponential in dimension This observation extends to many useful operations
Nearest Neighbor Search [Clarkson’94]
Dimension reduction = represent high-dimensional data in a low-dimensional space
Map given vectors into a low-dimensional space, while preserving most of the data
Goal: Trade-off accuracy for computational efficiency Common interpretation: preserve pairwise distances
A Nonlinear Approach to Dimension Reduction 4
The JL Lemma
Can be realized by a simple linear transformation A random k£d matrix works – entries from {-1,0,1} [Achlioptas’01] or
Gaussian [Gupta-Dasgupta’98,Indyk-Motwani’98]
Applications in a host of problems in computational geometry
Can we do better? A nearly-matching lower bound was given by [Alon’03] But it’s existential…
Theorem [Johnson-Lindenstrauss, 1984]:
For every n-point set X½Rm and 0<<1, there is a map :XRk, for k=O(-2log n), that preserves all distances within 1+:
||x-y||2 < ||(x)-(y)||2 < (1+) ||x-y||2, 8x,y2X.
A Nonlinear Approach to Dimension Reduction 5
Doubling Dimension Definition: Ball B(x,r) = all points within distance r from x.
The doubling constant (of a metric M) is the minimum value ¸ such that every ball can be covered by ¸ balls of half the radius First used by [Assouad’83], algorithmically by [Clarkson’97] The doubling dimension is dim(M)=log ¸(M) [Gupta-K.-Lee’03]
Applications: Approximate nearest neighbor search [Clarkson’97,
K.-Lee’04,…, Cole-Gottlieb’06, Indyk-Naor’06,…] Spanners [Talwar’04,…,Gottlieb-Roditty’08] Compact Routing [Chan-Gupta-Maggs-Zhou’05,…] Network Triangulation [Kleinberg-Slivkins-Wexler’04,…] Distance oracles [HarPeled-Mendel’06] Embeddings [Gupta-K.-Lee’03, K.-Lee-Mendel-Naor’04,
Abraham-Bartal-Neiman’08] Here ≤7.
A Nonlinear Approach to Dimension Reduction 6
A Stronger Version of JL?
The near-tight lower bound of [Alon’03] is existential Holds for X=uniform metric, where dim(X)=log n
Open: Extend to spaces with doubling dimension¿log n? Open: JL-like embedding into dimension k=O(dim(X))?
Even constant distortion would be interesting [Lang-Plaut’01,Gupta-K.-Lee’03]: Cannot be attained by linear transformations [Indyk-Naor’06]
Example [Kahane’81, Talagrand’92]:
xj = (1,1,…,1,0,…,0) 2 Rn (Wilson’s helix).
I.e. ||xi-xj|| = |i-j|1/2.
Theorem [Johnson-Lindenstrauss, 1984]:
Every n-point set X½l2 and 0<<1, has a linear embedding
:Xl2k, for k=O(-2log n), such that for all x,y2X,
||x-y||2 < ||(x)- (y)||2 < (1+) ||x-y||2.
distortion = 1+².
We present two partial resolutions, using Õ(dim2(X)) dimensions:
1. Distortion 1+ for a single scale, i.e. pairs where ||x-y||2[r,r].
2. Global embedding of the snowflake metric, ||x-y||½.
2’. Conjecture correct whenever ||x-y||2 is a Euclidean metric.
A Nonlinear Approach to Dimension Reduction 7
I. Embedding for a Single Scale Theorem 1. For every finite subset X½l2, and all 0<<1, r>0,
there is embedding f:Xl2k for k=Õ(log(1/)(dim X)2), satisfying
1. Lipschitz: ||f(x)-f(y)|| ≤ ||x-y|| for all x,y2X
2. Bi-Lipschitz at scale r: ||f(x)-f(y)|| ≥ (||x-y||) whenever ||x-y||2 [r, r]
3. Boundedness: ||f(x)|| ≤ r for all x2X
Compared to open question: Bi-Lipschitz only at one scale (weaker) But achieves distortion = absolute constant (stronger)
This talk: illustrate the proof for constant distortion The 1+ distortion is later attained for distances 2[’r, r] Overall approach: divide and conquer
A Nonlinear Approach to Dimension Reduction 8
Step 1: Net Extraction Extract from the point set X a r-net
Net properties: r-Covering r-Packing
Packing ) a ball of radius s contains O((s/r)dim(X)) net points
We shall build a good embedding for net points Why can we ignore non-net points?
Covering radius: r
Packing distance: r
A Nonlinear Approach to Dimension Reduction 9
Step 1: Net Extraction and Extension Recall: ||f||Lip = min {L¸0: ||f(x)-f(y)|| · L||x-y|| for all x,y}
Lipschitz Extension Theorem [Kirszbraun’34]: For every X½l2, every map f:Xl2
k can be extended to f’:l2l2k, such that
||f’||Lip · ||f||Lip.
Therefore, a good embedding just for the net points suffices Smaller net resolution less distortion for the non-net points
f
¸9r¸3r
f’ ¸r
A Nonlinear Approach to Dimension Reduction 10
Step 2: Padded Decomposition Partition the space probabilistically into clusters
Properties [Gupta-K.-Lee’03, Abraham-Bartal-Neiman’08]: Cluster diameter: bounded by dim(X)¢r.
Size: By the doubling property, bounded by (dim(X)/)dim(X) Padding: Each point is r-padded with probability > 9/10 Support: O(dim(X)) partitions
≤ dim(X)¢r
r-padded
not r-padded
A Nonlinear Approach to Dimension Reduction 11
Step 3: JL on Individual Clusters For each partition, consider each individual cluster
Reduce dimension using the JL lemma Constant distortion Target dimension (logarithmic in size): O(log((dim(X)/)dim(X)) = Õ(dim(X))
Then translate some point to the origin
JL
A Nonlinear Approach to Dimension Reduction 12
The story so far… To review
Step 1: Extract net points Step 2: Build family of partitions Step 3: For each partition, apply in each cluster JL and translate to origin
Embedding guarantees for
a single partition Intracluster distance: Constant distortion Intercluster distance:
Min distance: 0 Max distance: dim(X)¢r Not good enough Let’s backtrack…
A Nonlinear Approach to Dimension Reduction 13
The story so far… To review
Step 1: Extract net points Step 2: Build family of partitions Step 3: For each partition, apply in each cluster a Gaussian transform Step 4: For each partition, apply in each cluster JL and translate to origin
Embedding guarantees for a single partition Intracluster distance: Constant distortion Intercluster distance:
Min distance: 0 Max distance: dim(X)¢r
Not good enough when ||x-y||≈r Let’s backtrack…
A Nonlinear Approach to Dimension Reduction 14
Step 3: Gaussian Transform For each partition, apply within each cluster the Gaussian
transform (kernel) to distances [Schoenberg’38] G(t) = (1-e-t2)1/2
“Adaptation” to scale r:
Gr(t) = r(1-e-t2/r2)1/2
A map f:l2 l2 such that
||f(x)-f(y)|| = Gr(||x-y||) Threshold: Cluster diameter is at most r (Instead of dim(X)¢r) Distortion: Small distortion of distances in relevant range
Transform can increase dimension… but JL is the next step
A Nonlinear Approach to Dimension Reduction 15
Step 4: JL on Individual Clusters Steps 3 & 4:
New embedding guarantees Intracluster: Constant distortion Intercluster:
Min distance: 0 Max distance: r (instead of dim(X)¢r)
Caveat: Still not good enough when ||x-y||≈r Also smooth the map near cluster boundaries
JLGaussian
smaller diameter smaller dimension
A Nonlinear Approach to Dimension Reduction 16
Step 5: Glue Partitions We have an embedding for each partition
For padded points, the guarantees are perfect For non-padded points, the guarantees are weak
“Glue” together embeddings for different partitions Concatenate all dim(X) embeddings (and scale down)
Non-padded case occurs 1/10 of the time, so it gets “averaged away”
||F(x)-F(y)||2 = ||f1(x)-f1(y)||2 + … + ||fm(x)-fm(y)||2 ¼ m¢||x-y||2
Final dimension = Õ(dim2(X)): Number of partitions: O(dim(X)) Dimensions for partitioning: Õ(dim(X))
f1(x) = (1,7), f2(x) = (2,8), f3(x) = (4,9)
) F(x) = f1(x) f2(x) f3(x) = (1,7,2,8,4,9)
A Nonlinear Approach to Dimension Reduction 17
Kirszbraun’s theorem extends embedding to non-net points without increasing dimension
Step 6: Kirszbraun Extension Theorem
Embedding
Embedding + K.
A Nonlinear Approach to Dimension Reduction 18
Single Scale Embedding – Review Steps:
Net extraction Padded Decomposition Gaussian Transform JL Glue partitions Extension theorem
Theorem 1: Every finite X½l2 admits embedding f:Xl2
k for k=Õ((dim X)2), such that
1. Lipschitz: ||f(x)-f(y)|| ≤ ||x-y|| for all x,y2X
2. Bi-Lipschitz at scale r: ||f(x)-f(y)|| ≥ (||x-y||) whenever ||x-y||2 [r, r]
3. Boundedness: ||f(x)|| ≤ r for all x2X
A Nonlinear Approach to Dimension Reduction 19
Single Scale Embedding – Strengthened Steps:
Net extraction nets Padded Decomposition Padding probability 1-. Gaussian Transform JL Already 1+ distortion Glue partitions Higher percentage of padded points Extension theorem
Theorem 1: Every finite X½l2 admits embedding f:Xl2
k for k=Õ((dim X)2), such that
1. Lipschitz: ||f(x)-f(y)|| ≤ ||x-y|| for all x,y2X
2. Gaussian at scale r: ||f(x)-f(y)||=(1±)Gr(||x-y||) whenever ||x-y||2 [r, r]
3. Boundedness: ||f(x)|| ≤ r for all x2X
A Nonlinear Approach to Dimension Reduction 20
II. Snowflake Embedding Theorem 2. For all 0<<1, every finite subset X½l2 admits an
embedding F:Xl2k, for k=Õ(-4(dim X)2), with distortion 1+ for
the snowflake metric ||x-y||½.
We’ll illustrate the construction for constant distortion. The snowflake technique is due to [Assouad’83] We give the first implementation achieving distortion 1+.
A Nonlinear Approach to Dimension Reduction 21
II. Snowflake Embedding Theorem 2. For every 0<<1 and finite subset X½l2 there is an
embedding F:Xl2k of the snowflake metric ||x-y||½ achieving
dimension k=Õ(-4(dim X)2) and distortion 1+, i.e.
Compared to open question: We embed the snowflake metric (weaker) But achieve distortion 1+ (stronger)
We generalize [Kahane’81, Talagrand’92] who study Euclidean embedding of Wilson’s helix (real line w/distances |x-y|1/2)
We’ll illustrate the construction for constant distortion The snowflake technique is due to [Assouad’83] We give first implementation that achieves distortion 1+.
A Nonlinear Approach to Dimension Reduction 22
Assouad’s Technique Basic idea.
Consider single scale embeddings for all r=2i Fix points x,y2X, and suppose ||x-y||≈s
r = 16s r = 8s r = 4s r = 2s r = s r = s/2 r = s/4 r = s/8 r = s/16
x y
Lipschitz: ||f(x)-f(y)|| ≤ ||x-y|| = s
Gaussian: ||f(x)-f(y)||=(1±)Gr(||x-y||)
Boundedness: ||f(x)|| ≤ r
A Nonlinear Approach to Dimension Reduction 23
Assouad’s Technique Now scale down each embedding by r½ (snowflake)
r = 16s s s½/4 r = 8s s s½/8½ r = 4s s s½/2 r = 2s s s½/2½ r = s s s½
r = s/2 s/2 s½/2½ r = s/4 s/4 s½/2 r = s/8 s/8 s½/8½ r = s/16 s/16 s½/4
||f(x)-f(y)||/r½ ||f(x)-f(y)||
Combine these embeddings by addition (and staggering)
A Nonlinear Approach to Dimension Reduction 24
Snowflake Embedding – Review Steps:
Compute single scale embeddings for all r=2i Scale-down embedding for r by r½
Combine embeddings by addition (with some staggering)
By taking more refined scales (powers of 1+ instead of 2) and further staggering, can achieve distortion 1+ for the snowflake
Theorem 2. For all 0<<1, every subset X½l2 embeds into l2k for
k=Õ(-4(dim X)2), with distortion 1+ to the snowflake
A Nonlinear Approach to Dimension Reduction 25
Conclusion Gave two (1+)-distortion low-dimension embeddings for
doubling l2-spaces Single scale Snowflake
This framework can be extended to l1 and l∞. Some obstacles: Dimension reduction: Can’t use JL Lipschitz extension: Can’t use Kirszbraun Threshold: Can’t use Gaussian transform
Many of the steps in the single-scale embedding are nonlinear, although most “localities” are mapped (near) linearly Explain empirical success (e.g. Locally Linear Embeddings)?
Applications? Clustering is one potential area …