alex andoni (msr svc)

18
Sketching, Sampling and other Sublinear Algorithms: Euclidean space: dimension reduction and NNS Alex Andoni (MSR SVC)

Upload: garvey

Post on 22-Feb-2016

56 views

Category:

Documents


0 download

DESCRIPTION

Sketching, Sampling and other Sublinear Algorithms: Euclidean space: dimension reduction and NNS. Alex Andoni (MSR SVC). A Sketching Problem. Sketching: :objects short bit-strings given and should be able to deduce if and are “similar” Why? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Alex Andoni (MSR SVC)

Sketching, Sampling and other Sublinear Algorithms:

Euclidean space: dimension reduction and NNS

Alex Andoni(MSR SVC)

Page 2: Alex Andoni (MSR SVC)

A Sketching Problem

2

Sketching: :objects short bit-strings given and should be able to deduce if and are

“similar” Why?

reduce space and time to compute similarity

𝑓 𝑓010110 010101

similar?

To be or not to be

To sketch or not to sketch

𝑓 𝑓be to

similar?

Page 3: Alex Andoni (MSR SVC)

Sketch from LSH

3

LSH often has property:

Sketching from LSH:

Estimate by the fraction of collisions between controls the variance of the estimate

𝑑𝑖𝑠𝑡 (𝑝 ,𝑞)

Pr [𝑔 (𝑝 )=𝑔 (𝑞)]1

[Broder’97]: for Jaccard coefficient

Page 4: Alex Andoni (MSR SVC)

General Theory: embeddings The above map is an embedding General motivation: given distance (metric)

, solve a computational problem under

Euclidean distance (ℓ2) Hamming distance

Edit distance between two stringsEarth-Mover (transportation)

Distance

Compute distance between two points

Diameter/Close-pair of set SClustering, MST, etc

Nearest Neighbor Search

f

Reduce problem <P under hard metric> to <P under simpler metric>

Page 5: Alex Andoni (MSR SVC)

Embeddings: landscape Definition: an embedding is a map of a metric into a host

metric such that for any :

where is the distortion (approximation) of the embedding .

Embeddings come in all shapes and colors: Source/host spaces Distortion Can be randomized: with probability Time to compute

Types of embeddings: From norm to the same norm but of lower dimension (dimension

reduction) From non-norms (edit distance, Earth-Mover Distance) into a norm (ℓ1) From given finite metric (shortest path on a planar graph) into a norm

(ℓ1) not a metric but a computational procedure sketches

Page 6: Alex Andoni (MSR SVC)

Dimension Reduction Johnson Lindenstrauss Lemma: there is a linear

map , , that preserves distance between two vectors up to distortion with probability ( some constant)

Preserves distances among points for Motivation:

E.g.: diameter of a pointset in -dimensional Euclidean space

Trivially: time Using lemma: time for approximation MANY applications: nearest neighbor search, streaming,

pattern matching, approximation algorithms (clustering)…

Page 7: Alex Andoni (MSR SVC)

Main intuition The map can be simply a projection onto a

random subspace of dimension

Page 8: Alex Andoni (MSR SVC)

1D embedding How about one dimension () ? Map

, where are iid normal (Gaussian) random variable

Why Gaussian? Stability property: is distributed as , where is also

Gaussian Equivalently: is centrally distributed, i.e., has

random direction, and projection on random direction depends only on length of

pdf = E[g]=0E[g2]=1

Page 9: Alex Andoni (MSR SVC)

1D embedding Map ,

for any , Linear:

Want: Claim: for any , we have

Expectation: Standard deviation:

Proof: Prove for since linear Expectation

pdf = E[g]=0E[g2]=12 2

Page 10: Alex Andoni (MSR SVC)

Full Dimension Reduction Just repeat the 1D embedding for times!

where is matrix of Gaussian random variables

Want to prove: with probability

OK to prove for fixed

Page 11: Alex Andoni (MSR SVC)

Concentration is distributed as

where each is distributed as Gaussian Norm

is called chi-squared distribution with degrees Fact: chi-squared very well concentrated:

Equal to with probability Akin to central limit theorem

Page 12: Alex Andoni (MSR SVC)

Dimension Reduction: wrap-up with high probability

Extra: Linear: can update as changes Can use instead of Gaussians [AMS’96, Ach’01,

TZ’04…] Fast JL: can compute faster than time [AC’06,

AL’07’09, DKS’10, KN’10’12…]

Page 13: Alex Andoni (MSR SVC)

NNS for Euclidean space

13

Can use dimensionality reduction to get LSH for

LSH function : pick a random line , and quantize project point into

is a random Gaussian vector random in is a parameter (e.g., 4)

[Datar-Immorlica-Indyk-Mirrokni’04]

𝑝ℓ

Page 14: Alex Andoni (MSR SVC)

Regular grid → grid of balls p can hit empty space, so take

more such grids until p is in a ball

Need (too) many grids of balls Start by projecting in

dimension t

Analysis gives Choice of reduced dimension

t? Tradeoff between

# hash tables, n, and Time to hash, tO(t)

Total query time: dn1/c2+o(1)

Near-Optimal LSH

2D

p

pRt

[A-Indyk’06]

Page 15: Alex Andoni (MSR SVC)

Open question:

More practical variant of above hashing? Design space partitioning of that is

efficient: point location in poly(t) time qualitative: regions are “sphere-like”

[Prob. needle of length 1 is not cut]

[Prob needle of length c is not cut]≥

c2

𝑝

Page 16: Alex Andoni (MSR SVC)

Time-Space Trade-offs

[AI’06]

[KOR’98, IM’98, Pan’06]

[Ind’01, Pan’06]

Space Time Comment Reference

[DIIM’04, AI’06]

[IM’98]

querytime

space

medium medium

lowhigh

highlow

one hash table lookup!

no(1/ε2) ω(1) memory lookups [AIP’06]

n1+o(1/c2) ω(1) memory lookups [PTW’08, PTW’10]

Page 17: Alex Andoni (MSR SVC)

NNS beyond LSH

17

Data-dependent partitions…

Practice: Trees: kd-trees, quad-trees, ball-trees,

rp-trees, PCA-trees, sp-trees… often no guarantees

Theory: can improve standard LSH by random

data-dependent space partitions [A-Indyk-Nguyen-Razenshteyn’??]

tree-based approach to max-norm ()

Page 18: Alex Andoni (MSR SVC)

Finale Dimension Reduction in Euclidean space

, random projection preserves distances only dimensions for distance among points!

NNS for Euclidean space Random projections gives LSH Even better with ball partitioning

Or better with cool lattices?