sketching, sampling, and other sublinear algorithms 1 (lecture by alex andoni)

17
Sketching, Sampling and other Sublinear Algorithms: Nearest Neighbor Search Alex Andoni (MSR SVC)

Upload: anton-konushin

Post on 11-May-2015

1.022 views

Category:

Education


2 download

DESCRIPTION

We will learn about modern algorithmic techniques for handling large datasets, often by using imprecise but concise representations of the data such as a sketch or a sample of the data. The lectures will cluster around three themes Nearest Neighbor Search (similarity search): the general problem is, given a set of objects (e.g., images), to construct a data structure so that later, given a query object, one can efficiently find the most similar object from the database. Streaming framework: we are required to solve a certain problem on a large collection of items that one streams through once (i.e., algorithm's memory footprint is much smaller than the dataset itself). For example, how can a router with 1Mb memory estimate the number of different IPs it sees in a multi-gigabytes long real-time traffic? Parallel framework: we look at problems where neither the data or the output fits on a machine. For example, given a set of 2D points, how can we compute the minimum spanning tree over a cluster of machines. The focus will be on techniques such as sketching, dimensionality reduction, sampling, hashing, and others.

TRANSCRIPT

Page 1: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Sketching, Sampling and other Sublinear Algorithms:

Nearest Neighbor Search

Alex Andoni(MSR SVC)

Page 2: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Nearest Neighbor Search (NNS)

Preprocess: a set of points

Query: given a query point , report a point with the smallest distance to

𝑞

𝑝

Page 3: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Motivation

Generic setup: Points model objects (e.g. images) Distance models (dis)similarity measure

Application areas: machine learning: k-NN rule speech/image/video/music recognition,

vector quantization, bioinformatics, etc… Distance can be:

Hamming, Euclidean, edit distance, Earth-mover

distance, etc… Primitive for other problems:

find the similar pairs in a set D, clustering…

000000011100010100000100010100011111

000000001100000100000100110100111111 𝑞

𝑝

Page 4: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Lecture Plan1. Locality-Sensitive Hashing2. LSH as a Sketch3. Towards Embeddings

Page 5: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

2D case

Compute Voronoi diagram Given query , perform point

location Performance:

Space: Query time:

Page 6: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

High-dimensional case

All exact algorithms degrade rapidly with the dimension

In practice: When is “low-medium”, kd-trees work reasonably When is “high”, state-of-the-art is unsatisfactory

Algorithm Query time Space

Full indexing (Voronoi diagram size)

No indexing – linear scan

Page 7: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Approximate NNS

r-near neighbor: given a new point , report a point s.t.

Randomized: a point returned with 90% probability

c-approximate

𝑐𝑟if there exists apoint at distance

q

r p

cr

Page 8: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Heuristic for Exact NNS

r-near neighbor: given a new point , report a set with all points s.t. (each with 90%

probability)

may contain some approximate neighbors s.t.

Can filter out bad answers

q

r p

cr

c-approximate

Page 9: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Approximation Algorithms for NNS

A vast literature: milder dependence on dimension

[Arya-Mount’93], [Clarkson’94], [Arya-Mount-Netanyahu-Silverman-We’98], [Kleinberg’97], [Har-Peled’02],…[Aiger-Kaplan-Sharir’13],

little to no dependence on dimension

[Indyk-Motwani’98], [Kushilevitz-Ostrovsky-Rabani’98], [Indyk’98, ‘01], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-Indyk-Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [A-Indyk’06],… [A-Indyk-Nguyen-Razenshteyn’??]

Page 10: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Locality-Sensitive Hashing

Random hash function on s.t. for any points : Close when is “high”

Far when is “small”

Use several hashtables

q

p

𝑑𝑖𝑠𝑡 (𝑝 ,𝑞)

Pr [𝑔 (𝑝 )=𝑔 (𝑞)]

𝑟 𝑐𝑟

1

𝑃 1

𝑃 2

: , where

[Indyk-Motwani’98]

q

“not-so-small”𝑃 1=¿𝑃 2=¿

𝜌=log 1/𝑃1

log 1/𝑃2

Page 11: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Locality sensitive hash functions

11

Hash function is usually a concatenation of “primitive” functions:

Example: Hamming space , i.e., choose bit for a random chooses bits at random

Page 12: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Formal description

12

Data structure is just hash tables: Each hash table uses a fresh random function

Hash all dataset points into the table Query:

Check for collisions in each of the hash tables until we encounter a point within distance

Guarantees: Space: , plus space to store points Query time: (in expectation) 50% probability of success.

Page 13: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Analysis of LSH Scheme

13

How did we pick ? For fixed , we have

Pr[collision close pair] = Pr[collision far pair] =

Want to make Set , where

Page 14: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Analysis: Correctness

14

Let be an -near neighbor If does not exists, algorithm can output anything

Algorithm fails when: near neighbor is not in the searched buckets

Probability of failure: Probability do not collide in a hash table: Probability they do not collide in hash tables at

most

Page 15: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

Analysis: Runtime

15

Runtime dominated by: Hash function evaluation: time Distance computations to points in buckets

Distance computations: Care only about far points, at distance In one hash table, we have

Probability a far point collides is at most Expected number of far points in a bucket:

Over hash tables, expected number of far points is

Total: in expectation

Page 16: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

LSH in the wild

16

If want exact NNS, what is ? Can choose any parameters Correct as long as Performance:

trade-off between # tables and false positives will depend on dataset “quality” Can tune to optimize for given dataset

Further advantages: Point insertions/deletions easy natural to distribute computation/hash tables in

a cluster

𝐿

𝑘

safety not guaranteed

fewer false positives

fewer tables

Page 17: Sketching, Sampling, and other Sublinear Algorithms 1 (Lecture by Alex Andoni)

LSH Zoo

17

Hamming distance [IM’98] : pick a random coordinate(s)

Manhattan distance: homework

Jaccard distance between sets:

: pick a random permutation on the universe

min-wise hashing [Bro’97] Euclidean distance: next

lecture

To be or not to be

To sketch or not to sketch

…21102…

be toornot

sket

ch

…01122…

be toornot

sket

ch

…11101… …01111…

{be,not,or,to} {not,or,to, sketch}

1 1

not

not

=be,to,sketch,or,not

be to