embedding and sketching: lecture 1 -...

24
Sketching (1) Alex Andoni (Columbia University) MADALGO Summer School on Streaming Algorithms 2015

Upload: donhu

Post on 18-Sep-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

Sketching (1)

Alex Andoni(Columbia University)

MADALGO Summer School on Streaming Algorithms 2015

Page 2: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

IP Frequency

131.107.65.14 3

18.0.1.12 2

80.97.56.20 2

131.107.65.14

131.107.65.14

131.107.65.14

18.0.1.12

18.0.1.12

80.97.56.20

80.97.56.20

IP Frequency

131.107.65.14 3

18.0.1.12 2

80.97.56.20 2

127.0.0.1 9

192.168.0.1 8

257.2.5.7 0

16.09.20.11 1

Challenge: log statistics of the data, using small space

Page 3: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

Streaming statistics

Let ๐‘ฅ๐‘– = frequency of IP ๐‘–

1st moment (sum): โˆ‘๐‘ฅ๐‘–

Trivial: keep a total counter

2nd moment (variance): โˆ‘๐‘ฅ๐‘–2 = ||๐‘ฅ||2

Trivially: ๐‘› counters โ†’ too much space

Canโ€™t do better

Better with small approximation!

Via dimension reduction in โ„“2

IP Frequency

131.107.65.14 3

18.0.1.12 2

80.97.56.20 2

โˆ‘๐‘ฅ๐‘– = 7

โˆ‘๐‘ฅ๐‘–2 = 17

Page 4: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

2nd frequency moment

Let ๐‘ฅ๐‘– = frequency of IP ๐‘–

2nd moment: โˆ‘๐‘ฅ๐‘–2 = ||๐‘ฅ||2

Dimension reduction

Store a sketch of ๐‘ฅ

๐‘† ๐‘ฅ = ๐บ1๐‘ฅ, ๐บ2๐‘ฅ, โ€ฆ ๐บ๐‘˜๐‘ฅ = ๐‘ฎ๐‘ฅ

each ๐บ๐‘– is n-dimensional Gaussian vector

Estimator:

1

๐‘˜||๐‘ฎ๐‘ฅ||2 =

1

๐‘˜๐บ1๐‘ฅ 2 + ๐บ2๐‘ฅ 2 + โ‹ฏ + ๐บ๐‘˜๐‘ฅ 2

Updating the sketch:

Use linearity of the sketching function ๐‘†

๐‘ฎ ๐‘ฅ + ๐‘’๐‘– = ๐‘ฎ๐‘ฅ + ๐‘ฎ๐‘’๐‘–

๐‘ฎ

๐‘ฅ

๐‘˜

๐‘›

Page 5: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

Correctness

Theorem [Johnson-Lindenstrauss]:

||๐‘ฎ๐‘ฅ||2 = 1 ยฑ ๐œ– ||๐‘ฅ||2 with probability 1 โˆ’ ๐‘’โˆ’๐‘‚(๐‘˜๐œ–2)

Why Gaussian?

Stability property: ๐บ๐‘–๐‘ฅ = โˆ‘๐‘— ๐บ๐‘–๐‘—๐‘ฅ๐‘— is distributed as | ๐‘ฅ| โ‹… ๐‘”,

where ๐‘” is also Gaussian

Equivalently: ๐บ๐‘– is centrally distributed, i.e., has random

direction, and projection on random direction depends only

on length of ๐‘ฅ

๐‘ƒ ๐‘Ž โˆ™ ๐‘ƒ ๐‘ =

=1

2๐œ‹๐‘’โˆ’๐‘Ž2/2

1

2๐œ‹๐‘’โˆ’๐‘2/2

=1

2๐œ‹๐‘’โˆ’(๐‘Ž2+๐‘2)/2

pdf = 1

2๐œ‹๐‘’โˆ’๐‘”2/2

๐ธ[๐‘”] = 0๐ธ[๐‘”2] = 1

Page 6: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

Proof [sketch]

Claim: for any ๐‘ฅ โˆˆ โ„œ๐‘›, we have Expectation: ๐บ๐‘– โ‹… ๐‘ฅ 2 = ๐‘ฅ 2

Standard deviation: [|๐บ๐‘–๐‘ฅ|2] = ๐‘‚( ๐‘ฅ 2)

Proof: Expectation = ๐บ๐‘– โ‹… ๐‘ฅ 2 = ๐‘ฅ 2 โ‹… ๐‘”2

= ๐‘ฅ 2

๐‘ฎ๐‘ฅ is distributed as

1

๐‘˜||๐‘ฅ|| โ‹… ๐‘”1, โ€ฆ , ||๐‘ฅ|| โ‹… ๐‘”๐‘˜

where each ๐‘”๐‘– is distributed as 1D Gaussian

Estimator: ||๐บ๐‘ฅ||2 = ||๐‘ฅ||2 โ‹…1

๐‘˜โˆ‘๐‘– ๐‘”๐‘–

2

โˆ‘๐‘– ๐‘”๐‘–2 is called chi-squared distribution with ๐‘˜ degrees

Fact: chi-squared very well concentrated:

Equal to 1 + ๐œ– with probability 1 โˆ’ ๐‘’โˆ’ฮฉ(๐œ–2๐‘˜)

Akin to central limit theorem

pdf = 1

2๐œ‹๐‘’โˆ’๐‘”2/2

๐ธ[๐‘”] = 0๐ธ[๐‘”2] = 1

Page 7: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

2nd frequency moment: overall

Correctness:

||๐‘ฎ๐‘ฅ||2 = 1 ยฑ ๐œ– ||๐‘ฅ||2 with probability 1 โˆ’ ๐‘’โˆ’๐‘‚(๐‘˜๐œ–2)

Enough to set ๐‘˜ = ๐‘‚(1/๐œ–2) for const probability of success

Space requirement:

๐‘˜ = ๐‘‚(1/๐œ–2) counters of ๐‘‚(log ๐‘›) bits

What about ๐‘ฎ: store ๐‘‚(๐‘›๐‘˜) reals ?

Storing randomness [AMSโ€™96]

Ok if ๐‘”๐‘– โ€œless randomโ€: choose each of them as 4-wise

independent

Also, ok if ๐‘”๐‘– is a random ยฑ1

Only ๐‘‚(๐‘˜) counters of ๐‘‚ log ๐‘› bits

Page 8: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

More efficient sketches?

Smaller Space:

No: ฮฉ1

๐œ–2 log ๐‘› bits [JWโ€™11] โ† Davidโ€™s lecture

Faster update time:

Yes: Jelaniโ€™s lecture

Page 9: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

Streaming Scenario 2 131.107.65.14

18.0.1.12

18.0.1.12

80.97.56.20

IP Frequency

131.107.65.14 1

18.0.1.12 1

80.97.56.20 1

Focus: difference in traffic

1st moment: โˆ‘ |๐‘ฅ๐‘–โ€“ ๐‘ฆ๐‘–| = ๐‘ฅ โˆ’ ๐‘ฆ 1

2nd moment: โˆ‘ |๐‘ฅ๐‘–โ€“ ๐‘ฆ๐‘–|2 = ๐‘ฅ โˆ’ ๐‘ฆ 2

2

๐‘ฅ โˆ’ ๐‘ฆ 1 = 2

๐‘ฅ โˆ’ ๐‘ฆ 22 = 2

๐‘ฆ

๐‘ฅ

Similar Qs: average delay/variance in a network

differential statistics between logs at different servers, etc

IP Frequency

131.107.65.14 1

18.0.1.12 2

Page 10: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

Definition: Sketching

Sketching:

๐‘† : objects โ†’ short bit-strings

given ๐‘†(๐‘ฅ) and ๐‘†(๐‘ฆ), should be able to estimate some function

of ๐‘ฅ and ๐‘ฆ

๐‘†๐‘†

010110 010101

Estimate ๐‘ฅ โˆ’ ๐‘ฆ 22 ?

IP Frequency

131.107.65.14 1

18.0.1.12 1

80.97.56.20 1๐‘ฆ

IP Frequency

131.107.65.14 1

18.0.1.12 2

๐‘ฅ

Page 11: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

Sketching for โ„“2

As before, dimension reduction

Pick ๐‘ฎ (using common randomness)

๐‘†(๐‘ฅ) = ๐‘ฎ๐‘ฅ

Estimator: ||๐‘†(๐‘ฅ) โˆ’ ๐‘†(๐‘ฆ)||22 = ||๐‘ฎ ๐‘ฅ โˆ’ ๐‘ฆ ||2

2

010110 010101

||๐บ๐‘ฅ โˆ’ ๐บ๐‘ฆ||22

IP Frequency

131.107.65.14 1

18.0.1.12 1

80.97.56.20 1๐‘ฆ

IP Frequency

131.107.65.14 1

18.0.1.12 2

๐‘ฅ

๐บ๐‘ฅ

๐‘†

๐บ๐‘ฆ

๐‘†

Page 12: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

Sketching for Manhattan distance (โ„“1)

Dimension reduction?

Essentially no: [CSโ€™02, BCโ€™03, LNโ€™04, JNโ€™10โ€ฆ]

For ๐‘› points, ๐ท approximation: between ๐‘›ฮฉ 1/๐ท2and ๐‘‚(๐‘›/๐ท)

[BC03, NR10, ANN10โ€ฆ]

even if map depends on the dataset!

In contrast: [JL] gives ๐‘‚(๐œ–โˆ’2 log ๐‘›)

No distributional dimension reduction either

Weak dimension reduction is the rescueโ€ฆ

Page 13: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

Dimension reduction for โ„“1 ?

Can we do the โ€œanalogโ€ of Euclidean projections?

For โ„“2, we used: Gaussian distribution

has stability property:

๐‘”1๐‘ง1 + ๐‘”2๐‘ง2 + โ‹ฏ ๐‘”๐‘‘๐‘ง๐‘‘ is distributed as ๐‘” โ‹… ||๐‘ง||

Is there something similar for 1-norm?

Yes: Cauchy distribution!

1-stable:

๐‘1๐‘ง1 + ๐‘2๐‘ง2 + โ‹ฏ ๐‘๐‘‘๐‘ง๐‘‘ is distributed as ๐‘ โ‹… ||๐‘ง||1

Whatโ€™s wrong then?

Cauchy are heavy-tailedโ€ฆ

doesnโ€™t even have finite expectation (of abs)

๐‘๐‘‘๐‘“ ๐‘  =1

๐œ‹(๐‘ 2 + 1)

Page 14: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

Sketching for โ„“1 [Indykโ€™00]

14

Still, can consider map as before

๐‘† ๐‘ฅ = ๐ถ1๐‘ฅ, ๐ถ2๐‘ฅ, โ€ฆ , ๐ถ๐‘˜๐‘ฅ = ๐‘ช๐‘ฅ

Consider ๐‘† ๐‘ฅ โˆ’ ๐‘† ๐‘ฆ = ๐‘ช๐‘ฅ โˆ’ ๐‘ช๐‘ฆ = ๐‘ช ๐‘ฅ โˆ’ ๐‘ฆ = ๐‘ช๐‘ง where ๐‘ง = ๐‘ฅ โˆ’ ๐‘ฆ

each coordinate distributed as ||๐‘ง||1 ร—Cauchy

Take 1-norm ||๐‘ช๐‘ง||1 ?

does not have finite expectation, butโ€ฆ

Can estimate ||๐‘ง||1 by:

Median of absolute values of coordinates of ๐‘ช๐‘ง !

Correctness claim: for each ๐‘–

Pr ๐ถ๐‘–๐‘ง > ||๐‘ง||1 โ‹… (1 โˆ’ ๐œ–) > 1/2 + ฮฉ(๐œ–)

Pr ๐ถ๐‘–๐‘ง < ||๐‘ง||1 โ‹… (1 + ๐œ–) > 1/2 + ฮฉ(๐œ–)

Page 15: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

Estimator for โ„“1

Estimator: median ๐ถ1๐‘ง , ๐ถ2๐‘ง , โ€ฆ ๐ถ๐‘˜๐‘ง

Correctness claim: for each ๐‘–

Pr ๐ถ๐‘–๐‘ง > ||๐‘ง||1 โ‹… (1 โˆ’ ๐œ–) > 1/2 + ฮฉ(๐œ–)

Pr ๐ถ๐‘–๐‘ง < ||๐‘ง||1 โ‹… (1 + ๐œ–) > 1/2 + ฮฉ(๐œ–)

Proof:

๐ถ๐‘–๐‘ง = ๐‘Ž๐‘๐‘ (๐ถ๐‘–๐‘ง) is distributed as abs ||๐‘ง||1๐‘ = ||๐‘ง||1 โ‹… |๐‘|

Easy to verify that

Pr ๐‘ > 1 โˆ’ ๐œ– > 1/2 + ฮฉ ๐œ–

Pr ๐‘ < 1 + ๐œ– > 1/2 + ฮฉ ๐œ–

Hence, if we have ๐‘˜ = ๐‘‚ 1/๐œ–2

median ๐ถ1๐‘ง , ๐ถ2๐‘ง , โ€ฆ ๐ถ๐‘˜๐‘ง โˆˆ 1 ยฑ ๐œ– ||๐‘ง||1with probability at least 90%

Page 16: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

To finish the โ„“๐‘ normsโ€ฆ

๐‘-moment: ฮฃ๐‘ฅ๐‘–๐‘

= ๐‘ฅ ๐‘๐‘

๐‘ โ‰ค 2

works via ๐‘-stable distributions [Indykโ€™00]

๐‘ > 2

Can do (and need) ๐‘‚(๐‘›1โˆ’2/๐‘) counters

[AMSโ€™96, SSโ€™02, BYJKSโ€™02, CKSโ€™03, IWโ€™05, BGKSโ€™06, BO10, AKOโ€™11, Gโ€™11,

BKSVโ€™14]

Will see a construction via Precision Sampling

Page 17: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

A task: estimate sum Given: ๐‘› quantities ๐‘Ž1, ๐‘Ž2, โ€ฆ ๐‘Ž๐‘› in the range [0,1] Goal: estimate ๐‘† = ๐‘Ž1 + ๐‘Ž2 + โ‹ฏ๐‘Ž๐‘› โ€œcheaplyโ€

Standard sampling: pick random set ๐ฝ = {๐‘—1, โ€ฆ ๐‘—๐‘š} of size ๐‘š Estimator: ๐‘† =

๐‘›

๐‘šโ‹… (๐‘Ž๐‘—1

+ ๐‘Ž๐‘—2+ โ‹ฏ ๐‘Ž๐‘—๐‘š

)

Chebyshev bound: with 90% success probability1

2๐‘† โ€“ ๐‘‚(๐‘›/๐‘š) < ๐‘† < 2๐‘† + ๐‘‚(๐‘›/๐‘š)

For constant additive error, need ๐‘š = ฮฉ(๐‘›)

a1 a2 a3 a4

a1a3

Compute an estimate ๐‘† from ๐‘Ž1, ๐‘Ž3

Page 18: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

Precision Sampling Framework

Alternative โ€œaccessโ€ to ๐‘Ž๐‘–โ€™s:

For each term ๐‘Ž๐‘–, we get a (rough) estimate ๐‘Ž๐‘–

up to some precision ๐‘ข๐‘–, chosen in advance: |๐‘Ž๐‘– โ€“ ๐‘Ž๐‘–| < ๐‘ข๐‘–

Challenge: achieve good trade-off between

quality of approximation to ๐‘†

use only weak precisions ๐‘ข๐‘– (minimize โ€œcostโ€ of estimating ๐‘Ž)

a1 a2 a3 a4

u1 u2 u3 u4

a1 a2a3 a4

Compute an estimate ๐‘† from ๐‘Ž1, ๐‘Ž2, ๐‘Ž3, ๐‘Ž4

Page 19: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

Formalization

Sum Estimator Adversary

1. fix ๐‘Ž1, ๐‘Ž2, โ€ฆ ๐‘Ž๐‘›1. fix precisions ๐‘ข๐‘–

2. fix ๐‘Ž1, ๐‘Ž2, โ€ฆ ๐‘Ž๐‘› s.t. |๐‘Ž๐‘– โˆ’ ๐‘Ž๐‘–| < ๐‘ข๐‘–

3. given ๐‘Ž1, ๐‘Ž2, โ€ฆ ๐‘Ž๐‘›, output ๐‘† s.t.

โˆ‘๐‘– ๐‘Ž๐‘– โˆ’ ๐›พ ๐‘† < 1 (for some small ๐›พ)

What is cost?

Here, average cost = 1/๐‘› โ‹… โˆ‘ 1/๐‘ข๐‘–

to achieve precision ๐‘ข๐‘–, use 1/๐‘ข๐‘– โ€œresourcesโ€: e.g., if ๐‘Ž๐‘– is itself a sum ๐‘Ž๐‘– =โˆ‘๐‘—๐‘Ž๐‘–๐‘— computed by subsampling, then one needs ฮ˜(1/๐‘ข๐‘–) samples

For example, can choose all ๐‘ข๐‘– = 1/๐‘› Average cost โ‰ˆ ๐‘›

Page 20: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

Precision Sampling Lemma

Goal: estimate โˆ‘๐‘Ž๐‘– from { ๐‘Ž๐‘–} satisfying |๐‘Ž๐‘– โˆ’ ๐‘Ž๐‘–| < ๐‘ข๐‘–.

Precision Sampling Lemma: can get, with 90% success: O(1) additive error and 1.5 multiplicative error:

๐‘† โˆ’ ๐‘‚ 1 < ๐‘† < 1.5 โ‹… ๐‘† + ๐‘‚(1)

with average cost equal to O(log n)

Example: distinguish ฮฃ๐‘Ž๐‘– = 3 vs ฮฃ๐‘Ž๐‘– = 0 Consider two extreme cases:

if three ๐‘Ž๐‘– = 1: enough to have crude approx for all (๐‘ข๐‘– = 0.1)

if all ๐‘Ž๐‘– = 3/๐‘›: only few with good approx ๐‘ข๐‘– = 1/๐‘›, and the rest with ๐‘ข๐‘– = 1

๐œ– 1 + ๐œ–

๐‘† โˆ’ ๐œ– < ๐‘† < 1 + ๐œ– ๐‘† + ๐œ–

[A-Krauthgamer-Onakโ€™11]

๐‘‚(๐œ–โˆ’3log ๐‘›)

Page 21: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

Precision Sampling Algorithm

Precision Sampling Lemma: can get, with 90% success: O(1) additive error and 1.5 multiplicative error:

๐‘† โˆ’ ๐‘‚ 1 < ๐‘† < 1.5 โ‹… ๐‘† + ๐‘‚(1)

with average cost equal to ๐‘‚(log ๐‘›)

Algorithm: Choose each ๐‘ข๐‘–[0,1] i.i.d.

Estimator: ๐‘† = count number of ๐‘–โ€˜s s.t. ๐‘Ž๐‘–/ ๐‘ข๐‘– > 6 (up to a normalization constant)

Proof of correctness:

we use only ๐‘Ž๐‘– which are 1.5-approximation to ๐‘Ž๐‘–

๐ธ[ ๐‘†] โ‰ˆ โˆ‘ Pr[๐‘Ž๐‘– / ๐‘ข๐‘– > 6] = โˆ‘ ๐‘Ž๐‘–/6.

๐ธ[1/๐‘ข๐‘–] = ๐‘‚(log ๐‘›) w.h.p.

function of [ ๐‘Ž๐‘–/๐‘ข๐‘– โˆ’ 4/๐œ–]+

and ๐‘ข๐‘–โ€™s

concrete distrib. = minimum of ๐‘‚(๐œ–โˆ’3) u.r.v.

๐‘‚(๐œ–โˆ’3log ๐‘›)

๐œ– 1 + ๐œ–

๐‘† โˆ’ ๐œ– < ๐‘† < 1 + ๐œ– โ‹… ๐‘† + ๐‘‚ 1

Page 22: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

โ„“๐‘ via precision sampling

Theorem: linear sketch for โ„“๐‘ with ๐‘‚(1) approximation,

and ๐‘‚(๐‘›1โˆ’2/๐‘ log ๐‘›) space (90% succ. prob.).

Sketch:

Pick random ๐‘Ÿ๐‘–{ยฑ1}, and ๐‘ข๐‘– as exponential r.v.

let ๐‘ฆ๐‘– = ๐‘ฅ๐‘– โ‹… ๐‘Ÿ๐‘–/๐‘ข๐‘–1/๐‘

throw into one hash table ๐ป,

๐‘˜ = ๐‘‚(๐‘›1โˆ’2/๐‘ log ๐‘›) cells

Estimator:

max๐‘

๐ป ๐‘ ๐‘

Linear: works for difference as well

Randomness: bounded independence suffices

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

๐‘ฆ๐Ÿ

+ ๐‘ฆ๐Ÿ‘

๐‘ฆ๐Ÿ’ ๐‘ฆ๐Ÿ

+ ๐‘ฆ๐Ÿ“

+ ๐‘ฆ๐Ÿ”

๐‘ฅ =

๐ป =

๐‘ข โˆผ ๐‘’โˆ’๐‘ข

Page 23: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

Correctness of โ„“๐‘ estimation

Sketch:

๐‘ฆ๐‘– = ๐‘ฅ๐‘– โ‹… ๐‘Ÿ๐‘–/๐‘ข๐‘–1/๐‘

where ๐‘Ÿ๐‘–{ยฑ1}, and ๐‘ข๐‘– exponential r.v.

Throw into hash table ๐ป

Theorem: max๐‘

๐ป ๐‘ ๐‘ is ๐‘‚(1) approximation with 90%

probability, for ๐‘˜ = ๐‘‚(๐‘›1โˆ’2/๐‘ log๐‘‚ 1 ๐‘›) cells

Claim 1: max๐‘–

|๐‘ฆ๐‘–| is a const approx to ||๐‘ฅ||๐‘

max๐‘–

๐‘ฆ๐‘–๐‘ = max

๐‘–๐‘ฅ๐‘–

๐‘/๐‘ข๐‘–

Fact [max-stability]: max ๐œ†๐‘–/๐‘ข๐‘– distributed as โˆ‘๐œ†๐‘–/๐‘ข

max๐‘–

๐‘ฆ๐‘–๐‘ is distributed as ||๐‘ฅ||๐‘

๐‘/๐‘ข

๐‘ข is ฮ˜(1) with const probability

Page 24: Embedding and Sketching: Lecture 1 - madalgo.au.dkmadalgo.au.dk/fileadmin/madalgo/Summerschool2015/1_sketches.pdfSketching (1) Alex Andoni (Columbia University) MADALGO Summer School

Correctness (cont)

Claim 2:

max๐‘

|๐ป ๐‘ | = 1 โ‹… ||๐‘ฅ||๐‘

Consider a hash table ๐ป, and the cell ๐‘ where ๐‘ฆ๐‘–โˆ— falls into

for ๐‘–โˆ— which maximizes |๐‘ฆ๐‘–โˆ—|

How much โ€œextra stuffโ€ is there?

2 = (๐ป[๐‘] โˆ’ ๐‘ฆ๐‘–โˆ—)2 = (โˆ‘๐‘—โ‰ ๐‘–โˆ— ๐‘ฆ๐‘— โ‹… [๐‘—๐‘])2

๐ธ 2 = โˆ‘๐‘—โ‰ ๐‘–โˆ— ๐‘ฆ๐‘—2 โ‹… [๐‘—๐‘] = โˆ‘๐‘—โ‰ ๐‘–โˆ— ๐‘ฆ๐‘—

2/๐‘˜ โ‰ค ||๐‘ฆ||2/๐‘˜

We have: ๐ธ๐‘ข||๐‘ฆ||2 โ‰ค ||๐‘ฅ||2 โ‹… ๐ธ 1/๐‘ข1/๐‘ = ๐‘‚ log๐‘› โ‹… ||๐‘ฅ||2

||๐‘ฅ||2 โ‰ค ๐‘›1โˆ’2/๐‘||๐‘ฅ||๐‘2

By Markovโ€™s: 2 โ‰ค ||๐‘ฅ||๐‘2 โ‹… ๐‘›1โˆ’2/๐‘ โ‹… ๐‘‚(log๐‘›)/๐‘˜ with prob 0.9.

Then: ๐ป[๐‘] = ๐‘ฆ๐‘–โˆ— + ๐›ฟ = 1 โ‹… ||๐‘ฅ||๐‘.

Need to argue about other cells too โ†’ concentration

๐‘ฆ๐‘– = ๐‘ฅ๐‘– โ‹… ๐‘Ÿ๐‘–/๐‘ข๐‘–1/๐‘

where ๐‘Ÿ๐‘–{ยฑ1}๐‘ข๐‘– exponential r.v.

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6

๐‘ฆ๐Ÿ

+ ๐‘ฆ๐Ÿ‘

๐‘ฆ๐Ÿ’ ๐‘ฆ๐Ÿ

+ ๐‘ฆ๐Ÿ“

+ ๐‘ฆ๐Ÿ”

๐ป =