faculty of computer science, institute system architecture, database technology group

A Dip in the Reservoir: Maintaining Sample Synopses

of Evolving Datasets

Rainer Gemulla (University of Technology Dresden)Wolfgang Lehner (University of Technology Dresden)

Peter J. Haas (IBM Almaden Research Center)

Faculty of Computer Science, Institute System Architecture, Database Technology Group

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

Slide 2(VLDB 2006)

Outline

1. Introduction

2. Deletions

3. Resizing

4. Experiments

5. Summary



Slide 3(VLDB 2006)

Random Sampling

• Database applications– huge data sets– complex algorithms

(space & time)

• Requirements– performance, performance, performance

• Random sampling– approximate query answering – data mining – data stream processing– query optimization – data integration

Turnover in Europe (TPC-H)

1% 8.46 Mil. 0.15 Mil. 4s

10% 8.51 Mil. 0.05 Mil. 52s

100% 8.54 Mil. 200s



Slide 4(VLDB 2006)

The Problem Space

• Setting– arbitrary data sets– samples of the data– evolving data

• Scope of this talk– maintenance of

random samples

Can we minimize or even avoid access to base data?

Apply

D

Apply

Compute

Data Sample



Slide 5(VLDB 2006)

Types of Data Sets

• Data sets– variation of data set size– influence on sampling

Stable

Goal: stable sample

Growing

Goal: controlled

growing sample

Shrinking

uninteresting



Slide 6(VLDB 2006)

Uniform Sampling

• Uniform sampling– all samples of the same size are equally likely– many statistical procedures assume uniformity– flexibility

• Example– a data set (also called population)

– possible samples of size 2

1 2 3 4

1 2 1 3 1 4 2 3 2 4 3 4

16% 16% 16% 16% 16% 16%



Slide 7(VLDB 2006)

Reservoir Sampling

• Reservoir sampling– computes a uniform sample of M elements – building block for many sophisticated sampling schemes

– single-scan algorithm• add the first M elements• afterwards, flip a coin

a) ignore the element (reject) b) replace a random element in the sample (accept)

– accept probability of the ith element

i

MtP i

size population

size sample)accepted is (



Slide 8(VLDB 2006)

Reservoir Sampling (Example)

1 2+t1 +t2100%

• Example– sample size M = 2

1 2

1 2

1/3

3 2 1 3

1/3 1/3

+t1 +t2

+t333% 33% 33%

1 2

1 2 4 2 1 4

1 2

1/3

2/4 1/4 1/4

3 2 4 2 3 4

3 2

2/4 1/4 1/4

1 3 4 3 1 4

1 3

2/4 1/4 1/4

1/3 1/3

+t1 +t2

+t3

+t416% 8% 8% 8% 8% 8% 8%16% 16%



Slide 9(VLDB 2006)

Problems with Reservoir Sampling

• Problems with reservoir sampling– lacks support for deletions (stable data sets)– cannot efficiently enlarge sample (growing data sets)

?



Slide 10(VLDB 2006)

Outline

1. Introduction

2. Deletions

3. Resizing

4. Experiments

5. Summary



Slide 11(VLDB 2006)

Naïve/Prior Approaches

unstableconduct deletions, continue with smaller sample

(RS with deletions)

CommentsTechniqueAlgorithm

expensive, low space efficiency in our setting

tailored for multiset populations Distinct-value sampling

special case of our RP algorithm

developed for data streams (sliding windows only)

Passive sampling

inexpensive but unstable

“coin flip” sampling with deletions, purge if too large

Bernoulli s. with purging

stable but expensiveimmediately sample from base data to refill the sample

CAR(WOR)

expensive, unstablelet sample size decrease, but occasionally recompute

Backing sample

not uniformuse insertions to immediately refill the sample

Naïve



Slide 12(VLDB 2006)

Random Pairing

• Random pairing– compensates deletions with arriving insertions – corrects inclusion probabilies

• General idea (insertion)– no uncompensated deletions reservoir sampling– otherwise,

• randomly select an uncompensated deletion (partner)• compensate it: Was it in the sample?

– yes add arriving element to sample– no ignore arriving element



Slide 13(VLDB 2006)

Random Pairing

• Example

1 2

1 2

1/3

3 2 1 3

1/3 1/3

+t1 +t2

+t3

-t2 1 3 1 3

-t3 1 1

1

1

1

1

1

1

1 2

1 2

1/3

3 2 1 3

1/3 1/3

+t1 +t2

+t3

-t2 1 3 1 3

-t3 1 1

+t4

1

1

1

1

1

1

1 1 4

1/2 1/2

4 4

1/2 1/2

1 4 1

1/2 1/2

1 2

1 2

1/3

3 2 1 3

1/3 1/3

+t1 +t2

+t3

-t2 1 3 1 3

-t3 1 1

+t4

1

1

1

1

1

1

+t5

1 1 4

1/2 1/2

1 4 1

1/2 1/2

4 4

1/2 1/2

1 5

1

1 4

1

1 5

1

1 4

1

4 5

1

4 5

1

16% 16% 16% 16%16% 16%



Slide 14(VLDB 2006)

Random Pairing

• Details of the algorithm– keeping history of deleted items is expensive, but:

– maintenance of two counters suffices– correctness proof is in the paper

d

c

PtP i

1

deletions teduncompensa#

samplein deletions teduncompensa#sample)in spartner wa random()added is (



Slide 15(VLDB 2006)

Outline

1. Introduction

2. Deletions

3. Resizing

4. Experiments

5. Summary



Slide 16(VLDB 2006)

Growing Data Sets

• The problem– growing data set

Data set

growing data set

Random pairing

stable samplesampling fraction

decreases



Slide 17(VLDB 2006)

A Negative Result

• Negative result– There is no resizing algorithm which can enlarge a bounded-

size sample without ever accessing base data.

• Example– data set

– samples of size 2

– new data set

– samples of size 3

1 2 3 4

1 2 1 3 1 4 2 3 2 4 3 4

16% 16% 16% 16% 16% 16%

1 2 3 1 2 5

0% >0%Not uniform!

1 2 3 4 5 6 ...



Slide 18(VLDB 2006)

Resizing

• Goal– efficiently increase sample size– stay within an upper bound at all times

• General idea1. convert sample to Bernoulli sample2. continue Bernoulli sampling until new sample size is

reached3. convert back to reservoir sample

• Optimally balance cost– cost of base data accesses (in step 1) – time to reach new sample size (in step 2)



Slide 19(VLDB 2006)

Resizing

• Bernoulli sampling– uniform sampling scheme– each tuple is added to the sample with probability q– sample size follows binomial distribution no effective

upper bound

• Phase 1: Conversion to a Bernoulli sample– given q, randomly determine sample size– reuse reservoir sample to create Bernoulli sample

• subsample• sample additional tuples (base data access)

– choice of q• small less base data accesses• large more base data accesses



Slide 20(VLDB 2006)

Resizing

• Phase 2: Run Bernoulli sampling– accept new tuples with probability q– conduct deletions– stop as soon as new sample size is reached

• Phase 3: Revert to Reservoir sampling– switchover is trivial

• Choosing q– determines cost of Phase 1 and Phase 2– goal: minimize total cost

• base data access expensive small q• base data access cheap large q

– details in paper



Slide 21(VLDB 2006)

Resizing

• Example– resize by 30% if sampling fraction drops below 9%– dependent on costs of accessing base data

Low costs

immediate resizing

Moderate costs

combined solution

High costs

degenerates to Bernoulli sampling



Slide 22(VLDB 2006)

Outline

1. Introduction

2. Deletions

3. Resizing

4. Experiments

5. Summary



Slide 23(VLDB 2006)

Total Cost

• Total cost– stable dataset, 10M operations– sample size 100k, data access 10 times more expensive

than sample access

Base data access

No base data access



Slide 24(VLDB 2006)

Sample size

• Sample size– stable dataset, size 1M– sample size 100k

Base data access

No base data access



Slide 25(VLDB 2006)

Outline

1. Introduction

2. Deletions

3. Resizing

4. Experiments

5. Summary



Slide 26(VLDB 2006)

Summary

• Reservoir Sampling– lacks support for deletions– complete recomputation to enlarge the sample

• Random Pairing– uses arriving insertions to compensate for deletions

• Resizing– base data access cannot be avoided– minimizes total cost

• Future work– better q for resizing– combine with existing techniques [4,8,17] to enhance

flexibility, scalability



Slide 27(VLDB 2006)

Thank you!

Questions?



Slide 28(VLDB 2006)

Backup: Bounded-Size Sampling

• Why sampling?– performance, performance, performance

• How much to sample?– influencing factors

1. storage consumption2. response time3. accuracy

– choosing the sample size / sampling fraction1. largest sample that meets storage requirements2. largest sample that meets response time requirements3. smallest sample that meets accuracy requirements



Slide 29(VLDB 2006)

Backup: Bounded-Size Sampling

• Example– random pairing vs. bernoulli sampling– average estimation

Data set Sample size

BS violates 1, 2

Standard error

BS violates 3

N

n

n

Var1



Slide 30(VLDB 2006)

Backup: Distinct-Value Sampling

• Distinct-value sampling (optimistic setting for DV)– DV-scheme knows avg. dataset size in advance– assume no storage for counters & hash functions

Sample size

RP has better memory utilization

Execution time

RP is significantly faster

10%

10%

0% 10%0%10ms

100ms

1s

10s

100s

1000s



Slide 31(VLDB 2006)

Backup: RS With Deletions

• Reservoir sampling with deletions– conduct deletions, continue with smaller sample size

1 2

1 2

1/3

3 2 1 3

1/3 1/3

+t1 +t2

+t3

-t2 1 3 1 3

-t3 1 1

+t4

1 5 4 5

1 4

1

1

1

1

1

1

1 1/2 1/2

2/3 1/3 2/3 1/3

1 5 4 5

1 4

1/2 1/2

2/3 1/3 2/3 1/3

+t5

1

11% 5,5% 11% 33%5,5% 11% 5,5% 11% 5,5%



Slide 32(VLDB 2006)

Backup: Backing Sample

• Evaluation– data set consists of 1 million elements (on average)– 100k sample, clustered insertions/deletions

Data set

stable

Reservoir sampling

sample is empty eventually

Backing sample

expensive, unstable



Slide 33(VLDB 2006)

1 2

1 2

1/3

3 2 1 3

1/3 1/3

+t1 +t2

+t3

-t2 1 3 1 3

-t3 1 1

+t4 4

4 5

1

1

1

1

1

1

1

+t5

1

1 4

1

1 4

1

1 4 5 4 1 5

1/3 1/3 1/3

1 4 5 4 1 5

1/3 1/3 1/3

11% 11% 11% 33% 11% 11% 11%

Backup: An Incorrect Approach

• Idea– use arriving insertions to refill the sample

Not uniform!



Slide 34(VLDB 2006)

Backup: Random Pairing

• Evaluation– data set consists of 1 million elements (on average)– 100k sample, clustered insertions/deletions

Data set

stable

Reservoir sampling

sample gets emtpy eventually

Random pairing

no base data access!



Slide 35(VLDB 2006)

Backup: Average Sample Size

• Average sample size– stable dataset, 10M operations– sample size 100k



Slide 36(VLDB 2006)

Backup: Average Sample Size With Clustered Insertions/Deletions

• Average sample size with clustered insertions/deletions– stable dataset, size 10M, ~8M operations– sample size 100k



Slide 37(VLDB 2006)

Backup: Cost

• Cost– stable dataset, 10M operations– sample size 100k



Slide 38(VLDB 2006)

Backup: Cost With Clustered Insertions/Deletions

• Cost with clustered insertions/deletions– stable dataset, size 10M, ~8M operations– sample size 100k



Slide 39(VLDB 2006)

Backup: Resizing (Value of q)

• Resizing– enlarge sample from 100k to 200k– base data access 10ms, arrival rate 1ms

faculty of computer science, institute system architecture, database technology group

Documents

base data

data streams

uniform sample

dataevolving data scope

sample carworexpensive

expensiveimmediately

stable sample growinggoal

reservoir samplingproblems