scalable approximate query processing

50
Scalable Approximate Query Processing Florin Rusu

Upload: mckile

Post on 23-Feb-2016

41 views

Category:

Documents


2 download

DESCRIPTION

Scalable Approximate Query Processing. Florin Rusu. Data Explosion. Data storage advancements Price / capacity ($70 / 1 TB) Human generated Web 2.0 & social networking User data Communication Network & web logs (eBay – 50 TB / day) Call Detail Records (CDRs) Scientific experiments - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Scalable Approximate Query Processing

Scalable Approximate Query Processing

Florin Rusu

Page 2: Scalable Approximate Query Processing

2

Data Explosion• Data storage advancements

– Price / capacity ($70 / 1 TB)• Human generated

– Web 2.0 & social networking• User data

– Communication• Network & web logs (eBay – 50 TB / day)• Call Detail Records (CDRs)

• Scientific experiments– LHC (Large Hadron Collider)– SKA (Square Kilometer Array) – 1 EB (1018) / day– Sensor networks

04/19/2010

Page 3: Scalable Approximate Query Processing

3

Large-Scale Data Analytics• Traditional DB (OLTP)– Multi-user transaction processing– Optimized for specific workloads (views, indexes, …)

• Analytic processing (OLAP)– Data cubes

• Aggregate at different hierarchical levels• Pre-defined aggregates, not flexible

– Shared-nothing architectures (MPP)• Startups: Netezza, Greenplum, AsterData, Vertica, …• Parallel databases on clusters of computers• Storage layer (row store, column store, hybrid)• Compression

04/19/2010

Page 4: Scalable Approximate Query Processing

4

Interactive Data Analysis & Exploration

• Ad-hoc queries• Compute statistical aggregates over all data• Example: web log analysis– Documents (URL, Content)– UserVisits (IP, URL, Date, Duration)– “How much time did users spend searching for cars during the

period May – July 2009?”

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

04/19/2010

Page 5: Scalable Approximate Query Processing

5

Roadmap

• Database query execution• System design & implementation– DataBaseOnline (DBO)

• Approximation methods (theoretical analysis & practical implementation)– Sampling– Sketches– Sketches over samples

04/19/2010

Page 6: Scalable Approximate Query Processing

6

Query Execution

URL ContentA car

B car

C car

D phone

E car

F car

G car

H PC

I car

J car

IP URL Date Duration1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

σ

UV

σ

D

Σ

•Selections push down•Sort-Merge Join•Aggregate

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

04/19/2010

Page 7: Scalable Approximate Query Processing

7

SelectionURL ContentA car

B car

C car

D phone

E car

F car

G car

H PC

I car

J car

IP URL Date Duration1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

σ

UV

σ

D

Σ

•Storage manager•One thread for each table scan•Project unused columns

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

04/19/2010

Page 8: Scalable Approximate Query Processing

8

•Tuples are pipelined into join

SelectionURLA

B

C

E

F

G

I

J

URL DurationA 45

B 60

J 30

D 90

F 15

G 10

E 20

E 35

B 25

J 35

I 25

D 40

C 50

H 75

G 90

F 5

σ

UV

σ

D

Σ

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

04/19/2010

Page 9: Scalable Approximate Query Processing

9

URL Duration

A 45

B 60

J 30

D 90

F 15

G 10

E 20

E 35

•Sort tuples on join attribute•Write sorted runs to disk•Buffer space: UV(8)

Sort-Merge Join – Sort Phase

σ

UV

σ

D

Σ

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

URL

A

B

C

E

F

G

I

J

Run 1

URL Duration

A 45

B 60

D 90

E 20

E 35

F 15

G 10

J 30

URL Duration

B 25

J 35

I 25

D 40

C 50

H 75

G 90

F 5

Run 2

URL Duration

B 25

C 50

D 40

F 5

G 90

H 75

I 25

J 35

04/19/2010

Page 10: Scalable Approximate Query Processing

10

Sort-Merge Join – Merge Phase

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

Run 1

URL Duration

D 90

E 20

E 35

F 15

G 10

J 30

Run 2

URL Duration

C 50

D 40

F 5

G 90

H 75

I 25

J 35

URL

B

C

E

F

G

I

J

Run

URL Duration

B 25

B 60

URL Duration

A 45

URL

A

Duration

45

σ

UV

σ

D

Σ

04/19/2010

Page 11: Scalable Approximate Query Processing

11

Sort-Merge Join – Merge Phase

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

Run 1

URL Duration

F 15

G 10

J 30

Run 2

URL Duration

G 90

H 75

I 25

J 35

URL Duration

E 20

E 35

F 5

URL

E

URL Duration

D 40

D 90

σ

UV

σ

D

Σ

04/19/2010

URLF

G

I

J

Run

Page 12: Scalable Approximate Query Processing

12

Duration

0

Duration

45

•Update the sum as tuples are produced

Aggregation

Duration45

σ

UV

σ

D

Σ

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

04/19/2010

Page 13: Scalable Approximate Query Processing

13

Duration45

25

60

50

20

35

15

5

10

90

25

30

35

Duration

445

Final Result

σ

UV

σ

D

Σ

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

04/19/2010

Page 14: Scalable Approximate Query Processing

14

Roadmap

• Database query execution• System design & implementation– DataBaseOnline (DBO)

• Approximation methods (theoretical analysis & practical implementation)– Sampling– Sketches– Sketches over samples

04/19/2010

Page 15: Scalable Approximate Query Processing

15

What is the problem?

• TPC-H benchmark results (price / performance)– 10 TB scale

• 928 hard-disks (90 TB total storage capacity)• 16 × quad-core processors• 512 GB RAM• $1.5 million

– Load time: 55 hours– Q1: linear scan over one table with aggregates on top

• 1 query: 19 minutes• 9 queries: 3 hours (linear scaling)

04/19/2010

Page 16: Scalable Approximate Query Processing

16

Approximate Query Processing

Time

Que

ry re

sult

Traditional query processing

Result estimate

Confidence bounds

SELECT SUM f(r1•r2• … •rn)FROM R1 as r1, R2 as r2, …, Rn as rn

04/19/2010

Page 17: Scalable Approximate Query Processing

17

DBO System Architecture[Rusu et al. 2008]

σ

UV

σ

D

Σ

DB Engine

Query Result

Levelwise Step Controller

In-Memory Join

⋈UV' D'

Estimation Module

ResultConfidence bounds

1

2 3

4

5

Approximate answer

6

7

04/19/2010

Page 18: Scalable Approximate Query Processing

18

Roadmap

• Database query execution• System design & implementation– DataBaseOnline (DBO)

• Approximation methods (theoretical analysis & practical implementation)– Sampling– Sketches– Sketches over samples

04/19/2010

Page 19: Scalable Approximate Query Processing

19

Sampling[Dobra, Jermaine, Rusu & Xu 2009]

URL ContentA car

B car

C car

D phone

E car

F car

G car

H PC

I car

J car

IP URL Date Duration1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

σ

UV

σ

D

Σ

•Control, coordinate & schedule data flow between operators•Embed randomness in each operator

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

04/19/2010

Page 20: Scalable Approximate Query Processing

URLJ 68

F 220

C 312

H 389

Sampling – Selection

URL ContentJ car 68

F car 220

C car 312

D phone X

A car 389

B car 447

G car 515

H PC X

I car 695

E car 799

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45 70

1 B 06-01-09 60 140

1 J 06-01-09 30 185

1 D 05-15-09 90 252

1 I 04-28-09 35 X

2 A 04-30-09 60 X

2 F 06-15-09 15 358

2 G 06-13-09 10 409

2 E 06-01-09 20 476

2 E 07-10-09 35 495

3 C 04-28-09 25 X

3 B 05-23-09 25 722

3 J 05-29-09 35 739

3 I 06-13-09 25 745

3 D 06-09-09 40 791

4 C 07-30-09 50 798

4 H 05-14-09 75 837

4 H 08-02-09 65 X

4 G 07-23-09 90 953

4 F 06-16-09 5 973

URL Duration

A 45 70

B 60 140

J 30 185

D 90 252

URL

J

In-Memory JoinURL

J

URL

F 220

C 312

A 389

B 447

σ

UV

σ

D

Σ

•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators

Page 21: Scalable Approximate Query Processing

Sampling – Selection

URL ContentJ car 68

F car 220

C car 312

D phone X

A car 389

B car 447

G car 515

H PC X

I car 695

E car 799

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45 70

1 B 06-01-09 60 140

1 J 06-01-09 30 185

1 D 05-15-09 90 252

1 I 04-28-09 35 X

2 A 04-30-09 60 X

2 F 06-15-09 15 358

2 G 06-13-09 10 409

2 E 06-01-09 20 476

2 E 07-10-09 35 495

3 C 04-28-09 25 X

3 B 05-23-09 25 722

3 J 05-29-09 35 739

3 I 06-13-09 25 745

3 D 06-09-09 40 791

4 C 07-30-09 50 798

4 H 05-14-09 75 837

4 H 08-02-09 65 X

4 G 07-23-09 90 953

4 F 06-16-09 5 973

URL

F 220

C 312

A 389

B 447

σ

UV

σ

D

Σ

•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators

URL Duration

B 60 140

J 30 185

D 90 252

F 15 358

URL Duration

A 45

In-Memory JoinURL

J

Page 22: Scalable Approximate Query Processing

Sampling – Selection

URL ContentJ car 68

F car 220

C car 312

D phone X

A car 389

B car 447

G car 515

H PC X

I car 695

E car 799

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45 70

1 B 06-01-09 60 140

1 J 06-01-09 30 185

1 D 05-15-09 90 252

1 I 04-28-09 35 X

2 A 04-30-09 60 X

2 F 06-15-09 15 358

2 G 06-13-09 10 409

2 E 06-01-09 20 476

2 E 07-10-09 35 495

3 C 04-28-09 25 X

3 B 05-23-09 25 722

3 J 05-29-09 35 739

3 I 06-13-09 25 745

3 D 06-09-09 40 791

4 C 07-30-09 50 798

4 H 05-14-09 75 837

4 H 08-02-09 65 X

4 G 07-23-09 90 953

4 F 06-16-09 5 973

URL

F 220

C 312

H 389

B 447

σ

UV

σ

D

Σ

•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators

URL Duration

D 90 252

F 15 358

G 10 409

E 20 476

URL Duration

J 30

URL Duration

J 30

In-Memory JoinURL

J

Page 23: Scalable Approximate Query Processing

Sampling – Selection

URL ContentJ car 68

F car 220

C car 312

D phone X

A car 389

B car 447

G car 515

H PC X

I car 695

E car 799

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45 70

1 B 06-01-09 60 140

1 J 06-01-09 30 185

1 D 05-15-09 90 252

1 I 04-28-09 35 X

2 A 04-30-09 60 X

2 F 06-15-09 15 358

2 G 06-13-09 10 409

2 E 06-01-09 20 476

2 E 07-10-09 35 495

3 C 04-28-09 25 X

3 B 05-23-09 25 722

3 J 05-29-09 35 739

3 I 06-13-09 25 745

3 D 06-09-09 40 791

4 C 07-30-09 50 798

4 H 05-14-09 75 837

4 H 08-02-09 65 X

4 G 07-23-09 90 953

4 F 06-16-09 5 973

URL

G 515

I 695

E 799

σ

UV

σ

D

Σ

•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators

URL Duration

B 25 722

J 35 739

I 25 745

D 40 791

URL Duration

J 30

F 15

In-Memory JoinURL

J

F

C

A

B

50% input:360; [-328, 1048] 95% probability

Page 24: Scalable Approximate Query Processing

Sampling – Selection

URL ContentJ car 68

F car 220

C car 312

D phone X

A car 389

B car 447

G car 515

H PC X

I car 695

E car 799

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45 70

1 B 06-01-09 60 140

1 J 06-01-09 30 185

1 D 05-15-09 90 252

1 I 04-28-09 35 X

2 A 04-30-09 60 X

2 F 06-15-09 15 358

2 G 06-13-09 10 409

2 E 06-01-09 20 476

2 E 07-10-09 35 495

3 C 04-28-09 25 X

3 B 05-23-09 25 722

3 J 05-29-09 35 739

3 I 06-13-09 25 745

3 D 06-09-09 40 791

4 C 07-30-09 50 798

4 H 05-14-09 75 837

4 H 08-02-09 65 X

4 G 07-23-09 90 953

4 F 06-16-09 5 973

URL

E 799

σ

UV

σ

D

Σ

•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators

URL Duration

I 25 745

D 40 791

C 50 798

H 75 837

URL Duration

J 30

F 15

B 25

J 35

In-Memory JoinURL

J

F

C

A

B

G

I

Exceed In-Memory Join capacity (10 tuples)!Eliminate tuples such that variance is minimized.

Page 25: Scalable Approximate Query Processing

Sampling – Selection

URL ContentJ car 68

F car 220

C car 312

D phone X

A car 389

B car 447

G car 515

H PC X

I car 695

E car 799

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45 70

1 B 06-01-09 60 140

1 J 06-01-09 30 185

1 D 05-15-09 90 252

1 I 04-28-09 35 X

2 A 04-30-09 60 X

2 F 06-15-09 15 358

2 G 06-13-09 10 409

2 E 06-01-09 20 476

2 E 07-10-09 35 495

3 C 04-28-09 25 X

3 B 05-23-09 25 722

3 J 05-29-09 35 739

3 I 06-13-09 25 745

3 D 06-09-09 40 791

4 C 07-30-09 50 798

4 H 05-14-09 75 837

4 H 08-02-09 65 X

4 G 07-23-09 90 953

4 F 06-16-09 5 973

URL

E 799

σ

UV

σ

D

Σ

•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators

URL Duration

I 25 745

D 40 791

C 50 798

H 75 837

URL Duration

J 30

B 25

J 35

In-Memory JoinURL

J

A

B

G

74% input:258; [-293, 808]95% probability

Page 26: Scalable Approximate Query Processing

Sampling – Selection

URL ContentJ car 68

F car 220

C car 312

D phone X

A car 389

B car 447

G car 515

H PC X

I car 695

E car 799

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45 70

1 B 06-01-09 60 140

1 J 06-01-09 30 185

1 D 05-15-09 90 252

1 I 04-28-09 35 X

2 A 04-30-09 60 X

2 F 06-15-09 15 358

2 G 06-13-09 10 409

2 E 06-01-09 20 476

2 E 07-10-09 35 495

3 C 04-28-09 25 X

3 B 05-23-09 25 722

3 J 05-29-09 35 739

3 I 06-13-09 25 745

3 D 06-09-09 40 791

4 C 07-30-09 50 798

4 H 05-14-09 75 837

4 H 08-02-09 65 X

4 G 07-23-09 90 953

4 F 06-16-09 5 973

σ

UV

σ

D

Σ

•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators

URL Duration

URL Duration

J 30

B 25

J 35

G 90

In-Memory JoinURL

J

A

B

G

E

URL

All input:448; [3, 892]

95% probability

Page 27: Scalable Approximate Query Processing

27

Sampling Estimation – Intermediate Levels

• Query result estimator & variance estimator computed from result tuples found by In-Memory Join

• Confidence bounds derived with Central Limit Theorem • Solve optimization problem to keep bounds stable when

tuples are deleted from In-Memory Join

)( )( )()( )( 212111 22

ni

tRt Rt Rt

in tttfXiptTStTStTStTSIi

nn

11 22

)(... !

21Rt Rt Rt

ni

in

nn

tttfn

ppE

22 Var EE

04/19/2010

Page 28: Scalable Approximate Query Processing

28

•Sort tuples on random function of join attribute

Sampling – Join (Sort)

σ

UV

σ

D

ΣSELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

URL

J 888

F 67

C 489

A 227

B 987

G 51

I 342

E 739

Run 1

URL

F 67

A 227

C 489

J 888

Run 2

URL

G 51

I 342

E 739

B 987

URL Duration

A 45 227

B 60 987

J 30 888

D 90 43

F 15 67

G 10 51

E 20 739

E 35 739

B 25 987

J 35 888

I 25 342

D 40 43

C 50 489

H 75 150

G 90 51

F 5 67

URL Duration

D 90 43

G 10 51

F 15 67

A 45 227

E 20 739

E 35 739

J 30 888

B 60 987

Run 1

URL Duration

D 40 43

G 90 51

F 5 67

H 75 150

I 25 342

C 50 489

J 35 888

B 25 987

Run 204/19/2010

Page 29: Scalable Approximate Query Processing

29

Duration

0 0

Sampling – Join (Merge)

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

σ

UV

σ

D

Σ

URL Duration

G 10 51

F 15 67

A 45 227

E 20 739

E 35 739

J 30 888

B 60 987

Run 1

URL Duration

G 90 51

F 5 67

H 75 150

I 25 342

C 50 489

J 35 888

B 25 987

Run 2

Run 1

URL

F 67

A 227

C 489

J 888

Run 2

URL

G 51

I 342

E 739

B 987

URL Duration

G 10 51

G 90 51

URL

G 51

F 67

URL

G 51

URL Duration

G 10 51

G 90 51

Duration

10 51

90 51 In-Memory Join

Duration

100 51

04/19/2010

Page 30: Scalable Approximate Query Processing

30

Sampling – Join (Merge)

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

σ

UV

σ

D

Σ

URL Duration

E 20 739

E 35 739

J 30 888

B 60 987

Run 1

URL Duration

C 50 489

J 35 888

B 25 987

Run 2 Run 1

URL

C 489

J 888

Run 2

URL

E 739

B 987

URL Duration

C 50 489

E 20 739

E 35 739

URL

C 489

E 739

URL

C 489

URL Duration

C 50 489

Duration

50 489 In-Memory Join

Duration

240 489

50% input:468; [194, 741]95% probability

04/19/2010

Page 31: Scalable Approximate Query Processing

31

Sampling – Join (Merge)

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

σ

UV

σ

D

Σ

URL Duration

B 60 987

Run 1

URL Duration

B 25 987

Run 2 Run 1

URL

Run 2

URL

B 987

URL Duration

B 25 987

B 60 987

URL

B 987

URL

B 987

URL Duration

B 25 987

B 60 987

Duration

25 987

60 987

In-Memory Join

Duration

445 987

04/19/2010

Page 32: Scalable Approximate Query Processing

32

Sampling Estimation – Upper Level

• Bernoulli sampling with probability given by domain fraction seen so far

• Consolidate tuples generated by same join key• Solve optimization problem to minimize

variance across levels– Keep confidence bounds stable

04/19/2010

Page 33: Scalable Approximate Query Processing

33

Contributions

• Design & implement DBO, first online analytical processing engine– Provide estimates & confidence bounds

throughout entire query execution– SELECT-PROJECT-JOIN (SPJ) & GROUP BY queries

over any number of relations• Design & analyze fastest convergent

estimation method for online aggregation– Statistics & optimization techniques

04/19/2010

Page 34: Scalable Approximate Query Processing

34

Roadmap

• Database query execution• System design & implementation– DataBaseOnline (DBO)

• Approximation methods (theoretical analysis & practical implementation)– Sampling– Sketches– Sketches over samples

04/19/2010

Page 35: Scalable Approximate Query Processing

35

Sketches

URL ContentA car

B car

C car

D phone

E car

F car

G car

H PC

I car

J car

IP URL Date Duration1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

σ

UV

σ

D

Σ

•Build sketches on join attribute while data is read from disk•Use attributes in aggregate

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

04/19/2010

Page 36: Scalable Approximate Query Processing

36

Sketches

URL ContentA car

B car

C car

D phone

E car

F car

G car

H PC

I car

J car

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

1 2 3

S1 0 0 0

A B C D E F G H I J

S1 + - - - - + + + - -

A B C D E F G H I J

S1 1 2 3 1 1 2 2 3 3 3

URL

A

1 2 3

S1 0 0 0

1 2 3

S1 1 0 0

S1 + S1 1

04/19/2010

Page 37: Scalable Approximate Query Processing

37

Sketches

URL ContentA car

B car

C car

D phone

E car

F car

G car

H PC

I car

J car

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

1 2 3

S1 1 0 0

A B C D E F G H I J

S1 + - - - - + + + - -

A B C D E F G H I J

S1 1 2 3 1 1 2 2 3 3 3

URL Duration

A 45

1 2 3

S1 0 0 0

1 2 3

S1 45 0 0

S1 + S1 1

04/19/2010

Page 38: Scalable Approximate Query Processing

38

Sketches

URL ContentA car

B car

C car

D phone

E car

F car

G car

H PC

I car

J car

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

1 2 3

S1 0 1 -3

A B C D E F G H I J

S1 + - - - - + + + - -

A B C D E F G H I J

S1 1 2 3 1 1 2 2 3 3 3

1 2 3

S1 -140 35 -65

S1 230

04/19/2010

Page 39: Scalable Approximate Query Processing

39

Sketches

URL ContentA car

B car

C car

D phone

E car

F car

G car

H PC

I car

J car

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

1 2 3

S1 0 1 -3

S2 -1 2 1

S3 -3 0 1

A B C D E F G H I J

S1 + - - - - + + + - -

S2 + - + - + - + - + -

S3 - - - + + - + + - +

A B C D E F G H I J

S1 1 2 3 1 1 2 2 3 3 3

S2 3 3 2 1 2 1 2 1 3 2

S3 1 1 2 1 3 1 3 2 3 2

1 2 3

S1 -140 35 -65

S2 -225 140 -15

S3 -20 90 130

S1 230

S2 490

S3 190

230; [-416, 876]95% probability

04/19/2010

Page 40: Scalable Approximate Query Processing

40

Sketches Estimation

• Two random processes– Bucket selection– Sign

• Sketch update• Estimator• Confidence bounds– Multiple independent sketches– Chebyshev & Chernoff inequalities (worst-case)– Median Central Limit Theorem, Student-t distribution

(statistics)

HDh :

1,1: D

2,1,),join.()()join.()Sk(join)()Sk( iRtttfthR.thR iiiiiiiii

11 22

)(,])[(Sk)(Sk 2121Rt RtHh

ttfEhRhR

04/19/2010

Page 41: Scalable Approximate Query Processing

41

Pseudo-Random Number Generators[Rusu & Dobra 2006, 2007b]

• Detailed comparison of generating schemes– Abstract algebra (orthogonal arrays, vector spaces,

prime & extension fields)• Degree of independence as function of seed size• Fast range-summable

– Empirical evaluation• Generating time is few processor cycles

• Identify EH3 as generator for sketches– Lowest possible degree of independence– 7.3 ns to generate single number

04/19/2010

Page 42: Scalable Approximate Query Processing

42

Statistical Analysis[Rusu & Dobra 2007a, 2008]

• Detailed comparison of sketch estimators– Same accuracy (worst-case analysis)– Statistical analysis

• Distribution (probability density function)• Higher frequency moments (kurtosis)• Confidence bounds

– Empirical evaluation• Data skew, correlation, memory usage, update time

• Identify Fast-AGMS as most reliable scheme– Accurate over entire range of data– Small memory footprint, fast update time

04/19/2010

Page 43: Scalable Approximate Query Processing

43

Roadmap

• Database query execution• System design & implementation– DataBaseOnline (DBO)

• Approximation methods (theoretical analysis & practical implementation)– Sampling– Sketches– Sketches over samples

04/19/2010

Page 44: Scalable Approximate Query Processing

44

Sketches over Samples[Rusu & Dobra 2009]

σ

UV

σ

D

Σ

•Data is random on disk•Build sketches on join attribute while data is read from disk•Use attributes in aggregate•Provide estimates at any point

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

URL Content

J car

F car

C car

D phone

A car

B car

G car

H PC

I car

E car

IP URL Date Duration

1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

04/19/2010

Page 45: Scalable Approximate Query Processing

45

Sketches over SamplesSELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

1 2 3

S1 1 1 -2

S2 -1 0 1

S3 -2 0 0

A B C D E F G H I J

S1 + - - - - + + + - -

S2 + - + - + - + - + -

S3 - - - + + - + + - +

A B C D E F G H I J

S1 1 2 3 1 1 2 2 3 3 3

S2 3 3 2 1 2 1 2 1 3 2

S3 1 1 2 1 3 1 3 2 3 2

1 2 3

S1 -100 -35 -30

S2 -105 35 -15

S3 -30 30 65

URL Content

J car

F car

C car

D phone

A car

B car

G car

H PC

I car

E car

S1 -300

S2 360

S3 240

50% input:100; [-2382, 2582]

95% probability

04/19/2010

Page 46: Scalable Approximate Query Processing

46

Sketches over Samples – Estimation

• Define estimator over two completely different random processes & analyze statistically – Sampling – random partition, tuple domain– Sketches – random projection, frequency domain– Consider correlation between multiple sketches that share

same sample– Moment generating functions

• Generic analysis independent of sampling process– Bernoulli sampling– Sampling without replacement– Sampling with replacement

04/19/2010

Page 47: Scalable Approximate Query Processing

47

Sketches over Samples – Analysis

''i

Dii gEfECXE

jDi Dj

jii gfCX

''

Di Dji

Dii

Diiijiji

Djj

Dii gEfEgEfEggEffEgEfECX

2''2'2'''''2'2'2 22Var

Var[sketch over samples] =Var[samples] + Var[sketch] + Var[interaction]

04/19/2010

Page 48: Scalable Approximate Query Processing

48

Conclusions• Data explosion– Cheap, high-capacity storage– Current processing technology is too expensive for performance

it provides• Framework for online analytical processing– DBO system architecture

• Embed randomization into data processing• Provide estimates and bounds at any time

– Approximation methods• Sampling – most flexible• Sketches – single pass• Sketches over samples – fastest

04/19/2010

Page 49: Scalable Approximate Query Processing

49

Future Work• Short term

– Define & design query optimization for DBO– Extend DBO to other types of queries and with other approximation techniques

(end-biased samples, histograms, …)– Generalize sketches to multiple relations– Find optimal amount of data to sketch– Fully integrate sketches into DBO system

• Medium term– Develop data aggregation & approximation techniques for other types of

architectures• Multicore processors, GPUs• Distributed processing (Map-Reduce, Hadoop, …)

• Long term– Design & build scalable analytic processing system

• Aggregation & approximation

04/19/2010

Page 50: Scalable Approximate Query Processing

50

Publications• A. Dobra, C. Jermaine, F. Rusu, F. Xu – Turbo-Charging Estimate

Convergence in DBO. In VLDB 2009.• F. Rusu and A. Dobra – Sketching Sampled Data Streams. In ICDE 2009.• F. Rusu et al. – The DBO Database System. In SIGMOD 2008 (demo).• F. Rusu and A. Dobra – Sketches for Size of Join Estimation. In TODS, vol.

33, no. 3, 2008.• F. Rusu and A. Dobra – Pseudo-Random Number Generation for Sketch-

Based Estimations. In TODS, vol. 32, no. 2, 2007.• F. Rusu and A. Dobra – Statistical Analysis of Sketch Estimators. In

SIGMOD 2007.• F. Rusu and A. Dobra – Fast Range-Summable Random Variables for

Efficient Aggregate Estimation. In SIGMOD 2006.

04/19/2010