xiaodan wang, randal burns department of computer science johns hopkins university tanu malik cyber...

Post on 04-Jan-2016

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Xiaodan Wang, Randal BurnsDepartment of Computer ScienceJohns Hopkins University

Tanu MalikCyber CenterPurdue University

LifeRaft: Data-Driven, Batch Processing for the Exploration of

Scientific Databases

LifeRaft: Data-Driven, Batch Processing

BETTER LUCK NEXT TIME!

LifeRaft: Data-Driven, Batch Processing

ProblemQ1

Q2

Q3

Q4

LifeRaft: Data-Driven, Batch Processing

Goals

Eliminate redundant I/O to improve query throughput

Batch queries with that exhibit data sharing– Pre-process queries to identify data sharing– Co-schedule queries that access the same data– Access contentious data first to maximize sharing

Starvation resistance– Avoid indefinite queuing times (response time)– Enforce some constraints on completion order

LifeRaft: Data-Driven, Batch Processing

Target Applications

Data intensive scan queries– Executed against a clustered index– Clustered and federated databases (e.g. joins that correlate

multiple nodes)

Peta-scale astronomy (Pan-STARRS)– Data are partitioned spatially– Many queries scan full DB and last hours or days

Cross-match

– Probabilistic spatial join across multiple databases

LifeRaft: Data-Driven, Batch Processing

Filter and Refine

Filter queries– Pre-process queries to determine join buckets– Buckets B1,…,Bn and queries Q1,…, Qm

– Workload Wij denote objects from Qi that overlap Bj

Refinement– Read buckets one-at-a-time– Sort-merge join (sort by HTM ID)– Query specific predicates applied on output tuples

LifeRaft: Data-Driven, Batch Processing

Workload Throughput Metric

Greedily in order of decreasing workload throughput Exploits data regions that experience contention May starve requests

– Favors buckets experiencing frequent reuse– No guarantee a particular bucket or query receives service

LifeRaft: Data-Driven, Batch Processing

Aged Workload Throughput Metric

Inspired by disk-drive head scheduling Balance arrival order (low response time) with

contention (high throughput) Adaptive trade-offs based on workload saturation

– Maximize rate at which objects are joined during saturated workloads

– Enforce completion order (queuing times) to prevent indefinite starvation during low saturation

LifeRaft: Data-Driven, Batch Processing

Scheduling Behavior

Qi – Qi1, Qi2, Qi3

B1 B2 B3 B4 B5 B6 B7 B8

Qi Qj Qk

Sub-divide queries by bucket:

Qj – Qj3, Qj4, Qj5, Qj6 , Qj7, Qj8

Assumptions:- Inter-query time of 1 sec- I/O for each bucket of 1 sec- Cache size of 2- Join cost is negligibleQj – Qj5, Qj6 , Qj7, Qj8

Qk

LifeRaft: Data-Driven, Batch Processing

Arrival order with no sharing

Qi1

B1

Qi Arr

Qi2

B2

Qi3

B3

Qj1

B1

Qj Arr Qk Arr

Qj3

B3

Qi End

Qj4

B4

Qj6

B6

Qj7

B7

Qj8

B8

Qj End

Qk1

B1

Qk4

B4

Qk8

B8

Qk End

Qi – 3 sec

Completion Times:

Qj – 8 sec Qk – 13 sec Avg – 8 sec

B1 B2 B3 B4 B5 B6 B7 B8

Qi Qj QkQk

Tp – .2 qry/sec

LifeRaft: Data-Driven, Batch Processing

Age based scheduling (bias 1)

Qi1

B1

Qi Arr

Qi2

B2

Qi5

B5

Qi3Qj3

B3

Qj Arr Qk Arr Qi EndQj End

Qk End

Qj1Qk1

B1

Qj4Qk4

B4

Qj6Qk6

B6

Qi – 3 sec

Completion Times:

Qj – 7 sec Qk – 7 sec Avg – 5.6 sec Tp – .33 qry/sec

B1 B2 B3 B4 B5 B6 B7 B8

Qi Qj QkQk

Qj8Qk8

B8

Qj7Qk7

B7

LifeRaft: Data-Driven, Batch Processing

Contention based scheduling (bias 0)

Qi1

B1

Qi Arr

Qi2

B2

Qi3Qj3

B3

Qj Arr Qk Arr Qi EndQj End

Qk5

B5

Qk End

Qj1Qk1Qj4Qk4

B1 B4

Qj6Qk6

B6

Qj7Qk7

B7

Qi – 7 sec

Completion Times:

Qj – 5 sec Qk – 6 sec Avg – 6 sec Tp – .38 qry/sec

B1 B2 B3 B4 B5 B6 B7 B8

Qi Qj QkQk

Qj8Qk8

B8

(5.6) (.33)

LifeRaft: Data-Driven, Batch Processing

Throughput Performance

LifeRaft: Data-Driven, Batch Processing

Tuning theage bias

Throughput performance gap grows while response time gap is insensitive to saturation

Increasing age bias is more attractive at low saturation

LifeRaft: Data-Driven, Batch Processing

Parameter tuning using trade-off curves

LifeRaft: Data-Driven, Batch Processing

Discussion

Impact of caching strategies Workload overflow

– Large intermediate join results– Migrate pairs of workload and bucket

Beyond completion order– Higher priority for interactive queries

Batch processing in a clustered environmentP. Agrawal, D.Kifer, and C. Olston. Scheduling Shared Scans of Large Data Files. In VLDB, 2008.

LifeRaft: Data-Driven, Batch Processing

WHAT ABOUT US?

LifeRaft: Data-Driven, Batch Processing

Filter and refine

Partition data into buckets

LifeRaft: Data-Driven, Batch Processing

Average Response Time

LifeRaft: Data-Driven, Batch Processing

Outline

Motivation– Goals for data-driven, batch scheduling– Target application (SkyQuery)

LiftRaft scheduler– Filter and refine queries– Throughput maximizing metric– Starvation resistance– Differences in outcomes

Workload adaptive parameter selection

top related