Xiaodan WangDepartment of Computer ScienceJohns Hopkins University
Processing Data Intensive Queries
in Scientific Database Federations
Processing Data Intensive Queries in Scientific Database Federations
Problem
Data avalanche in scientific databases– Exponential growth in data size (Pan-STARRS)– Accumulation of data at multiple data sources (clustered and
federated databases)
Exploring massive, widely distributed data– Joins to find correlations across multiple databases– Queries are data intensive: large transfers over the network,
and scan large portions of the data– Query throughput limits scale of exploration
To improve overall query throughput but potentially sacrifice performance of individual queries
Processing Data Intensive Queries in Scientific Database Federations
Target Application
SkyQuery Federation of Astronomy Databases– Dozens of multi-terabyte databases across three Continents– Queries that perform full db scans lasting hours or days– Intermediate join results that are hundreds of MBs– Scalability concerns both in data size and number of sites
Cross-match– Probabilistic spatial join across multiple databases– Join results are accumulated, shipped from site to site, and
delivered to scientists
Processing Data Intensive Queries in Scientific Database Federations
Cross-Match Workload
A forward looking analysis shows that network dominates 90% of performance
A quarter of the cross-match queries execute for minutes to several hours
Processing Data Intensive Queries in Scientific Database Federations
Incorporating Network Structure
Processing Data Intensive Queries in Scientific Database Federations
Network-Aware Join Processing
Capture heterogeneity in global-scale federations– Metric to exploit high throughput paths– Decentralized, local optimizations using aggregate stats– Routing at the application layer– Two-approximate, MST-based solution with extensions
that employ semi-joins and explore bushy plans– Clustering to explore trade-offs with computation cost
Over a ten-fold reduction in network utilization for large joins
Processing Data Intensive Queries in Scientific Database Federations
A Case for Batch Processing
Top ten buckets accessed by 61% of queries and reuse occur close temporally
2% of buckets capture more than half of the workload and should be cached
Processing Data Intensive Queries in Scientific Database Federations
LifeRaft: Data-Driven Batch Proc.
Eliminate redundant I/O to improve query throughput
Batch queries with that exhibit data sharing
– Pre-process queries to identify data sharing
– Co-schedule queries that access the same data
– Access contentious data first to maximize sharing
– Improves performance by two-fold
Processing Data Intensive Queries in Scientific Database Federations
Discussion
Cache replacement for LifeRaft– Benefits contentions data regions that experience reuse
(Cache hit for LifeRaft is 40% compared with 7% for arrival order processing)
– Evaluate strategies that exploit I/O behavior of batch workloads (segmented strategy)
Buffering and workload overflow– Large intermediate join results– Migrate pairs of workload and bucket
Better support for interactive queries– Short and selective queries that focus on small region– Indefinite queuing times in presence of batch workloads
Processing Data Intensive Queries in Scientific Database Federations
Discussion (cont.)
Batch processing in a distributed environment– Network-aware scheduling does not consider
computation cost– Batch processing for a single system environment
Federating LifeRaft– Coordinate exec. of query that join multiple DBs– Batch proc. requires databases to buffer results– Maximize overall batch size while alleviating
memory used for buffering and network cost
Processing Data Intensive Queries in Scientific Database Federations
Exploring Alt. Join Schedules
Processing Data Intensive Queries in Scientific Database Federations
Discussion (cont.)
Explore both join schedules and opportunities for batching simultaneously– Bushy and semi-join plans increase computation while
clustering decrease computation– Skew in join workload (ie. sites close to end user)– Quantify trade-offs with computation cost (ie. number of
buckets in batch processing)
Users submit cross-match queries in batches Applying LifeRaft to other data-intensive, temporal-
spatial data such as Turbulence database
Processing Data Intensive Queries in Scientific Database Federations
Supplementary Slides
Processing Data Intensive Queries in Scientific Database Federations
Cross-Match Queries
Join by increasing cardinality (count *)– Minimal I/O– Fewer bytes on the network
Query
Mediator
Probe Query
ResultResult
Result
Count: 30Count: 100Count: 800
Processing Data Intensive Queries in Scientific Database Federations
Spanning Tree Approximation (STA)
B
C
A
D
E
F
GH
Processing Data Intensive Queries in Scientific Database Federations
STA: Find MST
B
C
A
D
E
F
GH
Processing Data Intensive Queries in Scientific Database Federations
STA: Join Using Paths on the MST
B
C
A
D
E
F
GH
1
2
3
54
6
7
9
8 10
12
11
13
Processing Data Intensive Queries in Scientific Database Federations
Filter and refine
Partition data into buckets
Processing Data Intensive Queries in Scientific Database Federations
Scheduling Behavior
Qi – Qi1, Qi2, Qi3
B1 B2 B3 B4 B5 B6 B7 B8
Qi Qj Qk
Sub-divide queries by bucket:
Qj – Qj3, Qj4, Qj5, Qj6 , Qj7, Qj8
Assumptions:- Inter-query time of 1 sec- I/O for each bucket of 1 sec- Cache size of 2- Join cost is negligibleQj – Qj5, Qj6 , Qj7, Qj8
Qk
Processing Data Intensive Queries in Scientific Database Federations
Arrival order with no sharing
Qi1
B1
Qi Arr
Qi2
B2
Qi3
B3
Qj1
B1
Qj Arr Qk Arr
Qj3
B3
Qi End
Qj4
B4
Qj6
B6
Qj7
B7
Qj8
B8
Qj End
Qk1
B1
Qk4
B4
Qk8
B8
Qk End
Qi – 3 sec
Completion Times:
Qj – 8 sec Qk – 13 sec Avg – 8 sec
B1 B2 B3 B4 B5 B6 B7 B8
Qi Qj QkQk
…
Tp – .2 qry/sec
Processing Data Intensive Queries in Scientific Database Federations
Age based scheduling (bias 1)
Qi1
B1
Qi Arr
Qi2
B2
Qi5
B5
Qi3Qj3
B3
Qj Arr Qk Arr Qi EndQj End
Qk End
Qj1Qk1
B1
Qj4Qk4
B4
Qj6Qk6
B6
Qi – 3 sec
Completion Times:
Qj – 7 sec Qk – 7 sec Avg – 5.6 sec Tp – .33 qry/sec
B1 B2 B3 B4 B5 B6 B7 B8
Qi Qj QkQk
Qj8Qk8
B8
Qj7Qk7
B7
Processing Data Intensive Queries in Scientific Database Federations
Contention based scheduling (bias 0)
Qi1
B1
Qi Arr
Qi2
B2
Qi3Qj3
B3
Qj Arr Qk Arr Qi EndQj End
Qk5
B5
Qk End
Qj1Qk1Qj4Qk4
B1 B4
Qj6Qk6
B6
Qj7Qk7
B7
Qi – 7 sec
Completion Times:
Qj – 5 sec Qk – 6 sec Avg – 6 sec Tp – .38 qry/sec
B1 B2 B3 B4 B5 B6 B7 B8
Qi Qj QkQk
Qj8Qk8
B8
(5.6) (.33)
Processing Data Intensive Queries in Scientific Database Federations
Parameter tuning using trade-off curves
Processing Data Intensive Queries in Scientific Database Federations
Tuning theage bias
Throughput performance gap grows while response time gap is insensitive to saturation
Increasing age bias is more attractive at low saturation