![Page 1: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/1.jpg)
Technische Universität München
Workload-Aware Data Partitioning in Community-
Driven Data Grids
Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin Gufler, Angelika Reiser, and Alfons Kemper
Department of Computer Science, Technische Universität München
Germany
![Page 2: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/2.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 2
• Many challenges and opportunities in e-science for database research– High-throughput data management– Correlation of distributed data sources
• Community-driven data grids– Dealing with data skew and query hot spots– Workload-awareness by employing cost model during
partitioningShould I Split or Replicate?
![Page 3: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/3.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 3
Query Load Balancing via Partitioning
![Page 4: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/4.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 4
Query Load Balancing via Partitioning
![Page 5: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/5.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 5
Query Load Balancing via Partitioning
![Page 6: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/6.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 6
Query Load Balancing via Partitioning
X
![Page 7: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/7.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 7
Query Load Balancing via Replication
![Page 8: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/8.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 8
Query Load Balancing via Replication
![Page 9: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/9.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 9
Query Load Balancing via Replication
![Page 10: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/10.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 10
The AstroGrid-D Project
• German Astronomy Community Grid http://www.gac-grid.org/
• Funded by the German Ministry of Education and Research
• Part of D-Grid
![Page 11: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/11.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 11
Up-Coming Data-Intensive Applications
• Alex Szalay, Jim Gray (Nature, 2006):“Science in an exponential world”
• Data rates– Terabytes a day/night– Petabytes a year
• LHC• LSST• LOFAR• Pan-STARRS
LOFAR
LHC
![Page 12: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/12.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 12
The Multiwavelength Milky Way
http://adc.gsfc.nasa.gov/mw/
![Page 13: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/13.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 13
Research Challenges
• Directly deal with Terabyte/Petabyte-scale data sets• Integrate with existing community infrastructures• High throughput for growing user communities
![Page 14: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/14.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 14
Current Sharing in Data Grids
• Data autonomy• Policies allow partners to access data• Each institution ensures
– Availability (replication)– Scalability
• Various organizational structures [Venugopal et al. 2006]:– Centralized– Hierarchical– Federated– Hybrid
![Page 15: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/15.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 15
Community-Driven Data Grids (HiSbase)
![Page 16: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/16.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 16
“Distribute by Region – not by Archive!”
![Page 17: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/17.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 17
“Distribute by Region – not by Archive!”
![Page 18: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/18.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 18
“Distribute by Region – not by Archive!”
![Page 19: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/19.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 19
“Distribute by Region – not by Archive!”
![Page 20: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/20.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 20
Mapping Data to Nodes
![Page 21: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/21.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 21
Workload-Aware Training Phase
• Incorporate query traces during training phase• Base partitioning scheme on
– Data load– Query load
• Challenges– Balance query load without losing data load balancing– Approximate real query hot spots from query sample
![Page 22: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/22.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 22
Dealing with Query Hot Spots
• Query skew triggered by increased interest in particular subsets of the data
• Two well-known query load balancing techniques:– Data partitioning– Data replication
• Finding trade-offs between both
![Page 23: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/23.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 23
When to Split (Partition) or to Replicate
• Considers partition characteristics– Amount of data (few/many data points)– Number of queries (few/many queries)– Extent of regions and queries (small/big queries)
Data points Few Queries Many Queries
Small Big Small Big
Few ─ ─ SPLIT REPLICATE
Many SPLIT SPLIT SPLIT REPLICATE
![Page 24: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/24.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 24
Region Weight Functions
• Data only (#objects in a region)• Queries only (#queries in a region)• Scaled queries
– Approximate “real” extent of hot spot– Avoid overfitting to training query set
• Heat of a region (#objects * #queries)• Extents of regions and queries
– Replicate when many big queries
big
small
![Page 25: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/25.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 25
Evaluation
• Weight functions: data, heat, extent• Data sets (observational, simulation)• Workloads (SDSS query log, synthetic)• Partitioning Scheme Properties
– Load distribution– Communication overhead
• Throughput Measurements– Distributed setup– FreePastry simulator
Pobs
![Page 26: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/26.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 26
Load Distribution
• Uniform data set from the Millennium simulation• Workload with extreme hot spot• In the following:
– 1024 partitions– Heat of a region (#data * #queries)– Normalized across all partitioning schemes
![Page 27: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/27.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 27
Query-unaware Training
![Page 28: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/28.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 28
Training with Scaled Queries (scaled 50x)
![Page 29: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/29.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 29
Training with Scaled Queries (scaled 400x)
![Page 30: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/30.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 30
Heat-based, Extent-based Training
![Page 31: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/31.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 31
Communication Overhead for Pobs
![Page 32: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/32.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 32
Throughput for Pobs
![Page 33: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/33.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 33
Load Balancing During Runtime
• Complement workload-aware partitioning with runtime load-balancing
• Short-term peaks– Master-slave approach– Load monitoring
• Long-term trends– Based on load monitoring– Histogram evolution
![Page 34: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/34.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 34
Related Work
• On-line load balancing• Hundreds of thousands to
millions of nodes• Reacting fast• Treating objects
individually HiSbase
![Page 35: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/35.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 35
Should I Split or Replicate?
• Many challenges and opportunities in e-science for database research– High-throughput data management– Correlation of distributed data sources
• Community-driven data grids– Dealing with data skew and query hot spots– Workload-awareness by employing cost model during
partitioning
![Page 36: Technische Universität München Workload-Aware Data Partitioning in Community- Driven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin](https://reader035.vdocument.in/reader035/viewer/2022070305/5513bf175503463a298b48a0/html5/thumbnails/36.jpg)
Technische Universität München
2009-03-24 EDBT 2009 – Workload-Aware Data Partitioning 36
Get in Touch
• Database systems group, TU München– Web site: http://www-db.in.tum.de– E-mail: [email protected]
• The HiSbase project– http://www-db.in.tum.de/research/projects/hisbase/
Thank You for Your Attention