parallelizing the data cube phd oral defence todd eavis july 23, 2003

Parallelizing the Data Cube

PhD Oral DefenceTodd Eavis

July 23, 2003

● Motivation for Parallel, Relational OLAP● Core Algorithms and Methods● Primary Systems Contributions● Experimental Evaluation and Results● Conclusions and Future Work

Overview

– Motivation for Parallel, Relational OLAP

– Core Algorithms and Methods

– Primary Systems Contributions

– Experimental Evaluation and Results

– Conclusions and Future Work

Why study OLAP and the Data Cube?

● On-line Analytical Processing: the foundation for a range of essential business applications

– sales and marketing analysis, planning and budgeting

● 4 billion dollar industry by 2005

● Data Cube: a core OLAP construct, first proposed in 1996 by Gray et al [GBLP], that supports sophisticated multi-dimensional data analysis

● Relevance to the Research Community? Results of Citeseer queries:

– OLAP: 797 papers– Data Cube: 362 papers

● Our interest: Data Cube Generation and Querying

Scale of OLAP Data Warehouses

● Average size of production data warehouses currently 700 GB [survey.com/Olap Report]

● Expected to reach 4 TB by 2004● 1/3 currently <= 50 GB. In two years, this number

will drop to just 6% ● Biggest data warehouses growing by a factor of 20

[Winter Report]● Biggest expected to exceed 100 TB within 2 years● Our Interest: Exploiting Parallel

Algorithms

Fundamental Design Alternatives● MOLAP (Multi-dimensional OLAP)

– Materialize data cube as a multi-dimensional array

– In theory: implicit indexing. In practice: hybrid schemes for sparse and dense regions

– Best for dense, low-dimensional spaces● ROLAP (Relational OLAP)

– Store data as relational tables

– Requires an explicit multi-dimensional index

– Scales well to higher dimensions and higher cardinalities

● Our Interest: Highly Scalable ROLAP model

Computing the Full Cube in Parallel

● Small number of previous projects [GC, LHL, MM, NWY]

– Speedup quite limited ● Our approach: Parallel Pipesort [DEHR2, DER3]

– Model 2d views as a “task graph”

– Create “Scan Pipelines” [AADG] as Minimum Cost Spanning Tree using O(dn(m + nlogn)) bipartite matching (n = nodes, m = edges)

– Partition task graph into sub-trees with O(p3d + p2d) augmented k-min-max [BSP] and distribute sub-trees to p processors

– Use over-sampling – S * p sub-trees – to improve load balance. High-low pairing in S – 1 rounds provides approximation to NP-Complete problem

– Use performance optimized algorithms for sorting, scanning, and I/O to generation local views

Computing Partial Cubes in Parallel● Important in practice for environments with higher

dimensions and/or specific visualization needs

● Little previous work, only partial solutions [BR,GC,SAG]

● Our approach: Greedy algorithms for “Schedule Tree” construction [DER2, DER5]

– Solution consists of algorithms for generating efficieint “Essential Trees” (red) and algorithms for adding beneficial non-selected nodes (blue)

– Greedy method: record “ state” information in Plan Objects. Incrementally add nodes with maximum benefit

– Pre-sorting “candidate” views by estimated size can reduce run-time from O(n3) to O(n2)

– O(d*n) heuristic extensions for higher dimensional space. A “confidence factor” β limits risk

A Parallel Date Cube Query Engine● Views must be indexed prior to access

● Related work: sequential r-trees for data cube [RYR] and general purpose parallel r-trees [SL]

● Our Approach: Parallel RCUBE [DER1, DER6]

– Records ordered as per Hilbert Space Filling Curve

– P-processor round-robin record striping

– Construct Partial r-tree indexes on each node, “packing” page blocks in Hilbert order

● Parallel Query Engine

– Combines indexing and OLAP post-processing (query transformation, parallel Sample Sort, record permutation, etc.)

– Uses surrogate views to support Partial Cubes

– Supports “linear” dimension hierarchies

The Virtual Data Cube

● Motivation: Hide the complexity of Data Cube algorithms and implementation

● Requires no knowledge of:– Format or extent of indexing– Degree of materialization (full or

partial)– Representation of hierarchies– Physical order of view attributes– Degree of parallelism

Systems Overview

● Full, robust Data Cube “prototype” [DER4]

● Approximately 20,000 lines of code

– C/C++, LEDA, MPI, STL● Template-based graph algorithms

● Designed for, and evaluated on, contemporary parallel machines:

– “Shared nothing” Linux cluster (Dalhousie)

– “Shared disk” SunFire multi-processor (HPCVL)

● Supporting systems include:

– Flex/Bison based data generator

– Batch query generator

– View Subset generator

Key Performance Issues

● Dynamic selection of “best” sorting algorithm

– Radix sort versus quicksort● Minimization of data movement

– Use of horizontal and vertical indirection● New pipeline aggregation algorithm

– “Lazy” aggregation ● Streamlined I/O

– I/O manager

– Independent I/O and computation threads

Costing model● Sophisticated cost model, common to

both full and partial cube [DEHR1]

● Based upon view size estimator

– Probabilistic counting technique● Experimentally supported metrics for:

– Dynamic Sorting (linear time versus comparison based)

– In-memory scanning and data movement

– Machine specific “Read” and “Write” I/O

● Dynamically considers impact of computation versus I/O

A Better Search Strategy

● Standard r-tree search strategy employs Depth First Search

● Our approach: Linear Breadth First Search

– Map the search algorithm to the linearly ordered levels of the packed index

– Resolve query with a left-to-right, top-to-bottom walk of the tree

– Disk head never moves backwards

– Resolution consists of a sequence of clustered scans

– Degrades gracefully to a sequential scan of index + sequential scan of data

Experimental Evaluation● Default test environment includes:

– 16 to 24 processors

– 2 million records/80 MB

– 4 to 14 dimensions

– Random query batches

– 24-node Linux cluster, 16-node SunFire MP (disk array)

Full Cube Partial Cube Query Processing

● Parallel Speedup approaching linear for all components

● Efficiency between 80 and 95%

Full Cube Evaluation

● Shared Disk

● 80 – 90% efficiency

● Disk array is bottleneck

● Over sampling factor

● SF = 2 consistently best

● Optimized pipeline processing

● Order of magnitude improvement

Partial Cube Evaluation

● Using partial cube algorithms for full cube

● All within 6% of “best” benchmark

● Recursive algorithm with 0.1%

● Partial cube of 3 dimensions or less

● Reductions over “naïve” method of 65 – 70%

● Tree pruning with confidence factor (on 14 d)

● Can eliminate up to 60% of original tree

● Virtually no reduction in tree quality

Query Evaluation

● Ratio of blocks retrieved to required seeks

● Random query batches

● Up to 140:1 for large, sparse spaces

● Record retrieval imbalance (16 processors)

● Only 0.3% from optimal load balance

● Overhead of using surrogate views (1 to 16 processors)

● Run time on materialized views versus time when those views were unavailable

● ROLAP a viable alternative to MOLAP in parallel setting

● Partial cubes can be efficiently generated● ROLAP cubes can be efficiently indexed● Virtual cube abstraction can be efficiently

supported

Thesis Conclusions

● First parallel ROLAP system in the Data Cube literature

● A balanced approach to data cube research

– Algorithm design

– Systems engineering

– Extensive performance analysis● Evaluated on contemporary parallel machines

– Commodity-style shared nothing cluster

– Shared disk architectures● Integration of three “independent” data cube research projects

into a single cohesive OLAP framework – the Virtual Cube

Research Highlights

Future Work

● Automated partial cube specification– Extension of “virtual cube”

● Parallel Query optimization– In addition to range queries or linear hierarchies– High volume query environments

● OLAP visualization● New projects are building on the current base

– Generation of Iceberg Cubes– Mining of association rules

Thank You!

DEHR1 F. Dehne, T. Eavis, S. Hambrusch and A. Rau-Chaplin, Parallelizing The Data Cube, Parallel and Distributed Databases: An International Journal, 2001

DER1 F. Dehne, Todd Eavis, A. Rau-Chaplin, Distributed Multi-dimensional ROLAP Indexing for the Data Cube, CCGrid, 2003.

DER2 F. Dehne, T. Eavis and A. Rau-Chaplin, Computing Partial Data Cubes for Parallel Data Warehousing Applications, Euro PVM-MPI, 2001.

DER3 F. Dehne, T. Eavis, and A. Rau-Chaplin, Coarse Grained Parallel On-Line Analytical Processing (OLAP) For Data Mining, ICCS, 2001.

DER4 F. Dehne, T. Eavis, and A. Rau-Chaplin, A Cluster Architecture for Parallel Data Warehousing, CCGrid, 2001.

DEHR2 F. Dehne, S. Hambrusch, T. Eavis, and A. Rau-Chaplin, Parallelizing The Data Cube, ICDT, 2001.

CHER Y. Chen, F.Dehne, Todd Eavis, A. Rau-Chaplin, Parallel ROLAP Data Cube Construction on Shared Nothing Multi-Processors, IPDPS, 2003.

DER5 F Dehne, T.Eavis, and A. Rau-Chaplin, Computing Partial Data Cubes, Submitted to HICCS, 2003.

DER6 F Dehne, T.Eavis, and A. Rau-Chaplin, RCUBE: Parallel Multi-Dimensional ROLAP Indexing, Submitted to Journal of to Data Mining and Knowledge Discovery., 2003

GBLP J. Gray and A. Bosworth and A. Layman and H. Pirahesh", Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals", ICDE, 1996

GC S. Goil and A. Choudhary, A Parallel Scalable Infrastructure for OLAP and Data Mining, IDEAS,1999

LHL H. Lu, X. Huang and Z. Li,, Computing Data Cubes Using Massively Parallel Processors, PCW '97, 1997

MM S. Muto and M. Kitsuregawa, A dynamic Load Balancing Strategy for Parallel Datacube Computation, WDW O,1999

NWY R. Ng, A. Wagner and Y. Yin, Iceberg-cube Computation with PC Clusters, SIGMOD, 2001

AADG S. Agarwal and R. Agrawal and P. Deshpande and A. Gupta and J. Naughton and R. Ramakrishnan and S. Sarawagi, On the Computation of Multidimensional aggregates, VLDB, 1996

BSP R. Becker and S. Schach and Y. Perl, A shifting algorithm for min-max tree partitioning, Journal of the ACM, 1982BR K. Beyer and R. Ramakrishnan, Bottom-up computation of sparse and Iceberg CUBEs, SIGMOD,1999RYR N. Roussopoulos and Y. Kotidis and M. Roussopolis, Cubetree: Organization of the bulk incremental updates on the data cube, SIGMOD, 1997 SL B. Schnitzer and S. Leutenegger, Master-client r-trees: a new parallel architecture, SSDM, 1999

References: Our own Virtual Data Cube Research

References: The Data Cube Literature

parallelizing the data cube phd oral defence todd eavis july 23, 2003

Documents

papers data cube

data cube generation

study olap

core olap construct

future work slide

querying slide

results conclusions

future work overview