idiff : informative summarization of differences in ...pages.cs.wisc.edu/~gayatrik/idiff.pdf ·...

iDiff : Informative Summarization of

Differences in Multidimensional Aggregates

Introduction

• Support for mining tools - clustering, classification, association rules and time series analysis available in major OLAP products

• Problem : To investigate HOW and WHY analysts explore the data cube and automate the process

• User driven exploration and manual discovery gets tedious and error-prone as data dimensionality and size increases

• iDiff – an operator that retrieves summarized reasons for drops or increases observed at aggregate level

• Goals : • Scalability of the algorithm as dimensionality and size increases • Feasibility of getting interactive answers

Multidimensional OLAP

• Dimensions, Dimension Hierarchies and Measures

• Example (Dimension and hierarchy) • Product Dimension (Product code -> type -> category)

• Stores Dimension (Store Name -> city -> state)

• Measures • Aggregated to various levels using functions like SUM, AVG, MAX, MIN,

COUNT, VARIANCE etc.

• Operations • Select, drill-down, roll-up

Example Query

Dimensions and Hierarchy

Dataset

Geography x Year

Query : Reason for the drop in “Revenue” for the geography “Rest of World” from 1990 to 1991 ? There is a steady increase in revenue in all cases except this

Measure : Revenue

Results 12% increase (discounting the rows below)

Largely responsible for the decrease

Another query and result

Just a 30% increase (discounting the rows underneath)

Mainly responsible for the increase

Simple Formulation for summarizing

• Simple approach : DetailSort • Example : ga – aggregated quantity ( revenue for

all products, all platforms, geography = ‘rest of world’ and year = 1990)

• gb = aggregated quantity (revenue for all products, all platforms, geography = ‘rest of world’ and year = 1991)

• Therefore, we differ in 1 dimension • Ca and Cb – subcubes got by expanding common

aggregated dimensions of ga and gb • List top 10 rows showing the largest changes in

detailed data sorted by the magnitude of the change

• May not account for a large part of the difference – so many other entries left to analyse

• Solution : summarize the rows with similar changes

DetailSort Approach

Summarize rows with similar change

Explains diff. of 900 only

RATIO and ERROR

They show similar increase (not exact increase though) RATIO is calculated as a summarized statistic for parent and that applies to all its children ERROR in this case is calculated as square root of sum of squares of error of each detailed row

Model - Major Factors in Summarization

• Grouping Criteria • How to measure similarity of relationship between corresponding values in

cubes Ca and Cb

• For OLAP datasets, ‘ratio’ is found to be a better choice than ‘difference’

• Therefore, each group of cells is associated with a ratio value that best characterizes the members of its group

Model - Major Factors in Summarization

• Error Function • Measures error due to summarization

• For same ratio, higher weightage to larger differences

• For same absolute difference, high weightage to larger ratio

• Both magnitude and ratio are important

• Err(va, vb , r) = (vb - rva) * log (vb / rva)

• Missing Values • If either va or vb is missing, the value chosen should be relative to the other

• If different values are used, we get different ratios and cannot summarize

• If va is missing, replace va with vb/F and viceversa

• Without looking at data, it is hard to pick an absolute value of F, where F is a constant fraction

• Therefore, values calculated depending upon the group ratio r such that the ratio is not very large or very small

Model - Major Factors in Summarization • Error Function

• Incrementally computable function. The function is rewritten as follows :

• The sums can be computed as S1 = S2 = S3 = S4 =

• Total error is S1 – S2 log r + S3 r + S4 rlogr

a

bb

v

vv log bv

b

aa

v

vv log av

Model - Major Factors in Summarization • Group Format

• Which subset of cells to consider for grouping?

• Dimension hierarchies are natural boundaries for grouping

• Group could overlap with a cell binding to the detailed answer

• Therefore, tuples belonging to second group (detailed) have been removed from the first

• Objective Function • Choice of two objective functions

• Minimizing total error given fixed answer size N

• Minimizing answer size N given fixed error limit

Model - Major Factors in Summarization • Other query forms :

• One sided error functions • If top level row has an increase, interested in seeing only rows with significant increase or

vice-versa • The only time a row with opposite change would be reported is when an intermediate parent

(other than top-level) is also present in the answer • Modify the error function to have a value of 0 if sign of change is opposite to that of the top

most parent and there are no other intermediate parents of the row

• Non-additive aggregate functions • Example : percentage increase in sales • They are not added at detailed level to obtain value at aggregate level • Example Query : Why is percentage increase in sales higher for product A than product B this

year? • Assume that the sales for Product B have also increased by the same percentage as that of

product A • Calculate the error caused by the assumption

Algorithm – Minimize total error given fixed answer size N • Dynamic Programming Approach

• (Ca | Cb) – set of tuples from Ca and Cb where corresponding cells are joined to get measures from both Ca and Cb

• Scan detailed tuples sequentially while maintaining best solution for slots 0 to N

• Single Dimension with no hierarchy • D(T,n,r) – Total Error for T tuples, answer size N and final ratio of top most row as r

• Finding the best value of ratio r :

• Initial global ratio rg = gb / ga

• Starting with a fixed number R of the ratios around this values from rg /2 to 2 rg

• Maintain a histogram of ratio values that is updated as tuples arrive

• Periodically, pick R-1 different ratio values by dropping extreme values and using average over the middle values

• Select R most distinct values from R previous values and R-1 new values to update the set

• When algorithm ends, pick solution corresponding to the value of r with the smallest cost

Algorithm – Minimize total error given fixed answer size N • Single Dimension with hierarchy

• L Level hierarchy on a dimension

• Whether to include the answer of aggregated tuple at that level or not

• When an aggregated tuple is included agg(T’) – parent of all tuples T’, default ratio of all its children T’ not in the answer is ratio induced by agg(T’) instead of outer global ratio

• D(T,n,r) = min(cost excluding the aggregate tuple, cost including the aggregate tuple)

• At each level, best solution is found for the group of tuples with the same parent at that level of hierarchy. Final answer stored in top most node after all tuples are scanned

• L nodes one for each of the L levels of hierarchy • Each node stores partial solutions for all slots (N) • Tuples scanned in order such that all tuples within same hierarchy appear together • Passed to detailed node – This updates the solution using previous dynamic prog

approach • On getting a tuple not in the same hierarchy, finds best solution using above equation

and passes the solution to node above • Same updation procedure followed at higher levels

Algorithm – Minimize total error given fixed answer size N • Multiple Dimensions with one or more levels of hierarchies on each

• Pick an order that will minimize total error - Levels with higher similarity amongst its members aggregated earlier

• For each level of each dimension, calculate the total error if all tuples at that level were summarized

• If the tuples within a level are similar, error of summarization will be less

• Sort the levels of dimension based on this concept

• Computation of similarity between tuples done offline using data of entire cube and a fixed ordering of dimensions stored or could be done dynamically for each query for the subset involved in the query

Algorithm – Minimize answer size given fixed error limit • Provide smallest length summary that meets an error limit

• Single dimension with no hierarchies • Scan all tuples from T

• If error Err(vb , rva ) > emax, include in final answer

• Optimal answer corresponds to the ratio r with minimum answer size

• Multiple dimensions handled in the same way as previous

• Level Pruning • Collect statistics of the number of tuples at various levels of detail

• If subset involved in processing is very large, limit the level of detail from which we start • Technique

• Probabilistic counting method – from statistics of number of tuples at various levels of detail – to estimate the size of different levels

• The size of relevant levels of aggregation can be found by scaling the estimated total size by the selectivity of the subcube with respect to the actual cube • Selectivity – reciprocal of size of the level

• These estimates are used to pick the level where system expects to complete within a specified threshold

Some Implementation details • In response to query, the DIFF operator dynamically generates the

correct SQL query and submits to OLAP data source

• Involves selecting the specified values at the specified aggregation level and sorting the data

• iDiff operator implemented as a stored procedure

Performance

• Factors that affected processing time • Number of tuples in the query result (Ca | Cb) – tuples from Ca or Cb where the

corresponding cells have been joined to get the measures from both Ca and Cb

• Total number of levels in (Ca | Cb)

• Answer size N that determines number of slots per node

Query1 : 8000 tuples, Query2 : 15000 tuples

idiff : informative summarization of differences in ...pages.cs.wisc.edu/~gayatrik/idiff.pdf ·...

Documents