data warehousing & data mining · 4. queries 4.1 query processing 4.2 queries in dw / olap 4.3...

69
Data Warehousing & Data Mining & Data Mining Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

Upload: others

Post on 17-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

Data Warehousing & Data Mining& Data MiningWolf-Tilo BalkeSilviu HomoceanuInstitut für InformationssystemeTechnische Universität Braunschweighttp://www.ifis.cs.tu-bs.de

Page 2: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

4. Queries

4.1 Query processing

4.2 Queries in DW / OLAP

4.3 Physical Modeling

4. Queries

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2

Page 3: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Queries are posed to the DBMS and processed before the actual evaluation

4.1Query processing

Query ProcessorApplications Programs

Object Code

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 3

DDL Interpreter

Query

Evaluation

Engine

Embedded

DML Precompiler

DML Compiler

Storage

Manager

Data

Page 4: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• How queries are answered

– Queries are usually stated in a high level declarative language as SQL

• For relational DB it can be mapped to relational algebra (RA)

4.1 Query processing

(RA)

– For evaluation it has to be translated to a low level execution plan

• Expressions that can be used at physical level of the file system

– E.g., for RDB physical relational algebra

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4

Page 5: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

4.1Query processing

Parser &

Translator

Query

Query

Optimizer

Relational Algebra

Expression

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 5

Statistics

Evaluation

Engine

Optimizer

Data

Access

Paths

Execution

Plan

Query

Result

Page 6: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Parsing and translation

– Queries need to be translated

• A scanner tokenizes the query

– DB language keywords, table names, attribute names, etc.

• The parser checks syntax and

4.1 Query processing

• The parser checks syntax and

verifies relations, attributes, data

types, etc.

• Translate the query into its internal

form

– Translated into relational algebra

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 6

Page 7: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Optimization

– Several relational algebra expressions might lead to the same result

• But different statements might also result in very different performance

4.1 Query processing

performance

– Query optimization is the heart of every database kernel

• Finding optimal plans may cost too much, but avoid crappy plans by all means

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7

Page 8: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Evaluation

– The query-execution engine takes a query-evaluation plan, executes it, and returns the answers to the query

– For the result of each operator a temporary file has to be created

4.1 Query processing

to be created

• Temporary files can be input for other operators

• Storing the temporary files on the disk is expensive, but necessary if DB buffer is small

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 8

Page 9: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• DW queries are big queries– Imply a large portion of the data

– Read only queries• no Updates

• Redundancy a necessity

4.2 DW Queries

• Redundancy a necessity– Materialized Views, special-purpose indexes, de-normalized

schemas

• Data is refreshed periodically– E.g., Daily or weekly

• Their purpose is to analyze data– OLAP (OnLine Analytical Processing)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 9

Page 10: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• OLAP usage fields

– Management Information

• Sales per product group / area / year

– Government

4.2 DW Queries

• Population census

– Scientific databases

• Geo-, Bio-Informatics

– Etc.

• Goal: Response Time of seconds / few minutes

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 10

Page 11: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• ODS can also run analytical queries…but they are not so good at it

• OLTP and OLAP are to each other as Water and Oil– Lock Conflicts: OLAP blocks OLTP

4.2 Why use DW

– Lock Conflicts: OLAP blocks OLTP• E.g., an OLAP query can block the

sales activity of all the stores tryingto update the DB

– Database design: • OLTP - normalized,

• OLAP - de-normalized

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 11

Page 12: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Tuning, Optimization– OLTP - inter-query parallelism, heuristic

optimization– OLAP - intra-query parallelism,

full-fledged optimization

• Freshness of Data

4.2 OLTP vs. OLAP

• Freshness of Data– OLTP - serializability– OLAP - reproducability

• Precision– OLTP - ACID – OLAP - Sampling, Confidence

Intervals

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 12

Page 13: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• The solution is to run OLTP andOLAP separately

• DW is a special sandbox for OLAP

– As input it uses OLTP systems

4.2 Why use DW

– As input it uses OLTP systems

– DW aggregates and replicates data

• Special schema

– New data is periodically uploaded to the Warehouse

– Old data is deleted from Warehouse

– Archiving done by OLTP system for legal reasons

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 13

Page 14: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Comparisons– Show me the sales per region for this year and compare it to

that of the previous year to identify discrepancies

• Multidimensional ratios– Show me the contribution to weekly profit made by all items sold in the northeast stores between 1st of May and 7th of May

4.2 Typical analytical requests

sold in the northeast stores between 1 of May and 7th of May

• Ranking and statistical profile– Show me sales, profit and average call volume per day for my 10 most profitable sales-people

• Custom consolidation– Show me the income statement by quarter for the last four

quarters for my northeast region operations

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 14

Page 15: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• OLAP queriesSELECT d1.x, d2.y, d3.z, sum(f.t1), avg(f.t2)

FROM Fact f, Dim1 d1, Dim2 d2, Dim3 d3

WHERE a < d1.field < b AND d2.field = c GROUP BY d1.x, d2.y, d3.z;

• The idea is to

4.2 Typical queries

• The idea is to – Select by Attributes of Dimensions

• E.g., region = „Europe“

– Group by Attributes of Dimensions• E.g., region, month, quarter

– Aggregate on measures• E.g., sum(price * volume)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 15

Page 16: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• How do we differentiate between OLAP and non-OLAP products? - OLAP rules

• Published in a controversial white paper

– “Providing OLAP to the User-Analysts:

4.2 Codd’s OLAP rules

– “Providing OLAP to the User-Analysts:An IT Mandate. (Arbor Software, 1993)”

• Dr. Codd was accused that he allowed hisname to be used, but did not put to muchwork into it

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 16

ORACLE

Hyperions

Solutions Arbor

Software

Page 17: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Rules organization

– 12 rules + 6 (extension rules) added in 1995

– 4 feature groups

• Basic features

4.2 Codd’s OLAP rules

• Special features

• Reporting features

• Dimension control

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 17

Page 18: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• 1) Multidimensional Conceptual View

– Data is viewed in multidimensional form in a matrix. An enterprise becomes multidimensional

• E.g., profits could be viewed by region, product, time period or scenario (actual budget, forecasts, etc.)

4.2 Basic features

or scenario (actual budget, forecasts, etc.)

– Advantages

• Multidimensional models enable more straight-forward manipulation of data

– E.g., slice, dice, etc.

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 18

Page 19: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• 2) Intuitive data manipulation

– Existence of a GUI with drag-and-drop feature and other graphical facilities

– Intuition is a vague term…

4.2 Basic features

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 19

Page 20: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• 3) Accessibility: OLAP as a Mediator

– Middleware between heterogeneous data sources and OLAP front end

• 4) Batch extraction vs. Interpretative Extraction

4.2 Basic features

– OLAP has to have its own staging database

– It should also allow live access to external data

– Similar to what HOLAP is today

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 20

Page 21: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• 5) OLAP analysis models

– Categorical: typical descriptive statistics

• Comparison of historical values

– Exegetical: what we have been doing with spreadsheets (slice, dice, drill down)

4.2 Basic features

spreadsheets (slice, dice, drill down)

• Discovering reasons for what we found through the categorical model

– Contemplative: what if analysis

• E.g., What is the effect of closing the Alaskastore, to the company

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 21

Page 22: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• 5) OLAP analysis models

– Formulaic: goal seeking models

• You know the outcome you want but you don’t know how to get there

• The model keeps changing parameters and doing

4.2 Basic features

• The model keeps changing parameters and doing contemplations until it gets the desired result or proves it is impossible

– E.g., How can I increase the sales of bikinis in the Alaska store? The outcome can be:

» Many solutions…

» No solutions: Bikini sales in Alaska are doomed to failure

» Unacceptable solutions: Close down all but the Alaska store

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 22

Page 23: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• 6) Client/Server Architecture– Allow users to share data easily

• 7) Transparency– The client should not have to be aware of how

connections to the OLAP engine or other data

4.2 Basic features

connections to the OLAP engine or other data sources is made

• 8) Multiuser support– OLAP is read-only therefore no need for transaction

control

– New OLAP systems allow data query while data is being streamed from external sources

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 23

Page 24: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Special features

– 9) Treatment of non-normalized data

• Can load data also from non-RDBMS sources

– 10) Store OLAP results

4.2 Special features

• OLAP data is expensive

• Reconstructing it over and over from the live data is not a good idea

– OLAP DB is a snapshot of the state of the data sources

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 24

Page 25: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• 11) Extraction of missing values

– 2 kinds of missing values

• NULL as in SQL meaning we don’t know the value of the attribute

• Missing value meaning that the attribute will never have a

4.2 Special features

• Missing value meaning that the attribute will never have a value for that entity

• 12) Treatment of missing values• All missing values are ignored by the

OLAP analyzer, regardless of theirsource

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 25

Page 26: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• 13) Flexible reporting– Vague

• 14) Uniform reporting performance– Codd required that reporting performance would not

be significantly degraded by increasing the number of

4.2 Reporting features

be significantly degraded by increasing the number of dimensions or database size

– Sounds more like a goal then a rule

• 15) Automatic adjustment of physical level– OLAP systems adjust its physical storage automatically

– Dynamical adjusted HOLAP

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 26

Page 27: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Dimension control– 16) Generic dimensionality

• Each dimension must be equivalent in both its structure and operational capabilities

• Controversial rule

– 17) Unlimited dimensions and aggregation levels

4.2 Dimension control

– 17) Unlimited dimensions and aggregation levels• Unlimited…is physically impossible so we should settle with a

large number– E.g., it should support at least 15 to 20 dimensions

– 18) Unrestricted cross-dimensional operations• Operation is not the same as calculation

– E.g. “What is Friday divided by red?” but operation on mixed data is possible “How many red shirts did we sell on Friday?”

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 27

Page 28: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Codd’s OLAP rules turned out not to be a success

• Other attempts to define OLAP and offer OLAP guides were made by

4.2 Codd’s OLAP rules

guides were made by

– The OLAP council

– Analytical Solutions Forum

– OLAP Solutions

• Nigel Pendse’s FASMI test

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 28

Page 29: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• The FASMI test:– Fast

• System is targeted to respond users within ~ 5 seconds– Complex analysis should take no longer than 20 second

• This can be achieved with exotic hardware and lots of pre-calculated scenarios

4.2 FASMI

calculated scenarios

– Analysis• The system can cope with any business logic and statistical

analysis that is relevant for the application and the user, and keep it easy enough for the target user

– Shared• The system should implement security requirements

• Not all OLAP products are read-only

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 29

Page 30: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• The FASMI test:

– Multidimensional

• The most important factor

• Should support multidimensional conceptual views

• Full support for hierarchies

4.2 FASMI

• Full support for hierarchies

– Information

• All the data and derived information (meta-data), needed

• The question is how much input data they can handle not how much GB they use to store it

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 30

Page 31: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Typical OLAP operations– Roll-up

– Drill-down

– Slice and dice

– Pivot (rotate)

4.2 OLAP operations

– Pivot (rotate)

• Other operations– Aggregate functions

– Ranking and comparing

– Drill-across

– Drill-through

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 31

Page 32: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Roll-up (drill-up)

– Taking the current aggregation level of fact values and doing a further aggregation

– Summarize data by

Climbing up hierarchy

4.2 Roll-up

• Climbing up hierarchy

• Or by dimensional reduction

• A mix of these 2 techniques

– Used for obtaining an increased generalization

• E.g., from Time.Week to Time.Year

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32

Page 33: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Roll-up operations can be classified into

– Dimensional roll-ups

• Are done solely on the fact table by dropping one or more dimensions, where the dimensions retained are represented by their keys (basic attributes of the attribute

4.2 Roll-up

represented by their keys (basic attributes of the attribute hierarchy)

– E.g., drop Client dimension

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 33

StoreCityDistrictRegionCountry

ArticleProd. GroupProd. FamilyProd. Categ

Week

DayMonthQuarterYear

Sales

Turnover

Client

Page 34: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Hierarchical roll-ups

– are done on the fact table and some dimension tables by climbing up the attribute hierarchies of dimensions whose hierarchies are used and having at least one attribute of each dimension

4.2 Roll-up

least one attribute of each dimension

• E.g., climbed the Time hierarchy to Quarter and Article hierarchy to Prod. group

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 34

StoreCityDistrictRegionCountry

ArticleProd. GroupProd. FamilyProd. Categ

Week

DayMonthQuarterYear

Sales

Turnover

Client

Page 35: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Climbing above the top

– In an ultimate case, hierarchical roll-up above the top level of an attribute hierarchy (attribute “ALL”) can be viewed as converting to a dimensional roll-up

4.2 Roll-up

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 35

ALL

Electronics

Video Audio

Video

recorder

Video

recorderCamcorder

TR-34 TS-56

TV

Clothes

Article

Prod. Group

Prod. Family

Category

Page 36: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Drill-down (roll-down)

– Reverse of Roll-up

– Represents a de-aggregate operation

• From higher level of summary to lower level of summary –detailed data

4.2 Drill-down

detailed data

– Introducing new dimensions

– Requires the existence of materialized finer grained data

• You can’t drill if you don’t have the data

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 36

Page 37: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

4.2 Roll-up drill-down example

Jim Bob Mary

Joe’s 45 33 30

Salitos 50 36 42

Roots 38 31 40

Jim Bob Mary

133 100 112

Roll-up

by BAR

Drill-down

by brand

€ by bar/drinker

€ by drinker

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 37

Jim Bob Mary

Wolters 48 40 40

Becks 45 31 37

Krombacher 40 29 35

by brand

€ by brand/drinker

Page 38: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Slice

– Reducing the number of dimensions by taking a projection of facts on a proper subset of dimensions for some selected values of dimensions that are being dropped

4.2 Slice

dropped

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 38

πStoreId, TimeId, Ammount (σArticleId = LaptopId(Sales))

Page 39: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

– Amounts to equality select condition

– WHERE clause in SQL

• E.g., slice Laptops

4.2 Slice

Product

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 39

818

Geography

Time

13.11.200818.12.2008

Laptops

CellP.

Page 40: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Dice

– Amounts to range select condition on onedimension, or to equality select condition on more than one dimension

• E.g., Range SELECT

4.2 Dice

Product• E.g., Range SELECT

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 40

πStoreId, TimeId, Amount (σArticleId

∈ {Laptop, CellP}(Sales))

818

Product

Geography

Time

13.11.200818.12.2008

Laptops

CellP.

Page 41: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• E.g., Equality SELECT on 2 dimensions Product and Time

4.2 Dice

πStoreId, Amount (σArticleId = Laptop ∧ MonthID = December(Sales))

Product

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 41

818

Geography

Time

December

January

Laptops

CellP.

Page 42: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Pivot (rotate)– Refers to re-arranging data for viewing purposes

• E.g., display cities down the pages and products across a page

– The simplest view of pivoting is that it selects two

4.2 Pivot

– The simplest view of pivoting is that it selects two dimensions to aggregate the measure

• The aggregated values are often displayed in a grid where each point in the (x, y) coordinate system corresponds to an aggregated value of the measure

• The x and y coordinate values are the values of the selected two dimensions

– The result of pivoting is also called cross–tabulation

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 42

Page 43: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Consider pivoting the following data

4.2 Pivot

Location

CityId City

1 Well

Time

TimId Day

1 Mon

2 Tue

Sales

CityId PerId TimId Amnt

1 1 1 230

1 1 2 300

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 43

1 Well

2 Nels

3 Auck

2 Tue

3 Wed

4 Thu

5 Fri

6 Sat

7 San

8 Mon

1 1 2 300

1 1 8 310

1 2 7 50

2 3 1 550

2 3 5 100

3 4 6 880

3 5 1 60

3 5 2 60

3 5 4 140

Page 44: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Pivoting on City and Day

4.2 Pivot

Mon Tue Wed Thu Fri Sat San SubTotal

Auckland 60 60 0 140 0 880 0 1140

Nelson 550 0 0 0 100 0 0 650

Wellington 540 300 0 0 0 0 50 890

SubTotal 1150 360 0 140 100 880 50 2680

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 44

Auck Nels Well SubTotal

Mon 60 550 540 1150

Tue 60 0 300 360

Wed 0 0 0 0

Thu 140 0 0 140

Fri 0 100 0 100

Sat 880 0 0 880

San 0 0 50 50

SubTotal 1140 650 890 2680

Page 45: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Analytical requests are hard to express

– Most analysts and decision makers

won’t enjoy it

– But wait…there are solutions

4.2 Typical analytical requests

SELECT f.region, z.month, sum(a.price * a.volume)

FROM Order a, Time z, PoS f

WHERE a.pos = f.name AND a.date = z.date

GROUP BY f.region, z.month

– But wait…there are solutions

• OLAP clients allow operations to be performed through GUIs

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 45

Page 46: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• How do these operations look like for the user?

– E.g., Crystal Decisions OLAP software

• 2 dimensions … is trivial

• E.g., Products by Store

4.2 OLAP data visualization

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 46

Product dimension

Sto

re d

ime

nsi

on

Page 47: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• 3 dimensions

– We can visualize sold quantity on 3 dimensions as layers

4.2 OLAP data visualization

Store dimension

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 47

Store dimension

Pro

du

ct d

ime

nsi

on

Page 48: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• More dimensions are difficult to represent

– If we introduce Time dimension, a data cell could be represented by its 4 dimensions as follows:

• Abc from Supplier dimension

• Batteries from Products dimension

4.2 OLAP data visualization

• Batteries from Products dimension

• Uptown from Store dimension

• And Monday from Time dimension

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 48

Page 49: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• OLAP products represent 3 and more dimensional data reducing it to a 2D layout

– By picking values of the dimensions which can not be displayed

• E.g., Display the number sold of Products by any of the

4.2 OLAP data visualization

• E.g., Display the number sold of Products by any of the Stores on Monday

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 49

Page 50: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Another way is by nesting on the same axis

4.2 OLAP data visualization

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 50

Page 51: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• OLAP reporting has to be very flexible

– The IBM way of an OLAP web based report

4.2 OLAP data visualization

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 51

Page 52: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Drill-down operation

– Can be performed easy

by going down on the hierarchy

and choosing the granularity

4.2 OLAP data visualization

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 52

Page 53: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Trends Visualization

– With the help of charts

4.2 OLAP data visualization

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 53

Page 54: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• We have seen how it looks at the user level and on the conceptual side

• But…how do operations translate from user level downwards?

4.3 Physical models

downwards?

– Well…it depends on the physical models used

• DOLAP (Desktop OLAP)

• MOLAP (Multidimensional OLAP)

• ROLAP (Relational OLAP)

• HOLAP (Hybrid OLAP)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 54

O

L

A

P

R

O

L

A

P

H

O

L

A

P

M

O

L

A

P

D

O

L

A

PT

i

m

e

Page 55: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• DOLAP

– Developed as extension to the production system reports

– The idea behind is

• It downloads a small hypercube from a central point (data mart or DW)

4.3 Physical models

or DW)

• Performs multidimensional analysis while disconnected from the data source

• The computation occurs on the client

– Requires little investment

– Most are limited to a single user

– They lack the ability to manage large data sets

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 55

Page 56: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• MOLAP

– Presentation layer provides

the multidimensional view

– The OLAP server stores data in a

4.3 Physical models

MOLAP

Client

Presentation

Server

The OLAP server stores data in a

multidimensional structure

• Computation occurs in this

layer during the loading step

(not at query)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 56

MOLAP

Interface

MDBData

Page 57: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Advantages

– Excellent performance

• Fast data retrieval

• Optimal for slicing and dicing

• Complex calculations

4.3 MOLAP

• Complex calculations

• All calculations are pre-generated when the cube is created

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 57

Page 58: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• All calculations are

pre-generated when

the cube is created

4.3 MOLAP

all

0-D(apex) cuboid

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 58

time supplier

time,item time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,location

time,item,supplier

time,location,supplier

item,location,supplier

time, item, location, supplier

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

item location

Page 59: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Disadvantages

– Limited amount of data it can handle

• Cube can be derived from large amount of data, but only summary level information will be included in the cube

– Requires additional investment

4.3 MOLAP

– Requires additional investment

• Cube technology are often proprietary

– Enormous amount of overhead

• An input file of 200 MB can expand to 5 GB with calculations

• Products:

– Cognos (IBM), Essbase (Oracle), Microsoft Analysis Service, Palo (open source)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 59

Page 60: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Things to consider when choosing MOLAP

– MOLAP tools traditionally have difficulty querying models with dimensions with very high cardinality(i.e., millions of members)

– Some MOLAP products have difficulty updating and

4.3 MOLAP

– Some MOLAP products have difficulty updating and querying models with more than 10 dimensions

• It depends on

– the complexity and cardinality of the dimensions in question

– the number of facts or measures stored

– Other MOLAP products can handle hundreds of dimensions

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 60

Page 61: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• ROLAP

– Presentation layer provides

the multidimensional view

– The ROLAP Server generates

4.3 ROLAP

Server ROLAP

Client

Presentation

The ROLAP Server generates

SQL queries, from the OLAP

OLAP requests, to query the

RDBMS

– Data is stored in RDBs

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 61

Server ROLAP

Server

RDBMS

Data

Page 62: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Special schema design: e.g., star, snowflake

• Special indexes: e.g., bitmap, R-Trees

• Advantages– Proven technology (relational model, DBMS)

4.3 ROLAP

– Can handle large amounts of data (VLDBs)

• Disadvantages– Limited SQL functionalities

• Products– Microsoft Analysis Service, Siebel Analytics (now

Oracle BI), Micro Strategy, Mondrian (open source)

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 62

Page 63: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Based on OLAP needs

4.3 ROLAP vs. MOLAP

OLAP needs MOLAP ROLAP

User

Benefits

Multidimensional View √ √

Excellent Performance √ -

Analytical Flexibility √ -

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 63

Benefits Analytical Flexibility √ -

Real-Time Data Access - √

High Data Capacity - √

MIS

Benefits

Easy Development √ -

Low Structure Maintenance - √

Low Aggregate Maintenance √ -

Page 64: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• HOLAP

– Best of both worlds

• Storing detailed data in RDBs

• Storing aggregated data in MDBs

– Different partitioning approaches

4.3 HOLAP

Server HOLAP

Presentation

– Different partitioning approaches

between MOLAP and ROLAP

• Vertical

• Horizontal

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 64

Server HOLAP

Server

RDBMS

Data

MDDB

Page 65: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Vertical partitioning

– Aggregations are stored in MOLAP for fast query performance,

– Detailed data in ROLAP to optimize time of cube processing (loading the data from the OLTP)

4.3 HOLAP

processing (loading the data from the OLTP)

• Horizontal partitioning

– HOLAP stores some slice of data, usually the more recent one (i.e. sliced by Time dimension) in MOLAP for fast query performance

– Older data in ROLAP

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 65

Page 66: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Other approaches

– Store some cubes in MOLAP and others in ROLAP, leveraging the fact that in a large cuboid, there will be dense and sparse sub-regions

4.3 HOLAP

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 66

Page 67: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

– ROLAP

• RDBMS - star/snowflake schema

– MOLAP

• MDBMS - Cube structures, array based storage

– ROLAP or MOLAP

4.3 Conclusions

– ROLAP or MOLAP

• Data models used play major role in performance differences

– MOLAP

• for summarized and relatively “small” volumes of data (50GB)

– ROLAP

• for detailed and larger volumes of data (TB)

– HOLAP is emerging as the OLAP server of choice

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 67

Page 68: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• How do these operations look like?

– As queries they can be expressed through query languages as SQL or MDX

– The original SQL/92 was not fit for OLAP

But SQL99 has extensions for OLAP functions

4.3 OLAP operations

• But SQL99 has extensions for OLAP functions

– GROUP BY, CUBE operators

• But since the subject is more

comprising… we will discuss it

in the next lecture

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 68

Page 69: Data Warehousing & Data Mining · 4. Queries 4.1 Query processing 4.2 Queries in DW / OLAP 4.3 Physical Modeling 4. Queries Data Warehousing & OLAP –Wolf-TiloBalke–InstitutfürInformationssysteme

• Queries

– OLAP query languages

– Logical modeling - implementation

Next lecture

Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 69