data warehousing & olap...• data warehousing: the process of constructing and using data...

27
Data Warehousing & OLAP

Upload: others

Post on 04-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Data Warehousing & OLAP

Page 2: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Motivation: Business Intelligence

Jian Pei: Big Data Analytics -- Multidimensional Analysis 2

Customer information (customer-id, gender, age, home-address, occupation, income, family-size, …)

Product information (Product-id, category, manufacturer, made-in, stock-price, …)

Sales information (customer-id, product-id, #units, unit-price, sales-representative, …)

Business queries: •  Which categories of products are most popular for customers in Vancouver? •  Find pairs (customer groups, most popular products)

Page 3: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

J. Pei: Finding Outstanding Aspects and Contrast Subspaces 3

Symptoms: overweight, high blood pressure, back pain, short of breadth, chest pain, cold sweat …

In what aspect is he most similar to cases of coronary artery disease

and, at the same time, dissimilar to adiposity?

Page 4: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Don’t You Ever Google Yourself?

•  Big data makes one know oneself better •  57% American adults search themselves on

Internet – Good news: those people are

better paid than those who haven’t done so! (Investors.com)

•  Egocentric analysis becomes more and more important with big data

J. Pei: Finding Outstanding Aspects and Contrast Subspaces 4

Page 5: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Egocentric Analysis

•  How am I different from (more often than not, better than) others?

•  In what aspects am I good?

J. Pei: Finding Outstanding Aspects and Contrast Subspaces 5

http://img03.deviantart.net/a670/i/2010/219/a/e/glee___egocentric_by_gleeondoodles.jpg

Page 6: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Dimensions •  “An aspect or feature of a situation, problem, or

thing, a measurable extent of some kind” – Dictionary

•  Dimensions/attributes are used to model complex objects in a divide-and-conquer manner – Objects are compared in selected dimensions/

attributes •  More often than not, objects have too many

dimensions/attributes than one is interested in and can handle

Jian Pei: Big Data Analytics -- Multidimensional Analysis 6

Page 7: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Multi-dimensional Analysis

•  Find interesting patterns in multi-dimensional subspaces –  “Michael Jordan is outstanding in subspaces (total

points, total rebounds, total assists) and (number of games played, total points, total assists)”

•  Different patterns may be manifested in different subspaces – Feature selection (machine learning and statistics):

select a subset of relevant features for use in model construction – a set of features for all objects

– Different subspaces may manifest different patterns

Jian Pei: Big Data Analytics -- Multidimensional Analysis 7

Page 8: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Jian Pei: Big Data Analytics -- Multidimensional Analysis 8

OLAP

•  Conceptually, we may explore all possible subspaces for interesting patterns

•  What patterns are interesting? •  How can we explore all possible subspaces

systematically and efficiently? •  Fundamental problems in analytics and data

mining

Page 9: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Jian Pei: Big Data Analytics -- Multidimensional Analysis 9

OLAP

•  Aggregates and group-bys are frequently used in data analysis and summarization SELECT time, altitude, AVG(temp) FROM weather GOUP BY time, altitude; –  In TPC, 6 standard benchmarks have 83 queries,

aggregates are used 59 times, group-bys are used 20 times

•  Online analytical processing (OLAP): the techniques that answer multi-dimensional analytical (MDA) queries efficiently

Page 10: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Jian Pei: Big Data Analytics -- Multidimensional Analysis 10

OLAP Operations

•  Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction –  (Day, Store, Product type, SUM(sales) à

(Month, City, *, SUM(sales)) •  Drill down (roll down): reverse of roll-up,

from higher level summary to lower level summary or detailed data, or introducing new dimensions

Page 11: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Roll Up

Jian Pei: Big Data Analytics -- Multidimensional Analysis 11

http://www.tutorialspoint.com/dwh/images/rollup.jpg

Page 12: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Drill Down

Jian Pei: Big Data Analytics -- Multidimensional Analysis 12

http://www.tutorialspoint.com/dwh/images/drill_down.jpg

Page 13: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Other Operations

•  Dice: pick specific values or ranges on some dimensions

•  Pivot: “rotate” a cube – changing the order of dimensions in visual analysis

Jian Pei: Big Data Analytics -- Multidimensional Analysis 13

http://en.wikipedia.org/wiki/File:OLAP_pivoting.png

Page 14: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Dice

Jian Pei: Big Data Analytics -- Multidimensional Analysis 14

http://www.tutorialspoint.com/dwh/images/dice.jpg

Page 15: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Jian Pei: Big Data Analytics -- Multidimensional Analysis 15

Relational Representation

•  If there are n dimensions, there are 2n possible aggregation columns

Roll up by model by year by color in a table

Page 16: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Jian Pei: Big Data Analytics -- Multidimensional Analysis 16

Difficulties

•  Many group bys are needed – 6 dimensions à 26=64 group bys

•  In most SQL systems, the resulting query needs 64 scans of the data, 64 sorts or hashes, and a long wait!

Page 17: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Jian Pei: Big Data Analytics -- Multidimensional Analysis 17

Dummy Value “ALL”

Page 18: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Jian Pei: Big Data Analytics -- Multidimensional Analysis 18

CUBE

SALES Model Year Color Sales Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39

DATA CUBE Model Year Color Sales

CUBE

Chevy 1990 blue 62 Chevy 1990 red 5 Chevy 1990 white 95 Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182 Chevy ALL red 90 Chevy ALL white 236 Chevy ALL ALL 508 Ford 1990 blue 63 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 ALL 189 Ford 1991 blue 55 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 ALL 116 Ford 1992 blue 39 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 ALL 128 Ford ALL blue 157 Ford ALL red 143 Ford ALL white 133 Ford ALL ALL 433 ALL 1990 blue 125 ALL 1990 red 69 ALL 1990 white 149 ALL 1990 ALL 343 ALL 1991 blue 106 ALL 1991 red 104 ALL 1991 white 110 ALL 1991 ALL 314 ALL 1992 blue 110 ALL 1992 red 58 ALL 1992 white 116 ALL 1992 ALL 284 ALL ALL blue 339 ALL ALL red 233 ALL ALL white 369 ALL ALL ALL 941

SELECT Model, Year, Color, SUM(sales) AS Sales FROM Sales WHERE Model in {'Ford', 'Chevy'} AND Year BETWEEN 1990 AND 1992 GROUP BY CUBE(Model, Year, Color);

Page 19: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Jian Pei: Big Data Analytics -- Multidimensional Analysis 19

Semantics of ALL

•  ALL is a set – Model.ALL = ALL(Model) = {Chevy, Ford } – Year.ALL = ALL(Year) = {1990,1991,1992} – Color.ALL = ALL(Color) = {red,white,blue}

Page 20: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Jian Pei: Big Data Analytics -- Multidimensional Analysis 20

OLTP Versus OLAP OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support DB design application-oriented subject-oriented

data current, up-to-date, detailed, flat relational Isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/write, index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed

tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Page 21: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Jian Pei: Big Data Analytics -- Multidimensional Analysis 21

What Is a Data Warehouse?

•  “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”

– W. H. Inmon •  Data warehousing: the process of

constructing and using data warehouses

Page 22: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Jian Pei: Big Data Analytics -- Multidimensional Analysis 22

Subject-Oriented

•  Organized around major subjects, such as customer, product, sales

•  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing

•  Providing a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process

Page 23: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Jian Pei: Big Data Analytics -- Multidimensional Analysis 23

Integrated

•  Integrating multiple, heterogeneous data sources –  Relational databases, flat files, on-line transaction

records •  Data cleaning and data integration

–  Ensuring consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources

•  E.g., Hotel price: currency, tax, breakfast covered, etc.

–  When data is moved to the warehouse, it is converted

Page 24: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Jian Pei: Big Data Analytics -- Multidimensional Analysis 24

Time Variant

•  The time horizon for the data warehouse is significantly longer than that of operational systems –  Operational databases: current value data –  Data warehouse data: provide information from a

historical perspective (e.g., past 5-10 years) •  Every key structure in the data warehouse contains

an element of time, explicitly or implicitly –  But the key of operational data may or may not contain “time element”

Page 25: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Jian Pei: Big Data Analytics -- Multidimensional Analysis 25

Nonvolatile

•  A physically separate store of data transformed from the operational environment

•  Operational updates of data do not occur in the data warehouse environment – Do not require transaction processing, recovery,

and concurrency control mechanisms – Require only two operations in data accessing

•  Initial loading of data •  Access of data

Page 26: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

Jian Pei: Big Data Analytics -- Multidimensional Analysis 26

Why Separate Data Warehouse?

•  High performance for both – Operational DBMS: tuned for OLTP – Warehouse: tuned for OLAP

•  Different functions and different data – Historical data: data analysis often uses

historical data that operational databases do not typically maintain

– Data consolidation: data analysis requires consolidation (aggregation, summarization) of data from heterogeneous sources

Page 27: Data Warehousing & OLAP...• Data warehousing: the process of constructing and using data warehouses Jian Pei: Big Data Analytics -- Multidimensional Analysis 22 Subject-Oriented

To-Do List

•  Read Section 4.1

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 27