chapter 4, data warehouse and olap operations · chapter 4, data warehouse and olap operations...

25
11/16/2017 1 Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor Department of Computer Science Baylor University CSI 4352, Introduction to Data Mining Chapter 4, Data Warehouse & OLAP Operations CSI 4352, Introduction to Data Mining Basic Concept of Data Warehouse Data Warehouse Modeling Data Warehouse Architecture Data Warehouse Implementation From Data Warehousing to Data Mining

Upload: others

Post on 16-Mar-2020

29 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

1

Chapter 4, Data Warehouse and OLAP Operations

Young-Rae Cho

Associate Professor

Department of Computer Science

Baylor University

CSI 4352, Introduction to Data Mining

Chapter 4, Data Warehouse & OLAP Operations

CSI 4352, Introduction to Data Mining

Basic Concept of Data Warehouse

Data Warehouse Modeling

Data Warehouse Architecture

Data Warehouse Implementation

From Data Warehousing to Data Mining

Page 2: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

2

What is Data Warehouse?

Data Warehouse ( defined in many different ways )

A decision support database that is maintained separately from the

organization’s operational database

The support of information processing by providing a solid platform of

consolidated, historical data for analysis

“A data warehouse is a (1) subject-oriented, (2) integrated,

(3) time-variant, and (4) nonvolatile collection of data in support of

management’s decision-making process.” — W. H. Inmon

Data Warehousing

The process of constructing and using data warehouses

Data Warehouse – Subject-Oriented

Organized around major subjects

e.g., customers, products, sales

Focusing on the modeling and analysis of data for decision makers,

not on daily operations or transaction processing

Provide a simple and concise view around particular subject issues

Excluding data that are not useful in the decision support process

Page 3: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

3

Data Warehouse – Integrated

Integrating multiple, heterogeneous data sources

Relational databases, flat files, on-line transaction records

Apply data cleaning and data integration techniques

Ensures consistency in naming conventions, encoding structures,

attribute measures, etc. among different data sources

Data Warehouse – Time Variant

The time horizon of data warehouses is significantly longer than that of

operational systems

Operational databases have current data values

Data warehouses provide information from a historical perspective

(e.g., 10-20 years)

Time is a key structure in data warehouses

Contain the attribute of time (explicitly or implicitly)

Page 4: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

4

Data Warehouse – Nonvolatile

A physically separate storage of data transformed from operational databases

Operational update of data does not occur

Not require transaction processing, recovery, and concurrency control

mechanisms

Require only two operations, initial loading of data and access of data

Data Integration Methods

Methods

(1) Process to provide uniform interface to multiple data sources

→ Tradition Database Integration

(2) Process to combine multiple data sources into coherent storage

→ Data warehousing

Traditional DB Integration

A query-driven approach

Wrappers / mediators on top of heterogeneous data sources

Data Warehousing

An update-driven approach

Combined the heterogeneous data sources in advance

Stored them in a warehouse for direct query and analysis

Page 5: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

5

OLTP vs. OLAP

OLTP (on-line transaction processing)

Major task of traditional relational DBMS

Day-to-day operations: e.g., purchasing, inventory, manufacturing,

banking, payroll, registration, accounting, etc.

OLAP (on-line analytical processing)

Major task of data warehouse system

Data analysis and decision making

Distinct Features (OLTP vs. OLAP)

User and system orientation (customers vs. market analysts)

Data contents (current, detailed vs. historical, consolidated)

Database design (ER + application vs. star + subject)

View (current, local vs. evolutionary, integrated)

OLTP vs. OLAP

Feature OLTP OLAP

Characteristic operational processing information processing

Orientation transaction analysis

Users clerk, DBA knowledge worker (CEO, analyst)

Function day-to-day operations decision support

DB Design application-oriented subject-oriented

Data current, up-to-date historical, integrated, summarized

Unit of work short, simple transaction complex query

Access read/write/update read-only (lots of scans)

Page 6: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

6

Why Data Warehouse?

Performance Issue

DBMS: tuned for OLTP

e.g., access methods, indexing, concurrency control, recovery

Data warehouse: tuned for OLAP

e.g., complex queries, multidimensional view, consolidation

Data Issue

Decision support requires historical data, consolidated and summarized

data, consistent data

Chapter 4, Data Warehouse & OLAP Operations

CSI 4352, Introduction to Data Mining

Basic Concept of Data Warehouse

Data Warehouse Modeling

Data Warehouse Architecture

Data Warehouse Implementation

From Data Warehousing to Data Mining

Page 7: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

7

Data Format for Warehouse

Dimensions

Multi-dimensional data model

Data are stored in the form of a data cube

Data Cube

A view of multi-dimensions

Dimension tables, such as item (item_name, brand, type), or

time (day, week, month, quarter, year)

Fact table contains measures (such as dollars_sold) and keys to each of

the related dimension tables

Cuboid

Each combination of dimensional spaces in a data cube

0-D cuboid, 1-D cuboid, 2-D cuboid, … , n-D cuboid

The Lattice of Cuboids

all

time item location supplier

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,supplier

time,location,supplier

item,location,supplier

0-D (apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D (base) cuboid

time,item

time,item,location

time, item, location, supplier

Page 8: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

8

Conceptual Modeling

Key of Modeling Data Warehouses

Handling dimensions & measures

Examples

Star schema: A fact table in the middle connected to a set of dimension

tables

Snowflake schema: A refinement of star schema where some dimensional

hierarchy is normalized into a set of smaller dimension

tables, forming a shape similar to snowflake

Fact constellations: Multiple fact tables share dimension tables, viewed

as a collection of stars, therefore called galaxy schema

or fact constellation

Example of Star Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcitystatecountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Page 9: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

9

Example of Snowflake Schema

supplier_keysupplier_type

supplier

city_keycitystatecountry

city

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcity_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_key

item

branch_keybranch_namebranch_type

branch

Example of Fact Constellation

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcitystatecountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_keyshipper_namelocation_keyshipper_type

shipper

Shipping Fact Table

Page 10: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

10

Cube Definition in DMQL

Cube Definition (Fact Table)

define cube <cube_name> [<dimension_list>]: <measure_list>

Dimension Definition (Dimension Table)

define dimension <dimension_name> as (<attribute_or_dimension_list>)

Special Case (Shared Dimension Table)

define dimension <dimension_name> as <dimension_name_first>

in cube <cube_name_first>

Star Schema Definition in DMQL

Example

define cube sales [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars),

avg_sales = avg(sales_in_dollars),

units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter,

year)

define dimension item as (item_key, item_name, brand, type,

supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city, state, country)

Page 11: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

11

Snowflake Schema Definition in DMQL

Example

define cube sales [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars),

avg_sales = avg(sales_in_dollars),

units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter,

year)

define dimension item as (item_key, item_name, brand, type,

supplier(supplier_key, supplier_type))

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city(city_key,

province_or_state, country))

Fact Constellation Schema Definition in DMQL

Example define cube sales [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),

units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city, state, country)

define cube shipping [time, item, shipper, from_location, to_location]:

dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)

define dimension time as time in cube sales

define dimension item as item in cube sales

define dimension shipper as (shipper_key, shipper_name,

location as location in cube sales, shipper_type)

define dimension from_location as location in cube sales

define dimension to_location as location in cube sales

Page 12: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

12

Measures

Distributive

If the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning

e.g., count(), sum(), min(), max()

Algebraic

If it can be computed by an algebraic function with m arguments, each of which is obtained by applying a distributive aggregate function

e.g., avg(), standard_deviation()

Holistic

If there is no constant bound on the storage size needed to describe a subaggregate

e.g., median(), mode(), rank()

Concept Hierarchy

Schema Hierarchy

e.g., street < city < state < country

e.g., day < { month < quarter ; week } < year

Set-group hierarchy

e.g., { (0..100] ; (100..200] } < (0..200]

e.g., { (0..10] < lowPrice ; { (10..100] ; (100..200] } < highPrice }

< allProducts

Year

Quarter| Week

Month

Day

(0..200]

(0..100] (100..200]

allProducts

lowPrice highPrice

(0..10] (10..100] (100..200]

Page 13: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

13

Three Components of Data Cube

Example

Measures: data values as a function of products, locations and time

Dimensions:

Hierarchies:

Company|

Category|

Productloca

tions

time

Country|

City|

Office

Year

Quarter| Week

Month

Day

product location time

Example of Cuboid Cells

all

product quarter country

product, quarter

product, country

quarter, country

product, quarter, country

0-D (apex) cuboid

1-D cuboids

2-D cuboids

3-D (base) cuboid

product dimension

time dimension

location dimension

Page 14: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

14

Example of Data Cube

Total annual salesof TV in U.S.A.

Quarter

Cou

ntr

ysum

sumTV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

OLAP Operations

Roll-up (drill-up)

Summarizes (aggregates) data

by climbing up hierarchy or by dimension reduction

Drill-down (roll-down)

Reverse of roll-up

by stepping down to lower-level data or introducing new dimensions

Slice

Selecting data on one dimension

Dice

Selecting data on multi-dimensions

Pivot (rotate)

Reorienting the cube, or transforming 3-D data to a series of 2-D spaces

Page 15: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

15

Roll-UP & Drill-Down

coun

try

quarter

city

quarter

coun

try

month

loca

tio

n

time

Slice, Dice & Pivot

coun

try

quarter

quarter

coun

try

quarter

loca

tio

n

time

Q1 Q2 Q3 Q4

USA

canada

mexico

pclaptop

supercom

coun

try

USA

canada

Q1 Q2

pclaptop

Q1 Q2 Q3 Q4

USA

canada

mexico

Page 16: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

16

Starnet Query Model

Shipping Method

air

truckorder

Orders

contracts

Customer

Productcategory

item

sales-person

division

division

OrganizationPromotion

citycountry

region

Location

dailyquarterlyannuallyTime

Each circle is called a footprint

customerID

Chapter 4, Data Warehouse & OLAP Operations

CSI 4352, Introduction to Data Mining

Basic Concept of Data Warehouse

Data Warehouse Modeling

Data Warehouse Architecture

Data Warehouse Implementation

From Data Warehousing to Data Mining

Page 17: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

17

Data Warehouse Design

Top-Down View

Allows the selection of the relevant information necessary for the data

warehouse

Data Source View

Exposes the information being captured, stored, and managed by

operational systems

Data Warehouse View

Consists of fact tables and dimension tables

Business Query View

Shows the perspectives of data in the warehouse to end-users

Data Warehouse Design Process

Categories by Process Direction

Top-down: Starts with overall design and planning (mature)

Bottom-up: Starts with experiments and prototypes (rapid)

Categories by Software Engineering View

Waterfall: structured, systematic analysis at each step before proceeding

to the next

Spiral: rapid generation of functional systems, short turn around time

Typical Data Warehouse Design Process

Choose business processes for modeling, e.g., orders, invoices, etc

Choose the grain (atomic level of data) of the business processes

Choose the dimensions that will apply to each fact table

Choose the measures that will populate each fact table

Page 18: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

18

Data Warehouse Architecture

DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

Query

Analysis

Reports

Monitor &IntegratorMetadata

Data Sources Front-End Tools

Serve

Data Marts

Operational DBs

Othersources

Data Storage

OLAP Server

Three Data Warehouse Models

Enterprise Warehouse

A global view with all the information about subjects spanning the entire

organization

Data Mart

A subset of corporate-wide data that is of value to a specific group of

users

Its scope is confined to specific, selected groups

Independent vs. dependent (directly from warehouse) data mart

Virtual Warehouse

A set of views over operational databases

Only some of possible summary views may be materialized

Page 19: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

19

Development of Data Warehouse

Define a high-level corporate data model

Model Refinement

Enterprise

Data

Warehouse

Multi-Tier

Data Warehouse

Distributed

Data Marts

Data Mart

Data Mart

Model Refinement

Utilities of Back-End Tools

Data Extraction

Get data from multiple, heterogeneous, and external sources

Data Cleaning

Detect errors in the data and rectify them when possible

Data Transformation

Convert data from the original format to the warehouse format

Loading

Sort, summarize, consolidate, compute views, check integrity, and

build indices and partitions

Refresh

Propagate the updates from data sources to the warehouse

Page 20: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

20

OLAP Server Architecture

Relational OLAP (ROLAP)

Uses relational or extended-relational DBMS to store and manage

warehouse data and OLAP middle ware

Includes optimization of DBMS back-end, implementation of aggregation

navigation logic, and additional tools and services

High scalability

Multidimensional OLAP (MOLAP)

Sparse array-based multidimensional storage engine

Fast indexing to pre-computed summarized data

Hybrid OLAP (HOLAP)

Low-level: relational / high-level: array

High flexibility

Metadata Repository

Definition of Metadata

The data defining data warehouse objects

Examples

Description of the structure of the data warehouse, e.g., schema, view,

dimensions, hierarchies, data definitions, data mart locations and contents

Operational meta-data, e.g., history of migrated data, currency of data,

warehouse usage statistics, error reports

Algorithms used for summarization

Mapping from operational environment to data warehouse

Data related to system performance

Business data, e.g., business terms and definitions, ownership of data,

charging policies

Page 21: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

21

Chapter 4, Data Warehouse & OLAP Operations

CSI 4352, Introduction to Data Mining

Basic Concept of Data Warehouse

Data Warehouse Modeling

Data Warehouse Architecture

Data Warehouse Implementation

From Data Warehousing to Data Mining

Data Cube Computation

View as a Lattice of Cuboids

How many cuboids in an n-dimensional cube?

How many cuboid cells in an n-dimensional cube

with Li levels?

Materialization of Data Cube

Full materialization (all cuboids), Partial materialization (some cuboids),

No materialization (only base cuboid)

Selection of cuboids to materialize

• Based on the size, sharing, access frequency, etc.

)11

(

n

i iL

n2

all

product time location

product, time

product, location

time, location

product, time, location

Page 22: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

22

Cube Operation

Cube Definition and Computation in DMQL

define cube sales [item, city, year]: sum (sales_in_dollars)

compute cube sales

Cube Definition and Computation in SQL

select item, city, year, SUM (amount)

from sales

cube by item, city, year

Internal Operations

group by (item, city, year)

group by (item, city), (item, year), (city, year)

group by (item), (city), (year)

group by ()

Iceberg Cube

Iceberg Cube Computation

Computing only the cuboid cells whose count or

other aggregates satisfying the condition like

HAVING COUNT(*) >= min_sup

Motivation

Only a small portion of cube cells may be “above the water’’

in a sparse cube

Only calculate “interesting” cells — data above certain threshold

Avoid explosive growth of the cube

Page 23: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

23

Chapter 4, Data Warehouse & OLAP Operations

CSI 4352, Introduction to Data Mining

Basic Concept of Data Warehouse

Data Warehouse Modeling

Data Warehouse Architecture

Data Warehouse Implementation

From Data Warehousing to Data Mining

Data Warehouse Usage

Information Processing

Supports querying, basic statistical analysis, and reporting using

crosstabs, tables, charts and graphs

Analytical Processing

Supports OLAP operations in multi-dimensional space

Data Mining

Supports pattern discovery from warehouse data, and presenting

the mining results using visualization tools

Page 24: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

24

From OLAP To OLAM

On-Line Analytical Mining (OLAM)

( OLAP + Data Mining ) in data warehouse

Why OLAM ?

High quality data

• Data warehouse contains integrated, consistent and cleaned data

Information processing infrastructure

• ODBC/OLE DB connections, web accessing, service facilities

OLAP-based exploratory data analysis

• Mining with drilling, dicing, pivoting, etc.

On-line selection of data mining functions

• Integration and swapping of multiple data mining functions

OLAM System Architecture

Data Warehouse

OLAMEngine

OLAPEngine

User GUI API

Data Cube API

Database API data cleaning data integration

Layer3

OLAP/OLAM

Layer2

MDDB

Layer1

Data Repository

Layer4

User Interfacemining query mining result

Meta Data

Databases

Page 25: Chapter 4, Data Warehouse and OLAP Operations · Chapter 4, Data Warehouse and OLAP Operations Young-Rae Cho Associate Professor ... 5 OLTP vs. OLAP ... Drill-down (roll-down) Reverse

11/16/2017

25

Questions?

References

Gray, J., et al., “Data Cube: A Relational Aggregation Operator

Generalizing Group-By, Cross-Tab and Sub-Totals”, Data Mining and

Knowledge Discovery, Vol. 1 (1997)

Lecture Slides are found on the Course Website,

www.ecs.baylor.edu/faculty/cho/4352