data warehousing

Data warehousing

Han, J. and M. Kamber. Data Mining: Concepts and Techniques. 2001. Morgan Kaufmann.

KDD process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Application

Data mining is the process of discovering interesting knowledge from large amounts of data stored in databases, data warehouses and/or other information repositories.

End User

BusinessAnalyst

DataAnalyst

DBA

MakingDecisions

Data PresentationVisualization Techniques

Data MiningInformation Discovery

OLAP, Multi-dimensional Analysis

Statistical Analysis, Querying and ReportingData Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Data mining andBusiness Intelligence

Data Exploration

Graphical user interface

Pattern evaluation

Data mining engine

Database or data warehouse server

DatabaseData

warehouse

Knowledge base

Data cleaningData integration

Filtering

Architecture of atypical data mining system

…Credit_infoIncomeAgeAddressNameCust_ID

…Supplier PriceTypeCategoryBrandNameitem_ID

…CommissionSalaryGroupDepartmentNameEmp_ID

amount …Pay_methodTimeDateEmp_idCust_idTrans_ID

Item_ID QtyTrans_ID

Customer

Item

Employee

Purchases

Items_sold

A relational database fragment

Queries

List of items sold in previous quarterTotal sales last month, grouped by salespersonNumber of sales transactions in DecemberSalesperson with highest amount of sales

Data warehouse

Integrates data from various sourcesData organized on a historical perspectivePresents different levels of summarized data

Multi-dimensional structuredimension: attributecell: aggregate measures

Data source in Chicago

Data source in New York

Data source in Toronto

Data source in Vancouver

CleanTransformIntegrateLoad

Datawarehouse

Query and analysis tools

Client

Client

Typical data warehouse architecture

Addr

ess

(city

)

Tim

e (q

tr.)

Q1

Q2

Q3

Q4

T1 T2 T3 T4 T5 T6

New YorkToronto

Vancouver

Item-types

Chicago

T1 T2 T3 T4 T5 T6

Q1

Q2

Q3

Q4Ad

dres

s (c

ount

ry)

USACanada

Tim

e (m

onth

) Jan

Feb

March

T1 T2 T3 T4 T5 T6

New YorkToronto

Vancouver

Item-types

ChicagoAddr

ess

(city

)

Item-types

Multi-dimensional data

Drill down on data for Q1

Roll-upon Address

Data WarehouseA decision support database that is maintained separately from the organization’s operational database

Collection of data this is– subject-oriented– integrated– time-variant– nonvolatile

Data Warehouse — Subject-Oriented

• Organized around major subjects, such as customer, product, sales.

• Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.

• Provide a simple and concise view around particular subject issues by excluding data that are not useful in the

decision support process.

Data Warehouse — Integrated

• Constructed by integrating multiple, heterogeneous data sources– relational databases, flat files, on-line transaction records

• Data cleaning and data integration – Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different data sources

• e.g., Hotel price: currency, tax, breakfast covered, etc.

– Data is converted when moved to the warehouse.

Data Warehouse — Time Variant

• The time horizon for data warehouse is significantly longer than that of operational systems.– Operational database: current value data.

– Data warehouse: provides information from a historical

perspective (e.g., past 5-10 years)

• Every key structure in the data warehouse contains an element of time, explicitly or implicitly

Data Warehouse — Non-Volatile

• A physically separate store of data transformed from the operational environment.

• Operational update of data does not occur in the data warehouse environment.– Does not require transaction processing, recovery, and

concurrency control mechanisms

– Requires only two operations in data accessing:

• initial loading of data and access of data.

Data Warehouse vs. Heterogeneous DBMS

• Traditional heterogeneous DB integration– Build wrappers/integrators on top of heterogeneous databases – Query driven approach

• When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for the individual heterogeneous sites involved, and results are integrated into a global answer set

• Complex information filtering, compete for resources with local processing

• Data warehouse: update-driven, high performance– Information from heterogeneous sources is integrated in advance

and stored in warehouses for direct query and analysis

Data Warehouse vs. Operational DBMS

• OLTP (on-line transaction processing)– Major task of traditional relational DBMS– Day-to-day operations: purchasing, inventory, banking, manufacturing,

payroll, registration, accounting, etc.

• OLAP (on-line analytical processing)– Major task of data warehouse system– Data analysis and decision making

• Distinct features (OLTP vs. OLAP):– Users and system orientation: transaction vs. decision support– Data contents: current, detailed vs. historical, consolidated– Database design: ER + application vs. star schema + subject– View: current, local vs. evolutionary, integrated– Access patterns: update vs. read-only but complex queries

OLTP vs. OLAP

OLTP OLAP users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/write index on primary key

multiple large scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Why Separate Data Warehouse?• Maintain high performance for both systems

– DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery

– Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.

• Different data and function– missing data: decision support requires historical data which

operational DBs do not typically maintain

– data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources

– data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled

From Tables and Spreadsheets to Data Cubes

• A data warehouse is based on a multidimensional data model which views data in the form of a data cube

• A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions– Dimension tables, such as item (item_name, brand, type), or

time(day, week, month, quarter, year)

– Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables

• In data warehousing literaturen-D base cube is called a base cuboid.The lattice of cuboids forms a data cube.

Time

Item

Location

Supplier

SALES

Multi-dimensional “cube”Sales by Item, Time, Location, Supplier

Cube: A Lattice of Cuboids

all

time item location supplier

time,item time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,location

time,item,supplier item,location,supplier

time, item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

time,location,supplier

current detailed data(up to 3 years)

monthly data (up to 15 years)

weekly data(up to 7 years)

retail monthly data(up to 15 years)

yearly data(keep all data)

special event effects (up to 30 years)

old detailed data (archived)

quarterly data(up to 20 years)

old weekly data(archived)

old monthly data(archived)

old quarterly data(archived)

old monthly data(archived)

Conceptual Modeling of Data Warehouses

Modeling data warehouses: dimensions & measures– Star schema: A fact table in the middle connected to a set of

dimension tables

– Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake

– Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

Star Schema Example

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_statecountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_salesMeasures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Snowflake Schema example


time

location_keystreetcity_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures


item


branch

city_keycityprovince_or_streetcountry

city

location_keystreetcityprovince_or_statecountry

location

Fact Constellation example


time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales


item


branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_keyshipper_namelocation_keyshipper_type

shipper

Multiple fact tables, sharing dimensionsCollection of stars – fact constellation or

galaxy schema

Data warehouse vs. data marts• Data warehouse

– Enterprise-wide scope– Subjects that span the organization– Fact constellation used to model multiple, interrelated

subjects

• Data mart– Department-wide scope– Departmental subset of data warehouse– Star, snowflake schema

Star schema is more efficient and thereby popular

Computing Measures• Measure: numerical value at each point in the data cube

e.g. <time=“Q1”, location=“Chicago”, item=“xyz”>: avg-amount

• Need to be able to efficiently compute measures

Measure types• Distributive: E.g., count(), sum(), min(), max().

Result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning.

• Algebraic: E.g., avg(), min_N(), standard_deviation().Can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.

• Holistic: E.g., median(), mode(), rank().

There is no constant bound on the storage size needed to describe a sub-aggregate. No constant function with M arguments (constant M) that characterizes the computation.Can be difficult to compute efficiently – approximate computation

Concept Hierarchy

all

Europe North_America

MexicoCanadaSpainGermany

Vancouver

...

......

... ...

all

region

country

TorontoFrankfurtcity

Example: Location dimension

Concept hierarchiesFull or partial ordering Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

Set-grouping hierarchye.g. price

($0..$1000]

($0..$1000] ($0..$1000] ($0..$1000]

($0..$1000] ($0..$1000]

($0..$1000]

($0..$1000]

($0..$1000]

($0..$1000]

Multiple hierarchies for an attribute price: {inexpensive, moderately_priced, expensive}

OLAP examplesSales volume as a function of product, month, and region

Prod

uct

Region

Month

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

A Sample Data CubeTotal annual salesof TV in U.S.A.Date

Produ

ct

Cou

ntrysum

sumTV

VCRPC

1Qtr 2Qtr 3Qtr 4QtrU.S.A

Canada

Mexico

sum

Addr

ess

(city

)

Tim

e (q

tr.)

Q1

Q2

Q3

Q4

T1 T2 T3 T4 T5 T6

New YorkToronto

Vancouver

Item-types

Chicago

T1 T2 T3 T4 T5 T6

Q1

Q2

Q3

Q4Ad

dres

s (c

ount

ry)

USACanada

Tim

e (m

onth

) Jan

Feb

March

T1 T2 T3 T4 T5 T6

New YorkToronto

Vancouver

Item-types

ChicagoAddr

ess

(city

)

Item-types

Drill down, Roll up

Drill down on data for Q1

Roll-upon Address

Tim

e (q

tr.)

Q1

Q2

Q3

Q4

T1 T2 T3 T4 T5 T6

New YorkToronto

Vancouver

Item-types

ChicagoChicago

Toronto

Q1

Q2

T3 T8

Dice for(location in {Chicago, Toronto}and time in {Q1}And Item in {T3, T8}

Chicago

New York

Toronto

Vancouver

T1 T2 T3 T4 T5 T6

SliceFor Time in {Q1}

Slicing and Dicing

Slice: Selection on one dimension

Dice; Selection on two or more dimensions

Browsing a Data Cube

VisualizationOLAPInteractive

manipulation

Typical OLAP Operations• Roll up (drill-up): summarize data

– by climbing up hierarchy or by dimension reduction• Drill down (roll down): reverse of roll-up

– from higher level summary to lower level summary or detailed data, or introducing new dimensions

• Slice and dice:– project and select

• Pivot (rotate):– reorient the cube, visualization, 3D to series of 2D planes.

• Other operations– drill across: involving (across) more than one fact table– drill through: through the bottom level of the cube to its back-

end relational tables (using SQL)

A Star-Net Query ModelShipping Method

AIR-EXPRESS

TRUCKORDER

Customer Orders

CONTRACTS

Customer

Product

PRODUCT GROUP

PRODUCT LINE

PRODUCT ITEM

SALES PERSON

DISTRICT

DIVISION

OrganizationPromotion

CITY

COUNTRY

REGION

Location

DAILYQTRLYANNUALYTime

Data Warehouse Design: Four Views • Top-down view

• selection of the relevant information necessary for the data warehouse based on current and future needs

• Data source view• exposes the information being captured, stored, and

managed by operational systems (E/R models, CASE, etc)

• Data warehouse view• fact tables and dimension tables, pre-calculated totals,

counts, etc. Source information, date, time for historical context

• Business query view • perspectives of data in the warehouse from the view of end-

user

Data Warehouse Design Process• Top-down, bottom-up approaches or a combination

– Top-down: Starts with overall design and planning (mature)– Bottom-up: Starts with experiments and prototypes (rapid)

• From software engineering point of view– Waterfall: structured and systematic analysis at each step – Spiral: rapid generation of increasingly functional systems, short

turn around time, quick turn around

• Typical data warehouse design process– Choose a business process to model, e.g., orders, invoices, etc.– Choose the grain (atomic level of data) of the business process– Choose the dimensions that will apply to each fact table record– Choose the measure that will populate each fact table record

MultiMulti--Tiered DW ArchitectureTiered DW Architecture

DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

AnalysisQueryReportsData mining

Monitor&

IntegratorMetadata

Data Sources Front-End Tools

Serve

Data Marts

OperationalDBs

othersources

Data Storage

OLAP Server

Three Data Warehouse Models

• Enterprise warehouse– collects all of the information about subjects spanning

the entire organization• Data Mart

– a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart

• Independent vs. dependent (directly from warehouse) data mart

• Virtual warehouse– A set of views over operational databases– Only some of the possible summary views may be

materialized

Data Marts• Data warehouse designed to meet the needs

of a specific group of users• Should (but may not) be designed with

corporate standards and accessibility in mind– incorporate standards for hardware,

software, networking, DBMS, naming conventions, etc.

– vendor’s attempt to bypass IT and sell directly to end-users?

Data Warehouse Development: A Recommended Approach

Define a high-level corporate data model

Data Mart

Data Mart

Distributed Data Marts

Multi-Tier Data Warehouse

Enterprise Data Warehouse

Model refinementModel refinement

OLAP Server Types• Relational OLAP (ROLAP)

– Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middleware to support missing pieces

– Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services

– greater scalability

• Multidimensional OLAP (MOLAP)– Array-based multidimensional storage engine (sparse matrix

techniques)– fast indexing to pre-computed summarized data

• Hybrid OLAP (HOLAP)– User flexibility, e.g., low level: relational, high-level: array

• Specialized SQL servers– specialized support for SQL queries over star/snowflake schemas

Metadata RepositoryMeta data is the data defining warehouse objects. It has the following kinds – Description of the structure of the warehouse

• schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents

– Operational meta-data• data lineage (history of migrated data and transformation path), currency

of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)

– The algorithms used for summarization– The mapping from operational environment to the data

warehouse– Data related to system performance

• warehouse schema, view and derived data definitions– Business data

• business terms and definitions, ownership of data, charging policies

Data Warehouse Back-End Tools, Utilities• Data extraction:

– get data from multiple, heterogeneous, and external sources

• Data cleaning:– detect errors in the data and rectify them when

possible• Data transformation:

– convert data from legacy or host format to warehouse format

• Load– sort, summarize, consolidate, compute views, check

integrity, and build indices and partitions• Refresh

– propagate the updates from the data sources to the warehouse

Advanced examples

Exploration of Data CubesHypothesis-driven: exploration by user, huge search space

Discovery-driven – pre-computed measures indicate exceptions, guide user in the

data analysis, at all levels of aggregation

– Exception: significantly different from the value anticipated, based on a statistical model

– Visual cues such as background color are used to reflect the degree of exception of each cell

– Computation of exception indicator can be included in cube constructionSelfExp: degree of surprise in cell, relative to values at same levels of aggregationInExp: degree of surprise somewhere beneath the cell, if we drill downPathExp: degree of surprise for each drill down path from cell

Advanced examplesExample: Discovery-driven exploration

Advanced examples

Complex Aggregation at Multiple Granularities

• Ex. Total sales in 2000 by Item, Region, Month, with subtotals

• Ex. Grouping by all subsets of {item, region, month}, find the maximum price in 2000 for each group, and the total sales generated by all maximum-price-sales

• Ex. Among the max-price-sales, find the min and max shelf life. Find the fraction of the total sales due to cases that have min shelf life.

Advances examples

Sales#units,$value

Supplier

Product

Sales#units,$value Product

Sales%sales

Supplier

Product

Sales volume as a % of total units sold of Product

Ordering

Sales#units,$value

Supplier

Product

Group sales by contiguous 10-day intervals.

10 day Moving-avg of Sales, by Product

Order Products by Sales-$ and group into deciles of decreasing performance.

data warehousing

Documents

analysis of data

access of data

data warehousinghan

data warehouse environment

data warehouse nonvolatile

operational update of

chicagoclient data source

various sources data