database management systems (l5)

CMPUT 391Database Management Systems

Data Warehousing & OLAP

Textbook: 17.1 – 17.5(first edition: 19.1 – 19.5)

Based on slides by Lewis, Bernstein and Kiferand other sources

University of AlbertaDr. Jörg Sander, 2006 1CMPUT 391 – Database Management Systems

Why Data Warehouses• Businesses have a lot of data, operational data and facts,

stored in heterogeneous and distributed databases.– in different databases – in different physical locations– in different formats

• Decision makers need fast access to this information in a summarized form, with a focus often on historical data

What Is Data Warehouse?• consolidates the information from different data sources, enabling

OLAP (online analytical processing), to help decision support. • is maintained separately from an operational database (which is used

for OLTP – online transaction processing).

Corporate data

Data Mart Data Mart Data Mart Data Mart

CorporateData Warehouse

Option 1:Consolidate Data Marts

Option 2:Build from scratch

OLTP Compared With OLAP• On Line Transaction Processing -- OLTP

– Maintain a database that is an accurate model of some real-world enterprise

• Short simple transactions• Relatively frequent updates• Transactions access only a small fraction of the database

• On Line Analytic Processing -- OLAP– Use information in database to guide strategic decisions

• Complex aggregation queries • Infrequent updates• Transactions access a large fraction of the database

Why Do We Separate DWs From Operational DBs?

• Performance reasons:– OLAP necessitates special data organization

that supports multidimensional views.– OLAP queries would degrade operational DB.– OLAP is read only.– No concurrency control and recovery.

• Decision support requires historical data.• Decision support requires consolidated data.

Fact Tables• Many OLAP applications are based on a fact table• For example, a supermarket application might be

based on a tableSales (Market_Id, Product_Id, Time_Id, Sales_Amt)

• The table can be viewed as a multidimensional data cube– The first three columns are the dimensions

representing specific • supermarkets• products • time intervals

– The fourth column, the Sales_Amt, is a function of the other three, called a measure

Dimension Tables• The dimensions of the fact table can be further

described with dimension tables• Fact table

– Sales (Market_id, Product_Id, Time_Id, Sales_Amt)• Dimension Tables

– Market (Market_Id, City, Province, Region)– Product (Product_Id, Name, Category, Price)– Time (Time_Id, Week, Month, Quarter)

Star Schema

• The fact and dimension relations can be displayed in an E-R diagram, which suggests a star and is called a star schema

Table View of a Star Schema

TimeIdDayMonthYear

CustIdCustNameCustCityCustCountry

Sales Fact Table

Product

Customer

unit_sales

dollar_sales

ProductNoProdNameProdDescCategory

Product

StoreIDCityProvinceCountryRegion

(Source: JH)

Two different measures

Aggregation• Many OLAP queries involve aggregation of the

data in the fact table• For example, to find the total sales (over time) of

each product in each market, we might use

SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)FROM SalesSales SGROUP BY S.Market_Id, S.Product_Id

• The aggregation is over the entire time dimension and thus produces a two-dimensional view of the data

Aggregation over Time

• The output of the previous query

SUM(Sales_Amt)M1 M2 M3 M4

P1 3003 1503 …P2 6003 2402 …P3 4503 3 …P4 7503 7000 …P5 … … …

Market_Id

Concept-Hierarchies

Many dimensions form an aggregation hierarchy(total or partial orders)

Examples:

Markets(Market_Id → City → Province → Country → Region)

weekTime(year → quarter day)

Drilling Down and Rolling Up• Executing a series of queries that moves down a

hierarchy (e.g., from aggregation over regions to that over provinces) is called drilling down– Requires the use of the fact table or information more

specific than the requested aggregation (e.g., cities)• Executing a series of queries that moves up the

hierarchy (e.g., from provinces to regions) is called rolling up– Note: In a rollup, coarser aggregations can be

computed using prior queries for finer aggregationsUniversity of AlbertaDr. Jörg Sander, 2006 13CMPUT 391 – Database Management Systems

Drilling DownDrilling down on market: from Region to Province

SalesSales (Market_Id, Product_Id, Time_Id, Sales_Amt)MarketMarket (Market_Id, City, Province, Region)

1. SELECT S.Product_Id, M.Region, SUM (S.Sales_Amt)FROM SalesSales S, MarketMarket MWHERE M.Market_Id = S.Market_IdGROUP BY S.Product_Id, M.Region

2. SELECT S.Product_Id, M.Province, SUM (S.Sales_Amt)FROM SalesSales S, MarketMarket MWHERE M.Market_Id = S.Market_IdGROUP BY S.Product_Id, M.Province,

Rolling UpRolling up on market, from Province to Region

If we have already created a table, Province_SalesProvince_Sales, using

1. SELECT S.Product_Id, M.Province, SUM (S.Sales_Amt)INTO Province_SalesProvince_SalesFROM Sales Sales S, MarketMarket MWHERE M.Market_Id = S.Market_IdGROUP BY S.Product_Id, M.Province

then we can roll up from there to:

22. SELECT T.Product_Id, M.Region, SUM (T.Sales_Amt)FROM Province_SalesProvince_Sales T, MarketMarket MWHERE M.Province = T.ProvinceGROUP BY T.Product_Id, M.Region

Pivoting• When we view the data as a multi-dimensional

cube and group on a subset of the axes, we are said to be performing a pivotpivot on those axes– Pivoting on dimensions D1,…,Dk in a data cube

D1,…,Dk,Dk+1,…,Dn means that we use GROUP BY A1,…,Ak and aggregate over Ak+1,…An, where Ai is an attribute of the dimension Di

– Example: Pivoting on ProductProduct and TimeTime corresponds to grouping on Product_id and Quarter and aggregating Sales_Amt over Market_id:

SELECT S.Product_Id, T.Quarter, SUM (S.Sales_Amt)FROM SalesSales S, TimeTime TWHERE T.Time_Id = S.Time_IdGROUP BY S.Product_Id, T.Quarter Pivot

Dicing• When we use GROUP BY to specify part of

a hierarchy, we are performing a range selection called a dice– Dicing Sales in the time dimension: total sales

for each product in each quarter.SELECT S.Product_Id, T.Quarter, SUM (Sales_Amt)FROM Sales S, Time TWHERE T.Time_Id = S.Time_IdGROUP BY T.Quarter, S.Product_Id

Slicing• When we use WHERE to specify a particular

value for an axis (or several axes), we are performing a slice– Slicing the data cube in the TimeTime dimension

(choosing sales only in week 12) then pivoting to Product_id (aggregating over Market_id)SELECT S.Product_Id, SUM (Sales_Amt)FROM SalesSales S, TimeTime TWHERE T.Time_Id = S.Time_Id AND T.Week = ‘Wk-12’GROUP BY S. Product_Id

Slicing-and-Dicing

• Typically slicing and dicing involves several queries to find the “right slice.”For instance, change the slice and the axes:

• Slicing on TimeTime and Market Market dimensions then pivoting to Product_idand Week (in the time dimension)

SELECT S.Product_Id, T.Week, SUM (Sales_Amt)FROM SalesSales S, TimeTime TWHERE T.Time_Id = S.Time_Id

AND T.Quarter = 4AND S.Market_id = ‘M1’

GROUP BY S.Product_Id, T.Week

The extended Multi-dimensional Data Cube/Fact Table

1999 2000 2002 SumDrama

… ...

Comedy

database management systems (l5)

Documents

l5 human systems overview snc2p nicole klement source: bioed...

04/18/2005yan huang - csci5330 database implementation –...

l5: introduction to file systems (v4a)

l5 - database...

database systems

database systems - database design

l5 coupling constants and spin systems in nmr

database fundamentals introduction introduction to database...

ims1907 database systems week 2 types of database systems

04/20/2005yan huang - csci5330 database implementation –...

database security -...

l5-intro directional drilling coordinate systems

database management systems cmam301. introduction to...

1 database management systems (dbms). 2 database management...

succeeding with technology database systems basic data...