database management systems (l5)
Post on 11-Mar-2022
1 Views
Preview:
TRANSCRIPT
CMPUT 391Database Management Systems
Data Warehousing & OLAP
Textbook: 17.1 – 17.5(first edition: 19.1 – 19.5)
Based on slides by Lewis, Bernstein and Kiferand other sources
University of AlbertaDr. Jörg Sander, 2006 1CMPUT 391 – Database Management Systems
Why Data Warehouses• Businesses have a lot of data, operational data and facts,
stored in heterogeneous and distributed databases.– in different databases – in different physical locations– in different formats
• Decision makers need fast access to this information in a summarized form, with a focus often on historical data
University of AlbertaDr. Jörg Sander, 2006 2CMPUT 391 – Database Management Systems
What Is Data Warehouse?• consolidates the information from different data sources, enabling
OLAP (online analytical processing), to help decision support. • is maintained separately from an operational database (which is used
for OLTP – online transaction processing).
Corporate data
Data Mart Data Mart Data Mart Data Mart
CorporateData Warehouse
Option 1:Consolidate Data Marts
Option 2:Build from scratch
University of AlbertaDr. Jörg Sander, 2006 3CMPUT 391 – Database Management Systems
OLTP Compared With OLAP• On Line Transaction Processing -- OLTP
– Maintain a database that is an accurate model of some real-world enterprise
• Short simple transactions• Relatively frequent updates• Transactions access only a small fraction of the database
• On Line Analytic Processing -- OLAP– Use information in database to guide strategic decisions
• Complex aggregation queries • Infrequent updates• Transactions access a large fraction of the database
University of AlbertaDr. Jörg Sander, 2006 4CMPUT 391 – Database Management Systems
Why Do We Separate DWs From Operational DBs?
• Performance reasons:– OLAP necessitates special data organization
that supports multidimensional views.– OLAP queries would degrade operational DB.– OLAP is read only.– No concurrency control and recovery.
• Decision support requires historical data.• Decision support requires consolidated data.
University of AlbertaDr. Jörg Sander, 2006 5CMPUT 391 – Database Management Systems
Fact Tables• Many OLAP applications are based on a fact table• For example, a supermarket application might be
based on a tableSales (Market_Id, Product_Id, Time_Id, Sales_Amt)
• The table can be viewed as a multidimensional data cube– The first three columns are the dimensions
representing specific • supermarkets• products • time intervals
– The fourth column, the Sales_Amt, is a function of the other three, called a measure
University of AlbertaDr. Jörg Sander, 2006 6CMPUT 391 – Database Management Systems
Dimension Tables• The dimensions of the fact table can be further
described with dimension tables• Fact table
– Sales (Market_id, Product_Id, Time_Id, Sales_Amt)• Dimension Tables
– Market (Market_Id, City, Province, Region)– Product (Product_Id, Name, Category, Price)– Time (Time_Id, Week, Month, Quarter)
University of AlbertaDr. Jörg Sander, 2006 7CMPUT 391 – Database Management Systems
Star Schema
• The fact and dimension relations can be displayed in an E-R diagram, which suggests a star and is called a star schema
University of AlbertaDr. Jörg Sander, 2006 8CMPUT 391 – Database Management Systems
Table View of a Star Schema
TimeIdDayMonthYear
Time
CustIdCustNameCustCityCustCountry
Cust
Sales Fact Table
Time
Product
Store
Customer
unit_sales
dollar_sales
ProductNoProdNameProdDescCategory
Product
StoreIDCityProvinceCountryRegion
Store
(Source: JH)
Two different measures
University of AlbertaDr. Jörg Sander, 2006 9CMPUT 391 – Database Management Systems
Aggregation• Many OLAP queries involve aggregation of the
data in the fact table• For example, to find the total sales (over time) of
each product in each market, we might use
SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)FROM SalesSales SGROUP BY S.Market_Id, S.Product_Id
• The aggregation is over the entire time dimension and thus produces a two-dimensional view of the data
University of AlbertaDr. Jörg Sander, 2006 10CMPUT 391 – Database Management Systems
Aggregation over Time
• The output of the previous query
University of AlbertaDr. Jörg Sander, 2006 11CMPUT 391 – Database Management Systems
SUM(Sales_Amt)M1 M2 M3 M4
P1 3003 1503 …P2 6003 2402 …P3 4503 3 …P4 7503 7000 …P5 … … …
Market_Id
Prod
uct_
Id
Concept-Hierarchies
Many dimensions form an aggregation hierarchy(total or partial orders)
Examples:
Markets(Market_Id → City → Province → Country → Region)
weekTime(year → quarter day)
month
University of AlbertaDr. Jörg Sander, 2006 12CMPUT 391 – Database Management Systems
Drilling Down and Rolling Up• Executing a series of queries that moves down a
hierarchy (e.g., from aggregation over regions to that over provinces) is called drilling down– Requires the use of the fact table or information more
specific than the requested aggregation (e.g., cities)• Executing a series of queries that moves up the
hierarchy (e.g., from provinces to regions) is called rolling up– Note: In a rollup, coarser aggregations can be
computed using prior queries for finer aggregationsUniversity of AlbertaDr. Jörg Sander, 2006 13CMPUT 391 – Database Management Systems
Drilling DownDrilling down on market: from Region to Province
SalesSales (Market_Id, Product_Id, Time_Id, Sales_Amt)MarketMarket (Market_Id, City, Province, Region)
1. SELECT S.Product_Id, M.Region, SUM (S.Sales_Amt)FROM SalesSales S, MarketMarket MWHERE M.Market_Id = S.Market_IdGROUP BY S.Product_Id, M.Region
2. SELECT S.Product_Id, M.Province, SUM (S.Sales_Amt)FROM SalesSales S, MarketMarket MWHERE M.Market_Id = S.Market_IdGROUP BY S.Product_Id, M.Province,
University of AlbertaDr. Jörg Sander, 2006 14CMPUT 391 – Database Management Systems
Rolling UpRolling up on market, from Province to Region
If we have already created a table, Province_SalesProvince_Sales, using
1. SELECT S.Product_Id, M.Province, SUM (S.Sales_Amt)INTO Province_SalesProvince_SalesFROM Sales Sales S, MarketMarket MWHERE M.Market_Id = S.Market_IdGROUP BY S.Product_Id, M.Province
then we can roll up from there to:
22. SELECT T.Product_Id, M.Region, SUM (T.Sales_Amt)FROM Province_SalesProvince_Sales T, MarketMarket MWHERE M.Province = T.ProvinceGROUP BY T.Product_Id, M.Region
University of AlbertaDr. Jörg Sander, 2006 15CMPUT 391 – Database Management Systems
Pivoting• When we view the data as a multi-dimensional
cube and group on a subset of the axes, we are said to be performing a pivotpivot on those axes– Pivoting on dimensions D1,…,Dk in a data cube
D1,…,Dk,Dk+1,…,Dn means that we use GROUP BY A1,…,Ak and aggregate over Ak+1,…An, where Ai is an attribute of the dimension Di
– Example: Pivoting on ProductProduct and TimeTime corresponds to grouping on Product_id and Quarter and aggregating Sales_Amt over Market_id:
SELECT S.Product_Id, T.Quarter, SUM (S.Sales_Amt)FROM SalesSales S, TimeTime TWHERE T.Time_Id = S.Time_IdGROUP BY S.Product_Id, T.Quarter Pivot
University of AlbertaDr. Jörg Sander, 2006 16CMPUT 391 – Database Management Systems
Dicing• When we use GROUP BY to specify part of
a hierarchy, we are performing a range selection called a dice– Dicing Sales in the time dimension: total sales
for each product in each quarter.SELECT S.Product_Id, T.Quarter, SUM (Sales_Amt)FROM Sales S, Time TWHERE T.Time_Id = S.Time_IdGROUP BY T.Quarter, S.Product_Id
Dice
University of AlbertaDr. Jörg Sander, 2006 17CMPUT 391 – Database Management Systems
Slicing• When we use WHERE to specify a particular
value for an axis (or several axes), we are performing a slice– Slicing the data cube in the TimeTime dimension
(choosing sales only in week 12) then pivoting to Product_id (aggregating over Market_id)SELECT S.Product_Id, SUM (Sales_Amt)FROM SalesSales S, TimeTime TWHERE T.Time_Id = S.Time_Id AND T.Week = ‘Wk-12’GROUP BY S. Product_Id
Slice
University of AlbertaDr. Jörg Sander, 2006 18CMPUT 391 – Database Management Systems
Slicing-and-Dicing
• Typically slicing and dicing involves several queries to find the “right slice.”For instance, change the slice and the axes:
• Slicing on TimeTime and Market Market dimensions then pivoting to Product_idand Week (in the time dimension)
SELECT S.Product_Id, T.Week, SUM (Sales_Amt)FROM SalesSales S, TimeTime TWHERE T.Time_Id = S.Time_Id
AND T.Quarter = 4AND S.Market_id = ‘M1’
GROUP BY S.Product_Id, T.Week
Slice
Pivot
University of AlbertaDr. Jörg Sander, 2006 19CMPUT 391 – Database Management Systems
The extended Multi-dimensional Data Cube/Fact Table
Sum
1999 2000 2002 SumDrama
… ...
Sum
Comedy
Year
Category
2001EdmontonCalgary
Lethbridge
All YearsDrama, Edmonton
City
Contains all possible aggregates in addition to the facts in the fact table
University of AlbertaDr. Jörg Sander, 2006 20CMPUT 391 – Database Management Systems
The CUBE Operator• To construct the following table, would take 4
queries (next slide)
University of AlbertaDr. Jörg Sander, 2006 21CMPUT 391 – Database Management Systems
SUM(Sales_Amt)M1 M2 M3 Total
P1 3003 1503 … …P2 6003 2402 … …P3 4503 3 … …P4 7503 7000 … …
Total … … … …
Market_Id
Prod
uct_
Id
The Four Queries• For the table entries, without the totals (aggregation on time)
SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)FROM SalesSales SGROUP BY S.Market_Id, S.Product_Id
• For the row totals (aggregation on time and supermarkets)SELECT S.Product_Id, SUM (S.Sales_Amt)FROM SalesSales SGROUP BY S.Product_Id
• For the column totals (aggregation on time and products)SELECT S.Market_Id, SUM (S.Sales) FROM SalesSales S GROUP BY S.Market_Id
• For global total: SELECT SUM (S.Sales) FROM SalesSales S
University of AlbertaDr. Jörg Sander, 2006 22CMPUT 391 – Database Management Systems
Definition of the CUBE Operator• Doing these four queries is wasteful
– The first does much of the work of the other three: if we could save that result and aggregate over Market_Id and Product_Id, we could compute the other queries more efficiently
• The CUBE clause is part of SQL:1999– GROUP BY CUBE (v1, v2, …, vn)– Equivalent to a collection of GROUP BYs, one for
each of the 2n subsets of v1, v2, …, vn
University of AlbertaDr. Jörg Sander, 2006 23CMPUT 391 – Database Management Systems
Example of CUBE Operator
• The following query returns all the information needed to obtain the previous products/markets table:
SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)FROM SalesSales SGROUP BY CUBE (S.Market_Id, S.Product_Id)
University of AlbertaDr. Jörg Sander, 2006 24CMPUT 391 – Database Management Systems
ROLLUP• ROLLUP is similar to CUBE except that instead of
aggregating over all subsets of the arguments, it creates subsets moving from right to left
• GROUP BY ROLLUP (A1,A2,…,An) is a series of these aggregations:– GROUP BY A1 ,…, An-1 ,An– GROUP BY A1 ,…, An-1– … … …– GROUP BY A1, A2– GROUP BY A1– No GROUP BY
• ROLLUP is also in SQL:1999University of AlbertaDr. Jörg Sander, 2006 25CMPUT 391 – Database Management Systems
Example of ROLLUP OperatorSELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)FROM SalesSales SGROUP BY ROLLUP (S.Market_Id, S. Product_Id)– first aggregates with the finest granularity:
GROUP BY S.Market_Id, S.Product_Id– then with the next level of granularity:
GROUP BY S.Market_Id– then the grand total is computed with no GROUP
BY clause
University of AlbertaDr. Jörg Sander, 2006 26CMPUT 391 – Database Management Systems
ROLLUP vs. CUBE
• The same query with CUBE:- first aggregates with the finest granularity:
GROUP BY S.Market_Id, S.Product_Id
- then with the next level of granularity (both subsets):
GROUP BY S.Market_IdGROUP BY S.Product_Id
- then the grand total with no GROUP BY
University of AlbertaDr. Jörg Sander, 2006 27CMPUT 391 – Database Management Systems
Materialized Views
The CUBE operator is often used to pre-compute aggregations on all dimensions of a fact table and then save them as a materialized views to speed up future queries
University of AlbertaDr. Jörg Sander, 2006 28CMPUT 391 – Database Management Systems
ROLAP and MOLAP• Relational OLAP: ROLAP
– OLAP data is stored in a relational database as previously described. Data cube is a way to think abouta fact table.
• Multidimensional OLAP: MOLAP– Vendor provides an OLAP server that implements a fact
table as a data cube using some multi-dimensional (non-relational) implementation.
– provide proprietary, perhaps visual, languages that allow unsophisticated users to make queries that involve pivots, drilling down, or rolling up
University of AlbertaDr. Jörg Sander, 2006 29CMPUT 391 – Database Management Systems
Implementation Issues
• OLAP applications are characterized by a very large amount of data that is relatively static, with infrequent updates– Thus, various aggregations can be precomputed
and stored in the database– Star joins, join indices, and bitmap indices can
be used to improve efficiency– Since updates are infrequent, the inefficiencies
associated with updates are minimized
University of AlbertaDr. Jörg Sander, 2006 30CMPUT 391 – Database Management Systems
top related