2
Data Warehousing & OLAPData Warehousing & OLAP
Defined in many different ways, but not rigorously.Defined in many different ways, but not rigorously. A decision support database that is maintained A decision support database that is maintained separatelyseparately
from the organization’s operational databasefrom the organization’s operational database Support Support information processinginformation processing by providing a solid by providing a solid
platform of consolidated, historical data for analysis.platform of consolidated, historical data for analysis. ““A data warehouse is aA data warehouse is a subject-orientedsubject-oriented,, integrated integrated,,
time-varianttime-variant, , and and nonvolatilenonvolatile collection of data in collection of data in support of management’s decision-making support of management’s decision-making process.”—W. H. Inmonprocess.”—W. H. Inmon
Data warehousing:Data warehousing: The process of constructing and using data warehousesThe process of constructing and using data warehouses
What is a Data Warehouse?
3
Data Warehouse—Subject-Data Warehouse—Subject-OrientedOriented
Organized around major Organized around major subjectssubjects, such as , such as
customer, product, sales.customer, product, sales.
Focusing on the modeling and analysis of data for Focusing on the modeling and analysis of data for
decisiondecision makersmakers, not on daily operations or , not on daily operations or
transaction processing.transaction processing.
Provide Provide a simple and concisea simple and concise view around view around
particular subject issues by particular subject issues by excluding data that excluding data that
are not useful in the decision support processare not useful in the decision support process..
4
Data Warehouse—Data Warehouse—IntegratedIntegrated
Constructed by integrating Constructed by integrating multiple, multiple, heterogeneousheterogeneous data sources data sources relational databases, flat files, on-line transaction relational databases, flat files, on-line transaction
recordsrecords Data cleaning and data integration techniques Data cleaning and data integration techniques
are applied.are applied. Ensure consistency in naming conventions, encoding Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different structures, attribute measures, etc. among different data sourcesdata sources
E.g., Hotel price: currency, tax, breakfast covered, etc.E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted. When data is moved to the warehouse, it is converted.
5
Data Warehouse—Time Data Warehouse—Time VariantVariant
The time horizon for the data warehouse is The time horizon for the data warehouse is
significantly significantly longerlonger than that of operational than that of operational
systems.systems. Operational database: current value data.Operational database: current value data. Data warehouse data: provide information from a Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouseEvery key structure in the data warehouse Contains an element of time, explicitly or implicitlyContains an element of time, explicitly or implicitly But the key of operational data may or may not But the key of operational data may or may not
contain “time element”.contain “time element”.
6
Data Warehouse—Non-VolatileData Warehouse—Non-Volatile
A A physically separate storephysically separate store of data transformed of data transformed
from the operational environment.from the operational environment.
Operational Operational update of data does not occurupdate of data does not occur in the in the
data warehouse environment.data warehouse environment. Does not require transaction processing, recovery, and Does not require transaction processing, recovery, and
concurrency control mechanismsconcurrency control mechanisms
Requires only two operations in data accessing: Requires only two operations in data accessing:
initial loading of datainitial loading of data and and access of dataaccess of data..
7
Three Data Warehouse Three Data Warehouse ModelsModels
Enterprise warehouseEnterprise warehouse collects all of the information about subjects collects all of the information about subjects
spanning the entire organizationspanning the entire organization Data MartData Mart
a subset of corporate-wide data that is of value to a a subset of corporate-wide data that is of value to a specific groupsspecific groups of users. Its scope is confined to of users. Its scope is confined to specific, selected groups, such as marketing data specific, selected groups, such as marketing data martmart
Virtual warehouseVirtual warehouse A set of views over operational databasesA set of views over operational databases Only some of the possible summary views may be Only some of the possible summary views may be
materializedmaterialized
8
Data Warehouse vs. Data Warehouse vs. Operational DBMSOperational DBMS
OLTP (on-line transaction processing)OLTP (on-line transaction processing) Major task of traditional relational DBMSMajor task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)OLAP (on-line analytical processing) Major task of data warehouse systemMajor task of data warehouse system Data analysis and decision makingData analysis and decision making
Distinct features (OLTP vs. OLAP):Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. marketUser and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidatedData contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subjectDatabase design: ER + application vs. star + subject View: current, local vs. evolutionary, integratedView: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queriesAccess patterns: update vs. read-only but complex queries
9
OLTP vs. OLAPOLTP vs. OLAP OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date detailed, flat relational isolated
historical, summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc
access read/write index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
10
Why Separate Data Why Separate Data Warehouse?Warehouse?
High performance for both systemsHigh performance for both systems DBMS— tuned for OLTP: access methods, indexing, DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recoveryconcurrency control, recovery Warehouse—tuned for OLAP: complex OLAP queries, Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.multidimensional view, consolidation.
Different functions and different data:Different functions and different data: missing datamissing data: Decision support requires historical data : Decision support requires historical data
which operational DBs do not typically maintainwhich operational DBs do not typically maintain data consolidationdata consolidation: DS requires consolidation : DS requires consolidation
(aggregation, summarization) of data from (aggregation, summarization) of data from heterogeneous sourcesheterogeneous sources
data qualitydata quality: different sources typically use inconsistent : different sources typically use inconsistent data representations, codes and formats which have to data representations, codes and formats which have to be reconciledbe reconciled
11
Conceptual Modeling of Conceptual Modeling of Data WarehousesData Warehouses
Modeling data warehouses: dimensions & measuresModeling data warehouses: dimensions & measures Star schemaStar schema: : A fact table in the middle connected to a A fact table in the middle connected to a
set of dimension tables set of dimension tables Snowflake schemaSnowflake schema: : A refinement of star schema where A refinement of star schema where
some dimensional hierarchy is some dimensional hierarchy is normalizednormalized into a set of into a set of
smaller dimension tablessmaller dimension tables, forming a shape similar to , forming a shape similar to
snowflakesnowflake
Fact constellationsFact constellations: : Multiple Multiple fact tablesfact tables share dimension share dimension
tablestables, viewed as a collection of stars, therefore called , viewed as a collection of stars, therefore called
galaxy schemagalaxy schema or fact constellation or fact constellation
12
Example of Star SchemaExample of Star Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Dimension Tables
13
From Tables and From Tables and Spreadsheets to Data Spreadsheets to Data
CubesCubes A data warehouse is based on a A data warehouse is based on a multidimensional multidimensional
datadata modelmodel which views data in the form of a which views data in the form of a
data cubedata cube
A data cube, such as A data cube, such as salessales, allows data to be , allows data to be
modeled and viewed in multiple dimensionsmodeled and viewed in multiple dimensions Dimension tables, such as Dimension tables, such as item (item_name, brand, type), item (item_name, brand, type),
oror time(day, week, month, quarter, year) time(day, week, month, quarter, year)
Fact table contains measures (such as Fact table contains measures (such as dollars_solddollars_sold) and ) and
keys to each of the related dimension tableskeys to each of the related dimension tables
14
Data CubesData Cubes
In data warehousing literature, an n-D In data warehousing literature, an n-D
base cube is called a base cube is called a base cuboidbase cuboid. The top . The top
most 0-D cuboid, which holds the highest-most 0-D cuboid, which holds the highest-
level of summarization, is called the level of summarization, is called the apex apex
cuboidcuboid. The lattice of cuboids forms a . The lattice of cuboids forms a
data cubedata cube..
15
Typical OLAP OperationsTypical OLAP Operations Slice and dice: Slice and dice:
project and selectproject and select Pivot (rotate): Pivot (rotate):
reorient the cube, visualization, 3D to series of reorient the cube, visualization, 3D to series of 2D planes.2D planes.
Roll up (drill-up):Roll up (drill-up): summarize data summarize data
by climbing up hierarchy or by dimension by climbing up hierarchy or by dimension reductionreduction
Drill down (roll down):Drill down (roll down): reverse of roll-up reverse of roll-up
from higher level summary to lower level from higher level summary to lower level summary or detailed data, or introducing new summary or detailed data, or introducing new dimensionsdimensions
16
PC
A Sample Data CubeA Sample Data Cube
Date
Produ
ct
Cou
ntr
y
TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
17
PC
Slice and DiceSlice and Dice
Date
Produ
ct
Cou
ntr
y
TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
Sales in USA
Sales of PC
Sales of 3Qtr
Sales of 3Qtr in USA Sales of PC of 3Qtr in USA
18
PC
Pivot (rotate):Pivot (rotate):
Date
Produ
ct
Cou
ntr
y
TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
19
Roll up (drill-up) Roll up (drill-up) ((dimension reduction dimension reduction
example)example)Date
Produ
ct
Cou
ntr
ysum
TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
Group by date, country
- reduced product dimension
Group by country
- reduced date dimension
Group by nothing
- reduced all dimensions
20
Roll up (drill-up)Roll up (drill-up)
Group by product, countryDate
Produ
ct
Cou
ntr
ysum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
Gro
up b
y da
te, c
ount
ry
Group by product
Group by dateALL
Group by country
21
Drill down (roll-down)Drill down (roll-down)
Date
Produ
ct
Cou
ntr
y
TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
Gro
up b
y da
te, c
ount
ry
ALL
Group by country
22
Cuboids Corresponding to Cuboids Corresponding to the Cubethe Cube
all
product date country
product,date product,country date, country
product, date, country
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D(base) cuboid
23
A Concept Hierarchy: Spatial A Concept Hierarchy: Spatial DimensionDimension
all
Europe North_America
MexicoCanadaSpainGermany
Vancouver
M. WindL. Chan
...
......
... ...
...
all
region
office
country
TorontoFrankfurtcity
24
Example of Star SchemaExample of Star Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Dimension Tables
25
Multidimensional DataMultidimensional Data
Sales volume as a function of product, Sales volume as a function of product, month, and regionmonth, and region
Prod
uct
Regio
n
MonthDimensions: Product, Location, TimeHierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
26
Measures: Three Measures: Three CategoriesCategories
distributivedistributive: if the result derived by applying the function : if the result derived by applying the function to to n n aggregate values is the same as that derived by aggregate values is the same as that derived by applying the function on all the data without partitioning.applying the function on all the data without partitioning.
E.g., count(), sum(), min(), max().E.g., count(), sum(), min(), max().
algebraicalgebraic:: if it can be computed by an algebraic function if it can be computed by an algebraic function with with MM arguments (where arguments (where M M is a bounded integer), each of is a bounded integer), each of which is obtained by applying a distributive aggregate which is obtained by applying a distributive aggregate function.function.
E.g.,E.g., avg(), min_N(), standard_deviation().avg(), min_N(), standard_deviation().
holisticholistic:: if there is no constant bound on the storage size if there is no constant bound on the storage size needed to describe a subaggregate.needed to describe a subaggregate.
E.g., median(), mode(), rank().E.g., median(), mode(), rank().
27
DistributiveDistributive Sum of sales group by product, region, Sum of sales group by product, region,
monthmonth Sum of sales group by category, region, Sum of sales group by category, region,
monthmonth Others: count, min, maxOthers: count, min, max
Dimensions: Product, Location, TimeHierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
28
AlgebraicAlgebraic
Average sales group by product, region, monthAverage sales group by product, region, month Average sales group by category, region, monthAverage sales group by category, region, month
We will keep We will keep the sum of sales group by product, region, monththe sum of sales group by product, region, month the count group by product, region, monththe count group by product, region, month
Then calculate the averageThen calculate the average Others: Others:
min_N(), standard_deviation().min_N(), standard_deviation().
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
29
HolisticHolistic
The rank of sales group by product, region, The rank of sales group by product, region, monthmonth
The rank of sales group by category, The rank of sales group by category, region, monthregion, month
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
30
SummarySummary Data warehouse Data warehouse
A A subject-orientedsubject-oriented,, integrated integrated, , time-varianttime-variant, and , and nonvolatilenonvolatile collection of data in support of management’s decision-making collection of data in support of management’s decision-making processprocess
A A multi-dimensional modelmulti-dimensional model of a data of a data warehousewarehouse Star schema, snowflake schema, fact constellationsStar schema, snowflake schema, fact constellations A data cube consists of dimensions & measuresA data cube consists of dimensions & measures
OLAPOLAP operations: drilling, rolling, slicing, dicing operations: drilling, rolling, slicing, dicing and pivotingand pivoting
Efficient computation of data cubesEfficient computation of data cubes Partial vs. full vs. no materializationPartial vs. full vs. no materialization Bitmap indexBitmap index