1 advanced database topics copyright © ellis cohen 2002-2005 data warehousing these slides are...

115
1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. For more information on how you may use them, please see http://www.openlineconsult.com/db

Upload: eunice-oliver

Post on 27-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

1

Advanced Database Topics

Copyright © Ellis Cohen 2002-2005

Data Warehousing

These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License.

For more information on how you may use them, please see http://www.openlineconsult.com/db

Page 2: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 2

TopicsOverviewStar Schema:

Fact & Dimension TablesThe Star Schema & DenormalizationViewing The Data CubeDrill Down & RollupCross TabulationsData VisualizationTrend & Rank AnalysisETL: Extraction, Transformation & LoadingMaterialized Views & Query RewritingIndexing for Data Warehouses

Page 3: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 3

Operational vs Analytical DBs

Operational DatabaseData needed and updated constantly to directly

support business operationsFocus on OLTP (on-line transaction processing):

Transactional access & modification of relatively small # of data points at a time

Analytical Database:Data Warehouse & Data MartCopious amounts of relatively static data, culled

& integrated across enterprise, cleansed & summarized, maintained historically, used for decision support and business intelligence (BI)

Focus on OLAP (on-line analytical processing): Querying large amounts of data, scheduled modifications

Page 4: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 4

Operational vs Analytical DBs

Operational Warehouse

UsageTransactional

(OLTP)Analytical

(OLAP)

Organized for Modifications Queries

Modifications Continual Periodic

QueriesNarrow-scope

Low-complexityBroad-scope

High-complexity

Database RelationalRelational/

Dimensional

Data NormalizedDenormalizedAggregated &

Derived

Page 5: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 5

Central Data Warehouse

(from Oracle 9i Data Warehousing Guide)

Page 6: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 6

Warehouse Questions

How many red Bally shoes did we sell by region in the third quarter of each of the last 5 years?

What are the top 25 selling products by category and region for this past quarter?

What percent of the market do we own for each product we make?

Which of our customer's zipcodes were responsible for the top 10% of total sales over the last year.

Page 7: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 7

Star Schema:Fact & Dimension

Tables

Page 8: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 8

Star Schema

Stores (Dimension)

DailySales (Fact)

storidprodiddatepriceunits

storid…

Products (Dimension)

prodid…

Measures

A Star Schema has a central fact table, with a composite primary key, which references multiple Dimension tables

what each fact measures

Data Warehousesare organized usingStar Schema models

foreign key

Page 9: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 9

Subjects (Facts) & Dimensions

Instead of thinking about entities & relationships, design a data warehouse by thinking about

Subjects (represented by fact tables)

Sales, Distribution, Purchases

Dimensions (represented by dimension tables)

How to uniquely identify the facts about each subject– Sales: Product, Stores, Dates

(maybe also Employee, Customer: depends what you want to analyze)

– Distribution: Warehouses, Products, Stores, Dates (maybe Employees & Trucks)

– Purchases: Products, Vendors, Dates (maybe also Employees)

Page 10: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 10

Fact & Dimension Tables

Fact TablesComposite primary key

• identify dimensions• uniquely identify each fact (or measurement)

Additional attributes: measures• what is measured about each fact

Dimension TablesPrimary key

Surrogate key uniquely identifies each dimension value

Additional attributesProperties of each dimension value

Page 11: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 11

Dimensions & Granularity

Dimensions have different levels of granularity

Stores

Regions

Districts

Products

SubCategories

ProductTypes

Categories

Manufacturers

Page 12: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 12

Snowflake Schema(with Normalized Dimensions)

Stores (Dimension) DailySales (Fact)storidprodiddatepriceunits

storidstornamcitystatedistid

Products (Dimension)

prodidcolorsizeprodtyp

Districtsdistiddistnamdistarearegid

Regionsregidregnam

ProductTypes

prodtypprodnamprodescrsubcatidmanfid

SubCategories

subcatidsubnamsubdescrcatid

Categories

catidcatnamcatdescr

Manufacturers

manfidmanfnam

Page 13: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 13

Typical Warehouse Query

How many red Bally shoes did we sell in each region in 2002?

SELECT r.regnam as region, sum(f.units) as sumunitsFROM DailySales f NATURAL JOIN Stores NATURAL JOIN Districts NATURAL JOIN Regions r NATURAL JOIN Products p NATURAL JOIN ProductTypes NATURAL JOIN SubCategorie s NATURAL JOIN Manufacturers mWHERE to_char(f.date,'YYYY') = '2002' AND p.color = 'red' AND m.manfnam = 'Bally' AND s.subnam = 'Shoe'GROUP BY r.regnam

Page 14: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 14

Aggregate Functions

AVG: Average COUNT: Count MIN: Minimum Value MAX: Maximum Value STDDEV: Standard Deviation

(and STDDEV_POP & STDDEV_SAMP) SUM: Sum VARIANCE: Variance

(and VAR_POP & VAR_SAMP)

Page 15: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 15

The Star Schema & Denormalization

Page 16: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 16

Snowflake Schema is Normalized

Snowflake Schema has normalized dimension tables

• Each dimension is represented by multiple sub-dimension tables at different levels of granularity (Product, ProductType, Category, etc.)

• Each sub-dimension table has attributes appropriate to the level of granularity– Product: color, size

– ProductType: prodnam, prodescr

– etc.

Page 17: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 17

Denormalization

Products (Dimension)

prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr

Products (Dimension)

prodidcolorsizeprodtyp

ProductTypes

prodtypprodnamprodescrsubcatidmanfid

SubCategories

subcatidsubnamsubdescrcatid

Categories

catidcatnamcatdescr

Manufacturers

manfidmanfnam

Why is there redundancy

here?

Page 18: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 18

Star Schema is Denormalized

The Star Schema has denormalized dimension tables

• Each dimension by joining together the sub-dimension table to form a single dimension table

• The dimension table has attributes at different levels of granularity

• The dimension tables contain lots of redundancy, but queries use far fewer joins

• Does not dramatically impact space: dimension tables usually < 1% size of fact table (but some descriptions may need to be stored separately)

Page 19: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 19

Star Schema(Fully Denormalized Dimensions)

Stores (Dimension)

DailySales (Fact)

storidprodiddatepriceunits

storidstornamcitystatedistiddistnamdistarearegidregnam

Products (Dimension)

prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescrMaybe catdescr not

included here if it is a GIF or a 4000 byte

description

Why should this be

replaced by a dateid?

Page 20: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 20

Schema Types

Snowflake SchemaFact table with

fully normalized dimension tables

Star SchemaFact table with

fully de-normalized dimension tables

Starflake SchemaFact table with

fully de-normalized dimension and (as needed) sub-dimension tables

Constellation SchemaMultiple fact tables

with shared dimension tables

Page 21: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 21

Query with Denormalized Schema

How many red Bally shoes did we sell in each region in 2002?

SELECT s.regnam as region, sum(f.units) as sumunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p WHERE to_char(f.date,'YYYY') = '2002' AND p.color = 'red' AND p.manfnam = 'Bally' AND p.subnam = 'Shoe'GROUP BY s.regnam Costly

Page 22: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 22

Typical Date Dimension Attributes

Requires Month + Year to identify a month within a year.Might want to add a single MonthYr field to represent the pair

Field Example Value

Year 2005

Month Feb

Quarter 1

DayOfMonth 12

DayOfYear 43

WeekOfYear 7

DayOfWeek Sat

Note: Quarter is less granular than MonthAlso, DayOfYear, WeekOfYear & DayOfWeek can be derived form the other fields

It is common and almost always more efficient to treat Dates as a dimension with a number of attributes

Page 23: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 23

Extended Date Dimension Hierarchy

Date (e.g. Feb 12, 2005)

DayOfWeek(e.g. Sat)

WeekYr(e.g. 2005Wk7)

MonthYr(e.g. Feb2005)

QuarterYr(e.g. 2005Q1)

Year(e.g 2005)

Quarter(e.g. 1)

Month(e.g. Feb)

WeekOfYear(e.g. 7)

DayOfYear(e.g. 43)

DayOfMonth(e.g. 12)

Page 24: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 24

Star Schema with Date Dimension

Stores (Dimension)DailySales (Fact)

storidprodiddateidpriceunits

storidstornamcitystatedistiddistnamdistarearegidregnam

Products (Dimension)prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr

Dates(Dimension)

dateiddatedayofweekdayofmonthdayofyearweekyrweekofyearmonthyrmonthquarteryrquarteryear

In general, represent dates by a Dates dimension table

Page 25: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 25

Query using Dates DimensionHow many red Bally shoes did we sell

in each region in 2002?SELECT s.regnam as region,

sum(f.units) as sumunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates dWHERE d.year = 2002 AND p.color = 'red' AND p.manfnam = 'Bally' AND p.subnam = 'Shoe'GROUP BY s.regnam

Needs an extra join, but simpler query, Executes faster if Dates is indexed by year

Page 26: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 26

More Complex Query

How many red Bally shoes did we sell by region in the third quarter of each of the last 5 years?

SELECT s.regnam as region, d.quarteryr, sum(f.units) as sumunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates d WHERE p.color = 'red' AND p.manfnam = 'Bally' AND p.subnam = 'Shoe'GROUP BY s.regnam, d.quarteryr, d.quarter, d.yearHAVING d.quarter = 3 AND d.year BETWEEN 1998 and 2002

Page 27: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 27

The M:N Mapping Problem DailySales (Fact)

storidprodiddateidpriceunits Products (Dimension)

prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr

Products

SubCategories

ProductTypes

Categories

Manufacturers

Suppose a product type may have multiple associated subcategories.

What do we do?

Page 28: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 28

M:N Mappings

DailySales (Fact)

storidprodiddateidpriceunits

prodidcolorsizeprodtypprodnamprodescrmanfidmanfnam

Products (Dimension)

subcatidsubnamsubdescrcatidcatnamcatdescr

SubCategories

prodtypsubcatid

ProdCatMap

Can't be a foreign key constraint, since

prodtyp is not unique in Product

A product type can have more than subcategory

OK to keep the M:N

bridge table

Page 29: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 29

Non-1NF Denormalization

DailySales (Fact)

storidprodiddateidpriceunits

prodidcolorsizeprodtypprodnamprodescrmanfidmanfnam{ subcatid }

Products (Dimension)

subcatidsubnamsubdescrcatidcatnamcatdescr

SubCategories

Represent a list of subcategories by

•A (non-standard) list datatype

•Delimited string – e.g. |314|209|812|

Another reasonable approach(esp if DB

support for lists)

Page 30: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 30

Full Denormalization DailySales (Fact)

storidprodiddateidpriceunits

prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr

Product (Dimension)

SELECT prodtyp, sum(units) FROM DailySales NATURAL JOIN Product GROUP BY prodtyp

is not correct because of duplication.Must write

WITH JustProduct AS (SELECT DISTINCT prodid, prodtyp FROM Product)SELECT prodtyp, sum(units) FROM DailySales NATURAL JOIN JustProduct GROUP BY prodtyp

UGLY! BAD!Don't do this!

Complicates joins

Page 31: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 31

Limits to DenormalizationStores (Dimension)

DailySales (Fact)

storidprodiddateidpriceunits

storid… prodid

Products (Dimension)

Dates (Dimension)

dateid…

StorePromos

storidstartdateidenddateidpromonamdiscount

Can't denormalize StorePromos

Unless you replace Store & Date with a

singleStoreDate dimension with storeid & dateid as primary keys: Way

too big

Page 32: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 32

ViewingThe Data Cube

Page 33: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 33

Data Cube Representation

Products dimension

Stores dimensio

n

Dates dimension

Sales of Beanie Babies in

Pittsburgh Store Today

Sales of Beanie Babies in Pittsburgh

Store Yesterday

All Sales(of all products

over time) in NYC Store

Pgh

NYC

Sales Cube

Page 34: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 34

Data Cube Characteristics

Each axis represents a dimension

– Elements along axis are at lowest granularity for that dimension

Measures are the data within the cells at intersections of the cube

– Information about the topic of the cube

– e.g. units & price for each sales fact (i.e. sales in a store of a product on a date)

Page 35: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 35

Data Cube ViewsSlice

View data relative to a point in one or more dimensions

View sales today (for each store & each product category)

View Bally shoe sales at the NYC store (for each date)

DiceView data relative to (sets of) ranges in one or

more dimensionsView sales for the last 4 days (for each store &

each product category)View sales for each type of shoes at all the NY

and NJ stores for each of the last 10 quarters

Page 36: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 36

MDDB: MultiDimensional DataBase

Knows about Fact & Dimension TablesUses direct (n dimensional) hypercube

representation to provide fast access to fact elements in query

Supports sparse representations– The Pittsburgh store doesn't sell lingerie– The Cape Cod store is not open in the winter– Baked Beanie Babies are only sold in the NE

region

Uses specialized query languagee.g. MDX (used by Microsoft OLAP Server)w basic data types: cube, slice, dice

Page 37: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 37

Choosing a ViewStore

State

City

Products

Brand

StoreType

Country

MinorSubCategory

MinorCategory

MajorSubCategory

MajorCategory

MonthYr

QuarterYr

Year

Customers

State

City

Country

Detailed Dicing Dimension

Slicing Dimensions

CA

1997Q1

Drink

EducLevel

Page 38: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 38

Slicing & Dicing

Detailed Dicing Dimension

Slicing AttributesBaseMeasures

DerivedMeasures

Examples use dynasight, www.arcplan.com

Page 39: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 39

Slice, Dice & Chart

Different Dicing Dimension

Measures

ChartedMeasures

SlicingAttributes

Page 40: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 40

Drill Down & Roll Up

Page 41: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 41

Slicing & Dicing

Detailed Dicing Dimension

Slicing AttributesMeasures

Drill Down

Page 42: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 42

Drill Down

Drill DownRe-Slice

Page 43: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 43

Uniform Drill Down & Rollup

Uniform Drill DownUniformly drill down to a certain level

Uniform RollupCompute Aggregate Values

at that level and all higher levels

Can be computed with a single SELECT statement using the ROLLUP grouping function

Non-uniform rollups (previous slide) require UNIONs

Page 44: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 44

Ordinary Group AggregationSELECT c.country, c.state, c.city, sum(f.sale) as

StoreSales, sum(f.cost) as StoreCost,StoreSales - StoreCost as StoreNet,100* StoreCost / StoreSales as PctCost

FROM Facts f NATURAL JOIN Customers cNATURAL JOIN Stores sNATURAL JOIN Dates dNATURAL JOIN Products p

WHERE p.MajorCategory = 'Drink' AND d.QuarterYr = '1997Q1' AND s.country = 'USA' AND s.State = 'CA'

GROUP BYc.country, c.state, c.city

Note: Constrain Store, not Customer

Page 45: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 45

Aggregate Query Results

Country State City StoreSales …USA CA Altadena 96 …USA CA Arcadia 64 …USA CA … … …USA CA Woodland … …USA OR Beaverton 12 …USA OR Corvalis 21 …USA OR … … …USA OR Woodburn 4 …… … … … …CANADA BC Victoria … …

Per-City Rollups

in CA

Per-City Rollups

in OR

That's fine, but it does NOT give us• Aggregate store sales for CA, OR, BC, etc• Aggregate store sales for USA, CANADA• Aggregate store sales overall

Page 46: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 46

Rollup Query ResultsCountry State City StoreSales …NULL NULL NULL 6310 …USA NULL NULL 4310 …USA CA NULL 3310 …USA CA Altadena 96 …USA CA Arcadia 64 …USA CA … … …USA CA Woodland … …USA OR NULL 1000 …USA OR Beaverton 12 …USA OR Corvalis 21 …USA OR … … …USA OR Woodburn 4 …… … … … …CANADA NULL NULL … …CANADA BC NULL … …CANADA BC Victoria … ……

Per-City Rollups

in CA

Per-City Rollups

in OR

OR Rollup

CA Rollup

USA RollupRollup ALL

Canada Rollup

BC Rollup

Page 47: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 47

Rollup using Union

SELECT c.country, c.state, c.city, …GROUP BY c.country, c.state, c.city

UNIONSELECT c.country, c.state, NULL AS city, …

GROUP BY c.country, c.stateUNIONSELECT c.country,

NULL AS state, NULL AS city, …GROUP BY c.country

UNIONSELECT NULL AS country, NULL AS state,

NULL AS city, …

Page 48: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 48

GROUP BY ROLLUPSELECT c.country, c.state, c.city, sum(f.sale) as

StoreSales, sum(f.cost) as StoreCost,StoreSales - StoreCost as StoreNet,100* StoreCost / StoreSales as PctCost

FROM Facts f NATURAL JOIN Customers cNATURAL JOIN Stores sNATURAL JOIN Dates dNATURAL JOIN Products p

WHERE p.MajorCategory = 'Drink' AND d.QuarterYr = '1997Q1' AND s.country = 'USA' AND s.State = 'CA'

GROUP BYROLLUP( c.country, c.state, c.city )

Note: Constrain Store, not Customer

Page 49: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 49

Cross Dimension Rollups

SELECT c.state, d.QuarterYr, sum(f.cost) as StoreCost

FROM Fact f NATURAL JOIN Customers cNATURAL JOIN Stores sNATURAL JOIN Dates dNATURAL JOIN Products p

WHERE p.MajorCategory = 'Drink' AND d.year = 1997 AND s.country = 'USA' AND s.State = 'CA'

GROUP BYROLLUP( c.state, d.QuarterYr )

Page 50: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 50

Cross Dimension Rollup Results

State Quarter StoreSalesNULL NULL 225,627CA NULL 63,530CA 1997Q1 14,431CA 1997Q2 15,332CA 1997Q3 15,673CA 1997Q4 18,094OR NULL 56,773OR 1997Q1 16,081OR 1997Q2 12,679OR 1997Q3 14,274OR 1997Q4 13,739WA NULL 105,324WA 1997Q1 25,240…

Per-Qtr Rollups

in CA

Per-Qtr Rollups

in OR

OR Rollup

CA Rollup

Rollup ALL

WA Rollup

Page 51: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 51

Cross Tabulations

Page 52: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 52

Cross Tab View of Rollup

14,431 15,332 15,673 18,094

16,081 12,679 14,274 13,739

25,240 24,953 25,958 29,173

CA

OR

WA

1997 1997Q1 1997Q2 1997Q3 1997Q4

225,627

63,530

56,773

105,324

?

Page 53: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 53

Cross Tab with Sums

Sums

Page 54: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 54

GROUP BY CUBE

SELECT c.state, d.QuarterYr, sum(f.cost) as StoreCost

FROM Fact f NATURAL JOIN Customers cNATURAL JOIN Stores sNATURAL JOIN Dates dNATURAL JOIN Products p

WHERE p.MajorCategory = 'Drink' AND d.Year = 1997 AND s.country = 'USA' AND s.State = 'CA'

GROUP BYCUBE( c.state, d.QuarterYr )

Page 55: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 55

Cube ResultsState Quarter StoreSalesNULL NULL 225,627NULL 1997Q1 55,752NULL 1997Q2 52,964NULL 1997Q3 55,905NULL 1997Q4 61,006CA NULL 63,530CA 1997Q1 14,431CA 1997Q2 15,332CA 1997Q3 15,673CA 1997Q4 18,094OR NULL 56,773OR 1997Q1 16,081OR 1997Q2 12,679OR 1997Q3 14,274OR 1997Q4 13,739WA NULL 105,324WA 1997Q1 25,240…

Per-Qtr Rollups

in CA

Per-Qtr Rollups

in OR

OR Rollup

CA Rollup

Rollup ALL

WA Rollup

Qtr Rollups

Page 56: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 56

Detailed Cross Tab

Page 57: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 57

Data Visualization

Page 58: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 58

Charting Visualizations

All: 1 dimension, 1 measure

Page 59: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 59

Volume Visualization

Clustered data: 3 dimensions, 1 measure shown using color

Page 60: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 60

Colored Sphere Visualization

Sparse data: 3 dimensions, 2 measures: pt size & color

White: colored measure unknown

Page 61: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 61

Vector Glyph Visualization

2 dimensions, 4 measures: <x,y,z> & color

Page 62: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 62

Dimensional Stacking

4 (2+2)dimensions, 1 binary measure•could use color for continuous measure •could chart: 3 (2+1) dimensions, 1 measure

Page 63: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 63

Visualization IssuesDimensions

How many dimension attributesHow dimension attributes are represented

MeasuresHow many simultaneous measuresHow measures are representedSpatial, Color (hue/brightness/…),

Texture, Audio, other sensory

TransformationsMeasures & Dimension attributes1 variable: sqr, sqrt, log, exp, 1/xN variables: linear combinations

Drill Up , Drill Down, PivotInteractivity & Immersiveness

Page 64: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 64

Trend & Rank Analysis

Page 65: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 65

Trend Example

Month Year TotalSmoothed

Total

Jan 1994 200 200

Feb 1994 344 272

Mar 1994 401 315

Apr 1994 443 347

May 1994 360 387

Jun 1994 404 402

Jul 1994 389 399

Aug 1994 451 401

Window

In addition to calculating the total # of units sold by monthwe want to smooth that over the preceding 3 months

Page 66: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 66

Trend Analysis

When you build a result setyou may want to define a fieldthat depends on a group of related rows

in the same result set (the window)This is particularly useful for

analyzing trends

SELECT d.month, d.year, sum(f.units) as totunits, {moving average of totunits over 3 months preceding} as movavgFROM DailySales f NATURAL JOIN Dates dGROUP BY d.year, d.month

window

Page 67: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 67

Trends in Oracle SQL

WITH MonthlyUnits AS (SELECT d.month, d.year, sum(f.units) as totunits FROM DailySales f NATURAL JOIN Dates dGROUP BY d.year, d.month )

SELECT month, year, totunits, avg(totunit) OVER ( ORDER BY year, month ROWS 3 PRECEDING ) AS movavgFROM MonthlyUnits

Window

Page 68: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 68

Trends in SQL 99

WITH MonthlyUnits AS (SELECT d.month, d.year, sum(f.units) as totunits FROM DailySales f NATURAL JOIN Product pGROUP BY d.year, d.month )

SELECT month, year, totunits, avg(totunits) OVER w AS movavgFROM MonthlyUnitsWINDOW w AS ( ORDER BY year, month ROWS BETWEEN 3 PRECEDING AND CURRENT ROW )

Page 69: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 69

Trends using Subqueries

WITH MonthlyUnits AS (SELECT d.month, d.year, 12*d.year + d.month as mknt, sum(f.units) as totunits,FROM DailySales f NATURAL JOIN Product pGROUP BY d.year, d.month)

SELECT m.month, m.year, m.totunits, (SELECT avg(mm.totunits) FROM MonthlyUnits mm WHERE mm.mknt BETWEEN m.mknt – 3 AND m.mknt) AS movavgFROM MonthlyUnits mORDER BY m.year, m.month

Page 70: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 70

Rank Example

Month Year Total Rank

Jan 1994 200 8

Feb 1994 344 7

Mar 1994 401 4

Apr 1994 443 2

May 1994 360 6

Jun 1994 404 3

Jul 1994 389 5

Aug 1994 451 1

Window

In addition to calculating the total # of units sold by monthwe want to rank it with respect to all the months

Page 71: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 71

Ranking in Oracle SQL

WITH MonthlyUnits AS (SELECT d.month, d.year, sum(f.units) as totunits FROM DailySales f NATURAL JOIN Product pGROUP BY d.year, d.month )

SELECT month, year, totunits, rank() OVER ( ORDER BY totunit DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS ranktotalFROM MonthlyUnits

The ugly idiom for 'ALL ROWS'

Page 72: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 72

Analytical Functions

Ranking FunctionsRANK, DENSE_RANK, CUME_DIST,

PERCENT_RANK, NTILE, ROW_NUMBERRank current row within the window

Inverse Percentile FunctionsPERCENTILE_CONT (continuous)PERCENTILE_DISC (discrete)

Histogram SupportWIDTH_BUCKET

Lag/Lead FunctionsLAG, LEADReturn data from a row at a specified offset from

the current row within the window

Page 73: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 73

ETL:Extraction,

Transformation & Loading

Page 74: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 74

ETL: Extraction, Transformation & Loading

80% of total cost of building warehouse

Extraction Loading

Transformation

Page 75: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 75

ExtractionSources

Multiple DB'sFlat FilesExternal Data Sources

• e.g. Census, Geographic, Weather, Financial, Unemployment Data

• Standard DB/Spreadsheet format or semi-structured data from the web

FrequencyPeriodic (hourly, daily, weekly, …)Triggered

• Single event• #, sequence, pattern of events

MechanismsSnapshots / Materialized Views / ReplicationDatabase TriggersProcess LogsQuery Sources (full vs incremental)

Page 76: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 76

TransformationCleaning

ScrubbingFilteringConformance

IntegrationRenamingFusion & MergingDetermine Surrogate KeysTimestampingSummarization

Schema OrganizationDimension TablesPre-Aggregation via Materialized Views Derivation

Page 77: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 77

(Transformation) Cleaning

ScrubbingUse domain-specific knowledgee.g. SS#, phone-number, zipcode

FilteringCheck for inconsistent dataUse data validation rules

ConformanceMap similarly typed data to standard

representation Convert

units (inch => cm, $ => euro)scale (mm => cm)formats (string => integer, string with/wo

$)

Page 78: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 78

(Transformation) IntegrationRenaming

Resolve name conflictsFusion - e.g. merge

– properties in city db– properties in developer lists

Determine Surrogate KeysDo not use keys from operational data as

primary key in warehouse dataTimestamping

Add timestamps to fact data where missing to enable historical queries

Reorganization & EvolutionSupport Data Reorganization & Schema

EvolutionSummarization

Summarize original operational data and combine into less detailed tables

Page 79: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 79

Integration (Data Reorganization)What do we do when attributes change?

Suppose districts are reorganized and a store is now part of a different district

Consistently changing mapping of store to district– Allows new and old data to be compared

reasonably by district– But causes incorrect comparisons by district

among older data alone

Solutions1. Keep fields for both old and new mapping -- in

fact, potentially a separate field for each reorganization

2. Add effective date to store dimension.Have multiple rows for same store - each with different effective date

Page 80: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 80

(Integration) Summarization

DailySales (Fact)storidprodiddatepriceunitsCustomerTransaction

transidcustidempidposidtime

ItemPurchasetransidlinenoprodidpriceunits

PointOfSaleTerminals

posidpostypstoridloc

Might build different fact tables for different purposes:

e.g. ones involving Customersones involving Store Locations

TradeoffSmaller Fact Tables vs.Missed Relationships

Page 81: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 81

Loading

Alternatives– Incremental vs Full Refresh:

most data is incrementally added to the warehouse

– Off-line vs on-line– Frequency

• Nightly• Weekly• Monthly

– All-at-once vs Staged

What indices to create or drop?What statistics to collect (& use)?

Page 82: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 82

Constellation SchemaData warehouses often are designed as

constellations• Multiple fact tables• Shared/related dimension tables

Examples– Sales: store, product, date– Distribution: distributor, store, product,

carrier, period– Advertising: store, medium, product, period

Query across same or related dimensions– Compare advertising and sales by store

within various periods

Page 83: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 83

Data Marts

Store different fact tables (or different groups of fact tables) in separate data marts

Page 84: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 84

Data Mart Architectures

Subset of Data WarehouseMeets needs of subgroup of users

• Top-down: – Extracted from Data Warehouse– Problem: early availability

• Bottom-up:– Built directly from staging area– Can be combined to form warehouse– Problem: Conformance.

ETL tool must provide metadata

• Hybrid:– Some data marts built directly from staging area– Others extracted from Data Warehouse

Page 85: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 85

Metadata Management

Identify & define each attribute– Source(s)– Transformation(s) applied– How aggregated– Description of what it represents– Relationships to other attributes– History

Page 86: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 86

Materialized Views & Query Rewriting

Page 87: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 87

What is an Ordinary View

A view is not a tableA view does not hold data

A view is just a descriptionused in expanding queries

which refer to the view!

Page 88: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 88

View ExpansionSuppose we define

CREATE VIEW HiEmps AS SELECT * FROM Emps WHERE sal > 1500

and then execute the query

SELECT ename, job FROM HiEmps

The database engine automatically expands this into

SELECT ename, job FROM Emps WHERE sal > 1500

Page 89: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 89

Motivating Materialized Views

Suppose a view is• Used frequently in an application• Somewhat expensive to compute• Based on tables that change infrequently

It would be useful to• Store the contents of the view in a table• Use the table for queries• Arrange to update the table

(automatically) when the base tables change [or, perhaps less frequently, if the view does not need to be perfectly up-to-date]

Page 90: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 90

Example Star Schema

Stores (Dimension)DailySales (Fact)

storidprodiddateidpriceunits

storidstornamcitystatedistiddistnamdistarearegidregnam

Products (Dimension)

prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr

Dates(Dimension)

dateiddatedayofweekdayofmonthdayofyearweekyrweekofyearmonthyrmonthquarteryrquarteryear

Page 91: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 91

Materialized Views

Materialized views actually hold data CREATE MATERIALIZED VIEW ProdDistYrSum ASSELECT p.prodtyp, s.distid, d.year,

sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates dGROUP BY p.prodtyp, s.distid, d.year

A materialized view is

• Like a table, in that it actually stores the result of the query

• Like a view, in that it is possible to arrange for it be automatically updated when the underlying base data changes

Page 92: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 92

Updating Materialized Views

During the loading phase new data is incrementally added to data warehouse tables

Materialized Views (which are defined as part of architecting the data warehouse) are either– Recalculated from scratch based on the

the new base table contents

– Incrementally updated based on incremental changes to the base tables.

How is ProdDistSumYr incrementally updatedwhen a new day's worth of data is added?

Page 93: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 93

Using Materialized ViewsInstead of writing,

SELECT p.prodtyp, s.distid, sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates dWHERE d.year = 2002GROUP BY p.prodtyp, s.distid

just writeSELECT prodtyp, distid, totunits

FROM ProdDistYrSumWHERE year = 2002

Because ProdDistYrSum is a materialized view, the database engine does NOT expand it, but just uses its materialized data

Page 94: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 94

Aggregating Materialized Views

Instead of writing,SELECT p.prodtyp, s.distid, sum(f.units) as totunits

FROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products pGROUP BY p.prodtyp, s.distid

just writeSELECT prodtyp, distid,

sum(totunits) AS totunits_overalltimeFROM ProdDistYrSumGROUP BY prodtyp, distid

Page 95: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 95

Architecting Materialized Views

Many possible combinations:– district, region– district/week, district/month, …– region/week, region/month, …– district/category, district/manufacturer, …– category/week, category/month, …– category/district/week, …

Design balances– Cost of precalculating & storing view– Cost of calculating on the fly

A heuristic optimization problem– Uses statistics of queries– Uses size of each combination– e.g. Benefit Per Unit Space (BPUS)

Which views should be materialized?

Page 96: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 96

Materialized View Evolution Problem

As the data warehouse evolves, the set of materialized views needs to change.

But, if the DW design already includes 1000 analysis queries, they would need to be rewritten to use the new set of materialized views.

This is expensive!

Page 97: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 97

Query Rewriting

Systems (like Oracle) that support query rewriting

• Can automatically rewrite queries to use available materialized views (this can be complicated!)

• Allow a subset of materialized views to be marked for use in query rewriting

Query rewriting is the opposite of view expansion!

If the data warehouse does not support query rewriting, the ETL tool could do it instead!

Page 98: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 98

Stores x SubCategories

CREATE MATERIALIZED VIEW StoreSubcatSum ASSELECT p.storid, s.subcatid, sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Store s NATURAL JOIN Product pGROUP BY p.storid, s.subcatid

Stores

Regions

Districts

Products

SubCategories

ProductTypes

Categories

Page 99: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 99

Districts x ProductTypes

CREATE MATERIALIZED VIEW DistProdtypSum ASSELECT p.prodtyp, s.distid, sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products pGROUP BY p.prodtyp, s.distid

Stores

Regions

Districts

Products

SubCategories

ProductTypes

Categories

Page 100: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 100

Multiple Materialized View Alternatives

SELECT p.catid, s.regid, sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p

GROUP BY p.catid, s.regid

Stores

Regions

Districts

Product

SubCategories

ProductTypes

Categories

Should the optimizer rewrite

this query in terms of

StoreSubcatSum or

DistProdtypSum?

In general, how good is the optimizer?Can it discover unions, etc.

Page 101: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 101

Automatic Result Caching

Database can (potentially)– cache the results of any query

automatically as a materialized view– use the query history to automatically

define new materialized views

Then, based on their size & usage statistics, the DB can automatically determine

– whether to discard any of these views after a while

– whether to discard or update any of these views when their underlying base tables are updated

Page 102: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 102

Materialized View References

Stores (Dimension)DailySales (Fact)

storidprodiddateidpriceunits

storidstornamcitystatedistiddistnamdistarearegidregnam

Products (Dimension)

prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr

prodtypdistidtotunits

DistProdtypSum

Not Foreign Key Constraints

When dimensions are denormalized, primary keys of materialized do

not refer to unique dimension attributes

This requires using DISTINCT queries

Page 103: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 103

rewritten as

Queries Requiring DISTINCTSELECT p.prodtyp, s.distnam,

sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products pGROUP BY p.prodtyp, s.distid, s.distnam

SELECT DISTINCT prodtyp, distnam, totunitsFROM DistProdTypeSum NATURAL JOIN Stores

DISTINCT is only needed because distid is not unique in Stores

Fix by adding a denormalized subdimension table for Districts

Page 104: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 104

Starflake Schema(Fully Denormalized Dimensionsand SubDimensions as needed)

Store (Dimension)

DailySales (Fact)

storidprodiddateidpriceunits

storidstornamcitystatedistiddistnamdistarearegidregnam

Products (Dimension)

prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr

prodtypdistidtotunits

prodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr

ProductTypes (SubDimension)

Keep Denormalized

distiddistnamdistarearegidregnam

Districts(SubDimension)

DistProdtypSum

Page 105: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 105

Indexing for Data Warehouses

Page 106: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 106

Implementing Warehouse Queries

SELECT sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Products pWHERE p.catid = 5

DailySales (Fact)

storidprodiddateidpriceunits

Products (Dimension)

prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr

Assume that• Products is indexed by catid• Daily Sales is indexed by

prodidFor a specific value of catid

Get all rowids in Products with that catid

Extract the prodid'sGet all rowids in DailySales

with those prodid'sExtract the units from the rows &

sum

Index by catid

Index by prodid

Page 107: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 107

Using Join Indexing

SELECT sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Products pWHERE p.catid = 5

DailySales (Fact)

storidprodiddateidpriceunits

Products (Dimension)

prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr

Assume that• Daily Sales is indexed by

Products.catidFor a specific value of catid

Get all rowids in DailySales with that catid

Extract the units from the rows & sum

Index by Product.catid

A join index is an index on one table based (in part) on values of fields of other tables

Page 108: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 108

Multi-Table JoinsSELECT sum(f.units) as totunits

FROM DailySales f NATURAL JOIN Products p NATURAL JOIN Dates d WHERE p.catid = 5 AND d.year = 2002

DailySales (Fact)

storidprodiddateidpriceunits

Assume that• Daily Sales is indexed by Product.catid• Daily Sales is indexed by Dates.year

For a specific value of catidGet all rowids in DailySales with that

catidFor a specific year

Get all rowids in DailySales with that year

Intersect the two lists of rowids (lots!)Extract the units from the rows & sum

Index by Product.catid

Index by Dates.year

Page 109: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 109

Multi-Table Join IndexesSELECT sum(f.units) as totunits

FROM DailySales f NATURAL JOIN Product p NATURAL JOIN Dates d WHERE p.catid = 5 AND d.year = 2002

DailySales (Fact)

storidprodiddateidpriceunits

Assume that• Daily Sales is indexed by

( Product.catid, Dates.year )For a distinct pair of (catid,year)

Get all rowids in DailySales with that catid & year

Extract the units from the rows & sum

Index by (Product.catid,

Dates.year)

Since issue as for aggregates. Lots of possible different combinations

of multi-table join indices: Which ones are worth building?

Page 110: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 110

Bitmap Indexing

603942 … 5 …

603947 … 2 …

603950 … 2 …

603951 … 2 …

603964 … 3 …

603968 … 5 …

prodid … catid …

Product

1

1

1

1

1

1

ProductBitmapIndex

1 2 3 4 5

Page 111: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 111

Using Bitmap Indices

Bitmap Index by category

SELECT min(size), max(size)FROM ProductWHERE catid = 5

Assume that• Product has bitmap index on catid

(faster than doing full scan or using B+ tree)

Implement Query ByScan all tuples in Products with the

bit set for catid: 5Extract the size from each tuple and

compute min(size) and max(size)

Product (Dimension)

prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr

Page 112: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 112

Bitmap Intersection

Bitmap Indices by category &

by color

SELECT min(size), max(size)FROM ProductWHERE catid = 5 AND color = 'fuschia'

Assume that• Product has bitmap indices on

catid and on colorImplement Query By

Construct the bit vector which has bits set for both catid: 5 and color: fuschia (very fast!)

Scan all tuples in Product with the bit set in the resulting bit vector

Extract the size from each tuple and compute min(size) and max(size)

Product (Dimension)

prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr

Page 113: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 113

Bitmap Join Indexing

SELECT sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Product pWHERE p.catid = 5 DailySales

(Fact)

storidprodiddateidpriceunits

Assume that• Daily Sales has bitmap index

on Product.catidFor a specific value of catid

Get all rowids in DailySales with that catid

Extract the units from the rows & sum

Bitmap Index by

Product.catid

Page 114: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 114

Multi-Table Joins with Bitmap Indices

SELECT sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Product p NATURAL JOIN Store s WHERE p.catid = 5 AND s.city = ‘Boston’

DailySales (Fact)

storidprodiddateidpriceunits

Assume that• Daily Sales has bitmap index on

Product.catid• Daily Sales has bitmap index on

Store.cityImplement Query By

Construct the bit vector which has bits set for both category: 5 and city: Boston (very fast!)

Get the rowids of all rows with bit set in the resulting bit vector

Extract the units from the rows & sum

Bitmap Index by Product.catid

Bitmap Index by Store.city

Page 115: 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

© Ellis Cohen, 2002-2005 115

Indexing vs Materialization

IndexingLess spaceUsable with different kinds of

aggregation and analysis operations

More opportunities for combining

Materialized Views (esp. Aggregates)

Avoid recomputation,esp. recalculation of aggregates