dimensional modeling

188
1 Dimensional Data Modeling Dimensional Data Modeling Module 6

Upload: jayanthsk

Post on 28-Nov-2014

208 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Dimensional Modeling

1

Dimensional Data ModelingDimensional Data Modeling

Module 6

Page 2: Dimensional Modeling

2

Course AgendaCourse Agenda

Rationale for dimensional modeling Dimensional modeling basics Dimensional modeling details Fact table details Dimension table details Design process Aggregate schemas Multiple fact tables Architected data marts

Page 3: Dimensional Modeling

3

Rationale for DimensionalRationale for DimensionalModelingModeling

Page 4: Dimensional Modeling

4

OperationsSales and Marketing

Customer Services

Product Development

The Business Value ChainThe Business Value Chain

A series of interrelated business processes which contribute to increased product value for the customer, and to profit for the enterprise Porter 1985

Page 5: Dimensional Modeling

5

Drive to CompeteDrive to Compete

Businesses constantly strive to optimize each process in the value chain

Optimization requires measuring and analyzing the effectiveness of each process as well as the value chain as a whole

OperationsSales and Marketing

Customer Services

Product Development

Page 6: Dimensional Modeling

6

The Role of Information The Role of Information TechnologyTechnology Process optimization

Supported by on-line transaction processing systems

OLTP

Measuring and analyzing processes Supported by 'analytic' systems Data warehouse

OperationsSales and Marketing

Customer Services

Product Development

Page 7: Dimensional Modeling

7

Example OLTP SystemsExample OLTP Systems

Manufacturing and Process

Control

Sales Order Entry and Campaign

Management

Customer Support and Relationship Management

Shipping and Inventory

Management

OperationsSales and Marketing

Customer Services

Product Development

Page 8: Dimensional Modeling

8

OLTP Systems & Business EventsOLTP Systems & Business Events

Events are the heart of every business Book an order Print a pick list Record a cash withdrawal Post a payment

Event detail is collected by OLTP systems Atomic focus Transaction consistency

Page 9: Dimensional Modeling

9

OLTP System ReportingOLTP System Reporting

OLTP systems answer event-oriented questions well Run invoices Print ledger Pull up customer detail

Operational reporting Focused on detail Predictable requirements

and query patterns Does not reveal the

overall performance of a process

Page 10: Dimensional Modeling

10

OLTP Design CharacteristicsOLTP Design Characteristics

Focus of OLTP Design Individual data elements Data relationships

Design goals Accurately model

business Remove redundancy

Page 11: Dimensional Modeling

11

OLTP Design ShortcomingsOLTP Design Shortcomings

Complex Unfamiliar to business

people Incomplete history Slow query performance

Page 12: Dimensional Modeling

12

Emergence of Dimensional Emergence of Dimensional ModelModel Logical modeling technique

For designing relational database structures

Addresses OLTP design shortcomings For use in analytic systems

First developed early 1980's Packaged goods industry

Popularized by Ralph Kimball, PhD. 1996 book: 'The Data Warehouse Toolkit'

Page 14: Dimensional Modeling

14

Dimensional ModelingDimensional ModelingBasicsBasics

Page 15: Dimensional Modeling

15

Sample Value Chain AnalysisSample Value Chain Analysis

"I need to see overall gross margin by category"

"How do inventory levels compare with sales by product and warehouse?"

"What are outstanding receivables by G/L account?"

What is the return rate for each supplier?

Process-oriented business questions

OperationsSales and Marketing

Customer Services

Product Development

Page 16: Dimensional Modeling

16

Measurement FocusMeasurement Focus

Process-oriented business measures

gross margin inventory levels, sales

receivables return rate

OperationsSales and Marketing

Customer Services

Product Development

Page 17: Dimensional Modeling

17

Brand

Captain Coffee

Product

Standard Coffee Maker

Thermal Coffee Maker

Deluxe Coffee Maker

All Products

Units Sold

5,000

2,400

2,073

9,473

Units Shipped

3,800

1,632

1,658

7,090

% Shipped

76%

68%

80%

75%

Coffee Maker Fulfillment Report

FactsFacts

Process MeasurementProcess Measurement

Measures Metrics or indicators by

which people evaluate a business process

Referred to as “Facts”

Examples Margin Inventory Amount Sales Dollars Receivable Dollars Return Rate

Page 18: Dimensional Modeling

18

Perspective FocusPerspective Focus

Process-oriented business perspectives

category Product, warehouse

G/L account supplier

OperationsSales and Marketing

Customer Services

Product Development

Page 19: Dimensional Modeling

19

Brand

Captain Coffee

Product

Standard Coffee Maker

Thermal Coffee Maker

Deluxe Coffee Maker

All Products

Units Sold

5,000

2,400

2,073

9,473

Units Shipped

3,800

1,632

1,658

7,090

% Shipped

76%

68%

80%

75%

Coffee Maker Fulfillment Report

DimensionsDimensions

Process PerspectivesProcess Perspectives

Dimensions The parameters by which

measures are viewed Used to break out, filter or

roll up measures Often found after the word

“by” in a business question Descriptive business terms

Examples Product Warehouse Customer Supplier

Page 20: Dimensional Modeling

20

Dimensional ModelDimensional Model

Definition Logical data model used to represent the

measures and dimensions that pertain to one or more business subject areas

Dimensional Model = Star Schema

Serves as basis for the design of a relational database schema

Can easily translate into multi-dimensional database design if required

Overcomes OLTP design shortcomings

Page 21: Dimensional Modeling

21

Dimensional Model AdvantagesDimensional Model Advantages

Understandable

Systematically represents history

Reliable join paths

High performance query

Enterprise scalability

Page 22: Dimensional Modeling

22

StoreStore

Star SchemaStar Schema

TimeTime

ProductProduct

FactsFacts

Schema SimplicitySchema Simplicity

Fewer tables Denormalized

Consolidated

Dimensional Familiar to users

Facts go in the fact tables

Dimensions in dimension tables

Increases understandability

Page 23: Dimensional Modeling

23

Time Dimension

year

quarter

month

date

day of the week

holiday flag

ord_date

Data FamiliarityData Familiarity

Adding business context Single source field

Expanded into parts

Decoded into business terms

Add special indicators and flags

e.g. time dimension

Increases understandability

Page 24: Dimensional Modeling

24

Store

Product

Facts

Time DimensionTime Dimension

Time Dimension

year

quarter

month

date

day of the week

holiday flag

Representing HistoryRepresenting History

Time dimension Part of every star schema

Marks the date when the facts (process measurements) occurred

Allows the schema to easily add and query data over time

Especially useful for performing comparison queries

Page 25: Dimensional Modeling

25

Fewer Join PathsFewer Join Paths

Star schema joins Defined during schema

design - not runtime

Business people can easily understand these relationships

One-to-many relations between dimensions and facts

Referential integrity always enforced

Page 26: Dimensional Modeling

26

High Performance DesignHigh Performance Design

Fewer joins means less 'expensive' queries

Deterministic query patterns

Star schema query optimization supported by all major RDBMS vendors

Page 27: Dimensional Modeling

27

Subject area dimensional

models

Subject Area ModelsSubject Area Models

Manufacturing and Process Control

Sales Order Entry and Campaign Management

Customer Support and Relationship

Management

Shipping and Inventory

Management

Subject area E/R

models

OperationsSales and Marketing

Customer Services

Product Development

Page 28: Dimensional Modeling

28

Enterprise ModelsEnterprise Models

Enterprise Scope E/R model

Enterprise scope dimensional model

Page 29: Dimensional Modeling

29

Exercise 1Exercise 1

Scenario Industry: Automobile manufacturing Company: Millennium Motors Value chain focus: Sales

Sample business questions: What are the top 10 selling car models this month? How do this months top 10 selling models compare to the

top 10 over the last six months? Show me dealer sales by region by model by day What is the total number of cars sold by month by dealer by

state?

List facts and dimensions

Page 30: Dimensional Modeling

30

Exercise 1Exercise 1 -- worksheetworksheet

Page 31: Dimensional Modeling

31

Exercise 1 SolutionExercise 1 Solution

Facts Sales revenue Quantity sold

Dimensions Model name Month Dealer name Region State Date

Page 33: Dimensional Modeling

33

Dimensional DesignDimensional DesignDetailsDetails

Page 34: Dimensional Modeling

34

Dimension

Dimension

Dimension

Star Schema Dimension TablesStar Schema Dimension Tables

Dimension tables Store dimension

values Textual content Dimension tables

usually referred to simply as 'dimensions'

Spend extra effort to add dimensional attributes

Page 35: Dimensional Modeling

35

key

key

key

Dimension

Dimension

Dimension

Dimension KeysDimension Keys

Synthetic keys Each table assigned a

unique primary key, specifically generated for the data warehouse

Primary keys from source systems may be present in the dimension, but are not used as primary keys in the star schema

Page 36: Dimensional Modeling

36

Key

attribute

attribute

attribute

Key

attribute

attribute

attribute

Key

attribute

attribute

attribute

Dimension

Dimension

Dimension

Dimension ColumnsDimension Columns

Dimension attributes Specify the way in

which measures are viewed: rolled up, broken out or summarized

Often follow the word “by” as in “Show me Sales by Region and Quarter”

Frequently referred to as 'Dimensions'

Page 37: Dimensional Modeling

37

Fact Table

fact1

fact2

fact3

Star Schema Fact TableStar Schema Fact Table

Process measures Start by assigning one

fact table per business subject area

Fact tables store the process measures (akaFacts)

Compared to dimension tables, fact tables usually have a very large number of rows

Page 38: Dimensional Modeling

38

Fact Table

fact1

fact2

fact3

keykeykey

Fact Table Primary KeyFact Table Primary Key

Every fact table Multi-part primary key

added Made up of foreign

keys referencing dimensions

Page 39: Dimensional Modeling

39

Fact Table SparsityFact Table Sparsity

Sparsity Term used to describe the very common situation

where a fact table does not contain a row for every combination of every dimension table row for a given time period

Because fact tables contain a very small percentage of all possible combinations, they are said to be "sparsely populated" or "sparse"

Page 40: Dimensional Modeling

40

Fact Table

Fact Table GrainFact Table Grain

Grain The level of detail

represented by a row in the fact table

Must be identified early Cause of greatest

confusion during design process

Example Each row in the fact table

represents the daily item sales total

Page 41: Dimensional Modeling

41

Sparsity ExampleSparsity Example

Assume 5,000 rows in 'dealer' dimension 50 rows in 'model' dimension

If all dealers sold all models every day: 5,000 * 50 = 250,000 sales every day 91,250,000 sales every year Assuming only one model sold in every dealer!

Sparsity Means that only a small fraction of the total possible

250,00 will be sold on a given day Generally, only record sales - not zeroes in fact table

Page 42: Dimensional Modeling

42

Designing a Star SchemaDesigning a Star Schema

Five initial design steps Based on Kimball's six steps Start designing in order Re-visit and adjust over project life

Page 43: Dimensional Modeling

43

Identify fact table

Start by naming the fact table with the name of the business subject area

Step OneStep One

Page 44: Dimensional Modeling

44

StepStep TwoTwo

Identify fact table grain

Describe what a row in the fact table represents - in business terms

Page 45: Dimensional Modeling

45

StepStep ThreeThree

Identify dimensions

Page 47: Dimensional Modeling

47

StepStep FiveFive

Identify dimensional attributes

Page 48: Dimensional Modeling

48

Exercise 2Exercise 2

Scenario Industry: Automobile manufacturing Company: Millennium Motors Value chain focus: Sales

Sample business questions: What are the top 10 selling car models this

month? How do this months top 10 selling models

compare to the top 10 over the last six months? Show me dealer sales by region by model by day What is the total number of cars sold by month by

dealer by state?

Page 49: Dimensional Modeling

49

Exercise 2 Exercise 2 -- continuedcontinued

Using these sources data elements, design a star schema that answers the proposed business questions Sales revenue Quantity sold Model name Dealer name Dealer city Product line Region where sold State Vehicle category Month Date of sales

Page 50: Dimensional Modeling

50

Exercise 2Exercise 2 –– sample datasample data

Page 51: Dimensional Modeling

51

Exercise 2Exercise 2 -- worksheetworksheet

Page 52: Dimensional Modeling

52

Exercise 2 Exercise 2 -- solutionsolution

Step 1 - Fact table name: 'Sale facts'

Step 2 - Fact table grain: Every row in the sales facts table is a summary of

car model sales for that day at a single dealer

Step 3 - Dimensions: Time, Model, Dealer

Step 4 - Facts: Total revenue, Quantity sold

Step 5 - Dimensional attributes: See next page

Page 53: Dimensional Modeling

53

Exercise 2Exercise 2 –– Dimensional ModelDimensional Model

Modelmodel_key

category

line

model

Sales Factsmodel_key

dealer_key

time_key

revenue

quantity

Timetime_key

year

quarter

month

date

Dealerdealer_key

region

state

city

dealer

Page 56: Dimensional Modeling

56

Example Fact TableExample Fact Table

Sales Factsmodel_key

dealer_key

time_key

revenue

quantity

Page 57: Dimensional Modeling

57

Example Fact Table RecordsExample Fact Table Records

time_key model_key dealer_key revenue quantity

1 1 1 75840.27 2

1 2 1 152260.37 3

1 3 1 28360.15 1

1 4 1 132675.22 4

1 5 1 43789.45 1

1 1 2 35678.98 1

1 3 2 57864.78 2

1 5 2 92876.67 2

Primary Key Facts

Sales Facts

Page 58: Dimensional Modeling

58

FactsFacts

Fully additive Can be summed across any and all dimensions Stored in fact table Examples: revenue, quantity

Page 59: Dimensional Modeling

59

Example: Additive FactsExample: Additive Facts

Modelmodel_key

brand

category

line

model

Sales Factsmodel_key

dealer_key

time_key

revenue

quantity

Timetime_key

year

quarter

month

date

Dealerdealer_key

region

state

city

dealer

Page 60: Dimensional Modeling

60

FactsFacts

Semi-additive Can be summed across most dimensions but not

all Examples: Inventory quantities, account

balances, or personnel counts Anything that measures a “level” Must be careful with ad-hoc reporting Often aggregated across the “forbidden

dimension” by averaging

Page 61: Dimensional Modeling

61

Example: SemiExample: Semi--additive Factsadditive Facts

Sales Factsmodel_key

dealer_key

time_key

inventory

Modelmodel_key

brand

category

line

model

Timetime_key

year

quarter

month

date

Dealerdealer_key

region

state

city

dealer

Page 62: Dimensional Modeling

62

FactsFacts

Non-Additive

Cannot be summed across any dimension

All ratios are non-additive

Break down to fully additive components, store

them in fact table

Page 63: Dimensional Modeling

63

Example: NonExample: Non--Additive FactsAdditive Facts

Margin_rate is non-additiveMargin_rate = margin_amt/revenue

model_keydealer_keytime_key

revenuemargin_amt

time_key

yearquartermonthdate

model_key

brandcategorylinemodel

Model Sales Facts

dealer_key

regionstatecitydealer

Dealer

Time

Page 64: Dimensional Modeling

64

Unit AmountsUnit Amounts

Unit price, Unit cost, etc.

Are numeric, but not measures

Store the extended amounts which are additive Unit amounts may be useful as dimensions for

“price point analysis”

May store unit values to save space

Page 65: Dimensional Modeling

65

FactlessFactless Fact TableFact Table

A fact table with no measures in it Nothing to measure... …Except the convergence of dimensional

attributes Sometimes store a “1” for convenience Examples: Attendance, Customer

Assignments, Coverage

Page 68: Dimensional Modeling

68

Example Dimension TablesExample Dimension Tables

dealer_key

regionstatecitydealer

model_key

brandcategorylinemodel

Model time_key

yearquartermonthdate

Time

Dealer

Page 69: Dimensional Modeling

69

Example Dimension TableExample Dimension TableRecordsRecords

time_key year quarter month date

1 1997 Q1 January 1/15/97

2 1997 Q1 January 1/16/97

3 1997 Q1 January 1/17/97

150 1997 Q2 April 4/1/97

777 1998 Q4 October 10/13/98

Synthetic Key Attributes

Time Dimension

Page 70: Dimensional Modeling

70

Example Dimension TableExample Dimension TableRecordsRecords

dealer_key region state city dealer

1 Northeast Massachusetts Boston Honest Ted's

2 Northeast Massachusetts Boston Stoller Co.

3 Southwest Arizona Tucson Wright Motors

12 Southwest California San Diego American

245 Central Illinois Chicago Lugwig Motors

Synthetic Key Attributes

Dealer Dimension

Page 71: Dimensional Modeling

71

Dimension TablesDimension Tables

Characteristics

Hold the dimensional attributes

Usually have a large number of attributes (“wide”) Add flags and indicators that make it easy to

perform specific types of reports Have small number of rows in comparison to fact

tables (most of the time)

Page 72: Dimensional Modeling

72

DonDon’’t Normalize Dimensionst Normalize Dimensions

Saves very little space

Impacts performance Can confuse matters when multiple

hierarchies exist A star schema with normalized dimensions is

called a "snowflake schema" Usually advocated by software vendors

whose product require snowflake for performance

Page 73: Dimensional Modeling

73

Example Snowflake SchemaExample Snowflake Schema

category_key

categorybrand_key

brand_key

brand

Brand

Category

line_key

linecategory_key

Line

model_key

modelline_key

Model

model_keydealer_keytime_key

revenuequantity

Sales Facts

date_key

datemonth_key

Day

month_key

monthquarter_key

Monthquarter_key

quarteryear_key

Quarteryear_key

year

Year

dealer_key

dealercity_key

Dealercity_key

citystate_key

Citystate_key

stateregion_key

Stateregion_key

region

Region

Page 74: Dimensional Modeling

74

Slowly Changing DimensionsSlowly Changing Dimensions

Dimension source data may change over time Relative to fact tables, dimension records

change slowly Allows dimensions to have multiple 'profiles'

over time to maintain history Each profile is a separate record in a

dimension table

Page 75: Dimensional Modeling

75

Slowly Changing Dimension Slowly Changing Dimension ExampleExample Example: A woman gets married

Possible changes to customer dimension• Last Name• Marriage Status• Address• Household Income

Existing facts need to remain associated with her single profile

New facts need to be associated with her married profile

Page 76: Dimensional Modeling

76

Slowly Changing Dimension Slowly Changing Dimension TypesTypes Three types of slowly changing dimensions

Type 1• Updates existing record with modifications• Does not maintain history

Type 2• Adds new record• Does maintain history• Maintains old record

Type 3: • Keep old and new values in the existing row• Requires a design change

Page 77: Dimensional Modeling

77

Designing Loads to Handle SCDDesigning Loads to Handle SCD

Design and implementation guidelines Gather SCD requirements when designing data

mapping and loading SCD needs to be defined and implemented at the

dimensional attribute level Each column in a dimension table needs to be

identified as a Type 1 or a Type 2 SCD If one Type 1 column changes, then all Type 1

columns will be updated If one Type 2 column changes, then a new record

will be inserted into the dimension table

Page 78: Dimensional Modeling

78

Designing Loads to Handle SCDDesigning Loads to Handle SCD

Design and implementation guidelines For large dimension tables, change data capture

techniques may be used to minimize the data volume

For smaller dimension tables, compare all OLTP records with dimension table records

Balance data volume with change data capture logic complexities

Page 79: Dimensional Modeling

79

Customer Dimension TableColumn Name SCD TypeCustomer Key N/ACustomer ID 1Name 1Marital Status 1Home Income 1

Designing Loads to Handle SCDDesigning Loads to Handle SCD

Type 1 example: a woman gets married

Page 80: Dimensional Modeling

80

Type 1 ExampleType 1 Example

CustID Name

MaritalStatus

123 Sue Jones S $30K

HomeIncome

CustID Name

MaritalStatus

1 123 Sue Jones S $30K 0

HomeIncome

CustKey

CustKey

DayKey Sales

1 1 $40

Day Dim

DayKey

BusinessDate

1 1/31/01

Sales FactsCustomer DimCustomer OLTP

DayKey

BusinessDate

1 1/31/01

2 2/01/01

Day Dim

CustKey

DayKey Sales

1 1 $40

1 2 $50

Sales Facts

CustID Name

MaritalStatus

123 Sue Smith M $60K

HomeIncome

Customer OLTP

Status

Customer Dim

CustID Name

MaritalStatus

1 123 Sue Smith M $60K 0

HomeIncome

CustKey Status

OLTP Star Schema

Sue Gets Married 2/1/01

Page 81: Dimensional Modeling

81

Type 1 ExampleType 1 Example

Observations Customer history is not maintained in the OLTP

system Customer history is not maintained in the star

schema Sue only has one customer 'profile' in customer

dimension table Sue’s sales facts across all history are associated

with her married profile Sales facts that were associated with Sue’s single

profile have been lost

Page 82: Dimensional Modeling

82

Customer Dimension TableColumn Name SCD TypeCustomer Key N/ACustomer ID 2Name 2Marital Status 2Home Income 1

Designing Loads to Handle SCDDesigning Loads to Handle SCD

Type 2 example: a woman gets married

Page 83: Dimensional Modeling

83

Type 2 ExampleType 2 Example

CustID Name

MaritalStatus

123 Sue Jones S 30K

Day Dim

HomeIncome

CustID Name

MaritalStatus

1 123 Sue Jones S $30K 0

HomeIncome

CustKey

CustKey

DayKey Sales

1 1 $40

DayKey

BusinessDate

1 1/31/01

Sales FactsCustomer DimCustomer OLTP

CustKey

DayKey Sales

1 1 $40

2 2 $50

Sales Facts

CustID Name

MaritalStatus

1 123 Sue Jones S $30K 1

HomeIncome

CustKey Status

2 123 Sue Smith M $60K 0

Customer Dim

CustID Name

MaritalStatus

123 Sue Smith M $60K

HomeIncome

Customer OLTP

Status

OLTP Star Schema

Sue Gets Married 2/1/01

Day DimDayKey

BusinessDate

1 1/31/01

2 2/01/01

Page 84: Dimensional Modeling

84

Type 2 ExampleType 2 Example

Type 2 Observations Customer history is not maintained in the OLTP

system

Customer history is maintained in the star schema

Sue has two 'profiles' in the customer dimension

Sue’s sales facts may be analyzed for when she was single, when she was married, and across all history by using the customer id field

Home income was updated in the new profile record

Page 85: Dimensional Modeling

85

Slowly Changing Dimension Slowly Changing Dimension AdviceAdvice 'When in doubt, design type 2'

Page 86: Dimensional Modeling

86

Rapidly Changing Dimension (RCD)Rapidly Changing Dimension (RCD)

Values change rapidly over time . No yardstick for telling when a dimension is

slowly changing or not and this is based on the judgment of the data modeler.

An SCD may become a RCD over time or vice versa.

Page 87: Dimensional Modeling

87

Large DimensionsLarge Dimensions

Dimensions containing several million records!!!

HOW TO SUPPORT? Database to support indexing technology that

support rapid browsing Find and suppress duplicate entries in the

dimension (eg. Name and address matching) Never use Type 2 to solve changing

dimensions (adding records)

Page 88: Dimensional Modeling

88

Rapidly Changing Monster Rapidly Changing Monster DimensionsDimensions

Dimensions containing > 100 million records!!!

HOW TO SUPPORT? Break the Monster dimension into separate

dimension tables Constant information in original table New dimension table can have discrete values

for each attribute Choose pre-defined set of values per attribute

Page 89: Dimensional Modeling

89

IndexingIndexing

Bitmap Indexes on the foreign key columns in the fact tables.

Bitmap Indexes on low cardinality columns in dimensional tables like Month, Product Category, Store category, etc…

B-Tree Indexes on Dimension key columns.

Page 90: Dimensional Modeling

90

Rapidly Changing Monster Rapidly Changing Monster DimensionsDimensions

Build the data in this dimension with all possible combinations of values for each attribute

Identify each combination uniquely Everytime an event occurs and is recorded in

fact table, attach it with the unique combination ID.

Page 93: Dimensional Modeling

93

Degenerate DimensionsDegenerate Dimensions

Dimensions with no other place to go Stored in the fact table Are not facts Common examples include invoice numbers

or order numbers

Page 94: Dimensional Modeling

94

Example Degenerate DimensionExample Degenerate Dimension

Page 95: Dimensional Modeling

95

Junk/Dirty DimensionJunk/Dirty Dimension

A convenient grouping of random flags and attributes.

After carving out all the dimensions some flags or text attributes that are left over in the fact table but do not belong to any of the dimension tables.

Page 96: Dimensional Modeling

96

Junk/Dirty DimensionJunk/Dirty Dimension

Alternatives to be avoided: Leaving the flags and attributes unchanged in the fact table

record Making each flag and attribute into its own separate

dimension Stripping out all of these flags and attributes from the

design

Make a convenient grouping of the flags and attributes to get them out of a fact table into a useful dimensional framework.

Page 97: Dimensional Modeling

97

Region

Northeast

Southeast

Units Sold Revenue

Quarterly Auto Sales Summary

State

Maine

New York

Massachusetts

Florida

Georgia

Virginia

Region

Northeast

Southeast

Central

Northwest

Southwest

Units Sold Revenue

Quarterly Auto Sales Summary

DrillingDrilling

Drilling down Adding dimensional detail Further breaks out a

measure in some way

Page 98: Dimensional Modeling

98

Region

Northeast

Southeast

Units Sold Revenue

Quarterly Auto Sales Summary

State

Maine

New York

Massachusetts

Florida

Georgia

Virginia

Region

Northeast

Southeast

Central

Northwest

Southwest

Units Sold Revenue

Quarterly Auto Sales Summary

DrillingDrilling

Rolling up Removing dimensional

detail Rolls up a measure

Page 99: Dimensional Modeling

99

DrillingDrilling

Drilling across A query that involves more than one fact table Not necessarily an action that changes how a user

is looking at the data Best resolved by multiple SQL passes

Page 101: Dimensional Modeling

101

Dimensional DesignDimensional DesignProcessProcess

Project Context

Page 102: Dimensional Modeling

102

Development Phase

Deployment PhaseDesign Phase

Data Mart DevelopmentData Mart Development

Dimensional modeling is a critical part of the data mart development effort

Page 103: Dimensional Modeling

103

Data Mart DevelopmentData Mart Development

Design phase Determine requirements and design schema

Development phase Iterative build and feedback

Deployment phase Automate load, document, train users

Page 104: Dimensional Modeling

104

Project DeliverablesProject Deliverables

Design Project definition

document Project plan Schema design Mapping document Report design

Development Populated data mart Load routines

(Sagent “Plans”) Query and reporting

environment

Deployment Automation Documentation Training materials

Page 105: Dimensional Modeling

105

Development Phase

Deployment PhaseDesign Phase

Project ApproachProject Approach

The dimensional model is developed during the design stage

Scope of the project has already been determined

Page 106: Dimensional Modeling

106

Development Phase

Deployment PhaseDesign Phase

Design Stage ActivitiesDesign Stage Activities

Gather requirements through requirements workshops

Develop star schema Conduct design review

Page 107: Dimensional Modeling

107

Gather RequirementsGather Requirements

Requirements definition User workshops Spreadsheets Sample reports

Source systems analysis DBA interviews Copybooks E/R diagrams

Page 108: Dimensional Modeling

108

Design DeliverablesDesign Deliverables

Deliverables The star schema itself Load mapping document

How these primary components are delivered will depend on needs and format chosen Modeling tools Spreadsheets Text documents

Page 109: Dimensional Modeling

109

Sales Factstime_keymodel_keydealer_key

time_key

Time

model_key

Model

dealer_key

Dealer

Notation ExampleNotation Example

IDEF1X Dependent entities - fact tables Independent entities - dimension tables

Page 110: Dimensional Modeling

110

Sales Facts

Time

Dealer

Model

Notation ExampleNotation Example

Martin IE Entities - fact or dimension tables Attributes not shown

Page 111: Dimensional Modeling

111

time_key

Time

model_key

Model

dealer_key

Dealer

time_keymodel_keydealer_key

Sales Facts

Notation ExampleNotation Example

Kimball Simple structure Cardinality implied

Page 112: Dimensional Modeling

112

Design Naming StandardsDesign Naming Standards

Responsibility of data administration Extended to the data warehouse Important to start early in the project

Suggested conventions Fact tables Dimension tables Aggregate tables Keys

Page 113: Dimensional Modeling

113

Data Element DefinitionsData Element Definitions

Clear descriptions Facts

Calculated formulae

Dimensional attributes

Multiple meanings/synonymous terms

Aliases

Page 114: Dimensional Modeling

114

Data Element InstancesData Element Instances

Example of Data

As it will exist in the warehouse

After decoding

Adds to model understanding

Removes ambiguity/uncertainty

Page 115: Dimensional Modeling

115

Data Element MappingData Element Mapping

Where is the data coming from

Source system

Table

Column

Record

Field

Page 116: Dimensional Modeling

116

Data TransformationData Transformation

Changing the data

Serves as spec for ETL process

Decodes

Type conversion

Conditional logic

Handling of NULL’s

Page 118: Dimensional Modeling

118

Aggregates SchemasAggregates Schemas

Page 119: Dimensional Modeling

119

Aggregate DesignsAggregate Designs

Aggregates Pre-stored fact summaries Along one or more dimensions The most effective tool for improving performance

Examples Summary of sales by region, by product, by

category Monthly sales

Page 120: Dimensional Modeling

120

Aggregate BackgroundAggregate Background

Aggregate rationale Improve end user query performance Reduce required CPU cycles Powerful cost saving tool

Restrictions Additive facts only Must use dimensional design

Page 121: Dimensional Modeling

121

Aggregate GuidelinesAggregate Guidelines

Don’t start with aggregates

Design and build based on usage

Sooner or later you'll need to build aggregates

Page 122: Dimensional Modeling

122

Aggregate TypesAggregate Types

Separate Tables Separate fact table for every aggregate Separate dimension table for every aggregate

dimension Same number of fact records as level field tables

Advantage Removes possibility of double counting Schema clarity

Caveat Requires software with aggregate navigation

capability

Page 123: Dimensional Modeling

123

One WayAggregate

Separate TablesSeparate Tables

month_keyproduct_keymarket_keyQuantityAmount

Mthly Sales Facts Agg

time_keyproduct_keymarket_keyQuantityAmount

Sales Factsproduct_keyCategory BrandProductDiet Indicator

Product

month_keyYearFiscal PeriodMonth

Month

market_keyRegion DistrictStateCity

Market

time_keyYearFiscal PeriodMonthDayDay of Week

Time

Page 124: Dimensional Modeling

124

Two WayAggregate

Separate TablesSeparate Tables

product_keyCategory BrandProductDiet Indicator

Product

category_keyCategory

Category

month_keycategory_keymarket_keyQuantityAmount

Mnthly Cat Sales Facts Agg

month_keyYearFiscal PeriodMonth

Month

market_keyRegion DistrictStateCity

Market

time_keyYearFiscal PeriodMonthDayDay of Week

Time

time_keyproduct_keymarket_keyQuantityAmount

Sales Facts

Page 125: Dimensional Modeling

125

Aggregate PitfallsAggregate Pitfalls

Sparsity failure Term used to describe the result of building too

many aggregate fact that do not summarize enough rows.

When Sparsity failure occurs, a relatively small star schema can grow (in terms of disk size) thousands of times.

Sparsity failure = aggregate explosion

Page 126: Dimensional Modeling

126

Aggregate Design GuidelinesAggregate Design Guidelines

Rule of twenty

To avoid aggregate explosion Make sure each aggregate record summarizes 20

or more lower-level records

Remember Total number of possible fact tables in any given

dimensional model = cartesian product of all levels in all the dimensions

Page 127: Dimensional Modeling

127

Year (1)

Quarter (4)

Month (12)

Date (365)

Time

5 years

20 quarters

60 months

1825 days

Hierarchies & Aggregate DesignHierarchies & Aggregate Design

Hierarchy diagram Helps visualize

options for building aggregates

Adding cardinalities insures following the rule of 20

Not required to build initial star schema

Page 128: Dimensional Modeling

128

Aggregate NavigationAggregate Navigation

Description Function provided by software layer: Aggregate

Navigator Directs user queries to the most favorable

available aggregate

Transparent to the end user

Page 129: Dimensional Modeling

129

Business View

Designer View

Aggregate FrameworkAggregate Framework

Page 130: Dimensional Modeling

130

Aggregate Aware SQL

Client PCSQL

RDBMS

Client PC

Application Server

SQLAggregate Aware SQL

RDBMS

Client PCAggregate Aware SQL

RDBMS

Aggregate ArchitectureAggregate Architecture

Page 131: Dimensional Modeling

131

Aggregate DeploymentAggregate Deployment

Incremental

Based on usage

Transparent to users

Typically warehouse DBA responsibility

Page 132: Dimensional Modeling

132

Build SubjectArea 1No aggregates

Build SubjectArea 2No aggregates

BuildBuildaggregatesaggregatesforforSubject area 1Subject area 1

Build SubjectArea 3No aggregates

BuildBuildaggregatesaggregatesforforSubject area 2Subject area 2

Build SubjectArea 4No aggregates

BuildBuildaggregatesaggregatesforforSubject area 3Subject area 3

Some reSome re--work requiredwork required

Aggregate DeploymentAggregate Deployment

Page 133: Dimensional Modeling

133

Exercise 3Exercise 3

Scenario Given the original star schema and the following

hierarchy, design a two-way aggregate table structure that will drastically increase performance

Make your own assumptions about summary levels

Page 134: Dimensional Modeling

134

Exercise 3Exercise 3 –– Dimensional ModelDimensional Model

Modelmodel_key

category

line

model

Sales Factsmodel_key

dealer_key

time_key

revenue

quantity

Timetime_key

year

quarter

month

date

Dealerdealer_key

region

state

city

dealer

Page 135: Dimensional Modeling

135

Exercise 3Exercise 3 Scenario

Industry: Automobile manufacturing Company: Millennium Motors Value chain focus: Sales

Sample business questions: What are the top 10 selling car models this month? How do this months top 10 selling models compare to the

top 10 over the last six months? Show me dealer sales by region by model by day What is the total number of cars sold by month by dealer by

state?

Page 136: Dimensional Modeling

136

Exercise 3Exercise 3

All

Category

Line

Model name

All

Year

Quarter

Month

Date

TimeModel

All

Region

State

City

Dealer name

Dealer

Millennium Motors' dimensions

5

50

1000

1000 40

10

20

20

60

1825

5

Page 137: Dimensional Modeling

137

Exercise 3 WorksheetExercise 3 Worksheet

Page 138: Dimensional Modeling

138

Exercise 3 SolutionExercise 3 Solution

model_key

categorylinemodel

model_keydealer_keytime_key

revenuequantity

time_key

yearquartermonthdate

dealer_key

regionstatecitydealer

month_key

yearquartermonth

state_key

regionstate

state_keymonth_keymodel_key

revenuequantity

Dealer

Time

MonthAgg Sales Facts

State

Model Sales Facts

Page 140: Dimensional Modeling

140

Multiple Fact TablesMultiple Fact Tables

Page 141: Dimensional Modeling

141

Multiple Fact TablesMultiple Fact Tables

Different business processes usually require different fact tables

There are also several cases where a single business process will require multiple fact tables Core and custom Snapshot and transaction Coverage Aggregates

Page 142: Dimensional Modeling

142

Different Business ProcessesDifferent Business Processes

Different business processes usually require different fact tables

In practice, it may be hard to identify what a “process” is

Sometimes you can spot different processes because measures are recorded With different dimensions At differing grains

Page 143: Dimensional Modeling

143

Different Dimensions or GrainDifferent Dimensions or Grain

product_keyCategory BrandProductDiet Indicator

Product

time_keyproduct_keyshipper_keymarket_keyQuantityWeight

Shipment Facts

shipper_keynametypemodeaddress

Shipper

time_keyYearFiscal PeriodMonthDayDay of Week

Time

market_keyRegion DistrictStateCity

Markettime_keyproduct_keymarket_keyQuantityAmount

Sales Facts

Page 144: Dimensional Modeling

144

Different Dimensions or GrainDifferent Dimensions or Grain

Don’t take shortcuts with grain

The 'not applicable' dimension value

Using a 'not applicable' row in a dimension confuses the grain and can introduce reporting difficulty

Page 145: Dimensional Modeling

145

Different Points in TimeDifferent Points in Time

Sometimes, it is not easy to identify the discrete business processes

All measures may have the same dimensionality or grain

Different measures are recorded at different times Quantity sold is not recorded at the same time as

quantity shipped

Page 146: Dimensional Modeling

146

Different TimingDifferent Timing

Building a single fact table would require recording zero or null for measures that are not applicable at a point in time

Reports would contain a confusing combination of zeros, nulls, and absence of data

Page 147: Dimensional Modeling

147

market_keyRegion DistrictStateCity

Different TimingDifferent Timing -- One FactOne FactTableTable

Initially will be null

time_keyproduct_keymarket_keyQuantity_soldAmount_soldQuantity_shippedAmount_shipped

Sales and Shipment Facts

time_keyYearFiscal PeriodMonthDayDay of Week

Time

Market

product_keyCategory BrandProductDiet Indicator

Product

Page 148: Dimensional Modeling

148

time_keyproduct_keymarket_keyQuantityAmount

Different TimingDifferent Timing -- Two FactTwo FactTablesTables

product_keyCategory BrandProductDiet Indicator

Product

Shipment Facts

time_keyproduct_keymarket_keyQuantityAmount

Sales Facts market_keyRegion DistrictStateCity

Market

time_keyYearFiscal PeriodMonthDayDay of Week

Time

Page 149: Dimensional Modeling

149

Identifying Different ProcessesIdentifying Different Processes

Look at the measures in question

Sort them into fact tables based on

Dimensions

Grain

Differing timings of events measured

Page 150: Dimensional Modeling

150

One Process, Multiple Fact One Process, Multiple Fact TablesTables

Core and custom

Coverage

Snapshot and transaction

Aggregates

Page 151: Dimensional Modeling

151

Core and Custom SchemasCore and Custom Schemas

There is a set of dimension attributes and measures shared in all cases

Depending on the value in a dimension, certain extra dimension attributes or measures are recorded

Heterogeneous products

Types of customers

Page 152: Dimensional Modeling

152

Core and CustomCore and Custom

product_key...

Product

customer_key ...

Customer

checking_key...custom checking attributes

Checking Account time_keychecking_keybranch_keycustomer_keyBalanceTransaction_count...custom checking facts

Checking Account Facts

time_keyproduct_keybranch_keycustomer_keyBalanceTransaction_count

Account Facts

time_key...

Time

branch_key...

Branch

Page 153: Dimensional Modeling

153

Core and CustomCore and Custom

Core fact table and dimensions All attributes shared no matter what Appropriate for analysis across entire subject area

Custom fact table and/or dimensions Contain attributes specific to a particular

dimension value (e.g. “Checking”) Only appropriate when the business question is

limited to that particular dimension value Should repeat shared facts to minimize need to

access two fact tables

Page 154: Dimensional Modeling

154

Coverage SchemaCoverage Schema

A star schema usually measure events that happen

Relationships between the dimensions involved are not captured if events do not happen

A coverage table fills the gap What did not sell that was on promotion? Who was assigned to that customer?

Usually “factless”

Page 155: Dimensional Modeling

155

product_keyCategory BrandProductSKU

Product

customer_keyNameCompanyAccountPhone_num

Customer

time_keyproduct_keycustomer_keyrep_keyquantitysales_dollars

Sales Facts

time_keyYearFiscal PeriodMonthDayDay of Week

Time

rep_keyrep_namerep_phoneRegion DistrictStateCity

Sales_rep

Measuring What HappenedMeasuring What Happened

Sales facts does not reveal who is assigned to a customer if they do not sell

Page 156: Dimensional Modeling

156

Coverage TableCoverage Table

Customer_coverage_facts shows who is assigned to a customer at a point in time

customer_keyNameCompanyAccountPhone_num

Customer

time_keycustomer_keyrep_key

Customer Coverage Facts

time_keyYearFiscal PeriodMonthDayDay of Week

Time

rep_keyrep_namerep_phoneRegion DistrictStateCity

Sales_rep

Page 157: Dimensional Modeling

157

Snapshot and TransactionSnapshot and Transaction

Viewing a single process multiple ways Transactions

The changes to what is being measured

Snapshot The status at a point in time

Example Changes to inventory Current status of inventory

Page 158: Dimensional Modeling

158

time_keyYearFiscal PeriodMonthDayDay of Week

SnapshotSnapshot

How much is on hand today? How much was on hand yesterday?

product_keyCategory BrandProductSKU

Product

location_keyWarehouseWH_codeCityState

Location

time_keyproduct_keylocation_keyquantity_on_hand

InventorySnapshot Time

Page 159: Dimensional Modeling

159

TransactionTransaction

How did inventory change today? How much product was returned due to failed

inspection?

product_keyCategory BrandProductSKU

Product

location_keyWarehouseWH_codeCityState

Location

time_keyproduct_keylocation_keytransaction_type_keytransaction_amount

InventoryTransactions

time_keyYearFiscal PeriodMonthDayDay of Week

Time

transaction_type_keytransaction_type_codetransaction_typetransaction_category

Transaction_type

Page 160: Dimensional Modeling

160

Aggregate TablesAggregate Tables

Aggregate table

A fact table that summarizes another fact table

Created for performance reasons

Covered in previous section

Page 161: Dimensional Modeling

161

Design Tools for Multiple TablesDesign Tools for Multiple Tables

Create a set of matrices Facts vs dimension Facts vs dimensional attributes

Mark where facts apply to dimensions Mark where facts apply to dimensional

attributes When facts don't apply, assume separate fact

table

Page 162: Dimensional Modeling

162

Bus MatrixBus Matrix

A Planning Methodology for Large Data Warehouses with multiple data marts or dimensional models.

Enables technical planning as well as executive communication.

Exceptionally effective for distributed data warehouses without a center.

Is simply a vertical list of data marts and a horizontal list of dimensions.

Page 163: Dimensional Modeling

163

Attr

ibut

e 1

Attr

ibut

e 2

Attr

ibut

e 3

Attr

ibut

e 4

Attr

ibut

e 5

Attr

ibut

e 6

Attr

ibut

e 7

Attr

ibut

e 8

Fact 1 X X X X

Fact 2 X X X X

Fact 3 X X X X X

Fact 4 X X X X X

Fact Table 1

Fact Table 2

Example MatrixExample Matrix

Fact vs dimensional attribute matrix

Page 164: Dimensional Modeling

164

Exercise 4Exercise 4

Scenario Industry: Automobile manufacturing Company: Millennium Motors Value chain focus: Sales

Sample business questions: What are the top 10 selling car models this month? How do this months top 10 selling models compare to the

top 10 over the last six months? Show me dealer sales by region by model by day. How many cars have been purchased over the last six

months by customers with yearly household incomes greater than $200,000?

Page 165: Dimensional Modeling

165

Exercise 4 Exercise 4 -- continuedcontinued

Using these sources data elements, design a star schema that answers the proposed business questions

Daily sales revenue Daily quantity sold Model Dealer Dealer city Product line Region where sold State Vehicle category Date of sales

Customer name Customer zip code Customer yearly income P.O. Number Purchase price Discount amount Brand of car

Page 166: Dimensional Modeling

166

Exercise 4Exercise 4 -- worksheetworksheet

Page 167: Dimensional Modeling

167

Exercise 4 SolutionExercise 4 Solution -- MatrixMatrix

facts

daily_sales X X X X X X X X X

daily_quantity X X X X X X X X X

purchase_price X X X X X X X X X X X X X

discount_amount X X X X X X X X X X X X X

Cust

omer

nam

e

Cust

omer

zip

cod

e

Mod

el

Cust

omer

inco

me

Dea

ler

P.O

. N

umbe

r

Dea

ler

city

Prod

uct

line

Bran

d of

car

Reg

ion

whe

re s

old

Stat

e

Vehi

cle

cate

gory

Dat

e of

sal

es

Page 168: Dimensional Modeling

168

Exercise 4Exercise 4 -- Star schemaStar schema

customer_key

customer_namecustomer_zipyearly_income

Customer

model_key

brandcategorylinemodel

Model

model_keydealer_keytime_key

revenuequantity

Daily Sales Facts

model_keydealer_keytime_keycustomer_key

po_numberpurchase_pricediscount_amt

Customer Sales Facts

time_key

yearquartermonthdate

Time

dealer_key

regionstatecitydealer

Dealer

Page 170: Dimensional Modeling

170

Architected Data MartsArchitected Data Marts

Page 171: Dimensional Modeling

171

Data MartData Mart

Meaning of the term 'data mart' has shifted over the last several years...

Page 172: Dimensional Modeling

172

Operational Systems

E.T.L.E.T.L.

SoftwareSoftware

Data Warehouse Analysis Users

Query & Query & Reporting Reporting SoftwareSoftware

E.T.L.E.T.L.

SoftwareSoftware

Data Marts

Data Mart Architecture 1993Data Mart Architecture 1993

Page 173: Dimensional Modeling

173

Operational Systems

E.T.L.

SoftwareData Marts

Analysis Users

Query & Reporting Software

Data Mart Architecture 1997Data Mart Architecture 1997

Page 174: Dimensional Modeling

174

Operational Systems Analysis Users

Data Mart

Data Warehouse

Architected Data MartsArchitected Data Marts

E.T.LSoftware

Query & Reporting Software

Page 175: Dimensional Modeling

175

Data MartData Mart

Warehouse Subject Area

Incremental warehouse development

Centralized architecture

Not new

Well - suited to star schemas

Page 176: Dimensional Modeling

176

Store Sales Facts

Product

Time (Day)

Product

Time (Day)

Shipments Facts

Warehouse

Warehouse Inventory Facts

Product

Month

““StovepipeStovepipe”” Data MartsData Marts

“Stovepipe” data marts Inconsistent and

overlapping data Difficult and costly to

maintain Redundant data load Can’t drill across Integration requires

starting over

Dimensions not conformed

Page 177: Dimensional Modeling

177

Conformed DimensionsConformed Dimensions

Definition Dimensions are conformed when they are the

same -or-

When one dimension is a strict rollup of another

Page 178: Dimensional Modeling

178

Conformed DimensionsConformed Dimensions

Same dimensions must:

1. ... have exactly the same set of primary keysand

2. ... have the same number of records

Page 179: Dimensional Modeling

179

Conformed DimensionsConformed Dimensions

Rolled up dimension When one dimension is a strict rollup of another

Which means Two conformed dimensions can be combined into

a single logical dimension by creating a union of the attributes

Page 180: Dimensional Modeling

180

Conformed DimensionsConformed Dimensions

Description

Shared common dimensions

Integrates logical design

Ensures consistency between data marts

Allows incremental development

Independent of physical location

Some re-work may be required

Page 181: Dimensional Modeling

181

Conformed DimensionsConformed Dimensions

Advantages

Enables an incremental development approach

Easier and cheaper to maintain

Drastically reduces extraction and loading complexity

Answers business questions that cross data marts

Supports both centralized and distributed architectures

Page 182: Dimensional Modeling

182

Store Dimension Sales

Facts

Product Dimension

Time Dimension

Shipment Facts

Warehouse Dimension

Inventory Facts

Month Dimension

Conformed DimensionsConformed Dimensions

Interlocking Star SchemasInterlocking Star Schemas

Page 183: Dimensional Modeling

183 Store Product Day Warehouse Month

Sales Facts

Shipment Facts

Inventory Facts

KimballKimball’’s Data Warehouse Buss Data Warehouse Bus

Page 184: Dimensional Modeling

184

When to ConformWhen to Conform

Two approaches Up-front As-you-go Both approaches work

Choose the approach that works for you

Page 185: Dimensional Modeling

185

CrossEnterpriseAnalysis

CreateFirst-CutStars

All SubjectAreas

Conform all Dimensions

FinalizeDesign &BuildSubjectArea 1

FinalizeDesign &BuildSubjectArea 2

FinalizeDesign &BuildSubjectArea 3

Conform Up FrontConform Up Front

Page 186: Dimensional Modeling

186

Design &BuildSubjectArea 1

Design &BuildSubjectArea 2

ConformDimensions

Design &BuildSubjectArea 3

ConformDimensions

Design &BuildSubjectArea 4

ConformDimensions

Some re-work required

Conform AsConform As--YouYou--GoGo

Page 188: Dimensional Modeling

188

Course ReviewCourse Review

Rationale for dimensional modeling Dimensional modeling basics Dimensional modeling details Fact table details Dimension table details Design process Aggregate schemas Multiple fact tables Architected data marts