lecture 1 - course introduction & dimensional modeling (1)

66
1 Professor: John Shantz Lecture 1 Carnegie Mellon University Pittsburgh, PA DATA WAREHOUSING

Upload: saikrishnaiyerj

Post on 25-Dec-2015

217 views

Category:

Documents


1 download

DESCRIPTION

Data Warehousing

TRANSCRIPT

Page 1: Lecture 1 - Course Introduction & Dimensional Modeling (1)

1

Professor: John Shantz

Lecture 1

Carnegie Mellon University

Pittsburgh, PA

DATA WAREHOUSING

Page 2: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

CLASS AGENDA

Data Warehousing 2

• Introductions and background

• Syllabus & Course expectations

• Data Warehousing Basics

• Dimensional Modeling Introduction

• Retail Sales Case Study

• Dimensional Modeling Exercise

• Project / Assignments

Page 3: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

MY BACKGROUND

Data Warehousing 3

• President, Data Warehouse Consultants LLC – a Pittsburgh-focused database and data warehouse consulting company

• Started company in 2004 to focus solely on data warehousing consulting opportunities

• Former Deloitte Consulting Manager – one of the founding members of Deloitte’s DW public sector practice.

• 15+ years of experience in design, development and implementation of data warehouse projects

• Successful implementation of many different data warehouses in various businesses

• Master of Business Administration from Tepper School of Business and EE undergraduate degree from Penn State University

Page 4: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

CLASS AGENDA

Data Warehousing 4

• Introductions and Background

• Syllabus & Course expectations

• Data Warehousing Basics

• Dimensional Modeling Introduction

• Retail Sales Case Study

• Dimensional Modeling Exercise

• Project / Assignments

Page 5: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

SYLLABUS REVIEW

Data Warehousing 5

• Textbook• Ralph Kimball and Margy Ross. The

Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling (Third Edition).

ISBN: 1-118-53080-2

Page 6: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

BLACKBOARD

Data Warehousing 6

• Blackboard will be used as for the class website.

• Lecture slides, handouts and announcements will be posted via the website.

Page 7: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

COURSE GOALS

Data Warehousing 7

• Understand the basic components of a data warehouse

• Design a data warehouse based on user requirements

• Create a prototype data warehouse using established principles discussed in class

• This will essentially be a project course and most work will revolve around your group’s project.

Page 8: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

COURSE GRADING

Data Warehousing 8

• Grading Criteria• Quizzes (2) 30%

• Project Requirements & Design 15%

• Project Presentation 15%

• Course Project 30%

• Class Participation 10%• Due to the project nature of this course and the required

group work, Pass/Fail grading will not be permitted.

• This course is also not available for official Audit credit for the same reasons.

Page 9: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

QUIZZES

Data Warehousing 9

• Two scheduled in-class quizzes

• Focused on key principles of data warehousing discussed in class and in the handouts

• Scheduled for Week 3 and Week 6 (subject to change)

• If you miss a quiz you will receive a zero for that score unless you make alternate arrangements with me IN ADVANCE. No make-ups or alternate arrangements will be made after the quiz is given.

• Alternate arrangements are not guaranteed, however, and are made solely at my discretion based on the individual student’s circumstances.

Page 10: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

COURSE PROJECT

Data Warehousing 10

• Once the basics of data warehousing have been covered, the course project will become the focal point of the class

• Groups will consist of 3-4 members (depending on class size) and be assigned randomly by me

• The project is due the final week of class. I strongly encourage you to begin the project in week 3 once your groups have been assigned.

• Every group member should know the subject and goals of his/her project. Failure to be knowledgeable on your group’s activities will negatively affect your evaluation.

Page 11: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

REQUIREMENTS & DESIGN

Data Warehousing 11

• The first deliverable for your group project is the requirements and design document.

• This will describe your design and serves as your plan for how you intend to build your data warehouse.

• It will be due during Week 5 of class.

Page 12: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

GROUP PRESENTATION

Data Warehousing 12

• Each group will be required to present a short synopsis on their project during the final class. Presentations will be approximately 10 to 12 minutes in length.

• All group members are strongly encouraged to be present for your group’s presentation

• The final presentations will be during the last day of class.

Page 13: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

GROUP PRESENTATION (CONT.)

Data Warehousing 13

• During each group’s presentation, another group will be chosen to provide a critique when the presentation is complete.

• Specifically, we will be looking for opinions on items that were done well and aspects that could be improved.

• An effective critique will contribute to your own team’s score for that group.

Page 14: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

COURSE PROJECT

Data Warehousing 14

• The final project is due the last week of class

• Detailed instructions of what is required for the project will be posted later in the year

• Every group member should know the subject and goals of his/her project. Failure to be knowledgeable on your group’s activities will negatively affect your evaluation.

Page 15: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

COURSE PROJECT – SOFTWARE

Data Warehousing 15

It is important to note that dimensional data modeling and data warehousing is software independent. It can be done correctly in almost any type of database.

To maintain grading consistency, however, I require the project to be done in Microsoft SQL Server 2012. This software is available from the Heinz computing center.

Further instructions on how to submit your project using the above software will be posted with the course project assignment.

Page 16: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

CLASS EXPECTATIONS – SOFTWARE

Data Warehousing 16

• To complete labs that I have planned you must install Microsoft SQL Server 2012 database. Be sure to install MS Analytic Services (MSAS) and MS Integration Services (MSIS) when you install the software. This is also the software I recommend you use to complete your project. If you download Express versions from Microsoft, this may be an issue.

• The software and installation instructions for both software packages are available from Heinz Computing Services.

• WORK ON SOFTWARE INSTALLATIONS EARLY AS THEY DO NOT GO SMOOTHLY SOMETIMES.

Page 17: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

COURSE ENROLLMENT & WAITLIST

Data Warehousing 17

• The enrollment size of this course is limited due to the time required in-class to complete the presentations. It is simply not possible to extend the enrollment beyond the current size.

• If you are currently on the waitlist and are very interested in pursuing the course, I suggest coming for the first two weeks to see who adds/drops the course. Historically, 2-5 students get in because of late drops.

• If you do not get into the course, please see your department representatives for you options. I don’t have the ability to increase the enrollment because of the structure of the course.

• Right now the course is being offered again next mini, and there is another section this mini.

Page 18: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

CLASS AGENDA

Data Warehousing 18

• Introductions and Background

• Syllabus & Course expectations

• Data Warehousing Basics

• Dimensional Modeling Introduction

• Retail Sales Case Study

• Dimensional Modeling Exercise

• Project / Assignments

Page 19: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

DATA WAREHOUSE DEFINITIONS

Data Warehousing 19

• A data warehouse is• subject-oriented,

• integrated,

• time-varying,

• non-volatile

• a collection of data that is used primarily in organizational decision making [Inmon, 1992]

• Typically a database that is maintained separately from the organization’s operational databases.

Page 20: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

DATA WAREHOUSE VS. DATA MARTS

Data Warehousing 20

• Enterprise Warehouse: collects all information about subjects (customers, products, sales, assets, personnel) that span the entire organization.• Requires extensive business modeling

• May take years to design and build

• Data Mart: a logical and physical subset of a data warehouse; in its most simplistic form, represents data from a single business process (e.g., retail sales, retail inventory, purchase orders) [Kimball, 2002]• Faster roll out, but complex integration in the long run

Page 21: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

• Decision Support System

• Data Analysis Environment

• Access database

• Analytic Workspace

REAL-WORLD DEFINITIONS

Data Warehousing 21

Whatever the business says it is…

The distinctions really break down:• Data mart

• Data warehouse

• “Cube”

• Reporting system

Ask four experts and you’ll get five definitions.

Everyone needs a data warehouse!

Page 22: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

DATA WAREHOUSING MARKET

Data Warehousing 22

• Includes hardware, database software, and tools

• Multi-billion dollar market segment

• A maturing market - warehouses deployed in virtually every industry: • manufacturing (e.g., order shipment)

• financial (e.g., claims analysis, fraud detection)

• retail (e.g., user profiling, inventory management)

• transportation (e.g., fleet management)

• telecommunications (e.g., call analysis)

• utilities (e.g., power usage analysis)

• healthcare (e.g., cost and treatment outcomes)

Page 23: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

WHY USE A SEPARATE SYSTEM?

Data Warehousing 23

Performance• Operational databases are designed and

tuned for known transactions and workloads.

• Complex decision-support queries would degrade performance for operational transactions.

• Special data organization, access and implementation methods needed for multidimensional views and queries.

Page 24: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

WHY USE A SEPARATE SYSTEM?

Data Warehousing 24

Function• Historical Data: Decision support requires

historical data, which operational databases do not typically maintain.

• Data Consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: operational databases, external sources.

• Data Quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled..

Page 25: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

WHY USE A SEPARATE SYSTEM?

Data Warehousing 25

Goal• Goal of transactional system is to capture data

quickly. Users have to use the system to do their jobs.

• Goal of data warehouse is to help the organization run better. Most data warehouses are valuable, but not absolutely necessary (until someone uses one to eat your lunch).

Target Audience• Helps to maintain “one version of the truth”.

Page 26: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

DATA WAREHOUSE COMPONENTS

Data Warehousing 26

• Data Warehouse Database Server• Most always a relational DBMS• All major database companies now have offerings

• OLAP Servers• Relational OLAP (ROLAP): extended relational DBMS that

maps operations on multidimensional data to standard relational operations.

• Multidimensional OLAP (MOLAP): special purpose server that directly implements multidimensional data and operations.

• Tools or Clients• Extraction, Transformation and Load tools• Query and reporting tools• Analysis tools• Data mining tools (e.g., trend analysis, prediction)

Page 27: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

DATA WAREHOUSING TOOLS

Data Warehousing 27

• Database Servers• Oracle

• Microsoft SQL Server

• Sybase

• IBM DB2

• Microsoft Access

• ETL Tools• Informatica PowerPlay

• Ascential DataStage (IBM)

• Hyperion Application Link

• Oracle PL/SQL

• Microsoft Data Transformation Services

• Many Others

Page 28: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Data Warehousing 28

SAMPLE DATA WAREHOUSE ARCHITECTURE

METADATA

Transformation & LoadExtraction User View

An

aly

tical

Data

Fil

ter

Data WarehouseOperational DataStore (ODS)

orStaging Area

Source System 2Extracts

Source System 1 Extracts

OLAP Cubes

Standard Reports

Source System 1

Source System 2

Tran

sfo

rmati

on

Log

ic

Data marts

Page 29: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

COMPARISON OF LEADING TOOL SUITES

Data Warehousing 29

Cognos series 7 IBM Cognos 8 BIBusiness Objects

Enterprise MS SQL Server BI Comments

SQL Server Analysis Services (SSAS)SQL Server Management Studio

Cognos PowerPlay Web Analysis Studio Web I ntelligence ?Web-based multidimensional analysis

Live office

Xcelsius

Crystal Reports

WebI ntelligence

Cognos WebPortal Cognos Connection I nfoView Performance Point Server Web Portal

GO! Dashboard Performance manager

Report Studio XCelsius

Dashboard builder

DecisionStream Data Manager Data integrator (BODI )SQL Server I ntegration Services (SSI S)

ETL and data integraion tool

PowerPlay Transformer

I mpromptu Administrator

Performance manager

Dashboard manager

Notice cast Event StudioBusinessObjects Enterprise XI

?Business activity monitoring

Planning Planning

Controller Controller

Business Planning and Consolidation

? Planning application

Framework Manager DesignerBusiness I ntelligence Development Studio

Modeling application

Metrics Manager Metrics StudioBusiness I ntelligence Development Studio

Scorecarding

I mpromptuReportNet (Report Studio + Query Studio)

SQL Server Reporting Services (SSRS)

Reporting

VisualizerBusiness I ntelligence Development Studio

Visual dashboards

Cognos PowerPlayPowerPlay (BI Mobile Analysis)

Desktop I ntelligence / OLAP I ntelligence

Desktop query and analysis tool

PowerPlay ExcelI BM Cognos 8 BI Analysis for Microsoft Excel

Microsoft Excel Spredsheet integration

http://www.bi-dw.info/cognos-bo-sqlserver.htm

Page 30: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

CLASS AGENDA

Data Warehousing 30

• Introductions and Background

• Syllabus & Course expectations

• Data Warehousing Basics

• Dimensional Modeling Introduction

• Retail Sales Case Study

• Dimensional Modeling Exercise

• Project / Assignments

Page 31: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

DIMENSIONAL MODELING

Data Warehousing 31

• Transactional systems • Designed to allow for quick transactional processing and

efficient storing of data. • To accomplish this, designers typically use some type of

normalization. Most strive for “Third Normal Form”.

• Analytical systems• Designed to extract and query data quickly• Access speed is the main concern• Hence, normalization which is widely used for transactional

databases, is generally not appropriate for data warehouse design

• Design should reflect multidimensional view

• This is called a dimensional model or “star schema”

Page 32: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Data Warehousing 32

SAMPLE DATA WAREHOUSE ARCHITECTURE

METADATA

Transformation & LoadExtraction User View

An

aly

tical

Data

Fil

ter

Data WarehouseOperational DataStore (ODS)

orStaging Area

Source System 2Extracts

Source System 1 Extracts

OLAP Cubes

Standard Reports

Source System 1

Source System 2

Tran

sfo

rmati

on

Log

ic

Data marts

Page 33: Lecture 1 - Course Introduction & Dimensional Modeling (1)

THE PROBLEM

Data Warehousing 33

Transactional models, while efficient for transaction processing, are not good for analytics

How do we determine the average grade in biology for CMU in a given semester?

Page 34: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

THE SOLUTION

Data Warehousing 34

Organize the data so it can be pulled out more efficiently.

The number of students can be counted by a simple aggregate query based on the fact table.

Page 35: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

DIMENSIONAL MODELING – COMPONENTS

Data Warehousing 35

Fact Table• Primary table which stores the performance measurements of

the business• The term “fact” refers to a business measure• Each row in a fact table corresponds to a specific

measurement• Each measurement is taken at the intersection of all the

relevant dimensions (e.g., day, product, and store) – this list of dimensions defines the “grain” of the fact table

• All measurements in a fact table must be at the same grain• Facts are either additive, semiadditive, or nonadditive – most

are numeric• Contains two or more foreign keys to dimension tables• Expresses the many-to-many relationships between

dimensions in dimensional models

Page 36: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

DIMENSIONAL MODELING – COMPONENTS

Data Warehousing 36

Dimension Tables• Contain the textual descriptors of the business

• Usually low in cardinality, but very wide (50-100 attributes not uncommon)

• Dimension attributes used as query constraints, groupings, and report labels

• The more descriptive the dimension attributes, the better

• Often contain hierarchical relationships (city=>state=>region)

Page 37: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

DIMENSIONAL MODELING – COMPONENTS

Data Warehousing 37

• Fact Table + Dimension Tables = Dimensional Model (Star Schema)

• Benefits of dimensional model• Simplicity

• Easy for business users to understand

• Improved query performance

• Extensibility• Easily accommodates change (but not that easily!)

Page 38: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

DIMENSIONAL MODELING PROCESS

Data Warehousing 38

Consists of four main steps:

1. Select the business process to model

2. Declare the grain of the business process

3. Choose the dimensions that apply to each fact table row

4. Identify the facts

Dimensional modeling is part science, and part art…

Page 39: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

CLASS AGENDA

Data Warehousing 39

• Introductions and Background

• Syllabus & Course expectations

• Data Warehousing Basics

• Dimensional Modeling Introduction

• Retail Sales Case Study

• Project / Assignments

• Dimensional Modeling Exercise

Page 40: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

RETAIL CASE STUDY

Data Warehousing 40

Large grocery chain:• 100 grocery stores spread over 5-state area• Each store has the following departments: grocery,

frozen foods, dairy, meat, produce, bakery, floral, and health/beauty aids

• Each store has 60K individual products or stock keeping units (SKUs)

• Each individual product is assigned and labeled with an SKU, regardless of whether it is produced externally or internally

• When a purchase occurs, the bar code is scanned into the point of sale (POS) system

• Pricing and promotion decisions represent an especially interesting aspect of the business…

Page 41: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

RETAIL CASE STUDY

Data Warehousing 41

• Profit comes from charging as much as possible for each product, lowering acquisition costs/overhead, and attracting as many customers as possible

• Promotions used to attract customers, and include temporary price reductions (TPRs), newspaper ads, in-store displays, and coupons

• Large increases in volume can be created by dramatic price reductions

• e.g., a 50-cent reduction in price of paper towels, especially when coupled with an ad and a display, can cause sale of paper towels to jump by a factor of 10

• However, such huge price reductions are not sustainable, since goods are likely being sold at a loss

• Thus, impact of promotions is an important part of the analysis of operations in the grocery store

Page 42: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

RETAIL CASE STUDY

Data Warehousing 42

1. Select the business process to model:

• Requires an understanding of both business requirements and available data

• Management wants to better understand customer purchases as captured by the POS system

• The chosen business process we will model is POS retail sales

Page 43: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

RETAIL CASE STUDY

Data Warehousing 43

2. Declare the grain of the business process:

• Specify exactly what an individual fact table row represents – the grain conveys the level of detail associated with fact table measurements

• It is highly recommended to choose the most granular or atomic information captured by the business process

• Why?

• The grain that we will use in our retail example, which is the most granular data available, is an individual line item on a POS transaction

Page 44: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

RETAIL CASE STUDY

Data Warehousing 44

Choose the dimensions:

• Primary dimensions determined from grain:• Date• Product• Store

• We also want to be able to see the effects of promotions on each sale:• Add a Promotion dimension

Page 45: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon UniversityData Warehousing 45

RETAIL CASE STUDY

Identify the facts:

Facts collected by POS:• Sales quantity

• Revenue or Sales dollar amount (sales quantity * unit price)

• Cost dollar amount

Page 46: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon UniversityData Warehousing 46

RETAIL CASE STUDY• We are also interested in gross margin:

• Gross margin = gross profit/sales dollar amount, where• Gross profit = sales dollar amount – cost dollar amount

• Should we choose to store gross profit or gross margin as a fact?• Gross profit is additive across all dimensions, whereas gross

margin is nonadditive• A fact is additive if we can sum the fact across all dimensions and

obtain a valid and correct number

• A fact is nonadditive if the summation of the fact across any dimension results in a meaningless, nonsensical number

• A fact is semiadditive if it is additive across some dimensions and nonadditive across other dimensions

Page 47: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon UniversityData Warehousing 47

GROSS MARGIN EXAMPLE

• Product A• Price is $10, Cost is $5

• Gross Profit is $5

• Gross Margin is 50% (5 / 10)

• Product B• Price is $100, Cost is $90

• Gross Profit is $10

• Gross Margin is 10%

• Assume we sell one of each product• Is the gross margin 60% (10% + 50%)?

• What is the gross margin for both products?

Page 48: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon UniversityData Warehousing 48

GROSS MARGIN EXAMPLE

• Assume we sell one of each product• Revenue is additive

• $100 + $10 = $110

• Gross Profit is additive• $10 + $5 = $15

• Gross Margin is not additive• ($10 + $5) / ($100 + $10) = 13.6%

• This value could not be calculated only from the gross margin on each individual product or transaction

Page 49: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

RETAIL CASE STUDY

Data Warehousing 49

Preliminary Star Schema

Page 50: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

RETAIL CASE STUDY

Data Warehousing 50

Query Example – What products have the highest gross margin in January?

SELECT SKU,

SUM(GrossProfit),

SUM(GrossProfit) / SUM (SalesDollars) AS GrossMargin

FROM RetailSalesTransactionFact F

INNER JOIN DateDim D

ON F.TransactionCalendarDate = D.CalendarDate

WHERE MonthName = ‘January’

GROUP BY SKU

Page 51: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

CLOSER LOOK AT THE DIMENSIONS

Data Warehousing 51

Date Dimension

Example attributes:• Date• Full Date Description• Month Number• Month Name• Month Short Name• Day Number in Month• Day of Week• Day Number in Year• Year• Fiscal Quarter• Fiscal Year• Holiday Indicator• First Day of Quarter Indicator• Selling Season…• Etc.

All data warehouses have a Date/Time dimension

It is possible to pre-populate Date dimension

Relatively small dimension table, e.g., 10 years of days is only about 3650 rows

Multiple hierarchies exist within Date dimension

Page 52: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

DATE DIMENSION EXAMPLE

Data Warehousing 52

Page 53: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

CLOSER LOOK AT THE DIMENSIONS

Data Warehousing 53

Product DimensionExample attributes:

• SKU Number (Natural Key)• UPC• Product Description• Brand Description• Category Description• Department Description• Package Type Description• Package Size• Fat Content• Diet Type• Weight• Weight Units of Measure• Storage Type…

Recall there are about 60K SKUs

Product dimension will contain about 150K rows when accounting for different merchandising schemes across stores and historical products

Product hierarchy:SKU=>Brand=>Category=>Department

Page 54: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

CLOSER LOOK AT THE DIMENSIONS

Data Warehousing 54

Store DimensionExample attributes:

• Store Number• Store Name• Store Street Address• Store City• Store County• Store State• Store Zip Code• Store Manager• Store District• Store Region• Floor Plan Type• Selling Square Footage• First Open Date…

Represents primary geographic dimension

Store hierarchies include:Store=>StateStore=>District=>Region

Any number of different geography or sales hierarchies can exist in this dimension

Page 55: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

CLOSER LOOK AT THE DIMENSIONS

Data Warehousing 55

Promotion DimensionExample attributes:

• Promotion Code• Promotion Name• Price Reduction Type• Promotion Media Type• Ad Type• Display Type• Coupon Type• Ad Media Name• Display Provider• Promotion Cost• Promotion Begin Date• Promotion End Date…

A causal dimension – it describes factors believed to cause a change in product sales

Useful in determining whether a promotion is effective, e.g.:

– Whether products under promotion experienced a gain in sales during promotional period

– Whether cannibalization occurred

The different promotion types are highly correlated, e.g., TPR, ad, coupon, and display often occur together

Page 56: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

RETAIL CASE STUDY

Data Warehousing 56

Assume the chain now switches POS systems and must renumber their SKU’s and store numbers.

Anyone see a problem with this model

Page 57: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

SURROGATE KEYS

Data Warehousing 57

• It is highly recommended to use surrogate keys for dimension table keys

• Surrogate keys are simply integers assigned sequentially to a particular dimension row

• Operational codes (e.g., SKU number) can still be retained for analysis purposes

• Benefits of surrogate keys:• Buffer the data warehouse from changes in operational codes

• Can save space due to their small size compared to operational codes

• Allow recording of conditions which do not have an operational code (e.g., “No Promotion”)

• Allow handling of changes to dimension table attributes (to be discussed later)

• The main disadvantage of using surrogate keys is that it requires some effort to implement

Always, always, always use surrogate keys!

Page 58: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

STAR SCHEMA WITH SURROGATE KEYS

Data Warehousing 58

Page 59: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

STAR SCHEMA SIZE ANALYSIS

Data Warehousing 59

Product Dimension• 150,000 products x 1 KB per row = 150 MB

Date Dimension• 3,650 dates (10 years) x 1 KB per row = 3.5 MB

Store Dimension• 100 stores x 2 KB per row = 0.2 MB

Promotion Dimension• 5,000 promotions x 1KB per row = 5 MB

Total Dimensions = 158.7 MB

Page 60: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

STAR SCHEMA SIZE ANALYSIS

Data Warehousing 60

Fact Table• Assume 10,000 transactions per day per store

• 10,000 purchases x 3650 days x 10 products per purchase x 1 promotion per purchase

• 365,000,000 records x 1KB per record = 365 GB

Total Size = Fact + Dimensions

Total Size = 365,000 MB + 158.7 MB = 365.2 GB

Sizing rule – when calculating size, the size of the dimension tables can usually be ignored.

Page 61: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

CLASS AGENDA

Data Warehousing 61

• Introductions and Background

• Syllabus & Course expectations

• Data Warehousing Basics

• Dimensional Modeling Introduction

• Retail Sales Case Study

• Project / Assignments

• Dimensional Modeling Exercise

Page 62: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

COURSE PROJECT

Data Warehousing 62

• Project groups will be assigned by me after the late drop period next week.

• Groups will consist of 4-5 members

Page 63: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

COURSE PROJECT

Data Warehousing 63

• It’s not too early to begin thinking about your topic for the course project.

• What you will need• A business objective (real or

plausible)

• A source of data

• An interest in the topic (this could be important!)

Page 64: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

READING ASSIGNMENTS

Data Warehousing 64

• Kimball – Chapters 1, 2 and 3

• Chaudhuri and Dayal, “An Overview of Data Warehousing and OLAP Technology”, Sections 1-7 (available on Blackboard)

Page 65: Lecture 1 - Course Introduction & Dimensional Modeling (1)

Carnegie Mellon University

CLASS AGENDA

Data Warehousing 65

• Introductions and Background

• Syllabus & Course expectations

• Data Warehousing Basics

• Dimensional Modeling Introduction

• Retail Sales Case Study

• Project / Assignments

• Dimensional Modeling Exercise

Page 66: Lecture 1 - Course Introduction & Dimensional Modeling (1)

66

Carnegie Mellon University

Pittsburgh, PA

DIMENSIONAL MODELING DESIGN EXERCISE