mis2502: data analytics - temple mis...mar 09, 2018  · dimensional data modeling ... could you...

Post on 13-Oct-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

MIS2502:Data AnalyticsDimensional Data Modeling

JaeHwuen Jungjaejung@temple.edu

http://community.mis.temple.edu/jaejung

Where we are…

Transactional Database

Analytical Data Store

Stores real-time transactional data

Stores historical transactional and

summary data

Data entry

Data transformation

Data analysis

Now we’re here…

Comparing Transactional and Analytical Data Stores

Transactional Database Analytical Data Store

Based on Relationalparadigm

Based on Dimensionalparadigm

Storage of real-timetransactional data

Storage of historical transactional data

Optimized for storage efficiency and data integrity

Optimized for data retrieval and summarization

Supports day-to-dayoperations

Supports periodic and on-demand analysis

Some terminology

• Takes many forms

• Really is just a repository for historical data

Data Warehouse

• Subset of the Data Warehouse

• Designed for specific analysisData Mart

• Organization of data as a “multidimensional matrix”

• Implementation of a Data MartData Cube

The Actual Process

Transactional Database 2

Transactional Database 1

Other Sources

ETL

ETL

ETL

Analytical Data Store

Data Mart(Finance)

Data Mart(Sales)

Data Mart(Inventory)

Data Warehouse

Data Cube

How they all relate

The data in the

transactional database…

…is put into a data

warehouse…

…which feeds the data mart…

…and is analyzed as a

cube.

We’ll start here.

The Data Cube

• Core component of Analytical Data Storeand Multidimensional Data Analysis

• Made up of “facts” and “dimensions”

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

Product

Store

M&MsDiet

CokeDoritos

Ardmore,

PA

Temple

Main

Cherry Hill,

NJ

King of

Prussia, PAJan. 2013

Feb. 2013

Mar. 2013

Quantity sold and total price are measured facts.Why isn’t product price a measured fact?

Cheetos

The Data Cube

• Fact (What happened)– Quantity

– Total price

– Number of promotions

– Etc.

• Dimension(Attributes that describes what happened)– When/where/what

product was sold

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

Product

Store

M&MsDiet

CokeDoritos

Ardmore,

PA

Temple

Main

Cherry Hill,

NJ

King of

Prussia, PAJan. 2013

Feb. 2013

Mar. 2013

Cheetos

The Data Cube

quantity quantity quantity quantity

quantity quantity quantity quantity

quantity quantity quantity quantity

quantity quantity quantity quantity

Product

Store

M&MsDiet

CokeDoritos

Ardmore,

PA

Temple

Main

Cherry Hill,

NJ

King of

Prussia, PAJan. 2013

Feb. 2013

Mar. 2013

The highlighted element represents the quantity M&Ms were sold in Ardmore, PA in

January, 2013

A single summary record representing a

business event (monthly sales).

Cheetos

Slicing the Data

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

Product

Store

M&MsDiet

CokeDoritos

Ardmore,

PA

Temple

Main

Cherry Hill,

NJ

King of

Prussia, PAJan. 2013

Feb. 2013

Mar. 2013

The highlighted elements represent

Cheetos cookies sold on Temple’s Main

campus from January to March, 2013

This is called “slicingthe data.”

Cheetos

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

Product

Store

M&MsDiet

CokeDoritos Cheetos

Ardmore,

PA

Temple

Main

Cherry Hill,

NJ

King of

Prussia, PAJan. 2013

Feb. 2013

Mar. 2013

What do the orange highlighted elements

represent?

What do the purple highlighted elements

represent?

Slicing the Data

Could you have a data mart with five dimensions?

Then why does our example (and most others) only have three?

Modeling a data cube:

Designing the Star Schema

Kimball’s Four Step Process for Data Cube Design (Kimball et al., 2008)

1. Choose the business process

2. Decide on the level of granularity

3. Identify the dimensions

4. Identify the fact

http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/four-4-step-design-process/

Choose the business process• Business processes are the operational activities

performed by your organization

• Determined by the questions you want to answer about your organization

Question Business Process

Who is my best customer? Sales

What are my highest selling products? Sales

Which supplier is offering us the best deals?

Purchasing

Which type of promotions work the best? Marketing

Note that a “business process” is not always about business.

Decide on the level of granularity

• Level of detail for each business process event

• Will determine the data in the dimensions

• Example: What are the highest sellingproducts?

– Choices for time: yearly, quarterly, monthly, daily

– Choices for store: store, city, state– Choices for product: product, category

How would you select the right granularity?

Identify the dimensions

• Description of the context of the business process

• The key elements of the process needed to answer the question (“fact”)– who, what, where, when, why, and

how

• Example: Sales transaction– A “sale” is the fact– Dimensions: product, store, and

time– Could this data mart tell you

• What is the best selling product?• Who is the best customer?

Identify the factThe fact table contains data called facts associated with the business process event

Keys

• Primary key for each event

• Foreign keys for the associated dimensions

• Example: Sales has Sales_IDas primary key, and Product_ID, Store_ID, and Time_ID as foreign keys

Facts: Measured, numeric data

• Facts: Quantifiable information for each business event – almost always numeric

• Describes a particular combination of dimensional data

• Example: Sales has quantity_sold and total_price.

Modeling a data cube: The Star Schema

ProductProduct_ID

Product_NameProduct_Price

Product_Weight

StoreStore_ID

Store_AddressStore_City

Store_StateStore_Type

TimeTime_ID

DayMonth

Year

Fact

Dimension

Dimension

Dimension

Transactional databases aren’t built around dimensions

• They don’t map well to cubes

• They aren’t set up for summarization

So we build a star schema

• Built around “dimensions” and “facts”

• Simplified relational model

The star schema facilitates

• Aggregating individual transactions

• Creation of cubes

SalesSales_ID

Product_IDStore_IDTime_ID

Quantity SoldTotal Price

Fact Table

• Contain facts (numeric measurements) associated with a specific business process

• Contain foreign keys that refer to dimension tables

Fact Sales

Sales_IDProduct_ID

Store_IDTime_ID

Quantity SoldTotal Price

Dimension Tables

• Contain text and descriptive information

• Provide the “who, what, where, when, why, and how” context surrounding a business process event

ProductProduct_ID

Product_NameProduct_Price

Product_Weight

StoreStore_ID

Store_AddressStore_City

Store_StateStore_Type

TimeTime_ID

DayMonth

Year

Fact

Dimension

Dimension

Dimension

SalesSales_ID

Product_IDStore_IDTime_ID

Quantity SoldTotal Price

From Star Schema to Data Cube

A Cube typically uses a Star Schema as its source

and stores precomputed summarized (aggregated) data

Much more efficient, but can’t be changed (non-volatile)

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

quantity

& total

price

Product

Store

M&MsDiet

CokeDoritos

Famous

Amos

Ardmore,

PA

Temple

Main

Cherry Hill,

NJ

King of

Prussia, PAJan. 2012

Feb. 2012

Mar. 2012

Advantages of data cube

• Fast response to give you the information you have previously designed in the cube

Speed

• The data multi-dimensional data structure allows the data to be analyzed in the most logical way.

Analysis

Data cube caveats

• The cube is “non volatile,” so you’re locked in

– Measured facts

– Dimensions

– Granularity

• So choose wisely!

– For example: You can’t track daily sales if “date” is monthly

– So why not include every single sale and do no aggregation?

Pivot tables in Excel

• PivotTable is a data summarization tool in Excel

– the easiest way to learn multidimensional data and generate simple reports

• Data cubes can act as the data source for Pivot Table in Excel

Summary

• Data warehouse vs. data mart vs. data cube

• Star schema

• Kimball’s four step process for data mart design

• Pivot tables in Excel

ICA #9

• In ICA #9, we learned to how to create a pivot table in Excel

– Identify which fields are assigned as VALUES and which ones are assigned as ROWS

– Identify the correct function for aggregation: e.g., SUM, COUNT, AVERAGE, MAX, MIN

The star schema in ICA #9

• Measured Fact: Order amount

• Three dimensions: Salesperson, Country, and Time.

Order

amount

Order

amount

Order

amount

Order

amount

Order

amount

Order

amount

Order

amount

Order

amount

Order

amount

Country

Salesperson

USA

Bob

Buchanan

Martha

Suyama

Paul

Peacock

Susan

Leverling

7/10/2008

Order

amount

UK

7/11/2008

7/12/2008

7/15/2008

7/16/2008

7/17/2008

Susan

Leverling

…more

salespeople

…more time

periods

Pivot Table and Data Cube

• The fields in the ROWS box correspond to dimensions in a data cube

• The fields in the VALUES box correspond to measured facts in a data cube

Example 1

Dimension Measured Fact

Example 2

Dimensions Measured Fact

top related