ims 6217: data warehousing / business intelligence 1 dr. lawrence west, management dept., university...

30
IMS 6217: Data Warehousing / Business Intelligence 1 Dr. Lawrence West, Management Dept., University of Central Florida [email protected] Database Performance Part 1—Topics Doing vs. Deciding—OLTP vs. OLAP Data Warehouses Fact tables, Dimension tables, Granularity DW in an integrated Business Intelligence system Design Steps Designing Fact Tables Designing Dimension Tables The Time dimension Fact Table Exercises The AdventureWorks DW

Upload: domenic-garrison

Post on 14-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

IMS 6217: Data Warehousing / Business Intelligence

1Dr. Lawrence West, Management Dept., University of Central [email protected]

Database Performance Part 1—Topics

• Doing vs. Deciding—OLTP vs. OLAP

• Data Warehouses

– Fact tables, Dimension tables, Granularity

– DW in an integrated Business Intelligence system

• Design Steps

• Designing Fact Tables

• Designing Dimension Tables

– The Time dimension

• Fact Table Exercises

• The AdventureWorks DW

IMS 6217: Data Warehousing / Business Intelligence

2Dr. Lawrence West, Management Dept., University of Central [email protected]

"With uncertainty present…"

With the introduction of uncertainty—the fact of ignorance and necessity of acting upon opinion rather than knowledge—into this Eden-like situation, its character is completely changed. With uncertainty absent, man's energies are devoted altogether to doing things; it is doubtful whether intelligence itself would exist in such a situation; in a world so built that perfect knowledge was theoretically possible, it seems likely that all organic readjustments would become mechanical, all organisms automata. With uncertainty present, doing things, the actual execution of activity, becomes in a real sense a secondary part of life; the primary problem or function is deciding what to do and how to do it. The two most important characteristics of social organization brought about by the fact of uncertainty have already been noticed. In the first place, goods are produced for a market, on the basis of an entirely impersonal prediction of wants, not for the satisfaction of the wants of the producers themselves. The producer takes the responsibility of forecasting the consumers' wants. In the second place, the work of forecasting and at the same time a large part of the technological direction and control of production are still further concentrated upon a very narrow class of the producers, and we meet with a new economic functionary, the entrepreneur. Frank H. Knight

University of Chicago 1921

IMS 6217: Data Warehousing / Business Intelligence

3Dr. Lawrence West, Management Dept., University of Central [email protected]

Doing vs. Deciding

• Organizations do many things

– List thirty transactions that your project organization executes or does

– Start with the Top-Ten list from Projects 2 & 3

• Managers decide things

– List thirty decisions that your project organization makes

– Identify where in the organizational hierarchy the decision lies

– What is the consequence/importance of the decision?

– What information influences each decision?

IMS 6217: Data Warehousing / Business Intelligence

4Dr. Lawrence West, Management Dept., University of Central [email protected]

Doing vs. Deciding / OLTP vs OLAP

• Are systems designed to support the execution of events suitable for the making of decisions?

• Event/transaction support requires – High throughput– High reliability– Accuracy– DB structures tuned for storage & performance

• Online Transaction Processing (OLTP) systems support events– Provide data or information to support transactions– Record acts → New data

IMS 6217: Data Warehousing / Business Intelligence

5Dr. Lawrence West, Management Dept., University of Central [email protected]

OLTP vs. OLAP—Let Me Count the Ways…

• Online Analytical Processing (OLAP) or Business Intelligence (BI) systems are oriented at decision making and analysis

• What are the problems with using our OLTP databases to support managerial decision making?

?

IMS 6217: Data Warehousing / Business Intelligence

6Dr. Lawrence West, Management Dept., University of Central [email protected]

The Data Warehouse

• The DW is a separate storage structure

• Designed to optimize query execution

– Not storage efficiency

– Not transaction throughput

• Expected to be loaded during down times

• Supports "readability"

• May sacrifice details for summaries

• Data and structures anticipate user needs

– Recurring decisions

– Flexible exploration

IMS 6217: Data Warehousing / Business Intelligence

7Dr. Lawrence West, Management Dept., University of Central [email protected]

Steps and Components

• Source Systems—provide raw data to the DW

• Integration Services—Provide transformation and loading services from source data to DW

• Data Warehouse—Customized data store for Business Intelligence

• Analysis Services—Tools for data mining and reporting

• Reporting Services—Our old friend acting on an enhanced data store

IMS 6217: Data Warehousing / Business Intelligence

8Dr. Lawrence West, Management Dept., University of Central [email protected]

Our Approach

• This Week

– Discuss DW storage strategies

– Discuss data to be stored

• Internal data from OLTP systems

• External data

– Design exercises

• Next Week

– DW loading strategies

– DW tools—Analysis Services

IMS 6217: Data Warehousing / Business Intelligence

9Dr. Lawrence West, Management Dept., University of Central [email protected]

Storage Strategies

• The DW stores transformed data that

– May be accessed directly to support analysis

– Supports actions of the Analysis Services to provide enhanced and efficient analysis

• Multiple Strategies

• We will look at the widely used approach using

– Fact tables,

– Dimension tables,

– Arranged in a Star Schema or Snowflake Schema (or both)

IMS 6217: Data Warehousing / Business Intelligence

10Dr. Lawrence West, Management Dept., University of Central [email protected]

Fact Tables Contain Facts (duhhhh) of Interest

• No PK designated for fact table

• Natural PK is TimeKeyOrdered, ProductKey, CustomerKey

– This defines the granularity of the data

• CategoryKey FD on ProductKey

• SalesTerrKey, SalesRepKey FD on CustomerKey

• UnitsSold, TotalDiscounts

– Summed from source data

– Additive

• SalesPrice is not additive

• ValueSold is derivable and additive

SALES

TimeKeyOrderedTimeKeyShippedTimeKeyPmntRcvdProductKeyCategoryKeyCustomerKeySalesTerrKeySalesRepKeyUnitsSoldSalesPriceValueSoldTotalDiscounts

IMS 6217: Data Warehousing / Business Intelligence

11Dr. Lawrence West, Management Dept., University of Central [email protected]

Star Schema & Dimension Tables

• Dimension Tables represent concepts (entities) used to group data in the fact tables

• Also contain descriptive attributes of the entity represented by the dimension table

• Simplest way for nontechnical users to picture the data

• Relate to FKs in the fact tables

SALES

TimeKeyOrderedTimeKeyShippedTimeKeyPmntRcvdProductKeyCategoryKeyCustomerKeySalesTerrKeySalesRepKeyUnitsSoldSalesPriceValueSoldTotalDiscounts

DimDate

TimeKey

DimCustomer

CustomerKey

DimProduct

ProductKey

DimCategory

CategoryKey

DimSalesRep

SalesRepKey

DimSalesTerr

SalesTerrKey

IMS 6217: Data Warehousing / Business Intelligence

12Dr. Lawrence West, Management Dept., University of Central [email protected]

Snowflake Schema & Dimension Tables

• Fewer direct links from dimension tables to fact table

• Dimension tables relate to each other

• Natural hierarchical relationships in data are preserved

– Implications for drilldown reports

• Increases complexity of data retrieval for nontechnical users

DimDate

TimeKey

DimCustomer

CustomerKey

DimCustomer

ProductKey

DimCategory

CategoryKey

DimSalesRep

SalesRepKey

DimGeography

GeographyKey

SALES

TimeKeyOrderedTimeKeyShippedTimeKeyPmntRcvdProductKeyCustomerKeySalesRepKeyUnitsSoldSalesPriceValueSoldTotalDiscounts

DimSalesTerr

SalesTerrKey

IMS 6217: Data Warehousing / Business Intelligence

13Dr. Lawrence West, Management Dept., University of Central [email protected]

Granularity

• The granularity of the fact tables is a critical

• There are alternative levels of granularity

– Finer granularity → more detail, more recordsUse SalesDate instead of Month

– Coaser granularity → less detail, fewer recordsUse SalesMonth instead of SalesDate

• Finer granularity can be aggregated in the DW to find the coarser granularity values

• Coarse granularity cannot be decomposed

• Granularity decisions are made for each of the FKs from the dimension tables

SALES

SalesRecIDFDOWeekProductIDCategoryIDCustomerIDSalesTerrIDSalesRepIDUnitsSoldSalesPriceValueSoldTotalDiscounts

IMS 6217: Data Warehousing / Business Intelligence

14Dr. Lawrence West, Management Dept., University of Central [email protected]

Design Steps

• It is impractical to design a one-source DW as the first deliverable

• Identify initial scope of DW

– Problem Statement

– Business Requirements

• Build DW Data Model

– Business Processes to address requirements

– Level of Detail

– Fact Tables (what we are measuring)

– Dimension Tables (how we look at the data)

IMS 6217: Data Warehousing / Business Intelligence

15Dr. Lawrence West, Management Dept., University of Central [email protected]

Design Steps (cont.)

• Design Integration Services

• Design Analysis Services

• Design Reports

• Deploy and Manage DW

• Add additional business requirements

– Repeat process for new requirements

– Add additional dimensions to the DW

IMS 6217: Data Warehousing / Business Intelligence

16Dr. Lawrence West, Management Dept., University of Central [email protected]

Fact Tables (Part 2)

• Identifying Fact Tables and their facts is an art

• No obvious mapping from OLTP tables to Fact or Dimension Tables

• The same DB table can contribute to multiple fact tables

• Requires analysis to discover central concepts that will become fact tables

– Decision maker interviews

– Reporting requirements

SALES

TimeKeyOrderedTimeKeyShippedTimeKeyPmntRcvdProductKeyCategoryKeyCustomerKeySalesTerrKeySalesRepKeyUnitsSoldSalesPriceValueSoldTotalDiscounts

IMS 6217: Data Warehousing / Business Intelligence

17Dr. Lawrence West, Management Dept., University of Central [email protected]

Fact Tables (Part 2—cont.)

• Look for a logical concept or event which measures of interest are about

– A sale (invoice)

– An order (purchase order)

– An enrollment (college DB)

• The concept/event should support the requirements

• The event is likely to be based on an OLTP table

– Not every OLTP table will become a fact table

• This concept/event will form the foundation for a fact table

SALES

TimeKeyOrderedTimeKeyShippedTimeKeyPmntRcvdProductKeyCategoryKeyCustomerKeySalesTerrKeySalesRepKeyUnitsSoldSalesPriceValueSoldTotalDiscounts

IMS 6217: Data Warehousing / Business Intelligence

18Dr. Lawrence West, Management Dept., University of Central [email protected]

Fact Tables--Measures

• Measures are the facts to be recorded for each row in the fact table

• Measures are often additive

– UnitsSold, TotalDiscounts, ValueSold

• Some are not additive

– SalesPrice

• Sometimes nonadditive measures are transformed into additive measures

– ValueSold = (UnitsSold * SalesPrice) - TotalDiscounts

SALES

TimeKeyOrderedTimeKeyShippedTimeKeyPmntRcvdProductKeyCategoryKeyCustomerKeySalesTerrKeySalesRepKeyUnitsSoldSalesPriceValueSoldTotalDiscounts

IMS 6217: Data Warehousing / Business Intelligence

19Dr. Lawrence West, Management Dept., University of Central [email protected]

Fact Tables—Measures (cont.)

• Measures may come from several sources—often not just values from a single OLTP source table

• Other candidates in our example

– COGS

– CurrentInterestRate – CompetitorPrice

– GrossMargin – NetMargin

– ShippingCost – ShippingWeight

SALES

TimeKeyOrderedTimeKeyShippedTimeKeyPmntRcvdProductKeyCategoryKeyCustomerKeySalesTerrKeySalesRepKeyUnitsSoldSalesPriceValueSoldTotalDiscounts

IMS 6217: Data Warehousing / Business Intelligence

20Dr. Lawrence West, Management Dept., University of Central [email protected]

Fact Tables--Dimensions

• Dimensions are ways of looking at the data

– Users may indicate they look at {fact table subject}"by" {dimension name}

– Sales by week

– Sales by customer

– Sales by product category

• Dimensions lead us to Dimension Tables

– Descriptive attributes about the dimension

– Foreign key to the fact table

IMS 6217: Data Warehousing / Business Intelligence

21Dr. Lawrence West, Management Dept., University of Central [email protected]

Dimension Tables

• Dimension tables are often basedon an OLTP entity

• Denormalized to include descriptiveattributes from other tables

– Product might include

• SupplierName • CategoryName

• SubCategoryName • SupplierCountry

• In Snowflake dimension tables related hierarchical information may be retained in the hierarchical tables

IMS 6217: Data Warehousing / Business Intelligence

22Dr. Lawrence West, Management Dept., University of Central [email protected]

Dimension Tables—Primary Keys

• Dimension tables should alwaysbe given an artificial identity PK—even if there is a suitable OLTP table PK

• If tables are ever loaded from multiple sources the natural PK may become invalid

– E.g., merging sales data from two business units with different databases

• Retain the business PK as an attribute in the dimension table

• Possibly include source system identifier for the row

DimCustomer

CustomerKeyCustomerIDSourceSystemLastNameFirstName :

IMS 6217: Data Warehousing / Business Intelligence

23Dr. Lawrence West, Management Dept., University of Central [email protected]

Dimension Tables—Time

• Time is a hugely common"by" dimension

• Decide on time granularity

– Daily, Weekly, Hourly?

• You might consider two timedimensions

– Daily for grossest categorization

– Hour for additional precision

DimDate

TimeKeyCalendarYearCalendarQuarterCalendarMonthNameCalendarNumberOfMonthDayNumberOfMonthDayNumberOfYearDayNumberOfWeekFiscalYearFiscalQuarterManufacturingYearManufacturingQuarterManufacturingMonthSeasonNameHolidayFlagThanksgivingWeekendFlagPreChristmasFlag

DimTimeOfDay

HourNumTimeOfDayName

IMS 6217: Data Warehousing / Business Intelligence

24Dr. Lawrence West, Management Dept., University of Central [email protected]

Dimensions—Time (cont.)

• The time dimension tablemaps from the measured timeattribute associated with thefact table record to variouslabels and aggregationsassociated with that value

• Facilitates summarizing byvarious aggregates with asingle time dimension measure

• TimeKey PK is often a datetime data type to the date level of precision

DimDate

TimeKeyCalendarYearCalendarQuarterCalendarMonthNameCalendarNumberOfMonthDayNumberOfMonthDayNumberOfYearDayNumberOfWeekFiscalYearFiscalQuarterManufacturingYearManufacturingQuarterManufacturingMonthSeasonNameHolidayFlagThanksgivingWeekendFlagPreChristmasFlag

IMS 6217: Data Warehousing / Business Intelligence

25Dr. Lawrence West, Management Dept., University of Central [email protected]

Fact Tables--Granularity

• In the olden days granularity decisions were made at the DW DB design stage

• Granularity decisions traded off

– Number of records and computational overhead associated with more detailed granularity

– Lack of precision with coarser granularity

• Modern computational power supports finer granularity

• Analysis services provides support for fast computation over large data sets

• Just don't go overboard

IMS 6217: Data Warehousing / Business Intelligence

26Dr. Lawrence West, Management Dept., University of Central [email protected]

Fact Table Exercise #1

• Are there any fact tables beyond the one illustrated on Slide 11 for the NorthWind DB?

• Are there additional facts that you might add to this table?

• Are there additional dimension tables you might add?

IMS 6217: Data Warehousing / Business Intelligence

27Dr. Lawrence West, Management Dept., University of Central [email protected]

Fact Table Exercise #2

• Expand entities around the core of our University ERD

– See next slide

• Consider two business goals

– Understand real credit hour revenue

– Understand classroom utilization

• Identify and design Fact and Dimension Tables

DeptCodeCourseNoNameCreditHrsLabHrs

COURSE

SectionIDDeptCode <AK>CourseNo <AK>SecNo <AK>Term <AK>Year <AK>Room <FK1>DaysTimeInstructorID <FK2>

SECTION

SectionID <FK1>StudentID <FK2>Grade <FK3>

ENROLLMENT

StudentIDLastNameFirstName :

STUDENT

HasHas

Has

GradeGradePts :

GRADE

Has

IMS 6217: Data Warehousing / Business Intelligence

28Dr. Lawrence West, Management Dept., University of Central [email protected]

Fact Table Exercise #2 (Cont.)

Student

StudentIDLname :Resident Y/N

Enrollment

SectionIDStudentIDGrade

Payment

PaymentIDStudentIDPaymentDatePaymentAmt

Section

SectionIDDeptCodeCourseNumSecNumTermYearDaysTimeInstructorIDCapacityBldgIDRoomNum

Instructor

EmployeeIDLastNameFirstNameDeptIDCurrentSalary

SalaryHistory

EmployeeIDStartDateEndDateSalaryStatus

Course

DeptCodeCourseNumCrHrTitleDescriptionLecHrLabHrDeptID

Department

DeprtIDNameChairIDDeptIDOfficeBldgOfficeRm

Textbook

ISBNTitleYear :WholsalePriceListPrice

SectionBook

ISBNSectionIDQtyOrderedQtySold

Room

BldgIDRoomNumTypeCapacity

Building

BldgIDNameFloorsSqFt

FeeSchedule

AcadYearInStateUGCrHrOutStateUGCrHrInStateGrCrHrOutStateGrCrHrInStateDCrHrOutStateDCrHrHealthFeeActivityFee

BrightFutureStat

StudentIDAcadYearPercentCovered

IMS 6217: Data Warehousing / Business Intelligence

29Dr. Lawrence West, Management Dept., University of Central [email protected]

External Data

• What external data might you want to have in a sales-oriented DW?

IMS 6217: Data Warehousing / Business Intelligence

30Dr. Lawrence West, Management Dept., University of Central [email protected]

Next Time

• Transformations to load the DW from the source OLTP (and other) data sources

– Automated support

– Do it yourself

• Analysis Services—putting our DW to work