data warehouse design & dimensional modeling

47
Data Warehouse Design & Dimensional Modeling Aaron Lowe Principal Consultant @Vendoran @SQLFriends

Upload: code-mastery

Post on 13-May-2015

3.707 views

Category:

Technology


2 download

DESCRIPTION

At Code Mastery Boston Aaron Lowe, Principal Consultant at Magenic, talks Data Warehouse Design & Dimensional Modeling

TRANSCRIPT

Page 1: Data Warehouse Design & Dimensional Modeling

Data Warehouse Design & Dimensional Modeling Aaron Lowe

Principal Consultant

@Vendoran

@SQLFriends

Page 2: Data Warehouse Design & Dimensional Modeling

Who am I » Aaron Lowe

» Husband

» Father of 5

» Principal Consultant at Magenic

» Working with SQL Server since 1998, version 6.5

» MCITP 2005 and 2008

» Co-organizer of SQLSaturday Chicago

» Masters in Information Systems Management

» www.aaronlowe.net / @Vendoran

» sqlfriends.org / @SQLFriends

Page 3: Data Warehouse Design & Dimensional Modeling

Data, Data everywhere, but not a drop of Information

http://www.flickr.com/photos/walkingsf/5993167874/

Page 4: Data Warehouse Design & Dimensional Modeling

The Data Person

http://www.flickr.com/photos/tantek/1360323838/

Page 5: Data Warehouse Design & Dimensional Modeling

How can we get more out of our data?

http://www.flickr.com/photos/danahlongley/4472897115/

Page 6: Data Warehouse Design & Dimensional Modeling

Leverage data to provide business insight

http://www.flickr.com/photos/juhansonin/4646203016/

Page 7: Data Warehouse Design & Dimensional Modeling

Create a new Data Model in a Data Warehouse

Page 8: Data Warehouse Design & Dimensional Modeling

Why a new Data Model?

Page 9: Data Warehouse Design & Dimensional Modeling

What do we need?

Page 10: Data Warehouse Design & Dimensional Modeling

Information – not just data » Collecting data

» Log Files » Clicks » How long? » How much?

» Prediction? » How Target Figured out Teen was pregnant -

http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/

» The Numerati - http://www.amazon.com/The-Numerati-Stephen-Baker/dp/B003TO6G20/ - published 2009!

Page 11: Data Warehouse Design & Dimensional Modeling

Relate data from multiple systems » The purpose of a data warehouse is to house standardized, structured, consistent,

integrated, correct, cleansed and timely data, extracted from various operational systems in an organization

» True picture of the business process

» Source Systems » Financial – AR/AP » Sales » CRM » HR » Application

Page 12: Data Warehouse Design & Dimensional Modeling

Fast » It’s my information and I want it Now!

» Empower Users

» Exploratory

» Reads

» Large datasets

Page 13: Data Warehouse Design & Dimensional Modeling

Why won’t existing models work?

Page 14: Data Warehouse Design & Dimensional Modeling

What are they designed for? » Operational

» Preservation of data integrity

» Speed of recording of business transactions

» Often Many tables

» To free the collection of relations from undesirable insertion, update and deletion dependencies;

» To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs;

» To make the relational model more informative to users;

» To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.

» —E.F. Codd, "Further Normalization of the Data Base Relational Model"

Page 15: Data Warehouse Design & Dimensional Modeling

Consistent » Partial data across

» Have the sale in the sales system » Represented in the inventory system » Don’t have the $ in the financial system yet

» Deleted on sources » Removed transactions » Archive » Legally destroy records can remove work product

» Incomplete data on source » Changes over time

Page 16: Data Warehouse Design & Dimensional Modeling

Silo’d » How do we get the entire picture?

» Example:

» Cost of Sales? » Sales system – Sale Price

» Marketing System – $$ spent on Marketing

» Inventory System – $$ spent on inventory

» HR System – $$ spent on Employee

» IT Systems – $$ spent on Infrastructure

Page 17: Data Warehouse Design & Dimensional Modeling

What will work?

http://www.flickr.com/photos/d-y-f/2870942257/

Page 18: Data Warehouse Design & Dimensional Modeling

Designed for Users » De-normalized

» Fast Reads » Fast Reports » Limited JOINs

» Information » Scheduled » On Demand » Exploratory

» Information » Cross Functional » The more the better!

Page 19: Data Warehouse Design & Dimensional Modeling

Inter-related data » Specifications for my Current Data Warehouse

http://www.flickr.com/photos/ross_goodman/3276964270/

Page 20: Data Warehouse Design & Dimensional Modeling

Independent from Operational » Operational systems change

» Data will outlive Application

» Crashes

» Upgrades

» Breaking changes

» Single Source of truth

Page 21: Data Warehouse Design & Dimensional Modeling

Logical Data Model

http://www.flickr.com/photos/doctorlizardo/6812846803/

Page 22: Data Warehouse Design & Dimensional Modeling

Terminology

http://www.flickr.com/photos/doctorlizardo/6809564765/

Page 23: Data Warehouse Design & Dimensional Modeling

Metadata Management » Business metadata

» What’s out there?

» Identify/Define

» Overloaded terms

» What is a customer?

» Process metadata

» DW process operations

» Asses system status

» Investigate problems

» Technical metadata

» Tables

» Fields

» Datatypes

Page 24: Data Warehouse Design & Dimensional Modeling

Dimensions and Facts Dimensions Facts

Thing/Objects Measurements/Events

Nouns Verbs

Wide but short Skinny but long

Rows can exist independently Rows cannot exist independently

Descriptive Mostly Numeric and Additive

“By” words – FACT by Dimension

Quantity Ordered by Product by Customer by Date

Page 25: Data Warehouse Design & Dimensional Modeling

Grain • Level of detail

• What is needed to meet business

requirements?

• What is possible to collect?

• How do you describe it?

• One row per X where X is the business

event

• One row per customer call

• One row per time sheet entry

• One row per employee status

change

• One row per order line item http://www.flickr.com/photos/frederikvanroest/3842334310/

Page 26: Data Warehouse Design & Dimensional Modeling

Methodology

http://www.flickr.com/photos/doctorlizardo/6812847973/

Page 27: Data Warehouse Design & Dimensional Modeling

Requirements – business focused » “Must embrace the goal of enhancing business value as the primary purpose.” –

Kimball

» “If your job is BI and you speak mostly to technical people all day, you are doing it wrong. Focus on first word - BUSINESS.” – Whitney Weaver (former Magenicon)

» Never ask “What do you want in the data warehouse?” Only one right answer - “Everything.”

» Ask questions that help you learn what the end user does

Page 28: Data Warehouse Design & Dimensional Modeling

Kimball v. Inmon Ralph Kimball Bill Inmon

Kimballites Inmonites

Bottom Up Top Down

Dimensional Normalized

Star Schema 3rd Normal Form

Easier for the User More Difficult for the Users

Few JOINs Many JOINs

Dimension/Facts Entities

Complicated ETL Not as complicated ETL

Difficult to modify structure Easier to adapt

Not mutually Exclusive

Page 29: Data Warehouse Design & Dimensional Modeling

Star vs. Snowflake Star Snowflake

ER resembles Star ER resembles Snowflake

Easier for the User More Difficult for the Users

Few JOINs Many JOINs

Faster Aggregations Slower Aggregations

Children with multiple parent tables

Normalized Dimensions

Snowflake is a variation on a Star, not an alternative http://www.flickr.com/photos/wandrus/6283157711/

Page 30: Data Warehouse Design & Dimensional Modeling

History (ology?)

http://www.flickr.com/photos/doctorlizardo/6809564335/

Page 31: Data Warehouse Design & Dimensional Modeling

Dimension Types » 0 – Inserts only, no updates or delete » 1 – Insert and updated to reflect current state » 2 – Slowly Changing Dimension (SCD)- multiple records to indicate different points in time

» 3 – multiple columns to indicate different point in time

» 4 – current value table and a history table » UNKNOWN values

Source Key Value StartDate EndDate

14 Blue 2012-01-01 2012-03-01

14 Green 2012-03-02

Source Key Value OldValue EffectiveDate

14 Green Blue 2012-03-02

Page 32: Data Warehouse Design & Dimensional Modeling

Date and Time » Date

» Fundamental dimensions across all organizations and industries » Allows for trending across dates or periods » 1 row for every date in the years = 365 or 366 row/year » Use your words

» WeekDay » EndofMonth » Quarter » FiscalYear?

» Time » Not often needed, but becoming more popular » Allows for time based analysis for things like Status » 1 row for every time slice in a day – minutes? Seconds?

Page 33: Data Warehouse Design & Dimensional Modeling

Surrogate Keys » New set of keys in the DW

» Protects against

» Source systems changes

» Single key for multiple source systems

» New rows that only exist in DW (UNKNOWN)

» Tracking over time (SCD)

Page 34: Data Warehouse Design & Dimensional Modeling

Physical Data Model

http://www.flickr.com/photos/flying_cloud/2667218708/

Page 35: Data Warehouse Design & Dimensional Modeling

Approach

http://www.flickr.com/photos/7506006@N07/7021456259/

Page 36: Data Warehouse Design & Dimensional Modeling

Null – yay or nay » Same discussion as OLTP with a twist

» Purpose of DW is for reporting » Building on top of with :

» SSIS » SSAS

» Purpose of the Dimension UNKNOWN values

» Best practice is to avoid if you can, otherwise document » Some have separate values for UNKNOWN and NOT POPULATED » Default value instead

Page 37: Data Warehouse Design & Dimensional Modeling

Aggregates » Minimize number of aggregates while maximize effectiveness

» Store or

» Can aggregate Facts

» Roll-up Dimension hierarchies?

» Can still be relational to other tables when necessary

Page 38: Data Warehouse Design & Dimensional Modeling

Hierarchies » Example: » Date - Roll up by Month, Quarter or Year

» Variable depth – Self-referencing » Variable depth with historical – changing surrogate keys – ouch

» Track business process separately

Key Day Month Quarter Year

364 30 12 4 2011

365 31 12 4 2011

366 1 1 1 2012

367 2 1 1 2012

Page 39: Data Warehouse Design & Dimensional Modeling

Size Matters

http://starwars.wikia.com/wiki/Rancor?image=Rancor-jpg

Page 40: Data Warehouse Design & Dimensional Modeling

Data amount and size » Data Types?

» BLOB data?

» Identity columns (do you need bigint?)

» Data Profiling

» Collect source system sizes for data bringing over

» Add sizes of new row

» Don’t forget index size!!!

Page 41: Data Warehouse Design & Dimensional Modeling

Partitioning » Usually lends naturally to partitioning large Fact tables by Date

» Larger Dimension tables can be partitioned as well

» Sometimes Old (SQL 2000) Partitioning is still better than SQL 2005+ partitioning

» Take ETL process into consideration

Page 42: Data Warehouse Design & Dimensional Modeling

Archiving » Question: When is big too big?

» Answer: When performance impact outweighs need for data availability

» Many options:

» Backup to tape offline

» keep “Archived” DW available

» Records Retention – this could be your work product

Page 43: Data Warehouse Design & Dimensional Modeling

Performance

http://www.flickr.com/photos/elfidomx/6026943114/

Page 44: Data Warehouse Design & Dimensional Modeling

Hardware » Remember when the user said “It’s my data and I want it now”? » Buy

» Reference Architecture (Fast Track) » Appliances

» HP » Enterprise Data Warehouse » Business Decision » Business Data Warehouse » Enterprise Database Consolidation

» Dell » PDW

» Build » Reference Architecture (Fast Track) » SQLIO » Benchmark

Page 45: Data Warehouse Design & Dimensional Modeling

Throughput » Amounts of data

» Not all of it will be in memory » Between ETL and reports, SP Cache might not be efficient » Need to tune those disks

» Reference Architecture(Fast Track) » Accepts that Procedure cache will stink due to data sizes » Instead small amount of RAM » Requires bandwidth of 400 GB/s per LUN

» Materialize data that makes reporting faster!! » More Denormalization » More Aggregations

» ReadOnly while not processing ETLs? (switch)

Page 46: Data Warehouse Design & Dimensional Modeling

Parallelism » Multiple Data Files

» SQL writes proportional fill

» Multiple Filegroups » Partitioning scheme » Facts/Dimensions » Tables that are often joined » Big tables » NCIX vs. data

» Multiple LUNs » I am not a SAN admin nor play one on TV

» Normal SQL performance

Page 47: Data Warehouse Design & Dimensional Modeling

Questions and Discussion time!