Download - Data Warehouse : Modeling and Design
NLS/IITB/DWH 1
Data Warehouse : Modeling and Design
N. L. Sarda
NLS/IITB/DWH 2
Outline
• Introduction• Warehouse structure• A case study• Dimensional analysis
NLS/IITB/DWH 3
Introduction
• DW is a single, complete and consistent store of data from different sources to understand & analyze the business
• Contains history data• Warehouse to facilitate browsing, navigating,
aggregating and visualization of related data to understand performance, problems, customer preferences, trends, etc.
• Warehouse data organized by important business subjects (customer, product, etc…)
NLS/IITB/DWH 4
Warehouse Structure
• Organized to facilitate ease of access and aggregation
• warehouse structure decomposed into dimensions and facts– Dimensions like ‘independent variables’, represent
entities for analysis
– Fact represents business data; relates to a set of dimensions
– Eg : customer, time, type of account are dimensions, and balances are facts
NLS/IITB/DWH 5
Warehouse Structure...
• The complex network of business entities and their relationships as depicted in an operational DB (using, say, ER model) is difficult for navigation and analysis
• A ‘2-level’ structure defined by ‘star schema’ is performed where a fact is at the center and dimensions form ‘spokes’
• Data not stored in ‘normalized’ form
NLS/IITB/DWH 6
Star Schema
• Contains a fact table and for each dimension one dimension table
Time Prod
Cust
fact
date, custno, prodno, cityname, ...
City
NLS/IITB/DWH 7
Dimensions
• Stored as a database table• Contains many descriptive attributes for analysis• Small and slowly changing data• Data often group-able for analysis
– Customers by age, occupation, income level
– Time by weeks, months, years
– Branches as rural, suburban or by size
• Thus, dimension data viewable as a hierarchy
NLS/IITB/DWH 8
Facts
• Contain business activity data• May be at detailed level or status level; called
transaction-oriented or snap-shot oriented• Deciding on granularity : every sale or total sales
of a day ?• Often contain numeric attributes for aggregation
(additive, semi-additive,…)• Contain dimensional table keys also
NLS/IITB/DWH 9
Snowflake Schema
• Hierarchies not captured explicitly in a star schema
• Snowflake schema represents hierarchy directly• Saves on storage but requires more join
NLS/IITB/DWH 10
Snowflake Schema
• Represent dimensional hierarchy directly by normalizing tables.
T ime
prod
cust
city
fact
date, custno, prodno, cityname, ...
region
NLS/IITB/DWH 11
Conformed Dimensions and Facts
• Goal is to produce a master suite of conformed dimensions and to standardize facts
• conformed dimension means same thing with every fact table (eg., customer, time, geography)
• it may contain data brought together from many sources
• ensures same units and meaning, same time durations and geographies across marts
NLS/IITB/DWH 12
Financial Services : A Case Study
• A bank offers various products/services like saving/checking accounts, mortgage loans, personal loans, TD, credit cards, etc…
• Purpose : track various a/c, customer profiles, etc…, for marketing and offering new services
• Requirements:– Get end-of-month summary of a/c for last 5 years– Valid snapshot as of yesterday for current month (with
full details)– Ability to group a/c in various ways & compare
balances– demographic behavior
NLS/IITB/DWH 13
Case Study ...
• Each account type has some unique attributes (requiring customized dimension and facts for each)
• Old data (a/c & customers ) may be incomplete or even different
• The warehouse data may come from multiple sources :– Loan processing system(customer,loan,dues,payment)– Fixed deposit system(customer,TD,…)– Front-office system(customer, account, transaction,..)– Credit-card system customer, transactions, interest,..)
NLS/IITB/DWH 14
Case Study ...
• Must plan extraction, correlation, consistent representation,…
• Let us consider a possible warehouse design for the indicated requirements
• Core fact table : balance in each account, # of transactions, grain : month
• Dimensions : a/c, household, branch, product, status, time
• A/c and household separate : many accounts per family; household definitions change
NLS/IITB/DWH 15
Case Study ...
• Product dimension permits hierarchy and defining specific attributes; separate because it changes
• Status : active or not, closed, etc. with reasons• Account contains customer’s data; for historical
reasons, customer to accounts relationship not well maintained
NLS/IITB/DWH 16
account keyprimary_namesecondary_nameaccount_addressaccount_cityaccount_stateaccount_zipdate_openedprimary_ageprimary_sexprimary_marital
household keyhousehold_head_namehousehold_addresshousehold_cityhousehold_statehousehold_ziphousehold_incomehousehold_type
Household Facts
account_keyhousehold_keybranch_keyproduct_keystatus_keytime_keyprimary_balancetransaction_count
product keyproduct_descriptiontypecategory
time keymonthyearfiscal_quarter
status keystatus_descriptionstatus_reasonnew_account_flagclosed_account_flag
branch keybranch-namebranch_addressbranch_citybranch_statebranch_zipbranch_type
The household data warehouse
NLS/IITB/DWH 17
Case Study ...
• Balance is semi-additive : can not be added across time
• Products highly heterogeneous : different attributes characterize different accounts (balance, deposit options, interest rate, over draft limit,..)
• Can’t combine all in a dimension as many not applicable to all products
NLS/IITB/DWH 18
Case Study ...
• Solution: create many facts, customized for each product, and one core fact with a product dimension having common attributes; leads to 100% replication, but facilitates clarifications, browsing, etc. and avoids join of customized and core facts
• When many facts are to be stored together go for snapshots (eg. monthly)
NLS/IITB/DWH 19
Case Study ...
• Transaction-grained facts usually have a single fact (eg. amount) that is directly involved in the transaction; we need a transaction dimension to represent these amounts
• In transaction grained fact table, we do not need customized fact tables per product; instead we create customized dimension tables
NLS/IITB/DWH 20
BusinessRequirement
Definition
BusinessRequirement
Definition
TechnicalArchitecture
Design
TechnicalArchitecture
Design
ProductSelection &Installation
ProductSelection &Installation
DimensionalModeling
DimensionalModeling
Data StagingDesign &
Development
Data StagingDesign &
Development
End-UserApplication
Development
End-UserApplication
Development
Projectplanning
End-UserApplication
Specification
Project ManagementProject Management
Deploy-ment
Deploy-ment
Main-tenence &
Growth
Main-tenence &
Growth
PhysicalDesign
Data Warehouse Life Cycle
NLS/IITB/DWH 21
Life Cycle : summary
• Project planning• Business requirements definition• Data track
– Dimensional modeling
– Physical design
– Data staging design and development
• Technology track– Technical architectural design
– Product selection and and installation
NLS/IITB/DWH 22
Life Cycle...
• Application track– End user application specification
– End user application development
• Deployment• Maintenance and growth• Project management
NLS/IITB/DWH 23
Collecting Requirements...
• Interviews/write-ups• Requirements findings document
– Project overview
– review of business objectives
– analytic and information requirements
– preliminary source systems analysis
– Preliminary success criteria
• Prepare and publish the requirements
NLS/IITB/DWH 24
Collecting Data about Existing Systems
• Understanding the candidate data sources• Detailed criteria for selecting the data sources
– Data accessibility– Longevity of the feed– Data accuracy– Project scheduling
• Customer matching and house-holding• Browsing and data content• Mapping data from source to target
NLS/IITB/DWH 25
Designing the Data Warehouse / Data Marts
• Identifying marts and dimensions• identify marts based on facts likely to be used
together, as a mart is a kind of subject area or application (divide-and-conquer strategy)
• often based on a single business process or a single source
• 10 to 30 marts common for a large organization• build a matrix of marts versus dimensions
NLS/IITB/DWH 26
Designing a Fact
• Define fact grain based on the basic business facts stored in legacy systems
• Choose dimensions and match them with granularity of facts
• Combine as many facts as possible with the context of defined granularity
NLS/IITB/DWH 27
Detailed Design Tips
• Names for dimensions and attributes should be chosen carefully to refer to corresponding business entities
• An attribute (in a dimension) is not replicated, but a fact may be present in many fact tables
• If a dimension occurs multiple times (eg, time), it is playing multiple roles; name them uniquely
• Every fact should have a default aggregation rule so that it is not aggregated wrongly
NLS/IITB/DWH 28
Dimension Attributes
• The quality of the data warehouse is measured by the quality of the dimension attributes
• The user interface responses and final reports are restricted to the precise contents of the dimension table attributes
• Properties– Verbose, descriptive, complete
– Quality assured, indexed
– Equally available, documented
NLS/IITB/DWH 29
Time Dimension
• Every data warehouse fact table is a time series of some observations
• We always seems to have one or more time dimensions in our fact table designs
• Provides useful hierarchies : week, month, quarter, year, etc
• Represents calendar with many useful attributes like day of week, day of month, week#, day#, quarter, weekday-flag, last-day-of-month-flag, holiday flag, etc.
NLS/IITB/DWH 30
Slowly Changing Dimensions
• The product key or customer key does not change, but the description of the product or customer does
• The data warehouse has three options for above changes– Overwrite the dimension record with the new values,
thereby losing history
• It is used whenever the old value of the attribute has no significance
• The corrections of any error falls into this category
NLS/IITB/DWH 31
Slowly Changing Dimensions...
– Create a new additional dimension record using a new value of the surrogate key
• is primary technique for accurately tracking a change in an attribute within a dimension
• requires use of a surrogate key
• a slowly changing dimension is used when a true physical change to the dimension entity has taken place
– Create an “old” field in the dimension record to store the immediate previous attribute value
• It is used when a change is tentative
NLS/IITB/DWH 32
Time Stamping the Changes
• The design of slowly changing dimension may be established by adding begin and end time stamps and a transaction description in each instance of a dimension record
• This design allows very precise time slicing of the dimension by itself
NLS/IITB/DWH 33
Large Dimensions
• Data warehouses that store extremely granular data may require some extremely large dimensions
• To support large dimensions we must choose the indexing technologies and data design approaches that:– supports rapid browsing of the unconditional
dimension, especially for low cardinality attributes
– Supports efficient browsing of cross-constrained values in the dimension table
– Find and suppress duplicate entries in the dimension
NLS/IITB/DWH 34
Foreign Key, Primary Key, Surrogate Key
• All dimensional tables have single keys, which, by definition, are primary keys
• All data warehouse keys must be meaningless surrogate keys; you must not use the original production keys
• A four byte integer makes a good surrogate key• Surrogate date keys• Avoid smart keys• Avoid production keys
NLS/IITB/DWH 35
Heterogeneous Product Schemas
• Multiple fact tables are needed when a business has heterogeneous products
• The global view needs a single core fact table crossing all lines of business, whereas local view focuses on specific product
• There are many attributes and facts which apply only to a specific product; a single fact table is not feasible
• create customized fact and (product) dimension table for each product, and build a core fact table with attributes that make sense across all lines of business; this allows to create a single portfolio (of products) for each customer
NLS/IITB/DWH 36
Transaction Schema
• Every data mart needs two separate models– Transaction version
– Periodic snapshot version
• ‘rolling’ snapshot containing averages across time
• Snapshots allow us to quickly measure the status of the enterprise
• The Transaction schema– low level transactions in the organization makes for a
good dimensional frame work
– The fact record for an individual transaction frequently contains only a single value
NLS/IITB/DWH 37
Transaction Schema..
• The transaction-based WH commonly used in– Time of day analysis
– Queue analysis
– Fraud detection
– Basket analysis
– Current status
NLS/IITB/DWH 38
Factless Fact Tables
• useful to describe events and their coverage• an event fact table records occurrence of an
event; has only flag and dimension keys (eg, student attendance)
• coverage fact table is frequently needed when a primary fact table in a dimensional data warehouse is sparse; eg, primary fact table will not provide items which were on promotion but did not sale; the coverage table, containing only dimension keys, lists all items on sale
NLS/IITB/DWH 39
Facts of Different Granularity
• The dimensional model gains power as the individual fact records become more and more atomic
• At the lowest level of individual transactions, the design is most powerful because– More of the descriptive attributes have single values
– The design withstands surprise in the form of new facts, new dimensions, or new attributes within existing dimensions
– More expressiveness at the lowest levels of granularity
NLS/IITB/DWH 40
Metadata Catalog
• It is an integral part of the overall architecture• It contains information that describes the
warehouse and plays an active role in its creation, use, and maintenance
• Contains source system metadata (data and processes), data staging metadata (dimensions, transformations, aggregations), DBMS metadata (tables, indexes, stored procedures), and front-room metadata (users, applications)
NLS/IITB/DWH 41
Technical Architecture
• Metadata driven– Metadata provides flexibility by buffering the various
components of the system from each other
– The metadata catalog provides parameters and information that allow the application to perform their task
NLS/IITB/DWH 42
Conclusion
• Building a corporate-wide data warehouse is a challenging task
• A systematic methodology essential• Plan the architecture globally but build it
incrementally• Keep user requirements at the core of all
development activities