previews of tdwi course books are provided as an opportunity to...
TRANSCRIPT
-
Previews of TDWI course books are provided as an opportunity to see the quality of our material and help you to select the courses that best fit your needs. The previews can not be printed. TDWI strives to provide course books that are content-rich and that serve as useful reference documents after a class has ended. This preview shows selected pages that are representative of the entire course book. The pages shown are not consecutive. The page numbers as they appear in the actual course material are shown at the bottom of each page. All table-of-contents pages are included to illustrate all of the topics covered by a course.
-
The Data Warehousing Institute
TDWI Data Modeling: Data Warehouse Design and Analysis Techniques
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
ii The Data Warehousing Institute
The Data Warehousing Institute takes pride in the educational soundness and technical accuracy of all of our courses. Please give us your comments – we’d like to hear from you. Address your feedback to:
email: [email protected] Publication Date: May 2003
© Copyright 1999-2003 by The Data Warehousing Institute. All rights reserved. No part of this document may be reproduced in any form, or by any means, without written permission from The Data Warehousing Institute.
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
The Data Warehousing Institute iii
Module One Data Modeling Concepts …….....................…….. 1-1
Module Two Requirements Analysis Models …………………. 2-1
Module Three Design and Specification Models ………………. 3-1
Unit A Design & Specification Modeling Concepts ……… 3A-1
Unit B Designing Data Marts ……………..............……….. 3B-1
Unit C Designing Data Warehouses ………………………. 3C-1
Unit D Designing Data Staging Areas ……………….……. 3D-1
Module Four Data Modeling and Design Summary ………….. 4-1
APPENDICES
Appendix A Glossary of Data Warehousing Terms ............... A-1
Appendix B TDWICo Case Study ............................................ B-1
Appendix C TDWICo Sample Models and Documentation ... C-1
Appendix D Article: Optimizing the Data Warehousing Environment for Change: The Persistent Staging Area ………………. D-1
Appendix E State Transition Analysis .................................... E-1
Appendix F Bibliography and References ............................. F-1
WORKSHOP
Exercises Exercise Activities ……………………...…………. W-1TAB
LE O
F C
ON
TEN
TS
Solutions Exercise Solutions ……………………...…………. WS-1
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Data Modeling Concepts
The Data Warehousing Institute 1-1
Module 1 Data Modeling Concepts
Topic Page
Modeling Fundamentals 1-2
The Warehouse Data Modeler 1-10
Warehousing Data Stores 1-14
Modeling Techniques 1-24
-
Data Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
1-28 The Data Warehousing Institute
Modeling Techniques Subject Area Modeling
Claim
Incident
Policy
OrganizationCustomer
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Data Modeling Concepts
The Data Warehousing Institute 1-29
Modeling Techniques Subject Area Modeling
DESCRIPTION A subject data model depicts business data subjects and the major associations among them. Subject models are used at the conceptual level for all the different data stores in warehousing. They aid in the detailed analysis of the information needs and what will be needed in the warehousing environment to meet those needs.
COMPONENTS The main components of subject area models are:
• Subjects -- High level views of topics of business interest that may be
considered equivalent to both global data classes and classes of entities.
• Relationships -- Represent the most visible associations between the subjects.
THE MODELING PROCESS
This technique is quite similar to E-R modeling. Its purpose is to identify subjects that will remain stable, even as the information needs change. The subjects represent one or more entities. As warehousing ERMs develop, the subjects provide a way to view subsets of complex models. In general the modeling activities are:
• Identify and name subjects • Associate subjects / identify relationships • Identify, and name attributes • Associate attributes with subjects
This is an iterative process and the order of the steps may change from one iteration to the next. Each iteration facilitates discovery in the next.
EXAMPLE The example on the facing page illustrates that:
• Policies are directly associated with claims. • Customers have interest in both policies and claims. • Customers are related to organizations. • Organizations have interest in both policies and claims. • Incidents are related to both policies and claims. One might infer that when a customer has interest in a policy, they also have interest in its directly related claims. Be careful to verify inferences.
-
Data Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
1-30 The Data Warehousing Institute
Modeling Techniques Fact/Qualifier Modeling
Facts
Q
ualif
iers
cust
omer
coun
t
% o
f tot
al m
arke
t
cust
omer
-id
cust
omer
-nam
e
hous
ehol
d co
unt
lost
cust
omer
-id
lost
-pol
icy-id
lost
-pol
icy-v
alue
claim
coun
t
claim
settl
emen
t lag
tim
e
claim
filin
g lag
tim
e
hom
e inc
iden
t cou
nt
region
zone
employee
customer
line of business
product
policy
cause of claim
year
month
demographics
policy features
coverage group
size of claim
customer value
customer share
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Data Modeling Concepts
The Data Warehousing Institute 1-31
Modeling Techniques Fact/Qualifier Modeling
DESCRIPTION A fact/qualifier matrix represents two sets of data items: (1) facts that business people need to know, and (2) qualifiers used to manipulate and organize the facts for analysis. Associations in the matrix illustrate which qualifiers are applicable to which facts. This model is used at the conceptual level to analyze the business questions that warehousing is intended to answer. Understanding the data implications of the information needs and their associated business questions is essential to capture the right data and build the right warehouse/mart data structures.
COMPONENTS The components of the matrix are:
• Facts – Discrete items of business information that (partially) satisfy
the information needs of the business. These are typed as descriptive or metric.
• Qualifiers – Criteria, by which the facts are accessed, sorted,
grouped, aggregated, filtered and presented to warehouse users. • The fact/qualifier association – An entry at an intersecting cell
indicating that the qualifier may be used to control how the fact is used in analysis. Association entries may record data about the association (e.g., a reference to the business questions from which the association is derived).
THE MODELING PROCESS
This matrix combines two lists derived from the information needs and their related business questions. The list of facts, sometimes called the “know list,” answers the question “What do you need to know?” The list of qualifiers, also called the “by list,” answers the question “What do you want to know it by?” Modeling is a simple process of: • Identify and name facts – label rows. (from the know list.) • Identify and name qualifiers – label columns. (from the by list.) • Associate facts with qualifiers – mark intersecting rows and columns
where a fact is associated with a qualifier. NOTE: The placement of facts as row labels and qualifiers as column labels is arbitrary. This model works equally well when row and column designations are reversed.
-
Data Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
1-32 The Data Warehousing Institute
Modeling Techniques State Transition Modeling
UnassignedClaim
RejectedClaim
END
ReceivedClaim
AssignedClaim
InvestigatedClaim
SettledClaim
PaidClaim
receive claim
assignadjuster
completeinvestigation
determinesettlement
make finalpayment
assignadjuster
find cause to reject
deferassignment
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Data Modeling Concepts
The Data Warehousing Institute 1-33
Modeling Techniques State Transition Modeling
DESCRIPTION State transition modeling is used to build entity life cycle models. This model is a tool to examine a single, state-dependent entity with respect to the states in which it may exist, and actions that cause it to change from one state to the next. A state transition model provides specific detail about a single entity. State transition modeling is used at the context level to help identify information needs, and at the structural level to help determine the time dependencies of the targets.. Build the model by identifying the initial state at which the entity becomes of interest to the business. Then follow the possible paths of successor states in an iterative fashion.
THE MODELING PROCESS
An entity life cycle model addresses one, and only one, entity. The modeling process begins identification and selection of an entity that is state-dependent and that needs further analysis. The following sequence of activities are used to model the entity’s life cycle:
• Select the state-dependent entity that is the focus of the model. • Identify the states in which an entity occurrence may exist, and the
actions that cause changes of state. • Identify the business rules that describe pre-conditions and post-
conditions for a change of state.
Test completeness and correctness of the model.
READING THE MODEL
The example on the facing page illustrates these (and other) state-based business rules:
• A Received Claim is checked for completeness, if incomplete it is Rejected and all processing on it stops.
• A claim is not Investigated until it is Assigned. • A claim is not Paid until it has been Investigated and Settled.
-
Data Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
1-36 The Data Warehousing Institute
Modeling Techniques Entity Relationship Modeling Process
partyparticipates in
policyprotects
claimantfiles
CLAIMANT
PARTYPTY-ID-NUMBER
POLICYHOLDERINTERESTEDPARTY
CLAIM ACTIONCLAIM-NUMBERACTION-TYPE-CODEACTION-BEGIN-DATE
isany
INCIDENTINCIDENT-DATEINCIDENT-LOCATIONINCIDENT-TYPE-CODE
party uses
incidentcauses
actiontaken on claim filedagainst
PARTYADDRESS
PTY-ID-NUMBERADDRESS-USAGE-CODE
CLAIMCLAIM-NUMBER
POLICYPOLICY-NUMBER
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Data Modeling Concepts
The Data Warehousing Institute 1-37
Modeling Techniques Entity Relationship Modeling Process
THE MODELING PROCESS
The diagram is the pictorial part of the ERM. It illustrates all of the model components and their associations. Every component has a unique name as a way to reference the component and link it to the descriptive part of the model. The following steps in the ERM process produce both diagrammatic and descriptive model components:
• Identify, name, and describe entities. • Associate entities / identify, name and describe relationships. • Assign cardinality to the relationships. • Identify, name, and describe attributes. • Associate the attributes with entities.
The process is iterative and not necessarily sequential. The results from iteration may help discovery of new items in the next.
MODELING HEURISTICS
To find entities, focus on nouns of business interest that are found in business processes, forms and other business documentation, and discussions with business people. To find relationships, focus on phrases that join or associate entities. Two questions help to discover attributes: “How do we uniquely identify an occurrence of the entity?” and “What facts do we need to know about this entity?”
READING THE MODEL
The essence of the model is expressed as two simple sentences for each relationship, including cardinality, along with the associated entities. For example, from the model on the facing page:
• One Claim Action is taken on one and only one Claim. • One Claim has one or more Claim Action(s). Minimum model validation requires that each of these statements be affirmed by the business as correct business rules.
Some of the common terms associated with E-R modeling are: COMMON E-R
LANGUAGE • Super-type and sub-type • Inheritance • Recursive relationship • Many-to-many relationship • Attributed relationship
• Conditional (or optional)
relationship • Identifying attribute • Descriptive attribute • Metric attribute
-
Data Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
1-40 The Data Warehousing Institute
Modeling Techniques Dimensional Modeling Process
Year
Quarter
Month
Time
Policy
Product
ProductLine
LOB
Product-Description Primary ResidenceRental ResidenceBasic AutoWhole LifeTerm Life
Product-Line-DescriptionHomeownersRenter’sPersonal AutoPersonal Life
Maximum-Coverage-Amount
LOB-DescriptionResidentialAutomobileLife
Maximum-Coverage-Amount
OrganizationRegion-Description
NorthwestSouthwest
Region-Manager
District-Description CaliforniaColoradoWashington
District-ManagerZone-Description
AdamsDenverKingSpokane
Zone-Manager
Region
District
Zone
Market Share
% of total marketNumber of Potential Policies
Number of Active Policies
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Data Modeling Concepts
The Data Warehousing Institute 1-41
Modeling Techniques Dimensional Modeling Process
THE MODELING PROCESS
This is a form of E-R modeling where the actual structure of the model is pre-determined. Each model represents one and only one business meter. Combining business meters for optimization is performed at the structural and physical levels of modeling. The modeling activities are: • Identify and name the meter • Identify and name the measures (association with meter is implicit) • Identify and name the dimensions and dimension levels • Associate dimension levels within dimensions as hierarchies • Associate dimensions with meter • Identify dimension values. These activities are performed repeatedly in an iterative and non-linear process.
MODELING HEURISTICS
Metric facts (measures) in the fact/qualifier matrix are indicative of one or more dimensional models. Related measures with common qualifiers indicate a meter. The qualifiers help determine the dimensions and dimension levels. Align the dimensions and levels with the structure of the business.
COMMON DIMENSIONS
Some common kinds of dimensions in any business are time, geography, product, customer, and organization. Actual names of the dimensions should be unique and use the business language of the organization.
WHAT ABOUT STAR SCHEMA?
Star and snowflake schema are physical implementations of a dimensional data model. These are not analysis models. They are not appropriate for the logical level. The DDM has been designed to serve these needs.
READING THE MODEL
Some examples of business metrics supported by the model on the facing page include:
• The percent of the active auto policies in Washington and Colorado
• The number of active whole life policies in the Northwest region • The potential market for term life policies.
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Requirements Analysis Models
The Data Warehousing Institute 2-1
Module 2 Requirements Analysis Models
Topic Page
Target Modeling Overview 2-2
Conceptual Modeling Overview 2-8
Business Questions 2-10
Subject Modeling 2-18
Fact/Qualifier Analysis 2-26
Target Configuration 2-56
-
This page intentionally left blank.
-
Requirements Analysis Models TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
2-2 The Data Warehousing Institute
Target Modeling Overview Modeling Objectives
Phys
ical
(opt
imiz
e)C
onte
xtua
l(s
cope
)
to sourcemodeling
Con
cept
ual
(ana
lyze
)L
ogic
al(d
esig
n)St
ruct
ural
(spe
cify
)Fu
nctio
nal
(Im
plem
ent)
business driversbusiness goals
information needs
source data ortarget data?
operational & external data
warehousing data
what kinds oftargets?
non-metricdata marts
data marts physical database design
Staging, warehouse, and mart DBMS detailedspecification (DDL) & implemented tables
data warehousing physical database design
warehouse physical database design
staging physical database design
business questions
metricdata marts
stagin
g data
data
ware
hous
e
fact/qualifier matrix warehousetargets configuration
warehouselogical model
(ERM)
staginglogical model
(ERM)
data martlogical model
(DDM)
data martlogical model
(DDM)
data martlogical model
(ERM)
data martlogical model
(ERM)
data martlogical model
(ERM)
warehouse structural model
(ERM)
staging structural model
(ERM)
data mart structural model
(ERM)
data mart structural model
(ERM)
data mart structural model
(DDM)
warehousing subject model
data martlogical model
(DDM)
data mart structural model
(DDM)
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Requirements Analysis Models
The Data Warehousing Institute 2-3
Target Modeling Overview Modeling Objectives
TARGET MODELING OBJECTIVES
The target modeling process, as illustrated by the deliverables flow on the facing page, is designed to produce data models for each target data store that is part of the warehousing environment: • Staging Area. • Data Warehouse. • Data Marts – both relational and dimensional. Data Models are produced at each level of model abstraction leading to functional implementation: • Conceptual models look at requirements – What needs to be built to
respond to information needs and business questions? • Logical models represent the design view of each target – What are
the “parts” of the solution? • Structural models represent the specification views of each target –
What must each “part” do specific to data warehousing? How do the “parts” fit into the warehousing architecture?
• Physical model represents the specification optimized for the
implementation environment – What are the platform specific details?
-
Requirements Analysis Models TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
2-4 The Data Warehousing Institute
Target Modeling Overview Modeling Context
• Warehousing Subjects• Business Questions• Facts & Qualifiers• Target Configuration
• Staging, Warehouse, & MartER Models
• Data Mart DDMs
• Staging Area Structure• Warehouse Structure• Relational Mart Structures• Dimensional Mart Structures
• Staging Physical Design• Warehouse Physical Design• Data Mart Physical Designs
(relational & dimensional)
• Implemented WarehousingDatabases
• Source Composition• Source Subjects
• Integrated Source DataModel (ERM)
• Source Data Structure Model
• Source Data Files
• Source Data FileDescriptions
• Business Goals & Drivers• Information Needs
Triage
ContextualModels
ConceptualModels
LogicalModels
StructuralModels
PhysicalModels
ImplementedData
yne rgys
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Requirements Analysis Models
The Data Warehousing Institute 2-5
Target Modeling Overview Modeling Context
TARGET MODELING CONTEXT
The context of target data modeling is established by the business drivers, the business goals, and the information needs that provide the foundation of the warehousing program. These are exactly the same context setting deliverables as for source data analysis. (This is good news, because source and target modeled with different contexts would cause real problems.)
SOURCE DATA & TARGET MODELING
While source and target data analysis activities follow distinct and separate paths, they are related, and are typically performed as parallel activities. Source modeling is related to target modeling in the following ways: • Shared Context – At the contextual level, source and target
modeling deliverables are identical. • Synergetic Concept – At the conceptual level, expect to experience
a high level of synergy between modeling activities. Understanding of source subjects helps to identify target subjects; and identification of warehousing subjects helps to understand source data. Knowledge of source data may also be useful to develop a robust set of business questions.
• Complementary Logic – At the logical level, modeling processes
are complementary. Triage provides a modeling process association between source data analysis and modeling of staging data to ensure complete attribution and an adaptable data design.
-
Requirements Analysis Models TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
2-14 The Data Warehousing Institute
Business Questions Discovery Techniques
Con
text
ual
(sco
pe)
to sourcemodeling
Con
cept
ual
(ana
lyze
)L
ogic
al(d
esig
n)
business driversbusiness goals
information needs
source data ortarget data?
operational & external data
warehousing data
what kinds oftargets?
non-metricdata marts
metricdata marts
stagin
g data
data
ware
hous
e
fact/qualifier matrix warehousetargets configuration
warehouselogical model
(ERM)
staginglogical model
(ERM)
data martlogical model
(DDM)
data martlogical model
(DDM)
data martlogical model
(ERM)
data martlogical model
(ERM)
data martlogical model
(ERM)
warehousing subject model
data martlogical model
(DDM)
business questions
• stakeholder driven• goal oriented• business process oriented• business measures based• data source analysis• current reporting analysis • surrogate system analysis• subject analysis
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Requirements Analysis Models
The Data Warehousing Institute 2-15
Business Questions Discovery Techniques
FINDING THE BUSINESS QUESTIONS
Using a single information need as the focal point, analysis and brainstorming based on any of the following methods may be effective to achieve a robust list of business questions. Repeat the process for each of the information needs of interest. • Stakeholder Driven – Work from the list of stakeholders identified in
the program charter. Have each stakeholder express their individual interest in the information need, and the specific business questions that they would like to have answered.
• Goal Oriented – Ask individual stakeholders (1) to examine the information need in context of business goals, (2) to describe how they can personally contribute to meeting the goals, and (3) to discuss the kinds of information that would help them to do so.
• Process Oriented – Explore business processes that are related to or affected by the information need. Seek specific questions about business process components (customers, products, inputs, suppliers, events, activities, and actors).
• Measures Based – Examine the information need to identify a set of meaningful business measures. Express each of the measures as a set of business questions. Consider measures based on finance, people and organizations, processes, markets, and customers.
• Source Data Analysis – Examine data sources to identify questions that the sources are able to answer. Extend the brainstorming to discuss those questions not being answered. Pay particular attention to questions that demand historical data to be answered.
• Current Reports Analysis – As with data sources, examine existing reports to identify the questions are and are not being answered. Again, consider the questions that need historical data.
• Surrogate System Analysis – Examine the systems, manual and otherwise, that stakeholders use to get information not readily available from core business systems. These include individually maintained spreadsheets and databases.
• Subject Analysis – When developing the warehouse subject model in parallel with identification of business questions, the subject model is a useful foundation to explore business questions. Seek information about each subject that is responsive to the information need, and express as a set of business questions.
-
Requirements Analysis Models TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
2-58 The Data Warehousing Institute
Target Configuration Three Roles of Warehousing Data Stores
Data
Inta
keDa
ta D
istrib
utio
nIn
form
atio
n De
liver
y
Data StagingProcesses
Warehouse PopulationProcesses
Data MartPopulation Process
Data MartPopulation Process
Data MartPopulation Process
Data Mart Data Mart Data Mart
DataWarehouse
PersistentStaging Data
Source
Data
AccessIntegration
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Requirements Analysis Models
The Data Warehousing Institute 2-59
Target Configuration Three Roles of Warehousing Data Stores
ROLES OF WAREHOUSING DATA
Every data warehousing environment has three distinct roles that need to be filled by data stores: • Data Intake – Receiving data into the warehousing environment
from various data sources. The intake role includes all activities necessary to isolate data from the source environment.
• Data Distribution – Structuring and storing data to serve as a single source of information organized by subjects of business interest.
• Information Delivery – Structuring and storing data in forms that are well aligned with business information needs; facilitating fast, easy access to business information. The delivery role encompasses all necessary activities to provide “information friendly” data and ready access to that data by business people.
THREE TIERS OF WAREHOUSING DATA STORES
Each of the roles described above aligns well with a single tier, and the related data stores, in a three-tier approach to data warehousing: • Data Staging provides the facility for data intake. A staging area is
any data store that is primarily designed for the purpose of receiving data into a warehousing environment. A good data staging strategy includes a staging area that is persistent, atomic, subject oriented, integrated, adaptable, and extensible. A persistent data staging area also serves as a historical record of the business and an essential data archiving component. A single data staging area is common in three-tier warehousing, however multiple staging areas are possible.
• The Data Warehouse provides the means of data integration. A good data warehouse, as described by Bill Inmon, is subject-oriented, integrated, non-volatile, and time-variant. In a three-tier approach, the data warehouse is optimized for distribution – its primary role is to serve as an integrated source from which data marts are populated. A single data warehouse is typical of three-tier warehousing.
• Data Marts are designed to meet the needs for information delivery. A data mart is optimized for access, and is designed to facilitate end-user analysis of data. Each data mart supports a single analytic application used by a distinct set of workers. This implies many data marts, each designed to meet a specific set of information needs.
Note that both staging and warehouse data stores serve integration needs; and that both warehouse and mart data stores serve access needs.
-
Requirements Analysis Models TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
2-60 The Data Warehousing Institute
Target Configuration How Many Tiers?
Data Intake &Integration Processes
Data MartPopulation Process
Data MartPopulation Process
Data MartPopulation Process
Data Mart Data Mart Data Mart
Warehouse/StagingData
Source
Data
Tier 1
Data
Inta
kean
d Di
strib
utio
n
Tier 2
Info
rmat
ion
Deliv
ery
Two TiersDependent Data Marts
Tier 1
Data
Inta
keTie
r 2Da
ta D
istr
ibutio
n&
Info
rmat
ion
Deliv
ery
Data Intake Processes
Data WarehousePopulation Processes
Data Staging
Source
Data
Data Warehouse
Two Tierswithout Data Marts
Tier 1
Data
Inta
ke,
Data
Dis
tribu
tion,
& In
form
atio
nDe
liver
y
Data Intake Processes
Data Warehouse
Source
Data
One TierData Warehouse
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Requirements Analysis Models
The Data Warehousing Institute 2-61
Target Configuration How Many Tiers?
ONE AND TWO TIER WAREHOUSING
Three tiers of warehousing data stores, while sometimes desirable, are not essential to successful data warehousing. Many successful warehousing environments have been implemented with two physical tiers satisfying the three roles. Two tiers of physical data stores implementing three distinct warehousing roles is a common approach. In some environments, volume of data and complexity of processing are not sufficient to require three tiers. They simply aren’t needed. In these instances the cost of implementing, operating, and maintaining additional data stores is not justified by the few optimization gains that might be realized. In other environments, constraints (development time, processing time, computer resources, people and organizations, etc.) make a three-tier approach impractical. When three tiers of data stores are desired, but can’t be achieved, it is important to realize that all three roles – data intake, data distribution, and information delivery – must still be supported. Thus, a single type of data store assumes multiple roles. When one data store serves more than one role, the optimization issues become more complex. The diagrams on the facing page illustrate some of the alternatives for two-tier and single-tier warehousing. The best configuration for any warehousing environment achieves a balance between the constraints of the environment and the relative importance of each of the three roles of data stores.
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Design and Specification Models
The Data Warehousing Institute 3-1
Module 3 Design and Specification Models
Unit A: Design and Specification Modeling Concepts
Unit B: Designing Data Marts
Unit C: Designing Data Warehouses
Unit D: Designing Data Staging Areas
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Design and Specification Modeling Concepts
The Data Warehousing Institute 3A-1
Module 3 – Unit A Design and Specification Modeling Concepts
Topic Page
Normalization 3A-2
State Transition Modeling 3A-6
Triage 3A-10
Structural Modeling Issues 3A-14
Structural and Physical Optimization Issues 3A-42
Optimization Techniques 3A-46
-
Design and Specification Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
3A-4 The Data Warehousing Institute
Normalization An Example
AUTO POLICY PREMIUMPOLICY-NUMBERdeferred-payment-service-cost
total-policy-costnumber-of-paymentspayment-frequencyfirst-payment-due-datelast-payment-due-datepremium-discounts:-- discount-code-- discount-schedule-amt-- premium-cost-before-discounts-- premium-cost-after-discounts-- discount-amount
AUTO POLICY PREMIUMPOLICY-NUMBERdeferred-payment-service-cost
total-policy-costnumber-of-paymentspayment-frequencyfirst-payment-due-datelast-payment-due-date
PREMIUM DISCOUNTPOLICY-NUMBERDISCOUNT-CODEdiscount-schedule-amtpremium-cost-before-discountspremium-cost-after-discountsdiscount-amount
0NF
1NF
2NF
3NF
AUTO POLICY PREMIUMPOLICY-NUMBERdeferred-payment-service-cost
total-policy-costpremium-cost-before-discountspremium-cost-after-discountsnumber-of-paymentspayment-frequencyfirst-payment-due-datelast-payment-due-date
PREMIUM DISCOUNTPOLICY-NUMBERDISCOUNT-CODEdiscount-amount
DISCOUNT SCHEDULEDISCOUNT-CODEdiscount-schedule-amt
AUTO POLICY PREMIUMPOLICY-NUMBERdeferred-payment-service-cost
total-policy-costpremium-cost-before-discountspremium-cost-after-discountsnumber-of-paymentspayment-frequencyfirst-payment-due-datelast-payment-due-date
PREMIUM DISCOUNTPOLICY-NUMBERDISCOUNT-CODEdiscount-amount
DISCOUNT SCHEDULEDISCOUNT-CODEdiscount-schedule-amt
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Design and Specification Modeling Concepts
The Data Warehousing Institute 3A-5
Normalization An Example
THE EXAMPLE The example on the facing page illustrates normalization through first, second, and third normal forms using the Automobile Policy Premium File as an example. The normalization steps resulted in: • First normal form separated the premium-discounts group of
attributes as a related entity because it is a repeating group. • Second normal form removed premium-cost-before-discounts and
premium-cost-after-discounts from the premium-discount entity and placed them into the auto-policy-premium. Both attributes are facts about the policy, and dependent only upon the policy-number key.
• Second normal form separated discount-schedule as an entity related
to premium-discount because discount-schedule-amt is not dependent on policy-number.
• Third normal form deleted total-policy-cost because it can be derived
as the sum of premium-cost-after-discounts and deferred-payment-service-cost.
• Third normal form deleted last-payment-due-date because it can be
derived from number-of-payments, payment-frequency, and first-payment-due-date.
This example clearly contains some assumptions about the business rules governing this data. Normalization cannot be performed without clear understanding of the business rules.
-
Design and Specification Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
3A-18 The Data Warehousing Institute
Structural Modeling Issues Time Modeling Examples
customer nbrcustomer first namecustomer last name
customer genderload date time stamp
NOV 2000 CUSTOMER
Snapshots
customer nbrcustomer first namecustomer last name
customer genderload date time stamp
OCT 2000 CUSTOMER
customer nbrcustomer effective begin date
customer first namecustomer last name
customer genderload date time stamp
CUSTOMERor...
Audit Trail
customer nbrcustomer effective begin date
customer first namecustomer last name
customer genderload date time stamp
CUSTOMER
States
policy numberpolicy-begin-date
coverage-begin-datecoverage-end-date
policy-termpremium-amountservice-amount
POLICY
Pending Policy
Active Policy
Expired Policy
Suspended Policy
Terminated Policy
oneof
claim numbereffective-begin-date
claim-action-event-dateload-date-time-stamp
CLAIM ACTION
Date Stamps
customer-counthousehold-count
SIZE OF CUSTOMER BASE
Product
PRODUCTproduct-id
product-descproduct-name
PRODUCT LINEline-code
line-description
LOBlob-codelob-name
Old ProductLOB
lob-codelob-name
POLICY TYPEtype-code
policy-type-desc
Versions
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Design and Specification Modeling Concepts
The Data Warehousing Institute 3A-19
Structural Modeling Issues Time Modeling Examples
SNAPSHOTS The example illustrates a snapshot of customer. Two possible design techniques for recording data using the snapshot approach are shown. Customer-effective-begin-date is a row level metadata element used in one of the examples.
VERSIONS The example illustrates a warehousing environment that retains two
views of a product hierarchy: an old product view that organized products as policy-type within line-of-business, and a current product view with products grouped by product-line within line-of-business.
AUDIT TRAIL This example illustrates an audit trail of changes to customer. The audit
trail technique records the date/time the change was effective or known to the business.
STATES In the example, a record of a policy is retained for each of several states
throughout the policy life cycle. Possible states (thus, possible policy records) include pending, active, expired, suspended, and terminated. Two possible design techniques are shown.
DATE STAMPS Notice the dates for claim action. The date stamp example illustrates
three dates as follows:
• effective-begin-date – When did the action become effective in the business?
• claim-action-event-date – When was the action recorded in the claims processing system?
• load-date-time-stamp – When was the action recorded in the warehousing data store?
There are situations in which significant timing differences exist between business effective dates and source event dates. In these situations, both effective dates and event dates would be reflected in the model.
ACQUISITION METHODS
The focus of this section is on the design techniques for handling time issues in warehousing data stores. The most common techniques are illustrated in more detail on subsequent pages. While these techniques are all possibilities for the modeler, no design can be created in a vacuum. The acquisition methods used have significant influence on the structural data model. For more details on acquisition techniques consider TDWI’s data acquisition course offerings.
-
Design and Specification Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
3A-34 The Data Warehousing Institute
Structural Modeling Issues Location Modeling Examples
Roles
customer-id-numbercustomer-summary-year
customer-value-start-of-yearcustomer-value-end-of-year
total-policy-count-start-of-yeartotal-policy-count-end-of-yearauto-policy-count-start-of-yearauto-policy-count-end-of-yearresidence-policy-ct-start-of-yrresidence-policy-ct-end-of-yr
total-yrs-as-a-customer
CUSTOMER ANNUALSUMMARY - SOUTHWEST
customer-id-numbercustomer-summary-year
customer-value-start-of-yearcustomer-value-end-of-year
total-policy-count-start-of-yeartotal-policy-count-end-of-yearauto-policy-count-start-of-yearauto-policy-count-end-of-yearresidence-policy-ct-start-of-yrresidence-policy-ct-end-of-yr
total-yrs-as-a-customer
CUSTOMER ANNUALSUMMARY - NORTHWEST
policy numberrow start DT stamp
underwriter-employee-idcustomer-zip-code
deductible-amtum-amt
liability-amtcollision-amt
comprehensive-amtpremium-rate-limited-flag
special-rate-limits-flagload-date-time-stamp
AUTOMOBILE POLICYcustomer-id-number
customer-record-begin-datecustomer-record-end-date
customer-last-namecustomer-first-name
customer-middle-nameage-group
income-groupgender
marital-statuscustomer-valuecustomer-share
lost-customer-indload date time stamp
CUSTOMER
customer-id-numbercustomer-record-begin-datecustomer-record-end-date
customer-last-namecustomer-first-name
customer-middle-namepay-by-debit-account-number
pay-by-debit-bank-namepay-by-debit-authorization-code
customer-credit-ratingcustomer-last-credit-check-date
CUSTOMER SECURED
Geographic Area
ZONE zone-numberzone-name
REGION rgn-codergn-name
DISTRICT dist-numberdist-name
LOCATION location-id
location-address
Organization
ZONE zone-numberzone-name
REGION rgn-codergn-name
DISTRICT dist-numberdist-name
EMPLOYEE employee-id
employee-name
EMPLOYEE employee-id
employee-name
UNDERWRITER INSIDE AGENT ADJUSTER
Partitionedto Distribute
Organization & Location Entities
Partitioned to Secure
Organization & LocationAttributes
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Design and Specification Modeling Concepts
The Data Warehousing Institute 3A-35
Structural Modeling Issues Location Modeling Examples
PARTITIONING FOR DISTRIBUTION
This example illustrates a case where each region has need to see annual summary data for its customers, but has no need to see data for other regions. Separate but similar data structures are modeled for each region.
PARTITIONED FOR SECURITY
This example shows a situation where some customer data – banking data and credit ratings – are security sensitive and not able to be viewed by all warehouse users. Two distinct data structures are modeled, one for generally accessible data and one for sensitive data. Note that some attributes are included in both data structures.
ORGANIZATION AND LOCATION ENTITIES
The example illustrates both organization and location data included in a warehousing data model as structures of related entities. Data of these types typically serve both as business data that provides context for other warehousing data and as metadata that helps to meet needs for security and distribution.
ORGANIZATION AND LOCATION ATTRIBUTES
This example illustrates attributes that may be used to identify locations and roles. In this case, underwriter-employee-id describes a role and identifies a person in that role. Customer-zip-code may be used to identify the location of a customer.
ROLES AS DATA This example illustrates multiple roles in which a warehouse user may
act – underwriter, inside agent, and adjuster. Data about people and organizations makes up an important part of warehouse data, describing the various roles and responsibilities that they have in the warehousing processes. Person and organization data may also be present in the warehouse as business data.
-
Design and Specification Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
3A-40 The Data Warehousing Institute
Structural Modeling Issues Usage Modeling Examples
customer-id-numbercustomer-record-begin-datecustomer-record-end-date
customer-last-namecustomer-first-name
customer-middle-nameage-group
income-groupgender
marital-statusload date time stamp
household-id
CUSTOMER
Secondary Keys& Access Paths
customer-id-numbercustomer-record-begin-datecustomer-record-end-date
customer-last-namecustomer-first-name
customer-middle-nameage-group
income-groupgender
marital-statuslost-customer-ind
load date time stamp
CUSTOMER
policy numberpolicy-record-begin-date
policy-begin-datecoverage-begin-datecoverage-end-date
liability-coverage-amtzone-id
high-risk-property-code
RESIDENTIAL POLICY Derived Datafor Access
customer-id-numbercustomer-summary-year
customer-value-start-of-yearcustomer-value-end-of-year
total-policy-count-start-of-yeartotal-policy-count-end-of-yearauto-policy-count-start-of-yearauto-policy-count-end-of-yearresidence-policy-ct-start-of-yrresidence-policy-ct-end-of-yr
total-yrs-as-a-customer
CUSTOMER ANNUALSUMMARY - SOUTHWEST
customer-id-numbercustomer-summary-year
customer-value-start-of-yearcustomer-value-end-of-year
total-policy-count-start-of-yeartotal-policy-count-end-of-yearauto-policy-count-start-of-yearauto-policy-count-end-of-yearresidence-policy-ct-start-of-yrresidence-policy-ct-end-of-yr
total-yrs-as-a-customer
CUSTOMER ANNUALSUMMARY - NORTHWEST
Summary &Partitioningfor Access
policy numberpolicy-record-begin-datepolicy-record-end-date
line-of-businesspolicy-begin-date
coverage-begin-datecoverage-end-date
load date time stampcustomer-id-number
POLICY
customer-id-numbercustomer-record-begin-datecustomer-record-end-date
customer-last-namecustomer-first-name
customer-middle-nameload date time stamp
household-id
CUSTOMER
household idparty numberrow start DTstamprow end DTstampload date time stamp
HOUSEHOLD KeyMigration
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Design and Specification Modeling Concepts
The Data Warehousing Institute 3A-41
Structural Modeling Issues Usage Modeling Examples
KEY MIGRATION The example illustrates the common resolution of one-to-many relationships, placing the primary key of household into the customer entity, and the primary key of customer into the policy entity.
SECONDARY KEYS AND ACCESS PATHS
This example shows customer-last-name and customer-first-name implemented as a secondary key. This makes customer data searchable by name, and provides access to customer data when a customer’s name is known but the id number is not known.
DERIVED DATA FOR ACCESS
In this example, lost-customer-indicator and high-risk-property-code are each derived values based on business rules. Implementing each as a secondary key supports searching and retrieval of lost customers and high risk properties.
SUMMARY AND PARTITIONING FOR ACCESS
The example shows customer summary data for two regions. Developing and storing the summary makes it readily accessible as information without requiring the warehouse user to develop complex queries or derive summaries at the time of access. Separating the data by region makes it easy for each region to view their data without need to filter out data from other regions.
-
Design and Specification Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
3A-42 The Data Warehousing Institute
Structural and Physical Optimization Issues Meeting Needs of People, Platforms and Processes
DATA MARTS
STAGING DATA DATA WAREHOUSE RELATIONAL DIMENSIONAL
data currency
time based summary
what are the impacts of time on
each kind of data store?
time
retention of history
security
STRU
CTUR
AL
locati
on
distribution
how do security & distribution needs affect each kind of
data store?
access
navigation
how do access & navigation needs affect
each kind of data store?
usag
e
toolset
performance
size
availability
backout & recovery
how does each data store need to be optimized for implementation
platforms?
PHYS
ICAL
impl
emen
tatio
n
DBMS
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Design and Specification Modeling Concepts
The Data Warehousing Institute 3A-43
Structural and Physical Optimization Issues Meeting Needs of People, Platforms and Processes
OPTIMIZATION CHALLENGES
Optimizing warehousing databases is a challenging task, seeking to balance the frequently conflicting needs of people, processes, and platforms. Balancing among requirements for performance, availability, size, recovery, and best use of the DBMS is difficult enough. For warehouses and data marts, access tools add to optimization challenges. Further, the balance must be achieved without severely affecting the structural adjustments that have been made to satisfy time, location, and usage requirements.
OPTIMIZATION FOR DATA STAGING
Remember that the purpose of staging data is to receive data into the warehousing environment. It is typically highly detailed data, of high volume, and the primary means to capture history. The most common optimization needs are to manage the database size and the processing performance of loads and extracts. Restart and recovery are significant staging data considerations. Staging data is kept at or near the third normal form. Derived, aggregate, and summary data structures are avoided, as they negatively affect both database size and process performance. There is no need to optimize staging data for user access, and short lapses of availability are generally not user visible.
OPTIMIZATION FOR DATA WAREHOUSE
The role of warehouse data is distribution – warehouses are usually optimized for distribution of data to data marts. Data may be at a higher grain than for staging data, and span of historical data may be smaller. Any size gains of higher granularity and reduced history may be offset by increased redundancy. Shared derivations, aggregates, and summaries are important parts of data integration and are appropriate warehouse data structures. These factors, combined with the potential for user access to the warehouse, demand a careful balance among warehouse optimization factors. Size, availability, extract and load performance, and query and access performance may all be important for warehouse data. Staging data is the foundation of warehouse recovery strategy.
OPTIMIZATION FOR DATA MARTS
Data in marts is intended for delivery of information. Marts are first optimized for access – biased toward the people factors. Data marts contain subsets of the data in the warehouse, often at higher levels of summary, so size becomes less significant. Performance of query and analysis is much more important than extract and load performance. Availability is essential – data marts exist for user access! Warehouse data is the foundation of data mart recovery.
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Marts
The Data Warehousing Institute 3B-1
Module 3 – Unit B Designing Data Marts
Topic Page
Modeling Overview 3B-2
Modeling Relational Data Marts 3B-6
Optimizing Relational Data Marts 3B-8
Implementing Relational Data Marts 3B-12
Modeling Dimensional Data Marts 3B-14
Optimizing Dimensional Data Marts 3B-244
Implementing Dimensional Data Marts 3B-30
-
This page intentionally left blank.
-
Designing Data Marts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
3B-6 The Data Warehousing Institute
Modeling Relational Data Marts Logical Modeling – Entities, Relationships, and Attributes
identify, name &describe Entities
identify, name &describe Relationships
identify, name &describe Attributes
dataneeds
modelingcontext
Target Configuration,BusinessQuestions,
& F/Q Matrix
Information Needs& Warehouse Model
Show all customers with auto policies but not other LOBs.Who are profitable customers and who are costly customers?
Which customers have more than 1 policy? Which in more than 1 LOB?What is the total customer count across LOBs? And total household count?
When we lose a customer in one LOB do we lose all of their business?
policy numberline-of-business
policy-begin-datecoverage-begin-datecoverage-end-date
policy-termlast-policy-update-date
premium-amountservice-amountcost of claims
cost of services
POLICY
customer-id-numbercustomer-last-namecustomer-first-name
customer-middle-nameage-group
income-groupgender
marital-status
CUSTOMER
household idparty number
HOUSEHOLD
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Marts
The Data Warehousing Institute 3B-7
Modeling Relational Data Marts Logical Modeling – Entities, Relationships, and Attributes
IDENTIFY, NAME, & DESCRIBE ENTITIES
When modeling non-metric data marts, an entity is a class of things that provide all or part of the answers to one or more business questions. These entities will be different from those of operational and staging data, and they may differ from the entities in the warehouse. A typical data mart model contains a subset of the warehouse entities, and includes new entities that are products of summarization, aggregation, or derivation.
IDENTIFY, NAME, & DESCRIBE RELATIONSHIPS
Relationships are associations among entities that have relevance to the business questions. In a data mart, each relationship needs to have a role in responding to business questions, and must be implemented by or derivable from warehouse data. When new entities are created by aggregation, derivation, or summarization, any relationships in which they participate must also be derived.
IDENTIFY, NAME, & DESCRIBE ATTRIBUTES
Attributes are the properties of entities that represent business facts needed to answer business questions. Each attribute in the mart data model must have a role in responding to business questions, and must be implemented by or derivable from the warehouse data.
CONFORMED ENTITIES AND ATTRIBUTES
When modeling multiple data marts, some standards across marts with respect to entities and attributes may be helpful. Warehouse users may be confused when the same entity is named differently, or has a different identifier from one mart to the next. Similarly, an attribute with the same meaning but different names, or the same name but different meanings is confusing. Where practical, use of entities and attributes that conform to a standard enhances usability of the individual mart and the entire warehousing environment.
TRIAGE & NORMALIZATION
Triage not normally applied when modeling data marts. Normalization is not necessary for data marts. They are typically de-normalized in the ways that do the best job of presenting data as information. This does not mean that sound design practices should be abandoned. De-normalizing should be done purposefully. It should not occur by accident or oversight.
-
Designing Data Marts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
3B-16 The Data Warehousing Institute
Modeling Dimensional Data Marts Logical Modeling - Dimensions
Business Question:
What is the total customer count by product across all lines ofbusiness? What is the total household count?
LOBProduct LineProduct
custo
mer c
ount
custo
mer id
custo
mer n
ame
claim
coun
tho
useh
old co
unt
Organization Time
Product
Model the Dimensions
LOB
Product Line
ProductRegion
District
Zone customer-counthousehold-count
SIZE OF CUSTOMER BASE
Year
Quarter
Month
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Marts
The Data Warehousing Institute 3B-17
Modeling Dimensional Data Marts Logical Modeling - Dimensions
IDENTIFY AND NAME DIMENSIONS
Dimensions are the perspectives by which facts may be accessed, selected, sequenced, grouped, filtered for analysis, and presented in a business context. A dimension is typically a multiple-level, hierarchical structure that is the basis of leveled summaries of data and drill-down types of analysis. The fact/qualifier matrix and the business questions help determine the dimensions of interest. Where the meter represents a grouping of facts from the matrix, a dimension represents a grouping of qualifiers. Dimensions may be named either generically (e.g., product, customer, etc.) or with business specific names (e.g., policy, policyholder). Note that fact/qualifier analysis and staging data design contribute significantly to identification and to the dimensional analysis and design activities described below.
DESCRIBE THE DIMENSION
As each dimension is identified it is useful to briefly describe some of its properties. At minimum, consider these questions: How volatile or stable is the dimension? Is it a conformed dimension?
IDENTIFY & NAME DIMENSION LEVELS
Determine the levels of the dimension and name each level with a term that is descriptive and business-oriented. Conformed dimensions (discussed later) already have identified levels.
DEVELOP HIERARCHIES
Structure the dimension as a hierarchy of parent/child relationships. Be careful not to overlook any levels of hierarchy that have business meaning. Realize that a single dimension type may have multiple hierarchies. Conformed dimensions already have a prescribed hierarchy.
IDENTIFY & NAME LEVEL IDENTIFIERS
For each dimension level, determine what attribute(s) is used as its identifier. Dimension levels are entities, and need unique identifiers just as all other entities do.
IDENTIFY & NAME CHARACTERISTICS
Determine what attributes of each dimension level are desirable. Most dimension levels have an attribute that is a description or name. Other characteristics of value may be identified by business people, examining information needs and business questions, and through triage.
IDENTIFY VALUES Identify the allowable set of values for dimension level identifiers. This
helps to fully understand the dimension and is needed when optimizing.
ASSOCIATE DIMENSIONS
Link the lowest level of each dimension to the meter. The dimension to meter association is always one-to-many.
-
Designing Data Marts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
3B-22 The Data Warehousing Institute
Modeling Dimensional Data Marts Logical Model Example
Geographic Area Time
PRODUCT-ID domainF = fleet autoP = personal autoL = life insurance
YEAR-NUMBER domainF = fleet autoP = personal autoL = life insurance
QUARTER-NUMBER domainF = fleet autoP = personal autoL = life insurance
MONTH-NUMBER domain01 = January02 = February03 = March04 = April
ProductZONE-NUMBER domainF = fleet autoP = personal autoL = life insurance
DIST-NUMBER domainF = fleet autoP = personal autoL = life insurance
RGN-CODE domain
NW = northwestSW = southwest
Business Question:What is the total customer count by product acrossall lines of business? What is the total household count?
LOB lob-codelob-name
PRODUCT LINE line-code
line-description
PRODUCT product-id
product-descproduct-name
customer-counthousehold-count
SIZE OF CUSTOMER BASE
YEAR year-number
QUARTER quarter-number
MONTH month-number
ZONE zone-numberzone-name
REGION rgn-codergn-name
DISTRICT dist-numberdist-name
LOB-CODE domainA = auto insuranceR = residential insuranceL = life insurance
LINE-CODE domainF = fleet autoP = personal autoL = life insurance
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Marts
The Data Warehousing Institute 3B-23
Modeling Dimensional Data Marts Logical Model Example
AN EXAMPLE The diagram on the facing page illustrates a logical dimensional model for a data mart. Of note in this model:
• The meter is Size of Customer Base.
• The measures (facts) are customer-count and household-count.
• Measures are sensitive to three dimensions: policy, organization, and time.
• Each dimension is a multi-level hierarchy.
• Each dimension level has a known identifier.
• Domain of values for each dimension level identifier is documented.
• Most dimension levels have either a name or a description as characteristics. One dimension level has multiple characteristics.
• Not all dimensions are explicitly referenced in the business question. Further investigation is necessary to determine that the business wants to track customer base across time and be able to compare customer base for different geographic areas.
• Although product line is not explicitly mentioned in the business question, product is a conformed dimension, and all levels of the hierarchy are included in the model.
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Warehouses
The Data Warehousing Institute 3C-1
Module 3 – Unit C Designing Data Warehouses
Topic Page
Modeling Overview 3C-2
Modeling Relational Data Marts 3C-8
Optimizing Relational Data Marts 3C-10
Implementing Relational Data Marts 3C-14
-
This page intentionally left blank.
-
Designing Data Warehouses TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
3C-2 The Data Warehousing Institute
Modeling Overview Purpose and Deliverables
AccessIntegration
Data StagingProcesses
Warehouse PopulationProcesses
Data MartPopulation Process
Data MartPopulation Process
Data MartPopulation Process
Data Mart Data Mart Data Mart
DataWarehouse
PersistentStaging Data
Source
Data
Tier 1
Data
Inta
keTie
r 2Da
ta D
istrib
ution
Tier 3
Info
rmat
ion
Deliv
ery
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Warehouses
The Data Warehousing Institute 3C-3
Modeling Overview Purpose and Deliverables
WAREHOUSE PROPERTIES
Remember that the primary role of a data warehouse is distribution of data to data marts. A warehouse is by definition integrated, subject-oriented, non-volatile, and time variant. Ideally, a warehouse contains data that is: • Cleansed – Data quality and data cleansing rules have been applied. • Base Data – The warehouse contains the lowest level of data
granularity that is needed to answer any business question, which may not be atomic data. It may also contain summary data at a higher level of granularity than the base data.
• Standardized – Standard (conformed) data structures are identified and
implemented. Common derivations are identified and applied. Common summaries are identified and implemented.
TYPE OF MODEL Warehouse data is modeled using entity/relationship techniques. Desired
characteristics of integration and subject-orientation are all readily supported by E-R modeling.
NORMALIZATION At the logical level, a second-normal-form data model is typical for data
warehouses. The third-normal-form would eliminate derived data and summary data that are desirable in the data warehouse. Structural and physical modeling may de-normalize to meet optimization needs. The resultant model may include aggregate structures that violate the second-normal form. First-normal-form violations are not typical in data warehouses, but may be introduced in the form of data arrays.
-
Designing Data Warehouses TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
3C-6 The Data Warehousing Institute
Modeling Warehouse Data Logical Modeling – Entities, Relationships, and Attributes
location numberstreet-addresscitystatepostal-codephone-numberzone id
LOCATION
pty-id-numberorganization-nameperson-last-nameperson-first-name
person-middle-nameage-group
income-groupgender
marital-status
PARTY
household idparty numberrow start DTstamp
PARTY HOUSEHOLD
POLICY/LOCATION
policy numberlocation numberlocation usage code
maybe one
policy number loss-payee-nameloss-payee-address
liability-coverage-amt
RESIDENTIAL POLICY
policy numberpolicy-type-codepolicy-begin-datecoverage-begin-datecoverage-end-date
ar-acct-numberparty number
POLICY
vinvehicle-record-begin-date
policy numbermakemodel
antilock-brakes-indicatorairbags-code
load date time stamprow start DTstamp
VEHICLE
pty-id-numberpolicy-numberparty-role
PARTY INTEREST
policy numberlocation numberlocation usage code
PARTY/LOCATION
identify, name &describe Entities
identify, name &describe Relationships
identify, name &describe Attributes
dataneeds
modelingcontext
TargetConfiguration,
BusinessQuestions &
FQ Model
InformationNeeds &Staging Model
policy number leinholder-nameleinholder-contact
premium-rate-limited-flagwiley-special-rate-flag
AUTOMOBILE POLICY
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Warehouses
The Data Warehousing Institute 3C-7
Modeling Warehouse Data Logical Modeling – Entities, Relationships, and Attributes
IDENTIFY, NAME, & DESCRIBE ENTITIES
When modeling warehouse data, an entity in a data warehouse model is a class of things about which business information is needed. These entities will be different from those of operational systems, and they may differ from those of the staging and mart data models. Warehouse entities seek to integrate, completing that not done in data staging, and may recognize collective entities such as party and household. Entity identification when modeling warehouse data uses three distinct streams of input: • Information needs provide the context for modeling. If a warehouse
entity is a class of things about which information is needed, then information needs are central to the modeling activity.
• The configuration of targets, combined with business questions that each target is intended to answer, guide specific data requirements. The fact/qualifier model for those business questions provides specific data needs. The warehouse must contain the data necessary to populate its dependent marts; and the marts must contain the data needed to answer business questions.
• The staging data models (logical and structural) provide knowledge of data availability. The data that has been received into the warehousing environment, and that is available to populate the warehouse, is identified by these models.
IDENTIFY, NAME, & DESCRIBE RELATIONSHIPS
Relationships are associations among entities that have meaning to the business. For warehouse data, each relationship needs to have a role in providing business information, and must be implemented by or derivable from staging data. Clearly, relationships of new entities (party, household, etc.) must be derived.
IDENTIFY, NAME, & DESCRIBE ATTRIBUTES
Attributes are the properties of entities that represent business facts needed to answer business questions. Each attribute in the warehouse data model must have a role in providing business information, and must be implemented by or derivable from the staging data. The attributes of newly identified entities obviously must be derived from staging data. Also note that metadata in the staging model may become business data in the warehouse. For example, row-start-DT-stamp in the staging model becomes policy-record-begin-date in the warehouse model.
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Staging Areas
The Data Warehousing Institute 3D-1
Module 3 – Unit D Designing Data Staging Areas
Topic Page
Modeling Overview 3D-2
Modeling Relational Data Marts 3D-8
Optimizing Relational Data Marts 3D-10
Implementing Relational Data Marts 3D-14
-
This page intentionally left blank.
-
Designing Data Staging Areas TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
3D-2 The Data Warehousing Institute
Modeling Overview Purpose and Deliverables
Data StagingProcesses
Warehouse PopulationProcesses
Data MartPopulation Process
Data MartPopulation Process
Data MartPopulation Process
Data Mart Data Mart Data Mart
DataWarehouse
PersistentStaging Data
Source
Data
Tier 1
Data
Inta
keTie
r 2Da
ta D
istr
ibutio
nTie
r 3In
form
atio
n De
liver
yAccess
Integration
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Staging Areas
The Data Warehousing Institute 3D-3
Modeling Overview Purpose and Deliverables
STAGING DATA PROPERTIES
Recall the definition of staging data. A staging area is any data store that is designed primarily to receive data into a warehousing environment. A good data staging strategy includes a staging area that is: • Persistent - Staging data is retained as long as it may have historical
value (typically for the life of the enterprise). • Atomic - Staging data is captured at the finest grain available. • Subject Oriented - Staging data is organized by business subjects. • Adaptable - Staging data is designed to accommodate both known
and unknown needs for information. • Extensible - The scope of data expands as new data sources are
introduced into the warehousing program. Ideally, staging data begins to assume the properties that are desirable in warehousing data: • Subject-Oriented – Organized around business subjects, independent
of the applications from which it is extracted. • Integrated – Combining data from multiple sources into a single
business view. • Time-variant – Providing multiple “point-in-time” views of the data. A persistent staging area that is time variant provides a sound way to address needs for enterprise history. Staging data also plays an important role in archival strategy.
TYPE OF MODEL Staging data is modeled using entity/relationship techniques. Desired
characteristics of adaptability, extensibility, subject-orientation, and integration are all well supported by E-R modeling.
NORMALIZATION At the logical level, a third-normal-form data model satisfies staging data
model needs. Structural and physical modeling may de-normalize to meet optimization needs. The resultant model is typically near the third-normal-form, needing only limited adjustments to optimize.
-
Designing Data Staging Areas TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
3D-6 The Data Warehousing Institute
Modeling Staging Data Logical Modeling – Entities, Relationships, and Attributes
policy numberleinholder-name
leinholder-contactdeductible-amt
um-amtliability-amt
collision-amtcomprehensive-amt
premium-rate-limited-flagwiley-special-rate-limits-flag
policy numberpolicy-type-codepolicy-begin-date
coverage-begin-datecoverage-end-date
policy-termlast-policy-update-date
premium-amountservice-amount
renewal-plan-codeauto-renew-bill-date
agent-renew-call-dateauto-term-notify-date
ar-acct-number
POLICY
maybe one
AUTOMOBILE POLICY
vinpolicy number
makemodelyeartype
usageantilock-brakes-indicator
airbags-code
VEHICLE
policy number[property-address
property-cityproperty-countyproperty-state]
property-type-codelegal-description
family-countresidential-bldg-count
non-residential-bldg-countemergency-svc-distance
PROPERTY
policy numberproperty-coverage-amt
proprty-benefits-basis-codequake-addl-coverage-amtflood-addl-coverage-amtwind-addl-coverage-amtcontents-coverage-amt
jewelry-addl-coverage-amtfurs-addl-coverage-amtarts-addl-coverage-amt
equipmt-addl-coverage-amtother-addl-coverage-amt
liability-coverage-amttotal-addl-coverages-amt
RESIDENTIAL POLICY
identify, name &describe Entities
identify, name &describe Relationships
identify, name &describe Attributes
dataavailability
modelingcontext
SourceData Models
InformationNeeds &Subject Model
dataneeds
FQ Analysis& Business
Questions businessevents State TransitionModels
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Staging Areas
The Data Warehousing Institute 3D-7
Modeling Staging Data Logical Modeling – Entities, Relationships, and Attributes
IDENTIFY, NAME, & DESCRIBE ENTITIES
In E-R modeling, an entity is generally defined as a class of things about which data is needed. To model staging data, that definition may be extended – an entity in a staging data model is a class of things about which data is received into the warehousing environment. These entities may be different from the classifications by which operational systems manage the data – staging entities may be different from operational entities. And they may differ from classifications by which information is delivered to the business – staging entities may be different from data warehouse entities. Entity identification when modeling staging data uses three distinct streams of input: • The warehousing subject model (and the corresponding information
needs) provides the context for modeling. Warehousing subjects are abstractions of entity groups. Each staging entity needs to belong to one of the subjects.
• Business questions (and supporting fact/qualifier analysis) provide the specifics of data requirements. The data needed is that which is necessary to answer the business questions.
• The source data models (logical and structural) provide knowledge of data availability. The data that may be received into the warehousing environment is identified by these models.
• State transition models, when available, provide understanding of the business events and their data impacts. When state models aren’t available, it is sometimes advisable to develop them as part of the staging data modeling effort.
IDENTIFY, NAME, & DESCRIBE RELATIONSHIPS
Relationships are associations among entities that have meaning to the business. For staging data each relationship needs to be visible in the source data – either implemented by one or more data sources or derivable from those sources. To increase adaptability of staging data, include all possible relationships for which source data is available.
IDENTIFY, NAME, & DESCRIBE ATTRIBUTES
The minimum requirement of staging data is to include every attribute that is needed to answer the business questions. Be certain that this minimum set of attributes is identified and modeled. The ideal for data staging is to implement all useful attributes from a data source at the time that source is first used to provide warehousing data. Full identification of staging data attributes is accomplished through triage.
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Data Modeling and Design Summary
The Data Warehousing Institute 4-1
Module 4 Data Modeling and Design Summary
Topic Page
Modeling Overview 2.4-2
Modeling Relational Data Marts 2.4-4
Deliverables Summary 2.4-8
-
This page intentionally left blank.
-
Data Modeling and Design Summary TDWI Data Modeling: Data Warehouse Design & Analysis Techniques
4-8 The Data Warehousing Institute
Deliverables Summary Modeling Deliverables Checklist
Deliverable Without this the… Business Drivers &
Goals Business reasons for undertaking a data warehousing program are not clearly articulated. Limits business value. Creates risk that business people will not accept the data warehouse.
Information Needs Information structure of the warehouse is not linked to clearly expressed needs of the business, and information needs are intangibles. Risk is “information silos” that fail to achieve subject orientation and integration, and inability to prioritize needs, plan increments, and select the right data sources.
Business Questions Information needs aren’t made concrete and tangible. The transition from needs analysis to design of data structures is difficult, and the quality of the results uncertain.
Source Composition Model
Data sources are not inventoried and grouped by subject. It is unclear how sources relate to the subjects that provide the basis of subject-orientation. Risks incompleteness and inaccuracy in data sourcing strategy.
Subject Area Model The warehousing program is absent any high level groupings of data as subjects of business interest. High level of risk that the warehouse data structures will not be adequately subject-oriented.
Fact/Qualifier Matrix Business questions are not formally analyzed to understand their data components and data usage. Increases risk of jumping to solutions without understanding the business problem. Jeopardizes completeness, correctness, and adaptability of warehousing solutions.
Warehouse Targets Configuration
Parts of a total warehousing solution are developed without a clear picture of the overall structure. Increases risk of “misfit” warehousing components, continuous rework, unstable warehousing environment, and user dissatisfaction.
Structure of Data Store (matrix)
Contents and structure of some data sources is not understood. These sources cannot be integrated into the source data model. Reduces opportunity to choose the best source. Increases risk that source data is misunderstood and translated to misinformation in the warehouse.
Source Logical Model Source data contents and relationships are not understood at a business level. A single business fact cannot be traced to multiple, redundant sourcing options. Risks incompleteness of data acquisition and data cleansing solutions. Inhibits ability to analyze impacts of and respond to changing source data structures.
State Transition Diagram
Entities are not understood in context of life cycles and business processes. Risks incompleteness in warehousing target designs. Increases probability of “out-of-synch” data in the warehouse.
Staging Logical Model (ERM)
Underlying structures for data staging are casually and informally designed. Risks rework and instability in the data staging environment, leading to greater instability and potential inaccuracies in the overall warehousing environment.
-
TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Data Modeling and Design Summary
The Data Warehousing Institute 4-9
Deliverables Summary Modeling Deliverables Checklist
Deliverable Without this the… Warehouse Logical
Model (ERM) Underlying structures for the data warehouse are casually and informally designed. Risks rework and instability in the data warehouse, which is disruptive to dependent data marts, and frustrating for users who directly access the warehouse.
Data Mart Logical Model (ERM)
Relational data structures and data mart subject orientation are not fully understood. Relational data mart designs occur at a physical level without benefit of logical and structural analysis and design. Consistency of relationships across multiple marts is at risk, and individual data mart designs may be physically inadequate.
Data Mart Logical Model (DDM)
Dimension hierarchies and dimensional relationships are not fully understood. Dimensional data mart designs occur at a physical level without benefit of logical and structural analysis and design. Conformity of dimensions across multiple marts is at risk, and individual data mart designs may be physically inadequate.
Staging Structural/Physical Model (ERM)
Time requirements for staging data are not formally analyzed, and the comprehensive staging design is not precisely specified as a cohesive set of physical tables. Risks sub-optimal data staging implementation. Impacting overall performance and adaptability of the warehousing application.
Warehouse Structural/Physical Model (ERM)
Time, location, and usage requirements for the warehouse are not formally analyzed and expressed as part of a comprehensive design, and that design is not precisely specified as a set of physical tables. Strategy for feeding dependent data marts is jeopardized, and user satisfaction with the warehouse is put at risk.
Data Mart Structural/Physical Model (ERM)
Time, location, and usage requirements for the mart are not formally analyzed and expressed as part of a comprehensive design, and that design is not precisely specified as a set of physical tables. Users may be surprised (sometimes unpleasantly) at the physical implementation.
Data Mart Structural/Physical Model (DDM)
Time, location, and usage requirements for the mart are not formally analyzed and expressed as part of a comprehensive design, and that design is not precisely specified as a set of physical tables. Users may be surprised (sometimes unpleasantly) at the physical implementation.
TDWI Data ModelingTable of Contents & 1: Data Modeling Concepts2: Requirements Analysis Models3A: Design and Specification Modeling Concepts3B: Designing Data Marts3C: Designing Data Warehouses3D: Designing Data Staging Areas4: Data Modeling and Design Summary