previews of tdwi course books are provided as an opportunity to...

63
Previews of TDWI course books are provided as an opportunity to see the quality of our material and help you to select the courses that best fit your needs. The previews can not be printed. TDWI strives to provide course books that are content-rich and that serve as useful reference documents after a class has ended. This preview shows selected pages that are representative of the entire course book. The pages shown are not consecutive. The page numbers as they appear in the actual course material are shown at the bottom of each page. All table-of-contents pages are included to illustrate all of the topics covered by a course.

Upload: others

Post on 04-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • Previews of TDWI course books are provided as an opportunity to see the quality of our material and help you to select the courses that best fit your needs. The previews can not be printed. TDWI strives to provide course books that are content-rich and that serve as useful reference documents after a class has ended. This preview shows selected pages that are representative of the entire course book. The pages shown are not consecutive. The page numbers as they appear in the actual course material are shown at the bottom of each page. All table-of-contents pages are included to illustrate all of the topics covered by a course.

  • The Data Warehousing Institute

    TDWI Data Modeling: Data Warehouse Design and Analysis Techniques

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    ii The Data Warehousing Institute

    The Data Warehousing Institute takes pride in the educational soundness and technical accuracy of all of our courses. Please give us your comments – we’d like to hear from you. Address your feedback to:

    email: [email protected] Publication Date: May 2003

    © Copyright 1999-2003 by The Data Warehousing Institute. All rights reserved. No part of this document may be reproduced in any form, or by any means, without written permission from The Data Warehousing Institute.

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    The Data Warehousing Institute iii

    Module One Data Modeling Concepts …….....................…….. 1-1

    Module Two Requirements Analysis Models …………………. 2-1

    Module Three Design and Specification Models ………………. 3-1

    Unit A Design & Specification Modeling Concepts ……… 3A-1

    Unit B Designing Data Marts ……………..............……….. 3B-1

    Unit C Designing Data Warehouses ………………………. 3C-1

    Unit D Designing Data Staging Areas ……………….……. 3D-1

    Module Four Data Modeling and Design Summary ………….. 4-1

    APPENDICES

    Appendix A Glossary of Data Warehousing Terms ............... A-1

    Appendix B TDWICo Case Study ............................................ B-1

    Appendix C TDWICo Sample Models and Documentation ... C-1

    Appendix D Article: Optimizing the Data Warehousing Environment for Change: The Persistent Staging Area ………………. D-1

    Appendix E State Transition Analysis .................................... E-1

    Appendix F Bibliography and References ............................. F-1

    WORKSHOP

    Exercises Exercise Activities ……………………...…………. W-1TAB

    LE O

    F C

    ON

    TEN

    TS

    Solutions Exercise Solutions ……………………...…………. WS-1

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Data Modeling Concepts

    The Data Warehousing Institute 1-1

    Module 1 Data Modeling Concepts

    Topic Page

    Modeling Fundamentals 1-2

    The Warehouse Data Modeler 1-10

    Warehousing Data Stores 1-14

    Modeling Techniques 1-24

  • Data Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    1-28 The Data Warehousing Institute

    Modeling Techniques Subject Area Modeling

    Claim

    Incident

    Policy

    OrganizationCustomer

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Data Modeling Concepts

    The Data Warehousing Institute 1-29

    Modeling Techniques Subject Area Modeling

    DESCRIPTION A subject data model depicts business data subjects and the major associations among them. Subject models are used at the conceptual level for all the different data stores in warehousing. They aid in the detailed analysis of the information needs and what will be needed in the warehousing environment to meet those needs.

    COMPONENTS The main components of subject area models are:

    • Subjects -- High level views of topics of business interest that may be

    considered equivalent to both global data classes and classes of entities.

    • Relationships -- Represent the most visible associations between the subjects.

    THE MODELING PROCESS

    This technique is quite similar to E-R modeling. Its purpose is to identify subjects that will remain stable, even as the information needs change. The subjects represent one or more entities. As warehousing ERMs develop, the subjects provide a way to view subsets of complex models. In general the modeling activities are:

    • Identify and name subjects • Associate subjects / identify relationships • Identify, and name attributes • Associate attributes with subjects

    This is an iterative process and the order of the steps may change from one iteration to the next. Each iteration facilitates discovery in the next.

    EXAMPLE The example on the facing page illustrates that:

    • Policies are directly associated with claims. • Customers have interest in both policies and claims. • Customers are related to organizations. • Organizations have interest in both policies and claims. • Incidents are related to both policies and claims. One might infer that when a customer has interest in a policy, they also have interest in its directly related claims. Be careful to verify inferences.

  • Data Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    1-30 The Data Warehousing Institute

    Modeling Techniques Fact/Qualifier Modeling

    Facts

    Q

    ualif

    iers

    cust

    omer

    coun

    t

    % o

    f tot

    al m

    arke

    t

    cust

    omer

    -id

    cust

    omer

    -nam

    e

    hous

    ehol

    d co

    unt

    lost

    cust

    omer

    -id

    lost

    -pol

    icy-id

    lost

    -pol

    icy-v

    alue

    claim

    coun

    t

    claim

    settl

    emen

    t lag

    tim

    e

    claim

    filin

    g lag

    tim

    e

    hom

    e inc

    iden

    t cou

    nt

    region

    zone

    employee

    customer

    line of business

    product

    policy

    cause of claim

    year

    month

    demographics

    policy features

    coverage group

    size of claim

    customer value

    customer share

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Data Modeling Concepts

    The Data Warehousing Institute 1-31

    Modeling Techniques Fact/Qualifier Modeling

    DESCRIPTION A fact/qualifier matrix represents two sets of data items: (1) facts that business people need to know, and (2) qualifiers used to manipulate and organize the facts for analysis. Associations in the matrix illustrate which qualifiers are applicable to which facts. This model is used at the conceptual level to analyze the business questions that warehousing is intended to answer. Understanding the data implications of the information needs and their associated business questions is essential to capture the right data and build the right warehouse/mart data structures.

    COMPONENTS The components of the matrix are:

    • Facts – Discrete items of business information that (partially) satisfy

    the information needs of the business. These are typed as descriptive or metric.

    • Qualifiers – Criteria, by which the facts are accessed, sorted,

    grouped, aggregated, filtered and presented to warehouse users. • The fact/qualifier association – An entry at an intersecting cell

    indicating that the qualifier may be used to control how the fact is used in analysis. Association entries may record data about the association (e.g., a reference to the business questions from which the association is derived).

    THE MODELING PROCESS

    This matrix combines two lists derived from the information needs and their related business questions. The list of facts, sometimes called the “know list,” answers the question “What do you need to know?” The list of qualifiers, also called the “by list,” answers the question “What do you want to know it by?” Modeling is a simple process of: • Identify and name facts – label rows. (from the know list.) • Identify and name qualifiers – label columns. (from the by list.) • Associate facts with qualifiers – mark intersecting rows and columns

    where a fact is associated with a qualifier. NOTE: The placement of facts as row labels and qualifiers as column labels is arbitrary. This model works equally well when row and column designations are reversed.

  • Data Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    1-32 The Data Warehousing Institute

    Modeling Techniques State Transition Modeling

    UnassignedClaim

    RejectedClaim

    END

    ReceivedClaim

    AssignedClaim

    InvestigatedClaim

    SettledClaim

    PaidClaim

    receive claim

    assignadjuster

    completeinvestigation

    determinesettlement

    make finalpayment

    assignadjuster

    find cause to reject

    deferassignment

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Data Modeling Concepts

    The Data Warehousing Institute 1-33

    Modeling Techniques State Transition Modeling

    DESCRIPTION State transition modeling is used to build entity life cycle models. This model is a tool to examine a single, state-dependent entity with respect to the states in which it may exist, and actions that cause it to change from one state to the next. A state transition model provides specific detail about a single entity. State transition modeling is used at the context level to help identify information needs, and at the structural level to help determine the time dependencies of the targets.. Build the model by identifying the initial state at which the entity becomes of interest to the business. Then follow the possible paths of successor states in an iterative fashion.

    THE MODELING PROCESS

    An entity life cycle model addresses one, and only one, entity. The modeling process begins identification and selection of an entity that is state-dependent and that needs further analysis. The following sequence of activities are used to model the entity’s life cycle:

    • Select the state-dependent entity that is the focus of the model. • Identify the states in which an entity occurrence may exist, and the

    actions that cause changes of state. • Identify the business rules that describe pre-conditions and post-

    conditions for a change of state.

    Test completeness and correctness of the model.

    READING THE MODEL

    The example on the facing page illustrates these (and other) state-based business rules:

    • A Received Claim is checked for completeness, if incomplete it is Rejected and all processing on it stops.

    • A claim is not Investigated until it is Assigned. • A claim is not Paid until it has been Investigated and Settled.

  • Data Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    1-36 The Data Warehousing Institute

    Modeling Techniques Entity Relationship Modeling Process

    partyparticipates in

    policyprotects

    claimantfiles

    CLAIMANT

    PARTYPTY-ID-NUMBER

    POLICYHOLDERINTERESTEDPARTY

    CLAIM ACTIONCLAIM-NUMBERACTION-TYPE-CODEACTION-BEGIN-DATE

    isany

    INCIDENTINCIDENT-DATEINCIDENT-LOCATIONINCIDENT-TYPE-CODE

    party uses

    incidentcauses

    actiontaken on claim filedagainst

    PARTYADDRESS

    PTY-ID-NUMBERADDRESS-USAGE-CODE

    CLAIMCLAIM-NUMBER

    POLICYPOLICY-NUMBER

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Data Modeling Concepts

    The Data Warehousing Institute 1-37

    Modeling Techniques Entity Relationship Modeling Process

    THE MODELING PROCESS

    The diagram is the pictorial part of the ERM. It illustrates all of the model components and their associations. Every component has a unique name as a way to reference the component and link it to the descriptive part of the model. The following steps in the ERM process produce both diagrammatic and descriptive model components:

    • Identify, name, and describe entities. • Associate entities / identify, name and describe relationships. • Assign cardinality to the relationships. • Identify, name, and describe attributes. • Associate the attributes with entities.

    The process is iterative and not necessarily sequential. The results from iteration may help discovery of new items in the next.

    MODELING HEURISTICS

    To find entities, focus on nouns of business interest that are found in business processes, forms and other business documentation, and discussions with business people. To find relationships, focus on phrases that join or associate entities. Two questions help to discover attributes: “How do we uniquely identify an occurrence of the entity?” and “What facts do we need to know about this entity?”

    READING THE MODEL

    The essence of the model is expressed as two simple sentences for each relationship, including cardinality, along with the associated entities. For example, from the model on the facing page:

    • One Claim Action is taken on one and only one Claim. • One Claim has one or more Claim Action(s). Minimum model validation requires that each of these statements be affirmed by the business as correct business rules.

    Some of the common terms associated with E-R modeling are: COMMON E-R

    LANGUAGE • Super-type and sub-type • Inheritance • Recursive relationship • Many-to-many relationship • Attributed relationship

    • Conditional (or optional)

    relationship • Identifying attribute • Descriptive attribute • Metric attribute

  • Data Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    1-40 The Data Warehousing Institute

    Modeling Techniques Dimensional Modeling Process

    Year

    Quarter

    Month

    Time

    Policy

    Product

    ProductLine

    LOB

    Product-Description Primary ResidenceRental ResidenceBasic AutoWhole LifeTerm Life

    Product-Line-DescriptionHomeownersRenter’sPersonal AutoPersonal Life

    Maximum-Coverage-Amount

    LOB-DescriptionResidentialAutomobileLife

    Maximum-Coverage-Amount

    OrganizationRegion-Description

    NorthwestSouthwest

    Region-Manager

    District-Description CaliforniaColoradoWashington

    District-ManagerZone-Description

    AdamsDenverKingSpokane

    Zone-Manager

    Region

    District

    Zone

    Market Share

    % of total marketNumber of Potential Policies

    Number of Active Policies

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Data Modeling Concepts

    The Data Warehousing Institute 1-41

    Modeling Techniques Dimensional Modeling Process

    THE MODELING PROCESS

    This is a form of E-R modeling where the actual structure of the model is pre-determined. Each model represents one and only one business meter. Combining business meters for optimization is performed at the structural and physical levels of modeling. The modeling activities are: • Identify and name the meter • Identify and name the measures (association with meter is implicit) • Identify and name the dimensions and dimension levels • Associate dimension levels within dimensions as hierarchies • Associate dimensions with meter • Identify dimension values. These activities are performed repeatedly in an iterative and non-linear process.

    MODELING HEURISTICS

    Metric facts (measures) in the fact/qualifier matrix are indicative of one or more dimensional models. Related measures with common qualifiers indicate a meter. The qualifiers help determine the dimensions and dimension levels. Align the dimensions and levels with the structure of the business.

    COMMON DIMENSIONS

    Some common kinds of dimensions in any business are time, geography, product, customer, and organization. Actual names of the dimensions should be unique and use the business language of the organization.

    WHAT ABOUT STAR SCHEMA?

    Star and snowflake schema are physical implementations of a dimensional data model. These are not analysis models. They are not appropriate for the logical level. The DDM has been designed to serve these needs.

    READING THE MODEL

    Some examples of business metrics supported by the model on the facing page include:

    • The percent of the active auto policies in Washington and Colorado

    • The number of active whole life policies in the Northwest region • The potential market for term life policies.

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Requirements Analysis Models

    The Data Warehousing Institute 2-1

    Module 2 Requirements Analysis Models

    Topic Page

    Target Modeling Overview 2-2

    Conceptual Modeling Overview 2-8

    Business Questions 2-10

    Subject Modeling 2-18

    Fact/Qualifier Analysis 2-26

    Target Configuration 2-56

  • This page intentionally left blank.

  • Requirements Analysis Models TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    2-2 The Data Warehousing Institute

    Target Modeling Overview Modeling Objectives

    Phys

    ical

    (opt

    imiz

    e)C

    onte

    xtua

    l(s

    cope

    )

    to sourcemodeling

    Con

    cept

    ual

    (ana

    lyze

    )L

    ogic

    al(d

    esig

    n)St

    ruct

    ural

    (spe

    cify

    )Fu

    nctio

    nal

    (Im

    plem

    ent)

    business driversbusiness goals

    information needs

    source data ortarget data?

    operational & external data

    warehousing data

    what kinds oftargets?

    non-metricdata marts

    data marts physical database design

    Staging, warehouse, and mart DBMS detailedspecification (DDL) & implemented tables

    data warehousing physical database design

    warehouse physical database design

    staging physical database design

    business questions

    metricdata marts

    stagin

    g data

    data

    ware

    hous

    e

    fact/qualifier matrix warehousetargets configuration

    warehouselogical model

    (ERM)

    staginglogical model

    (ERM)

    data martlogical model

    (DDM)

    data martlogical model

    (DDM)

    data martlogical model

    (ERM)

    data martlogical model

    (ERM)

    data martlogical model

    (ERM)

    warehouse structural model

    (ERM)

    staging structural model

    (ERM)

    data mart structural model

    (ERM)

    data mart structural model

    (ERM)

    data mart structural model

    (DDM)

    warehousing subject model

    data martlogical model

    (DDM)

    data mart structural model

    (DDM)

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Requirements Analysis Models

    The Data Warehousing Institute 2-3

    Target Modeling Overview Modeling Objectives

    TARGET MODELING OBJECTIVES

    The target modeling process, as illustrated by the deliverables flow on the facing page, is designed to produce data models for each target data store that is part of the warehousing environment: • Staging Area. • Data Warehouse. • Data Marts – both relational and dimensional. Data Models are produced at each level of model abstraction leading to functional implementation: • Conceptual models look at requirements – What needs to be built to

    respond to information needs and business questions? • Logical models represent the design view of each target – What are

    the “parts” of the solution? • Structural models represent the specification views of each target –

    What must each “part” do specific to data warehousing? How do the “parts” fit into the warehousing architecture?

    • Physical model represents the specification optimized for the

    implementation environment – What are the platform specific details?

  • Requirements Analysis Models TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    2-4 The Data Warehousing Institute

    Target Modeling Overview Modeling Context

    • Warehousing Subjects• Business Questions• Facts & Qualifiers• Target Configuration

    • Staging, Warehouse, & MartER Models

    • Data Mart DDMs

    • Staging Area Structure• Warehouse Structure• Relational Mart Structures• Dimensional Mart Structures

    • Staging Physical Design• Warehouse Physical Design• Data Mart Physical Designs

    (relational & dimensional)

    • Implemented WarehousingDatabases

    • Source Composition• Source Subjects

    • Integrated Source DataModel (ERM)

    • Source Data Structure Model

    • Source Data Files

    • Source Data FileDescriptions

    • Business Goals & Drivers• Information Needs

    Triage

    ContextualModels

    ConceptualModels

    LogicalModels

    StructuralModels

    PhysicalModels

    ImplementedData

    yne rgys

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Requirements Analysis Models

    The Data Warehousing Institute 2-5

    Target Modeling Overview Modeling Context

    TARGET MODELING CONTEXT

    The context of target data modeling is established by the business drivers, the business goals, and the information needs that provide the foundation of the warehousing program. These are exactly the same context setting deliverables as for source data analysis. (This is good news, because source and target modeled with different contexts would cause real problems.)

    SOURCE DATA & TARGET MODELING

    While source and target data analysis activities follow distinct and separate paths, they are related, and are typically performed as parallel activities. Source modeling is related to target modeling in the following ways: • Shared Context – At the contextual level, source and target

    modeling deliverables are identical. • Synergetic Concept – At the conceptual level, expect to experience

    a high level of synergy between modeling activities. Understanding of source subjects helps to identify target subjects; and identification of warehousing subjects helps to understand source data. Knowledge of source data may also be useful to develop a robust set of business questions.

    • Complementary Logic – At the logical level, modeling processes

    are complementary. Triage provides a modeling process association between source data analysis and modeling of staging data to ensure complete attribution and an adaptable data design.

  • Requirements Analysis Models TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    2-14 The Data Warehousing Institute

    Business Questions Discovery Techniques

    Con

    text

    ual

    (sco

    pe)

    to sourcemodeling

    Con

    cept

    ual

    (ana

    lyze

    )L

    ogic

    al(d

    esig

    n)

    business driversbusiness goals

    information needs

    source data ortarget data?

    operational & external data

    warehousing data

    what kinds oftargets?

    non-metricdata marts

    metricdata marts

    stagin

    g data

    data

    ware

    hous

    e

    fact/qualifier matrix warehousetargets configuration

    warehouselogical model

    (ERM)

    staginglogical model

    (ERM)

    data martlogical model

    (DDM)

    data martlogical model

    (DDM)

    data martlogical model

    (ERM)

    data martlogical model

    (ERM)

    data martlogical model

    (ERM)

    warehousing subject model

    data martlogical model

    (DDM)

    business questions

    • stakeholder driven• goal oriented• business process oriented• business measures based• data source analysis• current reporting analysis • surrogate system analysis• subject analysis

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Requirements Analysis Models

    The Data Warehousing Institute 2-15

    Business Questions Discovery Techniques

    FINDING THE BUSINESS QUESTIONS

    Using a single information need as the focal point, analysis and brainstorming based on any of the following methods may be effective to achieve a robust list of business questions. Repeat the process for each of the information needs of interest. • Stakeholder Driven – Work from the list of stakeholders identified in

    the program charter. Have each stakeholder express their individual interest in the information need, and the specific business questions that they would like to have answered.

    • Goal Oriented – Ask individual stakeholders (1) to examine the information need in context of business goals, (2) to describe how they can personally contribute to meeting the goals, and (3) to discuss the kinds of information that would help them to do so.

    • Process Oriented – Explore business processes that are related to or affected by the information need. Seek specific questions about business process components (customers, products, inputs, suppliers, events, activities, and actors).

    • Measures Based – Examine the information need to identify a set of meaningful business measures. Express each of the measures as a set of business questions. Consider measures based on finance, people and organizations, processes, markets, and customers.

    • Source Data Analysis – Examine data sources to identify questions that the sources are able to answer. Extend the brainstorming to discuss those questions not being answered. Pay particular attention to questions that demand historical data to be answered.

    • Current Reports Analysis – As with data sources, examine existing reports to identify the questions are and are not being answered. Again, consider the questions that need historical data.

    • Surrogate System Analysis – Examine the systems, manual and otherwise, that stakeholders use to get information not readily available from core business systems. These include individually maintained spreadsheets and databases.

    • Subject Analysis – When developing the warehouse subject model in parallel with identification of business questions, the subject model is a useful foundation to explore business questions. Seek information about each subject that is responsive to the information need, and express as a set of business questions.

  • Requirements Analysis Models TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    2-58 The Data Warehousing Institute

    Target Configuration Three Roles of Warehousing Data Stores

    Data

    Inta

    keDa

    ta D

    istrib

    utio

    nIn

    form

    atio

    n De

    liver

    y

    Data StagingProcesses

    Warehouse PopulationProcesses

    Data MartPopulation Process

    Data MartPopulation Process

    Data MartPopulation Process

    Data Mart Data Mart Data Mart

    DataWarehouse

    PersistentStaging Data

    Source

    Data

    AccessIntegration

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Requirements Analysis Models

    The Data Warehousing Institute 2-59

    Target Configuration Three Roles of Warehousing Data Stores

    ROLES OF WAREHOUSING DATA

    Every data warehousing environment has three distinct roles that need to be filled by data stores: • Data Intake – Receiving data into the warehousing environment

    from various data sources. The intake role includes all activities necessary to isolate data from the source environment.

    • Data Distribution – Structuring and storing data to serve as a single source of information organized by subjects of business interest.

    • Information Delivery – Structuring and storing data in forms that are well aligned with business information needs; facilitating fast, easy access to business information. The delivery role encompasses all necessary activities to provide “information friendly” data and ready access to that data by business people.

    THREE TIERS OF WAREHOUSING DATA STORES

    Each of the roles described above aligns well with a single tier, and the related data stores, in a three-tier approach to data warehousing: • Data Staging provides the facility for data intake. A staging area is

    any data store that is primarily designed for the purpose of receiving data into a warehousing environment. A good data staging strategy includes a staging area that is persistent, atomic, subject oriented, integrated, adaptable, and extensible. A persistent data staging area also serves as a historical record of the business and an essential data archiving component. A single data staging area is common in three-tier warehousing, however multiple staging areas are possible.

    • The Data Warehouse provides the means of data integration. A good data warehouse, as described by Bill Inmon, is subject-oriented, integrated, non-volatile, and time-variant. In a three-tier approach, the data warehouse is optimized for distribution – its primary role is to serve as an integrated source from which data marts are populated. A single data warehouse is typical of three-tier warehousing.

    • Data Marts are designed to meet the needs for information delivery. A data mart is optimized for access, and is designed to facilitate end-user analysis of data. Each data mart supports a single analytic application used by a distinct set of workers. This implies many data marts, each designed to meet a specific set of information needs.

    Note that both staging and warehouse data stores serve integration needs; and that both warehouse and mart data stores serve access needs.

  • Requirements Analysis Models TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    2-60 The Data Warehousing Institute

    Target Configuration How Many Tiers?

    Data Intake &Integration Processes

    Data MartPopulation Process

    Data MartPopulation Process

    Data MartPopulation Process

    Data Mart Data Mart Data Mart

    Warehouse/StagingData

    Source

    Data

    Tier 1

    Data

    Inta

    kean

    d Di

    strib

    utio

    n

    Tier 2

    Info

    rmat

    ion

    Deliv

    ery

    Two TiersDependent Data Marts

    Tier 1

    Data

    Inta

    keTie

    r 2Da

    ta D

    istr

    ibutio

    n&

    Info

    rmat

    ion

    Deliv

    ery

    Data Intake Processes

    Data WarehousePopulation Processes

    Data Staging

    Source

    Data

    Data Warehouse

    Two Tierswithout Data Marts

    Tier 1

    Data

    Inta

    ke,

    Data

    Dis

    tribu

    tion,

    & In

    form

    atio

    nDe

    liver

    y

    Data Intake Processes

    Data Warehouse

    Source

    Data

    One TierData Warehouse

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Requirements Analysis Models

    The Data Warehousing Institute 2-61

    Target Configuration How Many Tiers?

    ONE AND TWO TIER WAREHOUSING

    Three tiers of warehousing data stores, while sometimes desirable, are not essential to successful data warehousing. Many successful warehousing environments have been implemented with two physical tiers satisfying the three roles. Two tiers of physical data stores implementing three distinct warehousing roles is a common approach. In some environments, volume of data and complexity of processing are not sufficient to require three tiers. They simply aren’t needed. In these instances the cost of implementing, operating, and maintaining additional data stores is not justified by the few optimization gains that might be realized. In other environments, constraints (development time, processing time, computer resources, people and organizations, etc.) make a three-tier approach impractical. When three tiers of data stores are desired, but can’t be achieved, it is important to realize that all three roles – data intake, data distribution, and information delivery – must still be supported. Thus, a single type of data store assumes multiple roles. When one data store serves more than one role, the optimization issues become more complex. The diagrams on the facing page illustrate some of the alternatives for two-tier and single-tier warehousing. The best configuration for any warehousing environment achieves a balance between the constraints of the environment and the relative importance of each of the three roles of data stores.

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Design and Specification Models

    The Data Warehousing Institute 3-1

    Module 3 Design and Specification Models

    Unit A: Design and Specification Modeling Concepts

    Unit B: Designing Data Marts

    Unit C: Designing Data Warehouses

    Unit D: Designing Data Staging Areas

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Design and Specification Modeling Concepts

    The Data Warehousing Institute 3A-1

    Module 3 – Unit A Design and Specification Modeling Concepts

    Topic Page

    Normalization 3A-2

    State Transition Modeling 3A-6

    Triage 3A-10

    Structural Modeling Issues 3A-14

    Structural and Physical Optimization Issues 3A-42

    Optimization Techniques 3A-46

  • Design and Specification Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    3A-4 The Data Warehousing Institute

    Normalization An Example

    AUTO POLICY PREMIUMPOLICY-NUMBERdeferred-payment-service-cost

    total-policy-costnumber-of-paymentspayment-frequencyfirst-payment-due-datelast-payment-due-datepremium-discounts:-- discount-code-- discount-schedule-amt-- premium-cost-before-discounts-- premium-cost-after-discounts-- discount-amount

    AUTO POLICY PREMIUMPOLICY-NUMBERdeferred-payment-service-cost

    total-policy-costnumber-of-paymentspayment-frequencyfirst-payment-due-datelast-payment-due-date

    PREMIUM DISCOUNTPOLICY-NUMBERDISCOUNT-CODEdiscount-schedule-amtpremium-cost-before-discountspremium-cost-after-discountsdiscount-amount

    0NF

    1NF

    2NF

    3NF

    AUTO POLICY PREMIUMPOLICY-NUMBERdeferred-payment-service-cost

    total-policy-costpremium-cost-before-discountspremium-cost-after-discountsnumber-of-paymentspayment-frequencyfirst-payment-due-datelast-payment-due-date

    PREMIUM DISCOUNTPOLICY-NUMBERDISCOUNT-CODEdiscount-amount

    DISCOUNT SCHEDULEDISCOUNT-CODEdiscount-schedule-amt

    AUTO POLICY PREMIUMPOLICY-NUMBERdeferred-payment-service-cost

    total-policy-costpremium-cost-before-discountspremium-cost-after-discountsnumber-of-paymentspayment-frequencyfirst-payment-due-datelast-payment-due-date

    PREMIUM DISCOUNTPOLICY-NUMBERDISCOUNT-CODEdiscount-amount

    DISCOUNT SCHEDULEDISCOUNT-CODEdiscount-schedule-amt

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Design and Specification Modeling Concepts

    The Data Warehousing Institute 3A-5

    Normalization An Example

    THE EXAMPLE The example on the facing page illustrates normalization through first, second, and third normal forms using the Automobile Policy Premium File as an example. The normalization steps resulted in: • First normal form separated the premium-discounts group of

    attributes as a related entity because it is a repeating group. • Second normal form removed premium-cost-before-discounts and

    premium-cost-after-discounts from the premium-discount entity and placed them into the auto-policy-premium. Both attributes are facts about the policy, and dependent only upon the policy-number key.

    • Second normal form separated discount-schedule as an entity related

    to premium-discount because discount-schedule-amt is not dependent on policy-number.

    • Third normal form deleted total-policy-cost because it can be derived

    as the sum of premium-cost-after-discounts and deferred-payment-service-cost.

    • Third normal form deleted last-payment-due-date because it can be

    derived from number-of-payments, payment-frequency, and first-payment-due-date.

    This example clearly contains some assumptions about the business rules governing this data. Normalization cannot be performed without clear understanding of the business rules.

  • Design and Specification Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    3A-18 The Data Warehousing Institute

    Structural Modeling Issues Time Modeling Examples

    customer nbrcustomer first namecustomer last name

    customer genderload date time stamp

    NOV 2000 CUSTOMER

    Snapshots

    customer nbrcustomer first namecustomer last name

    customer genderload date time stamp

    OCT 2000 CUSTOMER

    customer nbrcustomer effective begin date

    customer first namecustomer last name

    customer genderload date time stamp

    CUSTOMERor...

    Audit Trail

    customer nbrcustomer effective begin date

    customer first namecustomer last name

    customer genderload date time stamp

    CUSTOMER

    States

    policy numberpolicy-begin-date

    coverage-begin-datecoverage-end-date

    policy-termpremium-amountservice-amount

    POLICY

    Pending Policy

    Active Policy

    Expired Policy

    Suspended Policy

    Terminated Policy

    oneof

    claim numbereffective-begin-date

    claim-action-event-dateload-date-time-stamp

    CLAIM ACTION

    Date Stamps

    customer-counthousehold-count

    SIZE OF CUSTOMER BASE

    Product

    PRODUCTproduct-id

    product-descproduct-name

    PRODUCT LINEline-code

    line-description

    LOBlob-codelob-name

    Old ProductLOB

    lob-codelob-name

    POLICY TYPEtype-code

    policy-type-desc

    Versions

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Design and Specification Modeling Concepts

    The Data Warehousing Institute 3A-19

    Structural Modeling Issues Time Modeling Examples

    SNAPSHOTS The example illustrates a snapshot of customer. Two possible design techniques for recording data using the snapshot approach are shown. Customer-effective-begin-date is a row level metadata element used in one of the examples.

    VERSIONS The example illustrates a warehousing environment that retains two

    views of a product hierarchy: an old product view that organized products as policy-type within line-of-business, and a current product view with products grouped by product-line within line-of-business.

    AUDIT TRAIL This example illustrates an audit trail of changes to customer. The audit

    trail technique records the date/time the change was effective or known to the business.

    STATES In the example, a record of a policy is retained for each of several states

    throughout the policy life cycle. Possible states (thus, possible policy records) include pending, active, expired, suspended, and terminated. Two possible design techniques are shown.

    DATE STAMPS Notice the dates for claim action. The date stamp example illustrates

    three dates as follows:

    • effective-begin-date – When did the action become effective in the business?

    • claim-action-event-date – When was the action recorded in the claims processing system?

    • load-date-time-stamp – When was the action recorded in the warehousing data store?

    There are situations in which significant timing differences exist between business effective dates and source event dates. In these situations, both effective dates and event dates would be reflected in the model.

    ACQUISITION METHODS

    The focus of this section is on the design techniques for handling time issues in warehousing data stores. The most common techniques are illustrated in more detail on subsequent pages. While these techniques are all possibilities for the modeler, no design can be created in a vacuum. The acquisition methods used have significant influence on the structural data model. For more details on acquisition techniques consider TDWI’s data acquisition course offerings.

  • Design and Specification Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    3A-34 The Data Warehousing Institute

    Structural Modeling Issues Location Modeling Examples

    Roles

    customer-id-numbercustomer-summary-year

    customer-value-start-of-yearcustomer-value-end-of-year

    total-policy-count-start-of-yeartotal-policy-count-end-of-yearauto-policy-count-start-of-yearauto-policy-count-end-of-yearresidence-policy-ct-start-of-yrresidence-policy-ct-end-of-yr

    total-yrs-as-a-customer

    CUSTOMER ANNUALSUMMARY - SOUTHWEST

    customer-id-numbercustomer-summary-year

    customer-value-start-of-yearcustomer-value-end-of-year

    total-policy-count-start-of-yeartotal-policy-count-end-of-yearauto-policy-count-start-of-yearauto-policy-count-end-of-yearresidence-policy-ct-start-of-yrresidence-policy-ct-end-of-yr

    total-yrs-as-a-customer

    CUSTOMER ANNUALSUMMARY - NORTHWEST

    policy numberrow start DT stamp

    underwriter-employee-idcustomer-zip-code

    deductible-amtum-amt

    liability-amtcollision-amt

    comprehensive-amtpremium-rate-limited-flag

    special-rate-limits-flagload-date-time-stamp

    AUTOMOBILE POLICYcustomer-id-number

    customer-record-begin-datecustomer-record-end-date

    customer-last-namecustomer-first-name

    customer-middle-nameage-group

    income-groupgender

    marital-statuscustomer-valuecustomer-share

    lost-customer-indload date time stamp

    CUSTOMER

    customer-id-numbercustomer-record-begin-datecustomer-record-end-date

    customer-last-namecustomer-first-name

    customer-middle-namepay-by-debit-account-number

    pay-by-debit-bank-namepay-by-debit-authorization-code

    customer-credit-ratingcustomer-last-credit-check-date

    CUSTOMER SECURED

    Geographic Area

    ZONE zone-numberzone-name

    REGION rgn-codergn-name

    DISTRICT dist-numberdist-name

    LOCATION location-id

    location-address

    Organization

    ZONE zone-numberzone-name

    REGION rgn-codergn-name

    DISTRICT dist-numberdist-name

    EMPLOYEE employee-id

    employee-name

    EMPLOYEE employee-id

    employee-name

    UNDERWRITER INSIDE AGENT ADJUSTER

    Partitionedto Distribute

    Organization & Location Entities

    Partitioned to Secure

    Organization & LocationAttributes

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Design and Specification Modeling Concepts

    The Data Warehousing Institute 3A-35

    Structural Modeling Issues Location Modeling Examples

    PARTITIONING FOR DISTRIBUTION

    This example illustrates a case where each region has need to see annual summary data for its customers, but has no need to see data for other regions. Separate but similar data structures are modeled for each region.

    PARTITIONED FOR SECURITY

    This example shows a situation where some customer data – banking data and credit ratings – are security sensitive and not able to be viewed by all warehouse users. Two distinct data structures are modeled, one for generally accessible data and one for sensitive data. Note that some attributes are included in both data structures.

    ORGANIZATION AND LOCATION ENTITIES

    The example illustrates both organization and location data included in a warehousing data model as structures of related entities. Data of these types typically serve both as business data that provides context for other warehousing data and as metadata that helps to meet needs for security and distribution.

    ORGANIZATION AND LOCATION ATTRIBUTES

    This example illustrates attributes that may be used to identify locations and roles. In this case, underwriter-employee-id describes a role and identifies a person in that role. Customer-zip-code may be used to identify the location of a customer.

    ROLES AS DATA This example illustrates multiple roles in which a warehouse user may

    act – underwriter, inside agent, and adjuster. Data about people and organizations makes up an important part of warehouse data, describing the various roles and responsibilities that they have in the warehousing processes. Person and organization data may also be present in the warehouse as business data.

  • Design and Specification Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    3A-40 The Data Warehousing Institute

    Structural Modeling Issues Usage Modeling Examples

    customer-id-numbercustomer-record-begin-datecustomer-record-end-date

    customer-last-namecustomer-first-name

    customer-middle-nameage-group

    income-groupgender

    marital-statusload date time stamp

    household-id

    CUSTOMER

    Secondary Keys& Access Paths

    customer-id-numbercustomer-record-begin-datecustomer-record-end-date

    customer-last-namecustomer-first-name

    customer-middle-nameage-group

    income-groupgender

    marital-statuslost-customer-ind

    load date time stamp

    CUSTOMER

    policy numberpolicy-record-begin-date

    policy-begin-datecoverage-begin-datecoverage-end-date

    liability-coverage-amtzone-id

    high-risk-property-code

    RESIDENTIAL POLICY Derived Datafor Access

    customer-id-numbercustomer-summary-year

    customer-value-start-of-yearcustomer-value-end-of-year

    total-policy-count-start-of-yeartotal-policy-count-end-of-yearauto-policy-count-start-of-yearauto-policy-count-end-of-yearresidence-policy-ct-start-of-yrresidence-policy-ct-end-of-yr

    total-yrs-as-a-customer

    CUSTOMER ANNUALSUMMARY - SOUTHWEST

    customer-id-numbercustomer-summary-year

    customer-value-start-of-yearcustomer-value-end-of-year

    total-policy-count-start-of-yeartotal-policy-count-end-of-yearauto-policy-count-start-of-yearauto-policy-count-end-of-yearresidence-policy-ct-start-of-yrresidence-policy-ct-end-of-yr

    total-yrs-as-a-customer

    CUSTOMER ANNUALSUMMARY - NORTHWEST

    Summary &Partitioningfor Access

    policy numberpolicy-record-begin-datepolicy-record-end-date

    line-of-businesspolicy-begin-date

    coverage-begin-datecoverage-end-date

    load date time stampcustomer-id-number

    POLICY

    customer-id-numbercustomer-record-begin-datecustomer-record-end-date

    customer-last-namecustomer-first-name

    customer-middle-nameload date time stamp

    household-id

    CUSTOMER

    household idparty numberrow start DTstamprow end DTstampload date time stamp

    HOUSEHOLD KeyMigration

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Design and Specification Modeling Concepts

    The Data Warehousing Institute 3A-41

    Structural Modeling Issues Usage Modeling Examples

    KEY MIGRATION The example illustrates the common resolution of one-to-many relationships, placing the primary key of household into the customer entity, and the primary key of customer into the policy entity.

    SECONDARY KEYS AND ACCESS PATHS

    This example shows customer-last-name and customer-first-name implemented as a secondary key. This makes customer data searchable by name, and provides access to customer data when a customer’s name is known but the id number is not known.

    DERIVED DATA FOR ACCESS

    In this example, lost-customer-indicator and high-risk-property-code are each derived values based on business rules. Implementing each as a secondary key supports searching and retrieval of lost customers and high risk properties.

    SUMMARY AND PARTITIONING FOR ACCESS

    The example shows customer summary data for two regions. Developing and storing the summary makes it readily accessible as information without requiring the warehouse user to develop complex queries or derive summaries at the time of access. Separating the data by region makes it easy for each region to view their data without need to filter out data from other regions.

  • Design and Specification Modeling Concepts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    3A-42 The Data Warehousing Institute

    Structural and Physical Optimization Issues Meeting Needs of People, Platforms and Processes

    DATA MARTS

    STAGING DATA DATA WAREHOUSE RELATIONAL DIMENSIONAL

    data currency

    time based summary

    what are the impacts of time on

    each kind of data store?

    time

    retention of history

    security

    STRU

    CTUR

    AL

    locati

    on

    distribution

    how do security & distribution needs affect each kind of

    data store?

    access

    navigation

    how do access & navigation needs affect

    each kind of data store?

    usag

    e

    toolset

    performance

    size

    availability

    backout & recovery

    how does each data store need to be optimized for implementation

    platforms?

    PHYS

    ICAL

    impl

    emen

    tatio

    n

    DBMS

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Design and Specification Modeling Concepts

    The Data Warehousing Institute 3A-43

    Structural and Physical Optimization Issues Meeting Needs of People, Platforms and Processes

    OPTIMIZATION CHALLENGES

    Optimizing warehousing databases is a challenging task, seeking to balance the frequently conflicting needs of people, processes, and platforms. Balancing among requirements for performance, availability, size, recovery, and best use of the DBMS is difficult enough. For warehouses and data marts, access tools add to optimization challenges. Further, the balance must be achieved without severely affecting the structural adjustments that have been made to satisfy time, location, and usage requirements.

    OPTIMIZATION FOR DATA STAGING

    Remember that the purpose of staging data is to receive data into the warehousing environment. It is typically highly detailed data, of high volume, and the primary means to capture history. The most common optimization needs are to manage the database size and the processing performance of loads and extracts. Restart and recovery are significant staging data considerations. Staging data is kept at or near the third normal form. Derived, aggregate, and summary data structures are avoided, as they negatively affect both database size and process performance. There is no need to optimize staging data for user access, and short lapses of availability are generally not user visible.

    OPTIMIZATION FOR DATA WAREHOUSE

    The role of warehouse data is distribution – warehouses are usually optimized for distribution of data to data marts. Data may be at a higher grain than for staging data, and span of historical data may be smaller. Any size gains of higher granularity and reduced history may be offset by increased redundancy. Shared derivations, aggregates, and summaries are important parts of data integration and are appropriate warehouse data structures. These factors, combined with the potential for user access to the warehouse, demand a careful balance among warehouse optimization factors. Size, availability, extract and load performance, and query and access performance may all be important for warehouse data. Staging data is the foundation of warehouse recovery strategy.

    OPTIMIZATION FOR DATA MARTS

    Data in marts is intended for delivery of information. Marts are first optimized for access – biased toward the people factors. Data marts contain subsets of the data in the warehouse, often at higher levels of summary, so size becomes less significant. Performance of query and analysis is much more important than extract and load performance. Availability is essential – data marts exist for user access! Warehouse data is the foundation of data mart recovery.

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Marts

    The Data Warehousing Institute 3B-1

    Module 3 – Unit B Designing Data Marts

    Topic Page

    Modeling Overview 3B-2

    Modeling Relational Data Marts 3B-6

    Optimizing Relational Data Marts 3B-8

    Implementing Relational Data Marts 3B-12

    Modeling Dimensional Data Marts 3B-14

    Optimizing Dimensional Data Marts 3B-244

    Implementing Dimensional Data Marts 3B-30

  • This page intentionally left blank.

  • Designing Data Marts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    3B-6 The Data Warehousing Institute

    Modeling Relational Data Marts Logical Modeling – Entities, Relationships, and Attributes

    identify, name &describe Entities

    identify, name &describe Relationships

    identify, name &describe Attributes

    dataneeds

    modelingcontext

    Target Configuration,BusinessQuestions,

    & F/Q Matrix

    Information Needs& Warehouse Model

    Show all customers with auto policies but not other LOBs.Who are profitable customers and who are costly customers?

    Which customers have more than 1 policy? Which in more than 1 LOB?What is the total customer count across LOBs? And total household count?

    When we lose a customer in one LOB do we lose all of their business?

    policy numberline-of-business

    policy-begin-datecoverage-begin-datecoverage-end-date

    policy-termlast-policy-update-date

    premium-amountservice-amountcost of claims

    cost of services

    POLICY

    customer-id-numbercustomer-last-namecustomer-first-name

    customer-middle-nameage-group

    income-groupgender

    marital-status

    CUSTOMER

    household idparty number

    HOUSEHOLD

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Marts

    The Data Warehousing Institute 3B-7

    Modeling Relational Data Marts Logical Modeling – Entities, Relationships, and Attributes

    IDENTIFY, NAME, & DESCRIBE ENTITIES

    When modeling non-metric data marts, an entity is a class of things that provide all or part of the answers to one or more business questions. These entities will be different from those of operational and staging data, and they may differ from the entities in the warehouse. A typical data mart model contains a subset of the warehouse entities, and includes new entities that are products of summarization, aggregation, or derivation.

    IDENTIFY, NAME, & DESCRIBE RELATIONSHIPS

    Relationships are associations among entities that have relevance to the business questions. In a data mart, each relationship needs to have a role in responding to business questions, and must be implemented by or derivable from warehouse data. When new entities are created by aggregation, derivation, or summarization, any relationships in which they participate must also be derived.

    IDENTIFY, NAME, & DESCRIBE ATTRIBUTES

    Attributes are the properties of entities that represent business facts needed to answer business questions. Each attribute in the mart data model must have a role in responding to business questions, and must be implemented by or derivable from the warehouse data.

    CONFORMED ENTITIES AND ATTRIBUTES

    When modeling multiple data marts, some standards across marts with respect to entities and attributes may be helpful. Warehouse users may be confused when the same entity is named differently, or has a different identifier from one mart to the next. Similarly, an attribute with the same meaning but different names, or the same name but different meanings is confusing. Where practical, use of entities and attributes that conform to a standard enhances usability of the individual mart and the entire warehousing environment.

    TRIAGE & NORMALIZATION

    Triage not normally applied when modeling data marts. Normalization is not necessary for data marts. They are typically de-normalized in the ways that do the best job of presenting data as information. This does not mean that sound design practices should be abandoned. De-normalizing should be done purposefully. It should not occur by accident or oversight.

  • Designing Data Marts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    3B-16 The Data Warehousing Institute

    Modeling Dimensional Data Marts Logical Modeling - Dimensions

    Business Question:

    What is the total customer count by product across all lines ofbusiness? What is the total household count?

    LOBProduct LineProduct

    custo

    mer c

    ount

    custo

    mer id

    custo

    mer n

    ame

    claim

    coun

    tho

    useh

    old co

    unt

    Organization Time

    Product

    Model the Dimensions

    LOB

    Product Line

    ProductRegion

    District

    Zone customer-counthousehold-count

    SIZE OF CUSTOMER BASE

    Year

    Quarter

    Month

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Marts

    The Data Warehousing Institute 3B-17

    Modeling Dimensional Data Marts Logical Modeling - Dimensions

    IDENTIFY AND NAME DIMENSIONS

    Dimensions are the perspectives by which facts may be accessed, selected, sequenced, grouped, filtered for analysis, and presented in a business context. A dimension is typically a multiple-level, hierarchical structure that is the basis of leveled summaries of data and drill-down types of analysis. The fact/qualifier matrix and the business questions help determine the dimensions of interest. Where the meter represents a grouping of facts from the matrix, a dimension represents a grouping of qualifiers. Dimensions may be named either generically (e.g., product, customer, etc.) or with business specific names (e.g., policy, policyholder). Note that fact/qualifier analysis and staging data design contribute significantly to identification and to the dimensional analysis and design activities described below.

    DESCRIBE THE DIMENSION

    As each dimension is identified it is useful to briefly describe some of its properties. At minimum, consider these questions: How volatile or stable is the dimension? Is it a conformed dimension?

    IDENTIFY & NAME DIMENSION LEVELS

    Determine the levels of the dimension and name each level with a term that is descriptive and business-oriented. Conformed dimensions (discussed later) already have identified levels.

    DEVELOP HIERARCHIES

    Structure the dimension as a hierarchy of parent/child relationships. Be careful not to overlook any levels of hierarchy that have business meaning. Realize that a single dimension type may have multiple hierarchies. Conformed dimensions already have a prescribed hierarchy.

    IDENTIFY & NAME LEVEL IDENTIFIERS

    For each dimension level, determine what attribute(s) is used as its identifier. Dimension levels are entities, and need unique identifiers just as all other entities do.

    IDENTIFY & NAME CHARACTERISTICS

    Determine what attributes of each dimension level are desirable. Most dimension levels have an attribute that is a description or name. Other characteristics of value may be identified by business people, examining information needs and business questions, and through triage.

    IDENTIFY VALUES Identify the allowable set of values for dimension level identifiers. This

    helps to fully understand the dimension and is needed when optimizing.

    ASSOCIATE DIMENSIONS

    Link the lowest level of each dimension to the meter. The dimension to meter association is always one-to-many.

  • Designing Data Marts TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    3B-22 The Data Warehousing Institute

    Modeling Dimensional Data Marts Logical Model Example

    Geographic Area Time

    PRODUCT-ID domainF = fleet autoP = personal autoL = life insurance

    YEAR-NUMBER domainF = fleet autoP = personal autoL = life insurance

    QUARTER-NUMBER domainF = fleet autoP = personal autoL = life insurance

    MONTH-NUMBER domain01 = January02 = February03 = March04 = April

    ProductZONE-NUMBER domainF = fleet autoP = personal autoL = life insurance

    DIST-NUMBER domainF = fleet autoP = personal autoL = life insurance

    RGN-CODE domain

    NW = northwestSW = southwest

    Business Question:What is the total customer count by product acrossall lines of business? What is the total household count?

    LOB lob-codelob-name

    PRODUCT LINE line-code

    line-description

    PRODUCT product-id

    product-descproduct-name

    customer-counthousehold-count

    SIZE OF CUSTOMER BASE

    YEAR year-number

    QUARTER quarter-number

    MONTH month-number

    ZONE zone-numberzone-name

    REGION rgn-codergn-name

    DISTRICT dist-numberdist-name

    LOB-CODE domainA = auto insuranceR = residential insuranceL = life insurance

    LINE-CODE domainF = fleet autoP = personal autoL = life insurance

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Marts

    The Data Warehousing Institute 3B-23

    Modeling Dimensional Data Marts Logical Model Example

    AN EXAMPLE The diagram on the facing page illustrates a logical dimensional model for a data mart. Of note in this model:

    • The meter is Size of Customer Base.

    • The measures (facts) are customer-count and household-count.

    • Measures are sensitive to three dimensions: policy, organization, and time.

    • Each dimension is a multi-level hierarchy.

    • Each dimension level has a known identifier.

    • Domain of values for each dimension level identifier is documented.

    • Most dimension levels have either a name or a description as characteristics. One dimension level has multiple characteristics.

    • Not all dimensions are explicitly referenced in the business question. Further investigation is necessary to determine that the business wants to track customer base across time and be able to compare customer base for different geographic areas.

    • Although product line is not explicitly mentioned in the business question, product is a conformed dimension, and all levels of the hierarchy are included in the model.

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Warehouses

    The Data Warehousing Institute 3C-1

    Module 3 – Unit C Designing Data Warehouses

    Topic Page

    Modeling Overview 3C-2

    Modeling Relational Data Marts 3C-8

    Optimizing Relational Data Marts 3C-10

    Implementing Relational Data Marts 3C-14

  • This page intentionally left blank.

  • Designing Data Warehouses TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    3C-2 The Data Warehousing Institute

    Modeling Overview Purpose and Deliverables

    AccessIntegration

    Data StagingProcesses

    Warehouse PopulationProcesses

    Data MartPopulation Process

    Data MartPopulation Process

    Data MartPopulation Process

    Data Mart Data Mart Data Mart

    DataWarehouse

    PersistentStaging Data

    Source

    Data

    Tier 1

    Data

    Inta

    keTie

    r 2Da

    ta D

    istrib

    ution

    Tier 3

    Info

    rmat

    ion

    Deliv

    ery

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Warehouses

    The Data Warehousing Institute 3C-3

    Modeling Overview Purpose and Deliverables

    WAREHOUSE PROPERTIES

    Remember that the primary role of a data warehouse is distribution of data to data marts. A warehouse is by definition integrated, subject-oriented, non-volatile, and time variant. Ideally, a warehouse contains data that is: • Cleansed – Data quality and data cleansing rules have been applied. • Base Data – The warehouse contains the lowest level of data

    granularity that is needed to answer any business question, which may not be atomic data. It may also contain summary data at a higher level of granularity than the base data.

    • Standardized – Standard (conformed) data structures are identified and

    implemented. Common derivations are identified and applied. Common summaries are identified and implemented.

    TYPE OF MODEL Warehouse data is modeled using entity/relationship techniques. Desired

    characteristics of integration and subject-orientation are all readily supported by E-R modeling.

    NORMALIZATION At the logical level, a second-normal-form data model is typical for data

    warehouses. The third-normal-form would eliminate derived data and summary data that are desirable in the data warehouse. Structural and physical modeling may de-normalize to meet optimization needs. The resultant model may include aggregate structures that violate the second-normal form. First-normal-form violations are not typical in data warehouses, but may be introduced in the form of data arrays.

  • Designing Data Warehouses TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    3C-6 The Data Warehousing Institute

    Modeling Warehouse Data Logical Modeling – Entities, Relationships, and Attributes

    location numberstreet-addresscitystatepostal-codephone-numberzone id

    LOCATION

    pty-id-numberorganization-nameperson-last-nameperson-first-name

    person-middle-nameage-group

    income-groupgender

    marital-status

    PARTY

    household idparty numberrow start DTstamp

    PARTY HOUSEHOLD

    POLICY/LOCATION

    policy numberlocation numberlocation usage code

    maybe one

    policy number loss-payee-nameloss-payee-address

    liability-coverage-amt

    RESIDENTIAL POLICY

    policy numberpolicy-type-codepolicy-begin-datecoverage-begin-datecoverage-end-date

    ar-acct-numberparty number

    POLICY

    vinvehicle-record-begin-date

    policy numbermakemodel

    antilock-brakes-indicatorairbags-code

    load date time stamprow start DTstamp

    VEHICLE

    pty-id-numberpolicy-numberparty-role

    PARTY INTEREST

    policy numberlocation numberlocation usage code

    PARTY/LOCATION

    identify, name &describe Entities

    identify, name &describe Relationships

    identify, name &describe Attributes

    dataneeds

    modelingcontext

    TargetConfiguration,

    BusinessQuestions &

    FQ Model

    InformationNeeds &Staging Model

    policy number leinholder-nameleinholder-contact

    premium-rate-limited-flagwiley-special-rate-flag

    AUTOMOBILE POLICY

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Warehouses

    The Data Warehousing Institute 3C-7

    Modeling Warehouse Data Logical Modeling – Entities, Relationships, and Attributes

    IDENTIFY, NAME, & DESCRIBE ENTITIES

    When modeling warehouse data, an entity in a data warehouse model is a class of things about which business information is needed. These entities will be different from those of operational systems, and they may differ from those of the staging and mart data models. Warehouse entities seek to integrate, completing that not done in data staging, and may recognize collective entities such as party and household. Entity identification when modeling warehouse data uses three distinct streams of input: • Information needs provide the context for modeling. If a warehouse

    entity is a class of things about which information is needed, then information needs are central to the modeling activity.

    • The configuration of targets, combined with business questions that each target is intended to answer, guide specific data requirements. The fact/qualifier model for those business questions provides specific data needs. The warehouse must contain the data necessary to populate its dependent marts; and the marts must contain the data needed to answer business questions.

    • The staging data models (logical and structural) provide knowledge of data availability. The data that has been received into the warehousing environment, and that is available to populate the warehouse, is identified by these models.

    IDENTIFY, NAME, & DESCRIBE RELATIONSHIPS

    Relationships are associations among entities that have meaning to the business. For warehouse data, each relationship needs to have a role in providing business information, and must be implemented by or derivable from staging data. Clearly, relationships of new entities (party, household, etc.) must be derived.

    IDENTIFY, NAME, & DESCRIBE ATTRIBUTES

    Attributes are the properties of entities that represent business facts needed to answer business questions. Each attribute in the warehouse data model must have a role in providing business information, and must be implemented by or derivable from the staging data. The attributes of newly identified entities obviously must be derived from staging data. Also note that metadata in the staging model may become business data in the warehouse. For example, row-start-DT-stamp in the staging model becomes policy-record-begin-date in the warehouse model.

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Staging Areas

    The Data Warehousing Institute 3D-1

    Module 3 – Unit D Designing Data Staging Areas

    Topic Page

    Modeling Overview 3D-2

    Modeling Relational Data Marts 3D-8

    Optimizing Relational Data Marts 3D-10

    Implementing Relational Data Marts 3D-14

  • This page intentionally left blank.

  • Designing Data Staging Areas TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    3D-2 The Data Warehousing Institute

    Modeling Overview Purpose and Deliverables

    Data StagingProcesses

    Warehouse PopulationProcesses

    Data MartPopulation Process

    Data MartPopulation Process

    Data MartPopulation Process

    Data Mart Data Mart Data Mart

    DataWarehouse

    PersistentStaging Data

    Source

    Data

    Tier 1

    Data

    Inta

    keTie

    r 2Da

    ta D

    istr

    ibutio

    nTie

    r 3In

    form

    atio

    n De

    liver

    yAccess

    Integration

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Staging Areas

    The Data Warehousing Institute 3D-3

    Modeling Overview Purpose and Deliverables

    STAGING DATA PROPERTIES

    Recall the definition of staging data. A staging area is any data store that is designed primarily to receive data into a warehousing environment. A good data staging strategy includes a staging area that is: • Persistent - Staging data is retained as long as it may have historical

    value (typically for the life of the enterprise). • Atomic - Staging data is captured at the finest grain available. • Subject Oriented - Staging data is organized by business subjects. • Adaptable - Staging data is designed to accommodate both known

    and unknown needs for information. • Extensible - The scope of data expands as new data sources are

    introduced into the warehousing program. Ideally, staging data begins to assume the properties that are desirable in warehousing data: • Subject-Oriented – Organized around business subjects, independent

    of the applications from which it is extracted. • Integrated – Combining data from multiple sources into a single

    business view. • Time-variant – Providing multiple “point-in-time” views of the data. A persistent staging area that is time variant provides a sound way to address needs for enterprise history. Staging data also plays an important role in archival strategy.

    TYPE OF MODEL Staging data is modeled using entity/relationship techniques. Desired

    characteristics of adaptability, extensibility, subject-orientation, and integration are all well supported by E-R modeling.

    NORMALIZATION At the logical level, a third-normal-form data model satisfies staging data

    model needs. Structural and physical modeling may de-normalize to meet optimization needs. The resultant model is typically near the third-normal-form, needing only limited adjustments to optimize.

  • Designing Data Staging Areas TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    3D-6 The Data Warehousing Institute

    Modeling Staging Data Logical Modeling – Entities, Relationships, and Attributes

    policy numberleinholder-name

    leinholder-contactdeductible-amt

    um-amtliability-amt

    collision-amtcomprehensive-amt

    premium-rate-limited-flagwiley-special-rate-limits-flag

    policy numberpolicy-type-codepolicy-begin-date

    coverage-begin-datecoverage-end-date

    policy-termlast-policy-update-date

    premium-amountservice-amount

    renewal-plan-codeauto-renew-bill-date

    agent-renew-call-dateauto-term-notify-date

    ar-acct-number

    POLICY

    maybe one

    AUTOMOBILE POLICY

    vinpolicy number

    makemodelyeartype

    usageantilock-brakes-indicator

    airbags-code

    VEHICLE

    policy number[property-address

    property-cityproperty-countyproperty-state]

    property-type-codelegal-description

    family-countresidential-bldg-count

    non-residential-bldg-countemergency-svc-distance

    PROPERTY

    policy numberproperty-coverage-amt

    proprty-benefits-basis-codequake-addl-coverage-amtflood-addl-coverage-amtwind-addl-coverage-amtcontents-coverage-amt

    jewelry-addl-coverage-amtfurs-addl-coverage-amtarts-addl-coverage-amt

    equipmt-addl-coverage-amtother-addl-coverage-amt

    liability-coverage-amttotal-addl-coverages-amt

    RESIDENTIAL POLICY

    identify, name &describe Entities

    identify, name &describe Relationships

    identify, name &describe Attributes

    dataavailability

    modelingcontext

    SourceData Models

    InformationNeeds &Subject Model

    dataneeds

    FQ Analysis& Business

    Questions businessevents State TransitionModels

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Designing Data Staging Areas

    The Data Warehousing Institute 3D-7

    Modeling Staging Data Logical Modeling – Entities, Relationships, and Attributes

    IDENTIFY, NAME, & DESCRIBE ENTITIES

    In E-R modeling, an entity is generally defined as a class of things about which data is needed. To model staging data, that definition may be extended – an entity in a staging data model is a class of things about which data is received into the warehousing environment. These entities may be different from the classifications by which operational systems manage the data – staging entities may be different from operational entities. And they may differ from classifications by which information is delivered to the business – staging entities may be different from data warehouse entities. Entity identification when modeling staging data uses three distinct streams of input: • The warehousing subject model (and the corresponding information

    needs) provides the context for modeling. Warehousing subjects are abstractions of entity groups. Each staging entity needs to belong to one of the subjects.

    • Business questions (and supporting fact/qualifier analysis) provide the specifics of data requirements. The data needed is that which is necessary to answer the business questions.

    • The source data models (logical and structural) provide knowledge of data availability. The data that may be received into the warehousing environment is identified by these models.

    • State transition models, when available, provide understanding of the business events and their data impacts. When state models aren’t available, it is sometimes advisable to develop them as part of the staging data modeling effort.

    IDENTIFY, NAME, & DESCRIBE RELATIONSHIPS

    Relationships are associations among entities that have meaning to the business. For staging data each relationship needs to be visible in the source data – either implemented by one or more data sources or derivable from those sources. To increase adaptability of staging data, include all possible relationships for which source data is available.

    IDENTIFY, NAME, & DESCRIBE ATTRIBUTES

    The minimum requirement of staging data is to include every attribute that is needed to answer the business questions. Be certain that this minimum set of attributes is identified and modeled. The ideal for data staging is to implement all useful attributes from a data source at the time that source is first used to provide warehousing data. Full identification of staging data attributes is accomplished through triage.

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Data Modeling and Design Summary

    The Data Warehousing Institute 4-1

    Module 4 Data Modeling and Design Summary

    Topic Page

    Modeling Overview 2.4-2

    Modeling Relational Data Marts 2.4-4

    Deliverables Summary 2.4-8

  • This page intentionally left blank.

  • Data Modeling and Design Summary TDWI Data Modeling: Data Warehouse Design & Analysis Techniques

    4-8 The Data Warehousing Institute

    Deliverables Summary Modeling Deliverables Checklist

    Deliverable Without this the… Business Drivers &

    Goals Business reasons for undertaking a data warehousing program are not clearly articulated. Limits business value. Creates risk that business people will not accept the data warehouse.

    Information Needs Information structure of the warehouse is not linked to clearly expressed needs of the business, and information needs are intangibles. Risk is “information silos” that fail to achieve subject orientation and integration, and inability to prioritize needs, plan increments, and select the right data sources.

    Business Questions Information needs aren’t made concrete and tangible. The transition from needs analysis to design of data structures is difficult, and the quality of the results uncertain.

    Source Composition Model

    Data sources are not inventoried and grouped by subject. It is unclear how sources relate to the subjects that provide the basis of subject-orientation. Risks incompleteness and inaccuracy in data sourcing strategy.

    Subject Area Model The warehousing program is absent any high level groupings of data as subjects of business interest. High level of risk that the warehouse data structures will not be adequately subject-oriented.

    Fact/Qualifier Matrix Business questions are not formally analyzed to understand their data components and data usage. Increases risk of jumping to solutions without understanding the business problem. Jeopardizes completeness, correctness, and adaptability of warehousing solutions.

    Warehouse Targets Configuration

    Parts of a total warehousing solution are developed without a clear picture of the overall structure. Increases risk of “misfit” warehousing components, continuous rework, unstable warehousing environment, and user dissatisfaction.

    Structure of Data Store (matrix)

    Contents and structure of some data sources is not understood. These sources cannot be integrated into the source data model. Reduces opportunity to choose the best source. Increases risk that source data is misunderstood and translated to misinformation in the warehouse.

    Source Logical Model Source data contents and relationships are not understood at a business level. A single business fact cannot be traced to multiple, redundant sourcing options. Risks incompleteness of data acquisition and data cleansing solutions. Inhibits ability to analyze impacts of and respond to changing source data structures.

    State Transition Diagram

    Entities are not understood in context of life cycles and business processes. Risks incompleteness in warehousing target designs. Increases probability of “out-of-synch” data in the warehouse.

    Staging Logical Model (ERM)

    Underlying structures for data staging are casually and informally designed. Risks rework and instability in the data staging environment, leading to greater instability and potential inaccuracies in the overall warehousing environment.

  • TDWI Data Modeling: Data Warehouse Design & Analysis Techniques Data Modeling and Design Summary

    The Data Warehousing Institute 4-9

    Deliverables Summary Modeling Deliverables Checklist

    Deliverable Without this the… Warehouse Logical

    Model (ERM) Underlying structures for the data warehouse are casually and informally designed. Risks rework and instability in the data warehouse, which is disruptive to dependent data marts, and frustrating for users who directly access the warehouse.

    Data Mart Logical Model (ERM)

    Relational data structures and data mart subject orientation are not fully understood. Relational data mart designs occur at a physical level without benefit of logical and structural analysis and design. Consistency of relationships across multiple marts is at risk, and individual data mart designs may be physically inadequate.

    Data Mart Logical Model (DDM)

    Dimension hierarchies and dimensional relationships are not fully understood. Dimensional data mart designs occur at a physical level without benefit of logical and structural analysis and design. Conformity of dimensions across multiple marts is at risk, and individual data mart designs may be physically inadequate.

    Staging Structural/Physical Model (ERM)

    Time requirements for staging data are not formally analyzed, and the comprehensive staging design is not precisely specified as a cohesive set of physical tables. Risks sub-optimal data staging implementation. Impacting overall performance and adaptability of the warehousing application.

    Warehouse Structural/Physical Model (ERM)

    Time, location, and usage requirements for the warehouse are not formally analyzed and expressed as part of a comprehensive design, and that design is not precisely specified as a set of physical tables. Strategy for feeding dependent data marts is jeopardized, and user satisfaction with the warehouse is put at risk.

    Data Mart Structural/Physical Model (ERM)

    Time, location, and usage requirements for the mart are not formally analyzed and expressed as part of a comprehensive design, and that design is not precisely specified as a set of physical tables. Users may be surprised (sometimes unpleasantly) at the physical implementation.

    Data Mart Structural/Physical Model (DDM)

    Time, location, and usage requirements for the mart are not formally analyzed and expressed as part of a comprehensive design, and that design is not precisely specified as a set of physical tables. Users may be surprised (sometimes unpleasantly) at the physical implementation.

    TDWI Data ModelingTable of Contents & 1: Data Modeling Concepts2: Requirements Analysis Models3A: Design and Specification Modeling Concepts3B: Designing Data Marts3C: Designing Data Warehouses3D: Designing Data Staging Areas4: Data Modeling and Design Summary