data vault for edw -...

30
By Raphael Klebanov, WhereScape, Inc. Data Vault for EDW

Upload: donhi

Post on 04-Jun-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

By Raphael Klebanov, WhereScape, Inc.

Data Vault for EDW

Agenda

2

Review of various Data Warehouse models in conjunction with their place in the modern data warehousing methods.

The Data Vault, as a preferred “flavor” of the Enterprise Data Warehouse for different businesses.

Overview of the Data Vault concepts and objects.

Real-world example where the Data Vault was chosen to replace a more “traditional” architecture for EDW

Principal Data Flow (simplified)

3

Dimensional Model (DM)

4

Different “Flavors” of an EDW

Data Vault (DV)

Third Normal Form (3NF)

Ralph Kimball, 1996. The Data Warehouse Toolkit.

Bill Inmon, 1981. Effective Data Base Design.

Dan Linstedt, 2000. Data Vault Series.

Dimensional Model (DM)

5

Collection of all data marts within the enterprise. Information is always stored in the DM.

In a DM, transaction data is separated into either

"facts": numeric transaction data

"dimensions“: reference information that gives meaning to the facts.

Key Aspects of a DM Approach

6

Clarity of design to both developers and business.

Optimum design to most BI tools.

Ease of use to query the database directly.

Main Drawbacks of the DM

7

Complicated to maintain the integrity of facts and dimensions.

Expensive to modify the DW structure with change in business rules / requirements.

Data scrub for conformed dimensions is challenging.

Difficult to load type 2 dimensions in real-time.

Conclusion on DM

8

All of these characteristics boil down to a main usage of the DM in data marts and access/presentation layers. The DW can be created using this approach for small data volumes and stable business structures.

Third Normal Form (3NF)

9

Central DW referred to as the Corporate Information Factory (CIF).

An enterprise has one centralized EDW, and data marts obtain their information from the EDW. In the EDW, information is usually stored in 3NF (Codd's third normal form).

In 3NF, the data in the DW is stored by database normalization rules. Tables are grouped together in subject areas that reflect general data categories.

Key Aspects of a 3NF Approach

10

It is more straightforward to add information into the database containing full historical data from the operational systems.

The data structures are more resilient to change since data should only appear in one table (i.e., the data is normalized).

Due to optimized, normalized structure, NRT/RT- and VLDB - loading are supported in most cases.

The Main Drawbacks of the 3NF

11

Disadvantages of this approach is step from the number of tables involved.

It is difficult for user to join data from different sources into meaningful information.

Subsequently, access the information without an exact understanding of the sources of data and of the data structure of the data warehouse.

Inflexibility (brittleness) of the 3NF Data Model.

Conclusion on 3NF Model

12

So all these characteristics lead to the realization that the main usage of the 3NF model is Operational Data Stores rather than EDW.

Data Vault (DV)

13

Designed to avoid or minimize the impact of the issues related to DM and 3NF and disadvantages of both methods.

DV Modeling is a method of designing an EDW to provide historical storage of data coming in from many operational systems with complete tracing of the origin of all the data coming into the database.

This method proved to be highly adaptable to change in the business environment.

The Data Vault is built to be organized around Business Keys.

Key Aspects of a DV Approach

14

Less complicated EDW loads resulting in greater stability and performance.

Improved flexibility allowing EDW to more easily adapt to changes in the business.

More suitability for incremental implementation (Agile DW) ensuring quicker delivery of business value.

Due to the highly granular nature of the DV model, it sustains Very Large Database (VLDB) capability resulting in no-need for redesign when EDW matures.

Main Drawbacks of the DV

15

Large amount of joins which makes maintenance of the database more strict.

Necessity to follow the modeling rules in the more strict way because small deviation from the DV Business Rules might case serious damage to the whole structure.

Like 3NF, impractical for direct querying.

Conclusion on DV Model

16

EDW

3NF:TENDENCY

PRE-AGGREGATES

ANALYTIC DATA FEED BACK TO SOURCE SYSTEMS

A Bit of Chemistry

17

“Atom” = Clear Definitions of the Data -- Usually 3NF

“Water Molecule” = “2-1/2”normalized DV: Hubs/Links/Sats

“Sugar Molecule” = Tables/Views with Pre-aggregated Data

“Sugar Cube” = Rapid BI Product -- Usually DM

Clear Definition, Removed Ambiguity

Efficient Loading, All the Data all the Time

Common Building Blocks for BI

Business Context, Agile Re-assembly

More on DV…Core Concepts

A HUB table contains a list of uniquely identified business keys that have a very low tendency to change.

18

A SATELLITE holds any data with a tendency to change over time, any descriptive data about a business key (HUB key).

A LINK is either a transaction, a hierarchy, or an association/relationship between the business keys (HUB keys).

Hubs

Identifiable business element.

Very low chance of changing (generally, not editable in source systems).

Same semantic meaning and granularity across the enterprise.

Hubs Examples Key: Nissan-ABC/123-456

Line of Business: NAICS 2007 45A

Organization: Empire State College

Model Number: 33777185JN

19

Hubs Quiz

A HUB represents an Event or Transaction (True or False)

HUB may contain record source as part of business key (True or False)

HUB always has an end-date (True or False)

HUB business key can be comprised of multiple columns (True or False)

HUB can be dependent on another HUB (True or False)

20

Links Intersection of two or more Business Keys (Hubs)

A Unit of Work (e.g. Product by Supplier Link, Customer by Category Link)

Identifiable business element relationships

Business event

Transaction between business keys (Hubs)

Hierarchy

Same As (data cleansing)

Includes Hubs Keys as Foreign key

Links Examples: Invoice Header (Buyer, Seller, Invoice Date, Receive Date)

Orders (Employee, Shipper, Customer, Order Date)

21

A transaction is always represented by a Link (True or False)

A Link can contain business keys (True or False)

A unit of work is always represented by a Link (True or False)

A link must contain a unit of work (True or False)

Links Quiz

22

Satellite

Time dimensional table about Hub or Link

Has one migrated foreign key (either from Hub or Link)

At least one satellite row for each Hub Key

Primary Key is the Hub Surrogate Key (Hub_key) and Load Date

Satellite Notes Non-identifying business elements

Descriptive of Business Key from Hub or Link

Dependent on either Hub or Link as Parent

Never dependent on more than one parent table

Never parent table to any other table (no snow flaking!)

Generally, has beginning and ending dates 23

Satellites Quiz Can Satellite be dependent on 1 or more parent tables

(True or False)

Satellite’ Primary Key is which of the following: A) Hub’s PK

B) Sat Load Date

C) Sequence Number

D) Sub-totals

Satellite can export its Key (True or False)

Satellite can be snow flaked (True or False)

Satellite is not impacted by Delta Processing (True or False)

24

NON-Core DV Structures A PIT (Point-in-Time) is a specialized SATELLITE derivative

that is used to get the latest row “AS OF” a specific date WITHOUT use of nested sub-queries in the main satellite query.

A MEASURE SATELLITE is a specific SATELLITE dedicated to hold particular descriptive data on which calculations or aggregations can be performed for analytical purpose.

A REFERENCE is a specific hybrid (flat table instead of Hub/Sat) in which “decoding” info is truly static, usually de-normalized, with no history.

A BRIDGE similar to PIT designed for performance but created from many Hubs and Links, allows computing by columns.

25

Few Lines…

26

In 2008 W.H. (Bill) Inmon stated that the “Data Vault is the optimal approach for modeling the EDW in the DW2.0 framework.” (DW2.0).

“The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise.” (http://www.tdan.com/view-articles/5054/).

The number of Data Vault users surpassed 500 in 2010 and grows rapidly (http://danlinstedt.com/about/dv-customers/) .

The Story

27

About the IPC Environment

28

DS Feeds: Daily, weekly, monthly and ad-hoc from RDBMSs and flat files, some UD

EDW Platform: SQL Server 2005+. Projected size of the EDW for 2010 is 4…5TB, growing 10-15% annually

Data Warehouse Builder: WhereScape RED 6

BI: Balanced Insight Consensus/MicroStrategy 9

The Phase I of the Data Vault EDW is completed (approx. 500 objects) along with the Data Mart and BI reports(6 weeks). The subsequent phases are being developing now

Also, the re-platforming of the Data Vault to Teradata 13 is underway now

Conclusion

29

Every Data Warehousing “Flavor” is applicable depending on phase and purpose of the DW:

Third Normal Form – “Normalization Rules”

Data Vault Structure – “Golden Copy”

Tables/Views with pre-aggregated data – “Reusable Components “

Dimensional Model – “Interpretation of Data by Users”

Specifically, Data Vault Model is, at current time, an optimum approach for Enterprise Data Warehouse building.

Questions?

Raphael Klebanov, MCS, PSM, CDVDM, TCP

Lead DW/BI Analyst

[email protected] raphael_ws

303.968.0703 learndatavault.com

‘New Business Supermodel’ by Dan Linstedt

30

For More Info Contact…