data vault and dw2.0

25
The Application of Data Vault to DW2.0 © Dan Linstedt, 2011-2012 all rights reserved

Upload: empowered-holdings-llc

Post on 19-Jan-2015

2.315 views

Category:

Business


5 download

DESCRIPTION

This is a presentation I gave in 2006 for Bill Inmon. The presentation covers Data Vault and how it integrates with Bill Inmon's DW2.0 vision. This is focused on the business intelligence side of the house. IF you want to use these slides, please put (C) Dan Linstedt, all rights reserved, http://LearnDataVault.com

TRANSCRIPT

Page 1: Data Vault and DW2.0

The Application of Data Vault to DW2.0

© Dan Linstedt, 2011-2012 all rights reserved

Page 2: Data Vault and DW2.0

2

A bit about me…

• Author, Inventor, Speaker – and part time photographer…

• 25+ years in the IT industry• Worked in DoD, US Gov’t, Fortune 50, and

so on…

• Find out more about the Data Vault:o http://www.youtube.com/LearnDataVaulto http://LearnDataVault.com

• Full profile on http://www.LinkedIn.com/dlinstedt

Page 3: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

3

Agenda• Defining The Needs for the Data Vault

o DW2.0 Architectureo DW2.0 Drivers for Data Modelingo Divergence of Data Models over Time

• Data Vault in DW2.0o Defining the Data Vaulto What does one look like?o Modeling in DW2.0o Applying Data Vault to Global DW2.0o Applying Data Vault to Time-Value DW2.0o Compliance in DW2.0o Applying Data Vault to System of Record

• The Paradox of DW2.0o Volume, Latency, Complexity,

Normalization andTransformation ability

Page 4: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

4

DW2.0 Architecture

Interactive

Archival

Integrated

Near-Line

METADATA

Tactical

Historical

Strategic

Extended

Enterprise Data Warehouse

Active Data Mining

TransformationActive

Cleansing

Cube Processing

TemporalIndexing

SemanticManagement

Enterprise Service Bus

ESB Connectivity:• EAI• EII• ETL / ELT• Web Services

ESB Management:• Text • Email • Spread Sheets• Transaction• Structured Information

Unstructured Data:• Email• Plain Text• Word Docs• Images

Data Models Must be consistently

applied throughout all layers.

Page 5: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

5

DW2.0 Drivers for Data Modeling

• Data Models are one of the main integration points between Technical and Business drivers.

• Business Keys drive understandability, and granularity• Normalization drives flexibility, and frequency of load• Raw data sets in the EDW/ADW drive compliance and volume

VolumeVolume FrequencyFrequency

GranularityGranularity

DataModel

FlexibilityFlexibility ComplianceCompliance

UnderstandabilityUnderstandability

DataModel

Technical Drivers Business Drivers

Page 6: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

6

Divergence of Data Models over

Time• Data models (both logical and physical) have diverged from

business drivers and direction over time.• The Data Models have driven towards physical improvements

instead of towards business improvements.• The Data Vault Architecture drives data modeling back to the

business sides of the house.

Time

Business Goals

Standard Data Modeling

Data Vault ModelingBusiness Process Modeling

Page 7: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

7

Agenda• Defining The Needs for the Data Vault

o DW2.0 Architectureo DW2.0 Drivers for Data Modelingo Divergence of Data Models over Time

• Data Vault in DW2.0o Defining the Data Vaulto What does one look like?o Modeling in DW2.0o Applying Data Vault to Global DW2.0o Applying Data Vault to Time-Value DW2.0o Compliance in DW2.0o Applying Data Vault to System of Record

• The Paradox of DW2.0o Volume, Latency, Complexity,

Normalization andTransformation ability

Image is from - What The Bleep Do We Know?

Page 8: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

8

Defining the Data Vault

The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business.

It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of today’s enterprise data warehouses.

Defining the Data VaultTDAN.com Article

Page 9: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

9

What Does One Look Like?

Customer

Sat

Sat

Sat

F(x)

Customer Information

Account

Sat

Sat

Sat

F(x)

Account Information

InvoiceID

Sat

Sat

Sat

F(x)

Invoice / Billing Information

The impact of linking disparate systems together, is inside the shaded area.

Link

F(x)

Sat

Records a history of the interaction

Elements:• Hub• Link• Satellite

Page 10: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

10

Modeling in DW2.0• Bill Says:

o DW2.0 must be brought down to a very finite level of detail.

o The starting point for DW2.0 is the modeling process.o The data model applies to the integrated sector, the

near line sector, and the archival sector.o The way that data warehouses are built is in an

incremental manner• The Data Vault specializes in:

o Providing finite grain at the lowest level possible,o Mapping business process models to data modelso Existing in all sectors simultaneously without

changes.o Flexibility and managing change so that impacts are

not a mile-wide and 10 miles deep.

Page 11: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

11

Elements in a Data Vault• Hub

o Unique List of Business Keys, tracked by the first time the warehouse saw them appear.

• Linko Relationships between business keys, also

representing a grain shift, or a hierarchical roll-up.

• Satelliteo Data over time, granular, and descriptive about

the business key. Also setup according to type of information, and rate of change.

Page 12: Data Vault and DW2.0

04/10/2023 Do Not Duplicate Without Written Permission 12

Applying the Data Vault to Global DW2.0

HubHub

SatSatSatSatLinkLink

Manufacturing EDW in China

Base EDW Created in CorporateFinancials in USA

HubHub

SatSatSatSat

HubHub

SatSatSatSat

LinkLink

SatSatSatSat

HubHub

SatSatSatSat

Planning in Brazil

LinkLink

HubHub

SatSatSatSatLinkLink

Page 13: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

13

Applying the Data Vault to Time-Value

DW2.0

1

10-12-2000

Acme Incorporated

Super Ducts

Finance

1

12-2-2000

Acme Inc

Super Ducts

Contracts

1

10-31-2000

Acme Incorporated

Super Ducts

Finance

Cust_Key

Load_Date

Name

Description

Record Src

Row 1 Row 2 Row 3 Row 4

Satellite entities in the Data Vault house data over time. They are split by type of information and rate of change. This is an

example set of data for a customer name satellite.

1

10-14-2000

Acme Corp, Inc

Super Ducts

Finance

Satellite Data Over Time

Page 14: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

14

Batch and Real-Time Data Arrival

1128589388

10-12-2000 16:43

ABC12356

1UX2589a

$10.00

DBT

Transaction ID

Date Stamp

Customer

Account #

Amount

Type

Hub Customer

Hub Customer

LinkTransaction

LinkTransaction

Hub Acct

Hub Acct

SatTransaction

SatTransaction

SatCustomer

SatCustomer

SatAcct

SatAcct

Customer InfoCustomer Info

Acct DataAcct Data

3, 6 or 12 Hr Load Window

All InsertsAll the time

Batch Load

Page 15: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

15

Star Schema Real-Time Data Issues

1128589388

10-12-2000 16:43

ABC12356

1UX2589a

$10.00

DBT

Transaction ID

Date Stamp

Customer

Account #

Amount

Type

DimensionCustomer

DimensionCustomer

FactTransaction

FactTransaction

DimensionAccount

DimensionAccount

Customer InfoCustomer Info

Acct DataAcct Data

3, 6 or 12 Hr Load Window

Updates areREQUIRED!

Batch Load

Cleansing & Quality must occur before the data can reach the target tables, cleansing and quality introduce unwanted latency!

Page 16: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

16

Compliance in DW2.0

• Raw Detail = auditable

• Loads in Real-Time or in Batch

• Integrated by Business Key

• Flexible, allows business changes (with little to no impact)

• No delay in loading data

• Data type conformity

• Semantic Integration

Source Systems

EDW / ADWData Vault

Data MartsData Delivery

RawIntegration

BusinessRules

ErrorMart

TrueMarts

User orAuditor

Changes to Source Information

Direction of Information Flow

Master Data(Operational)

Continuous Data

Improvement

Quality

Page 17: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

17

Applying the Data Vault to System Of

Record

• SOR 1 o Data Capture, Data Produced by system algorithms

• SOR 2o Raw Detailed Integrated Data over time, Integrated by Horizontal

(functional) Business Key. Auditable.• SOR 3

o Current view of the business, merged, quality cleansed, single copy, single source, feeds operational systems.

Source Systems Normalized EDWMaster Data or

Conformed Dimensions

SORDefinition 1

SORDefinition 2

SORDefinition 3

Page 18: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

18

DW2.0 Paradoxes• DW2.0 incorporates:

o Unstructured, Semi-Structured, Real-Time, and Batch Datao Global viewso All of which drive volumes of data.

• Volume causes latency in transformation.• Volume is directly proportional to transformation

complexity.• Real-Time data arrival is inversely proportional to

complexity and volume.• Time for “quality, cleansing, and transformation” on the

way in to the EDW diminishes as near-real-time is approached, or massive volumes of batch data are found within a shrinking batch window.

• Transformation can destroy data audit ability and compliance of the EDW / ADW.

Page 19: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

19

DW2.0 Paradoxes - Imagery

DW2.0DW2.0

VolumeVolume

Real-TimeTransactionsReal-Time

TransactionsUnstructured

DataUnstructured

DataLow-Level

GrainLow-Level

Grain

LowLatency

LowLatency

Drives

Increases

RequiresMerging, Quality,

CleansingMerging, Quality,

CleansingData Model

DenormalizationData Model

Denormalization Data ModelNormalization& Raw Details

Data ModelNormalization& Raw Details

Pushes

Requires

Fights

Fights

Fights

Auditability & ComplianceAuditability & Compliance

InhibitsInhibits

Provides

Page 20: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

20

DW2.0 Paradox Hypothesis• As we reach near-real time, the ability to transform data

and “wait” for parent dependencies directly decreases, the data decay rates increase, and therefore can cause data death if not processed in time.

• Normalization of the data model increases flexibility, and scalability.

• The closer we get to near-real-time, the more normalized the data model in the EDW/ADW must become.

• In order to process high volumes of batch data extremely fast, the “business transformations” must be removed from the load stream of the EDW.

Page 21: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

21

Data Vault Volumetrics

Cust Addr 41.15 MB

Cust Company 22.36 MB

Cust Detail 10.00 MB

Cust Hub 8.20 MB

Cust Name 28.00 MB

Initial Total Size 109 MB (200k Rows)

Monthly Growth Rate (new customers)

15% / Month

16.45 MB

Volumetrics (10% null Data)

Upon Initial Investigation, the 12 month growth rate for new customers is 197.4 MB per year….

Now let’s factor in the DELTA’s.

Page 22: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

22

Data Vault Growth

Table Initial Size (Data & Indexes) Avg Growth Per Week

Avg Growth Per Month

Avg Growth Per Year

Cust Addr 41.15 MB 5% = 2.0 MB

0% 104 MB

Cust Company 22.36 MB 0% 0% 10% = 2.23 MB

Cust Detail 10.00 MB 10% = 1.0 MB

varies 12 MB

Cust Hub 8.20 MB 0% 0% 0%

Cust Name 28.00 MB 0% 0% 5% = 1.4 MB

Initial Total Size

109 MB (200k Rows) 1.0 MB - 119.63 MB

Growth Rate 15% / Month (16.45 MB) - - 197.40 MB

TOTAL GROWTH / YEAR - - 317.03 MB

Volumetrics (10% null Data) – Delta Growth Only

Original Dimension: 497.16 MB per Year

New Data Vault:317.03 MB Per Year

Page 23: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

23

Data Vault VS Dimension Growth

0

500

1000

1500

2000

2500

3000

Initial Size Year 1 Year 2 Year 3 Year 4 Year 5

Gig

ab

yte

s

Dimension

Data Vault

Initial Size Year 1 Year 2 Year 3 Year 4 Year 5

Dimension 114 611.16 1108.32 1605.48 2102.64 2599.8

Data Vault 109 426.03 742.06 1059.09 1376.12 1693.15

How does the extensive growth rate affect queries?

Page 24: Data Vault and DW2.0

04/10/2023Do Not Duplicate Without Written Permission

24

SummarizationBusiness:• Lack of a single view of a

customer, product, service, etc...

• Lack of visibility into ALL information across the enterprise.

• Competition does it better, faster, cheaper.

• Unable to identify and forecast business trends and their impacts.

• WHERE’S THE KNOWLEDGE? OR IS IT JUST ALL DATA?

Technical:• Near-Real-Time (Active)• Huge Data Volumes• Massive Data Dis-Integration• Spread-Marts• Convergence of Operational

and Strategic Questions• Duplication of data in the

ODS, Warehouse, and Data Marts!

• Dimension-itis!!• ODS Ulcer!• Fact Table Granularity• JUNK tables, Helper Tables

Page 25: Data Vault and DW2.0

25

Where To Learn More• The Technical Modeling Book: http://LearnDataVault.com

• The Discussion Forums: & eventshttp://LinkedIn.com – Data Vault Discussions

• Contact me:http://DanLinstedt.com - web [email protected] - email

• World wide User Group (Free)http://dvusergroup.com