agile data engineering - intro to data vault modeling (2016)

39
KENT GRAZIANO AGILE DATA ENGINEERING: INTRODUCTION TO DATA VAULT DATA MODELING @KentGraziano kentgraziano.com

Upload: kent-graziano

Post on 14-Jan-2017

593 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Agile Data Engineering - Intro to Data Vault Modeling (2016)

KENT GRAZIANO

AGILE DATA ENGINEERING: INTRODUCTION TO DATA VAULT DATA

MODELING

@KentGraziano kentgraziano.com

Page 2: Agile Data Engineering - Intro to Data Vault Modeling (2016)

2

Agenda

Bio

What do we mean by Agile?

What is a Data Vault?

Where does it fit in an DW/BI architecture

How to design a Data Vault model

Being “agile” with Data Vault

What’s new in DV 2.0

Page 3: Agile Data Engineering - Intro to Data Vault Modeling (2016)

3

My Bio

› Senior Technical Evangelist, Snowflake Computing› Oracle ACE Director (BI/DW)› Certified Data Vault Master and DV 2.0 Practitioner› Data Modeling, Data Architecture and Data Warehouse

Specialist• 30+ years in IT• 25+ years of Oracle-related work• 20+ years of data warehousing experience

› Member – DAMA Houston› Former-Member: Boulder BI Brain Trust (

http://www.boulderbibraintrust.org/)› Author & Co-Author of a bunch of books

• The Business of Data Vault Modeling • The Data Model Resource Book (1st Edition)

› Blogger: The Data Warrior› Past-President of Oracle Development Tools User Group

and Rocky Mountain Oracle User Group

Page 4: Agile Data Engineering - Intro to Data Vault Modeling (2016)

4

Manifesto for Agile Software Development

“We are uncovering better ways of developing software by doing it and helping others do it.

Through this work we have come to value:

Individuals and interactions over processes and tools

Working software over comprehensive documentation

Customer collaboration over contract negotiation

Responding to change over following a plan

That is, while there is value in the items on the right, we value the items on the left more.”

http://agilemanifesto.org/

Page 5: Agile Data Engineering - Intro to Data Vault Modeling (2016)

5

Applying the Agile Manifesto to DW

(C) Kent Graziano

User Stories instead of requirements documents

Time-boxed iterations› Iteration has a standard length› Choose one or more user stories to fit in that

iteration

Rework is part of the game› There are no “missed requirements”... only those

that haven’t been delivered or discovered yet.

Page 6: Agile Data Engineering - Intro to Data Vault Modeling (2016)

6

Data Vault Definition

TDAN.com Article

The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of

business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable,

consistent and adaptable to the needs of the enterprise.

Architected specifically to meet the needs of today’s enterprise data warehouses

DAN LINSTEDT: Defining the Data Vault

Page 7: Agile Data Engineering - Intro to Data Vault Modeling (2016)

7

What is Data Vault Trying to Solve?

(C) Kent Graziano

What are our other Enterprise Data Warehouse options?› Third-Normal Form (3NF): Complex primary keys (PK’s)

with cascading snapshot dates› Star Schema (Dimensional): Difficult to reengineer fact

tables for granularity changes

Difficult to get it right the first time

Not adaptable to rapid business change

NOT AGILE!

Page 8: Agile Data Engineering - Intro to Data Vault Modeling (2016)

8

Data Vault Time Line

© LearnDataVault.com

20001960 1970 1980 1990

E.F. Codd invented relational modeling

Chris Date and Hugh Darwen Maintained and Refined Modeling

1976 Dr Peter ChenCreated E-R Diagramming

Early 70’s Bill Inmon Began Discussing Data Warehousing

Mid 60’s Dimension & Fact Modeling presented by General Mills and Dartmouth University

Mid 70’s AC Nielsen PopularizedDimension & Fact Terms

Mid – Late 80’s Dr Kimball Popularizes Star Schema

Mid 80’s Bill InmonPopularizes Data Warehousing

Late 80’s – Barry Devlin and Dr Kimball Release “Business Data Warehouse”

1990 – Dan Linstedt Begins R&D on Data Vault Modeling

2000 – Dan Linstedt releases first 5 articles on Data Vault Modeling

Page 9: Agile Data Engineering - Intro to Data Vault Modeling (2016)

9

Data Vault Evolution

(C) Kent Graziano

The work on the Data Vault approach began in the early 1990s, and completed around

1999.

Throughout 1999, 2000, and 2001, the Data Vault design was tested, refined, and deployed

into specific customer sites.

In 2002, the industry thought leaders were asked to review the architecture.

This is when I attend my first DV seminar in Denver and met Dan!

In 2003, Dan began teaching the modeling techniques to the mass public.

In 2014, Dan introduced DV 2.0!

Page 10: Agile Data Engineering - Intro to Data Vault Modeling (2016)

10

Where does a Data Vault Fit?

© LearnDataVault.com

STAGING EDWDATA VAULT

DATA MARTS(STAR SCHEMAS)

DATA MARTS(STAR SCHEMAS)

DATA MARTS(STAR SCHEMAS)

Page 11: Agile Data Engineering - Intro to Data Vault Modeling (2016)

11

Where does Data Vault fit?

©Oracle Corp

Data Vault goes here

Page 12: Agile Data Engineering - Intro to Data Vault Modeling (2016)

12

Data Vault: 3 Simple Structures

© LearnDataVault.com

EDWDATA VAULT

HUB

LINK

SATELITE

01

02

03

Page 13: Agile Data Engineering - Intro to Data Vault Modeling (2016)

13

Data Vault Core Architecture

© LearnDataVault.com

HUBS

Unique List of Business Keys

LINKS

Unique List of Relationships across keys

SATELITES

Descriptive Data

› Satellites have one and only one parent table› Satellites cannot be “Parents” to other tables› Hubs cannot be child tables

Page 14: Agile Data Engineering - Intro to Data Vault Modeling (2016)

14

Common Attributes

© LearnDataVault.com

Required – all structures

› Primary key – PK› Load date time

stamp – DTS› Record source –

REC_SRC

Required – Satellites only

› Load end date time stamp – LEDTS

› Optional in DV 2.0

Optional – Hubs & Links only

› Last seen dates – LSDTs

› MD5KEY (REQUIRED IN DV 2.0)

Optional – Satellites only

› Load sequence ID – LDSEQ_ID

› Update user – UPDT_USER

› Update DTS – UPDT_DTS

› MD5DIFF

Page 15: Agile Data Engineering - Intro to Data Vault Modeling (2016)

15

1. Hub = Business Keys

(C) Kent Graziano

Hubs = Unique Lists of Business KeysBusiness Keys are used to TRACK and IDENTIFY key information

New: DV 2.0 uses MD5 Hash of the BK for the PK

Page 16: Agile Data Engineering - Intro to Data Vault Modeling (2016)

16

2: Links = Associations

(C) Kent Graziano

Links = Transactions and AssociationsThey are used to hook together multiple sets of information

In DV 2.0 the BK attributes may migrate to the Links for faster query

Page 17: Agile Data Engineering - Intro to Data Vault Modeling (2016)

17

Modeling Links - 1:1 or 1:M?

(C) Kent Graziano

Today Tomorrow With a Link in The Data Vault

Relationship is a 1:1 so why model a Link?

The business rule can change to a 1:M.

You discover new data later.

No need to change the EDW structure.

Existing data is fine.

New data is added.

Page 18: Agile Data Engineering - Intro to Data Vault Modeling (2016)

18

3. Satellites = Descriptors

(C) Kent Graziano

Satellites provide context for the Hubs and the LinksTracks changes over time - Like SCD 2

In DV 2.0 use HASH_DIFF to detect changes

Page 19: Agile Data Engineering - Intro to Data Vault Modeling (2016)

19

Data Vault Model Flexibility (Agility)

(C) Kent Graziano

Goes beyond standard 3NF

Based on natural business keys

Hyper normalized› Hubs and Links only hold keys and meta data› Satellites split by rate of change and/or source

Enables Agile data modeling› Easy to add to model without having to change existing

structures and load routines• Relationships (links) can be dropped and created on-demand.

› No more reloading history because of a missed requirement

Not system surrogate keys

Allows for integrating data across functions and source systems more easily› All data relationships are key driven

Page 20: Agile Data Engineering - Intro to Data Vault Modeling (2016)

20

Data Vault Extensibility

(C) LearnDataVault.com

Adding new components to the EDW has NEAR ZERO impact to:

› Existing Loading Processes

› Existing Data Model› Existing Reporting &

BI Functions› Existing Source

Systems› Existing Star

Schemas and Data Marts

Page 21: Agile Data Engineering - Intro to Data Vault Modeling (2016)

21

Data Vault Productivity

(C) Kent Graziano

› Standardized modeling rules• Highly repeatable and learnable modeling

technique

• Can standardize load routineso Delta Driven processo Re-startable, consistent loading patterns.

• Can standardize extract routineso Rapid build of new or revised Data Marts

• Can be automated

• Can use a BI-meta layer to virtualize the reporting structureso Example: OBIEE Business Model and

Mapping toolo Example: BOBJ Universe Business Layer

• Can put views on the DV structures as wello Simulate ODS/3NF or Star Schemas

Page 22: Agile Data Engineering - Intro to Data Vault Modeling (2016)

22

Data Vault Adaptability

(C) Kent Graziano

› The Data Vault holds granular historical relationships.• Holds all history for all time, allowing any

source system feeds to be reconstructed on-demando Easy generation of Audit Trails for data

lineage and compliance.

o Data Mining can discover new relationships between elements

o Patterns of change emerge from the historical pictures and linkages.

› The Data Vault can be accessed by power-users

Page 23: Agile Data Engineering - Intro to Data Vault Modeling (2016)

23

Other Benefits of a Data Vault

(C) Kent Graziano

› Modeling it as a DV forces integration of the Business Keys upfront• Good for organizational alignment

› An integrated data set with raw data extends it’s value beyond BI:• Source for data quality projects• Source for master data • Source for data mining • Source for Data as a Service (DaaS) in

an SOA (Service Oriented Architecture).

Page 24: Agile Data Engineering - Intro to Data Vault Modeling (2016)

24

Other Benefits of a Data Vault

(C) Kent Graziano

› Upfront Hub integration simplifies the data integration routines required to load data marts.• Helps divide the work a bit.

› It is much easier to implement security on these granular pieces.

› Granular, re-startable processes enable pin-point failure correction.

› It is designed and optimized for real-time loading in its core architecture (without any tweaks or mods).

Page 25: Agile Data Engineering - Intro to Data Vault Modeling (2016)

25

How to be Agile using DV

(C) Kent Graziano

Model iteratively› Use Data Vault data

modeling technique› Create basic components,

then add over time

Virtualize the Access Layer› Don’t waste time building

facts and dimensions up front

ETL and testing takes too long› “Project” objects using

pattern-based DV model with database views (or BI meta layer)

Users see real reports with real data

› Can always build out for performance in another iteration

Page 26: Agile Data Engineering - Intro to Data Vault Modeling (2016)

26

WHAT IS

THE WORLD'S SMALLEST DATA VAULT?

Page 27: Agile Data Engineering - Intro to Data Vault Modeling (2016)

27

Worlds Smallest Data Vault

© LearnDataVault.com

Hub CustomerHub_Cust_Seq_ID

Hub_Cust_NumHub_Cust_Load_DTSHub_Cust_Rec_Src

Hub_Cust_Seq_IDSat_Cust_Load_DTS

Sat_Cust_Load_End_DTSSat_Cust_NameSat_Cust_Rec_Src

Satellite Customer Name

› The Data Vault doesn’t have to be “BIG”.

› A Data Vault can be built incrementally.

› Reverse engineering one component of the existing models is not uncommon.

› Building one part of the Data Vault, then changing the marts to feed from that vault is a best practice.

› The smallest Enterprise Data Warehouse consists of two tables: • One Hub, • One Satellite

Page 28: Agile Data Engineering - Intro to Data Vault Modeling (2016)

28

Notably…

› In 2008 Bill Inmon stated that the “Data Vault is the optimal approach for modeling the EDW in the DW2.0 framework.” (DW2.0)

› The number of Data Vault users in the US surpassed 500 in 2010 and grows rapidly (http://danlinstedt.com/about/dv-customers/)

Page 29: Agile Data Engineering - Intro to Data Vault Modeling (2016)

29

Organizations using Data Vault

› WebMD Health Services

› Anthem Blue-Cross Blue Shield

› MD Anderson Cancer Center

› Denver Public Schools

› Independent Purchasing Cooperative (IPC, Miami) • Owner of Subway

› Kaplan

› US Defense Department

› Colorado Springs Utilities

› State Court of Wyoming

› Federal Express

› US Dept. Of Agriculture

Page 30: Agile Data Engineering - Intro to Data Vault Modeling (2016)

30

What’s New in DV2.0?

© LearnDataVault.com

Modeling Structure Includes…

› NoSQL, and Non-Relational DB systems, Hybrid Systems

› Minor Structure Changes to support NoSQL

01 02 03 04

New ETL Implementation Standards

› For true real-time support

› For NoSQL support

New Architecture Standards

› To include support for NoSQL data management systems

New Methodology Components

› Including CMMI, Six Sigma, and TQM

› Including Project Planning, Tracking, and Oversight

› Agile Delivery Mechanisms

› Standards, and templates for Projects

Page 31: Agile Data Engineering - Intro to Data Vault Modeling (2016)

31

What’s New in DV2.0?

This model is fully compliant with Hadoop, needs NO changes to work properly

Note: Business Keys replicated to the Link structure for “join” capabilities on the way out to Data Marts.

© LearnDataVault.com

Page 32: Agile Data Engineering - Intro to Data Vault Modeling (2016)

32

Summary

Data Vault provides a data modeling technique that allows:

Model Agility Productivity So? Agile Data Warehousing?

01 02 03

› Enabling rapid changes and additions

› Enabling low complexity systems with high value output at a rapid pace

› Easy projections of dimensional models

Page 33: Agile Data Engineering - Intro to Data Vault Modeling (2016)

33

› Available on Amazon:

http://www.amazon.com/Better-Data-Modeling-Introduction-Engineering-ebook /dp/ B018BREV1C/

Shameless Plug:

Page 34: Agile Data Engineering - Intro to Data Vault Modeling (2016)

34

› Available on Amazon.com

› Soft Cover or Kindle Format

› Now also available in PDF at LearnDataVault.com

› Hint: Kent is the Technical Editor

Super Charge Your Data Warehouse

Page 35: Agile Data Engineering - Intro to Data Vault Modeling (2016)

35

› Available on Amazon:

http://www.amazon.com/Building-Scalable-Data-Warehouse-Vault/dp/0128025107/

New DV 2.0 Book

Page 36: Agile Data Engineering - Intro to Data Vault Modeling (2016)

36

Register at wwdvc.com

Page 37: Agile Data Engineering - Intro to Data Vault Modeling (2016)

37

Data Vault References

www.youtube.com/LearnDataVault www.facebook.com/learndatavault

www.learndatavault.comwww.danlinstedt.com

Page 38: Agile Data Engineering - Intro to Data Vault Modeling (2016)

38

QUESTIONS?

Page 39: Agile Data Engineering - Intro to Data Vault Modeling (2016)

39

Contact Information

KENT GRAZIANOSnowflake Computingwww.snowflake.net

[email protected]

@KentGraziano

http://kentgraziano.com