introduction to data vault ilja dmitrijev

18
Introduction to Data Vault Ilja Dmitrijev (www.in-volv.com) http://www.linkedin.com/in/iljadmitrijev Wildcard conference, Riga 2013

Upload: ilja-dmitrijevs

Post on 26-Jun-2015

383 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Introduction to data vault   ilja dmitrijev

Introduction to Data Vault Ilja Dmitrijev (www.in-volv.com)

http://www.linkedin.com/in/iljadmitrijev

Wildcard conference, Riga 2013

Page 2: Introduction to data vault   ilja dmitrijev

What we are expecting from Data Warehouse/BI?

Time-Variant (historized)

Non volatile (no updates)

Integrated, Enterprise

wide

Subject oriented

ETL & Query performance

Easy adopting changes

Auditable

Page 3: Introduction to data vault   ilja dmitrijev

What do we have in DWH design now?

Star schemas

OLAP

Big Data*

*Big Data = Columnar distributed data stores

Strong in

subject

oriented

querying of

the data

Auditability?

Historization?

Integration?

Enterpise wide?

Easy to adopt?

Data Vault Presented by Dan

Linstedt in 2000

Page 4: Introduction to data vault   ilja dmitrijev

Place of Data Vault in DWH architecture

Heavy tasks of integration,

historization and cleaning

performed in Data Vault

Data Marts (Star Schemas,

OLAP, BigData) are

lightweight, presentation only,

rebuild/reload in hours

Page 5: Introduction to data vault   ilja dmitrijev

The main idea of Data Vault

Break things out into component parts for flexibility and to facilitate the

capture of things that are either interpreted in different ways or

changing independently of each other. Decomposition.

These parts however need to be integrated to define the core business

concept (the Entity, the Dimension, etc.). So they must be kept together.

Unified.

Hub -The Natural Business

Key

Link -The Natural Business

Relationships

Satellite - All Context,

Descriptive Data and History

Page 6: Introduction to data vault   ilja dmitrijev

Hub

A Hub Construct in Data Vault contains Business Key

only the Business Key

contains No Context

A Hub Table contains only Business Key

Surrogate Key (Data Warehouse)

Load Date / Time Stamp

Record Source

Hub identifies important to

business entities

Business key= value by which entity is

referenced by business representatives

(invoice number, account number, client

number etc.)

Page 7: Introduction to data vault   ilja dmitrijev

Link

A Link Construct in Data Vault contains Relationship

only a Relationship

contains No Context

is always 1:1 with Relationship

A Link Table contains only Foreign keys for the Relationship (makes

unique key of link table)

Surrogate Key (Data Warehouse)

Load Date / Time Stamp

Record Source

By Default all relations are considered as M:M

which is far more natural then classical

RDBMS foreign keys

Page 8: Introduction to data vault   ilja dmitrijev

Satellite

a Satellite Construct in Data Vault contains Context only

has no FKs (no relationships)

Is attached to hub or link

Designed by * Rate of Change * Type of Data *

System…

a Satellite Table contains only hub/link surrogate id

Load Date / Time Stamp

Context Data (attributes)

Record Source

Only one instance of satellite is valid at

any time

Page 9: Introduction to data vault   ilja dmitrijev

Decomposition example

Handle “data explosion” issue

Vertical partitioning

Isolation of structural changes

Zero updates policy

Supports real time data

Page 10: Introduction to data vault   ilja dmitrijev

Data Vault structure example

Page 11: Introduction to data vault   ilja dmitrijev

How DV contributes to incremental build and agility?

You may start to model even if full scope is unknown

Simple hub, links, sats design rules reduce design error

rate

As the scope of the DWH is expanded, the Data Vault

can adapt to these changes without impacting the existing

model.

This is what allows the DWH to be built incrementally

and to adapt to change without the need for re-

engineering.

Page 12: Introduction to data vault   ilja dmitrijev

Structure Extension Examples

Page 13: Introduction to data vault   ilja dmitrijev

Few words about satellite design

There are no strict rules

Practitioners usually split satellites:

– By data source - simplifies traceability

– By context (e.g. identification, contact info,

profile) – isolates structural changes

– By rate of change - deal with data

explosions

Or combine approaches Extreme case: one satellite per attribute –

helps to deal with unpredictably changing source structure

Page 14: Introduction to data vault   ilja dmitrijev

Cleaning, deduplicating, integrating data in Data Vault

Data Vault follows

principle that all data

should be traceable

back to the source

and all data

transformations made

are auditable

Data Vault is not only used for data capturing,

historizing, but also for transformations:

deduplication, deriving, cleaning etc.

In Data Vault world you will encounter:

Raw vault – original, Data Vaulted data w/o

data creation (missing values are not replaced by

default)

Rule vault – additional satellites for cleaned,

derived, deduplicated data

Page 15: Introduction to data vault   ilja dmitrijev

Technical implementation

considerations

Fast, massive

parallel load into

Data Vault

Fast retrieval of data from

Data Vault- data changing

with high frequency is

isolated Indexing and

partitioning(horizontal)

of data in Data Vault is

not so crucial Remember that Data Vault

shall not be accessed by

end users via ad-hoc

analysis tools and reports!

Page 16: Introduction to data vault   ilja dmitrijev

Summary of Data Vault advantages

Incremental build, easy to

adopt for changes

Out of the box data

historization, integration framework

Simple design rules,

business centric

modelling

Supports graphs,

unstructured, real time data

Page 18: Introduction to data vault   ilja dmitrijev

Credits

Slides with Data Vault and data vault

elements formal definitions are kindly

provided by Hans Hultgren

(www.GeneseeAcademy.com)