introduction to data vault ilja dmitrijev
Embed Size (px)
TRANSCRIPT

Introduction to Data Vault Ilja Dmitrijev (www.in-volv.com)
http://www.linkedin.com/in/iljadmitrijev
Wildcard conference, Riga 2013

What we are expecting from Data Warehouse/BI?
Time-Variant (historized)
Non volatile (no updates)
Integrated, Enterprise
wide
Subject oriented
ETL & Query performance
Easy adopting changes
Auditable

What do we have in DWH design now?
Star schemas
OLAP
Big Data*
*Big Data = Columnar distributed data stores
Strong in
subject
oriented
querying of
the data
Auditability?
Historization?
Integration?
Enterpise wide?
Easy to adopt?
Data Vault Presented by Dan
Linstedt in 2000

Place of Data Vault in DWH architecture
Heavy tasks of integration,
historization and cleaning
performed in Data Vault
Data Marts (Star Schemas,
OLAP, BigData) are
lightweight, presentation only,
rebuild/reload in hours

The main idea of Data Vault
Break things out into component parts for flexibility and to facilitate the
capture of things that are either interpreted in different ways or
changing independently of each other. Decomposition.
These parts however need to be integrated to define the core business
concept (the Entity, the Dimension, etc.). So they must be kept together.
Unified.
Hub -The Natural Business
Key
Link -The Natural Business
Relationships
Satellite - All Context,
Descriptive Data and History

Hub
A Hub Construct in Data Vault contains Business Key
only the Business Key
contains No Context
A Hub Table contains only Business Key
Surrogate Key (Data Warehouse)
Load Date / Time Stamp
Record Source
Hub identifies important to
business entities
Business key= value by which entity is
referenced by business representatives
(invoice number, account number, client
number etc.)

Link
A Link Construct in Data Vault contains Relationship
only a Relationship
contains No Context
is always 1:1 with Relationship
A Link Table contains only Foreign keys for the Relationship (makes
unique key of link table)
Surrogate Key (Data Warehouse)
Load Date / Time Stamp
Record Source
By Default all relations are considered as M:M
which is far more natural then classical
RDBMS foreign keys

Satellite
a Satellite Construct in Data Vault contains Context only
has no FKs (no relationships)
Is attached to hub or link
Designed by * Rate of Change * Type of Data *
System…
a Satellite Table contains only hub/link surrogate id
Load Date / Time Stamp
Context Data (attributes)
Record Source
Only one instance of satellite is valid at
any time

Decomposition example
Handle “data explosion” issue
Vertical partitioning
Isolation of structural changes
Zero updates policy
Supports real time data

Data Vault structure example

How DV contributes to incremental build and agility?
You may start to model even if full scope is unknown
Simple hub, links, sats design rules reduce design error
rate
As the scope of the DWH is expanded, the Data Vault
can adapt to these changes without impacting the existing
model.
This is what allows the DWH to be built incrementally
and to adapt to change without the need for re-
engineering.

Structure Extension Examples

Few words about satellite design
There are no strict rules
Practitioners usually split satellites:
– By data source - simplifies traceability
– By context (e.g. identification, contact info,
profile) – isolates structural changes
– By rate of change - deal with data
explosions
Or combine approaches Extreme case: one satellite per attribute –
helps to deal with unpredictably changing source structure

Cleaning, deduplicating, integrating data in Data Vault
Data Vault follows
principle that all data
should be traceable
back to the source
and all data
transformations made
are auditable
Data Vault is not only used for data capturing,
historizing, but also for transformations:
deduplication, deriving, cleaning etc.
In Data Vault world you will encounter:
Raw vault – original, Data Vaulted data w/o
data creation (missing values are not replaced by
default)
Rule vault – additional satellites for cleaned,
derived, deduplicated data

Technical implementation
considerations
Fast, massive
parallel load into
Data Vault
Fast retrieval of data from
Data Vault- data changing
with high frequency is
isolated Indexing and
partitioning(horizontal)
of data in Data Vault is
not so crucial Remember that Data Vault
shall not be accessed by
end users via ad-hoc
analysis tools and reports!

Summary of Data Vault advantages
Incremental build, easy to
adopt for changes
Out of the box data
historization, integration framework
Simple design rules,
business centric
modelling
Supports graphs,
unstructured, real time data

Some usefull resources Data Vault Discussions
DataVaultAcademy
www.GeneseeAcademy.com;
http://danlinstedt.com
http://www.anchormodeling.com; 6NF
Another methodologies applying similar modeling approach
Quipu http://www.datawarehousemanagement.org
convenient Data Modeling, ETL, SQL tools

Credits
Slides with Data Vault and data vault
elements formal definitions are kindly
provided by Hans Hultgren
(www.GeneseeAcademy.com)