data vault & ladeperformance - doag deutsche oracle ... · data vault & ladeperformance no. 7 what...
Embed Size (px)
TRANSCRIPT
-
© CGI Group Inc. CONFIDENTIAL
Data Vault &
Ladeperformance
-
2
Über mich .....
Markus Kollas
BI Berater seit 1998
Bei CGI seit 01/2008
Executive Consultant
BI (Framework) Trainer
-
Data Vault & Ladeperformance
No. 3
What is a Data Vault? Data Vault Modelling Basics Hubs, Satellites, Links and a construct Performance Tuning your Data Vault Loading data into your Data Vault Retrieving data from your Data Vault
2
3
4
5
1
6
-
1.1 What is a Data Vault?
Author: Dan Linstedt
Real name:
Common Foundation of Data Warehouse modelling
“The Data Vault is a detail oriented, historical tracking and uniquely linked
set of normalized tables that support one or more functional areas of
business”.
“It is a hybrid approach encompassing the best of breed between 3rd
normal form (3NF) and star schema. The design is flexible, scalable,
consistent and adaptable to the needs of the enterprise. It is a data
model that is architected specifically to meet the needs of today’s
Enterprise Data Warehouse”.
No. 4
-
1.3 Where does Data Vault fit in?
No. 5
Analytical Transaction
Data Flow
Data Warehouse
-
1.8 Data modelling techniques applied!
No. 6
Station
Journey Ticket
Ticket
TypeClient
Line
Zone
Route
Line
Station
Calendar
Zone
Dimensional modelling Data Vault modelling
Link
Sat
Hub
Hub
Hub
Sat
Sat Link
Normalization
modelling (3NF)
Analytical Transaction Data Warehouse
-
Data Vault & Ladeperformance
No. 7
What is a Data Vault? Data Vault Modelling basics Hubs, Satellites, Links and a construct Performance Tuning your Data Vault Loading data into your Data Vault Retrieving data from your Data Vault
2
3
4
5
1
6
-
2.1 What are the Data Vault primary
components?
The Data Vault consists of three primary components:
Hubs are core business keys
Links form all associations between the Hubs
Satellites provide all detail information for
Hubs and Links
The Hubs and Links together form the skeletal structure of the model
while the satellites add all the descriptive details.
No. 8
-
2.3 Separation of data types in a DV structure
Rule: Each component contains either Business Keys (HUB),
Associations (LINK), or Details (Satellite)
No. 9
H
S
L
H S
H
S L H
S
L
S
S
S
Product
Name
Address
Vendor
Customer
Delivery
Order
Orderline
Producttype
-
2.4 In comparison: A dimensional data model
No. 10
Fact_Sale
Dim_Region
Dim_Time Dim_Product
Dim_Customer
Fact (Tables) contain all three types of data, Dimension (Tables)
contain Business Keys and Details…
-
2.5 In comparison: A normalized data model
No. 11
Customer
Entities (Tables) typically contain all
three types of data…
Order Store Region
Vendor
Product
Order Line
-
2.6 The advantage of data type separation
No. 12
• Each data component can be managed without impact on other components.
• Changes in data constrains (relationships) of source data (often) does not impact the Data Vault.
• All components are decoupled to make the Data Vault model (easy) extendable.
• The load procedures of the components are uniform.
-
Data Vault & Ladeperformance
No. 13
What is a Data Vault? Data Vault Modelling Basics Hubs, Satellites, Links and a construct Performance Tuning your Data Vault Loading data into your Data Vault Retrieving data from your Data Vault
2
3
4
5
1
6
-
3.4 Hub characteristics
No. 14
• Primary Key: PK is a unique hash key • Business Key: A Hub’s business key is the actual Hub value to
store and it is a unique index
• Load DTS: A Hub’s Load Date Time Stamp represents the first time the EDW saw the data
• Record Source: A Hub Record Source represents the source (system) of the Business key value
Business Key Hash (PK)
Business_Key
Load DTS
Record_Source
-
3.10 Link characteristics
No. 15
Link_Key Hash (PK)
Business Key Hashes
Load DTS
Record_Source
• Primary Key: PK is a unique hash key. • Foreign Keys: A Link has two or more Foreign keys (the Business
Hash Keys of the corresponding Hubs) implementing a n:n relation
between two or more Hubs. It is a composite unique index.
• Load DTS: A Link’s load date timestamp represents the first time the EDW saw the data.
• Record Source: A Link Record Source represents the source (system) of the Link associative value.
-
3.16 Satellite characteristics
No. 16
• Detail(s): A Satellite can have one or more detail values for one record. • The Detail Hash Diff helps to compare a new record to the older ones • End DTS: A Satellite end date timestamp represents the time the EDW
saw the new data that replaces the old record.– column is optional
• Record Source: A Satellite Record Source represents the source (system) of the detail value
• Note: Avoid outer joins, at least one row for every row in Hub or Link.
Business Key Hash (PK)
Load DTS (PK)
Details 1-n
Detail Hash Diff (optional)
End_DTS (optional)
Record_Source
• Business Key Hash: A Foreign key to the unique key of the Hub or Link.
• Load DTS: A Satellites load date timestamp represents the first time the EDW saw the
data (it is part of the Foreign Key).
• Both form the Primary Key of the Satellite
-
1.4 Data Vault physical structure
No. 17
Business Key Hash
Business_Key
Load_DTS
Record_Source Link
Hub
Satellite
Business Key Hash
Business_Key
Load_DTS
Record_Source
Hub
Satellite
Business Key Hash
Load_DTS
Detail(s)
End_DTS
Record_Source
Business Key Hash
Load_DTS
Detail(s)
End_DTS
Record_Source
Link Key Hash
Business Keys
Hashs
Load_DTS
Record_Source
-
Data Vault & Ladeperformance
No. 18
What is a Data Vault? Data Vault Modelling Basics Hubs, Satellites, Links and a construct Performance Tuning your Data Vault Loading data into your Data Vault Retrieving data from your Data Vault
2
3
4
5
1
6
-
3.1 Performance tuning your Data Vault
No. 19
• After all the functional modelling is done, the performance of the Data Vault can be tuned.
• Tuning the performance does not change the functionality.
• Performance is only tuned when necessary.
• There are two options for tuning: 1. Performance tuning that can be modelled into the Data Vault by the
Data Vault modeler.
2. Performance that is tuned by the DBA of the database (like table
spacing, indexing, partitioning, etc).
-
Data Vault Modelling part 2
No. 20
What is a Data Vault? Data Vault Modelling Basics Hubs, Satellites, Links and a construct Performance Tuning your Data Vault Loading data into your Data Vault Retrieving data from your Data Vault
2
3
4
5
1
6
-
4.1 Data Vault – Loading sequence?
No. 21
Data Mart
(Dim) Staging EDW
(DV) Transaction
Staging Loads Data Vault Loads Dimensional Loads
-
4.2 Data Vault – Loading Source and Stage
No. 22
Staging Loads Data Vault 2.0 Loads Dimensional Loads
So
urc
es
Sta
ge
Parallel loading of the
Sources, followed by Staging
(staging can be virtual or non
existent…)
-
4.3 Data Vault 1.0 – Loading Data Vault
No. 23
Staging Loads Data Vault Loads Dimensional Loads
So
urc
es
Sta
ge
Hu
bs
First up is parallel
loading the Hubs
DV
1.0
-
4.4 Data Vault 1.0 – Loading Data Vault
No. 24
Staging Loads Data Vault Loads Dimensional Loads
So
urc
es
Sta
ge
Hu
bs
Hu
b-S
at
Lin
ks
And parallel
loading of the
Links between the
Hubs
Then parallel
loading the
Satellites belonging
to the Hubs
DV
1.0
-
4.5 Data Vault 2.0 – Loading Data Vault
No. 25
Staging Loads Data Vault 2.0 Loads Dimensional Loads
So
urc
es
Sta
ge
Hu
bs
S
ats
L
inks
Parallel loading of all DV structures as
hardware restrictions
allow
DV
2.0
-
4.6 Data Vault – Loading Dimensions and Facts
No. 26
Staging Loads Data Vault 2.0 Loads Dimensional Loads
So
urc
es
Sta
ge
Dim
s
Facts
Hu
bs
S
ats
L
inks
Finally loading the Dimensions, followed
by the Facts of the
dimensional model
-
4.7 Data Vault – Loading
No. 27
• Starting the loading of the Hubs, Links and Satellites may be still major synchronization points.
• All loading is done simultaneously – thanks to the use of Hash Keys. • Sets of loading jobs “wait” for the previous set to complete. • Loads are started as soon as data is ready. • No other “waiting” time is required. • Load dependencies are greatly reduced.
-
Data Vault & Ladeperformance
No. 28
What is a Data Vault? Data Vault Modelling Basics Hubs, Satellites, Links and a construct Performance Tuning your Data Vault Loading data into your Data Vault Retrieving data from your Data Vault
2
3
4
5
1
6
-
4.2 EDW: Data Vault requires an Architectural
shift
No. 29
Data Mart
(Dim) Staging
EDW
(DV) Transaction
source
Complex business rules
coming out of the EDW,
“the lens” filter
Complex business rules are only transformed downstream, allowing
traceability, auditability and uniform/homogeneous loading.
Only “hard”
rules.
-
4.3 EDW: The Business Data Vault
No. 30
Data Mart
(Dim) Staging
EDW
(DV) Transaction
source
Business
DV
The Business Data Vault holds
transformed and calculated values:
It supports ”business transformations”
Raw DV
The Raw Data Vault is the vault as
described up till this point:
It supports “one version of the facts”
-
4.4 Business Data Vault Definition
• The Business Data Vault stores data processed by (soft) business rules.
• Data in the Business Data Vault is always derived from the Raw Data Vault (also called “Operational Vault”).
• Preferred design choice: separate models (Raw/Business Vault).
• Practical choices: Business Hubs, Links and Satellites are added to the Raw Data Vault model as needed.
No. 31
-
4.5 Business Data Vault Example:
SAT_INV_CUR
No. 32
HUB_INVOICE
SAT_INV_DT
Invoice_Hash
Load_DTS
Amount_Billed
Amount_Payed
End_DTS
Record_Source
SAT_INV_AMT
Invoice_Hash
Load_DTS
Billed_Date
Paid_Date
End_DTS
Record_Source
SAT_INV_CUR
Invoice_Hash
Load_DTS
Currency
Exchange_Rate
Amount_Payed
End_DTS
Record_Source
Derived calculation
based on
Amount_Payed from
SAT_INV_AMT and
Exchange_Rate
Invoice_Hash
Invoice_Number
Load_DTS
Record_Source
-
4.6 Business Data Vault Performance
Optimal choices for performance or real-time loading:
• Integrated Raw Data Vault and Business Data Vault.
• Business Hubs, Links or Satellites added to Raw Hubs, Links or Satellites.
• Example:
Customer Hub has two address Satellite tables; one for each of two
separate source systems. After loading the raw data, business rules
are used to calculate the active address and stores this result in a
Business Satellite attached to the Customer Hub.
No. 33
-
4.7 Retrieving data is to “know your data”
No. 34
Example:
•The relation between Customer and Product has always been 1:n.
•Then, on 01-01-2010, the transaction system changes and the relation between Customer and Product becomes n:1.
•The Link can handle this change, therefore no problem. •How can the reporting environment know about this
change? It is invisible in the Data Vault model that has not
been changed…
Conclusion: Hence the necessity of Meta Data!
-
• Tracking complete history on detailed level.
• 100% versioning and audit trail.
• Implicit implementation of MDM
• Parallel processing of satellite loading, optimizing performance.
• True “Single Source of Facts”
CDR in the EDW model – Storing time variant data
Link Call Detail
Hub Phone Number
received
making
charged
Satellite Call Detail
defining
Hub Facilities
Hub Customer
Hub Exchange
writing
Satellite Exchange
Satellite Facilities
Link Contract
Satellite Phone
Number
Satellite Contract
Satellite Customer
used
defining defining
defining
defining
defining
owns
part of
-
• Automated identification of candidate dimensions.
• A dimension originates from a hub.
• Combine with related links and satellites based on information requirements.
CDR in the Subject Areas or Data Marts – Identifying dimensions
Dimension Customer
Dimension Contracts
Dimension Exchanges
Dimension Facilities
Link Call Detail
Hub Phone Number
received
making
charged
Satellite Call Detail
defining
Hub Facilities
Hub Customer
Hub Exchange
writing
Satellite Exchange
Satellite Facilities
Link Contract
Satellite Phone
Number
Satellite Contract
Satellite Customer
used
defining defining
defining
defining
defining
owns
part of
Subject Area
Dimension Customer
Dimension Facilities
Dimension Exchanges
Dimension Contracts
-
Fact CDR
• Automated identification of candidate facts.
• A facts originates from a link with related hubs.
• Combine with related satellites based on information requirements.
• Optimized for analytical query performance.
CDR in Subject Areas or Marts – Identifying facts
Link Call Detail
Hub Phone Number
received
making
charged
Satellite Call Detail
defining
Hub Facilities
Hub Customer
Hub Exchange
writing
Satellite Exchange
Satellite Facilities
Link Contract
Satellite Phone
Number
Satellite Contract
Satellite Customer
used
defining defining
defining
defining
defining
owns
part of
Subject Area
Dimension Customer
Dimension Facilities
Dimension Exchanges
Dimension Contracts
Fact CDR
Data Vault – Deutsche Bank Juni 2012
-
Thank you
de.cgi.com/BI