andreasbuckenhofer - dwh refactoring with data vault ... · daimler tss dwh refactoring with data...
TRANSCRIPT
A company of Daimler AG
DWH REFACTORING WITH DATA VAULT
ANDREAS BUCKENHOFER
DOAG K&A 2017, NUREMBERG
ABOUT ME
https://de.linkedin.com/in/buckenhofer
https://twitter.com/ABuckenhofer
https://www.doag.org/de/themen/datenbank/in-memory/
http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/
https://www.xing.com/profile/Andreas_Buckenhofer2
Andreas Buckenhofer
Senior DB Professional
Since 2009 at Daimler TSS
Department: Big Data
Business Unit: Analytics
As a 100% Daimler subsidiary, we give
100 percent, always and never less.
We love IT and pull out all the stops to
aid Daimler's development with our
expertise on its journey into the future.
Our objective: We make Daimler the
most innovative and digital mobility
company.
NOT JUST AVERAGE: OUTSTANDING.
Daimler TSS
INTERNAL IT PARTNER FOR DAIMLER
+ Holistic solutions according to the Daimler guidelines
+ IT strategy
+ Security
+ Architecture
+ Developing and securing know-how
+ TSS is a partner who can be trusted with sensitive data
As subsidiary: maximum added value for Daimler
+ Market closeness
+ Independence
+ Flexibility (short decision making process,
ability to react quickly)
Daimler TSS 4
Daimler TSS
LOCATIONS
DWH Refactoring with Data Vault
Daimler TSS China
Hub Beijing
10 employees
Daimler TSS Malaysia
Hub Kuala Lumpur
42 employeesDaimler TSS IndiaHub Bangalore22 employees
Daimler TSS Germany
7 locations
1000 employees*
Ulm (Headquarters)
Stuttgart
Berlin
Karlsruhe
* as of August 2017
5
High expected data growth
Inflexibility of grown systems
Increasing requirements due to legal changes or financial formalities
Make the DWH ready for the future
Modern architecture: replace Kimball architecture and limitations
REQUIREMENTS AND CHALLENGES
DWH Refactoring with Data VaultDaimler TSS 7
• Kimball-style DWH with 2 tiers
• Staging layer
• Core/mart layer
• Up to 10 years of data
• 7TB data compressed
• All data in Mart layer
• Oracle 11gR2 on SLES and Informatica Powercenter 9.6.1
• Diverging requirements from 30+ national headquarters lead to long and complex mappings
• Insufficient performance and slow development process with changing requirements
• Low and differing data quality
STARTING BASIS – EXISTING DWH
DWH Refactoring with Data VaultDaimler TSS 8
Data Vault is just one part of the solution to cope with new challenges
Experience with Data Vault @Daimler since 2004/2005
Convincing and feasibility for Data Vault was given
STARTING BASIS – DATA VAULT
DWH Refactoring with Data VaultDaimler TSS 10
DATA VAULT - ARCHITECTURE, METHODOLOGY, MODEL
DWH Refactoring with Data VaultDaimler TSS 12
Architecture
• Multi-Tier
• Scalable
• Supports NoSQL
Methodology
• Repeatable
• Measureable
• Agile
Model
• Flexible
• Hash based
• Hub & Spoke
Implementation: Automation,
Pattern based, High speed
HUB TABLES: TYPICAL CHARACTERISTICS
DWH Refactoring with Data VaultDaimler TSS 15
Business Keys should be natural keys used by the business (e.g. Vehicle Identifier, Serial number)
Business Keys should stand alone and have meaning to the business
Business Keys should never change, have the same semantic meaning and the same granularity
Focus on Business Keys (instead focus on source system surrogates) ensures that the result serves the needs of the business
DWH Refactoring with Data Vault 16Daimler TSS
LINK
Unique
relationships
between
Business Keys
(HUBs)
LINK TABLES: TYPICAL CHARACTERISTICS
DWH Refactoring with Data VaultDaimler TSS 18
A LINK models a relationship between 2 or more HUBs
The relationship is always n:m
The composed key must be unique. One of the foreign keys is driving key
Link to Link allowed but should be avoided in a physical implementation due to load dependency
• Relationships / Associations
• Foreign Keys in OLTP systems
• Hierarchies and Redefinitions
• Hierarchical relationships are modeled by one link and two connections to HUBs: HAL
(parent-child LINK) and SAL (same-as LINK)
• Transactions and events are often modeled as link (could also be a Hub):
matter of dispute
• E.g. sales order or sensor data
• Intensive discussions about modeling as Hub or Link on conferences or social media
(modeling solution depends from requirements, context, etc)
CANDIDATES FOR LINKS
DWH Refactoring with Data VaultDaimler TSS 19
DWH Refactoring with Data Vault 20Daimler TSS
SAT
Descriptive,
detailled,
current
and
historized
data
SAT TABLES: TYPICAL CHARACTERISTICS
DWH Refactoring with Data VaultDaimler TSS 22
Contains all non-key attributes
Is connected to exactly one Hub or Link
HUB or LINK tables can (should) have several SAT tables, e.g. by source system
Can contain in the extreme case one column only (or any number of columns)
DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT)
DWH Refactoring with Data VaultDaimler TSS 24
Michael Olschimke, Dan Linstedt: Building a Scalable Data Warehouse with Data Vault 2.0, Morgan Kaufmann, 2015, Chapter 2.2
• Raw Vault
• Sat s_code containing raw data
• Business Vault
• Sat bs_interior containing standardized
interior data after applying business rules
on column s_code.code
• Standard_code is computed/derived
• Interior is computed/derived
• Complex business rules for different
car models
• Business rules change regularly
RAW DATA VAULT AND BUSINESS DATA VAULTSAMPLE PATTERN FOR SATELLITE DATA
DWH Refactoring with Data VaultDaimler TSS 25
H_VEHICLE
H_VEHICLE_KEY
VIN
BINARY(20)
VARCHAR2(17)
<pk>
S_CODE
H_VEHICLE_KEY
LOAD_DATE
CODE
BINARY(20)
DATE
VARCHAR2(5)
<pk,fk>
<pk>
BS_INTERIOR
H_VEHICLE_KEY
LOAD_DATE
INTERIOR
STANDARD_CODE
BINARY(20)
DATE
VARCHAR2(50)
VARCHAR2(5)
<pk,fk>
<pk>
• Separation of integration and business rules / transformations
• Core Warehouse Layer is modeled with Data Vault and integrates data by BK
(business key) “only”
• Business rules (Soft Rules) are applied from Raw Data Vault Layer to Mart
Layer and not earlier
• Alternatively from Raw Data Vault to additional layer called Business Data Vault
• Hard Rules don’t change data
• Data is fully auditable
• Real-time capable architecture
DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT)
DWH Refactoring with Data VaultDaimler TSS 26
• In the classical DWHs, the Core Warehouse Layer is regarded as “single
version of the truth”
• Integrates + cleanses data from different sources and eliminates contradiction
• Produces consistent results/reports across Data Marts
• But: cleansing is (still) objective, Enterprises change regularly, paradigm does not scale as
more and more systems exist
• Data in Raw Data Vault Layer is regarded as “Single version of the facts”
• 100% of data is loaded 100% of time
• Data is not cleansed and bad data is not removed in the Core Layer (Raw Vault)
DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT)
DWH Refactoring with Data VaultDaimler TSS 27
DATA VAULT - ARCHITECTURE, METHODOLOGY, MODEL
DWH Refactoring with Data VaultDaimler TSS 28
Architecture
• Multi-Tier
• Scalable
• Supports NoSQL
Methodology
• Repeatable
• Measureable
• Agile
Model
• Flexible
• Hash based
• Hub & Spoke
Implementation: Automation,
Pattern based, High speed
• Experience from a DWH with mobility data moving to Data Vault while
migrating to new hardware
• Similar requirements (huge data growth, JSON data sources)
• Similar starting position: Kimball style DWH on SQL Server (+ migration to new HW)
• Agreed approach:
• Small steps
• Move fact table one another to new hardware including new Data Vault tables including
staging tables
• Theory vs practice
• No chance to avoid first fact table with huge number of columns
• � Virtual Big Bang
BIG BANG OR SMALL STEPS?
DWH Refactoring with Data VaultDaimler TSS 31
Lessons Learned from a project moving to Data Vault while migrating to new
hardware:
• Laborious to maintain two systems in parallel
• Main challenge are new requirements and bug fixes that come permanently
• Change to Data Vault was and still is considered as the right step towards the
future
� Small steps that fit into 3-4 week sprint
� No standstill: Can be combined with new requirements
BIG BANG OR SMALL STEPS?
DWH Refactoring with Data VaultDaimler TSS 32
Key questions
• What about existing data in dimensions?
• What about existing data in facts?
• What about new data?
General approach
• Create Hub / Link / Sat tables in Data Vault layer
• Copy data from tables in Mart layer into tables in Data Vault layer
• Integrate new data into Data Vault layer first
KEY QUESTIONS AND GENERAL APPROACH
DWH Refactoring with Data VaultDaimler TSS 33
F_CONTRACT
D_FUNDING_UNIT_ID
D_VEHICLE_ID
D_DATE_ID_ORDER
D_STATUS_ID
D_DATE_ID_DELIVERY
D_CUSTOMER_ID
AMOUNT_EXPECTED
AMOUNT_ACTUAL
CONTRACTNO
<fk1>
<fk2>
<fk3>
<fk4>
<fk5>
<fk6>
D_STATUS
D_STATUS_ID <pk>
D_CUSTOMER
D_CUSTOMER_ID
CUSTOMERNO
NAME
VALIDFROM
VALIDTO
<pk>
F_ACCOUNTPLAN
D_VEHICLE_ID
D_CUSTOMER_ID
D_DATE_ID
AMOUNT
RISK
<fk1>
<fk2>
<fk3>
D_VEHICLE
D_VEHICLE_ID
VIN
MODEL
TYPE
VALIDFROM
VALIDTO
<pk>
D_DATE
D_DATE_ID <pk>
D_FUNDING_UNIT
D_FUNDING_UNIT_ID
FUNDING_UNIT
VALIDFROM
VALIDTO
<pk>
DIMENSION TABLES
DWH Refactoring with Data VaultDaimler TSS 34
H_VEHICLE
H_VEHICLE_KEY
FIN
<pk>
S_VEHICLE_DIM
H_VEHICLE_KEY
LOAD_DATE
MODEL
TYPE
VALIDFROM
VALIDTO
<pk,fk>
<pk>
H_CUSTOMER
H_CUSTOMER_KEY
VARCHAR2(20)
<pk>
S_CUSTOMER_DIM
H_CUSTOMER_KEY
LOAD_DATE
NAME
VALIDFROM
VALIDTO
<pk,fk>
<pk>
S_CUSTOMER_SOURCE1
H_CUSTOMER_KEY
LOAD_DATE
NAME
<pk,fk>
<pk>
S_VEHICLE_SOURCE1
H_VEHICLE_KEY
LOAD_DATE
MODEL
TYPE
<pk,fk>
<pk>S_CUSTOMER_SOURCE2
H_CUSTOMER_KEY
LOAD_DATE
ADDITIONAL_ATTRIBUTES
<pk,fk>
<pk>
• Move data back into Hub and Sat tables
• Create Hub and Sat tables
• 1 Hub
• 1 Sat for dimension (“cleansed data”; business Vault)
• 1-n Sat for source tables (“original data”; Raw Vault)
• Copy data from dimension into Hub and Sat tables (one-time action)
• Sat table is „closed“ afterwards – no data changes anymore
• Change Informatica mappings to store data in Hub and Sat and then move data into
dimensions
• Delete old data in dimension (> 2 or 5 years)
DIMENSION TABLES
DWH Refactoring with Data VaultDaimler TSS 35
FACT TABLES
DWH Refactoring with Data VaultDaimler TSS 36
F_CONTRACT
D_FUNDING_UNIT_ID
D_VEHICLE_ID
D_DATE_ID_ORDER
D_STATUS_ID
D_DATE_ID_DELIVERY
D_CUSTOMER_ID
AMOUNT_EXPECTED
AMOUNT_ACTUAL
CONTRACTNO
<fk1>
<fk2>
<fk3>
<fk4>
<fk5>
<fk6>
D_STATUS
D_STATUS_ID <pk>
D_CUSTOMER
D_CUSTOMER_ID
CUSTOMERNO
NAME
VALIDFROM
VALIDTO
<pk>
F_ACCOUNTPLAN
D_VEHICLE_ID
D_CUSTOMER_ID
D_DATE_ID
AMOUNT
RISK
<fk1>
<fk2>
<fk3>
D_VEHICLE
D_VEHICLE_ID
VIN
MODEL
TYPE
VALIDFROM
VALIDTO
<pk>
D_DATE
D_DATE_ID <pk>
D_FUNDING_UNIT
D_FUNDING_UNIT_ID
FUNDING_UNIT
VALIDFROM
VALIDTO
<pk>
H_VEHICLE
H_VEHICLE_KEY
FIN
<pk>
S_VEHICLE_DIM
H_VEHICLE_KEY
LOAD_DATE
MODEL
TYPE
VALIDFROM
VALIDTO
<pk,fk>
<pk>
H_CUSTOMER
H_CUSTOMER_KEY
VARCHAR2(20)
<pk>
L_CONTRACT
L_CONTRACT_KEY
H_CUSTOMER_KEY
H_VEHICLE_KEY
H_CONTRACT_KEY
<pk>
<fk2>
<fk3>
<fk1>
S_CUSTOMER_DIM
H_CUSTOMER_KEY
LOAD_DATE
NAME
VALIDFROM
VALIDTO
<pk,fk>
<pk>
S_CUSTOMER_SOURCE1
H_CUSTOMER_KEY
LOAD_DATE
NAME
<pk,fk>
<pk>
S_CONTRACT_FACT
H_CONTRACT_KEY
LOAD_DATE
AMOUNT_ACTUAL
CONTRACTDATE
<pk,fk>
<pk>
S_VEHICLE_SOURCE1
H_VEHICLE_KEY
LOAD_DATE
MODEL
TYPE
<pk,fk>
<pk>S_CUSTOMER_SOURCE2
H_CUSTOMER_KEY
LOAD_DATE
ADDITIONAL_ATTRIBUTES
<pk,fk>
<pk>
S_CONTRACT_SOURCE1
H_CONTRACT_KEY
LOAD_DATE
PRICE
<pk,fk>
<pk>
H_CONTRACT
H_CONTRACT_KEY
CONTRACTNO
<pk>
• Move data back into Hub and Sat tables
• Create Link and Sat tables
• 1 Link
• 1 Hub (optional if degenerated dimension, see next slide)
• 1 Sat for measurements from fact (“cleansed data”; business Vault)
• 1-n Sat for source tables (“original data”; Raw Vault)
• Copy data from fact into Link and Sat tables (one-time action)
• Sat table is „closed“ afterwards – no data changes anymore
• Change Informatica mappings to store data in Hub, Link and Sat and then move data into
facts
• Delete old data in facts (> 2 or 5 years)
FACT TABLES
DWH Refactoring with Data VaultDaimler TSS 37
• Transactions and events are often modeled as link (could also be a Hub):
matter of dispute
• In this case f_contract was modeled as Hub + Sats and connection to Link
• Alternative is to model contract fact as Link + Sat only
• First option is more flexible and easier to automate
• Second option may be the only way if there is no “business key” contractno or orderno
LINKS – MATTER OF DISPUTE
DWH Refactoring with Data VaultDaimler TSS 38
• Dimension and fact tables still the same structure
• Additional Core Warehouse Layer with Hub, Link, and Sat
• Not all dimensions and facts are refactored
• Many reference tables are still just dimensions (e.g. d_status and similiar)
• Informatica mappings now store data in Core Warehouse Layer first, then
move data into Mart
• What is the benefit?
• Slower
• Informatica mappings got even more complex
• Prerequisite for next step
INTERMEDIATE RESULT
DWH Refactoring with Data VaultDaimler TSS 39
• Integration of new sources takes a long time
• Mappings are too complex covering data cleansing from 30+ national
headquarters
• New requirements to deliver some data faster
DATA FLOW – STARTING POSITION
DWH Refactoring with Data VaultDaimler TSS 40
Source 1
Source 2
Source 3
Integration &
Business rulesDimension 2
Dimension 1 Fact Interfacedaily
Real-time
New SLA to
deliver data
within 4h
DATA FLOW – 1ST STEP
DWH Refactoring with Data VaultDaimler TSS 41
Source 1
Source 2
Source 3
Integration &
Business rulesDimension 2
Dimension 1 Fact Interfacedaily
Real-time
New SLA to
deliver data
within 4h
Source 1
Source 2
Source 3
Business rules Dimension 2
Dimension 1 Fact Interfacedaily
Real-time
New SLA to
deliver data
within 4hCo
re L
aye
r
DATA FLOW – 2ND STEP
DWH Refactoring with Data VaultDaimler TSS 42
Source 1
Source 2
Source 3
Business rules Dimension 2
Dimension 1 Fact Interfacedaily
Real-time
New SLA to
deliver data
within 4hCo
re L
aye
r
Source 1
Source 2
Source 3
Business
rules Dimension 2
Dimension 1Fact
Interface
daily
Real-time
New SLA to deliver
data within 4h
Co
re L
aye
r
Dimension 3 Fact2 More frequently
refreshed
• (Still) mix of „old“ Kimball-style DWH and Data Vault
• New Data Sources have to use Data Vault + DW Automation
• Mappings that are too complex and have many choices are candidates for refactoring
• Slow fact tables that contain many old data
• Requirements like SLA with delivery of data within 4h
• Small steps, can be done in 3-4 week sprints
• Budgeting easier as combination with new requirements possible
• Or just refactoring releases
• Business Keys were identified / determined by “alternate” keys in dimension
tables
SUMMARY
DWH Refactoring with Data VaultDaimler TSS 44
Productivity
•Smaller and maintainable mappings
•Standardization/Repeatability of development process
•DW Automation for new sources
•Workload distribution (1-2 developers had all the Know-How and were overloaded continuously in the old approach)
Performance
•Improve load performance
•Query performance on Data Vault is a challenge
Flexibility
•Integration of new sources
•Changing requirements – mappings for Soft rules got easier + more maintainable
STARTING BASIS – EXPECTED CHANGES COMING FROM DATA VAULT
DWH Refactoring with Data VaultDaimler TSS 46
“Data modeling is the process of learning about the data, and regardless of technology,
this process must be performed for a successful application.”
• Learn about the data and promote collective data understanding
• Derive security classification and measures
• Design for performance
• Accelerate development
• Improve Software quality
• Reduce maintenance costs
• Generate code
• NoSQL Schema-on-read: understand model versions after years
WHY DATA MODELING?
DWH Refactoring with Data VaultDaimler TSS 47
Source quote: Steve Hoberman: Data Modeling for Mongo DB, Technics Publications 2014
• Data Vault is just one part of the DWH modernization
• Data modeling, but also
• Data architecture [separate integration and business rules/cleansing]
• DWH was migrated to external data center
• Internal standards and processes are good for OLTP but not suitable for DWH, e.g. nologging
operations forbidden
• IMDB (In-Memory DB option) planned to optimize data access
• Hadoop may become part of the solution
• Data archival or offload processing
• Data Vault modeling applicable: Satellite data can be stored in Hadoop (e.g. new data
sources, JSON files)
NEXT STEPS TO MAKE DWH READY FOR FUTURE REQUIREMENTS
DWH Refactoring with Data VaultDaimler TSS 48
Daimler TSS GmbHWilhelm-Runge-Strasse 11, 89081 Ulm / Phone +49 731 505-06 / Fax +49 731 505-65 99
[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-No.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle
DWH Refactoring with Data VaultDaimler TSS 50
2-TIER DWH VS 3-TIER DWH WITH CORE LAYER
DWH Refactoring with Data VaultDaimler TSS 51
Co
re W
are
ho
use
Laye
r