Transcript

A company of Daimler AG

DWH REFACTORING WITH DATA VAULT

ANDREAS BUCKENHOFER

DOAG K&A 2017, NUREMBERG

ABOUT ME

https://de.linkedin.com/in/buckenhofer

https://twitter.com/ABuckenhofer

https://www.doag.org/de/themen/datenbank/in-memory/

http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/

https://www.xing.com/profile/Andreas_Buckenhofer2

Andreas Buckenhofer

Senior DB Professional

[email protected]

Since 2009 at Daimler TSS

Department: Big Data

Business Unit: Analytics

As a 100% Daimler subsidiary, we give

100 percent, always and never less.

We love IT and pull out all the stops to

aid Daimler's development with our

expertise on its journey into the future.

Our objective: We make Daimler the

most innovative and digital mobility

company.

NOT JUST AVERAGE: OUTSTANDING.

Daimler TSS

INTERNAL IT PARTNER FOR DAIMLER

+ Holistic solutions according to the Daimler guidelines

+ IT strategy

+ Security

+ Architecture

+ Developing and securing know-how

+ TSS is a partner who can be trusted with sensitive data

As subsidiary: maximum added value for Daimler

+ Market closeness

+ Independence

+ Flexibility (short decision making process,

ability to react quickly)

Daimler TSS 4

Daimler TSS

LOCATIONS

DWH Refactoring with Data Vault

Daimler TSS China

Hub Beijing

10 employees

Daimler TSS Malaysia

Hub Kuala Lumpur

42 employeesDaimler TSS IndiaHub Bangalore22 employees

Daimler TSS Germany

7 locations

1000 employees*

Ulm (Headquarters)

Stuttgart

Berlin

Karlsruhe

* as of August 2017

5

AGENDA

1. Motivation

2. Overview Data Vault

3. Refactoring

4. Summary

High expected data growth

Inflexibility of grown systems

Increasing requirements due to legal changes or financial formalities

Make the DWH ready for the future

Modern architecture: replace Kimball architecture and limitations

REQUIREMENTS AND CHALLENGES

DWH Refactoring with Data VaultDaimler TSS 7

• Kimball-style DWH with 2 tiers

• Staging layer

• Core/mart layer

• Up to 10 years of data

• 7TB data compressed

• All data in Mart layer

• Oracle 11gR2 on SLES and Informatica Powercenter 9.6.1

• Diverging requirements from 30+ national headquarters lead to long and complex mappings

• Insufficient performance and slow development process with changing requirements

• Low and differing data quality

STARTING BASIS – EXISTING DWH

DWH Refactoring with Data VaultDaimler TSS 8

ARCHITECTURE STYLE – KIMBALL

DWH Refactoring with Data VaultDaimler TSS 9

Data Vault is just one part of the solution to cope with new challenges

Experience with Data Vault @Daimler since 2004/2005

Convincing and feasibility for Data Vault was given

STARTING BASIS – DATA VAULT

DWH Refactoring with Data VaultDaimler TSS 10

AGENDA

1. Motivation

2. Overview Data Vault

3. Refactoring

4. Summary

DATA VAULT - ARCHITECTURE, METHODOLOGY, MODEL

DWH Refactoring with Data VaultDaimler TSS 12

Architecture

• Multi-Tier

• Scalable

• Supports NoSQL

Methodology

• Repeatable

• Measureable

• Agile

Model

• Flexible

• Hash based

• Hub & Spoke

Implementation: Automation,

Pattern based, High speed

Unique

identification

by

Natural keys

(Business Keys)

HUB

STRUCTURE HUB TABLES

DWH Refactoring with Data VaultDaimler TSS 14

HUB TABLES: TYPICAL CHARACTERISTICS

DWH Refactoring with Data VaultDaimler TSS 15

Business Keys should be natural keys used by the business (e.g. Vehicle Identifier, Serial number)

Business Keys should stand alone and have meaning to the business

Business Keys should never change, have the same semantic meaning and the same granularity

Focus on Business Keys (instead focus on source system surrogates) ensures that the result serves the needs of the business

DWH Refactoring with Data Vault 16Daimler TSS

LINK

Unique

relationships

between

Business Keys

(HUBs)

STRUCTURE LINK TABLES

DWH Refactoring with Data VaultDaimler TSS 17

LINK TABLES: TYPICAL CHARACTERISTICS

DWH Refactoring with Data VaultDaimler TSS 18

A LINK models a relationship between 2 or more HUBs

The relationship is always n:m

The composed key must be unique. One of the foreign keys is driving key

Link to Link allowed but should be avoided in a physical implementation due to load dependency

• Relationships / Associations

• Foreign Keys in OLTP systems

• Hierarchies and Redefinitions

• Hierarchical relationships are modeled by one link and two connections to HUBs: HAL

(parent-child LINK) and SAL (same-as LINK)

• Transactions and events are often modeled as link (could also be a Hub):

matter of dispute

• E.g. sales order or sensor data

• Intensive discussions about modeling as Hub or Link on conferences or social media

(modeling solution depends from requirements, context, etc)

CANDIDATES FOR LINKS

DWH Refactoring with Data VaultDaimler TSS 19

DWH Refactoring with Data Vault 20Daimler TSS

SAT

Descriptive,

detailled,

current

and

historized

data

STRUCTURE SAT TABLES

DWH Refactoring with Data VaultDaimler TSS 21

SAT TABLES: TYPICAL CHARACTERISTICS

DWH Refactoring with Data VaultDaimler TSS 22

Contains all non-key attributes

Is connected to exactly one Hub or Link

HUB or LINK tables can (should) have several SAT tables, e.g. by source system

Can contain in the extreme case one column only (or any number of columns)

ARCHITECTURE STYLE – DATA VAULT

DWH Refactoring with Data VaultDaimler TSS 23

DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT)

DWH Refactoring with Data VaultDaimler TSS 24

Michael Olschimke, Dan Linstedt: Building a Scalable Data Warehouse with Data Vault 2.0, Morgan Kaufmann, 2015, Chapter 2.2

• Raw Vault

• Sat s_code containing raw data

• Business Vault

• Sat bs_interior containing standardized

interior data after applying business rules

on column s_code.code

• Standard_code is computed/derived

• Interior is computed/derived

• Complex business rules for different

car models

• Business rules change regularly

RAW DATA VAULT AND BUSINESS DATA VAULTSAMPLE PATTERN FOR SATELLITE DATA

DWH Refactoring with Data VaultDaimler TSS 25

H_VEHICLE

H_VEHICLE_KEY

VIN

BINARY(20)

VARCHAR2(17)

<pk>

S_CODE

H_VEHICLE_KEY

LOAD_DATE

CODE

BINARY(20)

DATE

VARCHAR2(5)

<pk,fk>

<pk>

BS_INTERIOR

H_VEHICLE_KEY

LOAD_DATE

INTERIOR

STANDARD_CODE

BINARY(20)

DATE

VARCHAR2(50)

VARCHAR2(5)

<pk,fk>

<pk>

• Separation of integration and business rules / transformations

• Core Warehouse Layer is modeled with Data Vault and integrates data by BK

(business key) “only”

• Business rules (Soft Rules) are applied from Raw Data Vault Layer to Mart

Layer and not earlier

• Alternatively from Raw Data Vault to additional layer called Business Data Vault

• Hard Rules don’t change data

• Data is fully auditable

• Real-time capable architecture

DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT)

DWH Refactoring with Data VaultDaimler TSS 26

• In the classical DWHs, the Core Warehouse Layer is regarded as “single

version of the truth”

• Integrates + cleanses data from different sources and eliminates contradiction

• Produces consistent results/reports across Data Marts

• But: cleansing is (still) objective, Enterprises change regularly, paradigm does not scale as

more and more systems exist

• Data in Raw Data Vault Layer is regarded as “Single version of the facts”

• 100% of data is loaded 100% of time

• Data is not cleansed and bad data is not removed in the Core Layer (Raw Vault)

DATA VAULT 2.0 ARCHITECTURE (DAN LINSTEDT)

DWH Refactoring with Data VaultDaimler TSS 27

DATA VAULT - ARCHITECTURE, METHODOLOGY, MODEL

DWH Refactoring with Data VaultDaimler TSS 28

Architecture

• Multi-Tier

• Scalable

• Supports NoSQL

Methodology

• Repeatable

• Measureable

• Agile

Model

• Flexible

• Hash based

• Hub & Spoke

Implementation: Automation,

Pattern based, High speed

AGENDA

1. Motivation

2. Overview Data Vault

3. Refactoring

4. Summary

ARCHITECTURE STYLES – KIMBALL AND DATA VAULT

DWH Refactoring with Data VaultDaimler TSS 30

• Experience from a DWH with mobility data moving to Data Vault while

migrating to new hardware

• Similar requirements (huge data growth, JSON data sources)

• Similar starting position: Kimball style DWH on SQL Server (+ migration to new HW)

• Agreed approach:

• Small steps

• Move fact table one another to new hardware including new Data Vault tables including

staging tables

• Theory vs practice

• No chance to avoid first fact table with huge number of columns

• � Virtual Big Bang

BIG BANG OR SMALL STEPS?

DWH Refactoring with Data VaultDaimler TSS 31

Lessons Learned from a project moving to Data Vault while migrating to new

hardware:

• Laborious to maintain two systems in parallel

• Main challenge are new requirements and bug fixes that come permanently

• Change to Data Vault was and still is considered as the right step towards the

future

� Small steps that fit into 3-4 week sprint

� No standstill: Can be combined with new requirements

BIG BANG OR SMALL STEPS?

DWH Refactoring with Data VaultDaimler TSS 32

Key questions

• What about existing data in dimensions?

• What about existing data in facts?

• What about new data?

General approach

• Create Hub / Link / Sat tables in Data Vault layer

• Copy data from tables in Mart layer into tables in Data Vault layer

• Integrate new data into Data Vault layer first

KEY QUESTIONS AND GENERAL APPROACH

DWH Refactoring with Data VaultDaimler TSS 33

F_CONTRACT

D_FUNDING_UNIT_ID

D_VEHICLE_ID

D_DATE_ID_ORDER

D_STATUS_ID

D_DATE_ID_DELIVERY

D_CUSTOMER_ID

AMOUNT_EXPECTED

AMOUNT_ACTUAL

CONTRACTNO

<fk1>

<fk2>

<fk3>

<fk4>

<fk5>

<fk6>

D_STATUS

D_STATUS_ID <pk>

D_CUSTOMER

D_CUSTOMER_ID

CUSTOMERNO

NAME

VALIDFROM

VALIDTO

<pk>

F_ACCOUNTPLAN

D_VEHICLE_ID

D_CUSTOMER_ID

D_DATE_ID

AMOUNT

RISK

<fk1>

<fk2>

<fk3>

D_VEHICLE

D_VEHICLE_ID

VIN

MODEL

TYPE

VALIDFROM

VALIDTO

<pk>

D_DATE

D_DATE_ID <pk>

D_FUNDING_UNIT

D_FUNDING_UNIT_ID

FUNDING_UNIT

VALIDFROM

VALIDTO

<pk>

DIMENSION TABLES

DWH Refactoring with Data VaultDaimler TSS 34

H_VEHICLE

H_VEHICLE_KEY

FIN

<pk>

S_VEHICLE_DIM

H_VEHICLE_KEY

LOAD_DATE

MODEL

TYPE

VALIDFROM

VALIDTO

<pk,fk>

<pk>

H_CUSTOMER

H_CUSTOMER_KEY

VARCHAR2(20)

<pk>

S_CUSTOMER_DIM

H_CUSTOMER_KEY

LOAD_DATE

NAME

VALIDFROM

VALIDTO

<pk,fk>

<pk>

S_CUSTOMER_SOURCE1

H_CUSTOMER_KEY

LOAD_DATE

NAME

<pk,fk>

<pk>

S_VEHICLE_SOURCE1

H_VEHICLE_KEY

LOAD_DATE

MODEL

TYPE

<pk,fk>

<pk>S_CUSTOMER_SOURCE2

H_CUSTOMER_KEY

LOAD_DATE

ADDITIONAL_ATTRIBUTES

<pk,fk>

<pk>

• Move data back into Hub and Sat tables

• Create Hub and Sat tables

• 1 Hub

• 1 Sat for dimension (“cleansed data”; business Vault)

• 1-n Sat for source tables (“original data”; Raw Vault)

• Copy data from dimension into Hub and Sat tables (one-time action)

• Sat table is „closed“ afterwards – no data changes anymore

• Change Informatica mappings to store data in Hub and Sat and then move data into

dimensions

• Delete old data in dimension (> 2 or 5 years)

DIMENSION TABLES

DWH Refactoring with Data VaultDaimler TSS 35

FACT TABLES

DWH Refactoring with Data VaultDaimler TSS 36

F_CONTRACT

D_FUNDING_UNIT_ID

D_VEHICLE_ID

D_DATE_ID_ORDER

D_STATUS_ID

D_DATE_ID_DELIVERY

D_CUSTOMER_ID

AMOUNT_EXPECTED

AMOUNT_ACTUAL

CONTRACTNO

<fk1>

<fk2>

<fk3>

<fk4>

<fk5>

<fk6>

D_STATUS

D_STATUS_ID <pk>

D_CUSTOMER

D_CUSTOMER_ID

CUSTOMERNO

NAME

VALIDFROM

VALIDTO

<pk>

F_ACCOUNTPLAN

D_VEHICLE_ID

D_CUSTOMER_ID

D_DATE_ID

AMOUNT

RISK

<fk1>

<fk2>

<fk3>

D_VEHICLE

D_VEHICLE_ID

VIN

MODEL

TYPE

VALIDFROM

VALIDTO

<pk>

D_DATE

D_DATE_ID <pk>

D_FUNDING_UNIT

D_FUNDING_UNIT_ID

FUNDING_UNIT

VALIDFROM

VALIDTO

<pk>

H_VEHICLE

H_VEHICLE_KEY

FIN

<pk>

S_VEHICLE_DIM

H_VEHICLE_KEY

LOAD_DATE

MODEL

TYPE

VALIDFROM

VALIDTO

<pk,fk>

<pk>

H_CUSTOMER

H_CUSTOMER_KEY

VARCHAR2(20)

<pk>

L_CONTRACT

L_CONTRACT_KEY

H_CUSTOMER_KEY

H_VEHICLE_KEY

H_CONTRACT_KEY

<pk>

<fk2>

<fk3>

<fk1>

S_CUSTOMER_DIM

H_CUSTOMER_KEY

LOAD_DATE

NAME

VALIDFROM

VALIDTO

<pk,fk>

<pk>

S_CUSTOMER_SOURCE1

H_CUSTOMER_KEY

LOAD_DATE

NAME

<pk,fk>

<pk>

S_CONTRACT_FACT

H_CONTRACT_KEY

LOAD_DATE

AMOUNT_ACTUAL

CONTRACTDATE

<pk,fk>

<pk>

S_VEHICLE_SOURCE1

H_VEHICLE_KEY

LOAD_DATE

MODEL

TYPE

<pk,fk>

<pk>S_CUSTOMER_SOURCE2

H_CUSTOMER_KEY

LOAD_DATE

ADDITIONAL_ATTRIBUTES

<pk,fk>

<pk>

S_CONTRACT_SOURCE1

H_CONTRACT_KEY

LOAD_DATE

PRICE

<pk,fk>

<pk>

H_CONTRACT

H_CONTRACT_KEY

CONTRACTNO

<pk>

• Move data back into Hub and Sat tables

• Create Link and Sat tables

• 1 Link

• 1 Hub (optional if degenerated dimension, see next slide)

• 1 Sat for measurements from fact (“cleansed data”; business Vault)

• 1-n Sat for source tables (“original data”; Raw Vault)

• Copy data from fact into Link and Sat tables (one-time action)

• Sat table is „closed“ afterwards – no data changes anymore

• Change Informatica mappings to store data in Hub, Link and Sat and then move data into

facts

• Delete old data in facts (> 2 or 5 years)

FACT TABLES

DWH Refactoring with Data VaultDaimler TSS 37

• Transactions and events are often modeled as link (could also be a Hub):

matter of dispute

• In this case f_contract was modeled as Hub + Sats and connection to Link

• Alternative is to model contract fact as Link + Sat only

• First option is more flexible and easier to automate

• Second option may be the only way if there is no “business key” contractno or orderno

LINKS – MATTER OF DISPUTE

DWH Refactoring with Data VaultDaimler TSS 38

• Dimension and fact tables still the same structure

• Additional Core Warehouse Layer with Hub, Link, and Sat

• Not all dimensions and facts are refactored

• Many reference tables are still just dimensions (e.g. d_status and similiar)

• Informatica mappings now store data in Core Warehouse Layer first, then

move data into Mart

• What is the benefit?

• Slower

• Informatica mappings got even more complex

• Prerequisite for next step

INTERMEDIATE RESULT

DWH Refactoring with Data VaultDaimler TSS 39

• Integration of new sources takes a long time

• Mappings are too complex covering data cleansing from 30+ national

headquarters

• New requirements to deliver some data faster

DATA FLOW – STARTING POSITION

DWH Refactoring with Data VaultDaimler TSS 40

Source 1

Source 2

Source 3

Integration &

Business rulesDimension 2

Dimension 1 Fact Interfacedaily

Real-time

New SLA to

deliver data

within 4h

DATA FLOW – 1ST STEP

DWH Refactoring with Data VaultDaimler TSS 41

Source 1

Source 2

Source 3

Integration &

Business rulesDimension 2

Dimension 1 Fact Interfacedaily

Real-time

New SLA to

deliver data

within 4h

Source 1

Source 2

Source 3

Business rules Dimension 2

Dimension 1 Fact Interfacedaily

Real-time

New SLA to

deliver data

within 4hCo

re L

aye

r

DATA FLOW – 2ND STEP

DWH Refactoring with Data VaultDaimler TSS 42

Source 1

Source 2

Source 3

Business rules Dimension 2

Dimension 1 Fact Interfacedaily

Real-time

New SLA to

deliver data

within 4hCo

re L

aye

r

Source 1

Source 2

Source 3

Business

rules Dimension 2

Dimension 1Fact

Interface

daily

Real-time

New SLA to deliver

data within 4h

Co

re L

aye

r

Dimension 3 Fact2 More frequently

refreshed

ARCHITECTURE STYLE

DWH Refactoring with Data VaultDaimler TSS 43

• (Still) mix of „old“ Kimball-style DWH and Data Vault

• New Data Sources have to use Data Vault + DW Automation

• Mappings that are too complex and have many choices are candidates for refactoring

• Slow fact tables that contain many old data

• Requirements like SLA with delivery of data within 4h

• Small steps, can be done in 3-4 week sprints

• Budgeting easier as combination with new requirements possible

• Or just refactoring releases

• Business Keys were identified / determined by “alternate” keys in dimension

tables

SUMMARY

DWH Refactoring with Data VaultDaimler TSS 44

AGENDA

1. Motivation

2. Overview Data Vault

3. Refactoring

4. Summary

Productivity

•Smaller and maintainable mappings

•Standardization/Repeatability of development process

•DW Automation for new sources

•Workload distribution (1-2 developers had all the Know-How and were overloaded continuously in the old approach)

Performance

•Improve load performance

•Query performance on Data Vault is a challenge

Flexibility

•Integration of new sources

•Changing requirements – mappings for Soft rules got easier + more maintainable

STARTING BASIS – EXPECTED CHANGES COMING FROM DATA VAULT

DWH Refactoring with Data VaultDaimler TSS 46

“Data modeling is the process of learning about the data, and regardless of technology,

this process must be performed for a successful application.”

• Learn about the data and promote collective data understanding

• Derive security classification and measures

• Design for performance

• Accelerate development

• Improve Software quality

• Reduce maintenance costs

• Generate code

• NoSQL Schema-on-read: understand model versions after years

WHY DATA MODELING?

DWH Refactoring with Data VaultDaimler TSS 47

Source quote: Steve Hoberman: Data Modeling for Mongo DB, Technics Publications 2014

• Data Vault is just one part of the DWH modernization

• Data modeling, but also

• Data architecture [separate integration and business rules/cleansing]

• DWH was migrated to external data center

• Internal standards and processes are good for OLTP but not suitable for DWH, e.g. nologging

operations forbidden

• IMDB (In-Memory DB option) planned to optimize data access

• Hadoop may become part of the solution

• Data archival or offload processing

• Data Vault modeling applicable: Satellite data can be stored in Hadoop (e.g. new data

sources, JSON files)

NEXT STEPS TO MAKE DWH READY FOR FUTURE REQUIREMENTS

DWH Refactoring with Data VaultDaimler TSS 48

Daimler TSS GmbHWilhelm-Runge-Strasse 11, 89081 Ulm / Phone +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-No.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle

DWH Refactoring with Data VaultDaimler TSS 50

2-TIER DWH VS 3-TIER DWH WITH CORE LAYER

DWH Refactoring with Data VaultDaimler TSS 51

Co

re W

are

ho

use

Laye

r

DATA VAULT 2.0 ARCHITECTURE – TODAY’S WORLD (DANLINSTEDT)

DWH Refactoring with Data VaultDaimler TSS


Top Related