presentation pdi data_vault_framework_meetup2012

42
Introductionn [email protected]

Upload: pentaho-community

Post on 13-Jul-2015

300 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Presentation pdi data_vault_framework_meetup2012

Introductionn

[email protected]

Page 2: Presentation pdi data_vault_framework_meetup2012

Data Vault Definition

Source: Dan Linstedthttp://www.tdan.com/view-articles/5054/

The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of enterprise data warehouses.

Page 3: Presentation pdi data_vault_framework_meetup2012

Data Vault Building Blocks

Source: Dan Linstedthttp://www.slideshare.net/dlinstedt/introduction-to-data-vault-dama-oregon-2012

different sources/rate of change

Page 4: Presentation pdi data_vault_framework_meetup2012

Data Vault Fundamentals: Hub

Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren

Page 5: Presentation pdi data_vault_framework_meetup2012

Data Vault Fundamentals: Link

Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren

Page 6: Presentation pdi data_vault_framework_meetup2012

Data Vault Fundamentals: Satellite

Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren

Page 7: Presentation pdi data_vault_framework_meetup2012

Data Vault Fundamentals: Model

Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren

Page 8: Presentation pdi data_vault_framework_meetup2012

Data Vault ETL

Many objects to load, standardized procedures

This screams for a generic solution!

I don't want to:

throw ETL tool away and code it all myself

manage too many ETL objects

connect similar columns in mappings by hand

I do want to:

generate ETL (Kettle) objects? No

Take it one step further: there's only 1 parameterised hub load object. Don't need to know xml structure of PDI objects

Page 9: Presentation pdi data_vault_framework_meetup2012

Tools

Version Control

Database

Virtualization

Data Integration

Operating System

'Productivity'

Sql Development

Page 10: Presentation pdi data_vault_framework_meetup2012

Place of framework in architecture

StagingArea

CSVFiles

ETL

ERP

DBMS

Sources ETL Process Data Warehouse EUL

MySQL

Files

ETL:KettleDataVault Framework

Central DWH & Data Marts

MySQLDataVault

ETL

Page 11: Presentation pdi data_vault_framework_meetup2012

What has to be taken care of?

Data Vault designed and implemented in database

Staging tables and loading procedures in place(can also be generic, we use PDI Metadata Injection step for loading files)

Mapping from source to Data Vault specified (now in an Excel sheet)

What

Page 12: Presentation pdi data_vault_framework_meetup2012

Framework components

PDI repository (file based), jobs and transformations

Configuration files:kettle.properties

shared.xml

repositories.xml

Excel sheet that contains the specifications

MySQL database for metadata

Virtual machine with Ubuntu 12.04 Server

Page 13: Presentation pdi data_vault_framework_meetup2012

Design decisions

Updateable views with generic column names

(MySQL more lenient than PostgreSQL)

Compare satellite attributes via string comparison (concatenate all columns, with | (pipe) as delimiter)

'inject' the metadata using Kettle parameters

Generate and use an error table for each Data Vault table

Page 14: Presentation pdi data_vault_framework_meetup2012

Metadata tables

All have history tables

Page 15: Presentation pdi data_vault_framework_meetup2012

Metadata in Excel

Data Vault

connections

source systems

source tables

Page 16: Presentation pdi data_vault_framework_meetup2012

Metadata in Excel (hub + sat)

x 200 (max)

Page 17: Presentation pdi data_vault_framework_meetup2012

Metadata in Excel (link)

link attributes

x 10

Page 18: Presentation pdi data_vault_framework_meetup2012

Metadata in Excel (link satellite)

x 10

x 5

x 200 (max)

Page 19: Presentation pdi data_vault_framework_meetup2012

Last seen date

applicable for hubs and links

existing hubs and links: update 'last_seen_dts'!

Page 20: Presentation pdi data_vault_framework_meetup2012

Link validity satellite

Link has 'business key': not all hub id's

Page 21: Presentation pdi data_vault_framework_meetup2012

Loading the metadata

Page 22: Presentation pdi data_vault_framework_meetup2012

'design errors'

Checks to avoid debugging:(compares design metadata with Data Vault DB information_schema)

hubs, links, satellites that don't exist in the DV

key columns that do not exist in the DV

missing connection data (source db)

missing attribute columns

Page 23: Presentation pdi data_vault_framework_meetup2012

A complete run

Page 24: Presentation pdi data_vault_framework_meetup2012

Metadata needed for a hub

name

key column

business key column

source table

source table business key column(can be expression, e.g. concatenate for composite key)

Page 25: Presentation pdi data_vault_framework_meetup2012

Job for hub

Page 26: Presentation pdi data_vault_framework_meetup2012

Transformation for hub

Page 27: Presentation pdi data_vault_framework_meetup2012

Metadata needed for a linkname

key column

for each hub (maximum 10, can be a ref-table)

hub name

column name for the hub key in the link (roles!)

column in the source table → business key of hub

link 'attributes' (part of key, no hub, maximum 5)

link validity satellite needed?

last seen date needed?

source table

Page 28: Presentation pdi data_vault_framework_meetup2012

Job for link

Page 29: Presentation pdi data_vault_framework_meetup2012

Transformation for link

Run table needed for validity sat ?

Lookup hubs

Remove columns not in link

Last seen?

Page 30: Presentation pdi data_vault_framework_meetup2012

Metadata needed for a hub satellite

name

key column

hub name

column in the source table → business key of hub

for each attribute (maximum 200)

source column target column

source table

Page 31: Presentation pdi data_vault_framework_meetup2012

Job for hub satellite

Page 32: Presentation pdi data_vault_framework_meetup2012

Transformation for hub satellite

Page 33: Presentation pdi data_vault_framework_meetup2012

Metadata needed for a link satellite

name

key column

link name

for each hub of the link:

column in the source table → business key of hub

for each key attribute: source column

for each attribute: source column → target column

source table

Page 34: Presentation pdi data_vault_framework_meetup2012

Job for link satellite

Page 35: Presentation pdi data_vault_framework_meetup2012

Transformation for link satellite

Page 36: Presentation pdi data_vault_framework_meetup2012

Executing in a loop ..

Page 37: Presentation pdi data_vault_framework_meetup2012

.. and parallel

Page 38: Presentation pdi data_vault_framework_meetup2012

Logging

Configuring log tablesfor concurrent access

PDI logging

Custom logging

Page 39: Presentation pdi data_vault_framework_meetup2012

Version Control: PDI objects

Page 40: Presentation pdi data_vault_framework_meetup2012

Version Control: database objects

Page 41: Presentation pdi data_vault_framework_meetup2012

Some points of interest

Easy to make mistake in design sheet

Generic → a bit harder to maintain and debug

Application/tool to maintain metadata?

Data Vault generators (e.g. Quipu)?

Spinoff using Informatica and Oracle: Sander Robijns

Thanks to: Jos van Dongen Kasper de Graaf

Page 42: Presentation pdi data_vault_framework_meetup2012

Sourceforge!