data vault & ladeperformance - doag deutsche oracle ... · data vault & ladeperformance no. 7 what...

of 38 /38
© CGI Group Inc. CONFIDENTIAL Data Vault & Ladeperformance

Author: others

Post on 03-Mar-2020

2 views

Category:

Documents


0 download

Embed Size (px)

TRANSCRIPT

  • © CGI Group Inc. CONFIDENTIAL

    Data Vault &

    Ladeperformance

  • 2

    Über mich .....

    Markus Kollas

    BI Berater seit 1998

    Bei CGI seit 01/2008

    Executive Consultant

    BI (Framework) Trainer

  • Data Vault & Ladeperformance

    No. 3

    What is a Data Vault? Data Vault Modelling Basics Hubs, Satellites, Links and a construct Performance Tuning your Data Vault Loading data into your Data Vault Retrieving data from your Data Vault

    2

    3

    4

    5

    1

    6

  • 1.1 What is a Data Vault?

    Author: Dan Linstedt

    Real name:

    Common Foundation of Data Warehouse modelling

    “The Data Vault is a detail oriented, historical tracking and uniquely linked

    set of normalized tables that support one or more functional areas of

    business”.

    “It is a hybrid approach encompassing the best of breed between 3rd

    normal form (3NF) and star schema. The design is flexible, scalable,

    consistent and adaptable to the needs of the enterprise. It is a data

    model that is architected specifically to meet the needs of today’s

    Enterprise Data Warehouse”.

    No. 4

  • 1.3 Where does Data Vault fit in?

    No. 5

    Analytical Transaction

    Data Flow

    Data Warehouse

  • 1.8 Data modelling techniques applied!

    No. 6

    Station

    Journey Ticket

    Ticket

    TypeClient

    Line

    Zone

    Route

    Line

    Station

    Calendar

    Zone

    Dimensional modelling Data Vault modelling

    Link

    Sat

    Hub

    Hub

    Hub

    Sat

    Sat Link

    Normalization

    modelling (3NF)

    Analytical Transaction Data Warehouse

  • Data Vault & Ladeperformance

    No. 7

    What is a Data Vault? Data Vault Modelling basics Hubs, Satellites, Links and a construct Performance Tuning your Data Vault Loading data into your Data Vault Retrieving data from your Data Vault

    2

    3

    4

    5

    1

    6

  • 2.1 What are the Data Vault primary

    components?

    The Data Vault consists of three primary components:

    Hubs are core business keys

    Links form all associations between the Hubs

    Satellites provide all detail information for

    Hubs and Links

    The Hubs and Links together form the skeletal structure of the model

    while the satellites add all the descriptive details.

    No. 8

  • 2.3 Separation of data types in a DV structure

    Rule: Each component contains either Business Keys (HUB),

    Associations (LINK), or Details (Satellite)

    No. 9

    H

    S

    L

    H S

    H

    S L H

    S

    L

    S

    S

    S

    Product

    Name

    Address

    Vendor

    Customer

    Delivery

    Order

    Orderline

    Producttype

  • 2.4 In comparison: A dimensional data model

    No. 10

    Fact_Sale

    Dim_Region

    Dim_Time Dim_Product

    Dim_Customer

    Fact (Tables) contain all three types of data, Dimension (Tables)

    contain Business Keys and Details…

  • 2.5 In comparison: A normalized data model

    No. 11

    Customer

    Entities (Tables) typically contain all

    three types of data…

    Order Store Region

    Vendor

    Product

    Order Line

  • 2.6 The advantage of data type separation

    No. 12

    • Each data component can be managed without impact on other components.

    • Changes in data constrains (relationships) of source data (often) does not impact the Data Vault.

    • All components are decoupled to make the Data Vault model (easy) extendable.

    • The load procedures of the components are uniform.

  • Data Vault & Ladeperformance

    No. 13

    What is a Data Vault? Data Vault Modelling Basics Hubs, Satellites, Links and a construct Performance Tuning your Data Vault Loading data into your Data Vault Retrieving data from your Data Vault

    2

    3

    4

    5

    1

    6

  • 3.4 Hub characteristics

    No. 14

    • Primary Key: PK is a unique hash key • Business Key: A Hub’s business key is the actual Hub value to

    store and it is a unique index

    • Load DTS: A Hub’s Load Date Time Stamp represents the first time the EDW saw the data

    • Record Source: A Hub Record Source represents the source (system) of the Business key value

    Business Key Hash (PK)

    Business_Key

    Load DTS

    Record_Source

  • 3.10 Link characteristics

    No. 15

    Link_Key Hash (PK)

    Business Key Hashes

    Load DTS

    Record_Source

    • Primary Key: PK is a unique hash key. • Foreign Keys: A Link has two or more Foreign keys (the Business

    Hash Keys of the corresponding Hubs) implementing a n:n relation

    between two or more Hubs. It is a composite unique index.

    • Load DTS: A Link’s load date timestamp represents the first time the EDW saw the data.

    • Record Source: A Link Record Source represents the source (system) of the Link associative value.

  • 3.16 Satellite characteristics

    No. 16

    • Detail(s): A Satellite can have one or more detail values for one record. • The Detail Hash Diff helps to compare a new record to the older ones • End DTS: A Satellite end date timestamp represents the time the EDW

    saw the new data that replaces the old record.– column is optional

    • Record Source: A Satellite Record Source represents the source (system) of the detail value

    • Note: Avoid outer joins, at least one row for every row in Hub or Link.

    Business Key Hash (PK)

    Load DTS (PK)

    Details 1-n

    Detail Hash Diff (optional)

    End_DTS (optional)

    Record_Source

    • Business Key Hash: A Foreign key to the unique key of the Hub or Link.

    • Load DTS: A Satellites load date timestamp represents the first time the EDW saw the

    data (it is part of the Foreign Key).

    • Both form the Primary Key of the Satellite

  • 1.4 Data Vault physical structure

    No. 17

    Business Key Hash

    Business_Key

    Load_DTS

    Record_Source Link

    Hub

    Satellite

    Business Key Hash

    Business_Key

    Load_DTS

    Record_Source

    Hub

    Satellite

    Business Key Hash

    Load_DTS

    Detail(s)

    End_DTS

    Record_Source

    Business Key Hash

    Load_DTS

    Detail(s)

    End_DTS

    Record_Source

    Link Key Hash

    Business Keys

    Hashs

    Load_DTS

    Record_Source

  • Data Vault & Ladeperformance

    No. 18

    What is a Data Vault? Data Vault Modelling Basics Hubs, Satellites, Links and a construct Performance Tuning your Data Vault Loading data into your Data Vault Retrieving data from your Data Vault

    2

    3

    4

    5

    1

    6

  • 3.1 Performance tuning your Data Vault

    No. 19

    • After all the functional modelling is done, the performance of the Data Vault can be tuned.

    • Tuning the performance does not change the functionality.

    • Performance is only tuned when necessary.

    • There are two options for tuning: 1. Performance tuning that can be modelled into the Data Vault by the

    Data Vault modeler.

    2. Performance that is tuned by the DBA of the database (like table

    spacing, indexing, partitioning, etc).

  • Data Vault Modelling part 2

    No. 20

    What is a Data Vault? Data Vault Modelling Basics Hubs, Satellites, Links and a construct Performance Tuning your Data Vault Loading data into your Data Vault Retrieving data from your Data Vault

    2

    3

    4

    5

    1

    6

  • 4.1 Data Vault – Loading sequence?

    No. 21

    Data Mart

    (Dim) Staging EDW

    (DV) Transaction

    Staging Loads Data Vault Loads Dimensional Loads

  • 4.2 Data Vault – Loading Source and Stage

    No. 22

    Staging Loads Data Vault 2.0 Loads Dimensional Loads

    So

    urc

    es

    Sta

    ge

    Parallel loading of the

    Sources, followed by Staging

    (staging can be virtual or non

    existent…)

  • 4.3 Data Vault 1.0 – Loading Data Vault

    No. 23

    Staging Loads Data Vault Loads Dimensional Loads

    So

    urc

    es

    Sta

    ge

    Hu

    bs

    First up is parallel

    loading the Hubs

    DV

    1.0

  • 4.4 Data Vault 1.0 – Loading Data Vault

    No. 24

    Staging Loads Data Vault Loads Dimensional Loads

    So

    urc

    es

    Sta

    ge

    Hu

    bs

    Hu

    b-S

    at

    Lin

    ks

    And parallel

    loading of the

    Links between the

    Hubs

    Then parallel

    loading the

    Satellites belonging

    to the Hubs

    DV

    1.0

  • 4.5 Data Vault 2.0 – Loading Data Vault

    No. 25

    Staging Loads Data Vault 2.0 Loads Dimensional Loads

    So

    urc

    es

    Sta

    ge

    Hu

    bs

    S

    ats

    L

    inks

    Parallel loading of all DV structures as

    hardware restrictions

    allow

    DV

    2.0

  • 4.6 Data Vault – Loading Dimensions and Facts

    No. 26

    Staging Loads Data Vault 2.0 Loads Dimensional Loads

    So

    urc

    es

    Sta

    ge

    Dim

    s

    Facts

    Hu

    bs

    S

    ats

    L

    inks

    Finally loading the Dimensions, followed

    by the Facts of the

    dimensional model

  • 4.7 Data Vault – Loading

    No. 27

    • Starting the loading of the Hubs, Links and Satellites may be still major synchronization points.

    • All loading is done simultaneously – thanks to the use of Hash Keys. • Sets of loading jobs “wait” for the previous set to complete. • Loads are started as soon as data is ready. • No other “waiting” time is required. • Load dependencies are greatly reduced.

  • Data Vault & Ladeperformance

    No. 28

    What is a Data Vault? Data Vault Modelling Basics Hubs, Satellites, Links and a construct Performance Tuning your Data Vault Loading data into your Data Vault Retrieving data from your Data Vault

    2

    3

    4

    5

    1

    6

  • 4.2 EDW: Data Vault requires an Architectural

    shift

    No. 29

    Data Mart

    (Dim) Staging

    EDW

    (DV) Transaction

    source

    Complex business rules

    coming out of the EDW,

    “the lens” filter

    Complex business rules are only transformed downstream, allowing

    traceability, auditability and uniform/homogeneous loading.

    Only “hard”

    rules.

  • 4.3 EDW: The Business Data Vault

    No. 30

    Data Mart

    (Dim) Staging

    EDW

    (DV) Transaction

    source

    Business

    DV

    The Business Data Vault holds

    transformed and calculated values:

    It supports ”business transformations”

    Raw DV

    The Raw Data Vault is the vault as

    described up till this point:

    It supports “one version of the facts”

  • 4.4 Business Data Vault Definition

    • The Business Data Vault stores data processed by (soft) business rules.

    • Data in the Business Data Vault is always derived from the Raw Data Vault (also called “Operational Vault”).

    • Preferred design choice: separate models (Raw/Business Vault).

    • Practical choices: Business Hubs, Links and Satellites are added to the Raw Data Vault model as needed.

    No. 31

  • 4.5 Business Data Vault Example:

    SAT_INV_CUR

    No. 32

    HUB_INVOICE

    SAT_INV_DT

    Invoice_Hash

    Load_DTS

    Amount_Billed

    Amount_Payed

    End_DTS

    Record_Source

    SAT_INV_AMT

    Invoice_Hash

    Load_DTS

    Billed_Date

    Paid_Date

    End_DTS

    Record_Source

    SAT_INV_CUR

    Invoice_Hash

    Load_DTS

    Currency

    Exchange_Rate

    Amount_Payed

    End_DTS

    Record_Source

    Derived calculation

    based on

    Amount_Payed from

    SAT_INV_AMT and

    Exchange_Rate

    Invoice_Hash

    Invoice_Number

    Load_DTS

    Record_Source

  • 4.6 Business Data Vault Performance

    Optimal choices for performance or real-time loading:

    • Integrated Raw Data Vault and Business Data Vault.

    • Business Hubs, Links or Satellites added to Raw Hubs, Links or Satellites.

    • Example:

    Customer Hub has two address Satellite tables; one for each of two

    separate source systems. After loading the raw data, business rules

    are used to calculate the active address and stores this result in a

    Business Satellite attached to the Customer Hub.

    No. 33

  • 4.7 Retrieving data is to “know your data”

    No. 34

    Example:

    •The relation between Customer and Product has always been 1:n.

    •Then, on 01-01-2010, the transaction system changes and the relation between Customer and Product becomes n:1.

    •The Link can handle this change, therefore no problem. •How can the reporting environment know about this

    change? It is invisible in the Data Vault model that has not

    been changed…

    Conclusion: Hence the necessity of Meta Data!

  • • Tracking complete history on detailed level.

    • 100% versioning and audit trail.

    • Implicit implementation of MDM

    • Parallel processing of satellite loading, optimizing performance.

    • True “Single Source of Facts”

    CDR in the EDW model – Storing time variant data

    Link Call Detail

    Hub Phone Number

    received

    making

    charged

    Satellite Call Detail

    defining

    Hub Facilities

    Hub Customer

    Hub Exchange

    writing

    Satellite Exchange

    Satellite Facilities

    Link Contract

    Satellite Phone

    Number

    Satellite Contract

    Satellite Customer

    used

    defining defining

    defining

    defining

    defining

    owns

    part of

  • • Automated identification of candidate dimensions.

    • A dimension originates from a hub.

    • Combine with related links and satellites based on information requirements.

    CDR in the Subject Areas or Data Marts – Identifying dimensions

    Dimension Customer

    Dimension Contracts

    Dimension Exchanges

    Dimension Facilities

    Link Call Detail

    Hub Phone Number

    received

    making

    charged

    Satellite Call Detail

    defining

    Hub Facilities

    Hub Customer

    Hub Exchange

    writing

    Satellite Exchange

    Satellite Facilities

    Link Contract

    Satellite Phone

    Number

    Satellite Contract

    Satellite Customer

    used

    defining defining

    defining

    defining

    defining

    owns

    part of

    Subject Area

    Dimension Customer

    Dimension Facilities

    Dimension Exchanges

    Dimension Contracts

  • Fact CDR

    • Automated identification of candidate facts.

    • A facts originates from a link with related hubs.

    • Combine with related satellites based on information requirements.

    • Optimized for analytical query performance.

    CDR in Subject Areas or Marts – Identifying facts

    Link Call Detail

    Hub Phone Number

    received

    making

    charged

    Satellite Call Detail

    defining

    Hub Facilities

    Hub Customer

    Hub Exchange

    writing

    Satellite Exchange

    Satellite Facilities

    Link Contract

    Satellite Phone

    Number

    Satellite Contract

    Satellite Customer

    used

    defining defining

    defining

    defining

    defining

    owns

    part of

    Subject Area

    Dimension Customer

    Dimension Facilities

    Dimension Exchanges

    Dimension Contracts

    Fact CDR

    Data Vault – Deutsche Bank Juni 2012

  • Thank you

    de.cgi.com/BI