03-1 dwh data warehouse - time dimension

Upload: ksrsarma

Post on 05-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    1/67

    R. Marti

    3-1 Data Warehouse The Time Dimension

    Data Warehousing

    Spring Semester 2011

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    2/67

    3-1 DWh 2011: Data WarehouseR. Marti 2

    The Data Warehouse in the DWh Reference Architecture

    Data

    Ware-

    house

    Source

    Database

    Source

    Database

    Source

    Database

    DataMart

    Data

    Mart

    Dashboards

    Reports

    Interactive Analysis

    Data Warehousing

    Focus Architectural options and variations in data warehouse projects Design of the single integrated data warehouse, in particular

    - how to model temporal aspects- how to ensure common dimensions (=> Master Data Management)

    Master

    Data

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    3/67

    3-1 DWh 2011: Data WarehouseR. Marti Page 3

    Recap: Time in Classical Data Mart Designs (1)

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    4/67

    3-1 DWh 2011: Data WarehouseR. Marti 4

    Recap: Time in Classical Data Mart Designs (2)

    Rows in fact tables are associated with a specific time by the foreign keyreference to the time dimension, indicating as of when they are valid.

    However, rows in dimension tables are not associated with a time!- new rows (rows with an unknown source system identifier) are simply added- usually, no rows are deleted from a dimension table, even if rows with known

    source system identifiers are missing in a batch upload:

    . existing (old) facts still refer to objects corresponding to these missing rows

    . if sources do not send explicit information on deletions, it is unclear whether

    the corresponding objects have effectively become invalid or not

    (Note: Sending this information might mean re-designing the source system!)

    -changes in values of dimension rows with known source system identifiers are. either simply overwritten,

    . or a new row with a new surrogate (but the old source system id) is added

    (see topic slowly changing dimensions)

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    5/67

    3-1 DWh 2011: Data WarehouseR. Marti 5

    Temporal Database Systems + Languages

    For some types of analysis, dimensions should also be historized,especially for comparisons of measures across different time periods.

    Example:

    How did buying habits of customers change over the last few years,

    grouped by where they live.

    History of addresses of customers should also be kept!

    Since 1980, a lot of research has been conducted in temporal data models,temporal query languages, and temporal database systems.

    Generic support for temporal data is beginning to emerge in products:Teradata Database 13.10, IBM DB2 V10, Oracle

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    6/67

    3-1 DWh 2011: Data WarehouseR. Marti 6

    Notions of Time

    Valid Time is the time during which a fact in the real world was, is, or will betrue or, more precisely: was / is believed to be true or believed to become

    true. Note: This time is determined by the user.

    Sometimes also called effective time, as of time or business time.

    Transaction Time is the time during which a fact in the real world was or is(rightly or wrongly) stored in the database. Note: This time is determined by

    the system (unless the user decides to delay entering the data, of course ... ) .

    Sometimes also called system time.

    Example of an announcement made (and stored in a DB there and then)on October 1 2010 (= transaction time):

    David Cole will be Chief Risk Officer as of March 1 2011 (= valid time).

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    7/67

    3-1 DWh 2011: Data WarehouseR. Marti

    Associating Time with Data

    7

    time

    tuples

    attributes

    Assumption: For each relation, a clock with

    a given temporal granularity is specified,e.g., a day, a second, or a millisecond."Conceptually, the extension of a temporal

    relation Rcan then be viewed as a

    sequence of snapshot relations

    Rt= t(R)

    for every time point t of this clock."

    t is called snapshot operator(sometimes also timeslice operator)"

    snapshot at time t

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    8/67

    3-1 DWh 2011: Data WarehouseR. Marti 8

    Benefits and Pitfalls of Sequence of Snapshots Model

    Good for theoretical considerations, in particular determining equivalence of different temporal representations gauging the expressive power of temporal query languages

    May be impractical as an implementation model, given that it may requirelots of space, especially when

    granularity of time is fine-grained (minutes, seconds, milliseconds, ... ) represented facts do not change often, i.e. stay the same over a longerinterval (usually because they describe states rather than events)

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    9/67

    3-1 DWh 2011: Data WarehouseR. Marti 9

    From Sequence of Snapshots Model to Time Intervals

    Remedy:Dont store data that did not change since the previous clock tick again

    Collect identical snapshots of suitable smaller parts of a relation

    (e.g., tuples or attribute values) and associate them with time intervals

    rather than time points

    Alternatives:(1) associate temporal intervals with every tuple

    (2) associate temporal intervals with every attribute value

    (but the 2nd approach requires complex attributes, violating 1NF)

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    10/67

    3-1 DWh 2011: Data WarehouseR. Marti 10

    Valid Time Relations capturing State

    Conceptually, every tuple which captures a state is timestamped with a timeinterval [t

    from, t

    to] indicating the validity of the (non-temporal) data

    represented in the tuple

    Remarks:

    Transformation into 1NF by replacing V_INTERVALby V_FROM (valid from) and V_TO (valid to)

    The symbol ? means unknown, until now or until further notice.In standard SQL, it is usually represented by null or by the date 9999-12-31,

    both of which are not entirely satisfactory ...

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    11/67

    3-1 DWh 2011: Data WarehouseR. Marti 11

    Typical Queries (1): Snapshot of Valid Time Relation

    Snapshots of the previous valid time relation:

    Remarks:

    We assume that ID is the primary key at every point in time (in every snapshot). Producing a snapshot from a valid time relation is a simple selection in rel. algebra:select ID, NAME, FNAME, ADDR, SAL

    from EMP

    where :t in V_INTERVAL (or:where :tbetween V_FROM andV_TO )

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    12/67

    3-1 DWh 2011: Data WarehouseR. Marti 12

    Valid Time Relations capturing Recurring States

    A specific state of affairs can recur several times ( several time periods)

    transformation to 1NF

    The first two tuples are called value equivalent since they have the samevalues in all attributes except the temporal attributes V_FROM and V_TO.

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    13/67

    3-1 DWh 2011: Data WarehouseR. Marti 13

    Options in the Representation of Time

    Canonical representation using maximal time intervals (as on previous slide):

    One (of many) possible alternative representations using two (non-maximal)

    contiguous intervals (assuming a temporal granularity of a day):

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    14/67

    3-1 DWh 2011: Data WarehouseR. Marti 14

    Issues with non-canonical Representations

    Non-canonical representations may lead to incorrect answers:

    Example Query: Who left the company before 2008-01-01 and when?

    select ID, NAME, FNAME, V_TO

    from EMP

    where V_TO < date '2008-01-01'

    (Incorrect) Result:

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    15/67

    3-1 DWh 2011: Data WarehouseR. Marti 15

    Avoiding non-canonical Representations: By Design

    Ensure that intervals remain maximal when inserting or updating:

    Let R be a valid time relation in canonical form (i.e., with maximal time intervals)- n be a new valid time tuple to be inserted into the relation R

    - x1, ... ,xn (n 0) be all existing valid time tuple in relation R which are

    value equivalent to x (cf. p. 12)

    Then, for all i, 0 in, the following must hold (in pseudo-SQL notation):

    not exists (

    select *

    from Rxi

    where xi = n

    and(n.V_FROM - 1betweenxi.V_FROM andxi.V_TO

    orn.V_TO + 1betweenxi.V_FROM andxi.V_TO))

    (This could be specified as declarative check constraint if implementation supported it )

    value equivalence

    intervals do not touch or overlap

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    16/67

    3-1 DWh 2011: Data WarehouseR. Marti 16

    Typical Queries (2): Temporal Projection

    Unfortunately, (intermediate) query results may be non-canonical, even if

    applied to a canonical representation:

    Example: Where did employees live and when (irrespective of salary)?

    select ID, NAME, FNAME, ADDR, V_FROM, V_TO fromEMP

    Result:

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    17/67

    3-1 DWh 2011: Data WarehouseR. Marti 17

    Avoiding non-canonical Representations: By Coalescing

    Non-canonical representations can be transformed into the canonical

    representation by an operation called temporal coalescing which maximizes

    the length of all intervals by coalescing adjacent and overlapping intervals ofvalue-equivalent tuples.

    Coalesced form:

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    18/67

    3-1 DWh 2011: Data WarehouseR. Marti 18

    Temporal Coalescing in (Pseudo-) SQL

    with recursiveRclosas (

    -- initial ("anchor") query

    selectR.values, R.V_FROM, R.V_TO fromRunion

    -- recursive query: executed until no new data generated

    select R.values, R.V_FROM, Rclos.V_TO

    from R, Rclos

    where Rclos.values = R.values

    andRclos.V_FROM >= R.V_FROMandRclos.V_FROM-1 Rclos.V_TO )

    )

    more efficientimplementation

    uses window

    functions

    (see [Zhou et al 2006])

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    19/67

    3-1 DWh 2011: Data WarehouseR. Marti 19

    Typical Queries (3): Temporal Join

    Sometimes, the history of information stored in two relations is of interest:

    Example: Who worked on which projects and when?

    Result:

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    20/67

    3-1 DWh 2011: Data WarehouseR. Marti 20

    Temporal Join in SQL (without temporal coalescing!)

    Construct time intervals of result by intersectingtime intervals of operands

    (and keeping rows with non-empty intervals):

    select * from(

    select w.PROJ_ID, w.EMP_ID, e.NAME, e.FNAME,

    case when e.V_FROM > w.V_FROM

    then e.V_FROM

    else w.V_FROM

    end as V_FROM,case when e.V_TO < w.V_TO

    then e.V_TO

    else w.V_TO

    end as V_TO

    from WORKS_ON w, EMP e

    where e.ID = w.EMP_ID)where V_FROM

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    21/67

    3-1 DWh 2011: Data WarehouseR. Marti 21

    Proposals for Temporal Support in SQL

    There are proposals to hide this (and more, see following slides) temporal

    complexity in SQL, e.g., the SQL/Temporal part of a future SQL3 standard.

    A temporal join (including temporal coalescing) would look as follows:

    validtime

    select w.PROJ_ID, w.EMP_ID, e.NAME, e.FNAME,

    from WORKS_ON w, EMP e

    where e.ID = w.EMP_ID

    see e.g. [Snodgrass 1999]

    Richard T. Snodgrass: Developing Time-Oriented Database Applications.

    Morgan Kaufmann, 1999.

    Note: This publication is out of print, but available electronically as pdf ahttp://www.cs.arizona.edu/people/rts/publications.html

    DB2 10 for z/OS and Teradata Database V13.10 support most of the SQL/

    Temporal proposal.

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    22/67

    3-1 DWh 2011: Data WarehouseR. Marti 22

    Transaction Time Relations

    Note that transaction time should be automatically determined by thesystem at insert/update/delete time (or, more precise, commit time),

    not by the user; granularity is typically as fine as possible

    Transaction time can be represented exactly like valid time,by associating a time interval with tuples.

    Example: Transaction time history of employee 676 (also see slide 10)""1. 2006-07-01: insert 676 lives in Baar und earns 7000."2. 2008-04-01: update 676 lives in Bern."3. 2009-11-01: update 676 earns 7500."

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    23/67

    3-1 DWh 2011: Data WarehouseR. Marti 23

    Using DBMS Logging to capture Transaction Time

    Since transaction time can be automatically determined by the system,the DBMS logging facilities can be used.

    This is/was done e.g. in Postgres/PostgreSQL/Illustra (and in Oracle).

    Example: Transaction time history of employee 676 (see slide 15)""1. 2006-07-01: insert 676 lives in Baar and earns 7000."2. 2008-04-01: update 676 lives in Bern."3. 2009-11-01: update 676 earns 7500.

    Normal (snapshot) tablecontaining current contents.

    Undo log table containingchanges to produce

    previous contents of

    associated snaphsot table

    (before images).

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    24/67

    3-1 DWh 2011: Data WarehouseR. Marti 24

    Implementing Logging Using Triggers

    create or replace trigger TR_AU_EMP

    after update

    on EMP

    for each row

    declare

    l_log EMP_UNDO_LOG%rowtype;

    begin

    l_log.X_TIME := current_timestamp;l_log.UNDO_OP_CODE := 'update';l_log.ID := :old.ID;l_log.NAME := :old.NAME;

    l_log.FNAME := :old.FNAME;l_log.ADDR := :old.ADDR;l_log.SAL := :old.SAL;

    insert into EMP_UNDO_LOG values l_log;

    endTR_AU_EMP;/

    written in Oracle PL/SQL

    similar triggers required

    for inserts and deletes

    should probably check

    that ID has not changed

    and raise an applicationerror if this were the case

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    25/67

    3-1 DWh 2011: Data WarehouseR. Marti 25

    Bitemporal Relations

    Valid time and transaction time can be combined to allow for a completehistory of what information was/is believed to be true and when this was

    stored in the database.

    Example: Complete (bitemporal) history of employee 676""1. 2006-07-01: insert 676 lives in Baar and earns 7000 as of2006-08-01.

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    26/67

    3-1 DWh 2011: Data WarehouseR. Marti 26

    Bitemporal Relations (2)

    Example (continued): Complete (bi-temporal) history of employee 676""2. 2008-04-01: update 676 lives in Bern as of2008-03-01.

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    27/67

    3-1 DWh 2011: Data WarehouseR. Marti 27

    Bitemporal Relations (3)

    Example (continued): Complete (bi-temporal) history of employee 676""3. 2009-11-01: update 676 earns 7500 as of2010-01-01.

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    28/67

    3-1 DWh 2011: Data WarehouseR. Marti 28

    Bitemporal Relations (4)

    Example (continued): Complete (bi-temporal) history of employee 676""4. 2009-11-11: update correction: 676 earns 7700 as of2010-01-01.

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    29/67

    3-1 DWh 2011: Data WarehouseR. Marti 29

    Design of Temporal Databases

    Basic idea

    Do non-temporal database design Annotate which tables / attributes need to be historized (especially valid time)

    and how (state-based vs. event-based)

    Generate temporal data structures ... but how?Questions:

    Entity integrity (implemented by primary keys) temporal entity integrity

    Referential integrity (implemented by foreign keys) temporal referential integrity

    Arbiter: sequence of snapshots model

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    30/67

    3-1 DWh 2011: Data WarehouseR. Marti 30

    Temporal Entity Integrity (1)

    Temporal entity integrity = for every snapshot, entity integrity should hold.

    Pro memoria:- primary keys should consist of a minimal number of attributes

    which unqiuely identify each tuple

    - these attributes should ideally not change over time

    Options for the primary key of a valid time relation (e.g. for table EMP)(1) ID, V_FROM(2) ID, V_TO

    (3) ID, V_FROM, V_TO (non-minimal primary key!)

    (4) ID, SEQ_NO (where SEQ_NO is a sequence number or counter)

    Since all attributes except ID (and SEQ_NO) can change over the lifetime ofthe identified tuple

    - alternative (4) is probably the best,

    - followed by alternative (1) as V_FROM only changes in case of an error

    (and should not be referenced by foreign keys, as well see)

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    31/67

    3-1 DWh 2011: Data WarehouseR. Marti 31

    Temporal Entity Integrity (2)

    In addition, it might be desirable to enforce other constraints, including

    Time intervals must not be empty Time intervals should be maximal (unless e.g. queries like what was the

    case before or after a specific point in time are not of importance)

    create table EMP (

    ID integer not null,

    SEQ_NO integer not null,

    NAME varchar(20) not null,

    ...

    V_FROM date not null,

    V_TO date default date '9999-12-31',

    primary key (ID, SEQ_NO),

    check ( V_FROM

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    32/67

    3-1 DWh 2011: Data WarehouseR. Marti 32

    Referential Integrity between Snapshot Relations

    The foreign key (FK) attribute value(s) in the referencing relation must exist as

    primary key (PK) values in the referenced relation:

    Example: Works_On[Emp_Id] Emp[Id]Note: In relational theory, this is sometimes also called an inclusion dependency.

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    33/67

    3-1 DWh 2011: Data WarehouseR. Marti 33

    Temporal Referential Integrity (1)

    Temporal referential integrity = for every snapshot, referential integrity must hold.

    Problem:- primary keys now have a temporal part (on top of the non-temporal part)- valid time periods in the foreign key (referencing) relation are not

    necessarily the same as those of the primary key (referenced) relation

    At every point in time when the FK value was valid,

    the referenced PK value must be valid.

    t( t(Works_On[Emp_Id]) t(Emp[Id]) )

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    34/67

    3-1 DWh 2011: Data WarehouseR. Marti 34

    Temporal Referential Integrity (2)

    t( t(Works_On[Emp_Id]) t(Emp[Id]) ) holds for employee 676 because

    projection followed by temporal coalescing would result in:

    Of course, performing temporal coalescing for

    - adding tuples to and/or extending time intervals of the referencing relation

    - deleting tuples from and/or shrinking time intervals in the referenced relation

    would be an expensive proposition

    Recommendation: Track complete lifetimes of objects in a separate relation

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    35/67

    3-1 DWh 2011: Data WarehouseR. Marti 35

    Temporal Referential Integrity (3)

    Split valid time relation on referenced (PK) side into an object relation and aproperty relation.

    Add a referential integrity constraint from property relation to object relation. Re-route non-temporal referential integrity constraints from other relations

    to the object relation.

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    36/67

    3-1 DWh 2011: Data WarehouseR. Marti 36

    Temporal Referential Integrity (4)

    In referencing relations, it might be desirable to enforce referential integrity

    non-temporal part: as usual temporal part: time interval contained in time interval of referenced object

    create table WORKS_ON (

    EMP_ID integer not null,

    PROJ_ID integer not null,

    SEQ_NO integer not null,

    V_FROM date not null,V_TO date default date '9999-12-31',

    primary key (EMP_ID, PROJ_ID, SEQ_NO),

    check ( V_FROM

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    37/67

    3-1 DWh 2011: Data WarehouseR. Marti 37

    Temporal Normalization (1): Time-invariant Attributes

    Assume that attributeFName cannot change over the lifetime of anEmp

    (except to correct mistakes).

    In other words, the functional dependency (FD) IdFName holds

    relationEmp_Prop below is not in 2NF (attribute depends on part of PK)

    relationEmp_Prop exhibits update anomalies

    when having to fix a mistake in Sues first name (e.g. change to Susan)

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    38/67

    3-1 DWh 2011: Data WarehouseR. Marti 38

    Temporal Normalization (2): Time-invariant Attributes

    Recommendation:

    Consider moving time-invariant attributes (e.g.FName) from the property

    relation (e.g.Emp_Prop) to the object relation (e.g.Emp_Obj).

    InEmp_Obj, the FD IdFName still holds (and is enforced by the PK),

    so the relation does not exhibit update anomalies.

    InEmp_Prop, all attributes are now fully dependent on the PK but there is still an issue ...

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    39/67

    3-1 DWh 2011: Data WarehouseR. Marti 39

    Temporal Normalization (3): Asynchronous Changes

    Example: After having inserted the salary raise to employe 676 as of beginning

    of 2010, we learn that she actually moved to Aarau as of Dev 1 2009.

    update anomaly: several tuples need to be changed (in addition to an insert)!

    Recommendation:

    Attributes whose values change independently of other attributes should be put

    into different relations

    (somewhat like achieving 4NF in the face of multi-valued dependencies).

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    40/67

    3-1 DWh 2011: Data WarehouseR. Marti 40

    Temporal Normalization (4): Asynchronous Changes

    Example: Since address and salary of an employee may change independently

    (and asynchronuously), these attributes should be put into different relations.

    no update anomaly: one tuple needs to be changed (in addition to an insert)!

    Employee salaries remain untouched:

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    41/67

    3-1 DWh 2011: Data WarehouseR. Marti

    Summary of Design Recommendations

    For kernel entity types (with objects whose existence is independent of otherentities), considerthe introduction of an object relation to capture the lifetime

    of these objects main benefits:

    - referential integrity checking over time

    - home fortime-invariant attributes

    For relations representing object properties (or relationships between objects)and their history, considerchoosing a temporal primary key consisting of the

    non-temporal primary key attributes plus a (meaningless) sequence number.

    For relations representing object properties (or relationships between objects),considerdecomposing them into groups of attributes which

    - are eithertime-invariant

    this attribute group is moved to the object relation

    - orchange independently of one another(i.e., potentially at different times) each such attribute group is moved into a separate relation keeping

    track of the history of the values

    Remember: Following

    themisnofreelunch!

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    42/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 42

    Return to (Valid) Time in Warehousing

    TIME

    POLICY_PTF

    PREMIUM_AMT

    LOSS_AMTEXPENSE_AMT

    PROFIT_AMT

    PRODUCT

    PROD_ID

    CLIENT

    CL_IDCL_NAME

    CL_RATING

    PROF_CENTER

    PC_ID

    PC_NAME

    DIV_IDDIV_NAME

    Motivating Example

    Compare profits over the years

    - grouped by business divisions- grouped by client ratings

    What happens if, over time,

    - business divisions change(e.g. profit centers are shifted)?

    - ratings of clients change?

    - two clients merge (e.g.,primary insurers in the

    reinsurance business)?

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    43/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 43

    2009 2010

    X

    Y

    Z

    dimensional values (e.g., names of business divisions)

    measure

    +24%

    -40%

    +80%

    profit

    [CHF]

    time

    First impressions

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    44/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 44

    2009 2010

    +24%

    -40%

    +80%

    +0%

    +11%

    Profit Center Shift

    time

    profit

    [CHF]

    X

    X1

    X2

    X3

    Y

    Y1

    Y2

    S

    ZZ1

    Z2

    X

    X1

    X2

    X3

    Y

    Y1

    Y2

    S

    Z Z1

    Z2

    First impressions can be deceiving

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    45/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 45

    Terminology and Concepts: Dimensional Hierarchies

    Dimensions often have a hierarchical structure,

    e.g., in previous example:

    Product: hierarchical LineOfBusiness

    ProfitCenter: embedded in hierarchical org structureProfitCenter Division Group

    Client: hierarchical groupings possble,e.g., grouping by country continent,

    All Lines

    Property Casualty SpecialLines

    P&C Lines L&H Lines

    Life Health

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    46/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 46

    Coping with Business Change

    time

    tReport

    successful completion of business transaction

    captured measures refer to dimensional structuresvalid at this time

    report production

    which dimensional structure should reported measures refer to?

    original structures valid at respective capture times (tCapture[i])? structures valid at report time (tReport)? other times?

    need history + valid times need succession mapping

    changes to referenced dimensional structures

    tCapture[2]tCapture[1]

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    47/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 47

    Running Example

    dimension measure

    changes

    Population

    CountryId

    Year

    Country

    CountryId

    CountryName

    Year

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    48/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 48

    Changes to Dimensional Structures

    Type Image Description

    1 add New value addedA A B

    3 invalidate A value will not any longer be available fornew contracts

    A

    C

    A B

    2 rename Old value (name) will be replaced by newvalue

    AA B

    4 merge n old values will be merged into one valueAA1 A2

    5 split Old value will be divided into n valuesA1A A2

    6 move One value changes position in hierarchyA

    B C

    D

    A

    B C D

    Key Questions

    Succession

    Mapping

    TaxonomicRelationship

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    49/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 49

    Examples of Changes to Dimensional Structures

    adapted from Temporal Data Warehousing: Business Cases and Solutions, J. Eder et al.

    merge

    invalidate

    renamesplit

    add

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    50/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 50

    Issues: History, Validity and Succession of Values

    Dimensional values to be tracked over time must have

    a unique, invariant, not-to-be-reused identifier for the concept that thevalue representse.g. an identifier for the country first named Zaire and later Kongo

    a validity period indicating the overall lifetime of the concept whichthe value represents

    e.g. the lifetime of the country first named Zaire and later Kongo

    validity periods indicating the lifetime of the values used to representthe concepte.g. the lifetimes of the names Zaire and Kongo

    invalid dimensional values must have another dimensional value assuccessore.g., East Germany is succeeded by Germany

    1

    2

    3

    4

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    51/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 51

    Unique Identifier

    DB2 Colloquium

    2006-10-25

    1

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    52/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 52

    Succession of Dimensional Values4

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    53/67

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    54/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 54

    Succession of Dimensional Values4

    Step 3: Reassemble parts

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    55/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 55

    Succession of Dimensional Values4

    SQL Statement to do all 3 steps

    SELECT COALESCE(s.CurrId, p.CountryId) AS CountryId

    , p.Year, SUM(p.Population) AS Population

    FROM CountryPopulation p

    LEFT OUTER JOIN CountrySuccession sON s.Id = p.CountryId

    GROUP BY p.CountryId, p.Year

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    56/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 56

    Side Issue: Difficulties with the Split Operation

    Example

    measures population and GNP (gross national product) have been collected forCzechoslovakia up to 1992

    as of 1993, the same measures are collected for Czech and SlovakiaPossible solutions

    after 1993, keep Czechoslovakia and compute its population and GNP figures bysumming the figures of Czech and Slovakia

    before 1992, compute approximate percentages of the population and GNP figures fromCzechoslovakia for Czech and Slovakia

    note: in general, the precentages of the various measures are not identical

    leave countries as is and perform no mapping in either direction

    4

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    57/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 57

    Handling Splits (Sketch)4

    Step 2:

    Extrapolate

    Step 1:

    Aggregate overTaxonomy

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    58/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 58

    Lifecycle of Concepts

    Start ofvalidity

    Active

    Superseded

    Inactive

    define successor

    Move

    Introduction as Inactive

    Move

    Activecan be used to book new business and appear on reports

    Inactivecan appear on reports but cannot be used to book new business

    Supersededcannot appear on reports nor be used to book new business

    2

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    59/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 59

    Validity (Lifetime) of Concepts2

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    60/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 60

    Validity (Lifetime) of Names of Concepts

    DB2 Colloquium

    2006-10-25

    3

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    61/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 61

    Modified Star Schema Design

    Principle

    Add valid times in dimensionsin the Data Warehouse using

    - an object table (Country)

    - a single property table

    (here: CountryNames)

    both with an associated valid time

    interval.

    Let foreign keys in fact tables refer

    to the unchanging ID in object tables.

    Generate standard Data Marts from

    this data model as needed, mostoften a history of measure according

    to the current dimensional structure.

    Population

    CountryId

    Year

    Country

    CountryId

    VTimBeg

    VTimEnd

    Year

    CountrySuccession

    Id -- original identifierSuccId -- direct successor

    CurrId -- ultimate successor

    CountryNames

    CountryIdVTimBeg

    VTimEnd

    CountryName

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    62/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 62

    Coping with a Distributed Environment (Teaser)

    Transactional Data Stores

    additional identifiersmeasures tied to ref data

    Integration Data Stores

    History Stores (DWh)Exchange Stores (ODS)

    AnalyticalData Stores

    Flow of Master Data(e.g. Dimension Attributes + Values)

    Flow of Transactional Data

    e.g., MDM, CRM,

    ForEx, Geo DB

    e.g., Claims and

    Underwriting

    Systems

    Master Data Stores

    identifiersdimensional attributes

    Note: Of course, in a global enterprise, all of this all happens in a distributed environment

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    63/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 63

    Kimballs Types of Slowly Changing Dimensions

    Ralph Kimball proposed 3 (well actually 2 only) poor mans

    solutions to the historization of dimensions slowly changing

    dimensions (SCD) in the context of the Star Schema.

    SCD Type 1: no history of the dimensional attribute is needed simply overwrite the valuee.g. the correction of mistakes in names, birthdays etc.

    SCD Type 2: versions of some dimensional attributes are needed store new records in the dimension table, with a new DWh

    identifier (ID), the existing stable source system ID, and the new

    (changed) valuese.g. a change in the rating of a client, or the new business division a profit center belongs to

    SCD Type 3: current and original (or previous) versions are needed introduce a current and original attribute in the dimension tablee.g. the current rating and the original rating of each client

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    64/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 64

    Slowly Changing Dimensions Type 1

    Pros

    Simple to understand for business users and simple to implement(especially when using MOLAP tools)

    Requires the least space and has the best response time

    Conses

    Simplicity for business users is deceiving A change in a dimensional attribute effectively changes the context

    for all facts captured prior to the change

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    65/67

    3-1 DWh 2011: Data WarehouseR. Marti Slide 65

    Slowly Changing Dimensions Type 2

    Pros

    Reasonably understandable and simple to implement(regardless of MOLAP / ROLAP tool)

    Captures parts of the historyConses

    The time of a change in a dimension is not captured Requires more space since a single dimensional object is possibly

    represented in several rows (but this is usually not an issue)

    Can be confusing since changed dimensional data objects appearmore than once, with identical source system IDs, but at least one

    changed attribute value

    Checking when it is ok to refer to which DWh IDs is not possible

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    66/67

  • 7/31/2019 03-1 DWh Data Warehouse - Time Dimension

    67/67

    Literature

    General Temporal Database Concepts

    [Snodgrass 1999] Richard T. Snodgrass: Developing Time-Oriented Database Applications. Morgan Kaufmann,

    1999. (see http://www.cs.arizona.edu/people/rts/publications.html)

    [Zhou et al 2006] Xin Zhou, Fusheng Wang, Carlo Zaniolo: Efficient Temporal Coalescing Query Support in

    Relational Database Systems. Proc. 17th International Conference on Database and Expert Systems

    Applications - DEXA '06, 2006.

    [Johnston & Weis 2010] Tom Johnston, Randall Weis: Managing Time in Relational Databases: How to Design,

    Update and Query Temporal Data. Morgan Kaufmann, 2010.

    Data Warehouse Design

    [Kimball & Ross 2002] Ralph Kimball, Margy Ross: The Data Warehouse Toolkit: The Complete Guide to

    Dimensional Modeling, 2ndEdition. John Wiley, 2002.

    [Imhoff et al 2003] Claudia Imhoff, Nicholas Galemmo, Jonathan G. Geiger: Mastering Data Warehouse Design:

    Relational and Dimensional Techniques. John Wiley, 2003.

    [Golfarelli & Rizzi 2009] Matteo Golfarelli, Stefano Rizzi: Data Warehouse Design: Modern Principles and

    Methodologies. McGraw Hill, 2009.

    [Adamson 2010] Christopher Adamson: Star Schema: The Complete Reference. McGraw Hill, 2010.