data vault: data warehouse design goes agile

DecisionLab.Netbusiness intelligence is business performance ___________________________________________________________________________________________________________________________________________________________________________________

Data Vault:

Data Warehouse Design Goes Agile____________________________________________________________________________________________________________________________________________________________________________________DecisionLab http://www.decisionlab.net [email protected] direct 760.525.3268

http://blog.decisionlab.net Carlsbad, California, USA

Whitepaper

Data Vault: Data Warehouse Design Goes Agile

by daniel upton

business intelligence analytics developer

certified scrum master

DecisionLab.Netbusiness intelligence is business performance

[email protected]://www.linkedin.com/in/DanielUpton

__________________________________________________________________________________________________________________________________________________________________________________

of 13

http://www.linkedin.com/in/DanielUpton

http://www.linkedin.com/in/DanielUpton

Open Question: When we begin considering a new Data Warehouse initiative, how clear is the scope?

If we intend to design Data Marts, and we have no specified need for a data warehouse either to become a system of record, or to support Master Data Management (MDM), then we may choose to Dr. Ralph Kimball’s Data Warehouse Bus architecture, designing a library of conformed (standardized, re-usable) dimension and fact tables for deployment into a series of purpose-built data marts. Under these requirements, we may have no specific need for an Inmon style third-normal form (3nf) Enterprise Data Warehouse (EDW) in general, or for a Data Vault in particular. In other cases, however, because sometimes data warehouse data outlives its corresponding source data inside a soon-to-retire application database, then, like it or not, a data warehouse may, as Bill Inman remind us, assume a system of record role for its data. Whereas the Kimball Bus architecture’s tables are often not related via key fields, and in fact may not be populated at all until deployment from the Bus into a specific-needs Data Mart, Kimball adherents rarely assert a system-of-record role for their solutions.

But, suppose we do determine that our required solution either does need to assume a system of record role, or perhaps that it must support Master Data Management. As such, we may elect to design a fully functional EDW, rather than Kimball’s DW Bus, so that the EDW itself, and not just its dependent data marts, is a working, populated database. Now, knowing that the creation of a classic EDW, with its requirement for an up-front, enterprise-wide design, is a challenge with today’s expectations for rapid delivery, some may be curious about new design methodologies offer ways to accelerate EDW Design. Data Vault, a data warehouse modeling method with a substantial following in Denmark, and a growing base in the U.S., offers specific and important benefits.

In order to set expectations early about Data Vault, readers must understand that, somewhat unlike a traditional EDW, and utterly unlike a star-schema, a Data Vault (not to be confused with Business Data Vault, which is not addressed in this article) cannot serve as an efficient presentation layer appropriate for direct queries. Rather, it is more like a historic enterprise data staging repository that, with additional downstream ETL, will support not only star-schema, reporting and data mining, but also master data management, data quality and other enterprise data initiatives.

__________________________________________________________________________________________________________________________________________________________________________________

of 13

Data Vault Benefits:

Benefit #1: Allows for loading of a history-tracking DW with little or none of the typical extraction, transformation and loading (ETL) transformations that, once they are finally figured out, would otherwise contain subjective-interpretations of the data and which purportedly enhance the data and prepare it for reporting or analytics.

o In my view, this is almost enough of a benefit all by itself. As such, in my introduction that follows, I will focus on proving this point.

o Agile Win: Confidently loading a DW without having to already know the fine details of business rules and requirements and the resulting transformation requirements means that loading of historical and incremental data could get accomplished before the first target database design (3nf EDW or Data Mart) is complete.

Benefit #2: Insofar as Data Vault prescribes a very generic downstream ‘de-constructing’ of OLTP tables, these de-constructing transformations can be automated and so can it’s associated early-stage ETL into Data Vault. Since, as you’ll soon see, Data Vault causes a substantial increase in the number of tables, this automation potential is a substantial benefit.

o Agile Win: Automated initial design and loading, anyone?

Benefit #3: Due to Data Vault’s generic design logic, it’s use of surrogate keys (more on this soon), and it’s prescription to avoid subjective-interpretive transformations, it’s reasonable to quickly load a Data Vault just with the needed subset of tables.

o Agile Win: More frequent releases. Quickly design for, and load, only the data needed for the next release. Use the same generic design to load other tables when those User Stories from the Product Backlog get placed into a Sprint.

In the remainder of this article, I will provide a high level introduction to Data Vault, with primary emphasis on how it achieves Benefit #1.

__________________________________________________________________________________________________________________________________________________________________________________

of 13

High-Level Introduction to Data Vault Methodology:

We begin with a simple OLTP database design for clients purchasing products from a company’s stores. For simplicity, I include only a minimum of fields. In the diagrams, ‘BK’ means business key, ‘FK’ means foreign key. Refer to Diagram A below.

As is common, this simple OLTP schema does not use surrogate keys. If a client gets a new email address, or a product gets a new name, or a city’s re-mapping of boundary lines suddenly places an existing store in a new city, new values would overwrite the old values, which would then be lost. Of course, in order to preserve history, history-tracking surrogate keys are commonly used by practitioners of both Bill Inmon’s classic third-normal form (3nf) EDW design, as well as Dr. Ralph Kimball’s Star Schema method, but both of these methods prescribe surrogate keys within the context of data transformations that also include subjective interpretation (herein simply ‘subjective transformation’) in order to cleanse or purportedly enhance the data for the purposes of integration, reporting, or analytics. Data Vault purists claim that any such subjective transformation of line-of-business data introduces inappropriate distortion to it, thereby disqualifying the Data Warehouse as system of record. Data Vault, importantly, provides a unique way to track historical changes in source data while eliminating most, or all, subjective transformations such as field renaming, selective data-quality filters, establishment of hierarchies, calculated fields, and target values. Although analytics-driven, subjective transformations can still be applied, they are applied downstream of the Data Vault EDW, as subsequent transformations for loads into data marts designed to analyze specific processes. Back upstream, the Data Vault accomplishes historic change-tracking using a generic table-deconstructing approach that I will now describe. Before beginning, I recommend against too-quickly comparisons this method others, like star-schema design, which serve different needs.

__________________________________________________________________________________________________________________________________________________________________________________

of 13

Diagram A: Simple OLTP schema (data source for a Data Vault)

__________________________________________________________________________________________________________________________________________________________________________________

of 13

Fundamentally, Data Vault prescribes three types of tables: Hubs, Satellites, and Links. The diagram’s Client table as a good example. Hubs work according to the following simplified description:

Hub Tables:

Define the granularity of an entity (eg. product), and thus the granularity of non-key attributes (eg. product description) within the entity.

Contain a new surrogate primary key (PK), as well as the source table’s business key, which is demotes from its PK role.

Satellite Tables:

Contain all non-key fields (attributes), plus a set of date-stamp fields Contain, as a Foreign Key (FK), the Hub’s PK, plus the load date-time stamps. Have a defining, dependent entity relationship to one, and only one, parent table. Whether that parent table is a Hub or Link, the Satellite holds the non-key fields from the parent table. Although on initial loads, only one Satellite row will exist for each corresponding Hub row, whenever a non-key attribute

change (eg. a client’s email address changes) upstream in the OLTP schema (often accomplished up there with a simple over-write), a new row will be added only to the Satellite, and not the Hub, which is why many Satellite rows relate to one Hub row. So, in this fashion, historic changes within source tables are gracefully tracked in the EDW.

Notice, in Diagram B that, among other tables, the Client_h_s Satellite table is dependent to the Client_h Hub table, but that, at this stage in our design, the Client_h Hub is not yet related to Order_h Hub. When we add Links, those relationships will appear. But first, have a look at the tables, the new location of existing fields, and the various added date-time stamps.

__________________________________________________________________________________________________________________________________________________________________________________

of 13

Diagram B: Hubs and Satellite in a partially-designed Data Vault schema

__________________________________________________________________________________________________________________________________________________________________________________

of 13

Link Tables:

Refer to Diagram C Relate exactly two Hub tables together. Contain, now as non-key values, the primary keys of the two Hubs, plus its own surrogate PK. As with an ordinary association table, a Link is a child to two other tables and, as such, is able to gracefully handle

relative changes in cardinality between the two tables and, where necessary, can directly resolve many-to-many relationships that might otherwise cause a show-stopper error in the data-loading process.

Unlike an ordinary associationtable, the Link table, with its own surrogate PK, is able to track historic changes in the relationship itself between the two Hubs, and thus between their two directly-related OLTP source tables. Specifically, all loaded data that conformed with the initial cardinality between tables would share the same Link table surrogate key, but an unexpected, future source data change that either caused a cardinality reversal (so that the one becomes the many, and vice versa), a new row, with a new surrogate key, is generated to not only capture it now while the original surrogate key preserves the historical relationship. Slick!

In a more sophisticated Data Vault schema than this one, we might go further by adding a add load_date and load_date_end data_stamp fields to Link tables, too. As an (admittedly strange) example, the Order_Store_l Link table might conceivably get date-time stamp fields so that, in coordination with its surrogate PK, an Order (perhaps for a long-running service) that, after the Order Date, gets re-credited to a different store can be efficiently tracked over time in this way.

__________________________________________________________________________________________________________________________________________________________________________________

of 13

Diagram C: Completed Data Vault Schema (Link tables added)

__________________________________________________________________________________________________________________________________________________________________________________

of 13

Now, we’ve added Link tables. After scanning Diagram C, go back and compare it with Diagram A and note the movement of the various non-key attributes. Undoubtedly, you will also notice, and may be concerned, that the source schema’s five tables just morphed into the Data Vault’s twelve. Importantly, note that the Diagram A’s Details table was transformed not into a Hub-and-Satellite combination, but rather into a Link table. When you consider that an order detail record (a line item) is really just the association between an Order and a Product (albeit an association with plenty of vital associated data), then it makes sense that the Link table Details_l was created. This Link table, whose sole purpose is to relate the Orders_h and Products_h tables, of course, also needs a Details_l_s Satellite table to hold the show-stopper non-key attributes, Quantity and Unit Price.

The Data Vault method does allow for some interpretation here. You might now be thinking, “Aha! So, we haven’t eliminated all subjective interpretation!” Perhaps not, but what I’ll describe here is a pretty small, generic interpretation. Either way, in this situation, it would not be patently wrong to design a Details_h Hub table (plus, of course, a Details_h_s Satellite), rather than the Details_l Link. Added to that, if we use very simple Data-Vault design automation logic, which simply de-constructs all tables into Hub and Satellite pairs, this is what we would get. However, keep in mind that if we did that, we would then have to create not one, but two Link tables, specifically Order_Order_Details_l Link table and Product_Order_Details_l Link table to connect our tables, and these tables would contain no attributes of apparent value. Therefore, we choose the design that leaves us with a simpler, more efficient Data Vault design. By the way, this logic can easily be automated, but that’s beyond the scope of this article.

__________________________________________________________________________________________________________________________________________________________________________________

of 13

Conclusion:

Our discussion on Data Vault opened with the idea that an EDW should load and store historical data without applying any transformations that contain subjective interpretation of data or business-rules, because those interpretations, even if appropriate for specific reporting or analytics, do modify line-of-business data, and therefore introduce distortions into operational data. Those interpretive transformations should occur downstream during ETL into presentation layer tables.

Although Data Vault does, in fact, apply a specific set of generic ‘de-construction’ transformations, these transformations contain little or no subjective interpretation of business rules. They do, however, allow it to (1) apply an appropriate level of referential integrity to source data even where the source system may lack it now or in the future; (2) gracefully capture historical data changes, within and between tables, without endangering the success of the data load; (3) support loading of data from a subset of source tables initially, and then load, or not load, other related source data tables much later without compromising the EDW’s referential integrity.

Lastly, and very importantly; (4) data vault design and the associated Data Vault loading ETL, which is largely generic from one data set to another, can be automated, and thus radically accelerated in development. Although the logic of this automation flows from the simplicity of data vault design, a detailed automation discussion is beyond the scope of this article.

In closing, if we can automatically design and load a Data Warehouse (albeit not it’s presentation layer), it frees up brain cells for the higher-order logic of design of the presentation layer and the intensive, custom ETL to load it. As I described here, all of this can be accomplished simultaneously.

________________________________________________

daniel [email protected]

DecisionLab.Netbusiness intelligence is business performance

__________________________________________________________________________________________________________________________________________________________________________________

of 13

DecisionLab.NetRange of Services Offered:_____________________________________________________

Data Warehousing / Business Intelligence Technical Implementation

Estimation / Business Requirements / Feasibility Analysis

Data Warehouse/Mart Logical Design and Development

Multi-Dimensional Cubes w/ SQL Server Analysis Services (SSAS) Fewer, faster cubes; more granular, more comprehensive, and more integrated for extreme query-ability Custom Multi-Dimensional Expressions (MDX)

Dashboard Development: SharePoint, Excel, PerformancePoint, Tableau Report Development: MS Reporting Services (SSRS)

_________________________________________________________________________________________________________________________________________________________________________

Daniel UptonDecisionLab http://www.decisionlab.net [email protected] Direct 760.525.3268 http://blog.decisionlab.net Carlsbad, California, USA

__________________________________________________________________________________________________________________________________________________________________________________

of 13

data vault: data warehouse design goes agile

Technology