![Page 1: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/1.jpg)
Next Generation DWH Modeling
An overview of DWH modeling methods www.grundsatzlich-it.nl Ronald Kunenborg |
![Page 2: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/2.jpg)
Topics
• Where do we stand today
• Data storage and modeling through the ages
• Current data warehouse modeling methods
• Where next?
• Food for thought
![Page 3: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/3.jpg)
Where do we stand today
Fun facts from a 2012 TDWI-survey by Kalido:
64% need more than 1 month to integrate new data or integrate changes
24% spend more than $1,000,000 annually
31% employ more than 20 people to maintain the BI environment
85% agree IT can't keep up with business
Only 56% agree that IT and business are aligned
We need to do better…
![Page 4: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/4.jpg)
Where do we stand today
What do we need in order to be successful?
Have a business-oriented view on the data
Have good performance for our queries
Integrate different sources with ease
Be able to implement changes to reports quickly
Be able to handle ‘business changing its mind’
Handle large data volumes at low cost
“Deliver cheap reports in business terms, fast!”
Our data models and practices need to support this. How are we doing?
![Page 5: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/5.jpg)
Where do we stand today
2
3
5
0
1
2
3
4
5
6
7
8
9
10
Anchor modeling Data Vault Kimball
Modeling methods used
![Page 6: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/6.jpg)
Where do we stand today
2 2
1
0
1
2
3
4
5
6
7
8
9
10
SQL Server Oracle MySQL
Databases used
![Page 7: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/7.jpg)
Where do we stand today
2
1
2
1
0
1
2
3
4
5
SSIS Pentaho BODS BODS + Custom code (PL/SQL)
ETL-tools used
![Page 8: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/8.jpg)
Where do we stand today
2
0
1 1
0
1
0 0 0
1
2
3
4
5
SSIS Pentaho BODS BODS/Custom code (PL/SQL)
ETL: tool vs method (generate script or run from tool)
Generate
Run from tool
![Page 9: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/9.jpg)
Where do we stand today
5
0 0
1
2
3
4
5
6
7
8
9
10
Handcrafted Metadata
Kimball Kimball
Building datamarts: manual or automated
![Page 10: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/10.jpg)
Where do we stand today
4 2 0,1 8 0,1 0 8
16 24 32 40 48 56 64 72 80 88 96
104 112 120 128 136 144 152 160 168 176 184 192 200
ETL: hours to add entity / rebuild ETL
Apply business rules differently
Change ETL
Add new entity
![Page 11: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/11.jpg)
Where do we stand today
0
1
2
3
4
5
6
7
8
9
Add new entity
![Page 12: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/12.jpg)
Where do we stand today
So where does this leave us?
We can automate the simple data logistics
We can automate the historical storage area and save time on tedious ETL
Building data marts is still a manual process
(Complex) business rules are still hard
We satisfy some requirements. Can we do better?
![Page 13: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/13.jpg)
Where do we stand today
If we want to understand how we got here – we need to understand our past…
Why did we start with data modeling in the first place?
Pro’s & cons of older methods
Are data models really important?
![Page 14: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/14.jpg)
Data storage through the ages
Sumeria, 3400 BC. The dawn of civilization, humanity grows and starts to live in cities
Records needs to be kept about food stores. It’s no longer possible to just remember everything…
This is a “big data” issue
A list of temple gifts (Cuneiform)
![Page 15: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/15.jpg)
Data storage through the ages
Early capitalism sees a growing number of transactions for bankers: “big data” issues appear
In 1494, bookkeeping changes:
Information about subject of transaction is separated from transaction details: normalization appears
Now we can change account information (balance) without changing transaction details
![Page 16: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/16.jpg)
Data storage through the ages
Punched cards appear around 1725
First used for information storage in 1832 as information grows
Expanded for the 1890 US census
Enabled aggregations on a very large set of detail data for first time in history
Limited by machine speed
Early form of dimensional modeling
![Page 17: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/17.jpg)
Data storage through the ages
After 1952, magnetic tape (electronic) starts to slowly replace punch cards (mechanical)
Fast, huge storage (100 GB in 1974)
Still only sequential
Punch cards are completely gone by end of the 1970’s
IBM 726 vacuum-column dual tapedrives
![Page 18: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/18.jpg)
Data storage through the ages - conclusion
Hard disks are introduced around 1960 and change the entire field
Enabled RANDOM data access and Online Transaction Processing (OLTP)
Requirements differ from batch processing
Databases & data modeling appear IBM 350 RAMAC (3,75 MB)
![Page 19: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/19.jpg)
Data modeling through the ages
With random access, data is no longer “flat records”
Data modelers organize the data on permanent storage to achieve the desired purposes
Data models present the data and its relationships in a graphical format for better understanding
What are our *logical* modeling “Lego”-blocks?
KEY RELATIONSHIP ATTRIBUTE CONCEPT DATABASE “TABLE”
![Page 20: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/20.jpg)
Data modeling through the ages
The hierarchical model (1960):
“Give me all the equipment this person has” – done in a single query – no set? no result!
Person (parent)
Set of equipment records (children)
Monitors
Laptops
Monitors
![Page 21: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/21.jpg)
Data modeling through the ages
Late 70’s: Relational databases appear (Oracle)
Data is stored as tables and their relationships
Queries are declarative: say WHAT, not HOW
Separating code and database makes building and maintaining software (much) cheaper
The relational model replaces older database models by the end of the 80’s
Now we can do more queries with less people
![Page 22: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/22.jpg)
Data modeling through the ages
Standard relational modeling focuses on speed and consistency of data through normalization
This works well for transactional systems
Easy to retrieve and update single items by key
Queries over large datasets are slow
Equipment
Person
Monitors
Laptops
![Page 23: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/23.jpg)
Data modeling through the ages
Decision support systems appear during the 80’s. The term “Data Warehouse” is coined by Bill Inmon.
At first built using normalized data models
But:
Unpredictable queries on large datasets are slow
Changes to source systems are hard to process and cascade into multiple DWH changes
Integrating different source systems is quite difficult
![Page 24: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/24.jpg)
Data modeling through the ages
In the 90’s we see the Dimensional model
Concepts are stored in dimension-tables and relationships in fact-tables
Easy to understand, build and query
Less joins means faster queries
PERSON EQUIPMENT OWNERSHIP
![Page 25: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/25.jpg)
Data modeling through the ages
Over the years the model degrades
performance, understanding and maintainability suffer – often, a rebuild follows
PERSON EQUIPMENT
OWNERSHIP
![Page 26: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/26.jpg)
Data modeling through the ages
In the early 00’s Dan Linstedt publishes Data Vault:
Business keys in Hubs
business relationships in Links, (historical) attributes in Satellites
Rules to maintain integrity of the data model
PERSON EQUIPMENT Ownership
![Page 27: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/27.jpg)
Data modeling through the ages
Data Vault advantages:
Resilient to change
Better load performance
Easy to automate (except business rules)
Data Vault disadvantages:
Complex queries
Lower query performance
Misinterpretation of rules may be an issue
![Page 28: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/28.jpg)
Data modeling through the ages In the early 00’s Anchor Modeling shows up too
Concepts are defined around Anchors, relationships between Anchors in Ties
Attributes are stored in Attributes
Keys are just attributes
PERSON Ownership EQUIPMENT
![Page 29: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/29.jpg)
Data modeling through the ages
Anchor Model advantages:
Minimal impact of changes in source systems
A Kimball model is a set of views on this model
Easy to automate (except business rules)
Anchor Model disadvantages:
Query performance relies on the number of attributes in the query and on the specific abilities of the underlying RDBMS
No rules on storing interpreted data
![Page 30: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/30.jpg)
The different models
Kimball
Data Vault
Anchor model
![Page 31: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/31.jpg)
The different models
The (very rough) differences:
Kimball focuses on business concepts and ease of use
Data Vault focuses on resilient long term storage, auditing and business administration
Anchor Modeling focuses on business concepts and resilient long term storage
3NF (“Inmon”) focuses on historical replication of the source
![Page 32: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/32.jpg)
Data modeling through the ages
So let’s get back to our requirements:
Have a business-oriented view on the data
Have good performance for our queries
Integrate different sources with ease
Be able to handle ‘business changing its mind’
Implement changes to reports quickly
Handle large data volumes at low cost
How can we change reports quickly?
What if we have terabytes of data? Or petabytes?
![Page 33: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/33.jpg)
Where next?
Can we achieve our objectives using cheap(er) “NoSQL” data stores?
Best known alternative: Hadoop
Hadoop is a replicated data store for unstructured data, with many servers holding parts of the data
Structure is imposed by the (programmed) query
SQL front-end (Hive) and distributed database (Cassandra) are available on top of the framework
Basically, Hadoop is a huge number of “very fast clay tablets”
![Page 34: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/34.jpg)
How does it stack up?
Our requirement was cheap storage or fast changes:
For terabyte-sized volumes of data, Hadoop is cheaper in hardware and software cost than a commercial RDBMS or DWH appliance
Two thirds of the cost of Hadoop lies in programming the queries and maintaining the system – Hadoop may have a very expensive TCO
Facebook: “Hadoop is superior at exploring vast amounts of data with complete flexibility, but Relational is superior at traditional business questions” (TDWI, May 2013)
![Page 35: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/35.jpg)
Where next?
Or: use a DWH appliance and trade hardware cost for lower labor and storage costs
http://www.theregister.co.uk/2011/06/23/ibm_netezza_high_capacity/
Competition:
Source: microsoft.com
![Page 36: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/36.jpg)
Food for thought
Data modeling is very relevant, but:
Method trumps model
Automation may make the choice of data model moot
Automation of business rules beats automating simple ETL
Our job is not just to model, but to make the trade-offs between different data models
![Page 37: Next Generation DWH Modeling - Grundsätzlich It · Where do we stand today Fun facts from a 2012 TDWI-survey by Kalido: 64% need more than 1 month to integrate new data or integrate](https://reader034.vdocument.in/reader034/viewer/2022042020/5e778971bbf27b7ff311a76a/html5/thumbnails/37.jpg)
An overview of modeling methods
About the author:
Ronald Kunenborg obtained a Masters degree in Computer Science in 1994, before working as Scientific Programmer and Business Engineer at Mercedes-Benz for a decade. During that time he was active in all the fields in computing, including building the Logistics and Marketing data warehouse and all the reports for Mercedes-Benz. The last five years he specialized in data warehouse architecture, data modeling and business intelligence. Apart from that he also strives to share his knowledge through conferences, articles, and the Wikipedia pages he maintains.