dimensional modeling schema
TRANSCRIPT
-
7/24/2019 Dimensional Modeling Schema
1/23
Dimensional Modeling Schema
Written by DWBIConcepts TeamLast Updated: 19 September 2014
Now that we know thebasic approach to do dimensional modelingfrom our earlier article,
let us spend some time to understand various possible schema in dimensional modeling.
Requirement of different design schema
In Dimensional modeling, we can create different schema to suit our requirements. We need
various schema to accomplish several things like accommodating hierarchies of a dimension
or maintaining change histories of information etc. In this article we will discuss about 3
different schema, namely - Star, Snowflake and Conformed and we will also discuss how
hierarchical information are modeled in these schemata. We will reserve the discussion on
maintaining change histories for our next article.
Storing hierarchical information in dimension tables
From our previous article, we already know what is a dimension. Simply put, a dimension is
something that qualifies a measure (number). For example, if I say, "McDonalds sell 5000" -
that won't make any sense. But if I say, "McDonalds sell 5000 burgers per month" - then
that would make perfect sense. Here, "burger" and "month" are the members of dimensions
and they are qualifying the number 5000 in this sentence.
It is important to notice that "burger" and "month" are not dimension themselves - they are
just the members of the dimensions "food" and "time" respectively. "Burger" is just one of
many different "food" that McDonalds sell and "month" is just one of different units by which
time is measured. Typically a dimension will have several members and those members will
be stored in separate rows in the dimension table. So the "food" dimension table of
McDonalds will have one row for burger, one row for fries, one row for "drinks" etc.
Similarly, "time" dimension may contain 12 different months as the members of that
dimension.
Often we may find that there are hierarchical relations among the members of a dimension.
That is certain members of the dimension can be grouped under one group whereas other
members can be grouped into a separate group. Consider this - french fries and twister fries
both are "fries" and hence can be grouped under the same group "fries". Similarly chicken
burger and fish burger both can be grouped as "burger".
http://dwbi.org/data-modelling/dw-design/1-dimensional-modeling-guide.htmlhttp://dwbi.org/data-modelling/dw-design/1-dimensional-modeling-guide.htmlhttp://dwbi.org/data-modelling/dw-design/1-dimensional-modeling-guide.htmlhttp://dwbi.org/data-modelling/dw-design/1-dimensional-modeling-guide.html -
7/24/2019 Dimensional Modeling Schema
2/23
French Fries Twister Fries
This type of hierarchical relations can be stored in the model by following two different
approaches. We can either store them in the same "food" dimension table (star schema
approach) or we can create a separate dimension table in addition to "food" dimension -
just to store the type of the foods (snowflake schema approach).
STAR SCHEMA DESIGN
Star schema is the most simple kind of schema where one fact table is present in the center
of the schema surrounded by multiple dimension tables.
In a star schema all the dimension tables are connected only with the fact table and no
dimension table is connected with any other dimension table.
-
7/24/2019 Dimensional Modeling Schema
3/23
Benefit of Star Schema Design
Star schema is probably most popular schema in dimensional modeling because of its
simplicity and flexibility. In a Star schema design, any information can be obtained just by
traversing a single join, which means this type of schema will be ideal for information
retrieval (faster query processing). Here, note that all the hierarchies (or levels) of the
members of a dimension are stored in the single dimension table - that means, lets say if
you wish to group (veggie burger and chicken burger) in "burger" category and (french fries
and twister fries) in "fries" category, you have to store that category information in the
same dimension table.
Star schema provides a de-normalized design
Storing Hierarchy in star schemaAs depicted above, we will store hierarchical information in a flattened pattern in the single
dimension table in star schema. So our food dimension table will look like this: Food
KEY NAME TYPE
1 Chicken Burger Burger
2 Veggie Burger Burger
3 French Fries Fries
4 Twister Fries Fries
SNOW-FLAKE SCHEMA DESIGN
Snow flake schema is just like star schema but the difference is, here one or moredimension tables are connected with other dimension table as well as with the central fact
table. See the example of snowflake schema below.
Here we are storing the information in two dimension tables instead of one. We are storing
the food type in one dimension ("type" table as shown below) and food in other dimension.
This is a snowflake design. Type
-
7/24/2019 Dimensional Modeling Schema
4/23
KEY TYPE_NAME
1 Burger
2 Fries
Food
KEY TYPE_KEY NAME
1 1 Chicken Burger
2 1 Veggie Burger
3 2 French Fries
4 2 Twister Fries
If you are familiar with the concept of data normalization, you can understand that snow
flaking actually increase the level of normalization in the data. This has obvious
disadvantage in terms of information retrieval since we need to read more tables (and
traverse more SQL joins) in order to get the same information. Example, if you wish to find
out all the food, food type sold from store 1, the SQL queries from star and snowflake
schemata will be like below:
-
7/24/2019 Dimensional Modeling Schema
5/23
SQL Query For Star Schema
SELECT DISTINCT f.name, f.type
FROM food f, sales_fact t
WHERE f.key = t.food_key
AND t.store_key = 1
SQL Query For SnowFlake Schema
SELECT DISTINCT f.name, tp.type_name
FROM food f, type tp, sales_fact t
WHERE f.key = t.food_key
AND f.type_key = tp.key
AND t.store_key = 1
As you can see in this example, compared to star schema, snowflake schema requires one
more join (to connect one more table) to retrieve the same information. This is why
snowflake schema is not good performance wise.
-
7/24/2019 Dimensional Modeling Schema
6/23
Then why do we use snowflake schema? Let me give a quick and short answer to that. I
won't explain it in detail right now but I will leave it to you for your comprehension. The
reason we do it is, suppose we have another fact table with granularity store, food type and
day. This fact will use the key of "type" dimension table instead of "food" dimension table.
Unless you have this dimension table in your schema, you won't get the "type" key. This is
the reason we need to snowflake the "food" dimension to "type" dimension.
In our next article we will talk aboutpreserving history in dimension tables (slowly or
rapidly changing dimensions etc.).
History Preserving in Dimensional Modeling
Written by DWBIConcepts Team
Last Updated: 15 September 2014
In our earlier article we have seenhow to design a simple dimensional data modelfor a
point-of-sale system (as an example we took the case of McDonald's fast-food shop). In this
article we will begin with the same model and we will see how we may enhance the model
to store historical changes in the attributes of dimension table.
Nothing Lasts Forever
One of the important objectives while doing data modeling is, to develop a model which can
capture the states of the system with respect to time. You know, nothing lasts forever!
Product prices change over time, people change their addresses, marital status, employers
and even their names. If you are doing data modeling for a data warehouse where we
are particularly interested about historical analysis - it is crucial that we develop some
method of capturing these changes in our data model. As an example, let's say we store the
price of products in the "Food" dimension table that we created earlier and we want to be
able to capture the historical changes in "Food" price. In this article we will see what change
we need to do in our data model to be able to do this.
Note: The simple "Food" dimension we created earlier did not have any "Price" information.
But to illustrate the point of this article, we will add a "price" column to our "Food"
dimension table. So henceforth our "Food" dimension table will look like this: Food
KEY NAME TYPE_KEY PRICE
1 Chicken Burger 1 3.70
http://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/1-dimensional-modeling-guide.htmlhttp://dwbi.org/data-modelling/dimensional-model/1-dimensional-modeling-guide.htmlhttp://dwbi.org/data-modelling/dimensional-model/1-dimensional-modeling-guide.htmlhttp://dwbi.org/data-modelling/dimensional-model/1-dimensional-modeling-guide.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.html -
7/24/2019 Dimensional Modeling Schema
7/23
KEY NAME TYPE_KEY PRICE
2 VeggieBurger 1 3.20
3 French Fries 2 2.00
4 Twister Fries 2 2.20
In case if you have not read my previous article and wondering what "TYPE_KEY" means,
this is a foreign key coming from one other table that contains the type of the food e.g.,
Burger, Fries etc. Also notice, above table only tells us the price of the food as of current
point in time. It does not tell us what the price was, let's say 6 months ago. If the price of
Veggie Burger changes from $3.20 to $3.25 tomorrow, the new price will be updated in the
table and then we will have no way to know what was the earlier price. So our objective isto change the above table structure in such a way so that we can store all the historical and
future prices of the foods.
Types of Changing Dimensions
There are a few different ways to store the historical changes of values in data model. And
any particular way you want to adopt will depend on the typeof changing dimension. For
example, some dimensions can change quite rapidly, some dimensions do not change at all
but most dimensions change very slowly. That is why we can differentiate dimensions in
these 3 types depicted below.
Unchanging Dimension
There are some dimensions that do not change at all. For example, let's say you have
created a dimension table called "Gender". Below are the structure and data of this
dimension table: Gender
ID VALUE
1 Male
2 Female
The "Value" column in the above dimension is the attribute of this table that won't normally
change. This is an unchanging dimension - "male" will be always called "male" and "female"
-
7/24/2019 Dimensional Modeling Schema
8/23
will be always called "female". Off course, for some crazy reason, one may wish to change
the texts "Male"/"Female" to something else e.g. "man"/"woman". But that's really not a
change that we should be concerned about as such changes do not alter the "meaning" of
the attribute (the words man/male still mean the same thing). So if some changes need to
be done, we can simply update the "Value" column in dimension table. For all practical
intent and purpose, this dimension remains as an "Unchanging dimension".
Slowly Changing Dimension
Here comes the most popular dimension - "slowly changing dimension". These are the
dimensions where one or more attributes can change slowly with respect to time. Look at
the "food" dimension from our earlier example. "Price" is one such attribute which is
variable in this dimension. But "price" of french fries or burgers do not change very often,
may be they change once in a season. This is an example of slowly changing dimension.
Let me give you one more example. Let's say you have created a dimension table onemployees. And in the "employee" dimension you have a column called "Marital_Status".
This can definitely change (from unmarried to married for example) with respect to time.
But again, like the previous example, this is a slowly changing attribute. Doesn't change so
often.
Later in the article, we will see how to make necessary changes in our dimension table
design to store history for such slowly changing dimensions.
Rapidly Changing Dimensions
If you design a dimension table that has a rapidly changing attribute, then your dimension
table will become rapidly changing dimension.
As for example, let's say you have a "Subscriber" dimension where you store the details of
all the subscribers to a particular pre-paid mobile service plan. You have a "status" column
in the "Subscriber" dimension table which can have several different values based on the
current account balance of the subscriber. For example, if your balance is less than $0.1,
the status becomes "No Outgoing call". If your balance is less than $5, the status becomes
"Restricted Call Service". If your balance is less than $10, the status becomes "No Long
Distance Call" and if the balance is greater than $10 then status becomes "Full Service",
etc. Every month, the status of any subscriber keeps on changing multiple times based onhis or her account balance thereby making the "Subscribers" dimension one rapidly
changing dimension.
One must remember the way we design a rapidly changing dimension is often quite
different from the way we design a slowly changing dimension. In the next article however,
we will only look intodesigning of slowly changing dimension.
http://dwbi.org/data-modelling/dimensional-model/19-modeling-for-various-slowly-changing-dimension.htmlhttp://dwbi.org/data-modelling/dimensional-model/19-modeling-for-various-slowly-changing-dimension.htmlhttp://dwbi.org/data-modelling/dimensional-model/19-modeling-for-various-slowly-changing-dimension.htmlhttp://dwbi.org/data-modelling/dimensional-model/19-modeling-for-various-slowly-changing-dimension.html -
7/24/2019 Dimensional Modeling Schema
9/23
Dimensional Modeling Approach for Various SlowlyChanging Dimensions
Written by DWBIConcepts TeamLast Updated: 15 September 2014
In our earlier article we have discussed theneed of storing historical information in
dimensional tables.We have also learnt about various types of changing dimensions. In this
article we will pick "slowly changing dimension" only and learn in detail about various types
of slowly changing dimensions and how to design them.
Slowly changing dimensions, referred as SCD henceforth, can be modeled basically in 3
different ways based on whether we want to store full histories, partial histories or no
history. These different types are called Type 2, Type 3 and Type 1 respectively. Next we
will learn them in detail.
Also note, there are slight variations to the basic 3 SCD types that I show here. These
variations (sometimes labelled as type 4, 5, 6, 7 etc.) are mostly in terms of
implementation and use-cases. Don't worry about them now.
SCD Type 1
As mentioned above, we design a dimension as SCD type 1 when we do not want to store
the history. That is, whenever some values are modified in the attributes, we just want to
update the old values with the new values and we do notcare about storing the previous
history.
We do not store any history in SCD Type 1
Please mind, this is not same as "Unchanged Dimension" discussed in the previous article.
In case of an unchanged dimension, we assume that the values of the attributes of that
dimension will not change at all. On the other hand, here in case of a SCD Type 1
dimension, we assume that the values of the attributes will change slowly, however, we are
not interested to store those changes. We are only interested to store the current or latest
value. So every time it changes we will update the old value with new ones.
Handling SCD Type 1 Dimension in ETL Process
Technically, from ETL design perspective (Now, if you don't know what is ETL, you don't
have to bother about this paragraph - you can go to the next section) SCD Type 1
dimensions are loaded using "Merge" operation which is also known as "UPSERT" as an
abbreviation of "Update else Insert".
http://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.html -
7/24/2019 Dimensional Modeling Schema
10/23
SCD Type 1 dimensions are loaded by Merge operations
In "UPSERT" method, each row coming from the source is compared will all the records
present in the target dimension table based on the natural key and checked if the sourcerecord already exists in the target or not. If the row exists in the target, the target row is
updated with new values coming from source system. However if the row is not present in
the target system, the source row is inserted in the target table.
In pure ANSI SQL syntax, there is a particular statement that help you achieve the UPSERT
operation. It's called "MERGE" statement
MERGE INTO Target_Dimension_Table tgt
USING source_table src
ON
tgt.natural_key = src.natural_key
WHEN MATCHED THEN
UPDATE
SET tgt.column1 = src.value1,
tgt.column2 = src.value2, ...
WHEN NOT MATCHED THEN
INSERT ( tgt.column1 , tgt.column2 ...)
VALUES ( src.value1 , src.value2 ...)
As obvious from this example, you have to store the natural key of the data in the target
dimension table in order to perform this comparison. Later, I will write a separate article on
ETL architecture design, where I will talk about this in more detail. But from a modeling
perspective, please note that as a data modeler you should add one extra column in your
target dimension table as a place holder to store the natural key of the data.
SCD Type 2
Arguably, this is the most popular type of slowly changing dimensions. So we will try to
learn this as clearly as possible.
-
7/24/2019 Dimensional Modeling Schema
11/23
Let me come one step backward here and remind you again about what is our objective
here. As you can recall, in the previous articles we have learnt how the values of the
attributes (or columns) in the dimension table change with time. We are trying to store the
histories of such changes for the purpose of analysis.
In Type 1, we were not storing any history. However, now we are going to learn how may
we design a dimension table so that we can store the full history and always extract the
history of changes as and when we require that. We will take our "Food" dimension table as
an example here, where "Price" is a variable factor. Food
KEY NAME TYPE_KEY PRICE
1 Chicken Burger 1 3.70
2 VeggieBurger 1 3.20
3 French Fries 2 2.00
4 Twister Fries 2 2.20
Design of SCD Type 2 Dimension
In order to design the above table as SCD Type 2, we will have to add 3 more columns inthis table, "Date From", "Date To" and "Latest Flag". These columns are called type 2
metadata columns. See below: Food
KEY NAME TYPE_KEY PRICE DATE_FROM DATE_TO LATEST_F
1 Chicken Burger 1 3.70 01-Jan-11 31-Dec-99 Y
2 VeggieBurger 1 3.20 01-Jan-11 31-Dec-99 Y
3 French Fries 2 2.00 01-Jan-11 31-Dec-99 Y
4 Twister Fries 2 2.20 01-Jan-11 31-Dec-99 Y
Notice here, how the values of these 3 new columns are populated. In the very beginning,
when any new record is loaded in the table, we automatically default the values of "date
-
7/24/2019 Dimensional Modeling Schema
12/23
from" to the date of the day of the loading, "Date To" to some far future date (e.g., 31st
December 2099) and "Latest Flag" to "Y".
What is the meaning of these 3 metadata columns?
These 3 columns basically tell us whether a particular record in the table is latest or not and
what is the time period during which the record was latest (Also known as active period).
For example, data in the above table basically says that all the 4 records are latest (active)
and they are active from the day of loading (in this case 1st January 2011) until an
indefinite future date (31st December 2099).
But how does these columns help us store the change history?
Lets assume, today is 15 March 2011, and McDonald has decided to increase the price of"Veggie Burger" from $3.20 to $3.25. If this happens we will not straight away update the
price from $3.20 to $3.25. Instead to store this new information (and also the old
information), we will insert a new record in the "Food" dimension table which will look like
below: Food
KEY NAME TYPE_KEY PRICE DATE_FROM DATE_TO LATEST_F
1 Chicken Burger 1 3.70 01-Jan-11 31-Dec-99 Y
2 VeggieBurger 1 3.20 01-Jan-11 14-Mar-11 N
3 French Fries 2 2.00 01-Jan-11 31-Dec-99 Y
4 Twister Fries 2 2.20 01-Jan-11 31-Dec-99 Y
5 VeggieBurger 1 3.25 15-Mar-11 31-Dec-99 Y
Observe the change in the records with Key 2 and 5. Record 2, which was the originalrecord for the veggie burger, has now got updated as its latest flag has become 'N' and
"Date To" column value has changed to "14-Mar-2011". This means, Record 2 is no longer
latest or active (Latest Flag = "N") and it was active earlier during the period 1st Jan
2011(Date From) to 14 Mar 2011(Date To).
So, if Record 2 is not active, what is the latest record for "Veggie Burger" now? Record 5!
Its latest flag is set to "Y" and it says that that the record is active since 15 March 2011.
-
7/24/2019 Dimensional Modeling Schema
13/23
This record will remain active many years in the far-off future (until 31 Dec 2099) or at
least unless a new record is inserted again with latest flag Y and this record is updated
again with Latest Flag N. So next time again, let's say on 20 Dec 2011, McDonalds again
decide to change the price of Veggie Burger back to $3.20 and increase the price of the
chicken burger from $3.70 to $3.90, we will see 2 more new records in the table as
below: Food
KEY NAME TYPE_KEY PRICE DATE_FROM DATE_TO LATEST_F
1 Chicken Burger 1 3.70 01-Jan-11 19-Dec-11 N
2 VeggieBurger 1 3.20 01-Jan-11 14-Mar-11 N
3 French Fries 2 2.00 01-Jan-11 31-Dec-99 Y
4 Twister Fries 2 2.20 01-Jan-11 31-Dec-99 Y
5 VeggieBurger 1 3.25 15-Mar-11 19-Dec-11 N
6 Chicken Burger 1 3.80 20-Dec-11 31-Dec-99 Y
7 VeggieBurger 1 3.20 20-Dec-11 31-Dec-99 Y
As you can see from the design above, it is now possible to go back to any date in the
history and figure out what was the value of the "Price" attribute of "Food" dimension at
that point in time.
Surrogate key for SCD Type 2 dimension
Note from the above example that, each time we generate a new row in the dimension
table, we also assign a new key to the record. This is the key that flows down to the fact
table in a typical Star schema design. The value of this key, that is the numbers like 1, 2, 3,
. , 7 etc. are not coming from the source systems. Instead those numbers are just like
sequential running numbers which are generated automatically at the time of inserting
these records. These numbers are unique, so as to uniquely identify each record in the
table, and are called "Surrogate Key" of the table.
As obvious, multiple surrogate keys may be related to the same item, however, each key
will relate to one particular state of that item in time. In the above example, keys 2, 5 and 7
are all linked to "Veggie Burger" but they represent the state of the record in 3 different
-
7/24/2019 Dimensional Modeling Schema
14/23
time spans. It's worth noting that there would be only one record with latest flag = "Y"
among multiple records of the same item.
Alternate Design of SCD Type 2: Addition of Version
Number
A slight variation of design of SCD Type 2 dimension is possible where we can store the
version numbers of the records. The initial record will be called version 1 and as and when
new records are generated, we will increment the version number by 1. In this design
pattern, the records with highest version will always be the latest record. If we utilize this
design in our earlier example, the dimension table will look like this:Food
KEY NAME TYPE_KEY PRICE DATE_FROM DATE_TO VERS
1 Chicken Burger 1 3.70 01-Jan-11 19-Dec-11 1
2 VeggieBurger 1 3.20 01-Jan-11 14-Mar-11 1
3 French Fries 2 2.00 01-Jan-11 31-Dec-99 1
4 Twister Fries 2 2.20 01-Jan-11 31-Dec-99 1
5 VeggieBurger 1 3.25 15-Mar-11 19-Dec-11 2
6 Chicken Burger 1 3.80 20-Dec-11 31-Dec-99 2
7 VeggieBurger 1 3.20 20-Dec-11 31-Dec-99 3
Off course, we can also keep the "Latest Flag" column in the above table if we wish.
Handling SCD Type 2 Dimension in ETL ProcessAgain, if you do not know what is ETL - you can safely skip this section. But if you have
some ETL background then I suppose you have already pin-pointed the fact that, unlike
SCD Type 1, Type 2 requires you to insert new records in the table as and when any
attribute changes. This is obviously different from SCD Type 1. Because in case of SCD Type
1, we were only updating the record. But here, we will need to update old record (e.g.
-
7/24/2019 Dimensional Modeling Schema
15/23
changing the latest flag from "Y" to "N", updating the "Date To") as well as we will need to
insert a new record.
Like before, we can use the "natural key" to first compare if the source record is existing in
the target or not. If not, we will simply insert the record in the target with new surrogate
key. But if it already exists in the target, we will have to check if any value of the attributes
has changed between source and target - if not, we can ignore the source record. But if yes,
we will have to update the existing record as "N" and insert a new record with new
surrogate key. As I mentioned before, I will write a separate article on the ETL handling
later.
Performance Considerations of SCD Type 2 Dimension
SCD type 2, by design, tend to increase the volume of the dimension tables considerably.
Think of this: Let's say you have an "employee" dimension table which you have designed
as SCD Type 2. The employee dimensions has 20 different attributes and there are 10attributes in this table which change at least once in a year on average (e.g. employee
grade, manager's name, department, salary, band, designation etc.). This means if you
have 1,000 employees in your company, at the end of just one year, you are going to get
10,000 records in this dimension table (i.e. assuming on an average 10 attributes change
per year - resulting into 10 different rows in the dimension table).
As you can see, this is not a very good thing performance wise as this can considerably slow
down loading of your fact table as you will require to "look up" this dimension table during
your fact loading. One may argue that, even if we have 10,000 records, we will actually
have only 1,000 records with Latest_Flag = 'Y' and since we will only lookup records with
Latest_Flag = 'Y', the performance will not detoriate. This is not entirely true. While utilizingthe Latest_Flag = 'Y' filter may decrease the size of the lookup cache, but database will
generally need to do a full table scan (FTS) to identify latest records. Moreover, in many
cases ETL developer will not be able to make use of Latest_Flag = 'Y' column if the
transactional records do not always belong to the latest time (e.g. late arriving fact records
or loading fact table at later point in time - month end load / week end load etc.). In those
cases, putting latest_flag = 'Y' filter will be functionally incorrect as you should determine
the correct return key on the basis of "Date To", "Date From" columns. (If you do not
understand what I am talking about in this para, just ignore me for now. I am going to
explain these things later in some other article)
SCD Type 3
As I mentioned before, type 3 design is used to store partial history. Although theoretically
it is possible to use the type 3 design to store full history, that would be not possible
practically. So, what is type 3 design? In Type 2 design above, we have seen that whenever
-
7/24/2019 Dimensional Modeling Schema
16/23
the values of the attributes change, we insert new rows to the table. In case of type 3,
however, we add new column to the table to store the history.
So let's say, we have a table where we have 2 column initially - "Key" and "attribute".
KEY ATTRIBUTE
1 A
2 B
3 C
If the record 1 changes its attribute from A to D, we will add one extra column to the table
to store this change.
KEY ATTRIBUTE ATTRIBUTE_OLD
1 D A
2 B
3 C
If the record again change attribute values, we will again have to add columns to store the
history of the changes
KEY ATTRIBUTE ATTRIBUTE_OLD ATTRIBUTE_OLD_1
1 E D A
2 B
3 C
Isn't then SCD Type 3 very cumbersome?
-
7/24/2019 Dimensional Modeling Schema
17/23
As you can see, storing the history in terms of changing the structure of the table in this
way is quite cumbersome and after the attributes are changed a few times the table will
become unnecessarily big and fat and difficult to manage. But that does not mean SCD Type
3 design methodology is completely unusable. In fact, it is quite usable in a particular
circumstance - where we just need to store the partial history information.
Let's think about a special circumstance where we only need to know the "current value"
and "previous value" of an attribute. That is, even though the value of that attribute may
change numerous times, at any time we are only concerned about its current and previous
values. In such circumstances, we can design the table as type 3 and keep only 2 columns -
"current value" and "previous value" like below.
KEY CURRENT_VALUE PREVIOUS_VALUE
1 D A
2 B
3 C
I can't find a very good example of this scenario right away, however, I can give you one
example from one of my previous projects in telecom domain, wherein a certain calculated
field in the report used to depend on the latest and previous values of the customer status.
That calculated attribute was called "Churn Indicator" (churn in telecom business generally
means leaving a telephone connection) and the rule to populate the churn indicator was (in
a very very simplified way) like below:
Churn Indicator
= "Voluntary Churn"
(if customer's current status = 'Inactive' and previous status = 'Active')
= "Involuntary Churn",
(if customer's current status = 'Inactive' and previous status = 'Suspended')
As you can guess, in order to find out the correct value of churn indicator, you do not need
to know complete history of changes of customer's status. All you need to know is the
current and previous status. In this kind of partial history scenario, SCD Type 3 design is
very useful.
Note here, compared to SCD Type 2, type 3 does not increase the number of records in the
table thereby easing out performance concerns.
-
7/24/2019 Dimensional Modeling Schema
18/23
Now that we have already learnt about slowly changing dimensions, next we will
discusshow to design "Rapidly Changing Dimension" or RCD
What are Slowly Changing Dimensions?
Slowly Changing Dimensions (SCD) - dimensions that change slowly over time, rather thanchanging on regular schedule, time-base. In Data Warehouse there is a needtotrackchangesin dimension attributes in order to reporthistorical data.In other words,implementing one of the SCD types should enable users assigning proper dimension'sattribute value for given date. Example of such dimensions could be: customer, geography,employee.
There are many approaches how to deal with SCD. The most popular are:
Type 0- The passive method Type 1- Overwriting the old value
Type 2- Creating a new additional record
Type 3- Adding a new column
Type 4- Using historical table
Type 6- Combine approaches of types 1,2,3 (1+2+3=6)Type 0- The passive method. In this method no special action is performed upondimensional changes. Somedimension datacan remain the same as it was first timeinserted, others may be overwritten.
Type 1- Overwriting the old value. In this method no history of dimension changes is kept
in the database. The old dimension value is simply overwritten be the new one. This typeis easy to maintain and is often use for data which changes are caused by processingcorrections(e.g. removal special characters, correcting spelling errors).
Before the change:
Customer_ID Customer_Name Customer_Type
1 Cust_1 Corporate
After the change:
Customer_ID Customer_Name Customer_Type1 Cust_1 Retail
Type 2- Creating a new additional record. In this methodology all history of dimensionchanges is kept in the database. You capture attribute change by adding a new row with anew surrogate key to the dimension table. Both the prior and new rows contain asattributes the natural key(or other durable identifier). Also 'effective date' and 'current
http://dwbi.org/data-modelling/dimensional-model/20-implementing-rapidly-changing-dimension.htmlhttp://dwbi.org/data-modelling/dimensional-model/20-implementing-rapidly-changing-dimension.htmlhttp://dwbi.org/data-modelling/dimensional-model/20-implementing-rapidly-changing-dimension.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://dwbi.org/data-modelling/dimensional-model/20-implementing-rapidly-changing-dimension.html -
7/24/2019 Dimensional Modeling Schema
19/23
indicator' columns are used in this method. There could be only one record with currentindicator set to 'Y'. For 'effective date' columns, i.e. start_date and end_date, the end_datefor current record usually is set to value 9999-12-31. Introducing changes to thedimensional model in type 2 could be very expensive database operation so it is notrecommended to use it in dimensions where a new attribute could be added in the future.
Before the change:
Customer_ID Customer_Name Customer_Type Start_Date End_Date Current_Flag
1 Cust_1 Corporate 22-07-2010 31-12-9999 Y
After the change:
Customer_ID Customer_Name Customer_Type Start_Date End_Date Current_Flag
1 Cust_1 Corporate 22-07-2010 17-05-2012 N
2 Cust_1 Retail 18-05-2012 31-12-9999 Y
Type 3- Adding a new column. In this type usually only the current and previous value ofdimension is kept in the database. The new value is loaded into 'current/new' column andthe old one into 'old/previous' column. Generally speaking the history is limited to thenumber of column created for storing historical data. This is the least commonly neededtechinque.
Before the change:
Customer_ID Customer_Name Current_Type Previous_Type
1 Cust_1 Corporate Corporate
After the change:
Customer_ID Customer_Name Current_Type Previous_Type
1 Cust_1 Retail Corporate
Type 4- Using historical table. In this method a separate historical table is used to trackall dimension's attribute historical changes for each of the dimension. The 'main' dimensiontable keeps only the current data e.g. customer and customer_history tables.
Current table:
Customer_ID Customer_Name Customer_Type
1 Cust_1 Corporate
Historical table:
http://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.html -
7/24/2019 Dimensional Modeling Schema
20/23
Customer_ID Customer_Name Customer_Type Start_Date End_Date
1 Cust_1 Retail 01-01-2010 21-07-2010
1 Cust_1 Oher 22-07-2010 17-05-2012
1 Cust_1 Corporate 18-05-2012 31-12-9999
Type 6- Combine approaches of types 1,2,3 (1+2+3=6). In this type we have indimension table such additional columns as:
current_type - for keeping current value of the attribute. Allhistory recordsfor givenitem of attribute have the same current value.
historical_type - for keeping historical value of the attribute. All history records for givenitem of attribute could have different values.
start_date - for keepingstart dateof 'effective date' of attribute's history.
end_date - for keepingend dateof 'effective date' of attribute's history.
current_flag - for keeping information about the most recent record.
In this method to capture attribute change we add anew recordas in type 2. Thecurrent_type information is overwritten with the new one as in type 1. We store the historyin a historical_column as in type 3.
Customer_ID Customer_Name Current_Type Historical_Type Start_Date End_Date Current_Flag
1 Cust_1 Corporate Retail 01-01-2010 21-07-2010 N
2 Cust_1 Corporate Other 22-07-2010 17-05-2012 N
3 Cust_1 Corporate Corporate 18-05-2012 31-12-9999 Y
(C) 2008-2009 www.datawarehouse4u.info
All Rights ReservedJunk Dimension
In data warehouse design, frequently we run into a situation where there areyes/no indicator fields in the source system. Through business analysis, weknow it is necessary to keep such information in the fact table. However, ifkeep all those indicator fields in the fact table, not only do we need to buildmany small dimension tables, but the amount of information stored in the facttable also increases tremendously, leading to possible performance andmanagement issues.
Junk dimension is the way to solve this problem. In a junk dimension, wecombine these indicator fields into a single dimension. This way, we'll onlyneed to build a single dimension table, and the number of fields in the facttable, as well as the size of the fact table, can be decreased. The content inthe junk dimension table is the combination of all possible values of theindividual indicator fields.
http://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://stat.4u.pl/?maciejam1http://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.html -
7/24/2019 Dimensional Modeling Schema
21/23
Let's look at an example. Assuming that we have the following fact table:
In this example, TXN_CODE, COUPON_IND, and PREPAY_IND are allindicator fields. In this existing format, each one of them is a dimension. Usingthe junk dimension principle, we can combine them into a single junkdimension, resulting in the following fact table:
Note that now the number of dimensions in the fact table went from 7 to 5.
The content of the junk dimension table would look like the following:
-
7/24/2019 Dimensional Modeling Schema
22/23
In this case, we have 3 possible values for the TXN_CODE field, 2 possiblevalues for the COUPON_IND field, and 2 possible values for thePREPAY_IND field. This results in a total of 3 x 2 x 2 = 12 rows for the junkdimension table.
By using a junk dimension to replace the 3 indicator fields, we havedecreased the number of dimensions by 2 and also decreased the number offields in the fact table by 2. This will result in a data warehousing environmentthat offer better performance as well as being easier to manage.
Conformed Dimension
A conformed dimension is a dimension that has exactly the same meaningand content when being referred from different fact tables. A conformeddimension can refer to multiple tables in multiple data marts within the sameorganization. For two dimension tables to be considered as conformed, theymust either be identical or one must be a subset of another. There cannot beany other type of difference between the two tables. For example, twodimension tables that are exactly the same except for the primary key are notconsidered conformed dimensions.
Why is conformed dimension important? This goes back to thedefinition ofdata warehousebeing "integrated." Integrated means that even if a particularentity had different meanings and different attributes in the source systems,there must be a single version of this entity once the data flows into the datawarehouse.
http://www.1keydata.com/datawarehousing/data-warehouse-definition.htmlhttp://www.1keydata.com/datawarehousing/data-warehouse-definition.htmlhttp://www.1keydata.com/datawarehousing/data-warehouse-definition.htmlhttp://www.1keydata.com/datawarehousing/data-warehouse-definition.htmlhttp://www.1keydata.com/datawarehousing/data-warehouse-definition.htmlhttp://www.1keydata.com/datawarehousing/data-warehouse-definition.html -
7/24/2019 Dimensional Modeling Schema
23/23
The time dimension is a common conformed dimension in an organization.Usually the only rule to consider with the time dimension is whether there is afiscal year in addition to the calendar year and the definition of a week.Fortunately, both are relatively easy to resolve. In the case of fiscal vs.calendar year, one may go with either fiscal or calendar, or an alternative is to
have two separate conformed dimensions, one for fiscal year and one forcalendar year. The definition of a week is also something that can be differentin large organizations: Finance may use Saturday to Friday, while marketingmay use Sunday to Saturday. In this case, we should decide on a definitionand move on. The nice thing about the time dimension is once these rules areset, the values in the dimension table will never change. For example,October 16th will never become the 15th day in October.
Not all conformed dimensions are as easy to produce as the time dimension.An example is the customer dimension. In any organization with some history,there is a high likelihood that different customer databases exist in differentparts of the organization. To achieve a conformed customer dimension meansthose data must be compared against each other, rules must be set, and datamust be cleansed. In addition, when we are doing incremental data loads intothe data warehouse, we'll need to apply the same rules to the new values tomake sure we are only adding truly new customers to the customerdimension.
Building a conformed dimension also part of the process inmaster datamanagement,or MDM. In MDM, one must not only make sure the master datadimensions are conformed, but that conformity needs to be brought back tothe source systems.
http://www.1keydata.com/datawarehousing/master-data-management.htmlhttp://www.1keydata.com/datawarehousing/master-data-management.htmlhttp://www.1keydata.com/datawarehousing/master-data-management.htmlhttp://www.1keydata.com/datawarehousing/master-data-management.htmlhttp://www.1keydata.com/datawarehousing/master-data-management.htmlhttp://www.1keydata.com/datawarehousing/master-data-management.html