1 dwh concepts

27
8/13/2019 1 DWH Concepts. http://slidepdf.com/reader/full/1-dwh-concepts 1/27 Top 50 Data Warehousing Interview Questions with Answers What is data warehouse? A data warehouse is a electronic storage of an Organization's historical data for the  purpose of reporting, analysis and data mining or knowledge discovery. Other than that a data warehouse can also be used for the purpose of data integration, master data management etc. According to Bill Inmon, a datawarehouse should be subect!oriented, non!volatile, integrated and time!variant. Explanatory Note  "ote here, "on!volatile means that the data once loaded in the warehouse will not get deleted later. #ime!variant means the data will change with respect to time. #he above definition of the data warehousing is typically considered as $classical$ definition. %owever, if you are interested, you may want to read the article ! &hat is a data warehouse ! A ( guide to modern data warehousing ! which opens up a broader definition of data warehousing. What is the benefits of data warehouse? A data warehouse helps to integrate data )see *ata integration+ and store them historically so that we can analyze different aspects of business including, performance analysis, trend, prediction etc. over a given time frame and use the result of our analysis to improve the efficiency of business processes. Why Data Warehouse is used? or a long time in the past and also even today, *ata warehouses are built to facilitate reporting on different key business processes of an organization, known as -I. *ata warehouses also help to integrate data from different sources and show a single!point!of! truth values about the business measures. *ata warehouse can be further used for data mining which helps trend prediction, forecasts, pattern recognition etc. /heck this article to know more about data mining What is the difference between OLTP and OLAP? O0# is the transaction system that collects business data. &hereas O0A is the reporting and analysis system on that data.

Upload: vijay-kumar

Post on 04-Jun-2018

276 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 1/27

Top 50 Data Warehousing Interview Questions with

Answers

What is data warehouse?

A data warehouse is a electronic storage of an Organization's historical data for the purpose of reporting, analysis and data mining or knowledge discovery.

Other than that a data warehouse can also be used for the purpose of data integration,master data management etc.

According to Bill Inmon, a datawarehouse should be subect!oriented, non!volatile,integrated and time!variant.

Explanatory Note

 "ote here, "on!volatile means that the data once loaded in the warehouse will not getdeleted later. #ime!variant means the data will change with respect to time.

#he above definition of the data warehousing is typically considered as $classical$definition. %owever, if you are interested, you may want to read the article ! &hat is adata warehouse ! A ( guide to modern data warehousing ! which opens up a broaderdefinition of data warehousing.

What is the benefits of data warehouse?

A data warehouse helps to integrate data )see *ata integration+ and store them

historically so that we can analyze different aspects of business including, performanceanalysis, trend, prediction etc. over a given time frame and use the result of our analysisto improve the efficiency of business processes.

Why Data Warehouse is used?

or a long time in the past and also even today, *ata warehouses are built to facilitatereporting on different key business processes of an organization, known as -I. *atawarehouses also help to integrate data from different sources and show a single!point!of!truth values about the business measures.

*ata warehouse can be further used for data mining which helps trend prediction,forecasts, pattern recognition etc. /heck this article to know more about data mining

What is the difference between OLTP and OLAP?

O0# is the transaction system that collects business data. &hereas O0A is thereporting and analysis system on that data.

Page 2: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 2/27

O0# systems are optimized for I"123#, 4*A#2 operations and therefore highlynormalized. On the other hand, O0A systems are deliberately denormalized for fast dataretrieval through 1202/# operations.

Explanatory Note:

In a departmental shop, when we pay the prices at the check!out counter, the sales personat the counter keys!in all the data into a $oint!Of!1ales$ machine. #hat data istransaction data and the related system is a O0# system.

On the other hand, the manager of the store might want to view a report on out!of!stockmaterials, so that he can place purchase order for them. 1uch report will come out fromO0A system

What is data mart?

*ata marts are generally designed for a single subect area. An organization may havedata pertaining to different departments like inance, %3, 5arketting etc. stored in datawarehouse and each department may have separate data marts. #hese data marts can be built on top of the data warehouse.

What is ER model?

23 model or entity!relationship model is a particular methodology of data modelingwherein the goal of modeling is to normalize the data by reducing redundancy. #his isdifferent than dimensional modeling where the main goal is to improve the data retrievalmechanism.

What is dimensional modeling?

*imensional model consists of dimension and fact tables. act tables store differenttransactional measurements and the foreign keys from dimension tables that 6ualifies thedata. #he goal of *imensional model is not to achive high degree of normalization but tofacilitate easy and faster data retrieval.

3alph -imball is one of the strongest proponents of this very popular data modelingtechni6ue which is often used in many enterprise level data warehouses.

If you want to read a 6uick and simple guide on dimensional modeling, please check our7uide to dimensional modeling.

What is dimension?

A dimension is something that 6ualifies a 6uantity )measure+.

8

Page 3: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 3/27

or an e9ample, consider this: If I ust say; <8(kg=, it does not mean anything. But if Isay, $8(kg of 3ice )roduct+ is sold to 3amesh )customer+ on >th April )date+$, then thatgives a meaningful sense. #hese product, customer  and dates are some dimension that6ualified the measure ! 8(kg.

*imensions are mutually independent. #echnically speaking, a dimension is a dataelement that categorizes each item in a data set into non!overlapping regions.

What is Fact?

A fact is something that is 6uantifiable )Or measurable+. acts are typically )but notalways+ numerical values that can be aggregated.

What are additie! semi"additie and non"additie measures?

Non-additive easures

 "on!additive measures are those which can not be used inside any numeric aggregationfunction )e.g. 145)+, A?7)+ etc.+. One e9ample of non!additive fact is any kind of ratioor percentage. 29ample, >@ profit margin, revenue to asset ratio etc. A non!numericaldata can also be a non!additive measure when that data is stored in fact tables, e.g. somekind of varchar flags in the fact table.

!e"i Additive easures

1emi!additive measures are those where only a subset of aggregation function can beapplied. 0ets say account balance. A sum)+ function on balance does not give a useful

result but ma9)+ or min)+ balance might be useful. /onsider price rate or currency rate.1um is meaningless on rate however, average function might be useful.

Additive easures

Additive measures can be used with any aggregation function like 1um)+, Avg)+ etc.29ample is 1ales Cuantity etc.

At this point, I will re6uest you to pause and make some time to read this article on$/lassifying data for successful modeling$. #his article helps you to understand thedifferences between dimensional dataD factual data etc. from a fundamental perspective

What is #tar"schema?

#his schema is used in data warehouse models where one centralized fact table referencesnumber of dimension tables so as the keys )primary key+ from all the dimension tablesflow into the fact table )as foreign key+ where measures are stored. #his entity!relationship diagram looks like a star, hence the name.

E

Page 4: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 4/27

/onsider a fact table that stores sales 6uantity for each product and customer on a certaintime. 1ales 6uantity will be the measure here and keys from customer, product and timedimension tables will flow into the fact table.

If you are not very familiar about 1tar 1chema design or its use, we strongly recommendyou read our e9cellent article on this subect ! different schema in dimensional modeling

What is snow"fla$e schema?

#his is another logical arrangement of tables in dimensional modeling where a

centralized fact table references number of other dimension tables however, thosedimension tables are further normalized into multiple related tables.

/onsider a fact table that stores sales 6uantity for each product and customer on a certaintime. 1ales 6uantity will be the measure here and keys from customer, product and timedimension tables will flow into the fact table. Additionally all the products can be furthergrouped under different product families stored in a different table so that primary key of product family tables also goes into the product table as a foreign key. 1uch constructwill be called a snow!flake schema as product table is further snow!flaked into productfamily.

F

Page 5: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 5/27

Note

1now!flake increases degree of normalization in the design.

What are the different ty%es of dimension?

In a data warehouse model, dimension can be of following types,

. /onformed *imension8. Gunk *imensionE. *egenerated *imensionF. 3ole laying *imension

Based on how fre6uently the data inside a dimension changes, we can further classifydimension as

. 4nchanging or static dimension )4/*+8. 1lowly changing dimension )1/*+E. 3apidly changing *imension )3/*+

Hou may also read, 5odeling for various slowly changing dimension and Implementing3apidly changing dimension to know more about 1/*, 3/* dimensions etc.

What is a &'onformed Dimension&?

A conformed dimension is the dimension that is shared across multiple subect area./onsider '/ustomer' dimension. Both marketing and sales department may use the same

>

Page 6: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 6/27

customer dimension table in their reports. 1imilarly, a '#ime' or '*ate' dimension will beshared by different subect areas. #hese dimensions are conformed dimension.

#heoretically, two dimensions which are either identical or strict mathematical subsets ofone another are said to be conformed.

What is degenerated dimension?

A degenerated dimension is a dimension that is derived from fact table and does not haveits own dimension table.

A dimension key, such as transaction number, receipt number, Invoice number etc. doesnot have any more associated attributes and hence can not be designed as a dimensiontable.

What is (un$ dimension?

A unk dimension is a grouping of typically low!cardinality attributes )flags, indicatorsetc.+ so that those can be removed from other tables and can be unked into an abstractdimension table.

#hese unk dimension attributes might not be related. #he only purpose of this table is tostore all the combinations of the dimensional attributes which you could not fit into thedifferent dimension tables otherwise. Gunk dimensions are often used to implement3apidly /hanging *imensions in data warehouse.

What is a role"%laying dimension?

*imensions are often reused for multiple applications within the same database withdifferent conte9tual meaning. or instance, a $*ate$ dimension can be used for $*ate of1ale$, as well as $*ate of *elivery$, or $*ate of %ire$. #his is often referred to as a 'role! playing dimension'

What is #'D?

1/* stands for slowly changing dimension, i.e. the dimensions where data is slowlychanging. #hese can be of many types, e.g. #ype (, #ype , #ype 8, #ype E and #ype ,although #ype , 8 and E are most common. 3ead this article to gather in!depth

knowledge on various 1/* tables.

What is ra%idly changing dimension?

#his is a dimension where data changes rapidly. 3ead this article to know how toimplement 3/*.

Page 7: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 7/27

Describe different ty%es of slowly changing Dimension )#'D*

#ype (:

A #ype ( dimension is where dimensional changes are not considered. #his does not

mean that the attributes of the dimension do not change in actual business situation. It ustmeans that, even if the value of the attributes change, history is not kept and the tableholds all the previous data.

#ype :

A type dimension is where history is not maintained and the table always shows therecent data. #his effectively means that such dimension table is always updated withrecent data whenever there is a change, and because of this update, we lose the previousvalues.

#ype 8:

A type 8 dimension table tracks the historical changes by creating separate rows in thetable with different surrogate keys. /onsider there is a customer / under group 7 firstand later on the customer is changed to group 78. #hen there will be two separate recordsin dimension table like below,

-ey /ustomer 7roup 1tart *ate 2nd *ate

/ 7 st Gan 8((( Est *ec 8((>

8 / 78 st Gan 8((  NULL

 "ote that separate surrogate keys are generated for the two records. "400 end date in thesecond row denotes that the record is the current record. Also note that, instead of startand end dates, one could also keep version number column ), 8 ; etc.+ to denotedifferent versions of the record.

#ype E:

A type E dimension stored the history in a separate column instead of separate rows. 1ounlike a type 8 dimension which is vertically growing, a type E dimension is horizontallygrowing. 1ee the e9ample below,

-ey /ustomer revious 7roup /urrent 7roup

/ 7 78

#his is only good when you need not store many consecutive histories and when date ofchange is not re6uired to be stored.

J

Page 8: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 8/27

#ype :

A type dimension is a hybrid of type , 8 and E )K8KE+ which acts very similar to type8, but only you add one e9tra column to denote which record is the current record.

-ey /ustomer 

7roup

1tart *ate 2nd *ate /urrentlag

/ 7st Gan8(((

Est *ec8((>

 "

8 / 78st Gan8((

 NULL H

What is a mini dimension?

5ini dimensions can be used to handle rapidly changing dimension scenario. If a

dimension has a huge number of rapidly changing attributes it is better to separate thoseattributes in different table called mini dimension. #his is done because if the maindimension table is designed as 1/* type 8, the table will soon outgrow in size and create performance issues. It is better to segregate the rapidly changing members in differenttable thereby keeping the main dimension table small and performing.

What is a fact"less"fact?

A fact table that does not contain any measure is called a fact!less fact. #his table willonly contain keys from different dimension tables. #his is often used to resolve a many!to!many cardinality issue.

Explanatory Note:

/onsider a school, where a single student may be taught by many teachers and a singleteacher may have many students. #o model this situation in dimensional model, onemight introduce a fact!less!fact table oining teacher and student keys. 1uch a fact tablewill then be able to answer 6ueries like,

. &ho are the students taught by a specific teacher.8. &hich teacher teaches ma9imum students.E. &hich student has highest number of teachers.etc. etc.

What is a coerage fact?

A fact!less!fact table can only answer 'optimistic' 6ueries )positive 6uery+ but can notanswer a negative 6uery. Again consider the illustration in the above e9ample. A fact!lessfact containing the keys of tutors and students can not answer a 6uery like below,

. &hich teacher did not teach any studentL

M

Page 9: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 9/27

8. &hich student was not taught by any teacherL

&hy notL Because fact!less fact table only stores the positive scenarios )like student being taught by a tutor+ but if there is a student who is not being taught by a teacher, thenthat student's key does not appear in this table, thereby reducing the coverage of the table.

/overage fact table attempts to answer this ! often by adding an e9tra flag column. lag N( indicates a negative condition and flag N indicates a positive condition. #o understandthis better, let's consider a class where there are (( students and > teachers. 1o coveragefact table will ideally store (( > N >(( records )all combinations+ and if a certainteacher is not teaching a certain student, the corresponding flag for that record will be (.

What are incident and sna%shot facts

A fact table stores some kind of measurements. 4sually these measurements are stored)or captured+ against a specific time and these measurements vary with respect to time.

 "ow it might so happen that the business might not able to capture all of its measuresalways for every point in time. #hen those unavailable measurements can be kept empty)"ull+ or can be filled up with the last available measurements. #he first case is thee9ample of incident fact and the second one is the e9ample of snapshot fact.

What is aggregation and what is the benefit of aggregation?

A data warehouse usually captures data with same degree of details as available insource. #he $degree of detail$ is termed as granularity. But all reporting re6uirementsfrom that data warehouse do not need the same degree of details.

#o understand this, let's consider an e9ample from retail business. A certain retail chainhas >(( shops accross 2urope. All the shops record detail level transactions regarding the products they sale and those data are captured in a data warehouse.

2ach shop manager can access the data warehouse and they can see which products aresold by whom and in what 6uantity on any given date. #hus the data warehouse helps theshop managers with the detail level data that can be used for inventory management,trend prediction etc.

 "ow think about the /2O of that retail chain. %e does not really care about which certainsales girl in 0ondon sold the highest number of chopsticks or which shop is the best seller

of 'brown breads'. All he is interested is, perhaps to check the percentage increase of hisrevenue margin accross 2urope. Or may be year to year sales growth on eastern 2urope.1uch data is aggregated in nature. Because 1ales of goods in 2ast 2urope is derived bysumming up the individual sales data from each shop in 2ast 2urope.

#herefore, to support different levels of data warehouse users, data aggregation is needed.

P

Page 10: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 10/27

What is slicing"dicing?

1licing means showing the slice of a data, given a certain set of dimension )e.g. roduct+and value )e.g. Brown Bread+ and measures )e.g. sales+.

*icing means viewing the slice with respect to different dimensions and in different levelof aggregations.

1licing and dicing operations are part of pivoting.

What is drill"through?

*rill through is the process of going to the detail level data from summary data.

/onsider the above e9ample on retail shops. If the /2O finds out that sales in 2ast2urope has declined this year compared to last year, he then might want to know the root

cause of the decrease. or this, he may start drilling through his report to more detaillevel and eventually find out that even though individual shop sales has actuallyincreased, the overall sales figure has decreased because a certain shop in #urkey hasstopped operating the business. #he detail level of data, which /2O was not muchinterested on earlier, has this time helped him to pin point the root cause of declinedsales. And the method he has followed to obtain the details from the aggregated data iscalled drill through.

#istory $reserving in Di"ensional odeling 

0ast 4pdated on 1aturday, 8 Guly 8(8 (:8

&ritten by Akash 5itra

In our earlier article we have seen how to design a simple dimensional data model for a point!of!sale system )as an e9ample we took the case of 5c*onald's fast!food shop+. Inthis article we will begin with the same model and we will see how we may enhance themodel to store historical changes in the attributes of dimension table.

+othing Lasts Foreer 

One of the important obectives while doing data modeling is, to develop a model whichcan capture the states of the system with respect to time. Hou know, nothing lastsforeverQ roduct prices change over time, people change their addresses, marital status,employers and even their names. If you are doing data modeling for a data warehouse Rwhere we are particularly interested about historical analysis ! it is crucial that wedevelop some method of capturing these changes in our data model. As an e9ample, let'ssay we store the price of products in the $ood$ dimension table that we created earlierand we want to be able to capture the historical changes in $ood$ price. In this article wewill see what change we need to do in our data model to be able to do this.

(

Page 11: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 11/27

 "ote: #he simple $ood$ dimension we created earlier did not have any $rice$information. But to illustrate the point of this article, we will add a $price$ column to our$ood$ dimension table. 1o henceforth our $ood$ dimension table will look like this:

KEY NAME TYPE_KEY PRICE1 Chicken Burger 1 3.70

2 Veggie Burger 1 3.203 French Frie 2 2.00

! T"i#er Frie 2 2.20

In case if you have not read my previous article and wondering what $#H2S-2H$means, this is a foreign key coming from one other table that contains the type of the foode.g., Burger, ries etc. Also notice, above table only tells us the price of the food as ofcurrent point in time. It does not tell us what the price was, let's say months ago. If the price of ?eggie Burger changes from TE.8( to TE.8> tomorrow, the new price will beupdated in the table and then we will have no way to know what was the earlier price. 1oour obective is to change the above table structure in such a way so that we can store all

the historical and future prices of the foods.

Ty%es of 'hanging Dimensions

#here are a few different ways to store the historical changes of values in data model.And any particular way you want to adopt will depend on the type of changingdimension. or e9ample, some dimensions can change 6uite rapidly, some dimensions donot change at all but most dimensions change very slowly. #hat is why we candifferentiate dimensions in these E types depicted below.

%n&hanging Di"ension

#here are some dimensions that do not change at all. or e9ample, let's say you havecreated a dimension table called $7ender$. Below are the structure and data of thisdimension table:

I$ VA%&E

1 M'(e2 Fe)'(e

#he $?alue$ column in the above dimension is the attribute of this table that won'tnormally change. #his is an unchanging dimension ! $male$ will be always called $male$and $female$ will be always called $female$. Off course, for some crazy reason, one maywish to change the te9ts $5ale$D$emale$ to something else e.g. $man$ D $woman$. Butthat's really not a change that we should be concerned about as such changes do not alterthe $meaning$ of the attribute )the words man D male still mean the same thing+. 1o ifsome changes need to be done, we can simply update the $?alue$ column in dimensiontable. or all practical intent and purpose, this dimension remains as an $4nchangingdimension$.

Page 12: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 12/27

!lowly 'hanging Di"ension

%ere comes the most popular dimension ! $slowly changing dimension$. #hese are thedimensions where one or more attributes can change slowly with respect to time. 0ook atthe $food$ dimension from our earlier e9ample. $rice$ is one such attribute which is

variable in this dimension. But $price$ of french fries or burgers do not change veryoften, may be they change once in a season. #his is an e9ample of slowly changingdimension.

0et me give you one more e9ample. 0et's say you have created a dimension table onemployees. And in the $employee$ dimension you have a column called$5aritalS1tatus$. #his can definitely change )from unmarried to married for e9ample+with respect to time. But again, like the previous e9ample, this is a slowly changingattribute. *oesn't change so often.

0ater in the article, we will see how to make necessary changes in our dimension table

design to store history for such slowly changing dimensions.

(apidly 'hanging Di"ensions

If you design a dimension table that has a rapidly changing attribute, then your dimensiontable will become rapidly changing dimension.

As for e9ample, let's say you have a $1ubscriber$ dimension where you store the detailsof all the subscribers to a particular pre!paid mobile service plan. Hou have a $status$column in the $1ubscriber$ dimension table which can have several different values based on the current account balance of the subscriber. or e9ample, if your balance is

less than T(., the status becomes $"o Outgoing call$. If your balance is less than T>, thestatus becomes $3estricted /all 1ervice$. If your balance is less than T(, the status becomes $"o 0ong *istance /all$ and if the balance is greater than T( then status becomes $ull 1ervice$, etc. 2very month, the status of any subscriber keeps on changingmultiple times based on his or her account balance thereby making the $1ubscribers$dimension one rapidly changing dimension.

One must remember the way we design a rapidly changing dimension is often 6uitedifferent from the way we design a slowly changing dimension. In the ne9t articlehowever, we will only look into designing of slowly changing dimension.

I"ple"enting (apidly &hanging di"ension 

0ast 4pdated on 5onday, 8F 1eptember 8(8 (8:EP&ritten by Akash 5itra

#his article attempts to provide some methodologies on handling rapidly changingdimensions in a data warehouse.

8

Page 13: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 13/27

In the past we have learnt how to design various slowly changing dimensions. But the problem with type 8 slowly changing dimension is, with every change in the dimensionalattributes it increases the number of rows in the table. If lot of changes happen in theattributes of the dimension table )that is to say that the dimension is rapidly changing+,the table 6uickly becomes bulky causing considerable performance issues. %ence the

typical solution of 1/* #ype 8 dimensions may not be a very good fit for rapidlychanging scenarios.

#here are other methods to handle rapidly changing dimensions and one of those methodswill be discussed in this article. Bear in mind, this is not the only method to handlerapidly changing scenarios. "either is this the best one for every kind of scenarios. A datamodeler is encouraged to be innovative to come up with other novel approaches.

,un$ Dimension

#he method that we are going to consider here assumes the fact that, not all the attributes

of a dimension table are rapidly changing in nature. #here might be a few attributeswhich are changing 6uite often and some other attributes which seldom change. If we canseparate the fast changing attributes from the slowly changing ones and move them insome other table while maintaining the slowly changing attributes in the same table, wecan get rid of the issue of bulking up the dimension table.

1o lets take one e9ample to see how it works. 0ets say /41#O523 dimension hasfollowing columns:

• /41#O523S-2H

• /41#O523S"A52

/41#O523S72"*23 • /41#O523S5A3I#A0S1#A#41

• /41#O523S#I23 

• /41#O523S1#A#41

&hile attributes like name, gender, marital status etc. do not change at all or rarelychange, lets assume customer tier and status change every month based on customers buying pattern. If we decide to keep status and tier in the same 1/* #ype 8 /ustomerdimension table, we could risk filling!up the table too much too soon. Instead, we can pull out those two attributes in yet another table, which some people refer as G4"-*I52"1IO". %ere is how our unk dimension will look like. In this case, it will have E

columns as shown below.

• 12752"#A#IO"S-2H

• #I23 

• 1#A#41

E

Page 14: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 14/27

#he column 12752"#A#IO"S-2H is a surrogate key. #his acts as the primary key ofthe table. Also since we have removed status and tier from our main dimension table, thedimension table now looks like this:

• /41#O523S-2H

/41#O523S"A52• /41#O523S72"*23 

• /41#O523S5A3I#A0S1#A#41

 "e9t, we must create a linkage between the above customer dimension to our newlycreated G4"- dimension. "ote here, we can not simply pull the primary key of theG4"- dimension )which we are calling as 12752"#A#IO"S-2H+ into the customerdimension as foreign key. Because if we do so, then any change in G4"- dimension willre6uire us to create a new record in /ustomer dimension to refer to the changed key. #hiswould in effect again increase the data volume of the dimension table. &e solve this problem by creating one more mini table in between the original customer dimension and

the unk dimension. #his mini dimension table acts as a bridge between them. &e also put <start date= and <end date= columns in this mini table so that we can track the history.%ere is how our new mini table looks like:

• /41#O523S-2H

• 12752"#A#IO"S-2H

• 1#A3#S*A#2

• 2"*S*A#2

#his table does not re6uire any surrogate key. %owever, one may include one</4332"# 0A7= column in the table if re6uired. "ow the whole model looks like

this:

-aintaining the ,un$ Dimension

If number of attributes and the number of possible distinct values per attributes)cardinality+ are not very large in the Gunk dimension, we can actually pre!populate the unk dimension once and for all. In our earlier e9ample, lets say possible values of status

F

Page 15: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 15/27

are only <Active= and <Inactive= and possible values of #ier are only <latinum=, <7old=and <1ilver=. #hat means there can be only E 8 N distinct combinations of records inthis table. &e can pre!populate the table with these records from segmentation key N to and assign one key to each customer based on the customers status and tier values.

.ow does this ,un$ dimension hel%?

1ince the connection between the segmentation key and customer key is actuallymaintained in the mini dimension table, fre6uent changes in tier and status do not changethe number of records in the dimension table. &henever a customers status or tierattribute changes, a new row is added in the mini dimension )with 1#A3#S*A#2 N dateof change of status+ signifying the current relation between the customer and thesegmentation.

Its also worth mentioning that in this schema, we can manage the original customerdimension table in 1/* type or #ype 8 methods, but we will have to take e9tra care to

update the mini dimension also as and when there is a change in the key in the originaldimension table.

Di"ensional odeling Approa&h )or *arious !lowly

'hanging Di"ensions

In our earlier article we have discussed the need of storing historical information indimensional tables. &e have also learnt about various types of changing dimensions. In

this article we will pick $slowly changing dimension$ only and learn in detail aboutvarious types of slowly changing dimensions and how to design them.

1lowly changing dimensions, referred as 1/* henceforth, can be modeled basically in Edifferent ways based on whether we want to store full histories, partial histories or nohistory. #hese different types are called #ype 8, #ype E and #ype respectively. "e9t wewill learn them in detail.Also note, there are slight variations to the basic E 1/* types that I show here. #hesevariations )sometimes labelled as type F, >, , J etc.+ are mostly in terms ofimplementation and use!cases. *on't worry about them now.

#'D Ty%e /

As mentioned above, we design a dimension as 1/* type when we do not want to storethe history. #hat is, whenever some values are modified in the attributes, we ust want toupdate the old values with the new values and we do not  care about storing the previoushistory.

>

Page 16: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 16/27

&e do not store any history in 1/* #ype

lease mind, this is not same as $4nchanged *imension$ discussed in the previousarticle. In case of an unchanged dimension, we assume that the values of the attributes ofthat dimension will not change at all. On the other hand, here in case of a 1/* #ype

dimension, we assume that the values of the attributes will change slowly, however, weare not interested to store those changes. &e are only interested to store the current orlatest value. 1o every time it changes we will update the old value with new ones.

#andling !'D Type + Di"ension in ET, $ro&ess

#echnically, from 2#0 design perspective )"ow, if you don't know what is 2#0, youdon't have to bother about this paragraph ! you can go to the ne9t section+ 1/* #ype dimensions are loaded using $5erge$ operation which is also known as $4123#$ as anabbreviation of $4pdate else Insert$.

1/* #ype dimensions are loaded by 5erge operations

In $4123#$ method, each row coming from the source is compared will all the records present in the target dimension table based on the natural key and checked if the sourcerecord already e9ists in the target or not. If the row e9ists in the target, the target row isupdated with new values coming from source system. %owever if the row is not presentin the target system, the source row is inserted in the target table.

In pure A"1I 1C0 synta9, there is a particular statement that help you achieve the4123# operation. It's called $52372$ statement

MER*E INT+ T'rge#_$i)eni,n_T'-(e #g#&IN* ,urce_#'-(e rc

+N#g#.n'#ur'(_ke/ rc.n'#ur'(_ke/

EN MATCE$ TEN

&P$ATE

ET #g#.c,(u)n1 rc.'(ue14

  #g#.c,(u)n2 rc.'(ue24 ...

EN N+T MATCE$ TENINERT 5#g#.c,(u)n1 4 #g#.c,(u)n2 ...6

VA%&E 5rc.'(ue1 4 rc.'(ue2 ...

As obvious from this e9ample, you have to store the natural key of the data in the targetdimension table in order to perform this comparison. 0ater, I will write a separate articleon 2#0 architecture design, where I will talk about this in more detail. But from amodeling perspective, please note that as a data modeler you should add one e9tracolumn in your target dimension table as a place holder to store the natural key of thedata.

Page 17: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 17/27

#'D Ty%e 0

Arguably, this is the most popular type of slowly changing dimensions. 1o we will try tolearn this as clearly as possible.

0et me come one step backward here and remind you again about what is our obectivehere. As you can recall, in the previous articles we have learnt how the values of theattributes )or columns+ in the dimension table change with time. &e are trying to storethe histories of such changes for the purpose of analysis.

In #ype , we were not storing any history. %owever, now we are going to learn howmay we design a dimension table so that we can store the full history and always e9tractthe history of changes as and when we re6uire that. &e will take our $ood$ dimensiontable as an e9ample here, where $rice$ is a variable factor.

KEY NAME TYPE_KEY PRICE1 Chicken Burger 1 3.70

2 Veggie Burger 1 3.203 French Frie 2 2.00

! T"i#er Frie 2 2.20

Design o) !'D Type Di"ension

In order to design the above table as 1/* #ype 8, we will have to add E more columns inthis table, $*ate rom$, $*ate #o$ and $0atest lag$. #hese columns are called type 8metadata columns. 1ee below:

KEY NAME TYPE_KEY PRICE $ATE_FR+M $ATE_T+ %'#e#_F%*

1 Chicken Burger 1 3.70 018'n11 31$ec99 Y2 Veggie Burger 1 3.20 018'n11 31$ec99 Y

3 French Frie 2 2.00 018'n11 31$ec99 Y! T"i#er Frie 2 2.20 018'n11 31$ec99 Y

 "otice here, how the values of these E new columns are populated. In the very beginning,when any new record is loaded in the table, we automatically default the values of $datefrom$ to the date of the day of the loading, $*ate #o$ to some far future date )e.g., Est*ecember 8(PP+ and $0atest lag$ to $H$.

&hat is the meaning of these E metadata columnsL

#hese E columns basically tell us whether a particular record in the table is latest or notand what is the time period during which the record was latest )Also known as active period+. or e9ample, data in the above table basically says that all the F records are latest)active+ and they are active from the day of loading )in this case 1st January 2011+ untilan indefinite future date )31st December 2099+.

But how does these columns help us store the change historyL

J

Page 18: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 18/27

0ets assume, today is > 5arch 8(, and 5c*onald has decided to increase the price of$?eggie Burger$ from TE.8( to TE.8>. If this happens we will not straight away updatethe price from TE.8( to TE.8>. Instead to store this new information )and also the oldinformation+, we will insert a new record in the $ood$ dimension table which will looklike below:

KEY NAME TYPE_KEY PRICE $ATE_FR+M $ATE_T+ %'#e#_F%*1 Chicken Burger 1 3.70 018'n11 31$ec99 Y

2 Veggie Burger 1 3.20 018'n11 1!M'r11 N

3 French Frie 2 2.00 018'n11 31$ec99 Y

! T"i#er Frie 2 2.20 018'n11 31$ec99 Y: Veggie Burger 1 3.2: 1:M'r11 1!M'r11 Y

Observe the change in the records with -ey 8 and >. 3ecord 8, which was the originalrecord for the veggie burger, has now got updated as its latest flag has become '"' and$*ate #o$ column value has changed to $14-Mar-2011$. #his means, 3ecord 8 is nolonger latest or active )0atest lag N $"$+ and it was active earlier during the period 1st

 Jan 2011 )*ate rom+ to 14 Mar 2011 )*ate #o+.

1o, if 3ecord 8 is not active, what is the latest record for $?eggie Burger$ nowL 3ecord>Q Its latest flag is set to $H$ and it says that that the record is active since 15 March

2011.

#his record will remain active many years in the far!off future )until 31 Dec 2099+ or atleast unless a new record is inserted again with latest flag H and this record is updatedagain with 0atest lag ". 1o ne9t time again, let's say on 20 Dec 2011, 5c*onalds againdecide to change the price of ?eggie Burger back to TE.8( and increase the price of thechicken burger from TE.J( to TE.P(, we will see 8 more new records in the table as below:

KEY NAME TYPE_KEY PRICE $ATE_FR+M $ATE_T+ %'#e#_F%*

1 Chicken Burger 1 3.70 018'n11 19$ec11 N

2 Veggie Burger 1 3.20 018'n11 1!M'r11 N3 French Frie 2 2.00 018'n11 31$ec99 Y

! T"i#er Frie 2 2.20 018'n11 31$ec99 Y: Veggie Burger 1 3.2: 1:M'r11 19$ec11 N

; Chicken Burger 1 3.<0 20$ec11 31$ec99 Y7 Veggie Burger 1 3.20 20$ec11 31$ec99 Y

As you can see from the design above, it is now possible to go back to any date in thehistory and figure out what was the value of the $rice$ attribute of $ood$ dimension at

that point in time.

!urrogate .ey )or !'D Type di"ension

 "ote from the above e9ample that, each time we generate a new row in the dimensiontable, we also assign a new key to the record. #his is the key that flows down to the facttable in a typical 1tar schema design. #he value of this key, that is the numbers like , 8,E, ;. , J etc. are not coming from the source systems. Instead those numbers are ust like

M

Page 19: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 19/27

se6uential running numbers which are generated automatically at the time of insertingthese records. #hese numbers are uni6ue, so as to uni6uely identify each record in thetable, and are called $1urrogate -ey$ of the table.

As obvious, multiple surrogate keys may be related to the same item, however, each key

will relate to one particular state of that item in time. In the above e9ample, keys 8, > andJ are all linked to $?eggie Burger$ but they represent the state of the record in E differenttime spans. It's worth noting that there would be only one record with latest flag N $H$among multiple records of the same item.

Alternate Design o) !'D Type : Addition o) *ersion Nu"/er

A slight variation of design of 1/* #ype 8 dimension is possible where we can store theversion numbers of the records. #he initial record will be called version and as andwhen new records are generated, we will increment the version number by . In thisdesign pattern, the records with highest version will always be the latest record. If we

utilize this design in our earlier e9ample, the dimension table will look like this:

KEY NAME TYPE_KEY PRICE $ATE_FR+M $ATE_T+ Veri,n1 Chicken Burger 1 3.70 018'n11 19$ec11 1

2 Veggie Burger 1 3.20 018'n11 1!M'r11 13 French Frie 2 2.00 018'n11 31$ec99 1

! T"i#er Frie 2 2.20 018'n11 31$ec99 1

: Veggie Burger 1 3.2: 1:M'r11 19$ec11 2

; Chicken Burger 1 3.<0 20$ec11 31$ec99 27 Veggie Burger 1 3.20 20$ec11 31$ec99 3

Off course, we can also keep the $0atest lag$ column in the above table if we wish.

#andling !'D Type Di"ension in ET, $ro&ess

Again, if you do not know what is 2#0 ! you can safely skip this section. But if you havesome 2#0 background then I suppose you have already pin!pointed the fact that, unlike1/* #ype , #ype 8 re6uires you to insert new records in the table as and when anyattribute changes. #his is obviously different from 1/* #ype . Because in case of 1/*#ype , we were only updating the record. But here, we will need to update old record)e.g. changing the latest flag from $H$ to $"$, updating the $*ate #o$+ as well as we willneed to insert a new record.

0ike before, we can use the $natural key$ to first compare if the source record is e9istingin the target or not. If not, we will simply insert the record in the target with newsurrogate key. But if it already e9ists in the target, we will have to check if any value ofthe attributes has changed between source and target ! if not, we can ignore the sourcerecord. But if yes, we will have to update the e9isting record as $"$ and insert a newrecord with new surrogate key. As I mentioned before, I will write a separate article onthe 2#0 handling later.

P

Page 20: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 20/27

$er)or"an&e 'onsiderations o) !'D Type Di"ension

1/* type 8, by design, tend to increase the volume of the dimension tables considerably.#hink of this: 0et's say you have an $employee$ dimension table which you havedesigned as 1/* #ype 8. #he employee dimensions has 8( different attributes and there

are ( attributes in this table which change at least once in a year on average )e.g.employee grade, manager's name, department, salary, band, designation etc.+. #his meansif you have ,((( employees in your company, at the end of ust one year, you are goingto get (,((( records in this dimension table )i.e. assuming on an average ( attributeschange per year ! resulting into ( different rows in the dimension table+.

As you can see, this is not a very good thing performance wise as this can considerablyslow down loading of your fact table as you will re6uire to $look up$ this dimension tableduring your fact loading. One may argue that, even if we have (,((( records, we willactually have only ,((( records with 0atestSlag N 'H' and since we will only lookuprecords with 0atestSlag N 'H', the performance will not detoriate. #his is not entirely

true. &hile utilizing the 0atestSlag N 'H' filter may decrease the size of the lookupcache, but database will generally need to do a full table scan )#1+ to identify latestrecords. 5oreover, in many cases 2#0 developer will not be able to make use of0atestSlag N 'H' column if the transactional records do not always belong to the latesttime )e.g. late arriving fact records or loading fact table at later point in time ! month endload D week end load etc.+. In those cases, putting latestSflag N 'H' filter will befunctionally incorrect as you should determine the correct return key on the basis of$*ate #o$, $*ate rom$ columns. )If you do not understand what I am talking about inthis para, ust ignore me for now. I am going to e9plain these things later in some otherarticle+

#'D Ty%e 1

As I mentioned before, type E design is used to store partial history. Althoughtheoretically it is possible to use the type E design to store full history, that would be not possible practically. 1o, what is type E designL In #ype 8 design above, we have seen thatwhenever the values of the attributes change, we insert new rows to the table. In case oftype E, however, we add new column to the table to store the history.

1o let's say, we have a table where we have 8 column initially ! $-ey$ and $attribute$.

KEY ATTRIB&TE1 A

2 B3 C

If the record changes its attribute from A to *, we will add one e9tra column to thetable to store this change.

KEY ATTRIB&TE ATTRIB&TE_+%$1 $ A

8(

Page 21: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 21/27

2 B

3 C

If the record again change attribute values, we will again have to add columns to store thehistory of the changes

KEY ATTRIB&TE ATTRIB&TE_+%$ ATTRIB&TE_+%$_11 E $ A

2 B3 C

Isn't then 1/* #ype E very cumbersomeL

As you can see, storing the history in terms of changing the structure of the table in thisway is 6uite cumbersome and after the attributes are changed a few times the table will become unnecessarily big and fat and difficult to manage. But that does not mean 1/*#ype E design methodology is completely unusable. In fact, it is 6uite usable in a

 particular circumstance ! where we ust need to store the partial history information.

0et's think about a special circumstance where we only need to know the $current value$and $previous value$ of an attribute. #hat is, even though the value of that attribute maychange numerous times, at any time we are only concerned about its current and previousvalues. In such circumstances, we can design the table as type E and keep only 8 columns! $current value$ and $previous value$ like below.

KEY Curren#_V'(ue Prei,u_V'(ue

1 $ A

2 B3 C

I can't find a very good e9ample of this scenario right away, however, I can give you onee9ample from one of my previous proects in telecom domain, wherein a certaincalculated field in the report used to depend on the latest and previous values of thecustomer status. #hat calculated attribute was called $/hurn Indicator$ )churn in telecom business generally means leaving a telephone connection+ and the rule to populate thechurn indicator was )in a very very simplified way+ like below:

Churn In=ic'#,r

>V,(un#'r/ Churn>

5i? cu#,)er@ curren# #'#u @In'c#ie@ 'n= rei,u #'#u @Ac#ie@6

>In,(un#'r/ Churn>4

5i? cu#,)er@ curren# #'#u @In'c#ie@ 'n= rei,u #'#u @uen=e=@6

As you can guess, in order to find out the correct value of churn indicator, you do notneed to know complete history of changes of customer's status. All you need to know isthe current and previous status. In this kind of partial history scenario, 1/* #ype Edesign is very useful.

8

Page 22: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 22/27

 "ote here, compared to 1/* #ype 8, type E does not increase the number of records inthe table thereby easing out performance concerns.

 "ow that we have already learnt about slowly changing dimensions, ne9t we will discusshow to design $3apidly /hanging *imension$ or 3/* 

What is a data warehouse - A +0+ guide to "odern data

warehousing 

0ast 4pdated on #hursday, F 5arch 8(E (P:>

#his article discusses data warehousing from a holistic standpoint and 6uickly touchesupon all the relevant concepts that one needs to know. 1tart here if you do not knowwhere to start from.

1ince this site is dedicated to data warehousing professionals, it was plain for us toassume that visitors of this site are already 6uite familiar with the basic concepts of datawarehousing and business intelligence. /reating basic tutorials on these subects were perhaps too obvious for us to do and that e9plains why there are not many articles on the basic concepts and definitions of data warehousing in this site. But something began tostrike us.

&e began to realize that there are still a lot of craving among our readers to understandthe basic. And wikipedia definition of data warehousing is not enough for them ! partly because of the inade6uate elaboration and partly because of the heterogeneous

 background and e9periences of our users who are still struggling to grasp thefundamentals. 1o we decided to produce a set of comprehensive and basic ground uparticles for our readers to enable them venture deep into the subect of data warehousing.1o let's begin with what is data warehousing, and what is business intelligence and whyshould we care.

What is data warehousing?

*ata warehousing is the science of storing data for the purpose of meaningful futureanalysis.

Hes, it is a science )not much art involvedQ+ and it deals with the mechanism ofelectronically storing and retrieving data so that some analysis can be performed on thatdata to corroborate D support a business decision or to predict a business outcome.

What is usiness Intelligen&e1

Business Intelligence, on the other hand, is simply the art and science of presentinghistorical data in a meaningful way )often by using different data visualizationtechni6ues+.

88

Page 23: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 23/27

3aw data stored in databases turns into valuable information through the implementationof Business Intelligence processes.

Why data warehouse?

*& technologies provide historical, current and predictive views of business operations by analyzing the present and historical business data. *ata analysis is often done usingvisualization techni6ues that turn comple9 data into images that tells compelling story.3aw data by this process of analysis help management take right decisions.

As an e9ample of how modern visualization techni6ues are helping to unlock thecomple9 and hidden information stored deep inside the data, visithttp:DDwww.gapminder.org7ap5inder is a small tool conceptualized by rof. %ans 3osling, that analyzes comple9socio!economical data to reveal world's most important trends

#o further demonstrate the need of data warehousing, consider this.

0et's imagine a company called$air1hop$ that has ((( retail outletsacross 41A. #he company has built onedata warehouse to store the data collectedfrom all the shop outlets so that they cananalyze the data to gather businessintelligence.

#he company collects raw sales data from

all of their outlet shops )through a processcalled 2#0+ and then load them into a place called *ata warehouse or data mart )at this point don't bother too much about the e9act meaning and differences of data mart anddata warehouse ! we will get to it in detail later+.

Once the data is there in data warehouse )or data mart+ business intelligence techni6uesare applied to that data for analysis and reporting. 1ince the company now has the salesand purchase information from all their shops in a centralized place, it can easily use thisdata to answer some rudimentary 6uestions about their business e.g. what shop makeshighest sales, which product is most popular across the shop, what is the stock balanceetc.

It is very common to talk about both data warehousing and business intelligence togethersince business intelligence in this conte9t refers to analyzing the data in data warehouse.As &ikipedia puts it:

!ommon "unct#ons o" bus#ness #nte$$#%ence techno$o%#es are report#n%, on$#ne ana$yt#ca$

 process#n%, ana$yt#cs, data m#n#n%, process m#n#n%, comp$e& e'ent process#n%, bus#ness

 per"ormance mana%ement, benchmar(#n%, te&t m#n#n% and pred#ct#'e ana$yt#cs) 

8E

Page 24: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 24/27

2uilding the definition of data warehouse

#here are couple of points to notice from the typical retail scenario shown above. #hese points form the base of our discussion on the definition.

. Obective of building a data warehouse is to store data that are re6uired foranalysis and reporting

8. or a data warehouse to function, it should be supplemented with a process thatcan collect and load data into it )e.g. 2#0+

E. In a data warehouse, data actually flows from the source to target ! so the contentsof the data warehouse would be different between 8 given points in time

 "ow if I tell you that, the definition of data warehouse can be constructed from the aboveE points ! that shouldn't surprise you. But what will surprise you is ! a lot of these pointsare not really considered in the classic definition of data warehouse.

1o let's discuss the classic definitions of data warehouse first.

'lassic Definition of Data Warehouse " A %ee$ in the hostory

#he history of data warehouse dates back to P(. &ithout going into detail, here we will6uickly touch upon a few noteworthy events of the history of data warehouse. In PJ(,A/"ielsen, a global marketting research company, published sales data pertaining toretail industry in the form of dimensional data mart.

2arlier than this, the concept of data warehouse for analysis was only a subect ofacademic pursuit.

Along the same time, the concept of decision support system were gradually developingand people started to realize that data stored in operational systems )e.g. data stored in theindividual stores of a retail chain+ are not easy to analyze in the time of decision making.1o in PME, #eradata introduced a database management system specifically designed fordecision support.

In this decade and ne9t, several people e9perimented with several designs of datawarehouse and some of them were 6uite successful. In the year PP8, one of them, namedBill Inmon published a book ! Building the *ata &arehouse  ! which among other things,gave us a widely accepted definition of what a data warehouse is. &e will soon ump into

that definition. But before that let me mention one more author ! 3alph -imball ! who Fyears later in PP wrote another book ! *ata warehouse toolkit ! showing us yet anotherapproach of defining and building a data warehouse. 1ince then, both Inmon and -imballapproaches are widely accepted and implemented throughout the globe.

1o %ow did Bill Inmon defined a data warehouseL %ere it is:

8F

Page 25: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 25/27

 * data +arehouse #s a subect or#ented, non-'o$at#$e, #nte%rated, t#me 'ar#ant co$$ect#on

o" data #n support o" mana%ements dec#s#ons)

 "ow let's understand this definition.

E3%lanation on the classic definition of data warehouse

!u/2e&t 3riented

#his means a data warehouse has a defined scope and it only stores data under that scope.1o for e9ample, if the sales team of your company is creating a data warehouse ! the datawarehouse by definition is re6uired to contain data related to sales )and not the datarelated to production management for e9ample+

Non-volatile

#his means that data once stored in the data warehouse are not removed or deleted from itand always stay there no matter what.

Integrated

#his means that the data stored in a data warehouse make sense. act and figures arerelated to each other and they are integrable and proects a single point of truth.

Ti"e variant

#his means that data is not constant, as new and new data gets loaded in the warehouse,

data warehouse also grows in size.

4dentifying a data warehouse based on its definition

F simple terms by Inmon defined data warehouse succinctly. 0et's now check how thedefinition help us identify a data warehouse from other types of data stores.

Is a book written on how to create J different peas pudding, a data warehouseL It issubect oriented )deals with peas pudding+, "on!volatile )deals with fi9ed J methodsthat are there to stay+, integrated )makes sense+. But it's not time variant. Answer is it'snot a data warehouse.

1o is the folder on my desktop named $Account 1tatements$ a data warehouseL 1ubectoriented )deals with financial accounting+, non!volatile )until I manually delete it+, #imevariant )every month new account statements pour in+ but it's not integrated )one file inthe folder containing the account statement from bank HU for the month of 5ay has noway to link to the other file in the folder containing the account statement of the bankAB/ for the month of Gune+. 1o ! not a data warehouse.

8>

Page 26: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 26/27

1o is the sales information collected in one store of a big retail chain a data warehouseLIt's subect oriented, time variant and integrated )after all there is a relational database behind+. But is it non!volatileL 5ostly not. And even if it is ! there is a fifth factor. Is it being used for the purpose of management decision makingL 1urely not. After all whowill take an enterprise wide management decision based on the data collected from a

single storeL

A broader definition of data warehouse

#he classic definition that we discussed above does not focus much on the purpose of thedata warehouse. #he purpose is something which distinguishes a data warehouse from adata mart if you will and help us understand the need of the data warehouse. #he purposeof a data warehouse, as we discussed before, is to render a timely data!driven insight thatwas otherwise inconceivable directly from the raw data. A data warehouse which storesdata, is time variant and subect oriented and integrated yet does not solve this purpose !is no better than ust a data dump.

An alternative )and more concurrent+ definition of data warehouse will be:

A data warehouse is an electronically stored collection of integrated data that can be usedfor the purpose of intelligent analysis.

*ropping the time variance from the above definition broadens the coverage of thedefinition and omission of non!volatility condition makes the definition more realisticrather than idealistic. #here are many data that are not time variant )historical andscientific data+ but can be stored in a data warehouse for analysis. 1imilarly modern datawarehouses are purged regularly when the data lose its purpose. Adding a sense of

 purpose in the definition enables us to create a more reliable and goal!oriented datawarehouse.

#chematic 5iew of a data warehouse

#he diagram above shows a typical schematic structure of a data warehouse. As one cansee here, most data warehouses collect data from multiple sources to form one integratedwarehouse. Before loading to the warehouse, these data often need special treatment

8

Page 27: 1 DWH Concepts

8/13/2019 1 DWH Concepts.

http://slidepdf.com/reader/full/1-dwh-concepts 27/27

which is done in the 2#0 layer )2#0 ! 29traction, #ransformation, 0oading+. 2#0 layeris mostly responsible for 8 types of treatments on the data:

• *ata Integration ! 1o that some links can be established between data coming

from separate systems, and•

Cualitative #reatment ! so that the validity and 6uality of the data can be checked)and if re6uired corrected+ before loading to the data warehouse.