designing the stars chema database

8/12/2019 Designing the Stars Chema Database

1/18

Designing the Star Schema Database

By Craig Utley

Introduction

Creating a Star Schema Database is one of the most important, and sometimes the final,step in creating a data warehouse. Given how important this process is to our datawarehouse, it is important to understand how me move from a standard, on-linetransaction processing !"#$% system to a final star schema which here, we will call an!"&$ system%.

#his paper attempts to address some of the issues that have no doubt 'ept you awa'e atnight. &s you stared at the ceiling, wondering how to build a data warehouse, (uestionsbegan swirling in your mind)

*hat is a Data *arehouse+ *hat is a Data art+

*hat is a Star Schema Database+

*hy do wantneed a Star Schema Database+

#he Star Schema loo's very denormali/ed. *on0t get in trouble for that+

*hat do all these terms mean+

Should repaint the ceiling+

#hese are certainly burning (uestions. #his paper will attempt to answer these (uestions,and show you how to build a star schema database to support decision support withinyour organi/ation.

Terminology

Usually, you are bored with terminology at the end of a chapter, or buried in an appendi1at the bac' of the boo'. 2ere, however, have the thrill of presenting some terms up

front. #he intent is not to bore you earlier than usual, but to present a baseline off ofwhich we can operate. #he problem in data warehousing is that the terms are often usedloosely by different parties. #he Data *arehousing nstitute http)www.dw-institute.com% has attempted to standardi/e some terms and concepts. will present mybest understanding of the terms will use throughout this lecture. $lease note, however,that do not spea' for the Data *arehousing nstitute.

OLTP


2/18

!"#$ stand for !nline #ransaction $rocessing. #his is a standard, normali/ed databasestructure. !"#$ is designed for transactions, which means that inserts, updates, anddeletes must be fast. magine a call center that ta'es orders. Call ta'ers are continuallyta'ing calls and entering orders that may contain numerous items. 3ach order and eachitem must be inserted into a database. Since the performance of the database is critical,

we want to ma1imi/e the speed of inserts and updates and deletes%. #o ma1imi/eperformance, we typically try to hold as few records in the database as possible.

OLAP and Star Schema

!"&$ stands for !nline &nalytical $rocessing. !"&$ is a term that means many thingsto many people. 2ere, we will use the term !"&$ and Star Schema pretty muchinterchangeably. *e will assume that a star schema database is an !"&$ system. #his isnot the same thing that icrosoft calls !"&$4 they e1tend !"&$ to mean the cubestructures built using their product, !"&$ Services. 2ere, we will assume that anysystem of read-only, historical, aggregated data is an !"&$ system.

n addition, we will assume an !"&$Star Schema can be the same thing as a datawarehouse. t can be, although often data warehouses have cube structures built on top ofthem to speed (ueries.

Data Warehouse and Data Mart

Before you begin grumbling that have ta'en two very different things and lumped themtogether, let me e1plain that Data *arehouses and Data arts are conceptually different5 in scope. 2owever, they are built using the e1act same methods and procedures, so will define them together here, and then discuss the differences.

& data warehouse or mart% is way of storing data for later retrieval. #his retrieval isalmost always used to support decision-ma'ing in the organi/ation. #hat is why manydata warehouses are considered to be DSS Decision-Support Systems%. 6ou will hearsome people argue that not all data warehouses are DSS, and that0s fine. Some datawarehouses are merely archive copies of data. Still, the full benefit of ta'ing the time tocreate a star schema, and then possibly cube structures, is to speed the retrieval of data. nother words, it supports (ueries. #hese (ueries are often across time. &nd why wouldanyone loo' at data across time+ $erhaps they are loo'ing for trends. &nd if they areloo'ing for trends, you can bet they are ma'ing decisions, such as how much rawmaterial to order. Guess what) that0s decision support7

3nough of the soap bo1. Both a data warehouse and a data mart are storage mechanismsfor read-only, historical, aggregated data. By read-only, we mean that the personloo'ing at the data won0t be changing it. f a user wants to loo' at the sales yesterday fora certain product, they should not have the ability to change that number. !f course, if we'now that number is wrong, we need to correct it, but more on that later.


3/18

#he 8historical9 part may :ust be a few minutes old, but usually it is at least a day old. &data warehouse usually holds data that goes bac' a certain period in time, such as fiveyears. n contrast, standard !"#$ systems usually only hold data as long as it is 8current9or active. &n order table, for e1ample, may move orders to an archive table once theyhave been completed, shipped, and received by the customer.

*hen we say that data warehouses and data marts hold aggregated data, we need to stressthat there are many levels of aggregation in a typical data warehouse. n this section, onthe star schema, we will :ust assume the 8base9 level of aggregation) all the data in ourdata warehouse is aggregated to a certain point in time.

"et0s loo' at an e1ample) we sell ; products, dog food and cat food. 3ach day, we recordsales of each product. &t the end of a couple of days, we might have data that loo's li'ethis)

ood Cat >ood?;?@@ A ;

;

; E

? ; ;

?;@@ A F

; ; A

?

Table 1

=ow, as you can see, there are several transactions. #his is the data we would find in astandard !"#$ system. 2owever, our data warehouse would usually not record this levelof detail. nstead, we summari/e, or aggregate, the data to daily totals. !ur records in thedata warehouse might loo' something li'e this)

ood Cat >ood

?;?@@ A A

?;@@ @

Table

6ou can see that we have reduced the number of records by aggregating the individualtransaction records into daily records that show the number of each product purchasedeach day.


4/18

*e can certainly get from the !"#$ system to what we see in the !"&$ system :ust byrunning a (uery. 2owever, there are many reasons not to do this, as we will see later.

Aggregations

#here is no magic to the term 8aggregations.9 t simply means a summari/ed, additivevalue. #he level of aggregation in our star schema is open for debate. *e will tal' aboutthis later. Hust reali/e that almost every star schema is aggregated to some base level,called the grain.

OLTP Systems

!"#$, or !nline #ransaction $rocessing, systems are standard, normali/ed databases.!"#$ systems are optimi/ed for inserts, updates, and deletes4 in other words,transactions. #ransactions in this conte1t can be thought of as the entry, update, ordeletion of a record or set of records.

!"#$ systems achieve greater speed of transactions through a couple of means) theyminimi/e repeated data, and they limit the number of inde1es. >irst, let0s e1amine theminimi/ation of repeated data.

f we ta'e the concept of an order, we usually thin' of an order header and then a seriesof detail records. #he header contains information such as an order number, a bill-toaddress, a ship-to address, a $! number, and other fields. &n order detail record isusually a product number, a product description, the (uantity ordered, the unit price, thetotal price, and other fields. 2ere is what an order might loo' li'e)

!igure 1

=ow, the data behind this loo's very different. f we had a flat structure, we would seethe detail records loo'ing li'e this)


5/18

!rder =umber !rder Date Customer D Customer =ame Customer &ddress Customer City

A;? ?;?@@ ?A &C3 $roducts A; ain Street "ouisville

Customer State

Customer Iip Contact =ame Contact =umber $roduct D $roduct =ame

J6 ?;; Hane Doe ;--A;A; &AH; *idget

$roduct Description Category SubCategory $roduct $rice


6/18

!igure

=otice that we do not have the e1tended cost for each record in the !rderDetail table.

#his is because we store as little data as possible to speed inserts, updates, and deletes.#herefore, any number that can be calculated is calculated and not stored.

*e also minimi/e the number of inde1es in an !"#$ system. nde1es are important, ofcourse, but they slow down inserts, updates, and deletes. #herefore, we use :ust enoughinde1es to get by. !ver-inde1ing can significantly decrease performance.

$ormali%ation

Database normali/ation is basically the process of removing repeated information. &s wesaw above, we do not want to repeat the order header information in each order detail

record. #here are a number of rules in database normali/ation, but we will not go throughthe entire process.

>irst and foremost, we want to remove repeated records in a table. >or e1ample, we don0twant an order table that loo's li'e this)


7/18

!igure "

n this e1ample, we will have to have some limit of order detail records in the !rdertable. f we add ; repeated sets of fields for detail records, we won0t be able to handlethat order for ;A products. n addition, if an order :ust has one product ordered, we still

have all those fields wasting space.

So, the first thing we want to do is brea' those repeated fields into a separate table, andend up with this)

!igure #

=ow, our order can have any number of detail records.

OLTP Ad&antages

&s stated before, !"#$ allows us to minimi/e data entry. >or each detail record, we onlyhave to enter the primary 'ey value from the !rder2eader table, and the primary 'ey ofthe $roduct table, and then add the order (uantity. #his greatly reduces the amount ofdata entry we have to perform to add a product to an order.

=ot only does this approach reduce the data entry re(uired, it greatly reduces the si/e ofan !rderDetail record. Compare the si/e of the records in #able as to that in #able ?.


8/18

6ou can see that the !rderDetail records ta'e up much less space when we have anormali/ed table structure. #his means that the table is smaller, which helps speed inserts,updates, and deletes.

n addition to 'eeping the table smaller, most of the fields that lin' to other tables are

numeric. ewerinde1es per table are great for speeding up inserts, updates, and deletes. n general terms,the fewer inde1es we have, the faster inserts, updates, and deletes will be. 2owever,again in general terms, the fewer inde1es we have, the slower select (ueries will run. >orthe purposes of data retrieval, we want a number of inde1es available to help speed that

retrieval. Since one of our design goals to speed transactions is to minimi/e the numberof inde1es, we are limiting ourselves when it comes to doing data retrieval. #hat is whywe loo' at creating two separate database structures) an !"#$ system for transactions,and an !"&$ system for data retrieval.

"ast but not least, the data in an !"#$ system is not user friendly. ost # professionalswould rather not have to create custom reports all day long. nstead, we li'e to give ourcustomers some (uery tools and have them create reports without involving us. ostcustomers, however, don0t 'now how to ma'e sense of the relational nature of thedatabase. Hoins are something mysterious, and comple1 table structures such asassociative tables on a bill-of-material system% are hard for the average customer to use.

#he structures seem obvious to us, and we sometimes wonder why our customers can0tget the hang of it. Nemember, however, that our customers 'now how to do a >>!-to-">! revaluation and other such tas's that we don0t want to deal with4 therefore,understanding relational concepts :ust isn0t something our customers should have toworry about.

f our customers want to spend the ma:ority of their time performing analysis by loo'ingat the data, we need to support their desire for fast, easy (ueries. !n the other hand, we


9/18

need to meet the speed re(uirements of our transaction-processing activities. f these twore(uirements seem to be in conflict, they are, at least partially. any companies havesolved this by having a second copy of the data in a structure reserved for analysis. #hiscopy is more heavily inde1ed, and it allows customers to perform large (ueries againstthe data without impacting the inserts, updates, and deletes on the main data. #his copy of

the data is often not :ust more heavily inde1ed, but also denormali/ed to ma'e it easierfor customers to understand.

'easons to Denormali%e

*henever as' someone why you would ever want to denormali/e, the first and oftenonly% answer is) speed. *e0ve already discussed some disadvantages to the !"#$structure4 it is built for data inserts, updates, and deletes, but not data retrieval. #herefore,we can often s(uee/e some speed out of it by denormali/ing some of the tables andhaving (ueries go against fewer tables. #hese (ueries are faster because they performfewer :oins to retrieve the same recordset.

Hoins are slow, as we have already mentioned. Hoins are also confusing to many endusers. By denormali/ing, we can present the user with a view of the data that is far easierfor them to understand. *hich view of the data is easier for a typical end-user tounderstand)

!igure (


10/18

!igure )

#he second view is much easier for the end user to understand. *e had to use :oins to

create this view, but if we put all of this in one table, the user would be able to performthis (uery without using :oins. *e could create a view that loo's li'e this, but we are stillusing :oins in the bac'ground and therefore not achieving the best performance on the(uery.

*o+ We ie+ Inormation

&ll of this leads us to the real (uestion) how do we view the data we have stored in ourdatabase+ #his is not the (uestion of how we view it with (ueries, but how do welogicallyview it+ >or e1ample, are these intelligent (uestions to as')

2ow many bottles of &niseed Syrup did we sell last wee'+

&re overall sales of Condiments up or down this year compared to previous years+

!n a (uarterly and then monthly basis, are Dairy $roduct sales cyclical+

n what regions are sales down this year compared to the same period last year+

*hat products in those regions account for the greatest percentage of the decrease+

&ll of these (uestions would be considered reasonable, perhaps even common. #hey allhave a few things in common. >irst, there is a time element to each one. Second, they allare loo'ing for aggregated data4 they are as'ing for sums or counts, not individualtransactions. >inally, they are loo'ing at data in terms of 8by9 conditions.

*hen tal' about 8by9 conditions, am referring to loo'ing at data by certain conditions.>or e1ample, if we ta'e the (uestion 8!n a (uarterly and then monthly basis, are Dairy$roduct sales cyclical9 we can brea' this down into this) 8*e want to see total sales bycategory :ust Dairy $roducts in this case%, by(uarter or bymonth.9


11/18

2ere we are loo'ing at an aggregated value, the sum of sales, by specific criteria. *ecould add further 8by9 conditions by saying we wanted to see those sales by brand andthen the individual products.

>iguring out the aggregated values we want to see, li'e the sum of sales dollars or the

count of users buying a product, and then figuring out these 8by9 conditions is whatdrives the design of our star schema.

Ma.ing the Database Match our /0ectations

f we want to view our data as aggregated numbers bro'en down along a series of 8by9criteria, why don0t we :ust store data in this format+

#hat0s e1actly what we do with the star schema. t is important to reali/e that !"#$ isnot meant to be the basis of a decision support system. #he 8#9 in !"#$ stands fortransactions, and a transaction is all about ta'ing orders and depleting inventory, and not

about performing comple1 analysis to spot trends. #herefore, rather than tie up our !"#$system by performing huge, e1pensive (ueries, we build a database structure that maps tothe way we see the world.

*e see the world much li'e a cube. *e won0t tal' about cube structures for data storage:ust yet. nstead, we will tal' about building a database structure to support our (ueries,and we will speed it up further by creating cube structures later.

!acts and Dimensions

*hen we tal' about the way we want to loo' at data, we usually want to see some sort of

aggregated data. #hese data are called measures. #hese measures are numeric values thatare measurable and additive. >or e1ample, our sales dollars are a perfect measure. 3veryorder that comes in generates a certain sales volume measured in some currency. f wesell twenty products in one day, each for five dollars, we generate A dollars in totalsales. #herefore, sales dollars is one measure we may want to trac'. *e may also want to'now how many customers we had that day. Did we have five customers buying anaverage of four products each, or did we have :ust one customer buying twenty products+Sales dollars and customer counts are two measures we will want to trac'.

Hust trac'ing measures isn0t enough, however. *e need to loo' at our measures usingthose 8by9 conditions. #hese 8by9 conditions are called dimensions. *hen we say we

want to 'now our sales dollars, we almost always mean by day, or by (uarter, or by year.#here is almost always a time dimension on anything we as' for. *e may also want to'now sales by category or by product. #hese by conditions will map into dimensions)there is almost always a time dimension, and product and geographic dimensions are verycommon as well.

#herefore, in designing a star schema, our first order of business is usually to determinewhat we want to see our measures% and how we want to see it our dimensions%.


12/18

Maing Dimensions into Tables

Dimension tables answer the 8why9 portion of our (uestion) how do we want to slice thedata+ >or e1ample, we almost always want to view data by time. *e often don0t carewhat the grand total for all data happens to be. f our data happen to start on Hune A?,

A@@, do we really care how much our sales have been since that date, or do we reallycare how one year compares to other years+ Comparing one year to a previous year is aform of trend analysis and one of the most common things we do with data in a starschema.

*e may also have a location dimension. #his allows us to compare the sales in oneregion to those in another. *e may see that sales are wea'er in one region than any otherregion. #his may indicate the presence of a new competitor in that area, or a lac' ofadvertising, or some other factor that bears investigation.

*hen we start building dimension tables, there are a few rules to 'eep in mind. >irst, all

dimension tables should have a single-field primary 'ey. #his 'ey is often :ust an identitycolumn, consisting of an automatically incrementing number. #he value of the primary'ey is meaningless4 our information is stored in the other fields. #hese other fieldscontain the full descriptions of what we are after. >or e1ample, if we have a $roductdimension which is common% we have fields in it that contain the description, thecategory name, the sub-category name, etc. #hese fields do notcontain codes that lin' usto other tables. Because the fields are the full descriptions, the dimension tables are oftenfat4 they contain many large fields.

Dimension tables are often short, however. *e may have many products, but even so, thedimension table cannot compare in si/e to a normal fact table. >or e1ample, even if we

have , products in our product table, we may trac' sales for these products each dayfor several years. &ssuming we actually only sell , products in any given day, if wetrac' these sales each day for ten years, we end up with this e(uation) , products soldO E dayyear P A years e(uals almost AA,, records7 #herefore, in relative terms,a dimension table with , records will be short compared to the fact table.

Given that a dimension table is fat, it may be tempting to denormali/e the dimensiontable. Nesist the urge to do so4 we will see why in a little while when we tal' about thesnowfla'e schema.

Dimensional *ierarchies

*e have been building hierarchical structures in !"#$ systems for years. 2owever,hierarchical structures in an !"&$ system are different because the hierarchy for thedimension is actually all stored in the dimension table.

#he product dimension, for e1ample, contains individual products. $roducts are normallygrouped into categories, and these categories may well contain sub-categories. >orinstance, a product with a product number of OA;HC may actually be a refrigerator.


13/18

#herefore, it falls into the category of ma:or appliance, and the sub-category ofrefrigerator. *e may have more levels of sub-categories, where we would further classifythis product. #he 'ey here is that all of this information is stored in the dimension table.

!ur dimension table might loo' something li'e this)

!igure 2

=otice that both Category and Subcategory are stored in the table and not lin'ed inthrough :oined tables that store the hierarchy information. #his hierarchy allows us toperform 8drill-down9 functions on the data. *e can perform a (uery that performs sumsby category. *e can then drill-down into that category by calculating sums for thesubcategories for that category. *e can the calculate the sums for the individual productsin a particular subcategory.

#he actual sums we are calculating are based on numbers stored in the fact table. *e wille1amine the fact table in more detail later.

3onsolidated Dimensional *ierarchies 4Star Schemas5

#he above e1ample >igure F% shows a hierarchy in a dimension table. #his is how thedimension tables are built in a star schema4 the hierarchies are contained in the individualdimension tables. =o additional tables are needed to hold hierarchical information.

Storing the hierarchy in a dimension table allows for the easiest browsing of our

dimensional data. n the above e1ample, we could easily choose a category and then listall of that category0s subcategories. *e would drill-down into the data by choosing anindividual subcategory from within the same table. #here is no need to :oin to an e1ternaltable for any of the hierarchical informaion.

n this overly-simplified e1ample, we have two dimension tables :oined to the fact table.*e will e1amine the fact table later. >or now, we will assume the fact table has only onenumber) SalesDollars.


14/18

!igure 6

n order to see the total sales for a particular month for a particular category, our Sact.SalesDollars% &S Sum!fSalesDollars

>N! #imeDimension ==3N H!= $roductDimension ==3N H!=

Sales>act != $roductDimension.$roductD Q Sales>act.$roductD%

!= #imeDimension.#imeD Q Sales>act.#imeD

*23N3 $roductDimension.CategoryQ0Brass Goods0 &=D #imeDimension.onthQ

&=D #imeDimension.6earQA@@@

#o drill down to a subcategory, we would merely change the statement to loo' li'e this)

S3"3C# SumSales>act.SalesDollars% &S Sum!fSalesDollars

>N! #imeDimension ==3N H!= $roductDimension ==3N H!=

Sales>act != $roductDimension.$roductD Q Sales>act.$roductD%

!= #imeDimension.#imeD Q Sales>act.#imeD

*23N3 $roductDimension.SubCategoryQ0*idgets0 &=D #imeDimension.onthQ

&=D #imeDimension.6earQA@@@

Sno+la.e Schemas

Sometimes, the dimension tables have the hierarchies bro'en out into separate tables.#his is a more normali/ed structure, but leads to more difficult (ueries and slowerresponse times.


15/18

>igure @ represents the beginning of the snowfla'e process. #he category hierarchy isbeing bro'en out of the $roductDimension table. 6ou can see that this structure increasesthe number of :oins and can slow (ueries. Since the purpose of our !"&$ system is tospeed (ueries, snowfla'ing is usually not something we want to do. Some people try tonormali/e the dimension tables to save space. 2owever, in the overall scheme of the data

warehouse, the dimension tables usually only hold about AR of the records. #herefore,any space savings from normali/ing, or snowfla'ing, are negligible.

!igure 7

8uilding the !act Table

#he >act #able holds our measures, or facts. #he measures are numeric and additiveacross some or all of the dimensions. >or e1ample, sales are numeric and we can loo' attotal sales for a product, or category, and we can loo' at total sales by any time period.#he sales figures are valid no matter how we slice the data.

*hile the dimension tables are short and fat, the fact tables are generally long ands'inny. #hey are long because they can hold the number of records represented by theproduct of the counts in all the dimension tables.

>or e1ample, ta'e the following simplified star schema)


16/18

!igure 19

n this schema, we have product, time and store dimensions. f we assume we have tenyears of daily data, ; stores, and we sell products, we have a potential ofE,, records E days P ; stores P products%. &s you can see, this ma'esthe fact table long.

#he fact table is s'inny because of the fields it holds. #he primary 'ey is made up offoreign 'eys that have migrated from the dimension tables. #hese fields are :ust some sortof numeric value. n addition, our measures are also numeric. #herefore, the si/e of eachrecord is generally much smaller than those in our dimension tables. 2owever, we havemany, many more records in our fact table.

!act :ranularity

!ne of the most important decisions in building a star schema is the granularity of thefact table. #he granularity, or fre(uency, of the data is usually determined by the timedimension. >or e1ample, you may want to only store wee'ly or monthly totals. #he lowerthe granularity, the more records you will have in the fact table. #he granularity alsodetermines how far you can drill down without returning to the base, transaction-leveldata.


17/18

any !"&$ systems have a daily grain to them. #he lower the grain, the more recordsthat we have in the fact table. 2owever, we must also ma'e sure that the grain is lowenough to support our decision support needs.

!ne of the ma:or benefits of the star schema is that the low-level transactions are

summari/ed to the fact table grain. #his greatly speeds the (ueries we perform as part ofour decision support. #his aggregation is the heart of our !"&$ system.

!act Table Si%e

*e have already seen how products sold in ; stores and trac'ed for A years couldproduce E,, records in a fact table with a daily grain. #his, however, is thema1imum si/e for the table. ost of the time, we do not have this many records in thetable. !ne of the things we do not want to do is store /ero values. So, if a product did notsell at a particular store for a particular day, we would not store a /ero value. *e onlystore the records that have a value. #herefore, our fact table is often sparsely populated.

3ven though the fact table is sparsely populated, it still holds the vast ma:ority of therecords in our database and is responsible for almost all of our dis' space used. #he lowerour granularity, the larger the fact table. 6ou can see from the previous e1ample thatmoving from a daily to wee'ly grain would reduce our potential number of records toonly slightly more than ;,, records.

#he data types for the fields in the fact table do help 'eep it as small as possible. n mostfact tables, all of the fields are numeric, which can re(uire less storage space than thelong descriptions we find in the dimension tables.

>inally, be aware that each added dimension can greatly increase the si/e of our facttable. f we added one dimension to the previous e1ample that included ; possiblevalues, our potential number of records would reach F. billion.

3hanging Attributes

!ne of the greatest challenges in a star schema is the problem of changing attributes. &san e1ample, we will use the simplified star schema in >igure A. n the StoreDimensiontable, we have each store being in a particular region, territory, and /one. Somecompanies realign their sales regions, territories, and /ones occasionally to reflectchanging business conditions. 2owever, if we simply go in and update the table, and then

try to loo' at historical sales for a region, the numbers will not be accurate. By simplyupdating the region for a store, our total sales for that region will not be historicallyaccurate.

n some cases, we do not care. n fact, we want to see what the sales would have been hadthis store been in that other region in prior years. ore often, however, we do not want tochange the historical data. n this case, we may need to create a new record for the store.#his new record contains the new region, but leaves the old store record, and therefore


18/18

the old regional sales data, intact. #his approach, however, prevents us from comparingthis stores current sales to its historical sales unless we 'eep trac' of it0s previousStoreD. #his can re(uire an e1tra field called $reviousStoreD or something similar.

#here are no right and wrong answers. 3ach case will re(uire a different solution to

handle changing attributes.

Aggregations

>inally, we need to discuss how to handle aggregations. #he data in the fact table isalready aggregated to the fact table0s grain. 2owever, we often want to aggregate to ahigher level. >or e1ample, we may want to sum sales to a monthly or (uarterly number.n addition, we may be loo'ing for total :ust for a product or a category.

#hese numbers must be calculated on the fly using a standard S

designing the stars chema database

Documents