dimensional modeling

BI Dimensional Modeling

Table of Contents

1 DIMENSION MODELING......................................................................................................2

2 DIMENSIONAL MODELING SCHEMAS..........................................................................2

2.1 STAR SCHEMA.....................................................................................................................22.2 SNOWFLAKE SCHEMA.........................................................................................................2

3 DIMENSIONAL MODELING CONSTRUCTS..................................................................2

3.1 DIMENSIONS........................................................................................................................23.1.1 Dimension Keys..........................................................................................................23.1.2 Slowly Changing Dimensions (SCDs)........................................................................23.1.3 Rapidly Changing Dimensions (RCDs)......................................................................23.1.4 Degenerate Dimensions..............................................................................................23.1.5 Demographic Mini Dimension (or Sub Dimensions)..................................................23.1.6 Junk/Dirty Dimensions................................................................................................23.1.7 Conformed Dimensions...............................................................................................2

3.2 FACTS.................................................................................................................................23.2.1 Granularity.................................................................................................................23.2.2 Type of Fact table grains............................................................................................23.2.3 Type of Measures........................................................................................................23.2.4 Factless Fact Table.....................................................................................................23.2.5 Aggregation.................................................................................................................2

3.3 DIMENSIONAL MODELING TERMINOLOGY.........................................................................23.3.1 Hierarchies.................................................................................................................23.3.2 Browsing.....................................................................................................................23.3.3 Drilling........................................................................................................................2

4 THE DIMENSIONAL MODELING DESIGN PROCESS.................................................2

4.1 FOUR PHASE OF KIMBALL’S APPROACH FOR DESIGNING DIMENSIONAL DATABASE.........24.2 THE DATA WAREHOUSE BUS ARCHITECTURE....................................................................2

5 ADVANCED DESIGN............................................................................................................2

5.1 INDEXING............................................................................................................................25.2 EXTENDED DIMENSION TABLE DESIGN...............................................................................2

5.2.1 Many-to-Many Dimensions.........................................................................................25.2.2 Many-to-One Dimensions...........................................................................................25.2.3 Role-Playing Dimensions............................................................................................25.2.4 Organization and Parts Hierarchies..........................................................................25.2.5 Unpredictably Deep hierarchies.................................................................................25.2.6 Time stamping the changes in a Large Dimension.....................................................25.2.7 Building an Audit Dimension......................................................................................25.2.8 Too Few Dimensions and Too Many Dimensions......................................................2

5.3 EXTENDED FACT TABLE DESIGN.........................................................................................25.3.1 Facts of different Granularity and Allocation............................................................25.3.2 Time of day..................................................................................................................25.3.3 Multiple Units of Measure..........................................................................................25.3.4 Multinational Currency Tracking...............................................................................25.3.5 Value Band Reporting.................................................................................................2

of 21


5.4 ADVANCED ROLAP QUERYING AND REPORTING.............................................................25.4.1 Drill – Across Queries with multiple Technologies....................................................25.4.2 Self Referencing Queries, Behavior Tracking and Sequential Subsetting..................25.4.3 Market Basket Analysis..............................................................................................2

1 Dimension Modeling

Dimension Modeling is a database modeling technique often used in the data warehousing/OLAP area and is different from entity Relationship technique used for normal transactional (OLTP) applications. Primarily used for designing data marts and data warehouses, Dimensional Modeling seeks to present the data in a standard, intuitive framework that allows for high-performance access.

It is inherently dimensional and adheres to a discipline that uses the relational model with some important restrictions. Simply put, It’s a way to store data in a multidimensional form in a two dimensional Relational Database Management System. When the database can be visualized as a “cube” of three (or more) dimensions, people can imagine slicing and dicing that cube along each of it’s axis or “dimensions” to look at the cells or “measures” inside the cube.

Figure 1. Multi Dimensional Model

Every dimensional model is composed of one dominant/central table with a multi part key, called the fact table, and a set of smaller tables called the dimension tables. Each dimension table has a single part primary key that corresponds exactly to one of the components of the multipart key in the fact table. This characteristic “star-like” structure is often called a star join schema.

2 Dimensional Modeling Schemas

of 21


As mentioned earlier Dimensional Models involve a special schema called the star schema with an important extension call a snowflake schema.

2.1 Star schema

A Star Schema is a means of aggregating data based on a set of known dimensions. There is one large dominant table in the center of the schema, surrounded by smaller attendant tables. This center table is called the FACT table, and it contains the numeric data. The smaller tables are dimension tables, and they contain the metadata. The figure below shows star schema architecture. The reason for this name is because the query takes on the shape of a star. The fact table is the body of the star and the dimension tables are the points of the star.

Main constructs of the star schema are: Fact Tables: Fact tables contain the detail information you want to look at, such as numbers

of online items. Access control to sensitive information is maintained in fact tables. These rows can be very large as much as several million rows of data.

Dimension Tables: dimension tables are designed especially for selection and grouping. There is no access control on these tables and all users can view this information. These tables are much smaller than the fact tables and may contain 10,000 rows of data.

Physically, the database consists of a single fact table and a single table for each dimension. Each tuple in the fact table consists of pointers (foreign keys) to each of the dimension table. Each dimension table consists of columns that correspond to attributes of the dimension.

Salient features of the star schema - Easy to understand- Better Performance – minimizes the number of joins- Supports multi-dimensional analysis- Allows relative easy maintenance- Recommended for most Decision Support System user tools- Extensible design supports changing business requirements (debatable)

of 21


Figure 2. Star schema

2.2 Snowflake schema

A Snowflake Schema is an extension of the star schema by means of applying additional dimensions to the dimensions of a star schema in a relational environment. Snowflake schema is similar to the star schema. It normalizes dimension table to save data storage space. It can be used to represent hierarchies of information.

Snow flaking allows for easy update and load of data as redundancy of data is avoided to some extent, but browsing capabilities are greatly compromised.

Snow flaking often becomes necessary when you need data for which there is a one-to-many relationship with a dimension table. To try to consolidate this data into the dimension table would necessarily lead to redundancy (this is a violation of second normal form, which will produce a Cartesian product). This sort of redundancy can cause misleading results in queries, since the count of rows is artificially large (due to the Cartesian product). A simple example of such a situation might be a "customer" dimension for which there is a need to store multiple contacts. If the contact information is brought in to the customer table, there would be one row for each contact (i.e., one for each customer/contact combination). In this situation, it is better just to create a "contact" snowflake table with a FK to the customer. In general, it is better to avoid snow flaking if possible, but sometimes the consequences of avoiding it are much worse.

of 21


Figure 3. Snow Flake Schema

Note: Dimensions that are snow flaked are also called outrigger tables. In the above figure we have City “Outrigger”, State “Outrigger” and Region “Outrigger” tables.

3 Dimensional Modeling constructs

3.1 Dimensions

The dimension tables are where the attributes of the dimensions of the business are stored. The best attributes are textual and discrete and used to constraint the fact table. Each of these textual descriptions helps us to describe the member of the respective dimension.

- They are the entry points into the fact tables. It determines the grain of the fact table and vice versa.

- A single Primary key identifies each Dimension record.- It serves as a primary source of query constraints grouping and report labels/row headers.- They are relatively shallow in terms of rows but are wide with many large columns.- Not usually time dependent- Hierarchical relationships.- Robust dimension attributes delivers analytic slicing and dicing capabilities.- Dimension tables are de-normalized.

Examples of Dimensions: Employee, Time Product Customer etc.

3.1.1 Dimension KeysDimensional Modeling proposes that the dimension keys should be surrogate keys. A surrogate key is an artificial or synthetic key that is used as a substitute for a natural key. In a data warehouse, a surrogate key is a necessary generalization of the natural production key and is one of the basic elements of data warehouse design.

Every join between dimension tables and fact tables in a data warehouse environment should be based on surrogate keys, not natural keys. It is up to the ETL logic to systematically look up and replace every incoming natural key with a data warehouse surrogate key each time either a dimension record or a fact record is brought into the data warehouse environment.

The Surrogate Key should be a simple integer. Typically a four byte integer can contain 2 to the power 32 (> 2 billion positive integers, starting with 1). That should be enough to cater to even the largest dimensions.

Use surrogate keys also for the Time dimension.

SQL-based date key, is typically 8 bytes, so 4 bytes are wasted Bypassing joins leads to embedding knowledge of the calendar in the application, rather

than reading it from the time dimension It is not possible to encode a data stamp as “I do not know”, “It has not happen yet”, etc.

which may be required sometimes in a data warehouse environment.

of 21


Avoid smart keysKeys where you can tell something about the record just by looking at the key are called smart keys.

Avoid production keys

Production keys should be avoided because

Production may reuse keys that it has purged but that you are still maintaining, as I described.

Production may make a mistake and reuse a key even when it isn’t supposed to. This happens frequently in the world of UPCs in the retail world, despite everyone’s best intentions.

Production may recompact its key space because it has a need to garbage-collect the production system. One of my customers was recently handed a data warehouse load tape with all the production customer keys reassigned!

Production may legitimately overwrite some part of a product description or a customer description with new values but not change the product key or the customer key to a new value. You are left holding the bag and wondering what to do about the revised attribute values. This is the Slowly Changing Dimension crisis, which I will explain in a moment.

Production may generalize its key format to handle some new situation in the transaction system. Now the production keys that used to be integers become alphanumeric. Or perhaps the 12-byte keys you are used to have become 20-byte keys.

Your company has just made an acquisition, and you need to merge more than a million new customers into the master customer list. You will now need to extract from two production systems, but the newly acquired production system has nasty customer keys that don’t look remotely like the others.

-

3.1.2 Slowly Changing Dimensions (SCDs)

In the real world, dimensions and their descriptions, though relatively constant, evolve over time – employees come and go, they are promoted, salaries change etc. The term slowly changing dimensions is the variation in dimensional attributes over time. The word slowly in this context might seem incorrect but in general, when compared to a measure in a fact table, changes to dimensional data occur slowly.

We need to have a strategy to deal with these changed attributes over time. When we encounter a slowly changing dimension we face making one of the following three fundamental choices. Each choice results in a different degree of tracking changes over time

Type One (Overwriting History): A Type 1 change overwrites an existing dimensional attribute with new information. In the customer name-change example, the new name overwrites the old name, and the value for the old version is lost. A Type One change updates only the attribute, doesn't insert new records, and affects no keys.

of 21


Type Two (Preserving history) Creating an additional dimension record at the time of the change with the new attribute values and thereby segmenting history very accurately between the old description and the new description. Implementing Type Two changes within a data warehouse might require significant analysis and development. Type Two changes accurately partition history across time more effectively than other types. However, because Type Two changes add records, they can significantly increase the database's size.

Type Three (Preserving a version of history) Creating new “current” fields within the original dimension record to records the new attribute values, while keeping the original attribute values as well, thereby being able to describe history both forward and backward from the change either in terms of the original attribute values or in terms of the current attribute values. You usually implement Type Three changes only if you have a limited need to preserve and accurately describe history, such as when someone gets married and you need to retain the previous name. Instead of creating a new dimensional record to hold the attribute change, a Type Three change places a value for the change in the original dimensional record. You can create multiple fields to hold distinct values for separate points in time. In the case of a name change, you could create an OLD_NAME and NEW_NAME field and a NAME_CHANGE_EFF_DATE field to record when the change occurs. This method preserves the change. But how would you handle a second name change, or a third, and so on? The side effects of this method are increased table size and, more important, increased complexity of the queries that analyze historical values from these old fields. After more than a couple of iterations, queries become impossibly complex, and ultimately you're constrained by the maximum number of attributes allowed on a table.

Because most business requirements include tracking changes over time, data warehouse architects commonly implement Type Two changes. A data warehouse might use Type Two changes for all attributes in all tables. As an alternative, you can implement a mix of Type One and Type Two changes at an attribute level by implementing Type 2 changes for only attributes whose historical values are important when you're slicing and dicing. For example, users might not need to know an individual's previous name if a name change occurs, so a Type One change would suffice. Users might want the system to show only the person's current name. However, if the company reassigns sales territories, users might need to track who sold what, at what time, and in what territory, necessitating a Type Two change.

Although most data warehouses include Type Two changes, you need to seriously examine the business need to record historical data. Implementing Type Two changes might be necessary, but those changes will increase the database size, degrade performance, and lengthen the development time. You need to carefully evaluate using a Type Two implementation, a Type One implementation, or a hybrid implementation.

3.1.3 Rapidly Changing Dimensions (RCDs)If the dimension values change rapidly over time they are called rapidly changing dimensions. Note that there are no yardstick for telling when a dimension is slowly changing or not and this is based on the judgment of the data modeler. Also an SCD may become a RCD over time or vice versa. For RCDs the design followed depends on the size of the dimension

Small dimensions: The same technologies as for slowly changing dimensions may be appliedLarge dimensions: For large dimensions the choice of indexing techniques and data design approaches are important. We also cannot create additional records like we do to handle the slowly changing dimension problem as the size becomes prohibitive. We also need find suppress duplicate entries in the dimension.

of 21


A Rapidly changing very large dimensions example- Break off some of the attributes into their own separate dimension(s), a demographic

dimension(s).- Force the attributes selected to the demographic dimension to have relatively small

number of discrete values- Build up the demographic dimension with all possible discrete attributes combinations- Construct a surrogate demographic key for this dimension

Note: The demographic attributes are the one of the heavily used attributes. Their values are often compared in order to identify interesting subsets.

of 21


3.1.4 Degenerate Dimensions

Degenerate dimensions usually occur in line item-oriented fact table designs. A degenerate dimension is represented by a dimension key attribute with no corresponding dimension table.

Many of the dimensional designs revolve around some kind of control document like an order, an invoice, a bill of lading, or a ticket. Usually these control documents are a kind of container with one or more line items inside. A very natural grain for a fact table in these cases is the individual line item, In other words, a fact table record is a line item. Given this perspective, we can quickly visualize the necessary dimensions for describing each line item e.g. Product, Customer, Time etc. Generally, the attributes on the order number automatically go over to these chosen dimensions.

But what do we do with the order number itself? At the end of the design, the order number is sitting by itself, without any attributes. We call this a degenerate dimension. The degenerate dimension key should be the actual production order number and should sit in the fact table without a join to anything. There is no point of making a dimension table because the dimension table would not contain anything.

Note: If one or more attributes are legitimately left over after all the other dimension have been created, and they seem to belong to this control document entity we should simply create a normal dimension record with a normal join. You don’t have a degenerate dimension any more.

3.1.5 Demographic Mini Dimension (or Sub Dimensions)

There are some situations where we need to build “subdimnesions”. Subdimensions have some special requirements that make it different from a simple snow flaked attributes. In this example, the subdimension is a set of demographic attributes that have been measured for the county that the customer is in. All the customers in the county will share this identical set of attributes, and thus the dimension attributes are all at a different level of granularity. For many reasons, it makes sense to isolate this demographic data in a snow flaked subdimension.

First, the demographic data is available at a significantly different grain than the primary dimensional data and is administered and loaded at different times than the rest of the data in the customer dimension. Second, we really do save significant space in this case if the underlying customer dimension is large. And third, we may often browse among the attributes in the demographic table, which strengthens the argument that these attributes live in their own separate table.

of 21


3.1.6 Junk/Dirty Dimensions

A junk dimension is a convenient grouping of random flags and attributes to get out of a fact table and into a useful dimensional framework.

Sometimes after carving out all the dimensions some are still some flags or text attributes that are left over in the fact table but do not belong to any of the dimension tables.

When a number of miscellaneous flags and text attributes exist, the following design alternatives should be avoided:

Leaving the flags and attributes unchanged in the fact table record Making each flag and attribute into its own separate dimension Stripping out all of these flags and attributes from the design

A better alternative is to create a junk dimension. A junk dimension is a convenient grouping of flags and attributes to get them out of a fact table into a useful dimensional framework

3.1.7 Conformed Dimensions

Conformed dimensions can be used to analyze facts from two or more data marts. Suppose you have a “shipping” data mart (telling you what you’ve shipped to whom and when) and a “sales” data mart (telling you who has purchased what and when). Both marts require a “customer” dimension and a “time” dimension. If they’re the same dimension, then you have conforming dimensions, allowing you to extract and manipulate facts relating to a particular customer from both marts, answering questions such as whether late shipments have affected sales to that customer.

of 21


Suppose now that you add a “marketing” data mart to help you analyze product promotions. Again, with conformed customer and time dimensions, you’re able to analyze the effects of a particular product promotion on sales. (Analyzing facts from more than one fact table in this way is termed “drilling across.” My previous article, “Thinking dimensionally aids business intelligence design and use,” explains the function of facts and dimensions.)

The same conformed dimensions—in this case, time and customer dimensions—have meaning in the context of three independently developed data marts. These dimensions become enterprise property and can be used later in other marts as you evolve the enterprise data warehouse.

In order to use multiple data sources together, the data warehouse team has no choice but to conform the dimensions. The reason for this is the data arrived from multiple sources may have incompatible granularity. Conforming the dimensions means forcing the two data sources to share identical dimensions.

Conformed dimensions have consistent definitions regardless of where they are used. This allows a single query to be run across multiple tables, Data Marts and Data Warehouses.

of 21


3.2 Facts

As described earlier, the fact table is the table that is at the center of a star schema and holds the primary data. They contain the actual numerical measurements that the business is interested in.A fact table typically has two types of columns: those that contain measures and those that are foreign keys to dimension tables. Some key features of a fact table are

- Multi part Key. I.e. a composite key with one foreign key for each dimension.- Time is a always a part of the key - Usually numeric. Keys are surrogate integers and the measures are numeric.- Typically additive.

3.2.1 GranularityBy granularity we mean the level of data in the fact table. The lowest granularity is referred as atomic data. The granularity is determined by the grain. The meaning of a single record in a fact table is grain.

The granularity or frequency of the data is usually determined by the time dimension. For example, you may want to only store weekly or monthly totals. The lower the granularity, the more records you will have in the fact table. The granularity also determines how far you can drill down without returning to the base, transaction-level data. Many OLAP systems have a daily grain to them. The lower the grain, the more records that we have in the fact table. However, we must also make sure that the grain is low enough to support our decision support needs. One of the major benefits of the star schema is that the low-level transactions are summarized to the fact table grain. This greatly speeds the queries we perform as part of our decision support.

3.2.2 Type of Fact table grains

The three most common fact table grains are: Individual Transactions e.g. sales transaction, ATM transaction, insurance claims

transaction etc, High-level Snapshots e.g. monthly account snapshot, daily sales total etc. Line item control documents e.g. Invoices, orders etc.

3.2.3 Type of Measures The three types of measures that go into a fact table are

(Perfectly) Additive: a fact is additive if it make sense to add it across all the dimensions e.g., discrete numerical measures of activity, i.e., quantity sold, dollars soled

Semi-additive: A fact is semi-additive if it make sense to add it along some of the dimensions only e.g., numerical measures of intensity, i.e., account balance, inventory level

Non-additive: facts that cannot be added at all e.g., measurement of room temperature. All measures that record static levels, such account balance and inventory level. Are non-additive across time. However, these measures may be usefully aggregated across time by averaging over the number of time periods.

3.2.4 Factless Fact TableIn some rare cases, a fact table may contain no measures. It may consist of nothing but keys. These are called factless fact tables.

of 21


The first type of factless fact table is a table that records an event. Many event-tracking tables in dimensional data warehouses turn out to be factless. Here you will track student attendance at a college. Imagine that you have a modern student tracking system that detects each student attendance event each day. The dimensions would include

Date: one record in this dimension for each day on the calendar Student: one record in this dimension for each student Course: one record in this dimension for each course taught each semester Teacher: one record in this dimension for each teacher Facility: one record in this dimension for each room, laboratory, or athletic field

The grain of the fact table is the individual student attendance event. When the student walks through the door into the lecture, a record is generated. The fact table record, consisting of just the five keys, is a good representation of the student attendance event. The only problem is that there is no obvious fact to record each time a student attends a lecture or suits up for physical education. Tangible facts such as the grade for the course don't belong in this fact table. This fact table represents the student attendance process, not the semester grading process or even the midterm exam process. Actually, this fact table consisting only of keys is a perfectly good fact table and probably ought to be left as is. A lot of interesting questions can be asked of this dimensional schema, including:

Which classes were the most heavily attended? Which classes were the most consistently attended? Which teachers taught the most students? Which teachers taught classes in facilities belonging to other departments? Which facilities were the most lightly used? What was the average total walking distance of a student in a given day?

A second kind of factless fact table is called a coverage table. Coverage tables are frequently needed when a primary fact table in a dimensional data warehouse is sparse. The figure below shows a simple sales fact table that records the sales of products in stores on particular days under each promotion condition. The sales fact table does answer many interesting questions but cannot answer questions about things that didn't happen. For instance, it cannot answer the question, "Which products were on promotion that didn't sell?" because it contains only the records of products that did sell.

of 21


In this case the coverage table is used. A record is placed in the coverage table for each product in each store that is on promotion in each time period. You need the full generality of a fact table to record which products are on promotion. In general, which products are on promotion varies by all of the dimensions of product, store, promotion, and time. This complex many-to-many relationship must be expressed as a fact table.

An option is to just filling out the original fact table with records representing zero sales for all possible products. This is logically valid, but it would expand the fact table enormously and the coverage factless fact table can be made much smaller. The coverage table must only contain the items on promotion; the items not on promotion that also did not sell can be left out. Also the frequency of population of the coverage table will be less frequent than the fact table.

Answering the question, "Which products were on promotion that did not sell?" requires a two-step application. First, consult the coverage table for the list of products on promotion on that day in that store. Second, consult the sales table for the list of products that did sell. The desired answer is the set difference between these two lists of products.

Coverage tables are also useful for recording the assignment of sales teams to customers in businesses in which the sales teams make occasional very large sales. In such a business, the sales fact table is too sparse to provide a good place to record which sales teams were associated with which customers. The sales team coverage table provides a complete map of the assignment of sales teams to customers, even if some of the combinations never result in a sale

of 21


3.2.5 AggregationAggregates are statistical summaries of a fact table. You can have multiple levels of aggregates based on the same fact table. The summarization is done over levels of different dimensions e.g. Monthly sales is a summarization at the month level of the time dimension. This is typically done to improve performance. Fact tables that are aggregated are also called summary tables.

3.3 Dimensional Modeling Terminology

3.3.1 HierarchiesHierarchy is a set of attributes with a defined path for elemental browsing and drilling. Typically (but not always) Hierarchies exist within specific Dimensions.

Alternate hierarchy

Alternate hierarchies are very powerful. The preferred method for handling alternate hierarchies is to build the alternate hierarchy structure in columns to the right of the primary hierarchy. Not all leaf-level members, or primary keys of the dimension table, are members in alternate hierarchies. For those members, the alternate hierarchy columns should be left NULL. A sample of a dimension with a primary hierarchy and an alternate hierarchy follows:

of 21


3.3.2 BrowsingBrowsing is the act of navigating around a single dimension, either to gain an intuitive understanding of how the various attributes correlate with each other or to build a constraint on the dimension as a hole. Browsing often involves constraining one or more dimensional attributes and looking at the distinct values of another attribute in the presence of these constraints. E.g. we want to see which of the stores in Kent County, Ohio has the new upgraded plan, we need to go through the store name, country, state and floor plan fields in the Store dimension.

Browsing may or may not be across defined hierarchies. To support browsing, the dimensional tables should remain as fact tables and not be normalized. Normalized dimension tables destroy the ability to browse.

3.3.3 DrillingDrill down and drill up are common terms used while describing OLAP applications. When the user navigates from a summary level to a detail level data it is called drill down and the vice versa is called drill up (or roll up). This is typically done across a hierarchy.In dimensional modeling terms this means adding or subtracting grouping columns from a query. Imagine creating a grouping column in a report by opportunistically dragging a dimension attribute from any of the dimension tables down into the report, thereby making it a grouping column (see figure below). All dimension attributes can become grouping columns (though they are typically part of the same hierarchy).

of 21


Drilling AcrossDrilling across is the process of linking two or more fact tables at the same granularity, or, in other words, tables with the same set of grouping columns and dimensional constraints. Drilling across is a valuable technique whenever a business has several fundamental business processes that can be arranged in a value chain. Each business process gets its own separate fact table. For e.g., almost all manufacturers have an obvious value chain representing the demand side of their businesses consisting of finished goods inventory, orders, shipments, customer inventory, and customer sales. The figure shows how these fact tables are arranged in a sequence. The product and time dimensions thread through all of these fact tables. Some dimensions, such as customer ship to, thread through some, but not all of the fact tables. For instance, customer ship to does not apply to finished goods inventory.

A drill across report can be created by using grouping columns that apply to all the fact tables used in the report. Thus in the example, attributes may be freely chosen from the product and time dimension tables because they make sense for every fact table. Attributes from customer ship to can only be used as grouping columns if we avoid touching the finished goods inventory fact table. When multiple fact tables are tied to a dimension table, the fact tables should all link to that dimension table. When we use precisely the same dimension table with each of the fact tables, we say that the dimension is "conformed" to each fact table. Dimensions that are not conformed (such as those that differ in grain or detail) across fact tables will defeat the drill across application.

of 21


Drilling AroundThe final variant of drilling is drilling around a value circle. This is similar to the linear value chain in the previous example, but occurs in a data warehouse where the related fact tables that share common dimensions are not arranged in a linear order. The best example is from health care, where as many as 10 separate entities are processing patient encounters, and are sharing this information with one another. The Figure shows a typical health care value circle with 10 separate entities surrounding the patient. Although this is not a value chain like manufacturing, the data warehouse issues of combining facts from separate fact tables across a single line of a report are very much the same as the previous discussion. When the common dimensions are conformed and the requested grouping columns are drawn from dimensions that tie to all the fact tables in a given report, you can generate really powerful drill around reports by performing separate queries on each fact table and outer joining the answer sets in the client tool. Once you have set up multiple fact tables for either drilling across or drilling around, you can certainly drill up and down at the same time. In this case, you take the whole value chain, or value circle, and simultaneously ask all the fact tables for more granular data (drill down) or less

granular data (drill up).

4 The Dimensional Modeling Design Process

4.1 Four phase of Kimball’s approach for designing Dimensional

Database

Ralph Kimball proposes a four-step process for designing any dimensional model

Four steps involved in Kimball’s approach are:

1. Choose a Business Process to Model:A business process is major operational process in your organization that is supported by some kind of legacy system (or system) from which data can be collected for the purpose of the data warehouse. Examples of business processes are orders, invoices, shipments, inventory, account administration, sales and the general ledger.

of 21


2. Choose the Grain of the Business ProcessesThe grain is the fundamental atomic level of data to be represented in the fact table for this process. Typical grains are individual transactions, individual daily snapshots, or individual monthly snapshots. It is impossible to proceed to step 3 without defining the grains.

3. Choose the dimensions that will apply to each fact table record.Typical dimensions are Time, Product, Customer, Promotion, Warehouse, Transaction type and status. With the choice of ‘each dimension, describe all discrete text like dimension attributes (fields) that fill out each dimension table.

4. Choose the measured fact that will populate each fact table record.Typical measured facts are numeric additive quantities like quantity sold and dollars sold.

4.2 The Data warehouse Bus Architecture

Bus Architecture or Matrix is a Planning Methodology for the Largest Data Warehouses with multiple data marts or dimensional models. It is a tool for technical planning as well as executive

communication (with the users or business owners).

This matrix approach has been exceptionally effective for distributed data warehouses without a center. The matrix is simply a vertical list of data marts and a horizontal list of dimensions. The Figure below is an example matrix for the enterprise data warehouse of a large telecommunications company.

of 21


The Matrix Plan for the enterprise data warehouse of a large telecommunications company.

First-level data marts are directly derived from production applications. Second-level data marts are developed later and represent combinations of first-level data marts.

You start the matrix by listing all the first-level data marts. . A first-level data mart is a collection of related fact tables and dimension tables that is typically:

Derived from a single data source Supported and implemented by a single department Based on the most atomic data possible to collect from the source Conformed to the “data warehouse bus.”

A second-level data mart is a combination of two or more first-level marts. In most cases, a Second-level mart is more than a simple union of data sets from the first-level marts. For example, a second-level profitability mart may result from a complex allocation process that associates costs from several first-level cost-oriented data marts onto products and customers contained in a first-level revenue mart.

The matrix planning technique helps you build an enterprise data warehouse, especially when the warehouse is a distributed combination of far-flung data marts. The matrix becomes a resource that is part technical tool, part project management tool, and part communication vehicle to senior management.

of 21


5 Advanced Design

5.1 Indexing

a. Dimension table indexing Dimensions should be heavily indexed The attributes should generally have B-tree indices In case the cardinality is low use Bitmap indexes

b. Fact table indexing Fact tables should be indexed carefully

Build a single clustered index consisting of the dimension table foreign keys B-tree index may be needed if clustered dimension keys don’t ensure uniqueness Use caution when building more fact indexes

Determine primary key sort order Leading column should be time dimension key Consider other leading index terms to best clump data in blocks and provide fastest

subsetting

of 21

dimensional modeling

Documents

bi dimensional

finished goods

enterprise

original attribute

data warehouse

large telecommunications

level data

related fact