best dwh basics

7/27/2019 Best Dwh Basics

1/15

Dimension: The same category of information. For example, the time dimension.

Attribute: A unique level within a dimension. For example, Month is an attribute

in the Time Dimension.

or

Attribute represents a single type of information in a dimension. For example, year

is an attribute in the Time Dimension.

Hierarchy: The specification of levels that represents relationship between

different attributes within a dimension. For example, one possible hierarchy in the

Time dimension is Year Quarter Month Day.

Dimensional data model contains two types of tables. They are:

1)Fact table: Fact table in a dimensional data model contains the measures of all

interest, such measurements or metrics or facts of business processes. Take the

example of the sales amount of a business. The amount can be a monthly sales

number or sales number for a day. This measure is stored in the fact table with the

appropriate granularity.

For sales measures, a fact table generally contains three columns: a date column, a

store column and a sales amount column. Besides the measurements the table will

also contain foreign keys for the dimension tables.

or

It contains numeric values and also contain composite keys (i.e. collection offoreign keys)

E.g. sales and profit.

Dimension table:The dimension table in a dimensional model represents

the context of the measurements. The context of measurements can also be

understood as the characteristics such as who, what, where, when, how of a

measurement (subject).

For example, in a business process Sales, the characteristics of the 'monthly

sales number' measurement would be a Location (Where), Time (When) and

Product Sold (What). A dimension table contains a number of dimension

attributes or columns. In the Location dimension the various attributes can

be Location Code, State, Country, Zip code. Further, dimension attributes

contain one or more hierarchical relationships.In the Location dimension the


2/15

various attributes can be Location Code, State, Country, Zip code. Further,

dimension attributes contain one or more hierarchical relationships.

or

It contains character values

E.g. Customer_name, Customer_city.

What is dimension modeling: A data model that maintains all the dimensionsin their own tables and the fact in a separate table (with the necessary relationshipswith all dimensions) is called Dimensional Model. This is a de-normalized model

as this is used for report generation. The only data feeds can be through a

scheduled and structured process (ETL) which in turn fetches data from a

relational / transactional data source(s).

Ex: Here's a different way to look at dimensional modeling:

There are three basic styles of data models:

1)Conceptual data model: The conceptual data model is sometimes called

the domain model and it is typically used for exploring domain concepts in

an enterprise with stakeholders of the project.

2)Logical data model: The logical model is used for exploring the domain

concepts as well as their relationships. This model depicts the logical entity

types, typically referred to simply as entity types, the data attributes

describing those entities, and the relationships between the entities.

3)Physical data model: The physical data model is used in the design of

the database's internal schema and as such, it depicts the data columns of

those tables, and the relationships between the tables. This model represents

the data design taking into account the facilities and constraints of any given

database management system. The physical data model is often derived from

the logical data model although some can reverse engineer this from any

database implementation.

Data/dimension modeling tools:

1.) Oracle Designer

2.) ERWin (Entity Relationship for windows)

3.) Informatica (Cubes/Dimensions)

4.) Embarcadero

5.) Power DesignerSybase


3/15

Fact less Fact Table: A fact table contains only the keys i.e. foreign keys but no

measures (numerics) are known as fact less fact table.

or

A Fact Table having no Facts is known as Fact less Fact Table.

NOTE: Generally we using the fact less fact table when we want events that

happen only at information level but not included in the calculations level, just

information about an event that happen over a period.

APPROACHES:

At the time of software interrogation bottom/up is good but implementation time

top/down is good.

1)Top down: First we have to build data warehouse then we will build data marts.

Which will need more cross functional skills and time taking process also costly

ODS-->ETL-->Data warehouse-->Data mart-->OLAP

2)Bottom up: First we will build data marts then data warehouse. The data mart

that is first build will remain as a proof of concept for the others. Less time as

compared to above and less cost.

ODS-->ETL-->Data mart-->Data warehouse-->OLAP

How do we maintain Primary key in Fact Table ?

In data warehousing we are used surrogate keys by which we can change the value

of primary key.Suppose you have two table emp and dept and empno is the primary key of dept. table

and also it is used in emp table as fk In this case if we cannot modify the pk

because it is used as a foreign key in dept table. Thats why we need a extra columnswhich have no actual meaning. Here we have to take a extra columns ID assurrogate key in both table which have no meaning. But it can perform thejoins between two tables.

what is the difference between aggregate table and fact table ?

A fact table contains million of records and retrieving therecords from fact

table takes time. Where as aggregate tablecontains limited data from all the

required tables, and we retrieve the data it takes less time.

What is the difference between aggregate table and materialized view?


4/15

Aggregate tables are pre-computed totals in the form of hierarchical

multidimensional structure. Where as materialized view is a database objectwhich caches the query result in a concrete table and updates it from theoriginal database table from time to time .Aggregate tables are used to speedup the query computing whereas materialized view speed up the data retrieval.

What is aggregate table and aggregate fact table?

Aggregate table contains summarized data. The materialized view is aggregated

tables.

For example, in sales we have only date transaction. if we want to create a report

like sales by product per year. in such cases we aggregate the date vales into

week_agg month_agg quarter_agg year_agg. to retrieve date from this tables we

use @aggrtegate function.

Schemas: Depends on the requirement we can choose the schemaIn designing data models for data warehouses / data marts, the most commonly

used schema types are Star Schema and Snowflake Schema.

Whether one uses a star or a snowflake largely depends on personal preference and

business needs.

Some Points on star schema:

1) A star schema can be simple or complex. A simple star consists of one fact

table, a complex star can have more than one fact table.

2) In star schema, fact table in normalized format and dimension table is in de

normalized format.3) If performance is the priority then go for star schema, since here dimension

tables are de-normalized.

4) We use star schema when the query involves few joins and for better

performance. here data is de-normalized.

Some Points on snowflake schema:

1) Snowflake schema, both dimension and fact table is in normalized format only.

It is also known as Extended star schema.

2) Snowflake it requires more dimensions, more foreign keys and it will reduce thequery performance but it normalizes the records.

3) If memory space is the priority then go for snowflake schema, since here

dimension tables are normalized.

4) For complex joins we go for snowflake. performance is little

bit slower due to no. of joins. Here data is normalized.
http://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.html


5/15

Difference between Snowflake and Star Schema:

1) Star Schema means A centralized fact table and surrounded by different

dimensions

2) Star Schema contains Highly De-normalized Data

3) Star can not have parent table

4) Why need to go for Star schema:

Here a) less joiners contains

b) simply database

c) support drilling up options

1) Snowflake means In the same star schema dimensions split into another

dimensions

2) Snowflake contains Partially normalized

3) But snow flake contain parent tables4) Why need to go for Snowflake schema:

Here some times we used to provide separate dimensions from existing

dimensions that time we will go to snowflake

Disadvantage Of snowflake:

Query performance is very low because more joiners is there

Star Schema Definition: The star schema is the simplest data warehouse schema.

It is called a star schema because the diagram resembles a star with points radiating

from a center.


6/15

Advantages:

Simplest DW schema

Easy to understand

Easy to Navigate between the tables due to less number of joins.

Most suitable for Query processing

Disadvantages:

Occupies more space

Highly De-normalized

Snowflake schema Definition: A Snowflake schema is a Data warehouse Schema

which consists of a single Fact table and multiple dimensional tables. These

Dimensional tables are normalized. A variant of the star schema where each

dimension can have its own dimensions.


7/15

Advantages:

These tables are easier to maintain

Saves the storage space.

Disadvantages:

Due to large number of joins it is complex to navigate

Types of schemas:

1) Star Schema: In a star schema a central Fact table connects a number of

individual dimension tables this is called as a star schema.

It contains less joins so performance will be increase.

Star schema contains de-normalized data.

2) Snowflake Schema: One dimension table split into more than one dimension

this is known as snowflake schema.

It contains normalized data.

There are more joins in snowflake schema. so the performance is degrade.

3) Galaxy Schema: Galaxy schema is known as a

Fact constollation schema. It requires number of fact tables and Dimension tables

this is known as a Galaxy schema


8/15

4) Star flake schema: Hybrid structure that contains a mixture of (de-normalized)

star and (normalized) snowflake schemas

NOTE:Mainly in real time ...when we want to use existing data warehousing

as source we will go for snow flake schema

Types of Facts:

1)Additive: Additive facts are facts that can be summed up through all of the

dimensions in the fact table.

2)Semi-Additive: Semi-additive facts are facts that can be summed up for some of

the dimensions in the fact table, but not the others.

Eg : Bank Balances - you can take a bank account as Semi-Additive since a currentbalance for the account can't be summed as time period; but if you want see current

balance of a bank you can sum all accounts current balance.

3)Non-Additive: Non-additive facts are facts that cannot be summed up for any of

the dimensions present in the fact table.

Eg: Ratios, Averages & Variance

Types of Fact Tables:

1)Cumulative: This type of fact table describes what has happened over a period

of time.

For example, this fact table may describe the total sales by product by store by

day. The facts for this type of fact tables are mostly additive facts. The first

example presented here is a cumulative fact table.

2)Snapshot: This type of fact table describes the state of things in a particular

instance of time, and usually includes more semi-additive and non-additive facts.

The second example presented here is a snapshot fact table.

Types of dimension tables:

There are many dimension tables. The commonly used are:

1) Confirmed dimension


9/15

2) Junk dimension

3) Degenerated dimension

4) Slowly changing dimension

5) Rapidly changing dimension

The others are:

6) Virtual dimension

7) Regular dimension

8) Casual dimension

9) Shared dimension

10) Monster dimension

11) Inferred Dimension12) Role Playing Dimension

13) Shrunken Dimension

14) Out Triggers

15) Static Dimension

Slowly Changing Dimension: Attributes of a dimension that would undergochanges very rarely and commonly over the time.Ex: Customer Name SexOr

Slowly changing dimension (SCD) is the type of dimension which changes with

respect to time or period.

Ex: The employee of employee id say e23321 is presently in Hyderabad after a

month he is re-located in Bangalore than we can say the address dimension is SCD

w.r.t time

Rapidly Changing Dimension: Attributes of a dimension that changefrequently.

Or

Rapidly changing dimension is that where the dimensions changes quickly.

Ex: ATM transactions (banks).The data being changes continuously and

concurrently for each second so it is very difficult to capture this dimensions.

Conformed Dimension: The dimension table used by two or more fact tablesEx: Date dimensions

or

Conformed dimension is a dimension which is connected to or shared by more than

one fact table.

Eg: A business which takes care of both sales and orders of products then product

dimension becomes a conformed dimension for both sales fact and order fact


10/15

Degenerate Dimension: The value of the dimension stored in fact table insteadof the dimension table.

or

The data items that are not facts and data items that do not fit into the existing

dimensions are termed as Degenerate Dimensions. Degenerate Dimensions are

used when fact tables represent transactional data. They can be used as primary

key for the fact table but they cannot act as foreign keys.

For example In sales fact table Invoice number is a degenerated dimension. Since

Invoice Number is not tied up to an order header table hence there is no need for

invoice number to join a dimensional table; hence it is referred as degenerate

dimension.

Junk Dimension: It is a table with the combination of different and unrelated

attributes to reduce the pk and fk relation.Ex: student attendance tracking

or

un wanted data which is not useful fo report generating purpose the data will be

placed in the particular table that table is known as junk dimension. Generally it is

used to provide extra informations.

Ex:any yes or no like status is an example for junk dimension

Differences between OLTP and OLAP are:

OLTP: Online Transactional Processing, which deals with transactions.

For e.g. withdrawals at ATM machines. It involves many transactions. The

databases have to be updated more frequently after the successful completion of a

transaction.

1) customer-oriented, used for data analysis and querying by clerks, clients and IT

professionals.

2) manages current data, very detail-oriented.

3) adopts an entity relationship(ER) model and an application-oriented database

design.

4) focuses on the current data within an enterprise or department.5) Is the E-R modleling, there are more concurrent users,

6) It contains normalized tables so there is no redundancy.

7) More tables, Joins and less Indexes,

8) It stores daily transactional data

9) It stores very less data

10) It contains mainly current data


11/15

11) INSERT, UPDATE, MODIFY can be applied on OLTP.

12) Performance will be high

13) Users OLTP - clerk, DBA

14) OLTP - Transactional Process

15) No of Users OLTP-1000

OLAP: Online Analytical Processing, which deals with analysis of data. It has to

deal with historical data too (for analysis purpose) Not updated frequently. If

required bulk update is allowed.

1) market-oriented, used for data analysis by knowledge workers( managers,

executives, analysis).

2.) manages large amounts of historical data, provides facilities for summarization

and aggregation, stores information at different levels of granularity to support

decision making process.

3. ) adopts star, snowflake or fact constellation model and a subject-orienteddatabase design.

4) spans multiple versions of a database schema due to the evolutionary process of

an organization; integrates information from many organizational locations and

data stores

5) It is the Dimensional Modeling

6) It contains De-normalized tables there will be redundancy.

7) Less tables, Joins and more Indexes

8) It stores operational data

9) It contains Historical and Present data

10) only SELECT clause is applied on OLAP

11) It stores very Huge data

12) Performance will be low compared with OLTP

13) OLAP - Analytical Process

14) Users OLAP - Knowledge workers

1) Manager

2) Analysts

15) No of Users OLAP- 100

Types of OLAP:OLAP (ONLINE ANALYTICAL PROCESSING) is a set of specifications

which allows the client applications in retrieving the data from the Data

Warehouse for analytical process. There are 4 types of OLAPS we have


12/15

1.) DOLAP (DESKTOP OLAP): The OLAP which communicates with

DESKTOP DATABASES to retrieve the data is called DOLAP.

Ex: cognos business objects tools.

2.) ROLAP (RELATIONAL OLAP): The OLAP which communicates withRELATIONAL DATABASES to retrieve the data is called ROLAP.

Ex: COGNOS REPORT NET BUSINESS OBJECTS MICROSTRATAGY

HYPERION

3.) MOLAP (MULTIDIMENSTIONAL OLAP): The OLAP which

communicates with MULTI DIMENSIONAL DATABASES to retrieve the

data is called MOLAP.

Ex: COGNOS HYPERION

4.) HOLAP (HYBRID OLAP): The OLAP which uses the combined features

of ROLAP MOLAP is called HOLAP.

Ex: COGNOS

OLAP Query:

Roll-up : display data that increase in aggregation level

Drill-down : display details using query for dimension table hierarchy

Pivot : makes cross tabulation

Slice and dice: Makes range selection on one or more dimension.

Snapshot: A Snapshot is the copy of data, when we create a snapshot it

copies the exact data that's related to the at particular report, we use snapshot

when ever we want to compare reports(ex we want to compare this months

report with previous months)

Differences between a Data Warehouse and a Data Mart:

Category Data Warehouse Data Mart


13/15

Scope Corporate Line of Business (LOB)

Subject Multiple Single subject

Data Sources Many Few

Size (typical) 100 GB-TB+ < 100 GB

Implementation Time Months to years Months

slowly changing dimension: If the data in the dimension table happen to change

very rarely then it is called as slowly changing dimension.

ex: changing the name and address of a person which happens rerely.

The price of the product, address of the person, name of the city are few examples

of SCD.

This change can be implemented in three ways...

Type I: Replace the old record with a new record with updated data there bywe lose the history.

Type II: Create a new additional dimension table record with new value. Bythis way we can keep the history. We can determine which dimension is currentby adding a current record flag or by time stamp on the dimensional row.

Type III: In this type of implementation we create a new field in the dimensiontable which stores the old value of the dimension. When an attribute of the

dimension changes then we push the updated value to the current field and oldvalue to the old field.

In Type 1 Slowly Changing Dimension, the new information simply overwrites the

original information. In other words, no history is kept.

In our example, recall we originally have the following table:

Customer Key Name State

1001 Christina Illinois

After Christina moved from Illinois to California, the new information replaces the

new record, and we have the following table:


1001 Christina California

Advantages:


14/15

This is the easiest way to handle the Slowly Changing Dimension problem, since

there is no need to keep track of the old information.

Disadvantages:

All history is lost. By applying this methodology, it is not possible to trace back in

history. For example, in this case, the company would not be able to know that

Christina lived in Illinois before.

Usage:

About 50% of the time.

When to use Type 1: Type 1 slowly changing dimension should be used when it is

not necessary for the data warehouse to keep track of historical changes.

In Type 2 Slowly Changing Dimension, a new record is added to the table to

represent the new information. Therefore, both the original and the new record will

be present. The newe record gets its own primary key.



1001 Christina Illinois

After Christina moved from Illinois to California, we add the new information as a

new row into the table:


1001 Christina Illinois1005 Christina California

Advantages:

This allows us to accurately keep all historical information.

Disadvantages:

This will cause the size of the table to grow fast. In cases where the number of

rows for the table is very high to start with, storage and performance can become a

concern.

This necessarily complicates the ETL process.

Usage:About 50% of the time.

When to use Type 2: Type 2 slowly changing dimension should be used when it is

necessary for the data warehouse to track historical changes.

In Type 3 Slowly Changing Dimension, there will be two columns to indicate the

particular attribute of interest, one indicating the original value, and one indicating


15/15

the current value. There will also be a column that indicates when the current value

becomes active.


Customer Key Name State1001 Christina Illinois

To accommodate Type 3 Slowly Changing Dimension, we will now have the

following columns:

Customer Key

Name

Original State

Current State

Effective Date

After Christina moved from Illinois to California, the original information gets

updated, and we have the following table (assuming the effective date of change is

January 15, 2003):

Customer Key Name Original State Current State Effective Date

1001 Christina Illinois California 15-JAN-2003

Advantages:

This does not increase the size of the table, since new information is updated.

This allows us to keep some part of history.Disadvantages:

Type 3 will not be able to keep all history where an attribute is changed more than

once. For example, if Christina later moves to Texas on December 15, 2003, the

California information will be lost.

Usage:

Type 3 is rarely used in actual practice.

When to use Type 3: Type III slowly changing dimension should only be used

when it is necessary for the data warehouse to track historical changes, and when

such changes will only occur for a finite number of time.

best dwh basics

Documents