dwh concepts

21
Data Warehousing Concepts 06/01/2010 BFS 4/YORKSHIRE BUILDING SOCIETY MANOJ I BHADIYADRA [email protected]

Upload: madasamy-murugaboobathi

Post on 04-Mar-2015

106 views

Category:

Documents


7 download

DESCRIPTION

Uploaded from Google Docs

TRANSCRIPT

Data Warehousing Concepts 06/01/2010

BFS 4/YORKSHIRE BUILDING SOCIETY MANOJ I BHADIYADRA [email protected]

Agenda

• What is Data Warehouse?

• What is Data Model?

• Data Warehouse Architecture

• ETL

• What is Dimension and Fact?

• What is Star Schema?

• What is Snow Flake Schema?

• What is Galaxy Schema?

• Advantages of using Star, Snow flake and Galaxy Schemas

• What is Primary Key and Surrogate Key?

• What are Rollup and Drill-Down operations

• Design Tips

• A Single complete consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use In a business context.

• A collection of data designed to support management decision making process. Data warehouses contain a wide variety of data that present a coherent picture Of business conditions at a single point in time.

• A Data Warehouse is a Subject Oriented Integrated Time-Varying Non-Volatile collection of data that is used primarily in organizational decision making.

What is Data Warehouse?

What is Data Model?

The logical data structure developed during the logical database design process is a data model or entity model . It is also a description of the structural properties that define all entities represented in a database and all the relationships that exist among them.

ORA structured way of viewing a set of data — the design of the tablesand their corresponding relationships in a relational database.

Data Warehouse Architecture

Operational Databases

External Data Sources

EDW

E

T

L

dm1 dm2 dm3 dm4 dm5 dm6

Data Marts

Reports

ETL

• Short for Extract, Transform, Load, three database functions that are combined into one tool to pull data out of one schema and place it into another database/schema.

• Extract -- the process of reading data from a database.

• Transform -- the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database/schema.

• Load -- the process of writing the data into the target database/schema.

What is Dimension and Fact?

• Dimension:

A user typically needs to evaluate or analyze some aspect of the organization’s business. The requirements that have been collected must represent the two key elements of this analysis: what is being analyzed, and the evaluation criteria for what is being analyzed. The evaluation criteria are referred to as measures (a numeric attribute of a fact), and what is being analyzed is referred to as dimensions (a description attribute of a fact).

• Fact:

The fact table contains IDs for referencing dimensions tables, and measures for measuring the changing or performance of all dimension members.

What is Star Schema?

saleorderId

datecustIdprodIdstoreId

qtyamt

customercustIdname

addresscity

productprodIdnameprice

storestoreId

city

Star Schema With Data

customer custId name address city53 joe 10 main sfo81 fred 12 main sfo

111 sally 80 willow la

product prodId name pricep1 bolt 10p2 nut 5

store storeId cityc1 nycc2 sfoc3 la

sale oderId date custId prodId storeId qty amto100 1/7/97 53 p1 c1 1 12o102 2/7/97 53 p2 c1 2 11105 3/8/97 111 p1 c3 5 50

Sample Star Schema Structure

Dimension Hierarchies

store storeId cityId tId mgrs5 sfo t1 joes7 sfo t2 freds9 la t1 nancy

city cityId pop regIdsfo 1M northla 5M south

region regId namenorth cold regionsouth warm region

sType tId size locationt1 small downtownt2 large suburbs

store

sType

city region

snowflake schema

Data Dimension

time day week month quarter year1 1 1 1 20002 1 1 1 20003 1 1 1 20004 1 1 1 20005 1 1 1 20006 1 1 1 20007 1 1 1 20008 2 1 1 2000

all

years

quarters

months

days

weeks

What is Galaxy Schema?

The Galaxy Schema OR "Multiple Fact Table Schema" is composed of multiple fact tables, which are associated partially with the same dimension tables.

In Galaxy Schema You have two or more related fact table surrounded by common dimensions.

Advantages

The benefit of having star schema is that it is simpler than snowflake and galaxy schemas, making it easier for the ETL processes to load the data into Dimensional Data Store (DDS).The benifit of having snowflake schema is less redundancy, so less disk space is required.The benefit of having galaxy schema is the ability to model the business events more accurately by several fact tables.

Galaxy Schema

Customer

Area

Sales Fact

Time

Product

Purchase Fact

Supplier

Cust_ID Cust_Name Cust_State

Area_ID Area_Name

Time_Id Day Week Month Year

Product_Id Name Type_Name Prod_Brand Size Colour_Name

Purchase_Id Supplier_IdProd_Id Purchase_Price Quantity Time_Id

Supplier_Id Supplier_Name Supplier_Category

Sales_Id Prod_Id Cust_Id Sale_Price Quantity Time_Id Area_Id

PRIMARY KEY

• Definition: The primary key of a relational table uniquely identifies each record in the table. It can either be a normal attribute that is guaranteed to be unique (such as Social Security Number in a table with no more than one record per person) or it can be generated by the DBMS.

Primary keys may consist of a single attribute or multiple attributes in combination.

SURROGATE KEY

• A unique {primary key} generated by the {RDBMS} that is not derived from any data in the database and whose only significance is to act as the primary key. A surrogate key is frequently a sequential number.

ROLAP VS MOLAP

ROLAP: Relational On-Line Analytical Processing

MOLAP: Multi-Dimensional On-Line Analytical Processing

Roll Up AND Drill-Down

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

sale prodId date amtp1 1 62p2 1 19p1 2 48

• Add up amounts by day, product• In SQL: SELECT date, sum(amt) FROM SALE

GROUP BY date, prodId

RollUp Summarize Data : By climbing up hierarchy or by dimension reduction

Drill down Reverse of Roll up: from higher level summary to lower level summary or detailed data, or introducing new dimensions

Design Tips

What data is needed?

Where does it come from?

How to clean data?

How to represent in warehouse (schema)?

What to summarize?

What to materialize?

What to index?

THANK YOU Manoj I Bhadiyadra

[email protected]