Transcript
Page 1: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

®

IBM Software Group

© 2007 IBM Corporation

Page 2: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

®

IBM Software Group

© 2007 IBM Corporation

Page 3: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

3

04/08/23 TCS Confidential 3

Page 4: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

4

Course Roadmap

• Why we use Data warehousing

• Difference between Operational System and Data Warehouse

• Introduction to Data warehousing

• Data Warehousing Approaches

• Data Warehouse Technical Architecture

• Data Modelling concepts

• Operational Data Store

• Schema Design of Data warehouse

• Data Acquisation

• ETL Products

• Project Life Cycle

Page 5: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

5

Why We Need Data Warehousing ?

Better business intelligence for end-users

Reduction in time to locate, access, and analyze information

Consolidation of disparate information sources

To Store Large Volumes of Historical Detail Data from Mission Critical Applications

Strategic advantage over competitors

Faster time-to-market for products and services

Replacement of older, less-responsive decision support systems

Reduction in demand on IS to generate reports

Page 6: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

6

What is an Operational System?

Operational systems are just what their name implies; they are the systems that

help us run the day-to-day enterprise operations.

These are the backbone systems of any enterprise, such as order entry inventory

etc.

The classic examples are airline reservations, credit-card authorizations, and ATM

withdrawals etc.,

Page 7: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

7

Characteristics of Operational Systems

• Continuous availability

• Predefined access paths

• Transaction integrity

• Volume of transaction - High

• Data volume per query - Low

• Used by operational staff

• Supports day to day control operations

• Large number of users

Page 8: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

8

OLTP Vs Data Warehouse

Operational System Data Warehouse

Transaction Processing Query Processing

Predictable CPU Usage Random CPU Usage

Time Sensitive History Oriented

Operator View Managerial View

Normalized Efficient

Design for TP

Denormalized Design for

Query Processing

Operational System Data Warehouse

Transaction Processing Query Processing

Predictable CPU Usage Random CPU Usage

Time Sensitive History Oriented

Operator View Managerial View

Normalized Efficient

Design for TP

Denormalized Design for

Query Processing

Page 9: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

9

OLTP Vs WarehouseOperational System Data Warehouse

Designed for Atmocity,Consistency, Isolation andDurability

Designed for quite or staticdatabase

Organized by transactions(Order, Input, Inventory)

Organized by subject(Customer, Product)

Relatively smaller database Large database size

Many concurrent users Relatively few concurrentusers

Volatile Data Non Volatile Data

Operational System Data Warehouse

Designed for Atmocity,Consistency, Isolation andDurability

Designed for quite or staticdatabase

Organized by transactions(Order, Input, Inventory)

Organized by subject(Customer, Product)

Relatively smaller database Large database size

Many concurrent users Relatively few concurrentusers

Volatile Data Non Volatile Data

Page 10: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

10

Operational System Data Warehouse

Stores all data Stores relevant data

Performance Sensitive Less Sensitive to performance

Not Flexible Flexible

Efficiency Effectiveness

Operational System Data Warehouse

Stores all data Stores relevant data

Performance Sensitive Less Sensitive to performance

Not Flexible Flexible

Efficiency Effectiveness

Page 11: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

11

What is a Data Warehouse ?

Data Warehouse Data Warehouse is a

Subject-Oriented

Integrated

Time-Variant

Non-volatile

WH Inmon - Regarded As Father Of Data WarehousingWH Inmon - Regarded As Father Of Data Warehousing

Page 12: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

®

IBM Software Group

© 2007 IBM Corporation

Page 13: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

13

13

Subject Oriented Analysis

Data Warehouse StorageTransactional Storage

SalesSales

CustomersCustomers

ProductsProducts

EntrySales RepQuantity SoldPart NumberDate Customer NameProduct DescriptionUnit PriceMail Address

Process Oriented Subject Oriented

Page 14: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

14

14

Integration of Data

Data Warehouse StorageTransactional Storage

Appl. A - M, FAppl. B - 1, 0Appl. C - X, Y

Appl. A - pipeline cm.Appl. B - pipeline inchesAppl. C - pipeline mcf

Appl. A - balance dec(13,2) Appl. B - balance PIC 9(9)V99Appl. C - balance float

Appl. A - bal-on-handAppl. B - current_balanceAppl. C - balance

Appl. A - date (Julian)Appl. B - date (yymmdd)Appl. C - date (absolute)

M, F

pipeline cm

balance dec(13, 2)

balance

date (Julian)In

tegr

atio

n

Encoding

Unit of Attributes

Physical Attributes

Naming Conventions

Data Consistency

Page 15: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

15

15

Load

Access

Mass Load / Access of DataRecord-by-Record Data Manipulation

Insert

Access

Insert

Change

Delete

Change

Volatile Non-Volatile

Volatility of Data

Data Warehouse StorageTransactional Storage

Page 16: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

16

16

Time Variant Data Analysis

Data Warehouse StorageTransactional Storage

Current Data Historical Data

0

5

10

15

20

Sales ( in lakhs )

January February March

Year97

Sales ( Region , Year - Year 97 - 1st Qtr)

EastWestNorth

Page 17: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

Load/ Update

Consistent Points in Time

Updated constantly

Data changes according to

need, not a fixed schedule

Added to regularly, but loaded data

is rarely directly changed

Does NOT mean the Data

warehouse is never updated or

never changes!!

Constant Change

Operational systems Database

Data warehouse

Datawarehouse- Differences from Operational Systems

Insert

Insert

Update

Initial Load

Incremental Load

Incremental Load

Update

Delete

Page 18: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

18

Difference B/W OLTP AND OLAP

Page 19: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

19

DW Implementation Approaches

Top Down

Bottom-up

Combination of both

Choices depend on: current infrastructure resources architecture ROI Implementation speed

Page 20: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

20

Heterogeneous Source Systems

Staging

Common Staging interface Layer

EDW- “Top Down”Approach

Data mart bus architecture Layer

Enterprise Datawarehouse

Source1

Source2

Source3

Incremental Architected data marts

DM 1 DM 3DM 2

Page 21: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

21

Heterogeneous Source Systems

Staging

Common Staging interface Layer

EDW- “Bottom up”Approach

Data mart bus architecture Layer

Source1

Source2

Source3

Incremental Architected data marts

DM 1 DM 3DM 2

Enterprise Datawarehouse

Page 22: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

22

Source System Data Staging Area Presentation Area

Services:

Transform from

source-to-Target

Maintain Conform

Dimensions

No user query support

Data Store:

Flat files or relational tables

Design Goals:

Staging Throughput integrity/ consistency

Load

Access

Ad Hoc Query Tools

Report Writers

Analytic Applications

Modeling:

Forecasting Scoring Data Mining

Data Mart #1

Dimensional Atomic AND summery data Business Process Centric

Design Goals:

Easy-of -use Query Performance

Data Mart #2

Data Mart #.....

Data Mart Bus: Conformed facts and dimsExtract

Extract

Extract

Data Access Tools

Independent Data Marts: Ralph Kimball’s Ideology

Ralph Kimball’ Approach

Page 23: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

23

•E/R Design or Flat File

•Retain History Needed for

regular processing

•No end user access

• Dimensional

•Transaction & Summary data

•Data Mart Single subject area

(i.e. Fact table)

•Multiple Marts May exist in a

Single Database Instance

Bottom Up Approach

Staging Data Store

Data Warehouse

Data Mart Data Mart Data Mart

Data Mart Data MartData Mart

•Integrated Data•Timely User Access•Conformed Dimensions•Single Process to Build Dimension

Page 24: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

®

IBM Software Group

© 2007 IBM Corporation

Page 25: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

25

Bill Inmon’ Approach

Source System

Data Staging Area

Presentation Area

“Enterprise Data Warehouse”

Normalized tables

Atomic Data

User query support to atomic data

Extract

Extract

Extract

Load

Data Mart #1 Dimensional summery data Departmental Centric

Access

Access

Data Access Tools

Data Mart #2

Data Mart #...

ETL

Dependent Data Marts: Bill Inmon’s Ideology

DWH

Page 26: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

26

Top Down Approach

• Raw Input Data

• E/R Model• Subject Areas• Transaction Level Detail• Historical Persistency As justified- Archive

for Retrieval if Needed

• Most are dimensional• Data Mart Design by Business

Function• Summary Level Data

Data Mart Data Mart

Staging Data Store

Data Warehouse

Data Mart

Data Mart

Flat File

•Integrated Data•Timely user Access•Single Process to build dimension

Page 27: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

27

DW Implementation Approaches

Top Down More planning and design initially

Involve people from different work-groups, departments

Data marts may be built later from Global DW

Overall data model to be decided up-front

Bottom Up Can plan initially without waiting for

global infrastructure

built incrementally

can be built before or in parallel with Global DW

Less complexity in design

Page 28: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

28

DW Implementation Approaches

Top Down Consistent data definition and

enforcement of business rules across enterprise

High cost, lengthy process, time consuming

Works well when there is centralized IS department responsible for all H/W and resources

Bottom Up Data redundancy and

inconsistency between data marts may occur

Integration requires great planning

Less cost of H/W and other resources

Faster pay-back

Page 29: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

29

29

DW Architectures

Page 30: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

30

Prod

Mkt

HR

Fin

Acctg

Data Sources

Transaction Data

IBM

IMS

VSAM

Oracle

Sybase

ETL Software Data Stores Data AnalysisTools and Applications

Users

Other Internal Data

ERP SAP

Clickstream Informix

Web Data

External Data

Demographic Harte-Hanks

STAGING

AREA

OPERATIONAL

DATA

STORE

Ascential

Extract

Sagent

SAS

Clean/ScrubTransformFirstlogic

Load

DATASTAGE

Data MartsTeradataIBM

Data Warehouse

Meta Data

Finance

Marketing

Sales

Essbase

Microsoft

ANALYSTS

MANAGERS

EXECUTIVES

OPERATIONAL PERSONNEL

CUSTOMERS/SUPPLIERS

SQL

Cognos

SAS

Queries,Reporting,DSS/EIS,Data Mining

Micro Strategy

Siebel

BusinessObjects

Web Browser

Page 31: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

31

Benefits of DWH

To formulate effective business, marketing

and sales strategies.

To precisely target promotional activity.

To discover and penetrate new markets.

To successfully compete in the marketplace

from a position of informed strength.

To build predictive rather than retrospective models.

Page 32: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

32

Data Modeling

Page 33: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

33

Data Modeling

WHAT IS A DATA MODEL?

A data model is an abstraction of some aspect of the real world (system).

WHY A DATA MODEL?

• Helps to visualize the business • A model is a means of communication.• Models help elicit and document requirements. • Models reduce the cost of change. • Model is the essence of DW architecture based on which

DW will be implemented

Page 34: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

34

STEPS in DATA MODELING

Problem & scope definition

Requirement Gathering

Analysis

Logical Database Design

Deciding Database

Physical Database design

Schema Generation

Page 35: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

35

Levels of modeling Conceptual modeling

Describe data requirements from a business point of view without technical details

Logical modelingRefine conceptual modelsData structure oriented, platform

independent Physical modeling

Detailed specification of what is physically implemented using specific technology

Page 36: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

36

Modeling Techniques

Entity-Relationship Modeling

Traditional modeling technique

Technique of choice for OLTP

Suited for corporate data warehouse

Dimensional Modeling

Analyzing business measures in the specific business context

Helps visualize very abstract business questions

End users can easily understand and navigate the data

structure

Page 37: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

37

Relationship

Relationship between entities - structural interaction

and association

described by a verb

Cardinality

1-1

1-M

M-M

Example : Books belong to Printed Media

Entity-Relationship Modeling - Basic Concepts

Page 38: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

38

Entity-Relationship Modeling - Basic Concepts

AttributesCharacteristics and properties of entitiesExample :

Book Id, Description, book category are attributes of entity “Book”

Attribute name should be unique and self-explanatory

Primary Key, Foreign Key, Constraints are defined on Attributes

Page 39: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

Review of Logical Modeling Terms & Symbols

Entities define specific groups of information

Sales Organization

Sales Org IDDistribution Channel

Entity

Page 40: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

Review of Logical Modeling Terms & Symbols

One or more attribute uniquely identifies an instance of an entity

Sales Organization

Sales Org IDDistribution Channel

Identifier

Page 41: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere softwareReview of Logical Modeling Terms & Symbols

The logical model identifies relationships between entities

Sales Detail

Sales Record ID

Sales Rep

Sales Rep ID

Relationship{

Page 42: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

®

IBM Software Group

© 2007 IBM Corporation

Page 43: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

Logical Data Model

Sales Detail

Sales Record ID

Customer

Customer ID

Product

Product SKU

Suppliers

Supplier ID

Manufacturing Group

Manufacturing Org ID

Factory

Factory ID

Sales Organization

Sales Org IDDistribution Channel

Sales Rep

Sales Rep ID

Retail

Market

Product Sales Plan

Plan ID

Wholesale

Industry

Page 44: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

44

44

Examples: ER Model

Page 45: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

45

Limitations of E-R Modeling

Poor Performance

Tend to be very complex and difficult to navigate.

Page 46: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

®

IBM Software Group

© 2007 IBM Corporation

Page 47: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

47

47

Dimensional Modeling

Page 48: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

48

Dimensional Modeling

Dimensional modeling uses three basic concepts : measures, facts, dimensions.

Is powerful in representing the requirements of the business user in the context of database tables.

Focuses on numeric data, such as values counts, weights, balances and occurences.

Page 49: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

49

What is a Facts

A fact is a collection of related data items, consisting of measures and context data.

Each fact typically represents a business item, a business transaction, or an event that can be used in analyzing the business or business process.

Facts are measured, “continuously valued”, rapidly changing information. Can be calculated and/or derived.

Granularity

The level of detail of data contained in the data warehouse

e.g. Daily item totals by product, by store

Page 50: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

50

Types of Facts Additive

Able to add the facts along all the dimensionsDiscrete numerical measures eg. Retail sales in $

Semi AdditiveSnapshot, taken at a point in timeMeasures of IntensityNot additive along time dimension eg. Account balance, Inventory

balanceAdded and divided by number of time period to get a time-average

Non AdditiveNumeric measures that cannot be added across any dimensions Intensity measure averaged across all dimensions eg. Room

temperatureTextual facts - AVOID THEM

Page 51: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

51

Dimensions A dimension is a collection of members or units of the same type

of views.

Dimensions determine the contextual background for the facts.

Dimensions represent the way business people talk about the data resulting from a business process, e.g., who, what, when, where, why, how

Page 52: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

52

52

Dimensional Hierarchy

World

America AsiaEurope

USA

FL

Canada Argentina

GA VA CA WA

TampaMiami Orlando Naples

Continent Level

State Level

City Level

World Level

Country Level

Pare

nt R

elat

ion

Dimension Member / Business Entity

Geography Dimension

Attributes: Population, Tourist’s Place

Page 53: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

53

Dimensions Types

Conformed Dimension

Junk Dimension

Fast Changing Dimension

Role Playing Dimension

‘Garbage’ Dimension

Slowly Changing Dimension

Degenerated Dimension

53

Page 54: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

54

What is a Slowly Changing Dimension?

Although dimension tables are typically static lists, most dimension tables do change over

time.

Since these changes are smaller in magnitude compared to changes in fact tables, these

dimensions are known as slowly growing or slowly changing dimensions.

Page 55: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

55

Slowly Changing Dimension -Classification

Slowly changing dimensions are classified into three different

types

TYPE I

TYPE II

TYPE III

Page 56: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

56

Slowly Changing Dimensions Type I

Shane

Name

[email protected]

EmailEmp id

Shane

Name

[email protected]

EmailEmp id

Shane

Name

[email protected]

1001

EmailEmp id

Shane

Name

[email protected]

1001

EmailEmp id

Source

Source Target

Target

[email protected]

Page 57: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

57

Slowly Changing Dimensions Type II

Shane

Name

[email protected]

EmailEmp id

[email protected]

Email

Shane

Name

10

Emp id

1000

PM_PRIMARYKEY

0

PM_VERSION_NUMBER

Source Target

Page 58: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

®

IBM Software Group

© 2007 IBM Corporation

Page 59: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

59

Slowly Changing Dimensions -Versioning

Shane

Name

[email protected]

10

EmailEmp id

Source

Target

[email protected]

Shane101000

[email protected]

Shane101001

EmailNameEmp idPM_PRIMARYKEY

PM_VERSION_NUMBER

Page 60: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

60

Slowly Changing Dimensions -Versioning

Shane

Name

[email protected]

10

EmailEmp id

Source

Target

[email protected]

Shane101001

[email protected]

Shane101003

[email protected]

Shane101000

EmailNameEmp idPM_PRIMARYKEY

PM_VERSION_NUMBER

Page 61: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

61

Slowly Changing Dimensions Type II - Flag

Shane

Name

[email protected]

10

EmailEmp id

[email protected]

Email

Shane

Name

10

Emp id

1000

PM_PRIMARYKEY

Y

PM_CURRENT_FLAG

Source

Target

Page 62: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

62

Slowly Changing Dimensions - Flag Current

Shane

Name

[email protected]

10

EmailEmp id

Source

Target

[email protected]

Shane101000

[email protected]

Shane101001

EmailNameEmp idPM_PRIMARYKEY

PM_CURRENT_FLAG

Page 63: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

63

Slowly Changing Dimensions - Flag Current

Shane

Name

[email protected]

10

EmailEmp id

Source

Target

[email protected]

Shane101001

[email protected]

Shane101003

[email protected]

Shane101000

EmailNameEmp idPM_PRIMARYKEY

PM_CURRENT_FLAG

Page 64: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

®

IBM Software Group

© 2007 IBM Corporation

Page 65: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

65

Slowly Changing Dimensions Type II

Shane

Name

[email protected]

10

EmailEmp id

01/01/00

PM_BEGIN_DATE

[email protected]

Email

Shane

Name

10

Emp id

1000

PM_PRIMARYKEY

PM_END_DATE

Source

Target

Page 66: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

66

Slowly Changing Dimensions -Effective Date

Shane

Name

[email protected]

EmailEmp id

Source

Target

03/01/00

01/01/00

PM_BEGIN_DATE

03/01/[email protected]

Shane101000

[email protected]

Shane101001

EmailNameEmp idPM_PRIMARYKEY

PM_END_DATE

Page 67: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

67

Slowly Changing Dimensions - Effective Date

Shane

Name

[email protected]

EmailEmp id

Source

Target

05/02/00

03/01/00

01/01/00

PM_BEGIN_DATE

05/02/[email protected]

Shane101001

[email protected]

Shane101003

03/01/[email protected]

Shane101000

EmailNameEmp idPM_PRIMARYKEY

PM_END_DATE

Page 68: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

®

IBM Software Group

© 2007 IBM Corporation

Page 69: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

69

Slowly Changing Dimensions Type III

Shane

Name

[email protected]

10

EmailEmp id

PM_Prev_Column Name

[email protected]

Email

Shane

Name

10

Emp id

1

PM_PRIMARYKEY

01/01/00

PM_EFFECT_DATE

SourceTarget

Page 70: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

70

Slowly Changing Dimensions Type III

Shane

Name

[email protected]

EmailEmp id

Source

Target

[email protected]

PM_Prev_ColumnName

01/02/[email protected]

Shane101

EmailNameEmp idPM_PRIMARYKEY

PM_EFFECT_DATE

Page 71: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

71

Slowly Changing Dimensions Type III

Shane

Name

[email protected]

EmailEmp id

Source

Target

[email protected]

PM_Prev_ColumnName

01/03/[email protected]

Shane101

EmailNameEmp idPM_PRIMARYKEY

PM_EFFECT_DATE

Page 72: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

72

Degenerate Dimension

Dimension keys in fact table without corresponding dimension tables are called Degenerate Dimensions

Purpose of Degenerate Dimensions

1. Generally used when each record in fact represents transaction line item

2. Useful for grouping transaction line items belonging to a single transaction

Page 73: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

73

Fast Changing DimensionA fast changing dimension is a dimension whose attribute or

attributes for a record (row) change rapidly over time.1. Example: Age of associates, Income, Daily balance etc.2. Technique to handle fast changing dimension: Create band

tables

Page 74: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

74

Role Playing Dimension

A single dimension which is expressed differently in a fact table using views is called a role-playing dimension. This can be achieved by creating views on dimension table.

Page 75: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

75

Conformed Dimension

A conformed dimension means the same thing to each fact table to which it can be joined.

Typically, dimension tables that are referenced or are likely to be referenced by multiple fact tables (multiple dimensional models) are called conformed dimensions

.

Page 76: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

76

Conformed Dimension Option #1

Identical dimensions with same keys, labels, definitions and Values

Sales Schema

Inventory Schema

SALES Facts

DATE KEY PRODUCT KEY STORE KEY PROMO KEY

Product Desc Brand Desc Category Desc

PRODUCT KEY

INVENTORY Facts

DATE KEY PRODUCT KEY STORE KEYProduct Desc

Brand Desc Category Desc

PRODUCT KEY

Page 77: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

77

Conformed Dimension Option #2

Subset of base dimension with common labels, definitions and values

Sales Schema

Forecast Schema

SALES $

DATE KEY PRODUCT KEY STORE KEY PROMO KEY

Product Desc Brand Desc Category Desc

PRODUCT KEY DATE KEY

Day-of-week Week Desc Month Desc

SALES $

MONTH KEY BRAND KEYBrand Desc

Category Desc

BRAND KEY MONTH KEY

Month Desc

BRAND KEY Brand Desc Category Desc

12345 Cherriors Cereal

PROD KEY Prod Desc Brand Desc Category Desc

12345 Cherriors 10 Cherriors Cereal

Page 78: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

78

‘Garbage’ DimensionA garbage dimension is a dimension that consists of low-cardinality columnssuch as codes, indicators, and status flags.

Approach to handle Garbage dimension:• Put the new attributes into existing dimension tables.• Put the new attributes into the fact table.• Create new separate dimension tables garbage dimension• Create a separate ‘Garbage Dimension’ table

Page 79: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

79

Junk Dimensions

Whether to use junk dimension5 indicators, each has 3 values -> 243 (35) rows5 indicators, each has 100 values -> 100 million (1005) rows

When to insert rows in the dimension

Page 80: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

80

Factless Fact Tables

The two types of factless fact tables are:

Coverage tables

Event tracking tables

Page 81: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

81

Factless Fact Tables - Coverage Tables

Coverage tables are required when a primary fact table is sparse

Example: Tracking products in a store that did not sell

Page 82: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

®

IBM Software Group

© 2007 IBM Corporation

Page 83: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

83

Factless Fact Tables - Event Tracking

These tables are used for tracking a event:

Example: Tracking student attendance

Page 84: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

84

Fact Constellation Fact constellations: Multiple fact tables share dimension tables,viewed as

a collection of stars, therefore called galaxy schema or fact constellation

Page 85: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

85

What is a Data mart?

Data mart is a decentralized subset of data found either in a data warehouse or as a standalone subset designed to support the unique business unit requirements of a specific decision-support system.

Data marts have specific business-related purposes such as measuring the impact of marketing promotions, or measuring and forecasting sales performance etc,.

Data Mart

Data Mart

EnterpriseData Warehouse

Page 86: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

86

Data marts - Main Features

Main Features:

Low cost

Controlled locally rather than centrally, conferring power on the user group.

Contain less information than the warehouse

Rapid response

Easily understood and navigated than an enterprise data warehouse.

Within the range of divisional or departmental budgets

Page 87: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

®

IBM Software Group

© 2007 IBM Corporation

Page 88: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

88

Datamart Advantages :

Typically single subject area and fewer dimensions

Limited feeds

Very quick time to market (30-120 days to pilot)

Quick impact on bottom line problems

Focused user needs

Limited scope

Optimum model for DW construction

Demonstrates ROI

Allows prototyping

Advantages of Datamart over Datawarehouse

Page 89: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

89

Data Mart disadvantages :

Does not provide integrated view of business information.

Uncontrolled proliferation of data marts results in redundancy

More number of data marts complex to maintain

Scalability issues for large number of users and increased data volume

Disadvantages of Data Mart

Page 90: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

90

90

Data marts

• Embedded data marts are marts that are stored within

the central DW. They can be stored relationally as files or

cubes.

• Dependent data marts are marts that are fed directly by

the DW, sometimes supplemented with other feeds, such as

external data.

• Independent data marts are marts that are fed directly

by external sources and do not use the DW.

DM - Types

Page 91: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

®

IBM Software Group

© 2007 IBM Corporation

The Operational Data StoreThe Operational Data Store

Page 92: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

92

Page 93: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

93

Why We Need Operational Data Store?

Need

To obtain a “system of record” that contains the best data that

exists in a legacy environment as a source of information

Best here implies data to be

Complete

Up to date

Accurate

In conformance with the organization’s information model

Page 94: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

ODS data resolves data integration issues

Data physically separated from production environment to insulate it from the processing demands of reporting and analysis

Access to current data facilitated.

Operational Data Store - Insulated from OLTP

Tactical Analysis

OLTP Server

ODS

Page 95: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

95

Detailed data

Records of Business Events

(e.g. Orders capture)

Data from heterogeneous sources

Does not store summary data

Contains current data

Operational Data Store - Data

Page 96: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

®

IBM Software Group

© 2007 IBM Corporation

Page 97: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

97

ODS- Benefits Integrates the data

Synchronizes the structural differences in data

High transaction performance

Serves the operational and DSS environment

Transaction level reporting on current data

Flat files

RelationalDatabase

Operational Data Store

60,5.2,”JOHN” 72,6.2,”DAVID”

Excel files

Page 98: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

Update schedule - Daily or less

time frequency

Detail of Data is mostly between

30 and 90 days

Addresses operational needs

Weekly or greater time frequency

Potentially infinite history

Address strategic needs

Operational Data Store- Update schedule

ODSData

Data warehouse Data

Page 99: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

®

IBM Software Group

© 2007 IBM Corporation

Page 100: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

100

OLTP Vs ODS Vs DWH

Characteristic OLTP ODS Data Warehouse

Data redundancy Non-redundantwithin system;Unmanagedredundancy amongsystems

Somewhatredundant withoperationaldatabases

Managedredundancy

Data stability Dynamic Somewhat dynamic Static

Data update Field by field Field by field Controlled batch

Data usage Highly structured,repetitive

Somewhatstructured, someanalytical

Highlyunstructured,heuristic oranalytical

Database size Moderate Moderate Large to very large

Databasestructure stability

Stable Somewhat stable Dynamic

Page 101: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

101

Star Schema Design

Single fact table surrounded by denormalized dimension tables

The fact table primary key is the composite of the foreign keys (primary keys of dimension tables)

Fact table contains transaction type information.

Many star schemas in a data mart

Easily understood by end users, more disk storage required

Page 102: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

102

EXAMPLE OF STAR SCHEMA

Page 103: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

103

Snowflake Schema Single fact table surrounded by normalized dimension

tables

Normalizes dimension table to save data storage space.

When dimensions become very very large

Less intuitive, slower performance due to joins

May want to use both approaches, especially if supporting multiple end-user tools.

Page 104: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

104

Example of Snow flake schema

Page 105: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

105

Snowflake - Disadvantages

Normalization of dimension makes it difficult for user to understand

Decreases the query performance because it involves more joins

Dimension tables are normally smaller than fact tables - space may not be a major issue to warrant snowflaking

Page 106: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

106

Data Acquisation

Data Extraction

Data Transformation

Data Loading

106

Page 107: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

107

Tool Category Products ETL Tools ETI Extract, Informatica, IBM Visual Warehouse

Oracle Warehouse Builder

OLAP Server Oracle Express Server, Hyperion Essbase, IBM DB2 OLAP Server, Microsoft SQL Server OLAP Services, Seagate HOLOS, SAS/MDDB

OLAP Tools Oracle Express Suite, Business Objects, Web Intelligence, SAS, Cognos Powerplay/Impromtu, KALIDO, MicroStrategy, Brio Query, MetaCube

Data Warehouse Oracle, Informix, Teradata, DB2/UDB, Sybase, Microsoft SQL Server, RedBricks

Data Mining & Analysis

SAS Enterprise Miner, IBM Intelligent Miner, SPSS/Clementine, TCS Tools

Representative DW Tools

Page 108: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

108

ETL PRODUCTS

CODE BASED ETL TOOLS

GUI BASED ETL TOOLS

108

Page 109: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

109

CODE BASED ETL TOOLS

SAS ACCESS

SAS BASE

TERADATA ETL TOOLS

1. BTEQ

2. TPUMP

3. FAST LOAD

4. MULTI LOAD

Page 110: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

110

GUI BASED ETL TOOLS

Informatica

DT/Studio

Data Stage

Business Objects Data Integrator (BODI)

AbInitio

Data Junction

Oracle Warehouse Builder

Microsoft SQL Server Integration Services

IBM DB2 Ware house Center

Page 111: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

®

IBM Software Group

© 2007 IBM Corporation

Extraction Types Extraction Types

Page 112: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

112

Extraction Types

Extraction

Full ExtractPeriodic/

IncrementalExtract

Page 113: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

113

Full Extract

Source System

Full Extract

Data Mart

New data

Page 114: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

115

Incremental Extract

Data Mart

Source SystemIncremental Extract

Existing data

IncrementalData

Page 115: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

116

Incremental Extract

Data Mart

Source SystemIncremental Extract

New data

Changed data

Existing data

IncrementalData

Page 116: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

117

Incremental Extract

Data Mart

Source SystemIncremental Extract

New data

Changed data Existing data updated using changed data

IncrementalData

Incremental addition to data mart

Page 117: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

118

DATAWARE LOADING

Page 118: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

®

IBM Software Group

© 2007 IBM Corporation

Page 119: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

120

Types of Data warehouse Loading

Target update types

Insert

Update

Page 120: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

Types of Data Warehouse Updates

Insert

Full Replace

Selective Replace

Update plus Retain History

Update

Point in Time Snapshots

New Data Changed Data

Data Warehouse

Source data Data Staging

Page 121: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

New Data and Point-In-Time Data Insert

Source data

New data

OR

Point-in-Time Snapshot(e.g.. Monthly)

New Data Added to Existing Data

Page 122: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

Changed Data Insert

Source data Changed Data Added to Existing Data

Changed data

Page 123: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

124

Data Data WareWarehousehouse

Data Data WareWarehousehouse

EnterpriseData

Warehouse

EnterpriseData

Warehouse

Info Info AccessAccess

Info Info AccessAccess

Reporting tools

Web Browsers

OLAP

Mining

ETLETLETLETL

External Data External Data StorageStorage

BusinessBusinessRequirementRequirement

Map DataMap Datasourcessources

ReverseReverseEngg.Engg.

Map Map Req. to Req. to OLTPOLTP

OLTP OLTP SystemSystem

LogicalLogicalModelingModeling

RefineRefineModelModel

Data Warehouse Life cycle

Page 124: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

125

Project Life Cycle

Software Requirement Specification

High level Design(HLD)

Low level Design(LLD)

Development

Unit Testing

System Integration Testing

Peer Review

User Acceptance Testing

Production

Maintenance

125

Page 125: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

®

IBM Software Group

© 2007 IBM Corporation

Meta Data in a Data WarehouseMeta Data in a Data Warehouse

Page 126: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

127

• Data about data and the processes

• Metadata is stored in a data dictionary and repository.

• Insulates the data warehouse from changes in the schema of

operational systems.

• It serves to identify the contents and location of data in the

data warehouse

What is Metadata?

Page 127: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

128

Share resources

Users

Tools

Document system

Without meta data

Not Sustainable

Not able to fully utilize resource

Why Do You Need Meta Data?

Page 128: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere softwareThe Role of Meta Data in the Data Warehouse

Know what data you have and

You can trust it!

Meta Data enables data to become information, because with it you

Page 129: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

Meta Data Answers….

How have business definitions and terms changed over time?

How do product lines vary across organizations?

What business assumptions have been made?

How do I find the data I need?

What is the original source of the data?

How was this summarization created?

What queries are available to access the data

Page 130: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

131

Meta Data Process

Integrated with entire process and data flow

Populated from beginning to end

Begin population at design phase of project

Dedicated resources throughout

Build

Maintain

•Design•Mapping

•Design•Mapping

•Extract•Scrub•Transform

•Extract•Scrub•Transform

•Load•Index•Aggregation

•Load•Index•Aggregation

•Replication•Data Set Distribution

•Replication•Data Set Distribution

•Access & Analysis•Resource Scheduling & Distribution

•Access & Analysis•Resource Scheduling & Distribution

Meta DataMeta Data

System MonitoringSystem Monitoring

Page 131: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

132

Types of ETL Meta Data

.

ETL Meta data

Technical Meta data

Operational Meta data

Page 132: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

Data Warehouse Meta data

This Meta data stores descriptive information about the physical

implementation details of data warehouse.

Source Meta data

This Meta data stores information about the source data and the mapping of source

data to data warehouse data

Classification of ETL Meta Data

Page 133: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

Transformations & Integrations.

This Meta data describes comprehensive information about the Transformation and

loading.

Processing Information

This Meta data stores information about the activities involved in the processing of data

such as scheduling and archives etc

End User Information

This Meta data records information about the user profile and security.

ETL Meta Data

Page 134: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

IBM Software Group | WebSphere software

135

ETL -Planning for the Movement

The following may be helpful for planning the movement

Develop a ETL plan

Specifications

Implementation

Page 135: DATASTAGE AND QUALITY STAGE 9.1 ONLINE TRAINING

®

IBM Software Group

© 2007 IBM Corporation


Top Related