data warehouse components

8
2/21/2013 1 Dr. Yashvardhan Sharma Assistant Professor, CS & IS Dept. BITS-Pilani SS G515 - Data Warehousing: Introduction Topics Already Covered 2 Data Warehouse Architecture and its components Extraction,Transformation and Loading (ETL) Data Marts Approached to Design Data Warehouses Inmon’s Kimball’s A General Architecture for Data Warehousing 3 A General Architecture for Data Warehousing 4 The major components of data warehouse architecture are: Source systems are where the data comes from. Extraction, transformation, and load (ETL) move data between different data stores. The central repository is the main store for the data warehouse. The metadata repository describes what is available and where. Data marts provide fast, specialised access for end users and applications. Operational feedback integrates decision support back into the operational systems. End-users are the reason for developing the warehouse in the first place MOLAP: Multi-Dimensional On-Line Analytical Processing ROLAP: Relational On-Line Analytical Processing Loading the Data Warehouse Source Systems Data Staging Area Data Warehouse (OLTP) Data is periodically extracted Data is cleansed and transformed Users query the data warehouse Data Warehousing Architecture Monitoring & Administration Metadata Repository Extract Transform Load Refresh Data Marts External Sources Operational dbs Serve OLAP servers Analysis Query/ Reporting Data Mining 6

Upload: shakthi-raghuveer

Post on 30-Nov-2015

53 views

Category:

Documents


0 download

DESCRIPTION

data warehouse

TRANSCRIPT

Page 1: Data Warehouse Components

2/21/2013

1

Dr. Yashvardhan Sharma

Assistant Professor, CS & IS Dept.

BITS-Pilani

SS G515 - Data Warehousing: Introduction

Topics Already Covered

2

Data Warehouse Architecture and its components

Extraction, Transformation and Loading (ETL)

Data Marts

Approached to Design Data Warehouses

Inmon’s

Kimball’s

A General Architecture for Data

Warehousing

3

A General Architecture for

Data Warehousing

4

The major components of data warehouse architecture are: Source systems are where the data comes from. Extraction, transformation, and load (ETL) move data between

different data stores. The central repository is the main store for the data warehouse. The metadata repository describes what is available and where. Data marts provide fast, specialised access for end users and

applications. Operational feedback integrates decision support back into the

operational systems. End-users are the reason for developing the warehouse in the first

place

MOLAP: Multi-Dimensional On-Line Analytical Processing

ROLAP: Relational On-Line Analytical Processing

Loading the Data Warehouse

Source Systems Data Staging Area Data Warehouse

(OLTP)

Data is periodically extracted

Data is cleansed and transformed

Users query the data warehouse

Data Warehousing Architecture

Monitoring & Administration

Metadata

Repository

Extract

Transform

Load

Refresh

Data Marts

External

Sources

Operational

dbs

Serve

OLAP servers

Analysis

Query/

Reporting

Data

Mining

6

Page 2: Data Warehouse Components

2/21/2013

2

Data Warehousing Architecture

7

Data Warehouse Architecture

Data Warehouse COMPONENTS

9

Data Warehouse COMPONENTS

10

Source Data Component

Production Data.

Internal Data.

Archived Data.

External Data.

Data Staging Component

Data Extraction

Data Transformation.

Data Loading.

Data Loading

11

Data Storage Component

12

Many of the data warehouses also employ multidimensional

database management systems. Data extracted from the data

warehouse storage is aggregated in many ways and the

summary data is kept in the multidimensional databases

(MDDBs). Such multidimensional database systems are

usually proprietary products.

Page 3: Data Warehouse Components

2/21/2013

3

Information Delivery Component

13

Metadata Component

14

Metadata in a data warehouse is similar to a data dictionary,

but much more than a data dictionary.

Types of Metadata

Operational Metadata

Extraction and Transformation Metadata

End-User Metadata

More Details in Chapter 9.

Why Meta Data: Special Significance

15

First, it acts as the glue that connects all parts of the data

warehouse.

Next, it provides information about the contents and

structures to the developers.

Finally, it opens the door to the end-users and makes the

contents recognizable in their own terms.

Operational data source1

The architecture

Query Manage

Warehouse Manager

DBMS

Operational

data source 2

Meta-data

High

summarized data

Detailed data

Lightly

summarizeddata

Operational

data store (ods)

Operational

data source n

Archive/backup data

Load Manager

Data mining

OLAP(online analytical

processing) tools

Reporting, query,

application development, and EIS(executive

information system) tools

End-useraccess tools

Typical architecture of a data warehouse

Operational data store (ODS)

The main components

Operational data sourcesfor the DW is supplied from mainframe

operational data held in first generation hierarchical and network databases,

departmental data held in proprietary file systems, private data held on

workstaions and private serves and external systems such as the Internet,

commercially available DB, or DB assoicated with and organization’s suppliers

or customers

Operational datastore(ODS)is a repository of current and

integrated operational data used for analysis. It is often structured and supplied

with data in the same way as the data warehouse, but may in fact simply act as a

staging area for data to be moved into the warehouse

The main components load manageralso called the frontend component, it performs all the

operations associated with the extraction and loading of data into the

warehouse. These operations include simple transformations of the data to

prepare the data for entry into the warehouse

warehouse managerperforms all the operations associated with the

management of the data in the warehouse. The operations performed by this

component include analysis of data to ensure consistency, transformation and

merging of source data, creation of indexes and views, generation of

denormalizations and aggregations, and archiving and backing-up data

Page 4: Data Warehouse Components

2/21/2013

4

The main components query manageralso called backend component, it performs all the

operations associated with the management of user queries. The operations

performed by this component include directing queries to the appropriate

tables and scheduling the execution of queries

detailed, lightly and lightly summarized data,archive/backup data

meta-data

end-user access toolscan be categorized into five main groups: data

reporting and query tools, application development tools, executive

information system (EIS) tools, online analytical processing (OLAP) tools, and

data mining tools

Data flows

Inflow- The processes associated with the extraction, cleansing, and

loading of the data from the source systems into the data warehouse.

upflow- The process associated with adding value to the data in the

warehouse through summarizing, packaging , packaging, and

distribution of the data

downflow- The processes associated with archiving and backing-up

of data in the warehouse

outflow- The process associated with making the data availabe to the

end-users

Meta-flow-The processes associated with the management of the

meta-data

Operational data source1

Warehouse Manager

DBMS

Meta-data High

summarized data

Detailed data

Lightly

summarizeddata

Operational

data store (ods)

Operational

data source n

Archive/backup data

Load Manager

Data mining tools

OLAP (online analytical

processing) tools

End-useraccess tools

Information flows of a data warehouse

Reporting,

query,applicationdevelopment, and EIS (executive information

system) tools

Downflow

Inflow

Meta-flow

Upflow Query

Manage

Outflow

Warehouse Manager

Tools and Technologies The critical steps in the construction of a data

warehouse:

a. Extraction

b. Cleansing

c. Transformation

after the critical steps, loading the results into target system can be carried out either by separate products, or by a single, categories: code generators

database data replication tools

dynamic transformation engines

Populating & Refreshing the Warehouse

Data Extraction

Data Cleaning

Data Transformation

Convert from legacy/host format to warehouse format

Load

Sort, summarize, consolidate, compute views, check integrity, build indexes, partition

Refresh

Bring new data from source systems

ETL Process : Issues & Challenges

Consumes 70-80% of project time

Heterogeneous Source Systems

Little or no control over source systems

Source systems scattered

Source systems operating in different time zones

Different currencies

Different measurement units

Data not captured by OLTP systems

Ensuring data quality

Page 5: Data Warehouse Components

2/21/2013

5

Data Staging Area

A storage area where extracted data is

Cleaned

Transformed

Deduplicated

Initial storage for data

Need not be based on Relational model

Spread over a number of machines

Mainly sorting and Sequential processing

COBOL or C code running against flat files

Does not provide data access to users

Analogy – kitchen of a restaurant

Presentation Servers

A target physical machine on which DW data is organized for

Direct querying by end users using OLAP

Report writers

Data Visualization tools

Data mining tools

Data stored in Dimensional framework

Analogy – Sitting area of a restaurant

Data Cleaning

Why? Data warehouse contains data that is analyzed for

business decisions

More data and multiple sources could mean more errors in the data and harder to trace such errors

Results in incorrect analysis

Detecting data anomalies and rectifying them early has huge payoffs

Long Term Solution Change business practices and data entry tools

Repository for meta-data

Soundex Algorithms

Misspelled terms

For example NAMES

Phonetic algorithms – can find similar sounding names

Based on the six phonetic classifications of human speech sounds

Data Warehouse Design

OLTP Systems are Data Capture Systems

“DATA IN” systems

DW are “DATA OUT” systems

OLTP DW

Analyzing the DATA

Active Analysis – User Queries

User-guided data analysis

Show me how X varies with Y

OLAP

Automated Analysis – Data Mining

What’s in there?

Set the computer FREE on your data

Supervised Learning (classification)

Unsupervised Learning (clustering)

Page 6: Data Warehouse Components

2/21/2013

6

OLAP Queries

How much of product P1 was sold in 2009 state wise?

Top 5 selling products in 2010

Total Sales in Q1 of FY 2008-09?

Color wise sales figure of cars from 2008 to 2010

Model wise sales of cars for the month of Jan from 2006 to 2010

Data Mining Investigations

Which type of customers are more likely to spend most with us in the coming year?

What additional products are most likely to be sold to customers who buy sportswear?

In which area should we open a new store in the next year?

What are the characteristics of customers most likely to default on their loans before the year is out?

Continuum of Analysis

OLTP OLAP Data Mining

Primitive & Canned Analysis

Complex Ad-hoc Analysis

Automated Analysis

SQL

Specialized Algorithms

Data Marts

What is a data mart?

Advantages and disadvantages of data marts

Issues with the development and management of data marts

21-Feb-1334

Data Marts A subset of a data warehouse that supports the requirements

of a particular department or business process

Data Mart is a subset of corporate-wide data warehouse that

is of value to a specific groups of users. Its scope is confined

to specific, selected groups, such as marketing data mart.

Characteristics include:

Does not always contain detailed data unlike data warehouses

More easily understood and navigated

Can be dependent or independent

21-Feb-1335

Data Marts

36

Data Mart: A scaled-down version of the data warehouse

A data mart is a small warehouse designed for the

department level.

It is often a way to gain entry and provide an opportunity to

learn

Major problem: if they differ from department to

department, they can be difficult to integrate enterprise-

wide

Page 7: Data Warehouse Components

2/21/2013

7

Reasons for Creating Data Marts

Proof of Concept for the DW

Can be developed quickly and less resource intensive than DW

To give users access to data they need to analyze most often

To improve query response time due to reduction in the volume

of data to be accessed

21-Feb-1337

Kimball vs Inmon

Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form.

Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model.

21-Feb-1338

Kimball vs Inmon

Bill Inmon: Endorses a Top-Down design

Independent data marts cannot comprise an effective EDW.

Organizations must focus on building EDW

Ralph Kimball: Endorses a Bottom-Up design

EDW effectively grows up around many of the several

independent data marts – such as for sales, inventory, or

marketing

21-Feb-1339

Kimball vs Inmon: War of Words

"...The data warehouse is nothing more than the union of

all the data marts...,"

Ralph Kimball, December 29, 1997.

"You can catch all the minnows in the ocean and stack

them together and they still do not make a whale,"

Bill Inmon, January 8, 1998.

21-Feb-1340

Kimball vs. Inmon

There is no right or wrong between these two ideas, as

they represent different data warehousing philosophies.

In reality, the data warehouse in most enterprises are

closer to Ralph Kimball's idea. This is because most data

warehouses started out as a departmental effort, and

hence they originated as a data mart. Only when more

data marts are built later do they evolve into a data

warehouse.

21-Feb-1341

Data Warehousing Process

42

Enterprise-wide warehouse, top down, the Inmon

methodology

Data mart, bottom up, the Kimball methodology

When properly executed, both result in an enterprise-wide

data warehouse

Page 8: Data Warehouse Components

2/21/2013

8

Data warehouse versus data mart.

43

Building a Data Warehouse

44

Questions to be asked:

Top-down or bottom-up approach?

Enterprise-wide or departmental?

Which first—data warehouse or data mart?

Build pilot or go with a full-fledged implementation?

Dependent or independent data marts?