data warehousing. introduction modern organizations have huge amounts of data but are starving for...

20
DATA WAREHOUSING

Upload: alicia-allen

Post on 28-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

DATA WAREHOUSING

Page 2: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

Introduction Modern organizations have huge amounts of data

but are starving for information – facing information gap!

Reasons for information gap: Fragmented information systems, uncoordinated

(sometimes inconsistent) databases /islands of information due to time and resources constraints

Most systems support operational /transaction processing rather than informational processing.

Bridging the information gap are data warehouses!

Page 3: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

Introduction (Contd.)

A data warehouse bridges this information gap as it consolidates and integrates information from

internal and external sources and arranges it in a meaningful format for making accurate

business decisions

The two noticeable pioneers in the DWH field are Ralph Kimball and Bill Inmon.

The term Data Warehouse was coined by Bill Inmon in 1990

Page 4: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

What is a Data Warehouse? Inmon defines a data warehouse as follows:

"a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process".

Description of the above terms according to Inmon Subject Oriented:

Data that gives information about a particular subject instead of about a company's ongoing operations.

Integrated: Data that is gathered into the data warehouse from a variety of

sources and merged into a coherent whole.

Page 5: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

Time-variant: All data in the data warehouse is identified with a

particular time dimension. The time factor can be used to study trends and changes.

Non-volatile Data is stable in a data warehouse. More data is added

but data is never removed. This enables management to gain a consistent picture of the business.

Page 6: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

What is a Data Warehouse?

This definition by Inmon remains reasonably accurate almost ten years later. However, a single-subject data warehouse is typically referred to as

a data mart, while data warehouses are generally enterprise in scope.

data warehouses can be volatile. Due to the large amount of storage required for a data warehouse,

(multi-terabyte data warehouses are not uncommon), only a certain number of periods of history are kept in the warehouse.

For instance, if three years of data are decided on and loaded into the warehouse, every month the oldest month will be "rolled off" the database, and the newest month added.

Page 7: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

What is a Data Warehouse?

Ralph Kimball provided a much simpler definition of a data warehouse. a data warehouse is "a copy of transaction data

specifically structured for query and analysis".

This definition provides less insight and depth than Mr. Inmon's, but is no less accurate.

Page 8: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

The Need for Data warehousing

Two factors drive the need for data warehousing

Need of company-wide view of data

Need to separate operational* and informational* systems

* See slide notes

Page 9: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

Data warehousing?

It is the process of creating, populating, and then querying a data warehouse

Components of data warehousing are Source Systems Identification Data warehouse Design and Creation Data Acquisition Changed data capture Data Cleansing Data Aggregation Business Intelligence

Multidimensional Analysis Tools Query Tools Data Mining Tools Data Visualization Tools

Metadata Management

Page 10: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

Components of Data warehousing Source System Identification

Appropriate data must be located to build a data warehouse. This data comes from

Current OLTP systems (providing day-to-day information) `Legacy systems (providing historical data for prior periods)

Data Warehouse Design and Creation The design is often an iterative process Requires effort to understand database schema and a great deal

of user interaction

Page 11: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

Components of Data warehousing Data Acquisition

Involves moving company data from source systems into warehouse

Time consuming activity Performed using ETL (Extract/Transform/ Load) Tools Data acquisition is an ongoing, scheduled process.

Warehouse is refreshed monthly Changed data capture

Periodic update of warehouse from transactional systems is complicated as it is difficult to identify which records in source have changed since last update.

Some technologies used in this area are Replication servers, Publish/Subscribe, Triggers and Stored procedures, and Database Log analysis.

Page 12: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

Components of Data warehousing Data Cleansing

Typically performed in conjunction with data acquisition (part of “T” in “ETL”)

A complicated process that validates and if required corrects data before its loaded into warehouse

Example The entries for "Customer Name" may appear differently in

various source systems for the same customer. one entered as "IBM", one as "I.B.M.", and one as

"International Business Machines". A decision must be made as to which is correct, and then the

data cleansing tool will change the others to match the rule. This process is also referred to as "data scrubbing" or "data

quality assurance". It can be an extremely complex process, especially if some of the warehouse inputs are from older mainframe file systems (commonly referred to as "flat files"

or "sequential files").

Page 13: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

Components of Data warehousing

Data Aggregation If required, it is performed during “T” phase of “ETL” Data warehouse can be designed to store data at detail level (each

individual transaction), at some aggregate level (summary data) or a combination of both.

The advantage of summarized data is that typical queries against the warehouse run faster.

The disadvantage is that information, which may be needed to answer a query, is lost during aggregation

Business Intelligence (BI) Contains technologies such as Decision Support Systems (DSS),

Executive Information Systems (EIS), On-Line Analytical Processing (OLAP), Relational OLAP (ROLAP), Multi-Dimensional OLAP (MOLAP), Hybrid OLAP (HOLAP, a combination of MOLAP and ROLAP), and more.

Page 14: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

Components of Data warehousing

BI can be broken down into four broad fields Multidimensional Analysis Tools

Allow user to look at data from various different angles Often use a multidimensional database called cube

Query Tools Allow user to issue SQL queries against warehouse and get results

Data Mining Tools Automatically search for patterns in data Driven by complex statistical formulas

Data Visualization Tools Graphically represent data including 3D data pictures

Page 15: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

Components of Data warehousing

Metadata management Throughout the entire process of identifying,

acquiring, and querying the data, metadata management takes place

The datatype (e.g., string or integer) of the column is metadata. The name of the column is another. The actual value in the column for a particular row is not metadata - it is data

Metadata is stored in metadata repository Metadata is useful in almost all components of data

warehousing discussed earlier

Page 16: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

Data Mining

Definition: Knowledge discovery using a sophisticated blend of techniques

from traditional statistics, artificial intelligence and computer graphics

Goals To explain observed events or conditions, such as why sales of

a product have increased in a particular area To confirm hypothesis, such as, whether two-income families are

more likely to buy family medical cover than single-income families

To analyze data for new or unexpected relationships, such as what spending patterns are likely to accompany credit card fraud.

Page 17: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

Data Mining Techniques Case-based reasoning

Derives rules from real world case examples Rule discovery

Searches for patterns and correlations in large data sets Signal Processing

Identifies clusters of observations with similar characteristics

Neural nets Develops predictive models based on principles

modeled after the human brain Fractals

Compresses large databases without losing information

Page 18: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

Some Data Mining Applications

Analysis of business trends Identifying markets with above average or below average growth

Target marketing Identifying customers for promotional activity

Usage Analysis Identifying usage patterns of products and services

Product Affinity Identifying products that are purchased concurrently or

characteristics of shoppers For further application types - refer to book page 437

table 11.5

Page 19: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

Reading Assignment

Read these concepts from Book The ETL process Definitions

@ctive data warehouse Data mart

Independent data mart dependent data mart EDW

Star Schema Snowflake Schema OLAP MOLAP ROLAP

Page 20: DATA WAREHOUSING. Introduction Modern organizations have huge amounts of data but are starving for information – facing information gap! Reasons for information

Resources

http://www.intranetjournal.com/features/datawarehousing.html

Modern Database Management, 6/e