DATA WAREHOUSING
Introduction Modern organizations have huge amounts of data
but are starving for information – facing information gap!
Reasons for information gap: Fragmented information systems, uncoordinated
(sometimes inconsistent) databases /islands of information due to time and resources constraints
Most systems support operational /transaction processing rather than informational processing.
Bridging the information gap are data warehouses!
Introduction (Contd.)
A data warehouse bridges this information gap as it consolidates and integrates information from
internal and external sources and arranges it in a meaningful format for making accurate
business decisions
The two noticeable pioneers in the DWH field are Ralph Kimball and Bill Inmon.
The term Data Warehouse was coined by Bill Inmon in 1990
What is a Data Warehouse? Inmon defines a data warehouse as follows:
"a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process".
Description of the above terms according to Inmon Subject Oriented:
Data that gives information about a particular subject instead of about a company's ongoing operations.
Integrated: Data that is gathered into the data warehouse from a variety of
sources and merged into a coherent whole.
Time-variant: All data in the data warehouse is identified with a
particular time dimension. The time factor can be used to study trends and changes.
Non-volatile Data is stable in a data warehouse. More data is added
but data is never removed. This enables management to gain a consistent picture of the business.
What is a Data Warehouse?
This definition by Inmon remains reasonably accurate almost ten years later. However, a single-subject data warehouse is typically referred to as
a data mart, while data warehouses are generally enterprise in scope.
data warehouses can be volatile. Due to the large amount of storage required for a data warehouse,
(multi-terabyte data warehouses are not uncommon), only a certain number of periods of history are kept in the warehouse.
For instance, if three years of data are decided on and loaded into the warehouse, every month the oldest month will be "rolled off" the database, and the newest month added.
What is a Data Warehouse?
Ralph Kimball provided a much simpler definition of a data warehouse. a data warehouse is "a copy of transaction data
specifically structured for query and analysis".
This definition provides less insight and depth than Mr. Inmon's, but is no less accurate.
The Need for Data warehousing
Two factors drive the need for data warehousing
Need of company-wide view of data
Need to separate operational* and informational* systems
* See slide notes
Data warehousing?
It is the process of creating, populating, and then querying a data warehouse
Components of data warehousing are Source Systems Identification Data warehouse Design and Creation Data Acquisition Changed data capture Data Cleansing Data Aggregation Business Intelligence
Multidimensional Analysis Tools Query Tools Data Mining Tools Data Visualization Tools
Metadata Management
Components of Data warehousing Source System Identification
Appropriate data must be located to build a data warehouse. This data comes from
Current OLTP systems (providing day-to-day information) `Legacy systems (providing historical data for prior periods)
Data Warehouse Design and Creation The design is often an iterative process Requires effort to understand database schema and a great deal
of user interaction
Components of Data warehousing Data Acquisition
Involves moving company data from source systems into warehouse
Time consuming activity Performed using ETL (Extract/Transform/ Load) Tools Data acquisition is an ongoing, scheduled process.
Warehouse is refreshed monthly Changed data capture
Periodic update of warehouse from transactional systems is complicated as it is difficult to identify which records in source have changed since last update.
Some technologies used in this area are Replication servers, Publish/Subscribe, Triggers and Stored procedures, and Database Log analysis.
Components of Data warehousing Data Cleansing
Typically performed in conjunction with data acquisition (part of “T” in “ETL”)
A complicated process that validates and if required corrects data before its loaded into warehouse
Example The entries for "Customer Name" may appear differently in
various source systems for the same customer. one entered as "IBM", one as "I.B.M.", and one as
"International Business Machines". A decision must be made as to which is correct, and then the
data cleansing tool will change the others to match the rule. This process is also referred to as "data scrubbing" or "data
quality assurance". It can be an extremely complex process, especially if some of the warehouse inputs are from older mainframe file systems (commonly referred to as "flat files"
or "sequential files").
Components of Data warehousing
Data Aggregation If required, it is performed during “T” phase of “ETL” Data warehouse can be designed to store data at detail level (each
individual transaction), at some aggregate level (summary data) or a combination of both.
The advantage of summarized data is that typical queries against the warehouse run faster.
The disadvantage is that information, which may be needed to answer a query, is lost during aggregation
Business Intelligence (BI) Contains technologies such as Decision Support Systems (DSS),
Executive Information Systems (EIS), On-Line Analytical Processing (OLAP), Relational OLAP (ROLAP), Multi-Dimensional OLAP (MOLAP), Hybrid OLAP (HOLAP, a combination of MOLAP and ROLAP), and more.
Components of Data warehousing
BI can be broken down into four broad fields Multidimensional Analysis Tools
Allow user to look at data from various different angles Often use a multidimensional database called cube
Query Tools Allow user to issue SQL queries against warehouse and get results
Data Mining Tools Automatically search for patterns in data Driven by complex statistical formulas
Data Visualization Tools Graphically represent data including 3D data pictures
Components of Data warehousing
Metadata management Throughout the entire process of identifying,
acquiring, and querying the data, metadata management takes place
The datatype (e.g., string or integer) of the column is metadata. The name of the column is another. The actual value in the column for a particular row is not metadata - it is data
Metadata is stored in metadata repository Metadata is useful in almost all components of data
warehousing discussed earlier
Data Mining
Definition: Knowledge discovery using a sophisticated blend of techniques
from traditional statistics, artificial intelligence and computer graphics
Goals To explain observed events or conditions, such as why sales of
a product have increased in a particular area To confirm hypothesis, such as, whether two-income families are
more likely to buy family medical cover than single-income families
To analyze data for new or unexpected relationships, such as what spending patterns are likely to accompany credit card fraud.
Data Mining Techniques Case-based reasoning
Derives rules from real world case examples Rule discovery
Searches for patterns and correlations in large data sets Signal Processing
Identifies clusters of observations with similar characteristics
Neural nets Develops predictive models based on principles
modeled after the human brain Fractals
Compresses large databases without losing information
Some Data Mining Applications
Analysis of business trends Identifying markets with above average or below average growth
Target marketing Identifying customers for promotional activity
Usage Analysis Identifying usage patterns of products and services
Product Affinity Identifying products that are purchased concurrently or
characteristics of shoppers For further application types - refer to book page 437
table 11.5
Reading Assignment
Read these concepts from Book The ETL process Definitions
@ctive data warehouse Data mart
Independent data mart dependent data mart EDW
Star Schema Snowflake Schema OLAP MOLAP ROLAP
Resources
http://www.intranetjournal.com/features/datawarehousing.html
Modern Database Management, 6/e