data warehouse components
DESCRIPTION
data warehouseTRANSCRIPT
2/21/2013
1
Dr. Yashvardhan Sharma
Assistant Professor, CS & IS Dept.
BITS-Pilani
SS G515 - Data Warehousing: Introduction
Topics Already Covered
2
Data Warehouse Architecture and its components
Extraction, Transformation and Loading (ETL)
Data Marts
Approached to Design Data Warehouses
Inmon’s
Kimball’s
A General Architecture for Data
Warehousing
3
A General Architecture for
Data Warehousing
4
The major components of data warehouse architecture are: Source systems are where the data comes from. Extraction, transformation, and load (ETL) move data between
different data stores. The central repository is the main store for the data warehouse. The metadata repository describes what is available and where. Data marts provide fast, specialised access for end users and
applications. Operational feedback integrates decision support back into the
operational systems. End-users are the reason for developing the warehouse in the first
place
MOLAP: Multi-Dimensional On-Line Analytical Processing
ROLAP: Relational On-Line Analytical Processing
Loading the Data Warehouse
Source Systems Data Staging Area Data Warehouse
(OLTP)
Data is periodically extracted
Data is cleansed and transformed
Users query the data warehouse
Data Warehousing Architecture
Monitoring & Administration
Metadata
Repository
Extract
Transform
Load
Refresh
Data Marts
External
Sources
Operational
dbs
Serve
OLAP servers
Analysis
Query/
Reporting
Data
Mining
6
2/21/2013
2
Data Warehousing Architecture
7
Data Warehouse Architecture
Data Warehouse COMPONENTS
9
Data Warehouse COMPONENTS
10
Source Data Component
Production Data.
Internal Data.
Archived Data.
External Data.
Data Staging Component
Data Extraction
Data Transformation.
Data Loading.
Data Loading
11
Data Storage Component
12
Many of the data warehouses also employ multidimensional
database management systems. Data extracted from the data
warehouse storage is aggregated in many ways and the
summary data is kept in the multidimensional databases
(MDDBs). Such multidimensional database systems are
usually proprietary products.
2/21/2013
3
Information Delivery Component
13
Metadata Component
14
Metadata in a data warehouse is similar to a data dictionary,
but much more than a data dictionary.
Types of Metadata
Operational Metadata
Extraction and Transformation Metadata
End-User Metadata
More Details in Chapter 9.
Why Meta Data: Special Significance
15
First, it acts as the glue that connects all parts of the data
warehouse.
Next, it provides information about the contents and
structures to the developers.
Finally, it opens the door to the end-users and makes the
contents recognizable in their own terms.
Operational data source1
The architecture
Query Manage
Warehouse Manager
DBMS
Operational
data source 2
Meta-data
High
summarized data
Detailed data
Lightly
summarizeddata
Operational
data store (ods)
Operational
data source n
Archive/backup data
Load Manager
Data mining
OLAP(online analytical
processing) tools
Reporting, query,
application development, and EIS(executive
information system) tools
End-useraccess tools
Typical architecture of a data warehouse
Operational data store (ODS)
The main components
Operational data sourcesfor the DW is supplied from mainframe
operational data held in first generation hierarchical and network databases,
departmental data held in proprietary file systems, private data held on
workstaions and private serves and external systems such as the Internet,
commercially available DB, or DB assoicated with and organization’s suppliers
or customers
Operational datastore(ODS)is a repository of current and
integrated operational data used for analysis. It is often structured and supplied
with data in the same way as the data warehouse, but may in fact simply act as a
staging area for data to be moved into the warehouse
The main components load manageralso called the frontend component, it performs all the
operations associated with the extraction and loading of data into the
warehouse. These operations include simple transformations of the data to
prepare the data for entry into the warehouse
warehouse managerperforms all the operations associated with the
management of the data in the warehouse. The operations performed by this
component include analysis of data to ensure consistency, transformation and
merging of source data, creation of indexes and views, generation of
denormalizations and aggregations, and archiving and backing-up data
2/21/2013
4
The main components query manageralso called backend component, it performs all the
operations associated with the management of user queries. The operations
performed by this component include directing queries to the appropriate
tables and scheduling the execution of queries
detailed, lightly and lightly summarized data,archive/backup data
meta-data
end-user access toolscan be categorized into five main groups: data
reporting and query tools, application development tools, executive
information system (EIS) tools, online analytical processing (OLAP) tools, and
data mining tools
Data flows
Inflow- The processes associated with the extraction, cleansing, and
loading of the data from the source systems into the data warehouse.
upflow- The process associated with adding value to the data in the
warehouse through summarizing, packaging , packaging, and
distribution of the data
downflow- The processes associated with archiving and backing-up
of data in the warehouse
outflow- The process associated with making the data availabe to the
end-users
Meta-flow-The processes associated with the management of the
meta-data
Operational data source1
Warehouse Manager
DBMS
Meta-data High
summarized data
Detailed data
Lightly
summarizeddata
Operational
data store (ods)
Operational
data source n
Archive/backup data
Load Manager
Data mining tools
OLAP (online analytical
processing) tools
End-useraccess tools
Information flows of a data warehouse
Reporting,
query,applicationdevelopment, and EIS (executive information
system) tools
Downflow
Inflow
Meta-flow
Upflow Query
Manage
Outflow
Warehouse Manager
Tools and Technologies The critical steps in the construction of a data
warehouse:
a. Extraction
b. Cleansing
c. Transformation
after the critical steps, loading the results into target system can be carried out either by separate products, or by a single, categories: code generators
database data replication tools
dynamic transformation engines
Populating & Refreshing the Warehouse
Data Extraction
Data Cleaning
Data Transformation
Convert from legacy/host format to warehouse format
Load
Sort, summarize, consolidate, compute views, check integrity, build indexes, partition
Refresh
Bring new data from source systems
ETL Process : Issues & Challenges
Consumes 70-80% of project time
Heterogeneous Source Systems
Little or no control over source systems
Source systems scattered
Source systems operating in different time zones
Different currencies
Different measurement units
Data not captured by OLTP systems
Ensuring data quality
2/21/2013
5
Data Staging Area
A storage area where extracted data is
Cleaned
Transformed
Deduplicated
Initial storage for data
Need not be based on Relational model
Spread over a number of machines
Mainly sorting and Sequential processing
COBOL or C code running against flat files
Does not provide data access to users
Analogy – kitchen of a restaurant
Presentation Servers
A target physical machine on which DW data is organized for
Direct querying by end users using OLAP
Report writers
Data Visualization tools
Data mining tools
Data stored in Dimensional framework
Analogy – Sitting area of a restaurant
Data Cleaning
Why? Data warehouse contains data that is analyzed for
business decisions
More data and multiple sources could mean more errors in the data and harder to trace such errors
Results in incorrect analysis
Detecting data anomalies and rectifying them early has huge payoffs
Long Term Solution Change business practices and data entry tools
Repository for meta-data
Soundex Algorithms
Misspelled terms
For example NAMES
Phonetic algorithms – can find similar sounding names
Based on the six phonetic classifications of human speech sounds
Data Warehouse Design
OLTP Systems are Data Capture Systems
“DATA IN” systems
DW are “DATA OUT” systems
OLTP DW
Analyzing the DATA
Active Analysis – User Queries
User-guided data analysis
Show me how X varies with Y
OLAP
Automated Analysis – Data Mining
What’s in there?
Set the computer FREE on your data
Supervised Learning (classification)
Unsupervised Learning (clustering)
2/21/2013
6
OLAP Queries
How much of product P1 was sold in 2009 state wise?
Top 5 selling products in 2010
Total Sales in Q1 of FY 2008-09?
Color wise sales figure of cars from 2008 to 2010
Model wise sales of cars for the month of Jan from 2006 to 2010
Data Mining Investigations
Which type of customers are more likely to spend most with us in the coming year?
What additional products are most likely to be sold to customers who buy sportswear?
In which area should we open a new store in the next year?
What are the characteristics of customers most likely to default on their loans before the year is out?
Continuum of Analysis
OLTP OLAP Data Mining
Primitive & Canned Analysis
Complex Ad-hoc Analysis
Automated Analysis
SQL
Specialized Algorithms
Data Marts
What is a data mart?
Advantages and disadvantages of data marts
Issues with the development and management of data marts
21-Feb-1334
Data Marts A subset of a data warehouse that supports the requirements
of a particular department or business process
Data Mart is a subset of corporate-wide data warehouse that
is of value to a specific groups of users. Its scope is confined
to specific, selected groups, such as marketing data mart.
Characteristics include:
Does not always contain detailed data unlike data warehouses
More easily understood and navigated
Can be dependent or independent
21-Feb-1335
Data Marts
36
Data Mart: A scaled-down version of the data warehouse
A data mart is a small warehouse designed for the
department level.
It is often a way to gain entry and provide an opportunity to
learn
Major problem: if they differ from department to
department, they can be difficult to integrate enterprise-
wide
2/21/2013
7
Reasons for Creating Data Marts
Proof of Concept for the DW
Can be developed quickly and less resource intensive than DW
To give users access to data they need to analyze most often
To improve query response time due to reduction in the volume
of data to be accessed
21-Feb-1337
Kimball vs Inmon
Bill Inmon's paradigm: Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form.
Ralph Kimball's paradigm: Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model.
21-Feb-1338
Kimball vs Inmon
Bill Inmon: Endorses a Top-Down design
Independent data marts cannot comprise an effective EDW.
Organizations must focus on building EDW
Ralph Kimball: Endorses a Bottom-Up design
EDW effectively grows up around many of the several
independent data marts – such as for sales, inventory, or
marketing
21-Feb-1339
Kimball vs Inmon: War of Words
"...The data warehouse is nothing more than the union of
all the data marts...,"
Ralph Kimball, December 29, 1997.
"You can catch all the minnows in the ocean and stack
them together and they still do not make a whale,"
Bill Inmon, January 8, 1998.
21-Feb-1340
Kimball vs. Inmon
There is no right or wrong between these two ideas, as
they represent different data warehousing philosophies.
In reality, the data warehouse in most enterprises are
closer to Ralph Kimball's idea. This is because most data
warehouses started out as a departmental effort, and
hence they originated as a data mart. Only when more
data marts are built later do they evolve into a data
warehouse.
21-Feb-1341
Data Warehousing Process
42
Enterprise-wide warehouse, top down, the Inmon
methodology
Data mart, bottom up, the Kimball methodology
When properly executed, both result in an enterprise-wide
data warehouse
2/21/2013
8
Data warehouse versus data mart.
43
Building a Data Warehouse
44
Questions to be asked:
Top-down or bottom-up approach?
Enterprise-wide or departmental?
Which first—data warehouse or data mart?
Build pilot or go with a full-fledged implementation?
Dependent or independent data marts?