comp 578 data warehouse and data warehousing: an introduction keith c.c. chan department of...

33
COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

COMP 578Data Warehouse and Data Warehousing:

An Introduction

Keith C.C. Chan

Department of Computing

The Hong Kong Polytechnic University

Page 2: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

2

What is A Data Warehouse?

DataWarehouse

HRO

Health

Student

ITs

A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.— W. H. Inmon

Page 3: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

3

Data Warehousing and Industry

• One of the hottest topic in IS.– Over 90% of larger companies either have a DW or are starting one.

• Warehousing is big business– Old statistics from Megroup.

• $3.5 billion in early 1997• $8 billion in 1998 [Metagroup]• over $200 billion over next 5 years.

– Latest by IDC on DW tools.• $5 billion in 1999.• $16 billion in 2004.

– Latest by IDC on CRM applications• $61 billion in 2001• $148 billion in 2005

Page 4: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

4

Data Warehousing and Industry (2)

• A 1996 study of 62 data warehousing projects showed an average return on investment of 321%, with an average payback period of 2.73 years.

• In 2003, some people are skeptical.• WalMart has largest warehouse

– 900-CPU, 2,700 disk, 23 TB Teradata system– ~7TB in warehouse– 40-50GB per day

Page 5: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

Why Data Warehouse?Why The Hype?

Page 6: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

6

Information vs. Data

• Information is pivotal in today’s business environment. Success is dependent on its early and decisive use. A lack of information is a sure sign for failure. The rapidly changing environment in which business operates demands ever more immediate access to data. (Devlin, 1997)

• Many corporations are actively looking for new technologies that will assist them in becoming more profitable and competitive. Gaining competitive advantage requires that companies accelerate their decision making process so that they can respond quickly to change. One key to this accelerated decision making is having the right information, at the right time, easily accessible (Poe, 1996).

Page 7: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

7

The Information Gap• The information gap is a result of:

– Fragmented way in which ISs and supporting DBs have been developed.• One-thing-at-a-time due to constraints on time and resources.• DBs on a variety of hardware and software platforms.• Difficult to locate and use accurate information.

– Most systems developed to support operational processing.• Operational processing (a.k.a. TP) captures, stores and

manipulates data to support daily operations.• Little thought given to the information or analytical tools

needed for decision making.

Page 8: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

8

Bridging The Information Gap

• Data warehouses (DW) consolidate and integrate information from many different sources and arrange it in a meaningful format for making accurate business decisions (Martin, 1997a).

• They support complex business decisions through analysis of trends, target marketing, competitive analysis, and so on.

• Data warehousing has evolved to meet these needs without disturbing existing operational processing.

Page 9: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

9

What Are The Issues?

• How DW relates to existing operational systems.

• Data architecture appropriate for most DW environments.

• Extracting data from existing operational systems and loading them into a DW.

• Interact with DW using OLAP, data mining and data visualization.

Page 10: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

Data Warehouse & Data Warehousing as Solution

Page 11: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

11

The Need for Data Warehouses

• Two major factors drive the need for data warehousing in most organizations today:– Business requires an integrated company-wide

view of high-quality information.– The IS department must separate informational

from operational systems in order to dramatically improve performance in managing company data.

Page 12: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

12

Need for a Company Wide View

• Data in operational systems typically fragmented and of poor quality.

• Generally distributed on a variety of incompatible HW and SW platforms:– Unix running oracle DBMS– IBM MVS running the DB2 DBMS

• Often necessary to provide a single, corporate view of that information for decision making.

Page 13: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

13

Deriving a Single Corporate View

• Develop a profile for each student from:– STUDENT_DATA, STUDENT-EMPLOYEE,

STUDENT_HEALTH

• Some issues to resolve:– Inconsistent key structures: HKID and student name– Synonyms: Student_No and Student_ID.– Free-form vs. structured fields: Last name, first name.– Inconsistent data values: different phone numbers.– Missing data: how will the value for insurance be located?

Page 14: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

14

Need to Separate Operational and Informational Systems

• Operational system used to run a business in real time based on current data.

– E.g. sales order processing, reservation systems, patient registration, – Process large volumes of relatively simple read/write transactions,

while providing fast response.

• Information systems designed to support decision making based on historical data.

– Designed for complex and read-only queries or data mining application.

– Sales trend analysis, customer segmentation, and human resource planning.

Page 15: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

15

Need to Separate Operational and Informational Systems (2)

• It is essential to separate informational processing from operational processing by creating a data warehouse.– A DW centralizes data (at least logically) that are scattered

throughout disparate operational systems and makes them readily available for decision support.

– A properly designed DW adds value to data by improving their quality and consistency.

– A separate data warehouse eliminates much of the contention for resources that results when informational applications are cofounded with operational processing.

Page 16: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

16

Data Warehouse vs. Operational DB Systems

• OLTP (on-line transaction processing)– Major task of traditional relational DBMS– Day-to-day operations: purchasing, inventory,

banking, manufacturing, payroll, registration, accounting, etc.

• OLAP (on-line analytical processing)– Major task of data warehouse system– Data analysis and decision making

Page 17: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

17

Data Warehouse vs. Operational DB Systems

• Distinct features (OLTP vs. OLAP):– User and system orientation: customer vs. market– Data contents: current, detailed vs. historical,

consolidated– Database design: ER + application vs. star +

subject– View: current, local vs. evolutionary, integrated– Access patterns: update vs. read-only but complex

queries

Page 18: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

18

Why Separate Data Warehouse?• High performance for both systems:

– DBMS — tuned for OLTP: access methods, indexing, concurrency control, recovery

– Warehouse — tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.

• Different functions and different data:– missing data: Decision support requires historical data which

operational DBs do not typically maintain– data consolidation: DS requires consolidation (aggregation,

summarization) of data from heterogeneous sources– data quality: different sources typically use inconsistent data

representations, codes and formats which have to be reconciled.

Page 19: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

19

Advantages of Warehousing Approach

• High query performance– But not necessarily most current information

• Doesn’t interfere with local processing at sources– Complex queries at warehouse– OLTP at information sources

• Information copied at warehouse– Can modify, annotate, summarize, restructure, etc.– Can store historical information– Security, no auditing

• Has caught on in industry

Page 20: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

The Terms &The Definitions

Page 21: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

21

The Data Warehouse

• Strategic response to customer requirement for providing and processing information:– at various levels of abstraction– using history for trend analysis– with high performance

• What it provides - A protected business decision support environment - A repository of consolidated corporate data - A staging area for revitalizing operational systems

Page 22: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

22

Multi-Dimensional Database

What is A Data Warehouse

O L A P

Multi-relational tools

M e t a D a t aM e t a D a t a

D S SD S S E I SE I S

Data ScrubbersData Scrubbers

Star SchemaDim

ensional Data M

odeling

Dimensional D

ata Modelin

g

Middleware

Data M

artData Warehouse ManagerData Warehouse Manager

Data Propagation

Data Mining

Data Mining

Data Rotation

ESS

Page 23: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

23

What is a Data Warehouse?A Practitioners Viewpoint

“A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context.”

-- Barry Devlin, IBM Consultant

Page 24: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

24

What is a Data Warehouse?An Alternative Viewpoint

“A DW is a – subject-oriented,– integrated,– time-varying,– non-volatile

collection of data that is used primarily in organizational decision making.”

-- W.H. Inmon, Building the Data Warehouse, 1992

Page 25: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

25

The Data Warehouse

• Key characteristics– Subject-oriented– Integrated– Time-variant– Nonvolatile

Page 26: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

26

Subject Oriented

Operational Applications/Databases

Data Warehouse Subjects

Order BillingAccounts ReceivableAccounts PayableLoansSavingsLife Insurance Claims ProcessingAuto Insurance

CustomerClaimsSalesProduct

• Data is stored by business subject rather than by application

Page 27: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

27

Integrated

Data WarehouseDatabase

Subject = Customer

SavingsDatabase

SavingsApplication

Current Accounts

Application

Current AccountsDatabase

PersonalLoans

Application

Personal LoansDatabase

Operational Environment Decision Support Environment

No

Application

Flavor

• Data is stored once in a single integrated location

Customer data stored in several Databases

Page 28: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

28

Time-variant

- Data is tagged with some element of time - creation date, as of date/to , etc.

- Data is available for long periods of time. For example, five or more years

Data Warehouse Data

Time Data {

Key, Version and Date timestamp

• Data is stored as a series of snapshots or views which records data content and context across time.

Page 29: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

29

Non-volatile

Data WarehouseBusiness Users & Applications

Internal Source Systems

CreateUpdateDeleteTransactions

READ-ONLY

• Existing data in the warehouse is not overwritten or updated.

DataWarehouse

External Source Systems

Page 30: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

30

How the Data Warehouse evolved

OperationalReporting

Data Extraction/Replication

Data WarehousesData Marts

OLAP ServersData Mining

Page 31: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

31

Line of Business Data Marts extend the concept

• Extends the concept of Data Warehousing into the various lines-of-business in support of specific needs for business intelligence

Data WarehouseOther

Business Source Systems

ExternalData

Line of BusinessSystems

Data Staging/Replication Layer

Data Marts

Extraction

Transformation

Cleansing

Page 32: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

32

Data Mining further extends the concepts of tactical access to data in support of

specific business objectives

• Specialized applications which run on OLAP servers for drill-down processing

• Can include access by neural nets, gophers and agents

Data WarehouseOther

Business Source Systems

ExternalData

Line of BusinessSystems

Data Staging/Replication Layer

Extraction

Transformation

Cleansing

Page 33: COMP 578 Data Warehouse and Data Warehousing: An Introduction Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University

33

Data Warehousing

• Definition 1:– The process of constructing and using data

warehouses

• Definition 2:– The process whereby organizations extract

meaning from their informational assets through the use of data warehouses.