data warehousing - georgia state university · data warehousing 2 organization of concepts purpose...

26
1 Data Warehousing 2 Organization of Concepts Purpose Definition Transaction Processing vs. Data Warehousing Data Requirements Data Structures Example Basic Features Data Mapping Example Architecture of an IBM Data Warehouse Architecture Preparing Data for DW Reasons for Failure Data Warehouse (Data Integration) Purpose MDDB vs. Relational Database Rotation Ranging Roll-Up Drill-Down Computations Basic Features Benefits PowerPlay Tools Multidimension Databases (Enabling Technology) Purpose Applications Availability Bias Sequencial Pattern Classifying Clustering Hybrid Association Functions Emergent Applications Data Mining (Relationship Discovery) Large-Scale Data Management for Organizational Decision-Making

Upload: nguyennhan

Post on 19-Apr-2018

225 views

Category:

Documents


6 download

TRANSCRIPT

1

1

Data Warehousing

2

Organization of Concepts

Purpose

Definition

Transaction Processing vs.Data Warehousing

Data Requirements

Data StructuresExample

Basic Features

Data MappingExample Architecture of an IBM

Data Warehouse

Architecture

Preparing Data for DW

Reasons for Failure

Data Warehouse(Data Integration)

Purpose

MDDB vs. Relational DatabaseRotationRangingRoll-UpDrill-DownComputations

Basic Features

Benefits

PowerPlay

Tools

Multidimension Databases(Enabling Technology)

Purpose

Applications

Availability BiasSequencial PatternClassifyingClusteringHybrid

Association

Functions

Emergent Applications

Data Mining(Relationship Discovery)

Large-Scale Data Management forOrganizational Decision-Making

2

3

Learning Objectives

59. Explain what is a Data-Warehouse (DW) and the rationale for building them.

60. Describe four levels of data granularity handled by a DW.

61. Identify data requirements for DW and Database Technology suitable for such requirements

62. Explain what is metadata and its importance for DW.63. Describe the process of getting data into DW.64. Identify common reasons for DW failure.

Appendix 1– Architecture of a Data Warehouse: An Example from IBM– Data Mapping: An Example from IBM

4

Definition: Data-Warehouse (L.O.59)• A data warehouse is a central source of data,

stocked with data extracted from different operational systems and standardized.– New hybrid information systems which bridge two

broad classes of systems, namely: storage and retrieval systems, and decision support systems

– The objective is not just the storage and retrieval of data objects, but also the programmed and ad-hoc transformation of data to information for its defined user base.

3

5

Illustration (L.O.59)

1. Some organizations, such as Grand Metropolitan PLC, consider their data warehouse essentially as front-ends to existing production systems.

2. Other organizations such as Owens-Corning Fiberglass Corp. view their data warehouse as a collection of subject- or application-oriented databases that are kept separate from the operational applications.

6

Rationale for DW (Why?)Reason #1: Data Resource Bottleneck (L.O.59)

• Knowledge workers need to gather and analyze business data to generate business information

• “Data in jail” syndrome• Data quality problems, including:

– missing data– unreliable data– invalid data– inappropriate granularity– problematic semantics

4

7

• Most organizational transaction systems designed for capture and storage, but not amenable for analysis

• Offers a good conceptual fit with the way end-users visualize business data– Most business people already think about their

businesses in multidimensional terms• Managers tend to ask questions about product sales in different

markets over specific time periods

Rationale for DW (Why?)Reason #2: Conceptual Fit for End Users Use (L.O.59)

8

HP

IBMDEC

Compaq

Enterprise Network Computing and Client/Server Technology areEnterprise Network Computing and Client/Server Technology arechanging the way organizations look at all of their information changing the way organizations look at all of their information systemssystems

Reason #3: The New Technological Environment (L.O.59)

5

9

Reason #3 continues: Implications of Newer Technological Capabilities (L.O.59)

• Emergence of Tools– Emergence of large-scale data management tools – Emergence of several end-user oriented data

exploration tools

• Support for increased flexibility and integration within environments– Transaction environments

– Decision support environment

• Support for increased flexibility and integration between environments – Improved interface between transactions and decision

support.

10

• Performance (Canned queries, MD Analysis, Ad hoc, Min. Impact on Operational System)

• Flexibility (MD Flex, Ad hoc, Change data structure)

• Scalability (No. of Users, Volume of Data)

• Ease of Use (Location, Formulation, Navigation, Manipulation)

• Data Quality (Consistent, Correct, Timely, Integrated)

• Performance (Canned queries, MD Analysis, Ad hoc, Min. Impact on Operational System)

• Flexibility (MD Flex, Ad hoc, Change data structure)

• Scalability (No. of Users, Volume of Data)

• Ease of Use (Location, Formulation, Navigation, Manipulation)

• Data Quality (Consistent, Correct, Timely, Integrated)

Reason #4: Overall Goals of DataWarehousing(L.O.59)

6

11

What types of data DW deals with?Four degrees of Data Granularity (L.O.60)

• Feed level:1. Operational data from TPS - fundamental to DW data

• Ex. sales of XYZ at the end of today is 2345 units

• Storage level:2 Atomic data - DW data - individual data item and mostly the

lowest level of data in DW. Ex. sales of XYZ at the end of Sept. is 2045 units

• … at the end of August 1743• … at the end of July 1345, etc…

3. Summary data - DW data - these summaries are calculated ahead of time and stored for recall. Ex. At the end of third quarter sales of XYZ for Southeastern region was 23,456; at the end of second quarter sales of XYZ…

4. Answers to specific question - processed and stored for individual user’s pc or workstation.

• How does the growth of XYZ sales in SE compares with NW?

12

DW data requirements (L.O.61)

• DW data requirements are different from traditional transaction processing (TP) data. For example:

Characteristics TP data DW data

Volatility Dynamic Static Currency Current Historical Time dimension Implicitly “now” Explicit, visible Granularity Primitive, detailed Detailed and derived

summaries Updates Continuous,

random Periodic, scheduled

Tasks Repetitive Unpredictable Flexibility Low High Performance High performance

mandatory Lower performance may beacceptable

7

13

Database Technology suitable for the above DW data requirements (L.O.61)

• Hierarchical and Network structures based technologies are not suitable

• Simple Relational with some ‘fixing’ can do the job

• A new structure - which is quite unsuitable for TP data - has emerged for DW use: “Multidimensional DB”– Time, itself, is a dimension in MDDB

14

Database Technologies for DW continues...

• All DWs have two types of data– Dimension data (also called Perspectives)

• Tens to a few mil. Rows• Textually described• Frequently modified

– Fact data (also called Measurements)• Millions or billions of rows• Numeric• Don’t change

8

15

Pet store example

• Multi-dimensional Database Structure Cats Dogs Fish Gerbils Adams 3 12 5 4 Baker 2 4 22 5 Carstens 6 2 3 16

16

Salesperson Pet Type No. SoldAdams Cats 3Adams Dogs 12Adams Fish 5Adams Gerbils 4Baker Cats 2Baker Dogs 4Baker Fish 22Baker Gerbils 5Carstens Cats 6Carstens Dogs 2Carstens Fish 3Carstens Gerbils 16

Relational Equivalence (L.O 61 finishes here)

9

17

More on MDDB later in the lecture

18

Implications due to Metadata and Semantics (L.O.62)

– Human• “Natural” assumptions are not always correct• What to do with same item - shoe size - length vs.

width• Is ‘pay-rate’ same as ‘salary’ in another database• Validation rules for common human items such as

U.S. zip code– Computer-based metadata for people to use

• create metadata warehouse– data descriptions, data structure– A business thesaurus; etc.

– Computer-based metadata for computer to use

10

19

Getting Data into the Data Warehouse (L.O.62)(Self study items)

• Extraction -- extract data• Cleansing -- to achieve consistent format,

valid values• Loading -- input data from multiple sources;

decide which set of data “rules”• Transformation

– Change data type of field from integer to character– Summarization -- aggregation

20

Example of Summarization (L.O.62)

Cats Dogs Fish Gerbils Adams 3 12 5 4 Baker 2 4 22 5 Carstens 6 2 3 16 Total 11 18 30 25

Four cells added for summaries by pet. Summary data

11

21

Why do Data Warehousing Projects Fail? (L.O.64)

(Reference: Anatomy of a Failure http://www.cio.com/archive/enterprise/111597_data.html)

22

Summary/ Review Questions for Data Warehouse

• What is a data warehouse (DW)? Why build it? What are its goals?

• DW handles data of different degrees of ‘granularity.’ Do you know what are the levels of data handled by DW?

• What is the relationship between transaction processing systems (and, its data) and a DW? How do their data requirements compare?

• Quite a few DW projects fail or finish without achieving its complete set of original objectives. What are the factors that explain such failures? What can be done to improve the chances of success of DW project?

12

23

Multi-Dimensional Databases

•This presentation uses some materials from: “An Introduction to Multidimensional Database Technology,” by Kenan Technologies.

24

Learning Objectives

• L.O.65 Explain what are Multidimensional Databases and why are they needed?

• L.O.66 Contrast MDD and Relational Databases• L.O.67 Provide examples of situations when is MDD

(In)appropriate?• L.O.68 Identify benefits of MDD• Appendix 2

– Examples of MDD Features such as Rotation, Ranging, Roll-Up and Drill-Down, Computations.

– Example of inner workings of a commercially available OLAP (On-Line Analytical Processing) tool.

13

25

Organization of Concepts

Purpose

Definition

Transaction Processing vs.Data Warehousing

Data Requirements

Data StructuresExample

Basic Features

Data MappingExample Architecture of an IBM

Data Warehouse

Architecture

Preparing Data for DW

Reasons for Failure

Data Warehouse(Data Integration)

Purpose

MDDB vs. Relational DatabaseRotationRangingRoll-UpDrill-DownComputations

Basic Features

Benefits

PowerPlay

Tools

Multidimension Databases(Enabling Technology)

Purpose

Applications

Availability BiasSequencial PatternClassifyingClusteringHybrid

Association

Functions

Emergent Applications

Data Mining(Relationship Discovery)

Large-Scale Data Management forOrganizational Decision-Making

26

What is a Multi-Dimensional Database? (L.O.65)

A multidimensional database (MDD) is a computer software system designed to allow for the efficient and convenient storage and retrieval of large volumes of data that is (1) intimately related and (2) stored, viewed and analyzed from different perspectives. These perspectives are called dimensions.

14

27

Why Multi-Dimensional Databases? (L.O.65)

• No single "best" data structure for all applications within an enterprise

• Inherent ability to integrate and analyze large volumes of enterprise data

• Offers a good conceptual fit with the way end-users visualize business data– Most business people already think about their

businesses in multidimensional terms– Managers tend to ask questions about product sales in

different markets over specific time periods

28

Contrasting Relational and Multi-Dimensional Models:Example 1 (L.O. 66)

SALES VOLUMES FOR GLEASON DEALERSHIP

MODEL COLOR SALES VOLUMEMINI VAN BLUE 6MINI VAN RED 5MINI VAN WHITE 4SPORTS COUPE BLUE 3SPORTS COUPE RED 5SPORTS COUPE WHITE 5SEDAN BLUE 4SEDAN RED 3SEDAN WHITE 2

15

29

COLOR

MODEL

Mini Van

Sedan

Coupe

Red WhiteBlue

6 5 4

3 5 5

4 3 2

Sales Volumes

30

Contrasting Relational Model and MDD-Example 2 (L.O.66)

S A L E S V O L U M E S F O R A L L D E A L E R S H IP S

M O D E L C O L O R D E A L E R SH IP V O L U M EM IN I V A N B L U E C L Y D E 6M IN I V A N B L U E G L E A SO N 6M IN I V A N B L U E C A R R 2M IN I V A N R E D C L Y D E 3M IN I V A N R E D G L E A SO N 5M IN I V A N R E D C A R R 5M IN I V A N W H IT E C L Y D E 2M IN I V A N W H IT E G L E A SO N 4M IN I V A N W H IT E C A R R 3SPO R T S C O U PE B L U E C L Y D E 2S P O R T S C O U P E B L U E G L E A S O N 3S P O R T S C O U P E B L U E C A R R 2SPO R T S C O U PE R E D C L Y D E 7S P O R T S C O U P E R E D G L E A S O N 5S P O R T S C O U P E R E D C A R R 2SPO R T S C O U PE W H IT E C L Y D E 4S P O R T S C O U P E W H IT E G L E A S O N 5S P O R T S C O U P E W H IT E C A R R 1SE D A N B L U E C L Y D E 6SE D A N B L U E G L E A SO N 4S E D A N B L U E C A R R 2SE D A N R E D C L Y D E 1S E D A N R E D G L E A S O N 3SE D A N R E D C A R R 4SE D A N W H IT E C L Y D E 2S E D A N W H IT E G L E A S O N 2SE D A N W H IT E C A R R 3

16

31

Multidimensional Representation

Sales Volumes

DEALERSHIP

Mini Van

Coupe

Sedan

Blue Red White

MODEL

ClydeGleason

Carr

COLOR

32

Viewing Data - An Example

DEALERSHIP

Sales Volumes

MODEL

COLOR

•Assume that each dimension has 10 positions, as shown in the cube above •How many records would be there in a relational table? •Implications for viewing data from an end-user standpoint?

17

33

Adding Dimensions- An Example

MODEL

Mini Van

Coupe

Sedan

Blue Red White

ClydeGleason

Carr

COLOR

Sales Volumes

Coupe

Sedan

Blue Red WhiteClyde

GleasonCarr

COLOR

DEALERSHIP

Mini Van

Coupe

Sedan

Blue Red White

ClydeGleason

Carr

COLOR

JANUARY FEBRUARY MARCH

Mini Van

34

Differences between MDD and Relational Databases (L.O.66)

• Multidimensional array structure represents a higher level of organization than the relational table

• Perspectives are embedded directly into the structure in the multidimensional model

• All possible combinations of perspectives containing a specific attribute (the color BLUE, for example) line up along the dimension position for that attribute.

• Perspective values are placed in fields in the relational model - tells us very little about the overall field contents.

18

35

Benefits of MDD (L.O. 68)

• Cognitive Advantages for the User• Ease of Data Presentation and Navigation

– Obtaining the same views in a relational world requires the end user to either write complex SQL queries or use an SQL generator against the relational database to convert the table outputs into a more intuitive format.

• Ease of Maintenance– Because data is stored in the same way as it is viewed (i.e. according

to its fundamental attributes), no additional overhead is required to translate user queries into requests for data

• Performance– Multidimensional databases achieve performance levels that are

difficult to match in a relational environment.

36

When is MDD (In)appropriate? (L.O.67)

• When interrelationships more meaningful than individual data elements themselves.

• The greater the number of inherent interrelationships between the elements of a dataset, the more likely it is that a study of those interrelationships will yield business information of value to the company.

• Highly interrelated dataset types be placed in a multidimensional data structure for greatest ease of access and analysis

19

37

When is MDD Inappropriate? (L.O.67)

PERSONNEL

LAST NAME EMPLOYEE# EMPLOYEE AGESMITH 01 21REGAN 12 19FOX 31 63WELD 14 31KELLY 54 27LINK 03 56KRANZ 41 45LUCUS 33 41WEISS 23 19

First, consider situation 1

Then try to set up an MDD structure for situation 1, with LAST NAMEand Employee# as dimensions, and AGE as the measurement.

38

When is MDD (In)appropriate? (L.O.67)

C O L O R

MODEL

M iin i V a n

S e d a n

C o u p e

R e d W h iteB lu e

6 5 4

3 5 5

4 3 2

S a le s V o lu m e s

E M P L O Y E E #

LAST

NAME

K r a n z

W e is s

L u c a s

4 1 3 33 1

4 5

1 9

E m p lo y e e A g e

4 1

3 1

5 6

6 3

2 1

1 9

S m ith

R e g a n

F o x

W e ld

K e l ly

L in k

0 1 1 4 5 4 0 3 1 22 3

2 7

Note the sparseness between the two MDD representations

MDD Structures for the Situations

20

39

Summary and Review Questions

• Why is Multi-dimension Data Base used?• How does a Multi-Dimension Data Base

(MDDB) differ from other database techniques such as relational database?

• When is an MDDB appropriate to be used?

• (DW+MDDB+Data Mining) What roles a DW, an MDDB, and Data-mining play for large scale data management? Can you comment on their interdependence?

40

Appendix 1– Architecture of a Data Warehouse: An Example from IBM– Data Mapping: An Example from IBM

21

41

CommonBusiness Data

Warehouse

AutomatedChart Facility

CommonBusiness

InformationWarehouse

Automated Extract Applications

Common Access Interface

Query Support Information Delivery

ExternalDatabases

Development Manufacturing Finance Service Personnel AutomatedCharts

Management Knowledge Workers Executives

From Loeb, K., Rai, A., Ramaprasad, A., and Sharma, S., “Design, Development, and Implementation of a Global Information Warehouse: A Case Study at IBM,” Information Systems Journal, 1998.

Information Deployment Architecture at IBM’s Network Systems Division

42

Data Mapping Strategy at IBM

• Business segment is defined as any functional area of IBM such as marketing, finance, manufacturing, etc.

• Business processes is any process used by any business segment.

• A business segment owner may own multiple business processes such as software and hardware development.

22

43

Data

DecisionSupport Tools

Information

Business Processes Business Segments

From Loeb, K., Rai, A., Ramaprasad, A., and Sharma, S., “Design, Development, and Implementation of a Global Information Warehouse: A Case Study at IBM,” Information Systems Journal, 1998.

Data Mapping Strategy at IBM (O.5)

44

Appendix 2

• Examples of MDD features• An OLAP tool

23

45

MDD Features - Rotation

Sales Volumes

COLOR

MODEL

Mini Van

Sedan

Coupe

Red WhiteBlue

6 5 4

3 5 5

4 3 2

MODEL

COLOR

SedanCoupe

Red

White

Blue 6 3 4

5 5 3

4 5 2( ROTATE 90 o )

View #1 View #2

Mini Van

•Also referred to as “data slicing.”•Each rotation yields a different slice or two dimensional tableof data.

46

MDD Features - Rotation

COLORCOLORMODEL

MODELDEALERSHIPDEALERSHIP

MODEL

M ini Van

Coupe

Sedan

Blue Red White

ClydeGleason

Carr

COLOR

Mini Van

Blue

Red

WhiteClyde

GleasonCarr

MODEL

Mini VanCoupe

Sedan

Blue

Red

White

Carr

COLOR

COLOR

DEALERSHIP

View #1 View #2 View #3

DEALERSHIP

Mini VanCoupe

SedanBlueRedWhite

Clyde

Gleason

Carr

M ini Van Coupe Sedan

BlueRed

WhiteClyde

Gleason

Carr M in i Van

Coupe

SedanBlue

RedWhite

Clyde Gleason Carr

View #4 View #5 View #6

DEALERSHIP

CoupeSedan

( ROTATE 90o ) ( ROTATE 90o ) ( ROTATE 90o )

COLOR MODEL

MODEL

DEALERSHIP( ROTATE 90

o ) ( ROTATE 90

o )

Gleason Clyde

Sales Volumes

24

47

MDD Features - Ranging

S a le s V o lu m e s

D E A L E R S H IP

M in i V a n

C o u p e

M e ta l B lu e

MODEL

C ly d eC a rr

C O L O R

N o rm a l B lu e

M in i V a n

C o u p e

N o rm a l B lu e

M e ta l B lu e

C ly d eC a rr

• The end user selects the desired positions along each dimension.• Also referred to as "data dicing." • The data is scoped down to a subset grouping

48

MDD Features - Roll-Ups & Drill Downs

Gary

Gleason Carr Levi Lucas Bolton

Midwest

St. LouisChicago

Clyde

REGION

DISTRICT

DEALERSHIP

ORGANIZATION DIMENSION

• The figure presents a definition of a hierarchy within the organization dimension.• Aggregations perceived as being part of the same dimension.•Moving up and moving down levels in a hierarchy is referred to as “roll-up” and “drill-down.”

25

49

Dealership

District

Region

Nation

Salesperson

Sales Team

Manager

Senior Manager

Vice PresidentYear

Quarter

Month

Week

Day

Product Family

Product Line

Product

Rolling-up and Drilling-down Through Multiple Dimensions. In the previous grapic, we saw that users can roll-up and drill-down through a single dimension, ORGANIZATION. Well designed multidimensionaldatabases also allow users to roll-up and drill down through multiple dimensions concurrently. Thus, in thisexample, an end-user could hold the positions MANAGER, DISTRICT and PRODUCT constant, while drilling-down or rolling-up through sales figures over the TIME dimension.

Personnel

Organization Products

Time

50

MDD Features:Drill-Down Through a Dimension

GaryGleason Carr Levi Lucas Bolton MidwestSt. Louis ChicagoClyde

REGION DISTRICT

DEALERSHIP

MODEL

COLORSales Volumes

26

51

MDD Features:Multidimensional Computations

• Well equipped to handle demanding mathematical functions.

• Can treat arrays like cells in spreadsheets. For example, in a budget analysis situation, one can divide the ACTUAL array by the BUDGET array to compute the VARIANCE array.

• Applications based on multidimensional database technology typically have one dimension defined as a "business measurements" dimension.

• Integrates computational tools very tightly with the database structure.

52

The Time Dimension

• TIME as a predefined hierarchy for rolling-up and drilling-down across days, weeks, months, years and special periods, such as fiscal years.– Eliminates the effort required to build sophisticated

hierarchies every time a database is set up.

– Extra performance advantages