1
1
Data Warehousing
2
Organization of Concepts
Purpose
Definition
Transaction Processing vs.Data Warehousing
Data Requirements
Data StructuresExample
Basic Features
Data MappingExample Architecture of an IBM
Data Warehouse
Architecture
Preparing Data for DW
Reasons for Failure
Data Warehouse(Data Integration)
Purpose
MDDB vs. Relational DatabaseRotationRangingRoll-UpDrill-DownComputations
Basic Features
Benefits
PowerPlay
Tools
Multidimension Databases(Enabling Technology)
Purpose
Applications
Availability BiasSequencial PatternClassifyingClusteringHybrid
Association
Functions
Emergent Applications
Data Mining(Relationship Discovery)
Large-Scale Data Management forOrganizational Decision-Making
2
3
Learning Objectives
59. Explain what is a Data-Warehouse (DW) and the rationale for building them.
60. Describe four levels of data granularity handled by a DW.
61. Identify data requirements for DW and Database Technology suitable for such requirements
62. Explain what is metadata and its importance for DW.63. Describe the process of getting data into DW.64. Identify common reasons for DW failure.
Appendix 1– Architecture of a Data Warehouse: An Example from IBM– Data Mapping: An Example from IBM
4
Definition: Data-Warehouse (L.O.59)• A data warehouse is a central source of data,
stocked with data extracted from different operational systems and standardized.– New hybrid information systems which bridge two
broad classes of systems, namely: storage and retrieval systems, and decision support systems
– The objective is not just the storage and retrieval of data objects, but also the programmed and ad-hoc transformation of data to information for its defined user base.
3
5
Illustration (L.O.59)
1. Some organizations, such as Grand Metropolitan PLC, consider their data warehouse essentially as front-ends to existing production systems.
2. Other organizations such as Owens-Corning Fiberglass Corp. view their data warehouse as a collection of subject- or application-oriented databases that are kept separate from the operational applications.
6
Rationale for DW (Why?)Reason #1: Data Resource Bottleneck (L.O.59)
• Knowledge workers need to gather and analyze business data to generate business information
• “Data in jail” syndrome• Data quality problems, including:
– missing data– unreliable data– invalid data– inappropriate granularity– problematic semantics
4
7
• Most organizational transaction systems designed for capture and storage, but not amenable for analysis
• Offers a good conceptual fit with the way end-users visualize business data– Most business people already think about their
businesses in multidimensional terms• Managers tend to ask questions about product sales in different
markets over specific time periods
Rationale for DW (Why?)Reason #2: Conceptual Fit for End Users Use (L.O.59)
8
HP
IBMDEC
Compaq
Enterprise Network Computing and Client/Server Technology areEnterprise Network Computing and Client/Server Technology arechanging the way organizations look at all of their information changing the way organizations look at all of their information systemssystems
Reason #3: The New Technological Environment (L.O.59)
5
9
Reason #3 continues: Implications of Newer Technological Capabilities (L.O.59)
• Emergence of Tools– Emergence of large-scale data management tools – Emergence of several end-user oriented data
exploration tools
• Support for increased flexibility and integration within environments– Transaction environments
– Decision support environment
• Support for increased flexibility and integration between environments – Improved interface between transactions and decision
support.
10
• Performance (Canned queries, MD Analysis, Ad hoc, Min. Impact on Operational System)
• Flexibility (MD Flex, Ad hoc, Change data structure)
• Scalability (No. of Users, Volume of Data)
• Ease of Use (Location, Formulation, Navigation, Manipulation)
• Data Quality (Consistent, Correct, Timely, Integrated)
• Performance (Canned queries, MD Analysis, Ad hoc, Min. Impact on Operational System)
• Flexibility (MD Flex, Ad hoc, Change data structure)
• Scalability (No. of Users, Volume of Data)
• Ease of Use (Location, Formulation, Navigation, Manipulation)
• Data Quality (Consistent, Correct, Timely, Integrated)
Reason #4: Overall Goals of DataWarehousing(L.O.59)
6
11
What types of data DW deals with?Four degrees of Data Granularity (L.O.60)
• Feed level:1. Operational data from TPS - fundamental to DW data
• Ex. sales of XYZ at the end of today is 2345 units
• Storage level:2 Atomic data - DW data - individual data item and mostly the
lowest level of data in DW. Ex. sales of XYZ at the end of Sept. is 2045 units
• … at the end of August 1743• … at the end of July 1345, etc…
3. Summary data - DW data - these summaries are calculated ahead of time and stored for recall. Ex. At the end of third quarter sales of XYZ for Southeastern region was 23,456; at the end of second quarter sales of XYZ…
4. Answers to specific question - processed and stored for individual user’s pc or workstation.
• How does the growth of XYZ sales in SE compares with NW?
12
DW data requirements (L.O.61)
• DW data requirements are different from traditional transaction processing (TP) data. For example:
Characteristics TP data DW data
Volatility Dynamic Static Currency Current Historical Time dimension Implicitly “now” Explicit, visible Granularity Primitive, detailed Detailed and derived
summaries Updates Continuous,
random Periodic, scheduled
Tasks Repetitive Unpredictable Flexibility Low High Performance High performance
mandatory Lower performance may beacceptable
7
13
Database Technology suitable for the above DW data requirements (L.O.61)
• Hierarchical and Network structures based technologies are not suitable
• Simple Relational with some ‘fixing’ can do the job
• A new structure - which is quite unsuitable for TP data - has emerged for DW use: “Multidimensional DB”– Time, itself, is a dimension in MDDB
14
Database Technologies for DW continues...
• All DWs have two types of data– Dimension data (also called Perspectives)
• Tens to a few mil. Rows• Textually described• Frequently modified
– Fact data (also called Measurements)• Millions or billions of rows• Numeric• Don’t change
8
15
Pet store example
• Multi-dimensional Database Structure Cats Dogs Fish Gerbils Adams 3 12 5 4 Baker 2 4 22 5 Carstens 6 2 3 16
16
Salesperson Pet Type No. SoldAdams Cats 3Adams Dogs 12Adams Fish 5Adams Gerbils 4Baker Cats 2Baker Dogs 4Baker Fish 22Baker Gerbils 5Carstens Cats 6Carstens Dogs 2Carstens Fish 3Carstens Gerbils 16
Relational Equivalence (L.O 61 finishes here)
9
17
More on MDDB later in the lecture
18
Implications due to Metadata and Semantics (L.O.62)
– Human• “Natural” assumptions are not always correct• What to do with same item - shoe size - length vs.
width• Is ‘pay-rate’ same as ‘salary’ in another database• Validation rules for common human items such as
U.S. zip code– Computer-based metadata for people to use
• create metadata warehouse– data descriptions, data structure– A business thesaurus; etc.
– Computer-based metadata for computer to use
10
19
Getting Data into the Data Warehouse (L.O.62)(Self study items)
• Extraction -- extract data• Cleansing -- to achieve consistent format,
valid values• Loading -- input data from multiple sources;
decide which set of data “rules”• Transformation
– Change data type of field from integer to character– Summarization -- aggregation
20
Example of Summarization (L.O.62)
Cats Dogs Fish Gerbils Adams 3 12 5 4 Baker 2 4 22 5 Carstens 6 2 3 16 Total 11 18 30 25
Four cells added for summaries by pet. Summary data
11
21
Why do Data Warehousing Projects Fail? (L.O.64)
(Reference: Anatomy of a Failure http://www.cio.com/archive/enterprise/111597_data.html)
22
Summary/ Review Questions for Data Warehouse
• What is a data warehouse (DW)? Why build it? What are its goals?
• DW handles data of different degrees of ‘granularity.’ Do you know what are the levels of data handled by DW?
• What is the relationship between transaction processing systems (and, its data) and a DW? How do their data requirements compare?
• Quite a few DW projects fail or finish without achieving its complete set of original objectives. What are the factors that explain such failures? What can be done to improve the chances of success of DW project?
12
23
Multi-Dimensional Databases
•This presentation uses some materials from: “An Introduction to Multidimensional Database Technology,” by Kenan Technologies.
24
Learning Objectives
• L.O.65 Explain what are Multidimensional Databases and why are they needed?
• L.O.66 Contrast MDD and Relational Databases• L.O.67 Provide examples of situations when is MDD
(In)appropriate?• L.O.68 Identify benefits of MDD• Appendix 2
– Examples of MDD Features such as Rotation, Ranging, Roll-Up and Drill-Down, Computations.
– Example of inner workings of a commercially available OLAP (On-Line Analytical Processing) tool.
13
25
Organization of Concepts
Purpose
Definition
Transaction Processing vs.Data Warehousing
Data Requirements
Data StructuresExample
Basic Features
Data MappingExample Architecture of an IBM
Data Warehouse
Architecture
Preparing Data for DW
Reasons for Failure
Data Warehouse(Data Integration)
Purpose
MDDB vs. Relational DatabaseRotationRangingRoll-UpDrill-DownComputations
Basic Features
Benefits
PowerPlay
Tools
Multidimension Databases(Enabling Technology)
Purpose
Applications
Availability BiasSequencial PatternClassifyingClusteringHybrid
Association
Functions
Emergent Applications
Data Mining(Relationship Discovery)
Large-Scale Data Management forOrganizational Decision-Making
26
What is a Multi-Dimensional Database? (L.O.65)
A multidimensional database (MDD) is a computer software system designed to allow for the efficient and convenient storage and retrieval of large volumes of data that is (1) intimately related and (2) stored, viewed and analyzed from different perspectives. These perspectives are called dimensions.
14
27
Why Multi-Dimensional Databases? (L.O.65)
• No single "best" data structure for all applications within an enterprise
• Inherent ability to integrate and analyze large volumes of enterprise data
• Offers a good conceptual fit with the way end-users visualize business data– Most business people already think about their
businesses in multidimensional terms– Managers tend to ask questions about product sales in
different markets over specific time periods
28
Contrasting Relational and Multi-Dimensional Models:Example 1 (L.O. 66)
SALES VOLUMES FOR GLEASON DEALERSHIP
MODEL COLOR SALES VOLUMEMINI VAN BLUE 6MINI VAN RED 5MINI VAN WHITE 4SPORTS COUPE BLUE 3SPORTS COUPE RED 5SPORTS COUPE WHITE 5SEDAN BLUE 4SEDAN RED 3SEDAN WHITE 2
15
29
COLOR
MODEL
Mini Van
Sedan
Coupe
Red WhiteBlue
6 5 4
3 5 5
4 3 2
Sales Volumes
30
Contrasting Relational Model and MDD-Example 2 (L.O.66)
S A L E S V O L U M E S F O R A L L D E A L E R S H IP S
M O D E L C O L O R D E A L E R SH IP V O L U M EM IN I V A N B L U E C L Y D E 6M IN I V A N B L U E G L E A SO N 6M IN I V A N B L U E C A R R 2M IN I V A N R E D C L Y D E 3M IN I V A N R E D G L E A SO N 5M IN I V A N R E D C A R R 5M IN I V A N W H IT E C L Y D E 2M IN I V A N W H IT E G L E A SO N 4M IN I V A N W H IT E C A R R 3SPO R T S C O U PE B L U E C L Y D E 2S P O R T S C O U P E B L U E G L E A S O N 3S P O R T S C O U P E B L U E C A R R 2SPO R T S C O U PE R E D C L Y D E 7S P O R T S C O U P E R E D G L E A S O N 5S P O R T S C O U P E R E D C A R R 2SPO R T S C O U PE W H IT E C L Y D E 4S P O R T S C O U P E W H IT E G L E A S O N 5S P O R T S C O U P E W H IT E C A R R 1SE D A N B L U E C L Y D E 6SE D A N B L U E G L E A SO N 4S E D A N B L U E C A R R 2SE D A N R E D C L Y D E 1S E D A N R E D G L E A S O N 3SE D A N R E D C A R R 4SE D A N W H IT E C L Y D E 2S E D A N W H IT E G L E A S O N 2SE D A N W H IT E C A R R 3
16
31
Multidimensional Representation
Sales Volumes
DEALERSHIP
Mini Van
Coupe
Sedan
Blue Red White
MODEL
ClydeGleason
Carr
COLOR
32
Viewing Data - An Example
DEALERSHIP
Sales Volumes
MODEL
COLOR
•Assume that each dimension has 10 positions, as shown in the cube above •How many records would be there in a relational table? •Implications for viewing data from an end-user standpoint?
17
33
Adding Dimensions- An Example
MODEL
Mini Van
Coupe
Sedan
Blue Red White
ClydeGleason
Carr
COLOR
Sales Volumes
Coupe
Sedan
Blue Red WhiteClyde
GleasonCarr
COLOR
DEALERSHIP
Mini Van
Coupe
Sedan
Blue Red White
ClydeGleason
Carr
COLOR
JANUARY FEBRUARY MARCH
Mini Van
34
Differences between MDD and Relational Databases (L.O.66)
• Multidimensional array structure represents a higher level of organization than the relational table
• Perspectives are embedded directly into the structure in the multidimensional model
• All possible combinations of perspectives containing a specific attribute (the color BLUE, for example) line up along the dimension position for that attribute.
• Perspective values are placed in fields in the relational model - tells us very little about the overall field contents.
18
35
Benefits of MDD (L.O. 68)
• Cognitive Advantages for the User• Ease of Data Presentation and Navigation
– Obtaining the same views in a relational world requires the end user to either write complex SQL queries or use an SQL generator against the relational database to convert the table outputs into a more intuitive format.
• Ease of Maintenance– Because data is stored in the same way as it is viewed (i.e. according
to its fundamental attributes), no additional overhead is required to translate user queries into requests for data
• Performance– Multidimensional databases achieve performance levels that are
difficult to match in a relational environment.
36
When is MDD (In)appropriate? (L.O.67)
• When interrelationships more meaningful than individual data elements themselves.
• The greater the number of inherent interrelationships between the elements of a dataset, the more likely it is that a study of those interrelationships will yield business information of value to the company.
• Highly interrelated dataset types be placed in a multidimensional data structure for greatest ease of access and analysis
19
37
When is MDD Inappropriate? (L.O.67)
PERSONNEL
LAST NAME EMPLOYEE# EMPLOYEE AGESMITH 01 21REGAN 12 19FOX 31 63WELD 14 31KELLY 54 27LINK 03 56KRANZ 41 45LUCUS 33 41WEISS 23 19
First, consider situation 1
Then try to set up an MDD structure for situation 1, with LAST NAMEand Employee# as dimensions, and AGE as the measurement.
38
When is MDD (In)appropriate? (L.O.67)
C O L O R
MODEL
M iin i V a n
S e d a n
C o u p e
R e d W h iteB lu e
6 5 4
3 5 5
4 3 2
S a le s V o lu m e s
E M P L O Y E E #
LAST
NAME
K r a n z
W e is s
L u c a s
4 1 3 33 1
4 5
1 9
E m p lo y e e A g e
4 1
3 1
5 6
6 3
2 1
1 9
S m ith
R e g a n
F o x
W e ld
K e l ly
L in k
0 1 1 4 5 4 0 3 1 22 3
2 7
Note the sparseness between the two MDD representations
MDD Structures for the Situations
20
39
Summary and Review Questions
• Why is Multi-dimension Data Base used?• How does a Multi-Dimension Data Base
(MDDB) differ from other database techniques such as relational database?
• When is an MDDB appropriate to be used?
• (DW+MDDB+Data Mining) What roles a DW, an MDDB, and Data-mining play for large scale data management? Can you comment on their interdependence?
40
Appendix 1– Architecture of a Data Warehouse: An Example from IBM– Data Mapping: An Example from IBM
21
41
CommonBusiness Data
Warehouse
AutomatedChart Facility
CommonBusiness
InformationWarehouse
Automated Extract Applications
Common Access Interface
Query Support Information Delivery
ExternalDatabases
Development Manufacturing Finance Service Personnel AutomatedCharts
Management Knowledge Workers Executives
From Loeb, K., Rai, A., Ramaprasad, A., and Sharma, S., “Design, Development, and Implementation of a Global Information Warehouse: A Case Study at IBM,” Information Systems Journal, 1998.
Information Deployment Architecture at IBM’s Network Systems Division
42
Data Mapping Strategy at IBM
• Business segment is defined as any functional area of IBM such as marketing, finance, manufacturing, etc.
• Business processes is any process used by any business segment.
• A business segment owner may own multiple business processes such as software and hardware development.
22
43
Data
DecisionSupport Tools
Information
Business Processes Business Segments
From Loeb, K., Rai, A., Ramaprasad, A., and Sharma, S., “Design, Development, and Implementation of a Global Information Warehouse: A Case Study at IBM,” Information Systems Journal, 1998.
Data Mapping Strategy at IBM (O.5)
44
Appendix 2
• Examples of MDD features• An OLAP tool
23
45
MDD Features - Rotation
Sales Volumes
COLOR
MODEL
Mini Van
Sedan
Coupe
Red WhiteBlue
6 5 4
3 5 5
4 3 2
MODEL
COLOR
SedanCoupe
Red
White
Blue 6 3 4
5 5 3
4 5 2( ROTATE 90 o )
View #1 View #2
Mini Van
•Also referred to as “data slicing.”•Each rotation yields a different slice or two dimensional tableof data.
46
MDD Features - Rotation
COLORCOLORMODEL
MODELDEALERSHIPDEALERSHIP
MODEL
M ini Van
Coupe
Sedan
Blue Red White
ClydeGleason
Carr
COLOR
Mini Van
Blue
Red
WhiteClyde
GleasonCarr
MODEL
Mini VanCoupe
Sedan
Blue
Red
White
Carr
COLOR
COLOR
DEALERSHIP
View #1 View #2 View #3
DEALERSHIP
Mini VanCoupe
SedanBlueRedWhite
Clyde
Gleason
Carr
M ini Van Coupe Sedan
BlueRed
WhiteClyde
Gleason
Carr M in i Van
Coupe
SedanBlue
RedWhite
Clyde Gleason Carr
View #4 View #5 View #6
DEALERSHIP
CoupeSedan
( ROTATE 90o ) ( ROTATE 90o ) ( ROTATE 90o )
COLOR MODEL
MODEL
DEALERSHIP( ROTATE 90
o ) ( ROTATE 90
o )
Gleason Clyde
Sales Volumes
24
47
MDD Features - Ranging
S a le s V o lu m e s
D E A L E R S H IP
M in i V a n
C o u p e
M e ta l B lu e
MODEL
C ly d eC a rr
C O L O R
N o rm a l B lu e
M in i V a n
C o u p e
N o rm a l B lu e
M e ta l B lu e
C ly d eC a rr
• The end user selects the desired positions along each dimension.• Also referred to as "data dicing." • The data is scoped down to a subset grouping
48
MDD Features - Roll-Ups & Drill Downs
Gary
Gleason Carr Levi Lucas Bolton
Midwest
St. LouisChicago
Clyde
REGION
DISTRICT
DEALERSHIP
ORGANIZATION DIMENSION
• The figure presents a definition of a hierarchy within the organization dimension.• Aggregations perceived as being part of the same dimension.•Moving up and moving down levels in a hierarchy is referred to as “roll-up” and “drill-down.”
25
49
Dealership
District
Region
Nation
Salesperson
Sales Team
Manager
Senior Manager
Vice PresidentYear
Quarter
Month
Week
Day
Product Family
Product Line
Product
Rolling-up and Drilling-down Through Multiple Dimensions. In the previous grapic, we saw that users can roll-up and drill-down through a single dimension, ORGANIZATION. Well designed multidimensionaldatabases also allow users to roll-up and drill down through multiple dimensions concurrently. Thus, in thisexample, an end-user could hold the positions MANAGER, DISTRICT and PRODUCT constant, while drilling-down or rolling-up through sales figures over the TIME dimension.
Personnel
Organization Products
Time
50
MDD Features:Drill-Down Through a Dimension
GaryGleason Carr Levi Lucas Bolton MidwestSt. Louis ChicagoClyde
REGION DISTRICT
DEALERSHIP
MODEL
COLORSales Volumes
26
51
MDD Features:Multidimensional Computations
• Well equipped to handle demanding mathematical functions.
• Can treat arrays like cells in spreadsheets. For example, in a budget analysis situation, one can divide the ACTUAL array by the BUDGET array to compute the VARIANCE array.
• Applications based on multidimensional database technology typically have one dimension defined as a "business measurements" dimension.
• Integrates computational tools very tightly with the database structure.
52
The Time Dimension
• TIME as a predefined hierarchy for rolling-up and drilling-down across days, weeks, months, years and special periods, such as fiscal years.– Eliminates the effort required to build sophisticated
hierarchies every time a database is set up.
– Extra performance advantages