the data warehouse (dw) and business intelligence (bi) 9.1 cot5230 data mining week 9 the data...

29
The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S H A U S T R A L I A ’ S I N T E R N A T I O N A L U N I V E R S I T Y

Post on 20-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.1

COT5230 Data Mining

Week 9

The Data Warehouse (DW) and Business Intelligence (BI)

M O N A S HA U S T R A L I A ’ S I N T E R N A T I O N A L U N I V E R S I T Y

Page 2: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.2

Lecture Outline

Overview of Data Warehousing

Data Warehouse Architecture

Overview of Business Intelligence (BI)

OLAP

Page 3: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.3

What is a DW?

A data store to support data analysis or decision support

– Decision support:» a methodology to extract information from data

– Decision support system:» an arrangement of computerized tools to assist in managerial

decision making

Answers questions by combining historical operational data with a business data model that reflects business activity

Data may come from both operational and external sources

– external data - e.g. industry average salaries

Page 4: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.4

Data Warehouse Definitions - 1

The information in a DW is subject-oriented, non-volatile, and of an historic nature, and so DWs tend to contain extremely large datasets

The purpose of the DW is to provide the tools and facilities to manage and deliver complete, timely, accurate, and understandable business information to authorized individuals for effective business decision making

DW implementation needs a company-wide effort that requires user involvement and commitment at all levels

A successful DW implementation tracks return on investment

Page 5: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.5

Data Warehouse Definitions - 2

A DW is a concept not a product– It is the compiling, assembling, and consolidating of

application data common to user communities at a single logical point

Typical use includes ad hoc queries, “what if”, data matching, trend analysis and other sophisticated information functions

Warehouse data is typically extracted from OLTP systems

A DW can be described as a read-only database that provides users with access to consolidated, historic, or static data extracted from operational databases, usually augmented with external data

Page 6: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.6

Operational Data vs. the DW - 1

Integration– Data found within the DW is ALWAYS integrated, e.g.

» encoding, measurements of attributes, etc. are standardized

Normalized vs. denormalized– Operational data is normalized

Timespan– Operational data is current

– DW data is historical

Granularity– Operational data is at transaction level

– DW data is at an aggregation level

Page 7: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.7

Operational Data vs. the DW - 2

Dimensionality– data is clustered according to functional

requirements i.e. all orders to be delivered to a particular suburb

– data analyst requires access to all dimensions

Use– DW is read only

Page 8: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.8

MIS, or Before the DW

MIS: Management Information System

required detailed knowledge of the operational systems

no Business Information Directory

data quality is ad hoc

limited data integration from source systems

integration and querying performed by MIS specialists using 3+GL tools such as SAS

or at best performing queries using SQL against images of unintegrated operational databases

Page 9: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.9

Inmon’s 12 Rules - 1

DW and operational environments are separated

Integrated DW data

DW contains historical data

DW is snapshot data captured at particular point in time

DW data is subject-oriented

Page 10: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.10

Inmon’s 12 Rules - 2

No online update

DW SDLC is data-driven

DW contains several levels of data - raw to summarized

Data sources are traced

Meta-data is a critical component

DW contains a charge back mechanism

Page 11: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.11

DW Architecture

Authoritative Source

Source SystemsExternal systems

Extract / Enhance /Transform Layer

Copy mgtExtractTransform

Process onceBusiness rules

Consistency& controls

Value add

Enterprisesingle imagedata view

Separates data fromapplication

Fully modelled& documented

Data Warehouse

Build datafor appropriatedatamart

Parallelprocess

Denormalizefor specificuse

Customise

Meets specificOLAPrequirements

DataMarts

Delivery touser

Industrystandardtools

Tailored applicationswhereappropriate

Load

Business Information Directory

Page 12: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.12

Source Systems/Authoritative Source

must first identify authoritative source data

Authoritative Source– atomic data from the creating/owning source system

data propagation must be subject to a delivery contract

data propagation is asynchronous– no reverse propagation

– no periodic synchronization

delivery must have minimal impact on operational systems

Page 13: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.13

Extract/Enhance/Transform Layer

must create integrated and standardized data

deduping process happens here

denormalize into a format for direct loading into the DW

cleanse – must remove semantic and syntactic inconsistencies

– return invalid data to the source system for repair

requires a data quality process

simple business transformations

addition of surrogate keys and time variance

Page 14: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.14

Handling Inserts/Deltas - 1

Scenarios– additions to a (1) New or (2) Existing partition

– partitions are (1) Atomic or (2) Aggregates

New partition - atomic or aggregate– work off-line

– do summation outside of database and use efficient tools i.e.. Syncsort or C

– then SQL*LOADER

Page 15: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.15

Handling Inserts/Deltas - 2

Updates to an existing partition– Atomic Partition

» Unload, Sort, Reload or» Insert directly into DB - concurrency issues

– Aggregate Partition

R1 X 1R2 X 2

X 3 - stored in databaseR3 X 1

– Update directly to DW

– Unload and update out of the database

– Keep source data and re sort sum

Page 16: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.16

The Data Warehouse

contains atomic data

Star Schema structure– contains

» Facts» Dimensions» Attributes - Surrogate keys» Attribute Hierarchies

Key Issues– size

– data retention period - YTD

– backup and recovery

– security

Page 17: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.17

Star Schemas

a data modeling technique used to map decision support data into a relational database

this structure is based on the premise that a highly normalized data structure do not serve advanced data analysis requirements well

DimACustomer

Fact TableSALES

DimBProduct

DimCSalesrep

DimDLocation

Cust#

SalesrepID

Loc# Prod#

Page 18: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.18

Snowflake Schemas

DimACustomer

Fact TableSALES

DimBProduct

DimCSalesrep

DimDLocation

SalesrepID

Prod#

CustomerCategory

Customer Address

Customer State

Page 19: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.19

Fact Tables

Facts measure something of interest to an enterprise– atomic level or transactional data

– summarization will reduce volume but may lose information

CUST# PROD# TOTALC100 P100 $1000C100 P200 $2000

CUST# PROD# SALESREP DATECOSTC100 P100 S1 1/12 $510C100 P100 S2 2/12 $490

Page 20: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.20

Dimensions

drill down to atomic data from dimensions or reference tables

A Query– List sales of Product P100 for each State for each

Month of 1999?

Product Location TimeP#=P100 State=Each Year=1999PName Nuts Region Month=EachPCat

Page 21: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.21

Attributes & Attribute Hierarchies

each dimension table contains attributes

surrogate keys are commonly added to improve performance of joins between Fact tables and their associated Dimensions

attributes are used to search, filter of classify facts

Attribute Hierarchies: classification attributes, e.g.

SALES_REGIONVIC, TAS

Page 22: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.22

Datamarts/Customization/Cubes

customization - select only the attributes and rows of interest for export to a datamart or data cube

apply coding techniques to the attributes of interest suitable for search algorithm to be used

each cell of a cube is a view consisting of an aggregation of interest

– e.g. TOTAL_SALES

used as a performance improving technique to – pre aggregate groupby cells

– remove data not required for the problem at hand from the search algorithm

Page 23: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.23

Business Intelligence & The DW

most enterprises have a data repository to allow data analysis to occur

database provide enabling techniques– efficient data storage and access

– query optimization

80% of knowledge discovery in databases (KDD) is the preparation of the data - this is the data warehouse

the evolution of the desktop, database, networks and AI/search has made it possible to perform KDD in commercial databases

Page 24: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.24

The BI Process - 1

Understand and define the process

Perform data collection and extraction

Perform Data Cleaning and Exploration

Data Engineering– select attributes of interest

– select records of interest

– map attributes to suit DM algorithms

Page 25: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.25

The BI Process - 2

Algorithm Engineering– which algorithm to use

– ability to deal with » quality of input» quality of output» performance

Run the data mining algorithm

Preliminary evaluation of the results

Refine the data and the problem

Use the results to implement a business strategy

Page 26: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.26

A BI Model

AnalysisDiscovery

Pattern Recognition

Prediction/Verification

Model

AnswerVariables

Learning

Adaptive Modelling

Profit from targeted customers buying Product X/Cost of Producing the Model and Predicting the Answer= Return on Investment

Page 27: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.27

DM Techniques

Verification Driven Data Mining Techniques– Naive evaluation - exhaustive search

– Random walk

– ad hoc query

– OLAP

– Hypothesis testing - statistics

Discovery Driven Data Mining Techniques

– Statistical Modeling (e.g. linear regression)

– Visualization

– Rule-based and inductive learning

– Neural networks

– Genetic algorithms (an optimization technique)

Page 28: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.28

OLAP:On-Line Analytical Processing

an environment for the analysis of multi-dimensional data

– dice

– rotate

– drill-down

– rollup

OLAP provides advanced database support involving attribute selection, attribute encoding, row sampling, data cleansing and allows the use of multiple different search engines

– easy to use user-interface

– open system architecture using local processing power

Page 29: The Data Warehouse (DW) and Business Intelligence (BI) 9.1 COT5230 Data Mining Week 9 The Data Warehouse (DW) and Business Intelligence (BI) M O N A S

The Data Warehouse (DW) and Business Intelligence (BI) 9.29

References

Rob, P. & Coronel, C. Database Systems: Design, Implementation, and Management, 3rd Ed., Nelson 1997

Inmon W. H. - numerous. See http://www.cait.wustl.edu/cait/papers/prism/vol1_no1/ for example

Kimball, R - numerous

Golfarelli, M., Maio, D., and Rizzi, S. Conceptual Design of Data Warehouses from E/R Schemes, in Proceedings of the 31st Hawaii International Conference on System Sciences,1998

Lee A.J. and Rundensteiner, E. A Data Warehouse Evolution: Consistent Metadata Management.

Gray, J. et al. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab and Sub-Totals, Data Mining and Knowledge Discovery 1, pp. 29-53, 1997

Maier, D. et al. Selected Research Issues in Decision Support Databases Journal of Intelligent Information Systems, 11 (2), pp. 169-191 1998