data warehouse - sharif university of...

Data Warehouse

Ali Kamandi

Sharif University of Technology

Spring 2007

kamandi@ce.sharif.edu

Part 1:

Data Warehouse Concepts

Data Management

� A critical success factor: IT applications cannot be done without using data.

� The Difficulties of managing Data: • The amount of data increases exponentially with time

• Data are scattered throughout organization

• An ever-increasing amount of external data needs

• Data security, quality, and integrity are critical

Data Life Cycle

Data Sources

� Internal Data Sources: data about people,

products, services, and processes.

� Personal Data: IS users or other corporate

employees may document their own expertise

by creating personal data.

� External Data Sources: Data from commercial

databases to sensors and satellites.

Data, Data everywhere yet ...

� I can’t find the data I need

� data is scattered over the network

� many versions, subtle differences

� I can’t get the data I need

� need an expert to get the data

� I can’t understand the data I found

� available data poorly documented

� I can’t use the data I found

� results are unexpected

� data needs to be transformed from one

form to other

What is a Data Warehouse?

A single, complete and

consistent store of data

obtained from a variety of

different sources made

available to end users in a

what they can understand and

use in a business context.

[Barry Devlin]

Which are ourlowest/highest margin

customers ?

Which are ourlowest/highest margin

customers ?

Who are my customers and what products are they buying?

Which customersare most likely to go to the competition ?

What impact will new products/services

have on revenue and margins?

What impact will new products/services

have on revenue and margins?

What product prom--otions have the biggest

impact on revenue?

What product prom--otions have the biggest

impact on revenue?

What is the most effective distribution

channel?

What is the most effective distribution

channel?

Why Data Warehousing?

Decision Support

� Used to manage and control business

� Data is historical or point-in-time

� Optimized for inquiry rather than update

� Use of the system is loosely defined and can

be ad-hoc

� Used by managers and end-users to

understand the business and make judgements

Evolution of Decision Support

� 60’s: Batch reports

� hard to find and analyze information

� inflexible and expensive, reprogram every request

� 70’s: Terminal based DSS and EIS

� 80’s: Desktop data access and analysis tools

� query tools, spreadsheets, GUIs

� easy to use, but access only operational db

� 90’s: Data warehousing with integrated OLAP

engines and tools

Definition of data warehousing

According to W.H.Inmon

A data warehouse is a subject-oriented,

integrated, time-variant and non-volatile

collection of data in support of management’s

decision making process.

Characteristics of a Data Warehouse Subject-Oriented. Data are organized by subject and contain

information relevant for decision support only .

Consistency. Data in different operational databases may be encoded differently . In the data warehouse, though, they will be coded in a consistent manner.

Time variant. The data are kept for many years so that they can be used for trends, forecasting, and comparisons over time.

Non-volatile. Data are not updated once entered into the warehouse.

Multidimensional. Typically the data warehouse uses a multidimensional structure .

Web-based. Today’s data warehouse are designed to provide an efficient computing environment for web-based applications.

� often a copy of operational data

� with value-added data (e.g., summaries, history)

Why a Warehouse?

� Two Approaches:

� Query-Driven (Lazy)

� Warehouse (Eager)

Source Source

Query-Driven Approach

Client Client

Wrapper Wrapper Wrapper

Mediator

Source Source Source

Warehouse Architecture

Client Client

Warehouse

Query & Analysis

Integration

Metadata

Advantages of Warehousing

� High query performance

� Queries not visible outside warehouse

� Local processing at sources unaffected

� Can operate when sources unavailable

� Can query data not stored in a DBMS

� Extra information at warehouse

� Modify, summarize (store aggregates)

� Add historical information

Advantages of Query-Driven

� No need to copy data

� less storage

� no need to purchase data

� More up-to-date data

� Only query interface needed at sources

OLTP vs. OLAP

� OLTP: On Line Transaction Processing

� Describes processing at operational sites

� OLAP: On Line Analytical Processing

� Describes processing at warehouse

OLTP vs. OLAP

� Mostly updates

� Many small transactions

� Mb-Tb of data

� Raw data

� Clerical users

� Up-to-date data

� Consistency,

recoverability critical

� Mostly reads

� Queries long, complex

� Gb-Tb of data

� Summarized, consolidated

� Decision-makers, analysts

as users

OLTP OLAP

OLTP vs. OLAP OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/write index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Data Marts

Data Mart: A small data warehouse designed for a

strategic business unit ( SBU) or a department

The advantage of data marts include: low cost

(Prices under $100,000 versus $1million or more for

data warehouses); significantly shorter lead time for

implementation (often less than 90 days), local rather

than central control (conferring power on the using

group), More rapid response and more easily

understood and navigated than an enterprise wide

data warehouse .

Part 2:

Data Warehouse Design

Building a Data Warehouse

Relational and Multidimensional Database

� Relational databases store data in two –

dimensional tables. Multidimensional

databases typically store data in arrays, which

consist of at least three business dimension.

Relational data model

� based on a single structure of data values in a two

dimensional table

CUSTOMER ORDER

………

…Lyn002

…Robert001

…Cus_nameCus_id

03 Dec 02

02 Dec 02

Ord_date

………

…Lyn02

…00201

…Cus_idOrd_no

A Sample Data CubeTotal annual sales

of TV in U.S.A.Date

1Qtr 2Qtr 3Qtr 4Qtr

Canada

Mexico

Cuboids Corresponding to the Cubeall

product date country

product,date product,country date, country

product, date, country

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D(base) cuboid

• Cuboids show the data at different degrees of summarization.

• Given a set of dimensions, we can construct a lattice of cuboids, each showing the data at a different level of summarization, or group by. The lattice of cuboids is then referred to as a data cube.

Multidimensional Data Model

� Composed of one fact table and a set of dimension tables.

� Dimensional table: each dimension table has a simple table (non-composite) primary key that corresponds exactly to one of the components of the composite key in the fact table.

� A multidimensional data model is typically organized around a central theme, like sales, for instance.

Conceptual Modeling of Data Warehouses

Modeling data warehouses: dimensions & measures

� Star schema: A fact table in the middle connected to a set

of dimension tables

� Snowflake schema: A refinement of star schema where

some dimensional hierarchy is normalized into a set of

smaller dimension tables, forming a shape similar to

snowflake

� Fact constellations: Multiple fact tables share dimension

tables, viewed as a collection of stars, therefore called

galaxy schema or fact constellation

Example of Star Schematime_key

day_of_the_week

quarter

location_key

street

province_or_street

country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

supplier_type

branch_key

branch_name

branch_type

branch

Example of Snowflake Schematime_key

day_of_the_week

quarter

location_key

street

city_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

supplier_key

branch_key

branch_name

branch_type

branch

supplier_key

supplier_type

supplier

city_key

province_or_street

country

Example of Fact Constellation

time_key

day_of_the_week

quarter

location_key

street

province_or_street

country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

supplier_type

branch_key

branch_name

branch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_key

shipper_name

location_key

shipper_type

shipper

Typical OLAP Operations� Roll up (drill-up): summarize data

� by climbing up hierarchy or by dimension reduction

� Drill down (roll down): reverse of roll-up

� from higher level summary to lower level summary or detailed data,

or introducing new dimensions

� Slice and dice:

� project and select

� Pivot (rotate):

� reorient the cube, visualization, 3D to series of 2D planes.

� Other operations

Operations

� Rollup: summarize data

� e.g., given sales data, summarize sales for last

year by product category and region

� Drill down: get more details

� e.g., given summarized sales as above, find

breakup of sales by city within each region, or

within the Andhra region

More Cube Operations

� Slice and dice: select and project

� e.g.: Sales of soft-drinks in Andhra over the last

quarter

� Pivot: change the view of data

sale prodId storeId amt

p1 c1 12

p2 c1 11

p1 c3 50

p2 c2 8

c1 c2 c3

p1 12 50

p2 11 8

Fact table view:Multi-dimensional cube:

dimensions = 2

3-D Cube

sale prodId storeId date amt

p1 c1 1 12

p2 c1 1 11

p1 c3 1 50

p2 c2 1 8

p1 c1 2 44

p1 c2 2 4

day 2c1 c2 c3

p1 44 4

p2 c1 c2 c3

p1 12 50

p2 11 8

dimensions = 3

Multi-dimensional cube:Fact table view:

Another Example

p1 c1 1 12

p2 c1 1 11

p1 c3 1 50

p2 c2 1 8

p1 c1 2 44

p1 c2 2 4

• Add up amounts by day, product• In SQL: SELECT date, sum(amt) FROM SALE

GROUP BY date, prodId

sale prodId date amt

p1 1 62

p2 1 19

p1 2 48

drill-down

rollup

Pivoting

p1 c1 1 12

p2 c1 1 11

p1 c3 1 50

p2 c2 1 8

p1 c1 2 44

p1 c2 2 4

day 2c1 c2 c3

p1 44 4

p2 c1 c2 c3

p1 12 50

p2 11 8

Multi-dimensional cube:Fact table view:

c1 c2 c3

p1 56 4 50

p2 11 8

Design a Warehouse?

� Design data warehouse

� Design data marts

� Design representation (Star schema, …)

� gathering data

� Which data is needed?

� Where does it come from?

� cleansing, integrating, ...

� querying, reporting, analysis

� data mining

� monitoring, administering warehouse

Data Gathering

� Periodic snapshots

� Database triggers

� Log shipping

� Data shipping (replication service)

� …

Integration

� Data Cleaning

� Data Loading

� Derived DataClient Client

Warehouse

Query & Analysis

Integration

Metadata

Data Cleaning

� Migration (e.g., yen � dollars)

� Fusion (e.g., mail list, customer merging)

billing DB

service DB

customer1(Joe)

customer2(Joe)

merged_customer(Joe)

Loading Data

� Incremental vs. refresh

� Off-line vs. on-line

� Frequency of loading

� At night, 1x a week/month, continuously

� Parallel/Partitioned load

Derived Data

� Derived Warehouse Data

� When to update derived data?

Part 3:

Data Mining

Data Mining Concepts

Data mining: The process of searching for

valuable business information in a large

database, data warehouse, or data mart.

Data Mining Application

Retailing and sales

Banking

Manufacturing and production

Insurance

Police work

Health care

Marketing

Text Mining

The application of data mining to non-

structured or less-structured text files.

Web Mining

The application of data mining techniques to

discover actionable and meaningful patterns form

web resources.

Web mining is used in the following areas:

information filtering, surveillance, mining of web-

access logs for analyzing usage.

Some basic operations

� Predictive:

� Regression

� Classification

� Descriptive:

� Clustering / similarity matching

� Association rules and variants

Classification

� Given old data about customers and

payments, predict new applicant’s loan

eligibility.

Salary

Profession

Location

Customer type

Previous customers Classifier Decision rules

Salary > 5 L

Prof. = Exec

New applicant’s data

Classification methods

Goal: Predict class Ci = f(x1, x2, .. Xn)

� Regression: (linear or any other polynomial)

� a*x1 + b*x2 + c = Ci.

� Nearest neighbor

� Decision tree classifier

� Neural networks

� Tree where internal nodes are simple

decision rules on one or more attributes and

leaf nodes are predicted class labels.

Decision trees

Salary < 1 M

Prof = teacher

Age < 30

BadBadGood

Neural network

� Set of nodes connected by directed weighted

Hidden nodes

Output nodes

Basic NN unitA more typical NN

Bayesian learning

� Assume a probability model on generation of data.

� Apply Bayes theorem to find most likely class as:

)()|(max)|(max :class predicted

cpcdpdcpc

Clustering

� Unsupervised learning when old data with class

labels not available e.g. when introducing a new

product.

� Group/cluster existing customers based on time

series of payment history such that similar

customers in same cluster.

What is association rule mining?

Hmmm, which items are frequently

purchased together by my customers?

milkcereal

breadmilk

butter

milk bread

sugar eggs

Customer 1

Market Analyst

Customer 2

sugareggs

Customer n

Customer 3

Shopping Baskets 100100Basket 4

001011Basket 3

100111Basket 2

010011Basket 1

eggscerea

What is association rule mining? (cont.)

100100Basket 4

001011Basket 3

100111Basket 2

010011Basket 1

eggscerealbuttersugarbreadmilk

211233count

Support (milk)=3

Support (bread)=3

Support (sugar)=2

……

Support (milk U bread)=3

Support (milk U sugar)=1

……

Support (milk U bread U sugar)=1

……

Support (milk U bread U sugar U butter U cereal U eggs)=0

Confidence (A → B)=Support (A U B)/Support (A)

As Confidence (milk → bread) =

= Support (milk U bread)/Support (milk) = 3/3 = 100%,

Then milk → bread

If Confidence (A → B) >= min_conf, Then A → B

How DM improve your business?

Strategy 1: Placing milk

and bread within close

proximity may further

encourage the sale of

these items together

within single visits to

the store.

Strategy 2: Placing milk and bread at opposite ends of the store may entice customers who purchase such items to pick up other items along the way.

Strategy 3:Put these two

items into a package

at reduced price.

Stage in the evolution of knowledge discovery

Proactive , integrative ;

multiple business partners

Neural computing

advanced al models,

complex optimization,

web services

What is the best plan to

follow? how did we

perform compared to

metrics?

Advanced intelligent

systems; complete

integration(2000-2004)

Prospective , proactive

information delivery

Advanced algorithms,

multiprocessor computers,

massive databases

What’s likely to happen to

the tBoston unit’s sales

next month ? Why?

Intelligent data mining

(late 1990s)

Retrospective , proactive

data delivery at multiple

OLAP, multidimensional

databases, data

warehouses

What were the sales in

region A by product , by

salesperson?

Data warehousing and

decision support (early

1990s)

Retrospective , dynamic

data delivery at record

Relational databases

(RDBMS), structured

query language (SQL)

What were unit sales in

new England last March ?Data access (1980s)

Retrospective , static data

delivery

Computers ,tapes , disks What was my total

revenue in the last 5

years?

Data collection(1980s)

Business question enabling technologies characteristic Evolutionary stage

data warehouse - sharif university of...

Documents

sharif university of...

introduction - sharif university of...

role based access control - sharif university of...

sharif university of...

semantic web - sharif university of...

introduction to hspice - sharif university of...

machine learning - sharif university of...

lec int stp - sharif university of...

business motivation model - sharif university of...

soleymani - sharif university of...

semantic web - sharif university of...

advanced linux programming - sharif university of...

ip packet switching - sharif university of...

sharif university of...

video - sharif university of...

sharif university of...

introduction to java - sharif university of...

coalitional game theory - sharif university of...

cacit - sharif university of...

word embedding - sharif university of...