data warehouse-final

52
1 Data Warehousing

Upload: priyanka-manchanda

Post on 19-Jan-2017

209 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Warehouse-Final

1

Data Warehousing

Page 2: Data Warehouse-Final

2

OverviewPart 1: Data Warehouses Part 2: OLAPPart 3: Data MiningPart 4: Query Processing and

Optimization

Page 3: Data Warehouse-Final

3

Part 1: Data Warehouses

Page 4: Data Warehouse-Final

4

Data, Data everywhereyet ...

I can’t find the data I need data is scattered over the network many versions, subtle differences

I can’t get the data I need need an expert to get the data

I can’t understand the data I found available data poorly

documented I can’t use the data I found results are unexpected data needs to be transformed

from one form to other

Page 5: Data Warehouse-Final

5

What is a Data Warehouse? A single, complete and

consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

Data Characteristics: Subject Oriented Integrated Non-Volatile Time-Referenced

Page 6: Data Warehouse-Final

6

Which are our lowest/highest margin

customers ?Who are my customers

and what products are they buying?

Which customers are most likely to go to the competition ?

What impact will new products/services

have on revenue and margins?

What product prom--otions have the biggest

impact on revenue?

What is the most effective distribution

channel?

Why Data Warehousing?

Page 7: Data Warehouse-Final

WHY BI AND WHAT IS IT

Information is the life-blood of modern organization

BI transform data stored in core business systems into knowledge,which in turn enables you to,

•Know your customers and let your customer know you

•Streamline your business processes by aligning technology with business goals

•Know how your business runs as a whole by getting 360 deg view of your organization

Page 8: Data Warehouse-Final

BI Architecture

Page 9: Data Warehouse-Final

9

Decision SupportUsed to manage and control businessData is historical or point-in-timeOptimized for inquiry rather than updateUse of the system is loosely defined and

can be ad-hocUsed by managers and end-users to

understand the business and make judgements

Page 10: Data Warehouse-Final

10

What are the users saying...Data should be integrated

across the enterpriseSummary data had a real

value to the organizationHistorical data held the

key to understanding data over time

What-if capabilities are required

Page 11: Data Warehouse-Final

11

Data Warehousing -- It is a process

Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible

A decision support database maintained separately from the organization’s operational database

Page 12: Data Warehouse-Final

12

Traditional RDBMS used for OLTP Database Systems have been used

traditionally for OLTP clerical data processing tasks detailed, up to date data structured repetitive tasks read/update a few records isolation, recovery and integrity are critical

Will call these operational systems

Page 13: Data Warehouse-Final

13

OLTP vs Data Warehouse

OLTP Application Oriented Used to run business Clerical User Detailed data Current up to date Isolated Data Repetitive access by

small transactions Read/Update access

Warehouse (DSS) Subject Oriented Used to analyze business Manager/Analyst Summarized and refined Snapshot data Integrated Data Ad-hoc access using large

queries Mostly read access (batch

update)

Page 14: Data Warehouse-Final

14

Data Warehouse Architecture

RelationalDatabases

LegacyData

Purchased Data

Data Warehouse Engine

Optimized Loader

ExtractionCleansing

AnalyzeQuery

Metadata Repository

Page 15: Data Warehouse-Final

15

From the Data Warehouse to Data Marts

DepartmentallyStructured

IndividuallyStructured

Data WarehouseOrganizationallyStructured

Less

More

HistoryNormalizedDetailed

Data

Information

Page 16: Data Warehouse-Final

16

Users have different views of Data

Organizationallystructured

OLAP

Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data

Farmers: Harvest informationfrom known access paths

Tourists: Browse information harvestedby farmers

Page 17: Data Warehouse-Final

17

Wal*Mart Case StudyFounded by Sam WaltonOne the largest Super Market Chains

in the US

Wal*Mart: 2000+ Retail Stores SAM's Clubs 100+Wholesalers Stores

Page 18: Data Warehouse-Final

18

Old Retail ParadigmWal*Mart

Inventory Management Merchandise Accounts

Payable Purchasing Supplier Promotions:

National, Region, Store Level

Suppliers Accept Orders Promote Products Provide special

Incentives Monitor and Track The

Incentives Bill and Collect

Receivables Estimate Retailer

Demands

Page 19: Data Warehouse-Final

19

New (Just-In-Time) Retail Paradigm No more deals Shelf-Pass Through (POS Application)

One Unit Price Suppliers paid once a week on ACTUAL items sold

Wal*Mart Manager Daily Inventory Restock Suppliers (sometimes SameDay) ship to Wal*Mart

Warehouse-Pass Through Stock some Large Items

Delivery may come from supplier Distribution Center

Supplier’s merchandise unloaded directly onto Wal*Mart Trucks

Page 20: Data Warehouse-Final

20

Information as a Strategic WeaponDaily Summary of all Sales InformationRegional Analysis of all Stores in a logical

areaSpecific Product SalesSpecific Supplies SalesTrend Analysis, etc.Wal*Mart uses information when negotiating

with Suppliers Advertisers etc.

Page 21: Data Warehouse-Final

21

Schema Design

Database organization must look like business must be recognizable by business user approachable by business user Must be simple

Schema Types Star Schema Fact Constellation Schema Snowflake schema

Page 22: Data Warehouse-Final

22

Star SchemaA single fact table and for each

dimension one dimension table

T ime

prod

cust

city

fact

date, custno, prodno, cityname, sales

Page 23: Data Warehouse-Final

23

Dimension Tables

Dimension tables Define business in terms already familiar to

users Wide rows with lots of descriptive text Small tables (about a million rows) Joined to fact table by a foreign key heavily indexed typical dimensions

time periods, geographic region (markets, cities), products, customers, salesperson, etc.

Page 24: Data Warehouse-Final

24

Fact Table

Central table Typical example: individual sales

records mostly raw numeric items narrow rows, a few columns at most large number of rows (millions to a

billion) Access via dimensions

Page 25: Data Warehouse-Final

25

Snowflake schemaRepresent dimensional hierarchy directly

by normalizing tables. Easy to maintain and saves storage

T ime

prod

cust

city

fact

date, custno, prodno, cityname, ...

region

Page 26: Data Warehouse-Final

26

Fact Constellation

Fact Constellation Multiple fact tables that share many

dimension tables Booking and Checkout may share many

dimension tables in the hotel industryHotels

Travel Agents

Promotion

Room Type

Customer

Booking

Checkout

Page 27: Data Warehouse-Final

27

Data Granularity in WarehouseSummarized data stored

reduce storage costs reduce cpu usage increases performance since smaller

number of records to be processed design around traditional high level

reporting needs tradeoff with volume of data to be stored

and detailed usage of data

Page 28: Data Warehouse-Final

28

Granularity in WarehouseSolution is to have dual level of

granularity Store summary data on disks

95% of DSS processing done against this data

Store detail on tapes5% of DSS processing against this data

Page 29: Data Warehouse-Final

29

Levels of Granularity

Operational

60 days ofactivity

account activity date amount teller location account bal

accountmonth # trans withdrawals deposits average bal

amountactivity date amount account bal

monthly accountregister -- up to 10 years

Not all fieldsneed be archived

Banking Example

Page 30: Data Warehouse-Final

30

Data Integration Across Sources

Trust Credit cardSavings Loans

Same data different name

Different data Same name

Data found here nowhere else

Different keyssame data

Page 31: Data Warehouse-Final

31

Data Transformation

Data transformation is the foundation for achieving single version of the truth

Major concern for ITData warehouse can fail if appropriate

data transformation strategy is not developed

Sequential Legacy Relational ExternalOperational/Source Data

Data Transformation

Accessing Capturing Extracting Householding FilteringReconciling Conditioning Loading Validating Scoring

Page 32: Data Warehouse-Final

32

Data Transformation Exampleen

cod i

ngun

itfie

ld

appl A - balanceappl B - balappl C - currbalappl D - balcurr

appl A - pipeline - cmappl B - pipeline - inappl C - pipeline - feetappl D - pipeline - yds

appl A - m,fappl B - 1,0appl C - x,yappl D - male, female

Data Warehouse

Page 33: Data Warehouse-Final

33

Data Integrity Problems Same person, different spellings

Agarwal, Agrawal, Aggarwal etc... Multiple ways to denote company name

Persistent Systems, PSPL, Persistent Pvt. LTD. Use of different names

mumbai, bombay Different account numbers generated by different

applications for the same customer Required fields left blank Invalid product codes collected at point of sale

manual entry leads to mistakes “in case of a problem use 9999999”

Page 34: Data Warehouse-Final

34

When to Refresh?periodically (e.g., every night, every week)

or after significant eventson every update: not warranted unless

warehouse data require current data (up to the minute stock quotes)

refresh policy set by administrator based on user needs and traffic

possibly different policies for different sources

Page 35: Data Warehouse-Final

35

Refresh techniquesIncremental techniques

detect changes on base tables: replication servers (e.g., Sybase, Oracle, IBM Data Propagator)snapshots (Oracle)transaction shipping (Sybase)

compute changes to derived and summary tables

maintain transactional correctness for incremental load

Page 36: Data Warehouse-Final

36

How To Detect Changes

Create a snapshot log table to record ids of updated rows of source data and timestamp

Detect changes by: Defining after row triggers to update

snapshot log when source table changes Using regular transaction log to detect

changes to source data

Page 37: Data Warehouse-Final

37

Querying Data WarehousesSQL ExtensionsMultidimensional modeling of data

OLAP More on OLAP later …

Page 38: Data Warehouse-Final

38

Reporting ToolsAndyne Computing -- GQL Brio -- BrioQuery Business Objects -- Business Objects Cognos -- Impromptu Information Builders Inc. -- Focus for Windows Oracle -- Discoverer2000 Platinum Technology -- SQL*Assist, ProReports PowerSoft -- InfoMaker SAS Institute -- SAS/Assist Software AG -- Esperant Sterling Software -- VISION:Data

Page 39: Data Warehouse-Final

39

Deploying Data WarehousesWhat business information

keeps you in business today? What business information can put you out of business tomorrow?

What business information should be a mouse click away?

What business conditions are the driving the need for business information?

Page 40: Data Warehouse-Final

40

Cultural ConsiderationsNot just a technology projectNew way of using information

to support daily activities and decision making

Care must be taken to prepare organization for change

Must have organizational backing and support

Page 41: Data Warehouse-Final

41

User TrainingUsers must have a higher level of IT

proficiency than for operational systems

Training to help users analyze data in the warehouse effectively

Page 42: Data Warehouse-Final

42

Warehouse ProductsComputer Associates -- CA-Ingres Hewlett-Packard -- Allbase/SQL Informix -- Informix, Informix XPSMicrosoft -- SQL Server Oracle -- Oracle7, Oracle Parallel ServerRed Brick -- Red Brick Warehouse SAS Institute -- SAS Software AG -- ADABAS Sybase -- SQL Server, IQ, MPP

Page 43: Data Warehouse-Final

43

Strengths of OLAPIt is a powerful visualization

toolIt provides fast, interactive

response timesIt is good for analyzing time

seriesIt can be useful to find

some clusters and outlinersMany vendors offer OLAP

tools

Page 44: Data Warehouse-Final

44

Part 3: Data Mining

Page 45: Data Warehouse-Final

45

Why Data Mining Credit ratings/targeted marketing:

Given a database of 100,000 names, which persons are the least likely to default on their credit cards?

Identify likely responders to sales promotions Fraud detection

Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer?

Customer relationship management: Which of my customers are likely to be the most loyal, and

which are most likely to leave for a competitor? :Data Mining helps extract such information

Page 46: Data Warehouse-Final

46

Data mining

Process of semi-automatically analyzing large databases to find interesting and useful patterns

Overlaps with machine learning, statistics, artificial intelligence and databases but more scalable in number of features and

instances more automated to handle heterogeneous

data

Page 47: Data Warehouse-Final

47

Application Areas

Industry ApplicationFinance Credit Card AnalysisInsurance Claims, Fraud Analysis

Telecommunication Call record analysisTransport Logistics managementConsumer goods promotion analysisData Service providersValue added dataUtilities Power usage analysis

Page 48: Data Warehouse-Final

48

Data Mining in UseThe US Government uses Data Mining to track

fraudA Supermarket becomes an information brokerBasketball teams use it to track game strategyCross SellingTarget MarketingHolding on to Good CustomersWeeding out Bad Customers

Page 49: Data Warehouse-Final

49

Why Now?Data is being producedData is being warehousedThe computing power is availableThe computing power is affordableThe competitive pressures are strongCommercial products are available

Page 50: Data Warehouse-Final

50

Data Mining works with Warehouse Data

Data Warehousing provides the Enterprise with a memory

Data Mining provides the Enterprise with intelligence

Page 51: Data Warehouse-Final

51

Mining marketAround 20 to 30 mining tool vendorsMajor players:

Clementine, IBM’s Intelligent Miner, SGI’s MineSet, SAS’s Enterprise Miner.

All pretty much the same set of toolsMany embedded products: fraud detection,

electronic commerce applications

Page 52: Data Warehouse-Final

52

OLAP Mining integrationOLAP (On Line Analytical Processing)

Fast interactive exploration of multidim. aggregates.

Heavy reliance on manual operations for analysis:

Tedious and error-prone on large multidimensional data

Ideal platform for vertical integration of mining but needs to be interactive instead of batch.