01 dwh introduction

24
R. Marti Introduction Data Warehousing Spring Semester 2011

Upload: ran-lavi

Post on 06-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 1/24

R. Marti

Introduction

Data Warehousing

Spring Semester 2011

Page 2: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 2/24

Introduction – Data Warehousing 2011R. Marti

•  Age of Transactions (ca 1970 - )

 –  Goal: reliability - make sure no data is lost

 –  1960s: IMS (hierarchical data model)

 –  1980s: Oracle (relational data model), then DB2, Sybase

• 

Age of Business Intelligence (ca 1995 - ) –  Goal: analyze the data => make business decisions

 –  Aggregate data for boss.

Tolerate imprecision (e.g., slightly out-of-date information)!

 –  SAP BW / Business Objects, IBM Cognos, … (ROLAP = Relational model),

Oracle Hyperion Essbase (MOLAP = Multi-dimensional model)

•  Age of „Data for the Masses“

 –  Goal: everybody has access to everything

 –  Google (text), Cloud (XML, JSON: Services)

Historical Context

adapted from D. Kossmann & D. Binnig 2009

Page 3: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 3/24

Introduction – Data Warehousing 2011R. Marti 3

Theory: One integrated enterprise-wide Database

Application 

CApplication

A

Application 

B

Shared Database

Organizational Unit

Partner 

Risk

Loss Event Claim

Location ContractFinancial

Transaction

Product

Counterparty

Intermediary Client Bank

Management

Unit

Legal Entity

Employee

Asset

Market

Single common enterprise data model

Page 4: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 4/24

Introduction – Data Warehousing 2011R. Marti 4

Theory: Enterprise Data Model (very high-level example)

Organizational Unit

Partner 

Insurable

Loss Event Claim

Location ContractFinancial

Transaction

Product

Counterparty Intermediary Client Bank

Management

Unit Legal Entity

Employee

Asset

Market

1

2

3

4

Note:On this level, all relationships (the connecting lines)

tend to be many-to-many …

Page 5: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 5/24

Introduction – Data Warehousing 2011R. Marti 5

Practice: Drawbacks of an integrated approach

•  Business / management is usually not willing to fund this approach:

•  Development of the detailed Enterprise Data Model (EDM) is expensive

(and hence also time-consuming).

•  Mapping “legacy” data stores to the EDM is even more expensive.

•  Finally, data migration is typically orders of magnitude more expensive.

•  Sections of the EDM developed at the beginning of the project may

become obsolete before the entire EDM is “finished”, i.e., reflecting

the currently accepted view of the business.

•  Central “control” of an EDM represents a bottleneck – never mindhurt feelings of “subject matter experts” and “data owners”.

•  Interdependencies between existing (“legacy”) applications are often

unknown (and undocumented to boot).

Page 6: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 6/24

Introduction – Data Warehousing 2011R. Marti 6

Practice: Data Silos

Application 

CApplication

A

Application 

B

3

2

1

3 Overlaps/

Redundant data

2 Direct database access

1 Access via API call

Silos … … and (more or less visible) interdependencies

Page 7: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 7/24

Introduction – Data Warehousing 2011R. Marti 7

Application Landscape (simplified example)

SICSntIUF

Marketing&

Acquisition

ContractAdmini-stration

ClaimsManage-

ment

TechnicalAccounting

FinancialManage-

ment

Property & Casualty 

CMS 

Life & Health

Under-writing

BoF

FSA

Applications typically only support a subset of org units, locations, products, and/or 

business processes

⇒  interdependencies and redundancies in data and processes

⇒ need for integration and consolidation of data

|--------------------------- Core Business Processes ---------------------------|

   |  -  -  -  - 

      P     r     o      d    u     c      t     s

  -  -  -  -   |

Which Core Business Processes

in which Org Units are supprted

by which applications

Page 8: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 8/24

Introduction – Data Warehousing 2011R. Marti 8

Information Integration Along Business Processes

SICSntIUF

Marketing&

Acquisition

ContractAdmini-stration

ClaimsManage-

ment

TechnicalAccounting

FinancialManage-

ment

Property & Casualty 

Life & Health

Under-writing

•  Data collected in one business process must be read (and updated) in other 

business processes in order to avoid double entry

•  Data collected in different business processes must be combined to get a complete

picture across the value chain for decision support & reporting

|--------------------------- Core Business Processes ---------------------------|

BoF

P & CData Hub 

CMS  FSA

   |  -  -  -  - 

      P     r     o      d    u     c      t     s

  -  -  -  -   |

Page 9: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 9/24

Introduction – Data Warehousing 2011R. Marti 9

Integration Across Organizations, Locations, Products

IUFProperty & Casualty 

Life & Health

•  Data collected in different org units / locations must be combined to get a picture

across a larger org unit for decision support & reporting

|--------------------------- Core Business Processes ---------------------------|

 Marketing&

Acquisition

ContractAdmini-stration

ClaimsManage-

ment

TechnicalAccounting

FinancialManage-

ment

Under-writing

CMS 

   |  -  -  -  - 

      P     r     o      d    u     c      t     s

  -  -  -  -   |

Exposure

Data Mart

SICSntFSA

BoF

Page 10: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 10/24

Introduction – Data Warehousing 2011R. Marti 10

Other Issues / Drivers for Information Integration

•  Different technologies, e.g.

 –  file-based applications including the ubiquitous Excel spreadsheets

 –  “legacy” mainframe applications using e.g. IMS and CICS

 –  2-tier client/server applications using relational DBMSs

 –  multi-tier applications using e.g. CORBA, EJB, Web application servers etc.

•  Different data encodings, e.g.

 –  “home-grown” codes vs ISO encodings, e.g. for countries and currencies

 –  synonyms (mobile vs cellular phone), homonyms; implicit contexts ( premium)

 –  colors specified as enumerations vs RBG (= Red Green Blue) values

 –  different units, e.g. km/h vs mph, 1000 CHF vs 1 USD, etc

Page 11: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 11/24

Introduction – Data Warehousing 2011R. Marti

Approaches to Information Integration 

1.  Federation 

Everybody talks directly to everyone else.

2.  Warehouse 

Sources are translated from their local schema to a global

schema and copied to a central DB.

3.  Mediator  

Virtual warehouse – turns a user query into a sequence of 

source queries and assembles the results of this queries into

and “aggregate” result.

11adapted from J.D. Ullman 2007

Page 12: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 12/24

Introduction – Data Warehousing 2011R. Marti

Integration Approach 1: Federation

Wrapper

Wrapper

Wrapper

Wrapper

Wrapper

Wrapper

12

•  Issue:

n applications / data stores => up to n2 connections 

adapted from J.D. Ullman 2007

Page 13: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 13/24

Introduction – Data Warehousing 2011R. Marti

Integration Approach 2: Warehouse 

Warehouse

Wrapper Wrapper

Source 1 Source 2

13

(more detailed diagram to follow)

•  Issue:

usually only one-directional data flows supported

(but this is good enough for reporting / decision support) 

adapted from J.D. Ullman 2007

Page 14: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 14/24

Introduction – Data Warehousing 2011R. Marti

Integration Approach 3: Mediator 

Mediator

Wrapper Wrapper

Source 1 Source 2

User query

Query

Query

QueryQuery

Result

Result

Result

Result

Result

14

(not necessarily completely materialized)

adapted from J.D. Ullman 2007

Page 15: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 15/24

Introduction – Data Warehousing 2011R. Marti 15

Characterization of Data Warehouses

“Definition” due to Bill Inmon:

A Data Warehouse (DWh) is an integrated, subject-oriented, non-volatile database.

•  Integration 

The DWh contains consolidated data from several applications, respectively their 

databases.

•  Subject-orientation 

The data in a DWh is grouped around subjects, and its structure is designed to

make querying the data simple, especially for business analysts.

•  Non-volatility When new current data becomes available, the “old” data is not overwritten.

Instead, the DWh keeps a history of data in order to support trend analysis etc.

Page 16: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 16/24

Introduction – Data Warehousing 2011R. Marti 16

Data Warehouse Reference Architecture

Landing

Area

StagingArea

Data

Ware-house

Landing

Area

Source

Database

Source

Database

LandingAreaSourceDatabase

Data

Mart

Data

Mart

Metadata

Master 

Data

Dashboards

Reports

Interactive Analysis

Data

Mining

Data Warehousing

Page 17: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 17/24

Introduction – Data Warehousing 2011R. Marti 17

Terminology  (Caveat: Not everyone agrees on this 100%)

•  Source DB – also: Operational Application, OLTP Application

DB of an application which supports one or more types of business transactions.

•  Landing Area (LA) 

DB that is able to store a single data extract of a subset of one Source DB.

Its schema basically corresponds 1:1 with the schema of the subset of the Source DB.

•  Staging Area (SA) 

DB that is able to store matching data extracts from various Landing Areas in anintegrated format, waiting for the upload to the DWh once data from all Landing Areas

are available. Its schema basically corresponds 1:1 with the DWh schema.

•  Data Warehouse (DWh) 

DB containing the history of all complete Staging Areas. Its integrated schema is

usually +/- in Third Normal Form (3NF), see following slide.

Note: 3NF is a slight contradiction to the criterion “subject-orientation” mentionned above.

•  Data Mart (DM) – also: OLAP Application 

DB – on disk or in main memory – containing data describing the (present and past)

performance of one or more types of business transactions, taken form the DWh.

The schema of a Data Mart often has the form of one or more (denormalized) “stars”.

Page 18: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 18/24

Introduction – Data Warehousing 2011R. Marti 18

Normalized Schema of Operational DB (Example)

Typical Query:

Exposure 

- by Product - by Client 

Page 19: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 19/24

Introduction – Data Warehousing 2011R. Marti 19

Subject-orientation: Star Schema used in Data Marts

Typical Query:

Exposure 

- by Product - by Client 

- by Time 

Subject

Page 20: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 20/24

Introduction – Data Warehousing 2011R. Marti

OLTP Applications (= OnLine Transaction Processing) 

20

•  “Getting the data in”:

capturing data describing business transactions

•  Many short and “small” transactions:

point queries, single-row updates and/or inserts

•  Avoid (uncontrolled) redundancies => normalized schemas

•  Access to up-to-date, consistent DB

• 

Examples:- Flight reservation systems, Procurement, Order Management

•  Goal: 6000 Transactions Per Second (TPS) [Oracle 1995] 

adapted from D. Kossmann & D. Binnig 2009

Page 21: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 21/24

Introduction – Data Warehousing 2011R. Marti

OLAP Applications (= OnLine Analytical Processing)

21

•  “Getting the data out”:

analyzing the data describing business transactions

•  Queries with large result sets (all the data, joins)

•  No immediate updates and/or inserts,

but instead large periodic (daily, weekly) batch inserts

•  (Controlled) redundancy a necessity for performance reasons:

denormalized schemas, materialized views; indexes 

•  Examples:- Management Information Systems (MIS), Decision Support Systems (DSS)

- Statistical Databases

- Scientific databases, Bio-Informatic

•  Goal: Response Time of seconds / a few minutes 

adapted from D. Kossmann & D. Binnig 2009

Page 22: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 22/24

Introduction – Data Warehousing 2011R. Marti

OLTP vs OLAP: Water and Oil

22

•  Lock Conflicts

- long-runing OLAP reads

may block OLTP writes

•  Freshness of data

- OLTP: up-to-date data => serializability- OLAP: reproducability of analyses => historization

•  Precision

- OLTP: (usually!) exact

- OLAP: sampling, statistical summaries,results with confidence intervals

adapted from D. Kossmann & D. Binnig 2009

Page 23: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 23/24

Introduction – Data Warehousing 2011R. Marti Page 23

Information Supply

Information Integration 

Information Demand BusinessSteering

Process

Do Measure

InterpretDecide

OperationalBusiness

Process

Business Performance Management: Closing the Loop

integrate

aggregate

historize

plan

break-down

Page 24: 01 DWh Introduction

8/3/2019 01 DWh Introduction

http://slidepdf.com/reader/full/01-dwh-introduction 24/24

Introduction – Data Warehousing 2011R. Marti Page 24

BPM: The Role of Measures (example)

Information Supply

Information Integration 

Information Demand

Basic Measures

e.g., Premium, Loss, Cost

Basic + Derived

Measures,

e.g. Comb Ratio, VaR

Assumptions, e.g. future cedant

performance, interest rates

Target Measures

e.g. CombRatio, …

“Breakdown” of 

Target Measures

BusinessSteering

Process

OperationalBusiness

Process

OLTP

OLAP