01 dwh introduction
TRANSCRIPT
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 1/24
R. Marti
Introduction
Data Warehousing
Spring Semester 2011
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 2/24
Introduction – Data Warehousing 2011R. Marti
• Age of Transactions (ca 1970 - )
– Goal: reliability - make sure no data is lost
– 1960s: IMS (hierarchical data model)
– 1980s: Oracle (relational data model), then DB2, Sybase
•
Age of Business Intelligence (ca 1995 - ) – Goal: analyze the data => make business decisions
– Aggregate data for boss.
Tolerate imprecision (e.g., slightly out-of-date information)!
– SAP BW / Business Objects, IBM Cognos, … (ROLAP = Relational model),
Oracle Hyperion Essbase (MOLAP = Multi-dimensional model)
• Age of „Data for the Masses“
– Goal: everybody has access to everything
– Google (text), Cloud (XML, JSON: Services)
Historical Context
adapted from D. Kossmann & D. Binnig 2009
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 3/24
Introduction – Data Warehousing 2011R. Marti 3
Theory: One integrated enterprise-wide Database
Application
CApplication
A
Application
B
Shared Database
Organizational Unit
Partner
Risk
Loss Event Claim
Location ContractFinancial
Transaction
Product
Counterparty
Intermediary Client Bank
Management
Unit
Legal Entity
Employee
Asset
Market
Single common enterprise data model
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 4/24
Introduction – Data Warehousing 2011R. Marti 4
Theory: Enterprise Data Model (very high-level example)
Organizational Unit
Partner
Insurable
Loss Event Claim
Location ContractFinancial
Transaction
Product
Counterparty Intermediary Client Bank
Management
Unit Legal Entity
Employee
Asset
Market
1
2
3
4
Note:On this level, all relationships (the connecting lines)
tend to be many-to-many …
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 5/24
Introduction – Data Warehousing 2011R. Marti 5
Practice: Drawbacks of an integrated approach
• Business / management is usually not willing to fund this approach:
• Development of the detailed Enterprise Data Model (EDM) is expensive
(and hence also time-consuming).
• Mapping “legacy” data stores to the EDM is even more expensive.
• Finally, data migration is typically orders of magnitude more expensive.
• Sections of the EDM developed at the beginning of the project may
become obsolete before the entire EDM is “finished”, i.e., reflecting
the currently accepted view of the business.
• Central “control” of an EDM represents a bottleneck – never mindhurt feelings of “subject matter experts” and “data owners”.
• Interdependencies between existing (“legacy”) applications are often
unknown (and undocumented to boot).
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 6/24
Introduction – Data Warehousing 2011R. Marti 6
Practice: Data Silos
Application
CApplication
A
Application
B
3
2
1
3 Overlaps/
Redundant data
2 Direct database access
1 Access via API call
Silos … … and (more or less visible) interdependencies
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 7/24
Introduction – Data Warehousing 2011R. Marti 7
Application Landscape (simplified example)
SICSntIUF
Marketing&
Acquisition
ContractAdmini-stration
ClaimsManage-
ment
TechnicalAccounting
FinancialManage-
ment
Property & Casualty
CMS
Life & Health
Under-writing
BoF
FSA
Applications typically only support a subset of org units, locations, products, and/or
business processes
⇒ interdependencies and redundancies in data and processes
⇒ need for integration and consolidation of data
|--------------------------- Core Business Processes ---------------------------|
| - - - -
P r o d u c t s
- - - - |
Which Core Business Processes
in which Org Units are supprted
by which applications
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 8/24
Introduction – Data Warehousing 2011R. Marti 8
Information Integration Along Business Processes
SICSntIUF
Marketing&
Acquisition
ContractAdmini-stration
ClaimsManage-
ment
TechnicalAccounting
FinancialManage-
ment
Property & Casualty
Life & Health
Under-writing
• Data collected in one business process must be read (and updated) in other
business processes in order to avoid double entry
• Data collected in different business processes must be combined to get a complete
picture across the value chain for decision support & reporting
|--------------------------- Core Business Processes ---------------------------|
BoF
P & CData Hub
CMS FSA
| - - - -
P r o d u c t s
- - - - |
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 9/24
Introduction – Data Warehousing 2011R. Marti 9
Integration Across Organizations, Locations, Products
IUFProperty & Casualty
Life & Health
• Data collected in different org units / locations must be combined to get a picture
across a larger org unit for decision support & reporting
|--------------------------- Core Business Processes ---------------------------|
Marketing&
Acquisition
ContractAdmini-stration
ClaimsManage-
ment
TechnicalAccounting
FinancialManage-
ment
Under-writing
CMS
| - - - -
P r o d u c t s
- - - - |
Exposure
Data Mart
SICSntFSA
BoF
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 10/24
Introduction – Data Warehousing 2011R. Marti 10
Other Issues / Drivers for Information Integration
• Different technologies, e.g.
– file-based applications including the ubiquitous Excel spreadsheets
– “legacy” mainframe applications using e.g. IMS and CICS
– 2-tier client/server applications using relational DBMSs
– multi-tier applications using e.g. CORBA, EJB, Web application servers etc.
• Different data encodings, e.g.
– “home-grown” codes vs ISO encodings, e.g. for countries and currencies
– synonyms (mobile vs cellular phone), homonyms; implicit contexts ( premium)
– colors specified as enumerations vs RBG (= Red Green Blue) values
– different units, e.g. km/h vs mph, 1000 CHF vs 1 USD, etc
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 11/24
Introduction – Data Warehousing 2011R. Marti
Approaches to Information Integration
1. Federation
Everybody talks directly to everyone else.
2. Warehouse
Sources are translated from their local schema to a global
schema and copied to a central DB.
3. Mediator
Virtual warehouse – turns a user query into a sequence of
source queries and assembles the results of this queries into
and “aggregate” result.
11adapted from J.D. Ullman 2007
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 12/24
Introduction – Data Warehousing 2011R. Marti
Integration Approach 1: Federation
Wrapper
Wrapper
Wrapper
Wrapper
Wrapper
Wrapper
12
• Issue:
n applications / data stores => up to n2 connections
adapted from J.D. Ullman 2007
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 13/24
Introduction – Data Warehousing 2011R. Marti
Integration Approach 2: Warehouse
Warehouse
Wrapper Wrapper
Source 1 Source 2
13
(more detailed diagram to follow)
• Issue:
usually only one-directional data flows supported
(but this is good enough for reporting / decision support)
adapted from J.D. Ullman 2007
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 14/24
Introduction – Data Warehousing 2011R. Marti
Integration Approach 3: Mediator
Mediator
Wrapper Wrapper
Source 1 Source 2
User query
Query
Query
QueryQuery
Result
Result
Result
Result
Result
14
(not necessarily completely materialized)
adapted from J.D. Ullman 2007
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 15/24
Introduction – Data Warehousing 2011R. Marti 15
Characterization of Data Warehouses
“Definition” due to Bill Inmon:
A Data Warehouse (DWh) is an integrated, subject-oriented, non-volatile database.
• Integration
The DWh contains consolidated data from several applications, respectively their
databases.
• Subject-orientation
The data in a DWh is grouped around subjects, and its structure is designed to
make querying the data simple, especially for business analysts.
• Non-volatility When new current data becomes available, the “old” data is not overwritten.
Instead, the DWh keeps a history of data in order to support trend analysis etc.
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 16/24
Introduction – Data Warehousing 2011R. Marti 16
Data Warehouse Reference Architecture
Landing
Area
StagingArea
Data
Ware-house
Landing
Area
Source
Database
Source
Database
LandingAreaSourceDatabase
Data
Mart
Data
Mart
Metadata
Master
Data
Dashboards
Reports
Interactive Analysis
Data
Mining
Data Warehousing
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 17/24
Introduction – Data Warehousing 2011R. Marti 17
Terminology (Caveat: Not everyone agrees on this 100%)
• Source DB – also: Operational Application, OLTP Application
DB of an application which supports one or more types of business transactions.
• Landing Area (LA)
DB that is able to store a single data extract of a subset of one Source DB.
Its schema basically corresponds 1:1 with the schema of the subset of the Source DB.
• Staging Area (SA)
DB that is able to store matching data extracts from various Landing Areas in anintegrated format, waiting for the upload to the DWh once data from all Landing Areas
are available. Its schema basically corresponds 1:1 with the DWh schema.
• Data Warehouse (DWh)
DB containing the history of all complete Staging Areas. Its integrated schema is
usually +/- in Third Normal Form (3NF), see following slide.
Note: 3NF is a slight contradiction to the criterion “subject-orientation” mentionned above.
• Data Mart (DM) – also: OLAP Application
DB – on disk or in main memory – containing data describing the (present and past)
performance of one or more types of business transactions, taken form the DWh.
The schema of a Data Mart often has the form of one or more (denormalized) “stars”.
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 18/24
Introduction – Data Warehousing 2011R. Marti 18
Normalized Schema of Operational DB (Example)
Typical Query:
Exposure
- by Product - by Client
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 19/24
Introduction – Data Warehousing 2011R. Marti 19
Subject-orientation: Star Schema used in Data Marts
Typical Query:
Exposure
- by Product - by Client
- by Time
Subject
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 20/24
Introduction – Data Warehousing 2011R. Marti
OLTP Applications (= OnLine Transaction Processing)
20
• “Getting the data in”:
capturing data describing business transactions
• Many short and “small” transactions:
point queries, single-row updates and/or inserts
• Avoid (uncontrolled) redundancies => normalized schemas
• Access to up-to-date, consistent DB
•
Examples:- Flight reservation systems, Procurement, Order Management
• Goal: 6000 Transactions Per Second (TPS) [Oracle 1995]
adapted from D. Kossmann & D. Binnig 2009
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 21/24
Introduction – Data Warehousing 2011R. Marti
OLAP Applications (= OnLine Analytical Processing)
21
• “Getting the data out”:
analyzing the data describing business transactions
• Queries with large result sets (all the data, joins)
• No immediate updates and/or inserts,
but instead large periodic (daily, weekly) batch inserts
• (Controlled) redundancy a necessity for performance reasons:
denormalized schemas, materialized views; indexes
• Examples:- Management Information Systems (MIS), Decision Support Systems (DSS)
- Statistical Databases
- Scientific databases, Bio-Informatic
• Goal: Response Time of seconds / a few minutes
adapted from D. Kossmann & D. Binnig 2009
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 22/24
Introduction – Data Warehousing 2011R. Marti
OLTP vs OLAP: Water and Oil
22
• Lock Conflicts
- long-runing OLAP reads
may block OLTP writes
• Freshness of data
- OLTP: up-to-date data => serializability- OLAP: reproducability of analyses => historization
• Precision
- OLTP: (usually!) exact
- OLAP: sampling, statistical summaries,results with confidence intervals
adapted from D. Kossmann & D. Binnig 2009
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 23/24
Introduction – Data Warehousing 2011R. Marti Page 23
Information Supply
Information Integration
Information Demand BusinessSteering
Process
Do Measure
InterpretDecide
OperationalBusiness
Process
Business Performance Management: Closing the Loop
integrate
aggregate
historize
plan
break-down
8/3/2019 01 DWh Introduction
http://slidepdf.com/reader/full/01-dwh-introduction 24/24
Introduction – Data Warehousing 2011R. Marti Page 24
BPM: The Role of Measures (example)
Information Supply
Information Integration
Information Demand
Basic Measures
e.g., Premium, Loss, Cost
Basic + Derived
Measures,
e.g. Comb Ratio, VaR
Assumptions, e.g. future cedant
performance, interest rates
Target Measures
e.g. CombRatio, …
“Breakdown” of
Target Measures
BusinessSteering
Process
OperationalBusiness
Process
OLTP
OLAP