the data warehouse this is where the data lives. 2 agenda zwhat is a warehouse? zdata warehouse...
TRANSCRIPT
The Data Warehouse
This is where the data lives
2
Agenda
What is a Warehouse?Data Warehouse
ArchitectureData StorageData TransformationDBMS in
DatawarehousingBuilding a Successful
Data Warehouse
WHAT IS A
DATAWAREHOUSE ?
4
Peter Drucker believes ...
The two most important people in the 21st Century will be the CFO (managing the Cash Flow) and the CIO (managing the Information Flow)
5
Data, Data everywhereyet ...
I can’t find the data I need data is scattered over the network many versions, subtle differences
I can’t get the data I need need an expert to get the data
I can’t understand the data I found available data poorly documented
I can’t use the data I found results are unexpected data needs to be transformed
from one form to other
6
Which are our lowest/highest margin
customers ?
Which are our lowest/highest margin
customers ?
Who are my customers and what products are they buying?
Who are my customers and what products are they buying?
Which customers are most likely to go to the competition ?
Which customers are most likely to go to the competition ?
What impact will new products/services
have on revenue and margins?
What impact will new products/services
have on revenue and margins?
What product prom--otions have the biggest
impact on revenue?
What product prom--otions have the biggest
impact on revenue?
What is the most effective distribution
channel?
What is the most effective distribution
channel?
Why Data Warehousing?
7
What are the users saying...
Data should be integrated across the enterprise
Summary data had a real value to the organization
Historical data held the key to understanding data over time
What-if capabilities are required
8
What is Data Warehousing?
A process of transforming data into information and making it available to users in a timely enough manner to make a difference
[Forrester Research, April 1996]
Data
Information
9
What is a Data Warehouse?
A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.
[Barry Devlin]
10
Data Warehousing -- It is a process
Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible
A decision support database maintained separately from the organization’s operational database
11
Data Warehouse
A data warehouse is a subject-oriented integrated time-varying non-volatile
collection of data that is used primarily in organizational decision making.
-- Bill Inmon, Building the Data Warehouse 1996
12
Traditional RDBMS used for OLTP
Database Systems have been used traditionally for OLTP clerical data processing tasks detailed, up to date data structured repetitive tasks read/update a few records isolation, recovery and integrity are critical
Will call these operational systems
13
Operational Systems
Run the business in real time Based on up-to-the-second data Optimized to handle large numbers
of simple read/write transactions Optimized for fast response to
predefined transactions Used by people who deal with
customers, products -- clerks, salespeople etc.
They are increasingly used by customers
14
Examples of Operational Data
Data IndustryUsage Technology Volumes
CustomerFile
All TrackCustomerDetails
Legacy application, flatfiles, main frames
Small-medium
AccountBalance
Finance Controlaccountactivities
Legacy applications,hierarchical databases,mainframe
Large
Point-of-Sale data
Retail Generatebills, managestock
Client/Server,relational databases
Very Large
CallRecord
Telecomm-unications
Billing Legacy application,hierarchical database,mainframe
Very Large
ProductionRecord
Manufact-uring
ControlProduction
New applications,relational databases,AS/400
Medium
15
Application-Orientation vs. Subject-Orientation
Application-Orientation
Operational Database
LoansCredit Card
Trust
Savings
Subject-Orientation
DataWarehouse
Customer
VendorProduct
Activity
16
OLTP vs. Data Warehouse
OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse
Special data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries)
e.g., average amount spent on phone calls between 9AM-5PM in California during the month of December
17
OLTP vs. Data Warehouse
Complex Data Warehouse queries would degrade performance of operational DBMS
Data Warehouse requires historical data; not typically maintained by operational databases
Decision support requires consolidation (aggregation, summarization) of data from heterogeneous sources: operational DBMS, external sources, legacy systems
Different sources typically use different representations, code and format which have to be reconciled
18
OLTP vs Data Warehouse
OLTP Application
Oriented Used to run
business Detailed data Current up to date Isolated Data Repetitive access Clerical User
Warehouse (DSS) Subject Oriented Used to analyze
business Summarized and refined Snapshot data Integrated Data Ad-hoc access Knowledge User
(Manager)
19
OLTP vs Data Warehouse
OLTP Performance Sensitive Few Records accessed
at a time (tens)
Read/Update Access
No data redundancy Database Size
100MB -100 GB
Data Warehouse Performance relaxed Large volumes
accessed at a time(millions)
Mostly Read (Batch Update)
Redundancy present Database Size
100 GB - few terabytes
20
OLTP vs Data Warehouse
OLTP Transaction
throughput is the performance metric
Thousands of users Managed in entirety
Data Warehouse Query throughput is
the performance metric
Hundreds of users Managed by
subsets
21
Criteria for Selecting a Warehouse
load/index time query response time database size
requirements/limitations quality ratio of raw data size to full
database size (including indices, temp space, etc.)
parallel capabilities price company DBMS
standardization policy
DATAWAREHOUSE
ARCHITECTURE
23
Data Warehouse Architecture - Process View
RelationalDatabases
LegacyData
Purchased Data
Data Warehouse Engine
Optimized Loader
ExtractionCleansing
AnalyzeQuery
Metadata Repository
24
IT users
Operational Data Stores
DataTransformatio
nEnterprise Warehouse Management
Replication &Propagation
Data Marts Departmental Warehouses
Knowledge Discovery/Data Mining
Information Access Tools
Business Users
Getting Data In
Heart of theData Warehouse
GettingInformation Out
ωί Ιί θ
Data Warehouse Architecture - User View
25
Heart of the Data Warehouse
Heart of the data warehouse is the data itself!
Single version of the truthCorporate memoryData is organized in a way that
represents business -- subject orientation
26
Data Warehouse Structure
Subject Orientation -- customer, product, policy, account etc... A subject may be implemented as a set of related tables. E.g., customer may be five tables
27
Data Warehouse Structure
base customer (1985-87)custid, from date, to date, name, phone, dob
base customer (1988-90)custid, from date, to date, name, credit rating,
employer customer activity (1986-89) -- monthly
summary customer activity detail (1987-89)
custid, activity date, amount, clerk id, order no customer activity detail (1990-91)
custid, activity date, amount, line item no, order no
Time is Time is part of part of key of key of each tableeach table
28
Data Warehouse Structure
Base Customer (1985-87)Base Customer
(1988-90)Cust Activity (1986-89)
Cust Activity Detail (1987-89)
Cust activity Detail (1990-91)
Subject data may contain different data on different media
Can also useoptical disks
29
Schema Design
Database organization must look like business must be recognizable by business user approachable by business user Must be simple
Schema Types Star Schema Fact Constellation Schema Snowflake schema
30
Dimension Tables
Dimension tables Define business in terms already familiar to
users Wide rows with lots of descriptive text Small tables (about a million rows) Joined to fact table by a foreign key heavily indexed typical dimensions
time periods, geographic region (markets, cities), products, customers, salesperson, etc.
31
Fact Table
Central table mostly raw numeric items narrow rows, a few columns at most large number of rows (millions to a
billion) Access via dimensions
32
Star Schema
A single fact table and for each dimension one dimension table
Does not capture hierarchies directly
T im
e
prod
cust
city
fact
date, custno, prodno, cityname, ...
33
Snowflake schema
Represent dimensional hierarchy directly by normalizing tables.
Easy to maintain and saves storage
T im
e
prod
cust
city
fact
date, custno, prodno, cityname, ...
region
34
Fact Constellation
Fact Constellation Multiple fact tables that share many
dimension tables Booking and Checkout may share many
dimension tables in the hotel industryHotels
Travel Agents
Promotion
Room Type
Customer
Booking
Checkout
DATA STORAGE
IN DATAWAREHOUSE
36
Data Granularity in Warehouse
Summarized data stored reduce storage costs reduce cpu usage increases performance since smaller
number of records to be processed design around traditional high level
reporting needs tradeoff with volume of data to be stored
and detailed usage of data
37
Granularity in Warehouse
Can not answer some questions with summarized data Did Ashish call Vivek last month? Not
possible to answer if total duration of calls by Ashish over a month is only maintained and individual call details are not.
Detailed data too voluminous
38
Granularity in Warehouse
Tradeoff is to have dual level of granularity Store summary data on disks
95% of DSS processing done against this data
Store detail on tapes5% of DSS processing against this data
39
Estimates of Data Volume
To determine appropriate level (dual or single) of granularity we need to estimate the disk space requirements
For each known table Get upper and lower bounds on size of row Estimate max and min number of rows in
table for the 1 year horizon and the 5 year horizon
Calculate space for indexes for max and min number of rows
40
Dual level of granularity
1 Year Horizon 5 Year Horizon
10,000
100,000
1,000,000
10,000,000
# Rows
Any design will do
Careful Design
Dual levels ofgranularity
Dual levels of granularity and careful design
# Rows
100,000
1,000,000
10,000,000
20,000,000Dual levels of granularity and careful design
Careful Design
Any design will do
Dual levels ofgranularity
41
Dual Level of Granularity
On the five year horizon, the totals shift by an order of magnitude. Reason is: More expertise will be available in managing
the data warehouse volumes of data Hardware costs will have dropped More powerful software tools will be available End user will be more sophisticated
Actual size of record is not that important since size of indexes determines the above
42
What should be granularity level?
Starting point for deciding level of granularity is made on the basis of previous estimates and some educated guess
The initial guess is refined through an iterative analysis
Reports
Analysis
Design
Populate
Data WarehouseDeveloper
DSS Analysts
43
How to control granularity?
Summarize data from source as it goes into target Average data as it goes into target Push highest/lowest set values into target
Push only data that is needed at targetPush only a subset of rows based on
some conditions
44
Levels of Granularity
Operational
60 days ofactivity
account activity date amount teller location account bal
accountmonth # trans withdrawals deposits average bal
amountactivity date amount account bal
monthly accountregister -- up to 10 years
Not all fieldsneed be archived
Banking Example
45
Partitioning
Breaking data into several physical units that can be handled separately
Not a question of whether to do it in data warehouses but how to do it
Granularity and partitioning are key to effective implementation of a warehouse
46
Why Partitioning?
Flexibility in managing dataSmaller physical units allow
easy restructuring free indexing sequential scans if needed easy reorganization easy recovery easy monitoring
47
Criterion for Partitioning
Typically partitioned by date line of business geography organizational unit any combination of above
48
Partitioning Example
An insurance company may partition its data as follows: 1995 medical claims 1995 life claims 1996 medical claims 1996 life claims
49
Where to Partition?
Application level or DBMS levelMakes sense to partition at
application level Allows different definition for each year
Important since warehouse spans many years and as business evolves definition changes
Allows data to be moved between processing complexes easily
50
Denormalization
Normalization in a data warehouse may lead to lots of small tables
Can lead to excessive I/O’s since many tables have to be accessed
Denormalization is the answer especially since updates are rare
51
Denormalization
Create ArraysSelective RedundancyDerived Data
52
Creating Arrays
Many time each occurrence of a sequence of data is in a different physical location
Beneficial to collect all occurrences together and store as an array in a single row
Makes sense only if there are a stable number of occurrences which are accessed together
In a data warehouse, such situations arise naturally due to time based orientation can create an array by month
53
Selective Redundancy
Description of an item can be stored redundantly with order table -- most often item description is also accessed with order table
Updates have to be careful
54
Vertical Partitioning
acctno balance address date opened . . . .
acctno balance
acctno address date -opened . . .
Frequentlyaccessed
Rarely accessed
Smaller tableand so less I/O
55
Derived Data
Introduction of derived (calculated data) may often help
Have seen this in the context of dual levels of granularity
Can keep auxiliary views and indexes to speed up query processing
DATA TRANSFORMATION
57
Loading the Warehouse
Load is a crucial component for the success of warehouse project
Issues: Sources of data for the warehouse Data quality at the sources Data Transformation How to propagate updates (on the sources) to
the warehouse Terabytes of data to be loaded
58
Source Data
Typically host based, legacy applications Customized applications, COBOL, 3GL, 4GL
Point of Contact Devices POS, ATM, Call switches
External Sources Vendors, Delivery Partners, Agents
Sequential Legacy Relational ExternalOperational/Source Data
59
Data Quality - The Reality
Tempting to think that all that is there to creating a data warehouse is extracting operational data and entering into a data warehouse
Nothing could be farther from the truth
Warehouse data comes from disparate questionable sources
60
Data Quality - The Reality
Legacy systems no longer documentedOutside sources with questionable
quality proceduresProduction systems with no built in
integrity checks and no integration Operational systems are usually designed
to solve a specific business problem and are rarely developed to a a corporate plan“… And get it done quickly, we do not have
time to worry about corporate standards...”
61
Data Transformation
Data transformation is the foundation for achieving single version of the truth
Major concern for ITData warehouse can fail if appropriate
data transformation strategy is not developed
Sequential Legacy Relational ExternalOperational/Source Data
Data Transformation
Accessing Capturing Extracting Householding FilteringReconciling Conditioning Loading Validating Scoring
62
Data Integration Across Sources
Trust Credit cardSavings Loans
Same data different name
Different data Same name
Data found here nowhere else
Different keyssame data
63
Data Transformation Example
enco
ding
unit
fiel
d
appl A - balanceappl B - balappl C - currbalappl D - balcurr
appl A - pipeline - cmappl B - pipeline - inappl C - pipeline - feetappl D - pipeline - yds
appl A - m,fappl B - 1,0appl C - x,yappl D - male, female
Data Warehouse
64
Data Integrity Problems
Same person, different spellings Rajiv, Rajeev, Raju...
Multiple ways to denote company name Novatek Systems, NISL, Novatek Pvt. Ltd.
Use of different names Mumbai, Bombay Bill, William
Different account numbers generated by different applications for the same customer
Required fields left blank Invalid product codes collected at point of sale
manual entry leads to mistakes “in case of a problem use 9999999”
65
Data Transformation Terms
ExtractingConditioningScrubbingMergingHouseholding
EnrichmentScoringLoadingValidatingDelta Updating
66
Data Transformation Terms
Extracting Capture of data from operational source in
“as is” status Sources for data generally in legacy
mainframes in VSAM, IMS, IDMS, DB2; more data today in relational databases on Unix
Conditioning The conversion of data types from the
source to the target data store (warehouse) -- always a relational database
67
Data Transformation Terms
Scrubbing Ensuring all data meets the input validation
rules which should have been in place when the data was captured by the operational system. E.g..., null values for data declared not null, numeric in non-numeric, proper zip codes etc...
Merging Bringing together data from operational
sources. Choosing information from each functional system to populate the single occurrence of the data item in the warehouse
68
Data Transformation Terms
Householding Identifying all members of a household
(living at the same address) Ensures only one mail is sent to a
household Can result in substantial savings: 1
million catalogues at Rs. 50 each costs Rs. 50 million . A 2% savings would save Rs. 1 million
69
Data Transformation Terms
Enrichment Bring data from external sources to
augment/enrich operational data. (e.g. foreign currency fluctuations over period) Data sources include newspaper groups, consulting firms etc...
Scoring computation of a probability of an event.
e.g..., chance that a customer will defect to AT&T from MCI, chance that a customer is likely to buy a new product
70
Data Transformation Terms
Loading placing data into the warehouse --
accomplished using a load utility provided by database vendors
Validating Process of ensuring that the data
captured is accurate and transformation process is correct
71
Data Transformation Terms
Delta Updating propagation of changes to source since
last extraction load smaller subsets into data
warehouseMetadata
Data dictionary for the warehouse
72
Loads
After extracting, scrubbing, cleaning, validating etc. need to load the data into the warehouse
Issues huge volumes of data to be loaded small time window available when warehouse can be
taken off line (usually nights) when to build index and summary tables allow system administrators to monitor, cancel, resume,
change load rates Recover gracefully -- restart after failure from where you
were and without loss of data integrity
73
Load Techniques
Use SQL to append or insert new data record at a time interface will lead to random disk I/O’s
Use batch load utility
74
Batch Load Utility
Sort input records on clustering key Sequential I/O significantly faster than random
I/O Single pass load
perform all transformations, scrub, clean, validate, aggregate etc.
build indexes at same time and create summary tables
Sequential loads may take long (loading a TB warehouse may take ~100 days) Exploit I/O Parallelism to load data at acceptable rates
Leverage knowledge of data warehouse schemas
75
Load Taxonomy
Incremental versus Full loadsOnline versus Offline loads
76
Incremental Load
Full load is too disruptive and not required if updates since last load can be identified easily
Can use incremental load to reduce data actually loaded insert only updated tuples
77
Online Load
Online full loads can load a new table while queries on old table continue if there is enough disk space
Online incremental loads conflict with queries break into shorter transactions (every 1000
records or every so many seconds) coordinate this sequence of transactions: must
ensure consistency between base and derived tables and indices
78
Refresh
Propagate updates on source data to the warehouse
Issues: when to refresh how to refresh -- refresh techniques
79
When to Refresh?
periodically (e.g., every night, every week) or after significant events
on every update: not warranted unless warehouse data require current data (up to the minute stock quotes)
refresh policy set by administrator based on user needs and traffic
possibly different policies for different sources
80
Refresh Techniques
Full Extract from base tables read entire source table: too expensive maybe the only choice for legacy
systems
81
Refresh techniques
Incremental techniques detect changes on base tables: replication
servers (e.g., Sybase, Oracle, IBM Data Propagator)snapshots (Oracle)transaction shipping (Sybase)
compute changes to derived and summary tables
maintain transactional correctness for incremental load
82
How To Detect Changes
Create a snapshot log table to record ids of updated rows of source data and timestamp
Detect changes by: Defining after row triggers to update
snapshot log when source table changes Using regular transaction log to detect
changes to source data
DBMS
FOR DATAWAREHOUSING
84
Relational DBMS
Features that support DSS Specialized Indexing techniques Specialized join and scan methods data partitioning and use of parallelism complex query processing intelligent aggregate processing extensions to SQL and their processing
85
Indexing Techniques
Bitmap index: A collection of bitmaps -- one for each
distinct value of the column Each bitmap has N bits where N is the
number of rows in the table A bit corresponding to a value v for a
row r is set if and only if r has the value for the indexed attribute
86Customer Query : select * from customer where
gender = ‘F’ and vote = ‘Y’
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
Bitmap Index
M
F
F
F
F
M
Y
Y
Y
N
N
N
87
Bitmap Indexing
Bit arithmetic (AND/OR) is fastSpace occupied depends on cardinality
of domains Not good if indexed attribute has too many
distinct valuesCan be compressed (e.g. using run
length encoding) effectivelyProducts that support bitmaps: Model
204, TargetIndex(Redbrick), IQ (Sybase), Oracle 7.3
88
Join Indexes
Pre-computed joinsA join index between a fact table and
a dimension table correlates a dimension tuple with the fact tuples that have the same value on the common dimensional attribute e.g., a join index on city dimension of
calls fact table correlates for each city the calls (in the
calls table) that originated from that city
89
Join Indexes
Join indexes can also span multiple dimension tables e.g., a join index on city and time
dimension of calls fact table
90
Star Join Processing
Use join indexes to join dimension and fact table
CallsC+T
C+T+L
C+T+L+P
Time
Loca-tion
Plan
91
Optimized Star Join Processing
Time
Loca-tion
Plan
Calls
Virtual Cross Productof T, L and P
Apply Selections
92
Bitmapped Join Processing
AND
Time
Loca-tion
Plan
Calls
Calls
Calls
Bitmaps101
001
110
93
Intelligent Scan
Piggyback multiple scans of a relation (Redbrick) piggybacking also done if second scan
starts a little while after the first scan
94
Parallel Query Processing
Three forms of parallelism Independent Pipelined Partitioned and “partition and replicate”
Deterrents to parallelism startup communication
95
Parallel Query Processing
Partitioned Data Parallel scans Yields I/O parallelism
Parallel algorithms for relational operators Joins, Aggregates, Sort
Parallel Utilities Load, Archive, Update, Parse, Checkpoint,
Recovery Parallel Query Optimization
96
Pre-computed Aggregates
Keep aggregated data for efficiency (pre-computed queries)
Questions Which aggregates to compute? How to update aggregates? How to use pre-computed aggregates
in queries?
97
Pre-computed Aggregates
Aggregated table can be maintained by the warehouse server middle tier client applications
Pre-computed aggregates -- special case of materialized views -- same questions and issues remain
98
SQL Extensions
Extended family of aggregate functions rank (top 10 customers) percentile (top 30% of customers) median, mode Object Relational Systems allow
addition of new aggregate functions
99
SQL Extensions
Reporting features running total, cumulative totals
Cube operator group by on all subsets of a set of
attributes (month,city) redundant scan and sorting of data can
be avoided
100
Technological Requirements
Managing Large amounts of dataManaging multiple media -- storage
hierarchy cache (L1 and L2) main memory disks optical disks tapes fiche
101
Technological Requirements
Ability to index data at will temporary indices, sparse indices
Ability to monitor data freely and easily to determine whether reorganization is
required to determine if index is poorly structured to determine statistical composition of data
Need to interface to many technologies for both receiving and passing data
102
Technological Requirements
Programmer/Designer control of dataParallel Storage/Management of dataGood Metadata managementLoad the warehouse efficientlyUse indexes efficientlyCompaction of data
103
Technological Requirements
Compound KeysVariable Length dataLock Management
Need to be able to turn the lock manager on and off
Index Only processing
104
Warehouse Server Products
Oracle 8Informix
Online Dynamic Server for SMP Extended Parallel Server for MPP Universal Server for object relational
applicationsSybase
Adaptive Server 11.5 Sybase MPP Sybase IQ
105
Warehouse Server Products
MS-SQL Server, OLAP serverRed Brick WarehouseTandem NonstopIBM
DB2 MVS Universal Server DB2 400
Teradata
106
Server Scalability
Scalability is the #1 IT requirement for Data Warehousing
Hardware Platform options SMP Clusters (shared disk) MPP
Loosely coupled (shared nothing)Hybrid
107
SMP Characteristics
SMP -- Symmetric multi processing -- shared everything
Multiple CPUs share same memory
Workload is balanced across CPUs by OS
Scalability is limited to bandwidth of internal bus and OS architecture
Not tolerant to failure in processing node
Architecture is mostly invisible to applications
108
SMP Benefits
Lower entry point -- can start with SMP
Mature technology
109
MPP Characteristics
Each node owns a portion of the database
Nodes are connected via an interconnection network
Each node can be a single CPU or SMPLoad balancing done by applicationHigh scalability due to local processing
isolation
110
MPP benefits
High availabilityHigh scalability
111
Sizing your system
Estimate Total volume of data Total disk throughput required
Determine number of controllers and disks required
Determine CPU and memory based on the workload
112
Other Warehouse Related Products
Connectivity to Sources Apertus Information Builders EDA/SQL Platimum Infohub SAS Connect IBM Data Joiner Oracle Open Connect Informix Express Gateway
113
Other Warehouse Related Products
Data extract, clean, transform, refresh CA-Ingres replicator Carleton Passport Prism Warehouse Manager SAS Access Sybase Replication Server Platinum Inforefiner, Infopump
114
Other warehouse related products
Multidimensional Database Engines Arbor Essbase Oracle IRI Express SAS System
ROLAP servers HP Intelligent Warehouse Informix metacube MicroStrategy DSS server
115
Other Warehouse Related Products
Query/Reporting Environments Brio/Query Cognos Impromptu Informix Viewpoint CA Visual Express Business Objects Platinum Forest and Trees
116
Other Warehouse Related Products
Multidimensional Analysis Andyne Pablo Arbor Essbase Analysis Server Cognos Powerplay Holistic Systems (HOLOS) Microstrategy DSS SAS OLAP++
BUILDING A SUCCESSFUL
DATAWAREHOUSE
118
From The Standish Group
TSG: How many Data Warehouses do you have?
Data Warehouser: We have had eight.
TSG: To what do you attribute so many warehouses?
Data Warehouser: Seven mistakes ...
119
For a Successful Warehouse
From day one establish that warehousing is a joint user/builder project
Establish that maintaining data quality will be an ONGOING joint user/builder responsibility
Train the users one step at a timeConsider doing a high level corporate data
model in no more than three weeks
From Larry Greenfield, http://pwp.starnetinc.com/larryg/index.html
120
For a Successful Warehouse
Look closely at the data extracting, cleaning, and loading tools
Implement a user accessible automated directory to information stored in the warehouse
Determine a plan to test the integrity of the data in the warehouse
From the start get warehouse users in the habit of 'testing' complex queries
121
For a Successful Warehouse
Coordinate system roll-out with network administration personnel
When in a bind, ask others who have done the same thing for advice
Be on the lookout for small, but strategic, projects
Market and sell your data warehousing systems
122
Data Warehouse Pitfalls
You are going to spend much time extracting, cleaning, and loading data
Despite best efforts at project management, data warehousing project scope will increase
You are going to find problems with systems feeding the data warehouse
You will find the need to store data not being captured by any existing system
You will need to validate data not being validated by transaction processing systems
123
Data Warehouse Pitfalls
Some transaction processing systems feeding the warehousing system will not contain detail
Many warehouse end users will be trained and never or seldom apply their training
After end users receive query and report tools, requests for IS written reports may increase
Your warehouse users will develop conflicting business rules
Large scale data warehousing can become an exercise in data homogenizing
124
Data Warehouse Pitfalls
'Overhead' can eat up great amounts of disk space The time it takes to load the warehouse will
expand to the amount of the time in the available window... and then some
Assigning security cannot be done with a transaction processing system mindset
You are building a HIGH maintenance system You will fail if you concentrate on resource
optimization to the neglect of project, data, and customer management issues and an understanding of what adds value to the customer
THANK YOU