cis 465 - data warehousing1 data warehousing version 7.0 - 11/28/2000
TRANSCRIPT
CIS 465 - Data Warehousing 2
Can your database answer questions like these?
• What is the cost of staff to break into a new line of business?
• What are the travel routes of my competition’s inventory?
• At what velocity is my competitor moving toward a common goal?
• How will a transaction on a certain date be affected by currency exchange rates?
• Is a foreign labor source likely to produce a higher quality product?
• Which 20% of the problem creates 80% of the problems?
CIS 465 - Data Warehousing 3
Can your database answer questions like these?
• By product and location, how can we regain a lost customer base?
• Which skill and staff levels are most likely to accept the voluntary layoff package?
CIS 465 - Data Warehousing 4
Data is Difficult to Manage
• Amount of data increases exponentially; past data must be kept for long times; new data are added rapidly.
• Data are scattered throughout organizations; collected by many individuals; using different methods and devices.
• Only small portions of an organization’s data are relevant for specific decisions.
• An ever-increasing amount of external data needs to be considered in making organizational decisions.
CIS 465 - Data Warehousing 5
Data is Difficult to Manage - contd.
• Raw data may be stored in different computing systems, formats, and human and computer languages.
• Legal requirements relating to data differ among countries and change frequently.
• Selecting data management tools can be difficult because of the large number of tools available.
• Data security, quality, and integrity are critical, yet easily jeopardized.
CIS 465 - Data Warehousing 6
Data and Knowledge Managementment
• Businesses do not run on data. They run on information and their knowledge of how to put that information to use successfully.
CIS 465 - Data Warehousing 7
Some Information Concepts
Some Information Concepts
• Data: Unorganized facts and figures. (raw material)
• Information: Data that has been processed into a form that is meaningful to the recipient and is of real of perceived value in current or prospective actions or decisions.
• Information:– adds to a representation– corrects or confirms previous information– has “surprise” value in that it tells us something we
did not know, or could not predict.– What is a “finished product” to one, may be “raw
materials” to someone else.
CIS 465 - Data Warehousing 8
Definitions: Information vs. Knowledge
• Knowledge: a combination of instincts, ideas, rules, and procedures that guide actions and decisions.
• Knowledge consists of data items that are processed to convey understanding, experience, accumulated learning, and expertise as they apply to a current problem or situation.
• Helping to provide the best available knowledge to decision-making is another role of information systems
CIS 465 - Data Warehousing 9
Relationship Between Data, Information, and
Knowledge• The difference between data and information is
easy to remember.• It is often cited as the reason why systems that
collect large amounts of information fail to meet management’s information needs.
• There are many methods of converting data into information for decision making.
• Managers take action based on information about a current situation plus their accumulated knowledge. Actions taken feed the process of accumulating more knowledge (experience).
• Example: How do medical students become competent physicians?
CIS 465 - Data Warehousing 11
Attributes of Quality Information (reference Alter-
Chapter 4)• Timeliness• Completeness• Conciseness• Relevance• Accuracy• Precision• Appropriateness of Form
CIS 465 - Data Warehousing 12
Special Characteristics of Information
• Usefulness - depends on combination of quality,accessibility,and presentation.
• One person’s information may be another person’s noise.
• Soft data may be as important as hard data.
• Ownership of information may be hard to maintain.
• More information is not always better (information overload).
• Politics can often hide or distort information.
CIS 465 - Data Warehousing 13
Sources of Data
• Internal Data– Data about people, products, services, processes.– Often stored in corporate data bases (e.g. Sales or
HR).– Some data may be disparate in different regions,
but accessible by networks.
• Personal Data– Individuals document expertise by creating
personal data - subjective estimates of sales, opinions on competitors plans, interpretation of news or current events. etc. Some is kept in heads, or mental models.
– Can be store on PCs, or available on the Web.
• External Data– Many sources.
CIS 465 - Data Warehousing 14
Some Sources of Business External Data
• Federal Publications– Survey of Current
Business– Monthly Labor Review– Federal Reserve Bulletin– Employment and
Earnings– Commerce Business Daily– Census Bureau
• Other– International Monetary
Fund– Moody’s– Standard & Poor’s– Advertising Age– Dialog and Lexis/Nexis
• Other-contd.– ABI/Inform– Annual Editor &
Publisher Market Guide
– Thomas Register On-line
• Indexes – Encyclopedia of
Business Information Sources
CIS 465 - Data Warehousing 15
What is a data warehouse?
• A data warehouse is a pool of data organized in a format that enables users to interpret data and convert it into useful information to gain knowledge from this interpretation.
• It is a single place that contains complete and consistent data from multiple sources.
• Data warehousing is the act of a business person extracting business value from the data stored in the data warehouse.
CIS 465 - Data Warehousing 16
Collecting Raw Data
• Is not Easy– collect in the field– elicit from people– collect manually, electronically, or by sensors.
• Data collection technology has not kept pace with advances of data storage technology.
• Data collection from external sources is not easy either.
• Bottom Line: Garbage IN, Garbage OUT - GIGO.
• Data Quality is an Important Issue.
CIS 465 - Data Warehousing 17
Data Quality Issues
• Intrinsic Data Quality:– accuracy, objectivity, believability, reputation.
• Accessibility Data Quality:– Accessibility and access security
• Contextual Data Quality– relevancy, value-added, timeliness,
completeness, amount of data.
• Representation Data Quality:– Interpretability, ease of understanding, concise
representation, consistent representation
CIS 465 - Data Warehousing 18
Why Data Warehousing?
• Managers do not make decisions that are “good” or “bad”, they make decisions on the basis of good or bad information.
• Management information =– a. the right information– b. in the right form– c. at the right time.”
• Most transaction-based information systems have difficulty delivering this information.
CIS 465 - Data Warehousing 19
Why Data Warehousing - 2
• Not the right information:– data not easily accessible– meaning is subtly (or significantly
different) from the question context.
– Information is presented with too much or too little detail, covers the wrong time spans, or is in the wrong intervals.
CIS 465 - Data Warehousing 20
Why Data Warehousing - 3
• Not the right time:– Getting this information may
require the efforts of highly skilled professionals who are not generally available at the whim of business managers.
– Data comes from a variety of different systems which are resident on a variety of different technology platforms.
CIS 465 - Data Warehousing 21
Why Data Warehousing - 4
• Not the right format:– If data is extracted, merged, and
converted into a meaningful information, often it is not in a usable format.
– Users will want it loadable into a particular PC tool or spreadsheet with which he/she is familiar.
– Printouts weighing 10 pounds are not in the right format.
– a diskette with a COBOL file description is not in the right format.
CIS 465 - Data Warehousing 23
The Dilemma for Corporate IT
• How to control scarce IT resources consumed by insatiable user demand for ad-hoc reports.
• Each ad-hoc report generated by IT and analyzed by the user generates three more reports to further illuminate the insights gleamed in the first.
• Often the extract programs have few reusable components.
• The user is on a “voyage of discovery in a sea of data”.
CIS 465 - Data Warehousing 24
The Response of Corporate IT
• New methodologies: Align the IT systems with the business goals and requirements.
• These techniques concentrate on business process requirements, not decision support requirements.
• Transaction systems must be rigorously specified in advance. They are an intersection between the organization and the customer.
• These systems should not be a “voyage of discovery” for either.
• We need a new type of system, other than the typical transaction processing system, to handle these requests for analytical management information.
CIS 465 - Data Warehousing 25
Transaction Systems vs. Analytical Support
Systems• Transaction Systems:
– Insert an order for 300 baseballs– Update this passenger’s airline reservation.– close-out accounts payable records for this
vendor.– What is the current checking account balance for
this customer?
• Analytical Support Systems:– Did the sales promotion last quarter do better
than the same promotion last year?– Is the five-day moving average for this security
leading or trailing actual prices?– Which product line sells best in middle-America
and how does this correlate to demographic data.
CIS 465 - Data Warehousing 26
Analytical Processing
• Analytical Processing today includes what in the past have been called:
– DSS (Decision Support Systems)– EIS (Executive Information Systems)– ESS (Executive Support Systems)
• It is an evolution of “End-User Computing”
• Placing strategic data access in the hands of decision makers aids productivity and enables them to be better decision makers.
CIS 465 - Data Warehousing 27
Key Difference: OLTP vs. OLAP
• OLTP (On-line Transaction Processing): Processing specific functions well defined in advance (e.g. enter an order, debit an account, register for a course, transfer money from one account to another)
• OLAP (On-line Analytical Processing): providing flexibility for undetermined analysis, i.e. not specified in advance, ad-hoc.
CIS 465 - Data Warehousing 28
Data for Decision Support
• The data must be integrated - requires data from many separate internal corporate databases.
• The data must be enriched - through integration with other external data.
• The data must be available - and not constrained by machine resources.
CIS 465 - Data Warehousing 29
Sources of Data
• Internal Data:– Financial Systems– Logistics Systems– Sales Systems– Production Systems– Personnel Systems– Billing Systems– Information Systems
• External Data Needs:– to recognize opportunities– to detect threats– to identify synergies
CIS 465 - Data Warehousing 30
Sources of Data - 2
• External Data Categories– Competitor Data– Economic Data– Industry Data– Credit Data– Commodity Data– Econometric Data– Psychometric Data– Meteorological Data– Demographic Data– Sales & Marketing Data
CIS 465 - Data Warehousing 31
Operational Control vs. Operational Strategy
• Data is a source not just of operational control, but of operational strategy.
• Operational strategy is an attempt to describe the need, in a competitive and turbulent market, to continually innovate and re-align strategy with time scales too short to be comprehended by strategic planning in the conventional corporate sense.
CIS 465 - Data Warehousing 32
Comparison of Control and Strategy Data:
• Operational Data:– short-lived, rapidly changing– requires record-level access– repetitive standard transactions and access patterns– updated in real-time– event-driven; process generates data
• Strategic Data:– long-living, static– data aggregated into sets (which is why warehouse
data is friendly to RDBMS).– ad-hoc queries with some periodic reporting– updated periodically with mass loads– data-driven; data governs process.
CIS 465 - Data Warehousing 33
Information Requirements by Management Level
(Source: Gorry and Scott Morton)Characteristicsof Information
OperationalControl
ManagementControl
StrategicPlanning
Source Largely Internal External
Scope Well defined,narrow
Very wide
Level ofAggregation
Detailed Aggregate
Time Horizon Historical Future
Currency Highly current Quite Old
RequiredAccuracy
High Low
Frequency of Use Very frequent Infrequent
CIS 465 - Data Warehousing 34
Dimensional Modeling - I
• Dimensional Modeling gives us a way to visualize data.
• The CEO’s perspective: – “We sell products in various markets,
and we measure our performance over time.”
• From the data warehouse designer’s perspective, we hear three dimensions:– We sell Products– in various Markets– and measure performance over time.
CIS 465 - Data Warehousing 35
Dimensional Modeling - II
• Management may be interested in examining sales figures in a certain city by product, by time period, by salesperson, and by store.
• Three dimensions are easily represented in a cube.
• The more dimensions involved, the more difficult it is to represent in a single table or graph, or n-dimensional cube.
• The ability to add and modify the dimensions used in a table or graph is often known as “slicing and dicing” the data.
CIS 465 - Data Warehousing 36
Dimensional Modeling - III
•3-D + Spreadsheets•Data can be organized the way
managers like to see them, rather than the way that the system analysts do
•Different presentations of the same data can be arranged easily and quickly
CIS 465 - Data Warehousing 39
Multidimensionality
• Examples of dimensions– products, salespeople, market segments,
business units, geographical locations, distribution channels, country, industry, and various measures of time.
• Examples of Facts or Measures:– money, sales volume, head count, inventory,
profit, actual vs. forecast.
• Examples of Time – A special Dimension:
– daily, weekly, monthly, quarterly, yearly
CIS 465 - Data Warehousing 43
East
Sales
Margin
West
San FranciscoLos Angeles
Denver
Camera
TV
VCR
Audio
Camera
TV
VCR
Audio
Actual Budget Actual Budget
February March
Example of Different Dimensions of a Multidimensional Database - I
CIS 465 - Data Warehousing 44
TV
VCR
Jan
Feb
Mar
Qtr 1
Jan
Feb
Mar
Qtr 1
Sales
COGSMargin
Total ExpensesProfit
Actual Budget Actual BudgetEAST WEST
Example of Different Dimensions of a Multidimensional Database - II
CIS 465 - Data Warehousing 45
TV
VCR
January
February
March
Qtr 1
April
Sales Margin Sales Margin
Actual Budget
East
West
South
TOTAL
East
West
South
TOTAL
Example of Different Dimensions of a Multidimensional Database - III
CIS 465 - Data Warehousing 46
EAST
WEST
January
February
March
Qtr 1
April
TV VCR TV VCR
Sales Margin
Actual
Budget
Forecast
Variance
Actual
Budget
Forecast
Variance
Example of Different Dimensions of a Multidimensional Database - IV
CIS 465 - Data Warehousing 47
Typical Dimensional Model - I
• Also called the star join schema since the diagram looks like a star.
• One large central table with smaller attendant tables arranged in a radial pattern around the central table.
• The central table is the fact table and the other tables are dimension tables.
• The next slide models a simple business that sells products in a number of markets and measures performance over time.
CIS 465 - Data Warehousing 48
Typical Dimensional Model - II
Time_keyday-of-weekmonthquarteryearholiday_flag
Time_keyproduct_keystore_keydollars_soldunits_solddollars_cost
Product_keydescriptionbrandcategory
Store_keystore_nameaddressfloor_plan_type
Sales Fact
Time Dimension
Product Dimension
Market (Store) Dimension
CIS 465 - Data Warehousing 49
Typical Dimensional Model - III
• If the Sales Fact Table contains only daily item totals of all the products sold, we say this is the grain of the fact table.
• Each record represents the total sales of a particular product in a market on a particular day.
• Any combination of product, market, or day generates a different record in the fact table.
• In a large business, there would be a large number of records in the fact table. The fact table for a typical grocery store retailer with 500 stores each carrying 50,000 products on the shelves and measuring daily movement over two years could easily approach one billion rows.
• This is not a problem for industrial strength data warehousing servers.
CIS 465 - Data Warehousing 50
Dimension Tables
• Dimension tables are where textual descriptions of the business are stored.
• Each textual description helps to describe a member of the dimension.
• Example: each member in the product dimension is a specific product. The product dimension database has many attributes to describe the product. A key role of the dimension table attribute is to serve as the source of constraints in a query.
CIS 465 - Data Warehousing 51
Fact Table
• Fact Table is where numerical measurements of the business are stored.
• Each measurement is taken at the intersection of all the dimensions.
• The “best” facts are numeric, continuously valued and additive.
• For every query made against the fact table may use hundreds of thousands of individual records to construct an answer set.
CIS 465 - Data Warehousing 52
Facts vs. Attributes
• Sometimes it may be unclear whether a numeric data field extracted from a production data source is a fact or an attribute.
• If the numeric data field a measurement that varies continuously every time we sample it, it is generally a fact; otherwise if it is a discretely valued description of something that is more or less constant, it is a dimension attribute.
• Example: A standard cost for a product that seems like an attribute, but may be changed so often that it is more like a measured fact.
CIS 465 - Data Warehousing 53
Example
Brand
AxonFramisWidgetZapper
Dollar Sales
780104421395
Unit Sales
26350944439
CIS 465 - Data Warehousing 54
Example Query
• Find all product brands that were sold in the first quarter of 1995 and present the total dollar sales as well as the number of units.
• Brand is a collection of individual products.
• To construct:– A. Drag attribute brand from product dimension.
Place as Row Header.– B. Drag Dollar Sales and Units Sold from the Fact
Table, and place to the right of the Brand row header.
– C. Specify row constraint “1st Q 1995” on the quarter attribute in the Time Dimension Table.
CIS 465 - Data Warehousing 56
Transaction Processing
• The Relation Model was full of promises for equal access to data.
• In the early 1980’s the relational model was a dream. Typical transaction rates were one per second.
• Today the SABRE system typically processes 4,000 transactions per second, with peak bursts of 13,000 per second.
• OLTP - (On-line Transaction Processing) The point is to get data “in” to the database.
CIS 465 - Data Warehousing 57
Segregating Operational and Warehouse Data
• In the past, data administrators were constantly told to build data sharing, normalization, and non-redundant corporate databases.
• Early attempts at data warehousing tried to share the data with transaction-based systems. This resulted in LONG response times for complex queries.
• The idea today is to keep the two separate.• Separate databases, and perhaps separate
DBMS products and processor platforms are used.
• Controlled and practical redundancy is better than out-of-control theoretical purity.
CIS 465 - Data Warehousing 58
Fundamental Obstacles With Traditional Systems
• Systems Integration - Disintegration grew slowly from islands of automation.– ownership, planning, economic,
organizational development issues all contribute.
• Hardware Architecture• Inconsistent Data• Data Pollution:
– Bad Application Design (semantic and syntactical differences).
– Ownership– Data Entry Conventions
CIS 465 - Data Warehousing 59
The Data Warehouse
• Active, tactical, and current events flow from the operational systems to the data warehouse to become static, strategic, and historical data.
• The data warehouse becomes a “middle ground” where a large number of disparate and incompatible “legacy systems” are tied to an equally diverse collection of end-user workstations.
• Legacy systems usually comprise a hodge-podge of assorted hardware, software, and operational systems accumulated over many decades, are by nature, incompatible with one another and unique to each organization.
CIS 465 - Data Warehousing 60
Practical Facts About the Warehouse
• The chances are remote that any single vendor will be able to develop a product that can interface with all “legacy systems” painlessly and “seamlessly” and at the same time, combine data from modern platforms and data external to the organization.
• Instead warehouse product vendors develop specialized capabilities to work with various environments.
CIS 465 - Data Warehousing 61
Components of a Data Warehouse - 1
• Acquisition - The first component handles acquisition of data from legacy systems and outside sources.
• Data is identified, copied, formatted, and prepared for loading into a warehouse.
• Vendors provide tools for extraction and preparation.
CIS 465 - Data Warehousing 62
Components of a Data Warehouse - 2
• Storage Area - The second component is the storage area managed by relational databases, multi-dimensional databases, specialized hardware - symmetric multiprocessor (SMP) or massively parallel processors (MPP) machines - or by software.
• The storage component hold the data so that many different data mining, executive information and decision support systems can make use of it effectively.
CIS 465 - Data Warehousing 63
Components of a Data Warehouse - 3
• Access - The third component of the warehouse is the access area.
• Different end-user PCs and workstations draw data from the warehouse with the help of multi-dimensional analysis tools, neural networks, data discover tools, or analysis tools.
• These “smart” data-mining tools are the driving force behind the data warehouse concept.
• What good is it to store all the information without some way to understand it in new and different ways.
CIS 465 - Data Warehousing 65
Data Warehouse Access Tools
• Intelligent Agents and Agencies - tools work and think for user.
• Query Facilities and Managed Query environments.
• Statistical Analysis - One of the biggest surprises in the data warehousing marketplace is the resurgence of interest in traditional statistical analysis, and the concomitant resurrection of the popularity to products like SAS and SPSS.
CIS 465 - Data Warehousing 66
Data Warehouse Access Tools - 2
• Data Discovery - – A large class of tools formerly classified as
decision support, artificial intelligence and expert systems. They now make use of neural networks, fuzzy logic, decision trees, and other tools from advanced mathematics to allow a user to “sift” through massive amounts of raw data to “discover” new, interesting, insightful, and in many cases useful things about the organization, its operations, and its markets.
– There are many different data discovery tools/products on the market.
CIS 465 - Data Warehousing 67
Data Warehouse Access Tools - 3
• OLAP - On-line Analytical Processing:– often uses multi-dimensional spreadsheet
tools allowing users to look at information from many different angles.
– Users are able to “slice and dice” reports and to look at the same kinds of information at different levels at the same time.
– Typical OLAP application might allow a product manager to view sales figures for a given product at the national level, see them broken down by division, drill down to see territories within a division, check sales numbers for each store within a territory, and then compare them against sales of stores from another territory.
CIS 465 - Data Warehousing 68
OLAP - continued• DSS and EIS computing done by end-users in
online systems. (Contrast with OLTP)
•OLAP Activities– Generating queries– Requesting ad hoc reports– Conducting statistical analyses – Building multimedia applications
• OLAP uses the data warehouse and a set of tools, usually with multidimensional
capabilities – Query tools– Spreadsheets– Data mining tools
– Data visualization tools
CIS 465 - Data Warehousing 69
DataSources
I nternalDataSources
ExternalDataSources
DataAcquisition,Extraction,DeliveryTransformation
DataWarehouse
BusinessCommunication
Querying
ReportGeneration
SpreadsheetForecastingAnalysisModeling
Multimedia
E IS,Others
Online Analytical Processing
DataPresentationandVisualization
Data Warehousing and OLAP
CIS 465 - Data Warehousing 70
Data Warehouse Access Tools - 4
• Data Visualization – These tools turn ugly, boring numbers
into exciting visual presentations.– These tools bring graphical
representation to new heights. Example: Geographical information systems turn data about stores, individuals, or anything else into compelling, easy to understand, dynamic maps.
– Geographic Information systems have the ability to display spatial occurrences and the relationship between and among geographically specific variables.
CIS 465 - Data Warehousing 71
Data Visualization Technologies
• Digital images• Geographic information systems• Graphical user interfaces• Multidimensions• Tables and graphs• Virtual reality• Presentations • Animation
CIS 465 - Data Warehousing 72
Data Mining
Provides for:• Knowledge discovery in databases• Knowledge extraction• Data archeology• Data exploration• Data pattern processing• Data dredging• Information harvesting
CIS 465 - Data Warehousing 73
Major Data Mining Characteristics and Objectives
• Data are often buried deep• Client/server architecture• Sophisticated new tools--including advanced
visualization tools--help to remove the information “ore” • Massaging and synchronizing data • Usefulness of “soft” data• End-user minor is empowered by “data drills” and other
power query tools with little or no programming skills• Often involves finding unexpected results• Tools are easily combined with spreadsheets etc.• Parallel processing for data mining
CIS 465 - Data Warehousing 74
Data Mining Application Areas
• Marketing• Banking:• Retailing and sales• Manufacturing and production• Brokerage and securities trading• Insurance• Computer hardware and software• Government and defense• Airlines• Health care• Broadcasting• Law Enforcement
CIS 465 - Data Warehousing 75
Customer Relationship Management
• The availability of data mining tools has given rise to new terminology for the entire data warehousing field:
–Customer Relationship Management
DSS In Action 4.11: Data Visualization
To prevent systems from automatically identifying meaningless patterns in data, CFOs want to makesure that the processing power of a computer is always tempered with that of the insight of a humanbeing. One way to do that is through data visualization, which uses color, form, motion, and depth topresent masses of data in a comprehensible way. Andrew W. Lo, Director of the Laboratory for FinancialEngineering at Massachusetts Institute of Technology’s Sloan School of Management, developed aprogram in which a CFO can use a mouse to “fly” over a 3-D landscape representing the risk, return, andliquidity of a company’s assets. With practice, the CFO can begin to zero in on the choicest spot on the 3-Dlandscape--the one where the trade-off among risk, return, and liquidity is most beneficial. Says Lo: “Thevideo-game generation just loves these 3-D tools.”
So far, very few CFOs are cruising in 3-D cyberspace. Most still spend the bulk of their time on routinematters such as generating reports for the Securities & Exchange Commission. But that’s bound tochange. Says Glassco Park President Robert J. Park: “What we have in financial risk management todayis like what we had in computer typesetting in 1981, before desktop publishing.”
(Source: Condensed from: P. Coy, “Higher Math and Savvy Software are Crucial,” Business Week,October 28, 1996.)
CIS 465 - Data Warehousing 77
Intelligent Data Mining
• Use intelligent search to discover information within data warehouses that queries and reports cannot effectively reveal
• Find patterns in the data and infer rules from them
• Use patterns and rules to guide decision-making and forecasting
• Five common types of information that can be yielded by data mining:– 1) association,– 2) sequences, – 3) classifications, – 4) clusters, and – 5) forecasting
CIS 465 - Data Warehousing 78
Main Tools Used in Intelligent Data Mining
• Case-based Reasoning
• Neural Computing
• Intelligent Agents
• Other Tools– decision trees– rule induction– data visualization
Decision Support Systems and Intelligent Systems, Efraim Turban and Jay E. AronsonCopyright 1998, Prentice Hall, Upper Saddle River, NJ
CIS 465 - Data Warehousing 80
Benefits Derived from Data Warehousing - II
• Increase in knowledge worker productivity • Supports all decision makers’ data requirements• Provide ready access to critical data• Insulates operation databases from ad hoc
processing• Provides high-level summary information • Provides drill down capabilities
Yields– Improved business knowledge– Competitive advantage– Enhances customer service and satisfaction– Facilitates decision making– Help streamline business processes
CIS 465 - Data Warehousing 81
Developing the Data Warehouse - I
• The most expensive warehousing ventures involve major new hardware acquisitions and significant investments in training, analysis, and systems development costs.
• Typical startup projects allocate 60% of budget for hardware and software for creation of a powerful storage component. 30% on data mining and acquisition tools.
• Budgeting for Systems Analysis and Development has 50% of budget on acquisition capabilities, 30% fund user solutions, 20% creation of databases in the storage component.
CIS 465 - Data Warehousing 82
Developing the Data Warehouse - II
• Clarify what you want to do with the Warehouse - How Will It be Used.
• Scrutinize the offerings of vendors and systems integrators. Make sure you understand which functions they provide, and which you must build.
• Most successful projects start as small, tightly defined tactical systems to solve pressing business needs, and develop into larger systems over time.
CIS 465 - Data Warehousing 83
Developing the Data Warehouse - III
• On strategy for developing a data warehouse is to start with the development of data marts - small single subject-oriented data warehouses.
• The corporate/organizational data warehouse then becomes a collection of data marts.
• Care must be taken in this approach
CIS 465 - Data Warehousing 84
DW Summary: Key Concepts
• The DW is a “collection of integrated, subject-oriented databases designed to support the decision support function where each unit of data is non-volatile and relevant to some moment in time: (W.H. Inmon, 1992).
• Implicit Assumptions:– physically separate from operational
systems– hold aggregated data and transactional
(atomic) data for management separate from those used for OLTP.
CIS 465 - Data Warehousing 85
DW Summary: Characteristics
• Subject-orientation• integrated• non-volatile (i.e. not updated)• time variant (kept for long periods, for
forecasting and trend analysis)• summarized• large volume• not normalized• metadata• data sources
CIS 465 - Data Warehousing 86
DW Characteristics: Subject Orientation
• The data warehouse is oriented toward the major subjects of the organization as opposed to the functional orientation of legacy applications.
• APPLICATION ORIENTATION: Sales and Marketing, Materials Planning, Asset Tracking, Finance, Human Resources, Inventory
• SUBJECT ORIENTATION: Products, Vendors, Markets, employees, Customers, Sales History
CIS 465 - Data Warehousing 87
DW Characteristics: Integrated
• Data contained within the boundaries are integrated, i.e. consistency in naming conventions, measurement attributes, accuracy, and common aggregation.
• Consider that data on gender, dates, current balances could be brought from several different applications all named differently and all measured and stored differently.
• The process of ‘loading’ the data warehouse ‘scrubs’ data to eliminate inconsistencies.
CIS 465 - Data Warehousing 88
DW Characteristics: Time Variant
• Operational databases normally contain up-to-the-minute accuracy.
• Data fields in the data warehouse are taken at specific points in time; time is normally a primary key for extracting.
• DW fields are not necessarily current, but probably time series.
• Think of warehouse data as a sequence of sequential photographs or snapshots in time.
• Time horizon is long, perhaps 5 to 10 years, in order to analyze data over long time periods.
CIS 465 - Data Warehousing 89
DW Characteristics: Nonvolatility
• Inserts, deletes, and updates/changes are characteristic of operational/transaction databases. Data is normally stored in read-only format and not changed.
• The purpose of the data warehouse is to extract data for reporting. The data is ‘cleaned’ and ‘scrubed’ when it is loaded from operational stores.
• Implementation wise, issues of transaction and data recovery, rollback, detection and remedy of deadlock are unnecessary.
CIS 465 - Data Warehousing 90
DW Characteristics: Metadata
• Metadata - ‘data about the data’.• The data warehouse architecture is built
on the concept of data definitions or metadata and it pervades every activity of the data warehouse.
• Some metadata management issues:– standard definitions (technical and business
descriptions) of data stored in the warehouse.– Metadata captured and created in the extraction and
refinement loading of data.– Metadata on granularity, partitions, subject areas,
aggregation and summarization.– Metadata describing rules for timing and scheduling
of the refresh, update, and replication cycle.
CIS 465 - Data Warehousing 93
SalesRep
SalesDistrict
SalesRegion
SalesDivision
ProductGroup
ProductLine
Product
Contact
Ship ToShipperShipType
DistrictCredit
Contract CustomerContract
Type
SalesOrder
CustomerLocation
ContactLocation
OrderItem
Data Dependencies Model of a Business