cis 465 - data warehousing1 data warehousing version 7.0 - 11/28/2000

95
CIS 465 - Data Warehousing 1 Data Warehousing Version 7.0 - 11/28/2000

Upload: marian-wilcox

Post on 02-Jan-2016

232 views

Category:

Documents


0 download

TRANSCRIPT

CIS 465 - Data Warehousing 1

Data WarehousingVersion 7.0 - 11/28/2000

CIS 465 - Data Warehousing 2

Can your database answer questions like these?

• What is the cost of staff to break into a new line of business?

• What are the travel routes of my competition’s inventory?

• At what velocity is my competitor moving toward a common goal?

• How will a transaction on a certain date be affected by currency exchange rates?

• Is a foreign labor source likely to produce a higher quality product?

• Which 20% of the problem creates 80% of the problems?

CIS 465 - Data Warehousing 3

Can your database answer questions like these?

• By product and location, how can we regain a lost customer base?

• Which skill and staff levels are most likely to accept the voluntary layoff package?

CIS 465 - Data Warehousing 4

Data is Difficult to Manage

• Amount of data increases exponentially; past data must be kept for long times; new data are added rapidly.

• Data are scattered throughout organizations; collected by many individuals; using different methods and devices.

• Only small portions of an organization’s data are relevant for specific decisions.

• An ever-increasing amount of external data needs to be considered in making organizational decisions.

CIS 465 - Data Warehousing 5

Data is Difficult to Manage - contd.

• Raw data may be stored in different computing systems, formats, and human and computer languages.

• Legal requirements relating to data differ among countries and change frequently.

• Selecting data management tools can be difficult because of the large number of tools available.

• Data security, quality, and integrity are critical, yet easily jeopardized.

CIS 465 - Data Warehousing 6

Data and Knowledge Managementment

• Businesses do not run on data. They run on information and their knowledge of how to put that information to use successfully.

CIS 465 - Data Warehousing 7

Some Information Concepts

Some Information Concepts

• Data: Unorganized facts and figures. (raw material)

• Information: Data that has been processed into a form that is meaningful to the recipient and is of real of perceived value in current or prospective actions or decisions.

• Information:– adds to a representation– corrects or confirms previous information– has “surprise” value in that it tells us something we

did not know, or could not predict.– What is a “finished product” to one, may be “raw

materials” to someone else.

CIS 465 - Data Warehousing 8

Definitions: Information vs. Knowledge

• Knowledge: a combination of instincts, ideas, rules, and procedures that guide actions and decisions.

• Knowledge consists of data items that are processed to convey understanding, experience, accumulated learning, and expertise as they apply to a current problem or situation.

• Helping to provide the best available knowledge to decision-making is another role of information systems

CIS 465 - Data Warehousing 9

Relationship Between Data, Information, and

Knowledge• The difference between data and information is

easy to remember.• It is often cited as the reason why systems that

collect large amounts of information fail to meet management’s information needs.

• There are many methods of converting data into information for decision making.

• Managers take action based on information about a current situation plus their accumulated knowledge. Actions taken feed the process of accumulating more knowledge (experience).

• Example: How do medical students become competent physicians?

CIS 465 - Data Warehousing 10

Relationship Between Data, Information, and

Knowledge

CIS 465 - Data Warehousing 11

Attributes of Quality Information (reference Alter-

Chapter 4)• Timeliness• Completeness• Conciseness• Relevance• Accuracy• Precision• Appropriateness of Form

CIS 465 - Data Warehousing 12

Special Characteristics of Information

• Usefulness - depends on combination of quality,accessibility,and presentation.

• One person’s information may be another person’s noise.

• Soft data may be as important as hard data.

• Ownership of information may be hard to maintain.

• More information is not always better (information overload).

• Politics can often hide or distort information.

CIS 465 - Data Warehousing 13

Sources of Data

• Internal Data– Data about people, products, services, processes.– Often stored in corporate data bases (e.g. Sales or

HR).– Some data may be disparate in different regions,

but accessible by networks.

• Personal Data– Individuals document expertise by creating

personal data - subjective estimates of sales, opinions on competitors plans, interpretation of news or current events. etc. Some is kept in heads, or mental models.

– Can be store on PCs, or available on the Web.

• External Data– Many sources.

CIS 465 - Data Warehousing 14

Some Sources of Business External Data

• Federal Publications– Survey of Current

Business– Monthly Labor Review– Federal Reserve Bulletin– Employment and

Earnings– Commerce Business Daily– Census Bureau

• Other– International Monetary

Fund– Moody’s– Standard & Poor’s– Advertising Age– Dialog and Lexis/Nexis

• Other-contd.– ABI/Inform– Annual Editor &

Publisher Market Guide

– Thomas Register On-line

• Indexes – Encyclopedia of

Business Information Sources

CIS 465 - Data Warehousing 15

What is a data warehouse?

• A data warehouse is a pool of data organized in a format that enables users to interpret data and convert it into useful information to gain knowledge from this interpretation.

• It is a single place that contains complete and consistent data from multiple sources.

• Data warehousing is the act of a business person extracting business value from the data stored in the data warehouse.

CIS 465 - Data Warehousing 16

Collecting Raw Data

• Is not Easy– collect in the field– elicit from people– collect manually, electronically, or by sensors.

• Data collection technology has not kept pace with advances of data storage technology.

• Data collection from external sources is not easy either.

• Bottom Line: Garbage IN, Garbage OUT - GIGO.

• Data Quality is an Important Issue.

CIS 465 - Data Warehousing 17

Data Quality Issues

• Intrinsic Data Quality:– accuracy, objectivity, believability, reputation.

• Accessibility Data Quality:– Accessibility and access security

• Contextual Data Quality– relevancy, value-added, timeliness,

completeness, amount of data.

• Representation Data Quality:– Interpretability, ease of understanding, concise

representation, consistent representation

CIS 465 - Data Warehousing 18

Why Data Warehousing?

• Managers do not make decisions that are “good” or “bad”, they make decisions on the basis of good or bad information.

• Management information =– a. the right information– b. in the right form– c. at the right time.”

• Most transaction-based information systems have difficulty delivering this information.

CIS 465 - Data Warehousing 19

Why Data Warehousing - 2

• Not the right information:– data not easily accessible– meaning is subtly (or significantly

different) from the question context.

– Information is presented with too much or too little detail, covers the wrong time spans, or is in the wrong intervals.

CIS 465 - Data Warehousing 20

Why Data Warehousing - 3

• Not the right time:– Getting this information may

require the efforts of highly skilled professionals who are not generally available at the whim of business managers.

– Data comes from a variety of different systems which are resident on a variety of different technology platforms.

CIS 465 - Data Warehousing 21

Why Data Warehousing - 4

• Not the right format:– If data is extracted, merged, and

converted into a meaningful information, often it is not in a usable format.

– Users will want it loadable into a particular PC tool or spreadsheet with which he/she is familiar.

– Printouts weighing 10 pounds are not in the right format.

– a diskette with a COBOL file description is not in the right format.

CIS 465 - Data Warehousing 22

Figure 10.02

CIS 465 - Data Warehousing 23

The Dilemma for Corporate IT

• How to control scarce IT resources consumed by insatiable user demand for ad-hoc reports.

• Each ad-hoc report generated by IT and analyzed by the user generates three more reports to further illuminate the insights gleamed in the first.

• Often the extract programs have few reusable components.

• The user is on a “voyage of discovery in a sea of data”.

CIS 465 - Data Warehousing 24

The Response of Corporate IT

• New methodologies: Align the IT systems with the business goals and requirements.

• These techniques concentrate on business process requirements, not decision support requirements.

• Transaction systems must be rigorously specified in advance. They are an intersection between the organization and the customer.

• These systems should not be a “voyage of discovery” for either.

• We need a new type of system, other than the typical transaction processing system, to handle these requests for analytical management information.

CIS 465 - Data Warehousing 25

Transaction Systems vs. Analytical Support

Systems• Transaction Systems:

– Insert an order for 300 baseballs– Update this passenger’s airline reservation.– close-out accounts payable records for this

vendor.– What is the current checking account balance for

this customer?

• Analytical Support Systems:– Did the sales promotion last quarter do better

than the same promotion last year?– Is the five-day moving average for this security

leading or trailing actual prices?– Which product line sells best in middle-America

and how does this correlate to demographic data.

CIS 465 - Data Warehousing 26

Analytical Processing

• Analytical Processing today includes what in the past have been called:

– DSS (Decision Support Systems)– EIS (Executive Information Systems)– ESS (Executive Support Systems)

• It is an evolution of “End-User Computing”

• Placing strategic data access in the hands of decision makers aids productivity and enables them to be better decision makers.

CIS 465 - Data Warehousing 27

Key Difference: OLTP vs. OLAP

• OLTP (On-line Transaction Processing): Processing specific functions well defined in advance (e.g. enter an order, debit an account, register for a course, transfer money from one account to another)

• OLAP (On-line Analytical Processing): providing flexibility for undetermined analysis, i.e. not specified in advance, ad-hoc.

CIS 465 - Data Warehousing 28

Data for Decision Support

• The data must be integrated - requires data from many separate internal corporate databases.

• The data must be enriched - through integration with other external data.

• The data must be available - and not constrained by machine resources.

CIS 465 - Data Warehousing 29

Sources of Data

• Internal Data:– Financial Systems– Logistics Systems– Sales Systems– Production Systems– Personnel Systems– Billing Systems– Information Systems

• External Data Needs:– to recognize opportunities– to detect threats– to identify synergies

CIS 465 - Data Warehousing 30

Sources of Data - 2

• External Data Categories– Competitor Data– Economic Data– Industry Data– Credit Data– Commodity Data– Econometric Data– Psychometric Data– Meteorological Data– Demographic Data– Sales & Marketing Data

CIS 465 - Data Warehousing 31

Operational Control vs. Operational Strategy

• Data is a source not just of operational control, but of operational strategy.

• Operational strategy is an attempt to describe the need, in a competitive and turbulent market, to continually innovate and re-align strategy with time scales too short to be comprehended by strategic planning in the conventional corporate sense.

CIS 465 - Data Warehousing 32

Comparison of Control and Strategy Data:

• Operational Data:– short-lived, rapidly changing– requires record-level access– repetitive standard transactions and access patterns– updated in real-time– event-driven; process generates data

• Strategic Data:– long-living, static– data aggregated into sets (which is why warehouse

data is friendly to RDBMS).– ad-hoc queries with some periodic reporting– updated periodically with mass loads– data-driven; data governs process.

CIS 465 - Data Warehousing 33

Information Requirements by Management Level

(Source: Gorry and Scott Morton)Characteristicsof Information

OperationalControl

ManagementControl

StrategicPlanning

Source Largely Internal External

Scope Well defined,narrow

Very wide

Level ofAggregation

Detailed Aggregate

Time Horizon Historical Future

Currency Highly current Quite Old

RequiredAccuracy

High Low

Frequency of Use Very frequent Infrequent

CIS 465 - Data Warehousing 34

Dimensional Modeling - I

• Dimensional Modeling gives us a way to visualize data.

• The CEO’s perspective: – “We sell products in various markets,

and we measure our performance over time.”

• From the data warehouse designer’s perspective, we hear three dimensions:– We sell Products– in various Markets– and measure performance over time.

CIS 465 - Data Warehousing 35

Dimensional Modeling - II

• Management may be interested in examining sales figures in a certain city by product, by time period, by salesperson, and by store.

• Three dimensions are easily represented in a cube.

• The more dimensions involved, the more difficult it is to represent in a single table or graph, or n-dimensional cube.

• The ability to add and modify the dimensions used in a table or graph is often known as “slicing and dicing” the data.

CIS 465 - Data Warehousing 36

Dimensional Modeling - III

•3-D + Spreadsheets•Data can be organized the way

managers like to see them, rather than the way that the system analysts do

•Different presentations of the same data can be arranged easily and quickly

CIS 465 - Data Warehousing 37

Dimensional Model of the Business

Product

Market

Ti

me

A Multidimensional database

CIS 465 - Data Warehousing 39

Multidimensionality

• Examples of dimensions– products, salespeople, market segments,

business units, geographical locations, distribution channels, country, industry, and various measures of time.

• Examples of Facts or Measures:– money, sales volume, head count, inventory,

profit, actual vs. forecast.

• Examples of Time – A special Dimension:

– daily, weekly, monthly, quarterly, yearly

CIS 465 - Data Warehousing 40

Multidimensionality View - I

CIS 465 - Data Warehousing 41

Multidimensionality View - II

CIS 465 - Data Warehousing 42

Multidimensionality View - III

CIS 465 - Data Warehousing 43

East

Sales

Margin

West

San FranciscoLos Angeles

Denver

Camera

TV

VCR

Audio

Camera

TV

VCR

Audio

Actual Budget Actual Budget

February March

Example of Different Dimensions of a Multidimensional Database - I

CIS 465 - Data Warehousing 44

TV

VCR

Jan

Feb

Mar

Qtr 1

Jan

Feb

Mar

Qtr 1

Sales

COGSMargin

Total ExpensesProfit

Actual Budget Actual BudgetEAST WEST

Example of Different Dimensions of a Multidimensional Database - II

CIS 465 - Data Warehousing 45

TV

VCR

January

February

March

Qtr 1

April

Sales Margin Sales Margin

Actual Budget

East

West

South

TOTAL

East

West

South

TOTAL

Example of Different Dimensions of a Multidimensional Database - III

CIS 465 - Data Warehousing 46

EAST

WEST

January

February

March

Qtr 1

April

TV VCR TV VCR

Sales Margin

Actual

Budget

Forecast

Variance

Actual

Budget

Forecast

Variance

Example of Different Dimensions of a Multidimensional Database - IV

CIS 465 - Data Warehousing 47

Typical Dimensional Model - I

• Also called the star join schema since the diagram looks like a star.

• One large central table with smaller attendant tables arranged in a radial pattern around the central table.

• The central table is the fact table and the other tables are dimension tables.

• The next slide models a simple business that sells products in a number of markets and measures performance over time.

CIS 465 - Data Warehousing 48

Typical Dimensional Model - II

Time_keyday-of-weekmonthquarteryearholiday_flag

Time_keyproduct_keystore_keydollars_soldunits_solddollars_cost

Product_keydescriptionbrandcategory

Store_keystore_nameaddressfloor_plan_type

Sales Fact

Time Dimension

Product Dimension

Market (Store) Dimension

CIS 465 - Data Warehousing 49

Typical Dimensional Model - III

• If the Sales Fact Table contains only daily item totals of all the products sold, we say this is the grain of the fact table.

• Each record represents the total sales of a particular product in a market on a particular day.

• Any combination of product, market, or day generates a different record in the fact table.

• In a large business, there would be a large number of records in the fact table. The fact table for a typical grocery store retailer with 500 stores each carrying 50,000 products on the shelves and measuring daily movement over two years could easily approach one billion rows.

• This is not a problem for industrial strength data warehousing servers.

CIS 465 - Data Warehousing 50

Dimension Tables

• Dimension tables are where textual descriptions of the business are stored.

• Each textual description helps to describe a member of the dimension.

• Example: each member in the product dimension is a specific product. The product dimension database has many attributes to describe the product. A key role of the dimension table attribute is to serve as the source of constraints in a query.

CIS 465 - Data Warehousing 51

Fact Table

• Fact Table is where numerical measurements of the business are stored.

• Each measurement is taken at the intersection of all the dimensions.

• The “best” facts are numeric, continuously valued and additive.

• For every query made against the fact table may use hundreds of thousands of individual records to construct an answer set.

CIS 465 - Data Warehousing 52

Facts vs. Attributes

• Sometimes it may be unclear whether a numeric data field extracted from a production data source is a fact or an attribute.

• If the numeric data field a measurement that varies continuously every time we sample it, it is generally a fact; otherwise if it is a discretely valued description of something that is more or less constant, it is a dimension attribute.

• Example: A standard cost for a product that seems like an attribute, but may be changed so often that it is more like a measured fact.

CIS 465 - Data Warehousing 53

Example

Brand

AxonFramisWidgetZapper

Dollar Sales

780104421395

Unit Sales

26350944439

CIS 465 - Data Warehousing 54

Example Query

• Find all product brands that were sold in the first quarter of 1995 and present the total dollar sales as well as the number of units.

• Brand is a collection of individual products.

• To construct:– A. Drag attribute brand from product dimension.

Place as Row Header.– B. Drag Dollar Sales and Units Sold from the Fact

Table, and place to the right of the Brand row header.

– C. Specify row constraint “1st Q 1995” on the quarter attribute in the Time Dimension Table.

CIS 465 - Data Warehousing 55

How a Data Warehouse is Built

CIS 465 - Data Warehousing 56

Transaction Processing

• The Relation Model was full of promises for equal access to data.

• In the early 1980’s the relational model was a dream. Typical transaction rates were one per second.

• Today the SABRE system typically processes 4,000 transactions per second, with peak bursts of 13,000 per second.

• OLTP - (On-line Transaction Processing) The point is to get data “in” to the database.

CIS 465 - Data Warehousing 57

Segregating Operational and Warehouse Data

• In the past, data administrators were constantly told to build data sharing, normalization, and non-redundant corporate databases.

• Early attempts at data warehousing tried to share the data with transaction-based systems. This resulted in LONG response times for complex queries.

• The idea today is to keep the two separate.• Separate databases, and perhaps separate

DBMS products and processor platforms are used.

• Controlled and practical redundancy is better than out-of-control theoretical purity.

CIS 465 - Data Warehousing 58

Fundamental Obstacles With Traditional Systems

• Systems Integration - Disintegration grew slowly from islands of automation.– ownership, planning, economic,

organizational development issues all contribute.

• Hardware Architecture• Inconsistent Data• Data Pollution:

– Bad Application Design (semantic and syntactical differences).

– Ownership– Data Entry Conventions

CIS 465 - Data Warehousing 59

The Data Warehouse

• Active, tactical, and current events flow from the operational systems to the data warehouse to become static, strategic, and historical data.

• The data warehouse becomes a “middle ground” where a large number of disparate and incompatible “legacy systems” are tied to an equally diverse collection of end-user workstations.

• Legacy systems usually comprise a hodge-podge of assorted hardware, software, and operational systems accumulated over many decades, are by nature, incompatible with one another and unique to each organization.

CIS 465 - Data Warehousing 60

Practical Facts About the Warehouse

• The chances are remote that any single vendor will be able to develop a product that can interface with all “legacy systems” painlessly and “seamlessly” and at the same time, combine data from modern platforms and data external to the organization.

• Instead warehouse product vendors develop specialized capabilities to work with various environments.

CIS 465 - Data Warehousing 61

Components of a Data Warehouse - 1

• Acquisition - The first component handles acquisition of data from legacy systems and outside sources.

• Data is identified, copied, formatted, and prepared for loading into a warehouse.

• Vendors provide tools for extraction and preparation.

CIS 465 - Data Warehousing 62

Components of a Data Warehouse - 2

• Storage Area - The second component is the storage area managed by relational databases, multi-dimensional databases, specialized hardware - symmetric multiprocessor (SMP) or massively parallel processors (MPP) machines - or by software.

• The storage component hold the data so that many different data mining, executive information and decision support systems can make use of it effectively.

CIS 465 - Data Warehousing 63

Components of a Data Warehouse - 3

• Access - The third component of the warehouse is the access area.

• Different end-user PCs and workstations draw data from the warehouse with the help of multi-dimensional analysis tools, neural networks, data discover tools, or analysis tools.

• These “smart” data-mining tools are the driving force behind the data warehouse concept.

• What good is it to store all the information without some way to understand it in new and different ways.

CIS 465 - Data Warehousing 64

Components of a Data Warehouse - 4

CIS 465 - Data Warehousing 65

Data Warehouse Access Tools

• Intelligent Agents and Agencies - tools work and think for user.

• Query Facilities and Managed Query environments.

• Statistical Analysis - One of the biggest surprises in the data warehousing marketplace is the resurgence of interest in traditional statistical analysis, and the concomitant resurrection of the popularity to products like SAS and SPSS.

CIS 465 - Data Warehousing 66

Data Warehouse Access Tools - 2

• Data Discovery - – A large class of tools formerly classified as

decision support, artificial intelligence and expert systems. They now make use of neural networks, fuzzy logic, decision trees, and other tools from advanced mathematics to allow a user to “sift” through massive amounts of raw data to “discover” new, interesting, insightful, and in many cases useful things about the organization, its operations, and its markets.

– There are many different data discovery tools/products on the market.

CIS 465 - Data Warehousing 67

Data Warehouse Access Tools - 3

• OLAP - On-line Analytical Processing:– often uses multi-dimensional spreadsheet

tools allowing users to look at information from many different angles.

– Users are able to “slice and dice” reports and to look at the same kinds of information at different levels at the same time.

– Typical OLAP application might allow a product manager to view sales figures for a given product at the national level, see them broken down by division, drill down to see territories within a division, check sales numbers for each store within a territory, and then compare them against sales of stores from another territory.

CIS 465 - Data Warehousing 68

OLAP - continued• DSS and EIS computing done by end-users in

online systems. (Contrast with OLTP)

•OLAP Activities– Generating queries– Requesting ad hoc reports– Conducting statistical analyses – Building multimedia applications

• OLAP uses the data warehouse and a set of tools, usually with multidimensional

capabilities – Query tools– Spreadsheets– Data mining tools

– Data visualization tools

CIS 465 - Data Warehousing 69

DataSources

I nternalDataSources

ExternalDataSources

DataAcquisition,Extraction,DeliveryTransformation

DataWarehouse

BusinessCommunication

Querying

ReportGeneration

SpreadsheetForecastingAnalysisModeling

Multimedia

E IS,Others

Online Analytical Processing

DataPresentationandVisualization

Data Warehousing and OLAP

CIS 465 - Data Warehousing 70

Data Warehouse Access Tools - 4

• Data Visualization – These tools turn ugly, boring numbers

into exciting visual presentations.– These tools bring graphical

representation to new heights. Example: Geographical information systems turn data about stores, individuals, or anything else into compelling, easy to understand, dynamic maps.

– Geographic Information systems have the ability to display spatial occurrences and the relationship between and among geographically specific variables.

CIS 465 - Data Warehousing 71

Data Visualization Technologies

• Digital images• Geographic information systems• Graphical user interfaces• Multidimensions• Tables and graphs• Virtual reality• Presentations • Animation

CIS 465 - Data Warehousing 72

Data Mining

Provides for:• Knowledge discovery in databases• Knowledge extraction• Data archeology• Data exploration• Data pattern processing• Data dredging• Information harvesting

CIS 465 - Data Warehousing 73

Major Data Mining Characteristics and Objectives

• Data are often buried deep• Client/server architecture• Sophisticated new tools--including advanced

visualization tools--help to remove the information “ore” • Massaging and synchronizing data • Usefulness of “soft” data• End-user minor is empowered by “data drills” and other

power query tools with little or no programming skills• Often involves finding unexpected results• Tools are easily combined with spreadsheets etc.• Parallel processing for data mining

CIS 465 - Data Warehousing 74

Data Mining Application Areas

• Marketing• Banking:• Retailing and sales• Manufacturing and production• Brokerage and securities trading• Insurance• Computer hardware and software• Government and defense• Airlines• Health care• Broadcasting• Law Enforcement

CIS 465 - Data Warehousing 75

Customer Relationship Management

• The availability of data mining tools has given rise to new terminology for the entire data warehousing field:

–Customer Relationship Management

DSS In Action 4.11: Data Visualization

To prevent systems from automatically identifying meaningless patterns in data, CFOs want to makesure that the processing power of a computer is always tempered with that of the insight of a humanbeing. One way to do that is through data visualization, which uses color, form, motion, and depth topresent masses of data in a comprehensible way. Andrew W. Lo, Director of the Laboratory for FinancialEngineering at Massachusetts Institute of Technology’s Sloan School of Management, developed aprogram in which a CFO can use a mouse to “fly” over a 3-D landscape representing the risk, return, andliquidity of a company’s assets. With practice, the CFO can begin to zero in on the choicest spot on the 3-Dlandscape--the one where the trade-off among risk, return, and liquidity is most beneficial. Says Lo: “Thevideo-game generation just loves these 3-D tools.”

So far, very few CFOs are cruising in 3-D cyberspace. Most still spend the bulk of their time on routinematters such as generating reports for the Securities & Exchange Commission. But that’s bound tochange. Says Glassco Park President Robert J. Park: “What we have in financial risk management todayis like what we had in computer typesetting in 1981, before desktop publishing.”

(Source: Condensed from: P. Coy, “Higher Math and Savvy Software are Crucial,” Business Week,October 28, 1996.)

CIS 465 - Data Warehousing 77

Intelligent Data Mining

• Use intelligent search to discover information within data warehouses that queries and reports cannot effectively reveal

• Find patterns in the data and infer rules from them

• Use patterns and rules to guide decision-making and forecasting

• Five common types of information that can be yielded by data mining:– 1) association,– 2) sequences, – 3) classifications, – 4) clusters, and – 5) forecasting

CIS 465 - Data Warehousing 78

Main Tools Used in Intelligent Data Mining

• Case-based Reasoning

• Neural Computing

• Intelligent Agents

• Other Tools– decision trees– rule induction– data visualization

Decision Support Systems and Intelligent Systems, Efraim Turban and Jay E. AronsonCopyright 1998, Prentice Hall, Upper Saddle River, NJ

CIS 465 - Data Warehousing 79

Benefits Derived from Data Warehousing - I

CIS 465 - Data Warehousing 80

Benefits Derived from Data Warehousing - II

• Increase in knowledge worker productivity • Supports all decision makers’ data requirements• Provide ready access to critical data• Insulates operation databases from ad hoc

processing• Provides high-level summary information • Provides drill down capabilities

Yields– Improved business knowledge– Competitive advantage– Enhances customer service and satisfaction– Facilitates decision making– Help streamline business processes

CIS 465 - Data Warehousing 81

Developing the Data Warehouse - I

• The most expensive warehousing ventures involve major new hardware acquisitions and significant investments in training, analysis, and systems development costs.

• Typical startup projects allocate 60% of budget for hardware and software for creation of a powerful storage component. 30% on data mining and acquisition tools.

• Budgeting for Systems Analysis and Development has 50% of budget on acquisition capabilities, 30% fund user solutions, 20% creation of databases in the storage component.

CIS 465 - Data Warehousing 82

Developing the Data Warehouse - II

• Clarify what you want to do with the Warehouse - How Will It be Used.

• Scrutinize the offerings of vendors and systems integrators. Make sure you understand which functions they provide, and which you must build.

• Most successful projects start as small, tightly defined tactical systems to solve pressing business needs, and develop into larger systems over time.

CIS 465 - Data Warehousing 83

Developing the Data Warehouse - III

• On strategy for developing a data warehouse is to start with the development of data marts - small single subject-oriented data warehouses.

• The corporate/organizational data warehouse then becomes a collection of data marts.

• Care must be taken in this approach

CIS 465 - Data Warehousing 84

DW Summary: Key Concepts

• The DW is a “collection of integrated, subject-oriented databases designed to support the decision support function where each unit of data is non-volatile and relevant to some moment in time: (W.H. Inmon, 1992).

• Implicit Assumptions:– physically separate from operational

systems– hold aggregated data and transactional

(atomic) data for management separate from those used for OLTP.

CIS 465 - Data Warehousing 85

DW Summary: Characteristics

• Subject-orientation• integrated• non-volatile (i.e. not updated)• time variant (kept for long periods, for

forecasting and trend analysis)• summarized• large volume• not normalized• metadata• data sources

CIS 465 - Data Warehousing 86

DW Characteristics: Subject Orientation

• The data warehouse is oriented toward the major subjects of the organization as opposed to the functional orientation of legacy applications.

• APPLICATION ORIENTATION: Sales and Marketing, Materials Planning, Asset Tracking, Finance, Human Resources, Inventory

• SUBJECT ORIENTATION: Products, Vendors, Markets, employees, Customers, Sales History

CIS 465 - Data Warehousing 87

DW Characteristics: Integrated

• Data contained within the boundaries are integrated, i.e. consistency in naming conventions, measurement attributes, accuracy, and common aggregation.

• Consider that data on gender, dates, current balances could be brought from several different applications all named differently and all measured and stored differently.

• The process of ‘loading’ the data warehouse ‘scrubs’ data to eliminate inconsistencies.

CIS 465 - Data Warehousing 88

DW Characteristics: Time Variant

• Operational databases normally contain up-to-the-minute accuracy.

• Data fields in the data warehouse are taken at specific points in time; time is normally a primary key for extracting.

• DW fields are not necessarily current, but probably time series.

• Think of warehouse data as a sequence of sequential photographs or snapshots in time.

• Time horizon is long, perhaps 5 to 10 years, in order to analyze data over long time periods.

CIS 465 - Data Warehousing 89

DW Characteristics: Nonvolatility

• Inserts, deletes, and updates/changes are characteristic of operational/transaction databases. Data is normally stored in read-only format and not changed.

• The purpose of the data warehouse is to extract data for reporting. The data is ‘cleaned’ and ‘scrubed’ when it is loaded from operational stores.

• Implementation wise, issues of transaction and data recovery, rollback, detection and remedy of deadlock are unnecessary.

CIS 465 - Data Warehousing 90

DW Characteristics: Metadata

• Metadata - ‘data about the data’.• The data warehouse architecture is built

on the concept of data definitions or metadata and it pervades every activity of the data warehouse.

• Some metadata management issues:– standard definitions (technical and business

descriptions) of data stored in the warehouse.– Metadata captured and created in the extraction and

refinement loading of data.– Metadata on granularity, partitions, subject areas,

aggregation and summarization.– Metadata describing rules for timing and scheduling

of the refresh, update, and replication cycle.

CIS 465 - Data Warehousing 91

The end

CIS 465 - Data Warehousing 92

CIS 465 - Data Warehousing 93

SalesRep

SalesDistrict

SalesRegion

SalesDivision

ProductGroup

ProductLine

Product

Contact

Ship ToShipperShipType

DistrictCredit

Contract CustomerContract

Type

SalesOrder

CustomerLocation

ContactLocation

OrderItem

Data Dependencies Model of a Business

CIS 465 - Data Warehousing 94

Figure 10.04

CIS 465 - Data Warehousing 95

Benefits of the Data Warehouse Structure

• Data integrity• Consistency across time lines• High efficiency• Low operating costs• Can store data at different levels of summarization• Can give customers quick turnaround