© 2003, prentice-hall chapter 1 - 1 chapter 1: introduction to data mining, warehousing, and...

35
© 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas

Upload: randall-sanders

Post on 23-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 1

Chapter 1: Introduction to Data Mining, Warehousing, and Visualization

Modern Data Warehousing, Mining, and Visualization: Core Concepts

by George M. Marakas

Page 2: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 2

1-1: The Modern Data Warehouse

A data warehouse is a copy of transaction data specifically structured for querying, analysis and reporting

Note that the data warehouse contains a copy of the transactions. These are not updated or changed later by the transaction system.

Also note that this data is specially structured, and may have been transformed when it was placed in the warehouse

Page 3: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 3

1-2: Data Warehouse Roles and Structures

The DW has the following primary functions:

It is a direct reflection of the business rules of the enterprise.

It is the collection point for strategic information.

It is the historical store of strategic information.

It is the source of information later delivered to data marts.

It is the source of stable data regardless of how the business processes may change.

Page 4: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 4

Position of the Data Warehouse Within the Organization

Page 5: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 5

Data Marts

A data mart is a smaller, more focused data warehouse. It reflects the business rules of a specific business unit.

The data mart does not need to cleanse its data because that was done when it went into the warehouse.

It is a set of tables for direct access by users.

These tables are designed for aggregation.

It typically is not a source for traditional statistical analysis.

Page 6: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 6

Position of the Data Mart Within the Organization

Data D

elivery

Data Mart

Data Mart

Data Mart

Decision Support

Information

Decision Support

Information

Decision Support

Information

Page 7: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 7

1-3: What Can a Data Warehouse Do?

Some of the benefits of a DW are:

Immediate information delivery

Data integration from across and even outside the organization

Future vision from historical trends

Tools for looking at data in new ways

Freedom from IS department resource limitations (you don’t need programmers to use a data warehouse)

Page 8: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 8

Sales Analysis Determine real-time product sales to make vital pricing and distribution decisions. Analyze historical product sales to determine success or failure attributes. Evaluate successful products and determine key success factors. Use corporate data to understand the margin as well as the revenue implications of a decision. Rapidly identify a preferred customer segments based on revenue and margin. Quickly isolate past preferred customers who no longer buy. Identify daily what product is in the manufacturing and distribution pipeline. Instantly determine which salespeople are performing, on both a revenue and margin basis, and which are

behind.

Financial Analysis Compare actual to budgets on an annual, monthly and month-to-date basis. Review past cash flow trends and forecast future needs. Identify and analyze key expense generators. Instantly generate a current set of key financial ratios and indicators. Receive near-real-time, interactive financial statements.

Human Resource Analysis Evaluate trends in benefit program use. Identify the wage and benefits costs to determine company-wide variation. Review compliance levels for EEOC and other regulated activities.

Other Areas Warehouses have also been applied to areas such as: logistics, inventory, purchasing, detailed transaction

analysis and load balancing.

Examples of Common DW Applications

Page 9: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 9

What Does All This Mean?

On a daily basis, organizations turn to their data warehouses to answer a limitless variety of questions.Nothing is free, however, and these benefits do come with a cost.The value of a data warehouse is a result of the new and changed business processes it enables.There are limitations, though. A DW cannot correct problems with the data, although it may help to clearly identify them.

Page 10: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 10

Costs Hardware, software, development personnel and consultant costs. Operational costs like ongoing systems maintenance.

Benefits Added Revenue Will the new (business objective) process generate new customers (what is the

estimated value?) Will the new (business objective) process increase the buying propensity of

existing customers (by how much?) Is the new process necessary to ensure that the competition doesn't offer a

demanded service that you can't match? Reduced costs What costs of current systems will be eliminated? Is the new process intended to make some operation more efficient? If so, how

and what is the dollar value?

Comparison of Typical DW Costs and Benefits

Page 11: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 11

1-4: The Cost of Warehousing Data

Expenditures can be categorized as one-time initial costs or as recurring, ongoing costs.

The initial costs can further be identified as for hardware or software.

Expenditures can also be categorized as capital costs (associated with acquisition of the warehouse) or as operational costs (associated with running and maintaining the warehouse)

Page 12: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 12

Recurring Costs One-Time Costs

Capital Hardware maintenance Software maintenance Terminal analysis Middleware

Hardware Software Disk DBMS CPU Terminal analysis Network Middleware Terminal analysis Network Log utility Processing Metadata Infrastructure

Operational Ongoing refreshment Integration transformation Data model maintenance Record identification maintenance Metadata infrastructure maintenance Archival of data Data aging within the DW

Integration/transformation processing specification

Metadata infrastructure population System of record definition Data dictionary language definition Network transfer definition CASE/Repository interface Initial data warehouse population Data model definition Database design definition

Expenditures Associated with Building a DW

Page 13: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 13

Cost Are Highly Variable

A company that spends less money for their data warehouse is often happier with it.

The main justification for the development expense is that a DW reduces the cost of accessing the information owned by the organization.

Since information has to be retrieved just once (when it is placed in the warehouse), DW users see a lower cost on each report generated.

Page 14: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 14

Typical Multidatabase Report and Screen Generation

SourceSystem

A

SourceSystem

B

SourceSystem

C

SourceSystem

D

Data download and

transformation contribute to

retrieval costs for every report

or screen generated

Page 15: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 15

Typical DW Report and Screen Generation

SourceSystem

A

SourceSystem

B

SourceSystem

C

SourceSystem

D

OrganizationalData

Warehouse

Data upload and

transformation costs occur just once. Retrieval costs are lower.

Page 16: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 16

Farmers and Explorers

Every corporation has two types of DW users.Farmers know what they want before they set out to find it. They submit small queries and retrieve small nuggets of information.Explorers are quite unpredictable. They often submit large queries. Sometimes they find nothing, sometimes they find priceless nuggets.Cost justification for the DW is usually done on the basis of the results obtained by farmers since explorers are unpredictable.

Page 17: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 17

1-5: Data Marts and the Data Warehouse

OrganizationalData

Warehouse

FinanceData Mart

Accounting

Data Mart

MarketingData Mart

SalesData Mart

Operational Data Store

Operational Data Store

Operational Data Store

Operational Data Store

Legacy Systems

Legacy systems feed data to the warehouse.

The warehouse

feeds specialized

information to departments.

Page 18: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 18

The Data Mart is More Specialized

OrganizationalData

Warehouse

FinanceData Mart

AcctingData Mart

MarketingData Mart

SalesData Mart

Data Marts

DepartmentalizedSummarized, aggregated dataStar join designLimited historical dataLimited data volumeRequirements driven dataFocused on departmental needsMulti-dimensional DBMS technologies

Organizational Data Warehouse

CorporateHighly granular dataNormalized designRobust historical dataLarge data volumeData Model driven dataVersatileGeneral purpose DBMS technologies

The data mart serves the needs of one business unit, not the organization.

Page 19: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 19

1-6: Foundations of Data Mining

Data mining is the process of using raw data to infer important business relationships.

Despite a consensus on the value of data mining, a great deal of confusion exists about what it is.

It is a collection of powerful techniques intended for analyzing large datasets.

There is no single data mining approach, but rather a set of techniques that can be used in combination with each other.

Page 20: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 20

1-7: The Roots of Data Mining

The approach has roots in practice dating back over 30 years.

In the early 1960s, data mining was called statistical analysis, and the pioneers were statistical software companies such as SAS and SPSS.

By the 1980s, the traditional techniques had been augmented by new methods such as fuzzy logic, heuristics and neural networks.

Page 21: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 21

A General Approach

Although all data mining endeavors are unique, they possess a common set of process steps:

1. Infrastructure preparation – choice of hardware platform, the database system and one or more mining tools

2. Exploration – looking at summary data, sampling and applying intuition

3. Analysis – each discovered pattern is analyzed for significance and trends

Page 22: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 22

A General Approach (continued)

4. Interpretation – Once patterns have been discovered and analyzed, the next step is to interpret them. Considerations include business cycles, seasonality and the population the pattern applies to.

5. Exploitation – this is both a business and a technical activity. One way to exploit a pattern is to use it for prediction. Others are to package, price or advertise the product in a different way.

Page 23: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 23

1.8: The Approach to Data Exploration and Data Mining

A Perfect CorrelationA Perfect Correlation

A Strong CorrelationA Strong Correlation

A Weak CorrelationA Weak Correlation

A

B

A

B

A

B

The basis

for all

data mining activities is correlation.

Page 24: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 24

The Spectrum of Correlation

In general, a correlation coefficient is a number between 0 and 1 that shows strength of a relationship.

Some types of correlation are signed (±) to also show the direction of the relationship.

Even a weak correlation can be interesting, however, if it shows a trend over time.

1 .5 0Perfect Moderate NoCorrelation Correlation Correlation

Page 25: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 25

Methods to Determine Correlation

A B

vs.

vs.

vs.

vs.

vs.

A

A

A

A

A

BBB

B B

BB

B

BB

B B

Data element vs. data element

Data element vs. unit of time

Data element vs. data element groups

Data element vs. geography

Data element vs. external trends

Data element vs. demographics

vs.The method

used depends on the type of elements

being correlated.

Page 26: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 26

The Data Warehouse and Data Mining

Data mining does not require the use of a warehouse, but it may be the best foundation for mining.

If multiple analyses are run in sequence, the data need to be held constant (as in a DW). In an operational database, data change often.

Also important is that the data in the DW is integrated and stable

Page 27: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 27

Volumes of Data – The Biggest Challenge

The largest challenge a data miner may face is the sheer volume of data in the warehouse.

It is quite important, then, that summary data also be available to get the analysis started.

A major problem is that this sheer volume may mask the important relationships the analyst is interested in.

The ability to overcome the volume and visualize the data becomes quite important.

Page 28: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 28

1.9: Foundations of Data Visualization

One of the earliest known examples of data visualization was in London during the 1854 cholera epidemic. A map (next slide) helped to identify the source of the disease.

Modern visualization techniques grew from the twin technologies of computer graphics and high performance computing in the 1970s and 1980s.

One computer scientist who saw this trend arising was Douglas Engelbart in the 1950s.

Page 29: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 29

Dr. John Snow used a map to show the source of

cholera was a water

pump, thus proving the

disease was water

borne.

Broad StreetPump

Page 30: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 30

Opportunity and Timing

Alternative input devices (light pen, sketch pad and mouse) began to appear in the 1960s.

In the 1970s, flight simulators became much more realistic when graphics replaced film.

In the same decade, special effects computers became entrenched in the entertainment industry.

In the 1980s, visualization grew more dynamic with applications like the animation of Los Angeles smog patterns.

Page 31: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 31

One of today’s more useful

types of visualization is in simulators

(both in games and in practice).

This is the only way most of us will ever fly a Boeing 747.

Page 32: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 32

It is now both cheaper and safer to train commercial

pilots on simulators.

With good software, pilots

can be placed in situations they may not ever see – until too late – in the

cockpit.

Page 33: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 33

A Sequence of Frames Animating LA Smog

Day 1 Swirling Winds – Light Smog Particles

Day 2 Offshore Winds – Moderate Smog Particles

Day 3 Head-on View of Smog Particles and Streamlines

Page 34: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 34

Number Crunching With a Difference

In the 1990s, rapid advances in chip technology, both at the CPU and the graphics processor, put data visualization everywhere.

Imagine trying to understand DNA sequences from just the numbers!

On the next slide, a Mapuccino display helps us see where the results from a text search come from.

Page 35: © 2003, Prentice-Hall Chapter 1 - 1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization:

© 2003, Prentice-Hall Chapter 1 - 35