chapter one(2004)

8/2/2019 Chapter One(2004)

1/60

1

Department of Information

Technology

Course Title: Data Warehousing and Data Mining

Course No: IT4204

Credit Hr: 3

By: TWW

8/2/2019 Chapter One(2004)

2/60

2

Chapter one

Introduction In this chapter we will cover the following issues in brief

Motivation: Why data mining?

What is data mining & data warehousing?

Architecture of a typical DM system

Why DM: Application of data mining

Data Mining: On what kind of data?

Data mining functionalities

Are all the patterns interesting?

Classification of data mining systems

Major issues in data mining

8/2/2019 Chapter One(2004)

3/60

3

Motivation:

Necessity is the Mother of Invention

Our capacity of generating and collecting data have been increased

rapidly in the last several decades

Huge amount of data is available at the tip of our hand

It is predicted that more data will be produced in the next year than

has been generated during the entire existence of humankind!

A study by IBM predicts that by 2005 the amount of information on

the planet will increase from 3.2 million exabytes to 43 million

exabytes (1 exabyte=250 bytes) .

8/2/2019 Chapter One(2004)

4/60

4

Motivation:


Contributing factors include

Widespread use of bar code for most commercial products,

Computerization of many business, scientific, and governmental

transactions, Advances in data collection tools (audio, video, satellite remote

sensing, scanning, image capturing tools)

Usage of WWW as a global information system (the Internet in

general), Development comprehensive application software,

new computing and storage technologies

8/2/2019 Chapter One(2004)

5/60

5

Motivation:


All this have made it easier to create, collect, and store all types of

data.

As a result it creates a problem what is called data exposition

Data explosion is the problem of having huge amount of data in anenterprise stored in databases, data warehouses and other information

repositories generated by automated data collection tools and mature

database technology in large databases which has to be processed to

make a decision.

As the size of data get larger, analyzing the data becomes very

difficult

8/2/2019 Chapter One(2004)

6/60

6

Motivation:


Data can be managed and stored in

structured relational databases;

in semi-structured file systems, such as e-mail;

unstructured fixed content, like documents and graphic files.

Companies rely on this enterprise data to improve decision-makingand to gain a competitive advantage;

Data has indeed become a highly valued business asset.

The huge amount of data exceeds our human ability to make

comprehension on the data and to put the best decision without tools

Generating and storing of large volumes of data has reached a critical

mass and appropriate tools for comprehend the data becomes vital.

8/2/2019 Chapter One(2004)

7/60

7

Motivation:


We are drowning in data, but starving for knowledge!

The Solution: Data warehousing and data mining

Data mining can be viewed as a result of the natural evolution of information

technology.

This can be more explained if we look at the evolution of database technology

since 19th century.

8/2/2019 Chapter One(2004)

8/60

8

Motivation:

Necessity is the Mother of Invention 1960s:

Known to be the era of primitive file processing

There were activities such as

Data collection,

database creation (No DBMS),

Information management system (IMS), mainly using COBOL

1970s:

Relational data model, relational DBMS implementation

Data modeling tools like ER diagram

Indexing and data organization techniques such as B+ tree, hashing, etc

Query language such as SQL

User interfaces, forms and reports

Query processing and optimization techniques

Transaction management: recovery, concurrency control, etc

Online Transaction processing (OLTP)

8/2/2019 Chapter One(2004)

9/60

9

Motivation:


1980s:

Period of advanced DB Systems

advanced data models

extended-relational, Object Oriented, Object-Relational, deductive, etc.)

application-oriented DBMS

spatial, temporal, multimedia, active, scientific, engineering, Knowledgebase,

etc.)

1990s2000s:

Data mining and data warehousing, Knowledge discovery, OLAP and Web

based databases

8/2/2019 Chapter One(2004)

10/60

10

What is Data Mining & Data warehousing?

Data mining is extraction of interesting (non-trivial, implicit, previously unknownand potentially useful) information or patterns from data in large databases (data

warehouse)

The term Data mining is a misnomer as it doesnt directly related to what is does.

For exampling mining gold from rock is called Gold mining but not rock mining.

Similarly oil mining is mining oil from the ground.

Data mining should best describe as knowledge mining from data rather that datamining

Any way, we will use the term with this understanding

8/2/2019 Chapter One(2004)

11/60

11


Alternative names

Knowledge discovery(mining) from databases (KDD), knowledge extraction,

data/pattern analysis,

data archeology,

data dredging,

information harvesting,

business intelligence, etc.

Note that: query processing systems, Expert systems (knowledge base systems) or

Information retrieval systems are not data mining tasks

8/2/2019 Chapter One(2004)

12/60

12


Data can now be stored in many different types of databases.

Special DB architecture that has recently emerged is data warehouse

Data warehouse is a repository of multiple heterogeneous data sources organized

under a unified schema at a single site in order to facilitate management decisionmaking process

Data warehouse technology includes

Data cleansing (removing noise and inconsistent data)

Data integration (combining multiple data sources into one data warehouse)

On-Line Analytical Processing (OLAP)

8/2/2019 Chapter One(2004)

13/60

13


OLAP is analysis technique which have functionalities such as

Summarization

Consolidation

Aggregation as well as

The ability to view information from different angle

Although OLAP tools support multidimensional analysis and decision making,

additional data analysis tools are required for in depth data analysis such as

Data classification

Clustering and

Characterization of data changes over time

The abundance of data, coupled with the need for powerful data analysis tools has

been described as data rich but information poor situation

8/2/2019 Chapter One(2004)

14/60

14


Data mining (Knowledge Discovery in Databases) consists of iterative sequences

of the following seven steps

1. Learning the application domain:

Learn relevant prior knowledge to be used in the DM process

Learn the goals of DM application

2. Create target data set (data selection)

Data cleaning and preprocessing: (may take 60% of effort!)

Data reduction and transformation:

Find useful features, dimensionality/variable reduction, invariant representation.

3. Choosing functions of data mining

summarization, classification, regression, association, clustering.

8/2/2019 Chapter One(2004)

15/60

15


4. Choosing the mining algorithm(s)

Data mining: search for patterns of interest

5. Identify the relevant knowledge by measuring the mining resultinterestingness

Pattern evaluation

6. Present the knowledge to the user

knowledge presentation visualization, transformation, removing redundant patterns, etc.

7. Use of discovered knowledge

8/2/2019 Chapter One(2004)

16/60

16April 7, 2012 16

Data mining Models

The above seven steps are usually standardized with various data models

The most common standard in data mining is the CRoss-Industry Standard

Process for Data Mining (CRISP-DM),

CRISP-DM has the following steps

Business/research understanding (Learning the application domain),

Data understanding (data selection for the problem)

Data preparation which involves

collecting, cleaning, consolidating and amalgamating records, summarizing fields,

checking for data integrity, detecting irregularities and illegal attributes, filling in for

missing values, trimming outliers.

Data modeling involves

selecting data mining tools, transforming the data if the tools require it, generating

samples for training and testing the model and finally using the tools to build and

select a model

Evaluating the model and

Deploying the model

8/2/2019 Chapter One(2004)

17/60

17April 7, 2012 17

Data mining Models

8/2/2019 Chapter One(2004)

18/60

18April 7, 2012 18

Data Mining: A KDD Process

Data mining: the core of knowledgediscovery in Database

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection & Transformation

Data Mining

Pattern Evaluation

8/2/2019 Chapter One(2004)

19/60

19April 7, 2012 19

Data Mining and Business Intelligence

Increasing potentialto support

business decisions End User

Business Analyst

Data Analyst

DBA

Making

Decisions

Data PresentationVisualization Techniques

Data Mining

Information Discovery

Data Exploration

OLAP

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

8/2/2019 Chapter One(2004)

20/60

20April 7, 2012 20

Architecture of a Typical Data Mining

System

Data

Warehouse

Data cleaning & data integration Filtering

Databases

Databaseordata

warehouseserver

Dataminingengine

Patternevaluation

Graphical user interface

Knowledge-base

8/2/2019 Chapter One(2004)

21/60

21April 7, 2012 21

Why Data Mining?

Potential Applications

Data mining has many and varied fields of application

some of which are listed below.1 Retail/Marketing

2 Banking

3 Insurance and Health Care

4 Transportation

5 Medicine

8/2/2019 Chapter One(2004)

22/60

22April 7, 2012 22

Potential Applications: Retail/Marketing Analysis

Market analysis is a tool companies use in order to better understand the

environment in which they operate.

It is one of the main steps in the development of a marketing plan which

involves critically reviewing and organizing collected data so that it can be

used in making strategic marketing decisions

Retailers can use information collected through affinity programs (e.g.,

shoppers club cards, frequent flyer points, contests) to assess the effectiveness

of product selection and placement decisions, coupon offers, and which

products are often purchased together.

8/2/2019 Chapter One(2004)

23/60

23April 7, 2012 23


Companies such asbanking service providers andmusic clubs can use data

mining to create a churn analysis, to assess which customers are likely to

remain as subscribers and which ones are likely to switch to a competitor

The source of data can be credit card transactions, loyalty cards, discount

coupons, customer complaint calls, plus (public) lifestyle studies

8/2/2019 Chapter One(2004)

24/60

24April 7, 2012 24


The potential application involves

Target marketing analysis which involves finding clusters model

customers who share the same characteristics, interest, income level,

spending habits, etc to participate in specific market.

Predict response to mailing, calling, advertizing campaigns

Determine customer purchasing patterns over time: Finding pattern of

customer on buying items: based on attribute or combination of attributes

such as marital status, age, salary, citizen, type of credit card, family

condition, etc

Cross-market analysis Associations/co-relations between product sales

Prediction based on the association information

8/2/2019 Chapter One(2004)

25/60

25April 7, 2012 25



Customer profiling

what types of customers buy what products (clustering or classification)

Find associations among customer demographic characteristics (age, name,

address, profession, sex, etc)

Identify buying patterns from customers

Identifying customer requirements

identifying the best products for different customers

use prediction to find what factors will attract new customers

Provides summary information

various multidimensional summary reports for decision making

statistical summary information (data central tendency and variation)

8/2/2019 Chapter One(2004)

26/60

26April 7, 2012 26



market basket analysis:

identifying group of items that customers usually buy together

Market segmentation

Gathering demographic, geographic, behavioral and physiological information

about a customer and cluster them for proper handling

Risk analysis and management

Forecasting, customer retention, improved underwriting, quality control,

competitive analysis

8/2/2019 Chapter One(2004)

27/60

27April 7, 2012 27

Potential Applications: Fraud detection andmanagement

Detecting outliers and manage them before they destroy the organization

environment

Widely used in health care, retail, credit card services, telecommunications

(phone card fraud), etc.

Use historical data to build models of fraudulent behavior and use data

mining to help identify similar instances

Potential Applications: Fraud detection and

8/2/2019 Chapter One(2004)

28/60

28April 7, 2012 28


Examples

Auto insurance:

detect a group of people who stage accidents to collect on insurance

Money laundering:

detect suspicious money transactions in banking network

Medical insurance:

detect professional patients and ring of doctors and ring of references

Detecting inappropriate medical treatment

Detecting telephone fraud

detect users of telephone line which either hijacked or stolen from acustomer

Potential Applications: Fraud detection and

8/2/2019 Chapter One(2004)

29/60

29April 7, 2012 29


Examples

Telephone call model:

destination of the call, duration, time of day or week. Analyze patterns that deviatefrom an expected norm.

Telecom can identify discrete groups of callers with frequent intra-groupcalls, especially mobile phones, and broke a possible multimillion dollar

fraud.

Retail: Analysts estimate that 38% of retail shrink is due to dishonest employees.

8/2/2019 Chapter One(2004)

30/60

30April 7, 2012 30

Potential Applications: Banking

Detect patterns of fraudulent credit card use

Identify loyal' customers

Predict customers likely to change their credit card affiliation

Determine credit card spending by customer groups

Find hidden correlations between different financial indicators Identify stock trading rules from historical market data

8/2/2019 Chapter One(2004)

31/60

31April 7, 2012 31

Potential Applications: Insurance and Health Care

Claims analysis - i.e which medical procedures are claimed together

Predict which customers will buy new policies

Identify behavior patterns of risky customers

Identify fraudulent behavior

8/2/2019 Chapter One(2004)

32/60

32April 7, 2012 32

Potential Applications: Transportation

Determine the distribution schedules among outlets

Analyze loading patterns

8/2/2019 Chapter One(2004)

33/60

P t ti l A li ti

8/2/2019 Chapter One(2004)

34/60

34April 7, 2012 34

Potential Applications: Others

Text mining (news group, email, documents) and Web analysis.

Intelligent query answering

Sports

Astronomy

8/2/2019 Chapter One(2004)

35/60

35April 7, 2012 35

Potential Applications: Corporate Analysisand Risk Management

Finance planning and asset evaluation

cash flow analysis and prediction

contingent claim analysis to evaluate assets

cross-sectional and time series analysis (financial-ratio, trend analysis,

etc.) Resource planning:

summarize and compare the resources and spending

Competition:

monitor competitors and market directions

group customers into classes and a class-based pricing procedure

set pricing strategy in a highly competitive market

8/2/2019 Chapter One(2004)

36/60

36April 7, 2012 36

Data Mining: On What Kind of Data?

Relational databases

Data warehouses

Transactional databases

Advanced DB and information repositories

Object-oriented and object-relational databases

Spatial databases

Time-series data and temporal data

Text databases and multimedia databases

Heterogeneous and legacy databases

WWW

8/2/2019 Chapter One(2004)

37/60

37April 7, 2012 37

Data Mining Functionalities

Data mining can be performed on various types of data stores and

Databases

Data mining functionalities are used to specify the kind of patterns to

be found in data mining task

The kind of pattern mined form a given data is not known for the

user.

Techniques should be implemented to extract various pattern from

the available data so that user can choose what they need to use.

There are different kinds of data mining functionalities (tasks) that

can be used to extract various types of pattern from data

8/2/2019 Chapter One(2004)

38/60

38April 7, 2012 38


Generally data mining task can be broadly classified as

Descriptive (un supervised)

Predictive (supervised)

Descriptive data mining task characterize the general properties of the data

in a database.

This kind of data mining is usually unsupervised

Predictive data mining task perform inference on the current data in order tomake prediction to the future reference

This is usually supervised technique

8/2/2019 Chapter One(2004)

39/60

39April 7, 2012 39


The supervised predictive data mining functionalities includes

Classification

Regression

Time series

Estimation

The unsupervised descriptive data mining functionalities includes Concept /class description: Characterization (summarization) and discrimination

Association Analysis

Clustering analysis

Outlier analysis

Evolution analysis

Data Mining Functionalities:

8/2/2019 Chapter One(2004)

40/60

40April 7, 2012 40

Data Mining Functionalities:Classification

Classification is the process of finding a set of models (or functions) that

describe and distinguish data classes or concepts for the purpose of being able

to use the model to predict the class of an object whose class is unknown

The derived class is based on training data set and can be represented in

various forms such as classification IFTHEN rule, decision tree,

mathematical formulae or neural networks

Prediction is the process of predicting some missing or unavailable data values

rather than class labels.

Finding models (functions) that describe and distinguish classes or concepts

for future prediction

Data Mining Functionalities:

8/2/2019 Chapter One(2004)

41/60

41April 7, 2012 41

Data Mining Functionalities:Regression

Regression is the process of predicting some missing or unavailable data values

rather than class labels.

Finding models (functions) that describe and distinguish classes or concepts

for future prediction


8/2/2019 Chapter One(2004)

42/60

42April 7, 2012 42

Data Mining FunctionalitiesConcept/class description: Characterization and discrimination

Given a class/classes with data that belongs to the class, describe the class by

making observation of its members. Hence one can describe individual classes or concepts in a summarized,

concise and yet precise terms which is called class/concept description.

These description can be derived via data characterization or data

discrimination or both

Data characterization refers to summarizing the data of the class under

consideration (target class) in general term

For example one may characterize the item class as a class in which 90% of the objects are

computer and its peripheral

Data discrimination is description made by making comparative analysisbetween the target class with the other comparative class (contrasting classes)

For example one may discriminate item class from other class like customer and order class

by saying the item class attributes get modified more frequently than others

a a n ng unc ona es:

8/2/2019 Chapter One(2004)

43/60

43April 7, 2012 43

a a n ng unc ona es:Association Analysis

Association analysis is the discovery of association rules showing attribute-

value conditions that occur frequently together in a given set of data

Association rules are of the form XY [Support = s%, confidence = c%] where

X is conjunctions of attributes and Y is conjunctions of values and interpreted as

if X then it is likely to happen Y with support s% and confidence c%.

For example

age(X, 20..29) ^ income(X, 20..29K)buys(X, PC) [support = 2%,confidence = 60%]

Interpreted as any one whose age ranges from 20 to 29 and income

range is from 20 to 29K likely buy PC with support 2% and confidence

of 60%

Support shows the probability that all the predicates in X and Y fulfill

together. i.e. P(X U Y)

Confidence shows if predicates in X fulfilled then the predicate in Y is also

fulfilled with the stated percentage. i.e. P(Y | X)


8/2/2019 Chapter One(2004)

44/60

44April 7, 2012 44

Data Mining FunctionalitiesAssociation Analysis

Example 2:

contains(T, computer)contains(T, software) [1%, 75%]

Interpreted as if Item T contains computer it is also likely to contain

software with support 1% and confidence 75%

In the above two examples, Age, Income, buys and Contains are calledattributes or predicates

An attribute is a value if it is after the implication sign

Association rule can be Multi-dimensional (more than 1 predicate in X and Y)

or single-dimensional association rule (only one predicate in both X and Y)

For example the association rule in example 1 is multi-dimensional where as in

example two is single dimensional


8/2/2019 Chapter One(2004)

45/60

45April 7, 2012 45


Cluster analysis

In cluster Analysis, class labels are unknown and a group of data is

given to be classified.

Cluster analysis group data to form new classes, e.g., cluster houses

to find distribution patterns

Clustering based on the principle:

maximizing the intra-class similarity and minimizing the interclass

similarity

8/2/2019 Chapter One(2004)

46/60

46April 7, 2012 46


Outlier analysis

Database may contain data object that do not comply with the general

behavior or model of the data.

These data objects are outliers.

Usually outlier data items are considered as noise or exception in many

data mining applications

However, in some application such as fraud detection, the rare events can

be more interesting than the more regularly occurring ones.

The analysis of outlier data is referred to as outlier mining

8/2/2019 Chapter One(2004)

47/60

47April 7, 2012 47


Trend and evolution analysis

Describe and model regularities or trends for objects whose

behavior changes over time.

Though this may include characterization, discrimination,

association or clustering of time related data, distinct features of

such an analysis include time series data analysis, sequence or

periodicity pattern matching and similarity based data analysis.

It is also referred as regression analysis, sequential pattern mining,

periodicity analysis, similarity-based analysis

A All th Di d P tt I t ti ?

8/2/2019 Chapter One(2004)

48/60

48April 7, 2012 48

Are All the Discovered Patterns Interesting?

A data mining system/query may generate thousands of patterns, not all of them

are interesting Questions

What makes a pattern interesting?

Can a data mining system generate all of the interesting patterns?

Can a data mining system generate only interesting patterns?

Answers for all the three questions will be given bellow

Q i 1

8/2/2019 Chapter One(2004)

49/60

49April 7, 2012 49

A pattern is interesting if it is easily understood by humans, valid on

new or test data with some degree of certainty, potentially useful, novel,

or validates some hypothesis that a user seeks to confirm

An interesting pattern represents knowledge

Measure of Interestingness measures

Two types (Objective vs. subjective)

Objective: based on statistics and structures of patterns, e.g., support,

confidence, etc.

Subjective:based on users belief in the data, e.g., unexpectedness

(contradicting a users belief), novelty, actionability, etc.

Question 1

Are All the Discovered Patterns Interesting?

Q i 2

8/2/2019 Chapter One(2004)

50/60

50April 7, 2012 50

Referred as Completeness of the data mining algorithm

No single data mining system is complete but users can set a constraint on

the type of pattern they are looking for in which the data mining function

generate all the pattern with the specified constraints

Association algorithms dont find classification pattern and others for

example

Question 2

Can We Find All Interesting Patterns?

8/2/2019 Chapter One(2004)

51/60

51April 7, 2012 51

This is an Optimization problem in data mining system

it remain an challenging issue

Approaches in the research topic

First generate all the patterns and then filter out the uninteresting

ones.

Generate only the interesting patternsmining query optimization

Question 3

Can We Find Only Interesting Patterns?

8/2/2019 Chapter One(2004)

52/60

52April 7, 2012 52

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database

TechnologyStatistics

Other

Disciplines

Information

Science

Machine

LearningVisualization

D Mi i Cl ifi i S h

8/2/2019 Chapter One(2004)

53/60

53April 7, 2012 53

Data Mining: Classification Schemes

General functionality

Descriptive data mining

Predictive data mining

Different views, different classifications Kinds of databases to be mined

Kinds of knowledge to be discovered

Kinds of techniques utilized

Kinds of applications adapted

8/2/2019 Chapter One(2004)

54/60

54April 7, 2012 54

A Multi-Dimensional View of Data Mining Classification

Databases to be mined Relational, transactional, object-oriented, object-relational, active,

spatial, time-series, text, multi-media, heterogeneous, legacy, WWW,etc.

Knowledge to be mined

Characterization, discrimination, association, classification, clustering,trend, deviation and outlier analysis, etc.

Multiple/integrated functions and mining at multiple levels

Techniques utilized

Database-oriented, data warehouse (OLAP) oriented, machinelearning, statistics, visualization, neural network, etc.

Applications adapted

Retail, telecommunication, banking, fraud analysis, DNA mining, stockmarket analysis, Web mining, Weblog analysis, etc.

8/2/2019 Chapter One(2004)

55/60

55April 7, 2012 55

Major Issues in Data Mining

The major issue in data mining includes the following aspects Mining methodology and user interaction

Performance and scalability

Issues related to the diversity of data types

Issues related to applications and social impacts These will be briefly discussed in the following slides

a or ssues:

8/2/2019 Chapter One(2004)

56/60

56April 7, 2012 56

Mining methodology and user interaction Reflects the kind of knowledge mined, the ability to mine knowledge at different

granularities, the use of domain knowledge, and knowledge visualization

Mining different kinds of knowledge in databases for different users

Require different data mining functionalities and algorithms

Interactive mining ofknowledge at multiple levels of abstraction

Enable to verify the discovered knowledge is relevant or not for the intended purpose

Incorporation of background knowledge Background knowledge is vital in refining the data mining performance and evaluate its

relevance

Data mining query languages and ad-hoc data mining

Such query language enable to retrieve knowledge from a data warehouse just like what SQL

does from the database in a DBMS

Presentation and visualization of data mining results

Result should be presented visually or using high level natural language

Handling noise, incomplete data and exceptions

Pattern evaluation: the interestingness problem

Major Issues:

8/2/2019 Chapter One(2004)

57/60

57April 7, 2012 57

Major Issues:Performance and scalability

This includes Efficiency and scalability of data mining algorithms

Parallel, distributed and incremental mining methods for huge amount of

data

Major Issues:

8/2/2019 Chapter One(2004)

58/60

58April 7, 2012 58

Major Issues:Issues related to the diversity of data types

This involves

Handling relational and complex types of data.

Relational data need efficient and effective data mining system as it is

widely used

Complex data refers to data such as hyper text, multimedia, spatial, temporaland transactional which further require attention to do data mining on such

data

Mining information from heterogeneous databases and

global information systems (WWW)

Which is a major issue to use as a source of data in data mining

Major Issues:

8/2/2019 Chapter One(2004)

59/60

59April 7, 2012 59

Major Issues:Issues related to applications and social impacts

Data mining also concerned on issues related to its application

and the impact it may have on the social aspect

These includes

Identifying the application of discovered knowledge which can be

Domain-specific data mining tools

Intelligent query answering

Process control and decision making

How to integrate the discovered knowledge with existing knowledge:A

knowledge fusion problem

Protection of data security, integrity, and privacy that may have socialimpact

8/2/2019 Chapter One(2004)

60/60

Summary Data mining: discovering interesting patterns from large amounts of data

A natural evolution of database technology, in great demand, with wideapplications

A KDD process includes data cleaning, data integration, data selection,

transformation, data mining, pattern evaluation, and knowledge presentation

Mining can be performed in a variety of information repositories

Data mining functionalities: characterization, discrimination, association,

classification, clustering, outlier and trend analysis, etc.

Classification of data mining systems

Major issues in data mining

chapter one(2004)

Documents