issues in data mining infrastructure

72
Page 1/71 Issues in Data Mining Issues in Data Mining Infrastructure Infrastructure Authors: Nemanja Jovanovic, [email protected] Valentina Milenkovic, [email protected] Prof. Dr. Veljko Milutinovic, [email protected] http://galeb.etf.bg.ac.yu/~vm

Upload: christian-zurlo

Post on 30-Dec-2015

29 views

Category:

Documents


4 download

DESCRIPTION

Issues in Data Mining Infrastructure. Authors:Nemanja Jovanovic, [email protected] Valentina Milenkovic, [email protected] Prof. Dr. Veljko Milutinovic, [email protected] http://galeb.etf.bg.ac.yu/~vm. Data Mining in the Nutshell. Uncovering the hidden knowledge. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1/71

Issues in Data Mining Issues in Data Mining InfrastructureInfrastructure

Issues in Data Mining Issues in Data Mining InfrastructureInfrastructure

Authors: Nemanja Jovanovic, [email protected] Milenkovic, [email protected] Prof. Dr. Veljko Milutinovic, [email protected]

http://galeb.etf.bg.ac.yu/~vm

Page 2/71

Data Mining in the NutshellData Mining in the NutshellData Mining in the NutshellData Mining in the Nutshell

Uncovering the hidden knowledge

Huge n-p complete search space

Multidimensional interface

Page 3/71

A Problem …A Problem …A Problem …A Problem …

You are a marketing manager for a cellular phone company

Problem: Churn is too high

Bringing back a customer after quitting is both difficult and expensive

Giving a new telephone to everyone whose contract is expiring is very expensive (as well as wasteful)

You pay a sales commission of 250$ per contract

Customers receive free phone (cost 125$) with contract

Turnover (after contract expires) is 40%

Page 4/71

… … A SolutionA Solution… … A SolutionA Solution

Three months before a contract expires, predict which customers will leave

If you want to keep a customer that is predicted to churn, offer them a new phone

The ones that are not predicted to churn need no attention

If you don’t want to keep the customer, do nothing

How can you predict future behavior?

Tarot Cards?

Magic Ball?

Data Mining?

Page 5/71

Still Skeptical?Still Skeptical?Still Skeptical?Still Skeptical?

Page 6/71

The DefinitionThe DefinitionThe DefinitionThe Definition

Automated

The automated extraction of predictive information from (large) databases

Extraction

Predictive

Databases

Page 7/71

History of Data MiningHistory of Data MiningHistory of Data MiningHistory of Data Mining

Page 8/71

Repetition in Solar ActivityRepetition in Solar ActivityRepetition in Solar ActivityRepetition in Solar Activity

1613 – Galileo Galilei

1859 – Heinrich Schwabe

Page 9/71

The Return of theThe Return of theHalley CometHalley Comet

The Return of theThe Return of theHalley CometHalley Comet

1910 1986 2061 ???

1531

1607

1682

239 BC

Edmund Halley (1656 - 1742)

Page 10/71

Data Mining is NotData Mining is NotData Mining is NotData Mining is Not

Data warehousing

Ad-hoc query/reporting

Online Analytical Processing (OLAP)

Data visualization

Page 11/71

Data Mining isData Mining isData Mining isData Mining is

Automated extraction of predictive informationfrom various data sources

Powerful technology with great potential to help users focus on the most important information stored in data warehouses or streamed through communication lines

Page 12/71

Data Mining canData Mining canData Mining canData Mining can

Answer question that were too time consuming to resolve in the past

Predict future trends and behaviors, allowing us to make proactive, knowledge driven decision

Page 13/71

Focus of this PresentationFocus of this PresentationFocus of this PresentationFocus of this Presentation

Data Mining problem types

Data Mining models and algorithms

Efficient Data Mining

Available software

Page 14/71

Data MiningData Mining Problem Types Problem Types

Data MiningData Mining Problem Types Problem Types

Page 15/71

Data Mining Problem TypesData Mining Problem TypesData Mining Problem TypesData Mining Problem Types

6 types

Often a combination solves the problem

Page 16/71

Data Description and Data Description and SummarizationSummarization

Data Description and Data Description and SummarizationSummarization

Aims at concise description of data characteristics

Lower end of scale of problem types

Provides the user an overview of the data structure

Typically a sub goal

Page 17/71

SegmentationSegmentationSegmentationSegmentation

Separates the data into interesting and meaningful subgroups or classes

Manual or (semi)automatic

A problem for itself or just a step in solving a problem

Page 18/71

ClassificationClassificationClassificationClassification

Assumption: existence of objects with characteristics that belong to different classes

Building classification models which assign correct labels in advance

Exists in wide range of various application

Segmentation can provide labels or restrict data sets

Page 19/71

Concept DescriptionConcept DescriptionConcept DescriptionConcept Description

Understandable description of concepts or classes

Close connection to both segmentation and classification

Similarity and differences to classification

Page 20/71

Prediction (Regression)Prediction (Regression)Prediction (Regression)Prediction (Regression)

Similar to classification - difference:discrete becomes continuous

Finds the numerical value of the target attribute for unseen objects

Page 21/71

Dependency AnalysisDependency AnalysisDependency AnalysisDependency Analysis

Finding the model that describes significant dependences between data items or events

Prediction of value of a data item

Special case: associations

Page 22/71

Data Mining ModelsData Mining ModelsData Mining ModelsData Mining Models

Page 23/71

Neural NetworksNeural NetworksNeural NetworksNeural Networks

Characterizes processed data with single numeric value

Efficient modeling of large and complex problems

Based on biological structures Neurons

Network consists of neurons grouped into layers

Page 24/71

Neuron FunctionalityNeuron FunctionalityNeuron FunctionalityNeuron Functionality

I1

I2

I3

In

Output

W1

W2

W3

Wn

f

Output = f (W1*I1, W2*I1, …, Wn*In)Output = f (W1*I1, W2*I1, …, Wn*In)

Page 25/71

Training Neural NetworksTraining Neural NetworksTraining Neural NetworksTraining Neural Networks

Page 26/71

Neural Networks - ConclusionNeural Networks - ConclusionNeural Networks - ConclusionNeural Networks - Conclusion

Once trained, Neural Networks can efficiently estimate value of output variable for given input

Neurons and network topology are essentials

Usually used for prediction or regression problem types

Difficult to understand

Data pre-processing often required

Page 27/71

Decision TreesDecision TreesDecision TreesDecision Trees

A way of representing a series of rules that lead to a class or value

Iterative splitting of data into discrete groups maximizing distance between them at each split

CHAID, CHART, Quest, C5.0

Classification trees and regression trees

Unlimited growth and stopping rules

Univariate splits and multivariate splits

Page 28/71

Decision TreesDecision TreesDecision TreesDecision Trees

Balance>10 Balance<=10

Age<=32 Age>32

Married=NO Married=YES

Page 29/71

Decision TreesDecision TreesDecision TreesDecision Trees

Page 30/71

Rule InductionRule InductionRule InductionRule Induction

Method of deriving a set of rules to classify cases

Creates independent rules that are unlikely to form a tree

Rules may not cover all possible situations

Rules may sometimes conflict in a prediction

Page 31/71

Rule InductionRule InductionRule InductionRule Induction

If balance>100.000 then confidence=HIGH & weight=1.7

If balance>25.000 andstatus=married

then confidence=HIGH & weight=2.3

If balance<40.000 then confidence=LOW & weight=1.9

Page 32/71

K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)

K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)

Usage of knowledge of previously solved similar problems in solving the new problem

Assigning the class to the group where most of the k-”neighbors” belong

First step – finding the suitable measure for distance between attributes in the data

How far is black from green?

+ Easy handling of non-standard data types

- Huge models

Page 33/71

K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)

K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)

Page 34/71

Data Mining Models Data Mining Models and Algorithmsand Algorithms

Data Mining Models Data Mining Models and Algorithmsand Algorithms

Logistic regression

Discriminant analysis

Generalized Adaptive Models (GAM)

Genetic algorithms

Etc…

Many other available models and algorithms

Many application specific variations of known models

Final implementation usually involves several techniques

Selection of solution that match best results

Page 35/71

Efficient Data MiningEfficient Data Mining Efficient Data MiningEfficient Data Mining

Page 36/71

Don’t Mess With It!

YES NO

YES

You Shouldn’t Have!

NO

Will it ExplodeIn Your Hands?

NO

Look The Other Way

Anyone ElseKnows? You’re in TROUBLE!

YESYES

NO

Hide ItCan You Blame Someone Else?

NO

NO PROBLEM!

YES

Is It Working?

Did You Mess With It?

Page 37/71

DM Process ModelDM Process ModelDM Process ModelDM Process Model

CRISP–DM – tends to become a standard

5A – used by SPSS Clementine(Assess, Access, Analyze, Act and Automate)

SEMMA – used by SAS Enterprise Miner(Sample, Explore, Modify, Model and Assess)

Page 38/71

CRISP - DMCRISP - DMCRISP - DMCRISP - DM

CRoss-Industry Standard for DM

Conceived in 1996 by three companies:

Page 39/71

CRISP – DM methodologyCRISP – DM methodologyCRISP – DM methodologyCRISP – DM methodology

Four level breakdown of the CRISP-DM methodology:

Phases

Generic Tasks

Process Instances

Specialized Tasks

Page 40/71

Mapping generic modelsMapping generic modelsto specialized modelsto specialized models

Mapping generic modelsMapping generic modelsto specialized modelsto specialized models

Analyze the specific context

Remove any details not applicable to the context

Add any details specific to the context

Specialize generic context according toconcrete characteristic of the context

Possibly rename generic contents to provide more explicit meanings

Page 41/71

Generalized and Specialized Generalized and Specialized CookingCooking

Generalized and Specialized Generalized and Specialized CookingCooking

Preparing food on your own Find out what you want to eat

Find the recipe for that meal

Gather the ingredients

Prepare the meal

Enjoy your food

Clean up everything (or leave it for later)

Raw stake with vegetables?

Check the Cookbook or call mom

Defrost the meat (if you had it in the fridge)

Buy missing ingredients or borrow the from the neighbors

Cook the vegetables and fry the meat

Enjoy your food or even more

You were cooking so convince someone else to do the dishes

Page 42/71

CRISP – DM modelCRISP – DM modelCRISP – DM modelCRISP – DM model

Business understanding

Data understanding

Data preparation

Modeling

Evaluation

Deployment

Business understanding

Data understanding

Datapreparation

ModelingEvaluation

Deployment

Page 43/71

Customizing a Web PageCustomizing a Web PageCustomizing a Web PageCustomizing a Web Page

User-friendly design

Prediction of the users interests

Reduction of server workload

Reduction of Web traffic

Page 44/71

Customizing a Web PageCustomizing a Web PageCustomizing a Web PageCustomizing a Web Page

Page 45/71

Business UnderstandingBusiness UnderstandingBusiness UnderstandingBusiness Understanding

Determine business objectives

Assess situation

Determine data mining goals

Produce project plan

Page 46/71

Business Understanding - OutputsBusiness Understanding - OutputsBusiness Understanding - OutputsBusiness Understanding - Outputs

Background

Business objectives and success criteria

Inventory of resources

Requirements, assumptions, and constrains

Risks and contingencies

Terminology

Costs and benefits

Data mining goals and success criteria

Project plan

Initial assessment of tools and techniques

Page 47/71

Customizing a Web Page – Customizing a Web Page – Business Understanding ExampleBusiness Understanding Example

Customizing a Web Page – Customizing a Web Page – Business Understanding ExampleBusiness Understanding Example

Business objectives

Assess situation

Data mining goals

Project plan

Make the users surfing more comfortable

Decrease of overhead for users

Reduction of workload and Web traffic

Make the users surfing more comfortable

Decrease of overhead for users

Reduction of workload and Web traffic

Find the patterns in the user behavior

Page 48/71

Data UnderstandingData UnderstandingData UnderstandingData Understanding

Collect initial data

Describe data

Explore data

Verify data quality

Page 49/71

Data Understanding - OutputsData Understanding - OutputsData Understanding - OutputsData Understanding - Outputs

Data collection report

Data description report

Data exploration report

Data quality report

Background of data

List of data sources

For each data source, method of acquisition

Problems encountered in data acquisition

Detailed description of each data source

List of tables or other database objects

Description of each field including units, codes, etc.

Expected regularities or patterns and methods of detection

Regularities or patterns found (expected and unexpected)

Any other surprises

Conclusions for data transformation, data cleaning and any other pre-processing

Conclusions related to data mining goals or business objectives

Approach taken to assess data quality

Results of data quality assessment

Page 50/71

Customizing a Web Page – Customizing a Web Page – Data Understanding ExampleData Understanding ExampleCustomizing a Web Page – Customizing a Web Page –

Data Understanding ExampleData Understanding Example

Collecting the data

Data description

Results of data exploring

Verification of the quality of the data

Update the server to monitor user behavior

Record the users activities into a storage

Analyze recorded data

Decide which data is usable for mining

Page 51/71

Data PreparationData PreparationData PreparationData Preparation

Select data

Clean data

Construct data

Integrate data

Format data

Page 52/71

Data Preparation - OutputsData Preparation - OutputsData Preparation - OutputsData Preparation - Outputs

Dataset description report

Background including broad goals andplan for pre-processing

Description of pre-processing

Detailed description of resultant datasets

Rational for inclusion/exclusion of attributes

Discoveries made during pre-processing and implications for further work

Dataset

Page 53/71

Customizing a Web Page – Customizing a Web Page – Data Preparation ExampleData Preparation ExampleCustomizing a Web Page – Customizing a Web Page – Data Preparation ExampleData Preparation Example

Decide from what period will the users monitored actions be considered

Make assumptions about unnecessary monitored data and discard them

Classify user actions into categories, group interesting links, etc…

If more information about user is available from other sources, use them

Transform data into suitable forms so several modeling techniques could be applied

Page 54/71

ModelingModelingModelingModeling

Select modeling technique

Generate test design

Build model

Assess model

Page 55/71

Modeling - OutputsModeling - OutputsModeling - OutputsModeling - Outputs

Assessment of DM results with respect to business success criteria

Test design

Model description

Model assessment

Broad description of the type of model and the training data to be used

Explanation of how the model will be tested or assessed

Description of any data required for testing

Description of any planned examination of modelsby domain or data experts

Type of model and relation to data mining goals

Parameter settings used to produce model

Detailed description of the model and any special features

Conclusions regarding patterns in the data

Overview of assessment process including deviations from the test plan

Detailed assessment of the model

Comments on models by domain or data experts

Insights into why a certain modeling technique and certain parameter setting lead to good/bad results

Page 56/71

Customizing a Web Page – Customizing a Web Page – Modeling ExampleModeling Example

Customizing a Web Page – Customizing a Web Page – Modeling ExampleModeling Example

The problem is prediction of behavior

Regression could be a good solution due to distinct nature of the data

Create the software according to the project plan

Observe the behavior of the software

Tune the model after each evaluation phase if needed

Page 57/71

EvaluationEvaluationEvaluationEvaluation

Evaluate results

Review process

Determine next steps

results = models + findings

Page 58/71

Evaluation - OutputsEvaluation - OutputsEvaluation - OutputsEvaluation - Outputs

Assessment of DM results with respect to business success criteria

Review of process

List of possible actions

Review of Business Objectives andBusiness Success Criteria

Comparison between success criterion and DM results

Conclusion about achievability of success criterion and suitability of data mining process

Review of “Project Success”

Are there new business objectives?

Page 59/71

Customizing a Web Page – Customizing a Web Page – Evaluation ExampleEvaluation Example

Customizing a Web Page – Customizing a Web Page – Evaluation ExampleEvaluation Example

Observe the model behavior at work

Collect response from Beta testers

Check user satisfaction

Check server and network engagement

Classify results

Determine which parameter of the model should be changed

Present new ideas and modifications

Step back into previous phases as needed

Page 60/71

DeploymentDeploymentDeploymentDeployment

Plan deployment

Plan monitoring and maintenance

Produce final report

Review project

Page 61/71

Deployment - OutputsDeployment - OutputsDeployment - OutputsDeployment - Outputs

Monitoring and maintenance plan

Final report Overview of deployment results and indicationwhich of results may require updating

Description of how updating will be triggered

Description of how updating will be performed

Summary of Business Understanding(background, objectives and success criteria)

Summary of data mining process

Summary of data mining results

Summary of results evaluation

Summary of deployment and maintenance plan

Cost/benefit analysis

Conclusions for the business

Conclusions for future data mining

Page 62/71

Customizing a Web Page – Customizing a Web Page – Deployment ExampleDeployment Example

Customizing a Web Page – Customizing a Web Page – Deployment ExampleDeployment Example

Make the feature available to all users

Make plan for maintenance and user feedback

Analyze costs and benefits

Summarize the whole documentation

Summarize network and server additional activity

Collect the new ideas

Award according to results

Leave space for upgrade

Page 63/71

At Last…At Last…At Last…At Last…

Page 64/71

Available SoftwareAvailable SoftwareAvailable SoftwareAvailable Software

Page 65/71

Available SoftwareAvailable SoftwareAvailable SoftwareAvailable Software

Discussion of data mining vendors and software is not included into this slide set

Page 66/71

ConclusionsConclusionsConclusionsConclusions

Page 67/71

WWWWWW.NBA..NBA.COMCOMWWWWWW.NBA..NBA.COMCOM

Page 68/71

Se7enSe7enSe7enSe7en

Page 69/71

CD – ROM CD – ROM CD – ROM CD – ROM

Page 70/71

CreditsCreditsCreditsCredits

Anne Stern, SPSS, Inc.

Djuro Gluvajic, ITE, Denmark

Obrad Milivojevic, PC PRO, Yugoslavia

Page 71/71

ReferencesReferencesReferencesReferences

Bruha, I., ‘Data Mining, KDD and Knowledge Integration: Methodology and A case Study”, SSGRR 2000

Fayyad, U., Shapiro, P., Smyth, P., Uthurusamy, R., “Advances in Knowledge Discovery and Data Mining”, MIT Press, 1996

Glumour, C., Maddigan, D., Pregibon, D., Smyth, P., “Statistical Themes nad Lessons for Data Mining”, Data Mining And Knowledge Discovery 1, 11-28, 1997

Hecht-Nilsen, R., “Neurocomputing”, Addison-Wesley, 1990

Pyle, D., “Data Preparation for Data Mining”, Morgan Kaufman, 1999

galeb.etf.bg.ac.yu/~vm

www.thearling.com

www.crisp-dm.com

www.twocrows.com

www.sas.com/products/miner

www.spss.com/clementine

Page 72/71

The ENDThe ENDThe ENDThe END