spss

38
© 2004 IBM Corporation © 2012 IBM Corporation Introduction to IBM SPSS Modeler and Data Mining LESSON 1: INTRODUCTION TO DATA MINING IBM Global Business Solutions PAKISTAN - ISLAMABAD

Upload: muhammad-sadiq

Post on 07-Aug-2015

92 views

Category:

Documents


2 download

DESCRIPTION

SPSS

TRANSCRIPT

Page 1: SPSS

© 2004 IBM Corporation© 2012 IBM Corporation

Introduction to IBM SPSS Modeler and Data Mining

LESSON 1: INTRODUCTION TO DATA MINING

IBM Global Business SolutionsPAKISTAN - ISLAMABAD

Page 2: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization2

• To introduce the concept of Data Mining • To introduce the CRISP-DM process model as a

general framework for carrying out Data Mining projects

• To describe successful data mining projects and reasons projects fail

• To describe the skills needed for data mining • To sketch the plan of this course

Objectives

Page 3: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization3

What Data Mining means to you?

Introduction to Data Mining

Page 4: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization4

Data Mining is a general term which includes a number of techniques to extract useful

information from (large) data files, without necessarily having preconceived notions about

what will be discovered.

Introduction to Data Mining

Page 5: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization5

The useful information often consists of patterns and

relationships in the data that were previously unknown or

even unsuspected

Introduction to Data Mining

Page 6: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization6

Data mining is an interactive and iterative process. Business expertise must be used jointly with advanced technologies to identify underlying relationships and features in the data. A seemingly useless pattern in data discovered by data-mining technology can often be transformed into a valuable piece of actionable information using business experience and expertise.

Introduction to Data Mining

Page 7: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization7

Existing data is used to “train” a model, and then ‘test” it to determine whether it should

be deemed acceptable and likely to generalize to the population of interest

Introduction to Data Mining

Page 8: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization8

Data mining has been used in applications.

• Developing models to detect fraudulent phone or credit-card activity

• Predicting good and poor sales prospects • Predicting next page browsed on a website. • Identifying customers who are likely to cancel

their policies, subscriptions, or accounts

Introduction to Data Mining

Page 9: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization9

• Classifying customers into groups with distinct usage or need patterns

• Predicting who is likely to not renew a contract for mobile phone service

• Finding rules that identify products that, when purchased, predict additional purchases

• Identifying factors that lead to defects in a manufacturing process

• Predicting whether a heart attack is likely to recur among those with cardiac disease

Introduction to Data Mining

Page 10: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization10

• Are data available?• Do data cover the relevant factors?

• Are the data too noisy?

• Are there enough data?

• Is expertise on the data available?

Key Questions for a Data-Mining Project

Page 11: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization11

Data mining is much more effective if done in a planned, systematic way!

A Strategy for Data Mining: the CRISP-DM Process Methodology

Page 12: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization12

To guide your planning, answer the following questions:

• What substantive problem do you want to solve? • What data sources are available, and what parts

of the data are relevant to the current problem? • What kind of preprocessing and data cleaning do

you need to do before you start mining the data?

A Strategy for Data Mining: the CRISP-DM Process Methodology

Page 13: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization13

• What data mining technique(s) will you use? • How will you evaluate the results of the data

mining analysis? • How will you get the most out of the

information you obtained from data mining?

A Strategy for Data Mining: the CRISP-DM Process Methodology

Page 14: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization14

The data mining process model recommended for use with PASW Modeler is the Cross-Industry Standard Process for Data Mining (CRISP-DM).

Can be applied to a wide variety of industriesand business problems

A Strategy for Data Mining: the CRISP-DM Process Methodology

Page 15: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization15

Six phases Business understanding

Data understandingData preparationModeling

EvaluationDeployment

A Strategy for Data Mining: the CRISP-DM Process Methodology

Page 16: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization16

• Identify business objectives and success criteria

• Perform a situational assessment (resources, constraints, assumptions, risks, costs, and benefits)

• Determine the goals of the data-mining project and success criteria

• Produce a project plan

CRISP-DM: Business understanding

Page 17: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization17

• Collecting initial data, • Describing data, • Exploring data, and • Verifying data quality

CRISP-DM: Data understanding

Page 18: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization18

• selecting, • cleaning, • constructing, • integrating, and • formatting data

CRISP-DM: Data preparation

Page 19: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization19

• Extracting data from a data warehouse or data mart

• Linking tables together within a database or in PASW Modeler

• Combining data files from different systems • Reconciling inconsistent field values

Activities: the Data Understanding and Data Preparation

Page 20: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization20

• Identifying missing, incorrect, or extreme data values

• Data selection • Restructuring data into a form the analysis

requires • Transforming relevant fields (taking

differences, ratios, etc.)

Activities: the Data Understanding and Data Preparation

Page 21: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization21

• Sophisticated analysis methods/models are used to extract information from the data

Steps• Selecting modeling techniques• Generating test designs• Building and assessing models

CRISP-DM: Modeling

Page 22: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization22

Developing a model is an iterative process

Several models tried Best model picked

CRISP-DM: Modeling

Page 23: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization23

A feature of data-mining is the use of multiple models to make predictions, building on the strengths of each technique.

CRISP-DM: Modeling

Page 24: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization24

Evaluate how the data mining results can help you to achieve your business objectives

CRISP-DM: Evaluation

Page 25: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization25

At this stage in the project, you have built a model (or models) that appears to have high

quality, from a data analysis perspective. Before writing final reports and deploying the

model, it is important to more thoroughly evaluate the model, and review the steps

executed to construct the model, to be certain it properly achieves the business objectives.

CRISP-DM: Evaluation

Page 26: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization26

A key aim is to determine if there is some important business issue that has not been

sufficiently considered.

At the end of this phase, a decision will be made on the use of the data-mining results.

CRISP-DM: Evaluation

Page 27: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization27

Plan for monitoring the model’s predictionsand success

Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as

implementing a repeatable data-mining process.

CRISP-DM: Deployment

Page 28: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization28

Page 29: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization29

Data preparation usually precedes modeling. However, decisions made and information

gathered during the modeling phase can often lead you to rethink parts of the data

preparation phase, which can then present new modeling issues, and so on. The two phases feed back on each other until both

phases have been resolved adequately

A Strategy for Data Mining: the CRISP-DM Process Methodology

Page 30: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization30

The evaluation phase can lead you to re-evaluate your original business understanding, and you may decide that you've been trying to answer the wrong question. At this point, you can revise your business understanding and

proceed through the rest of the process again with a better target in mind.

A Strategy for Data Mining: the CRISP-DM Process Methodology

Page 31: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization31

The second key point is the iterative nature of data mining. You will rarely, if ever, simply plan a data mining project,

execute it and then pack up your data and go home. Using data mining to address your customers' demands is an

ongoing endeavor. The knowledge gained from one cycle of data mining will almost invariably lead to new questions, new issues, and new opportunities to identify and meet

your customers' needs. Those new questions, issues, and opportunities can usually be addressed by mining your data

once again. This process of mining and identifying new opportunities should become part of the way that you

think about your business and a cornerstone of your overall business strategy.

A Strategy for Data Mining: the CRISP-DM Process Methodology

Page 32: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization32

Page 33: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization33

• It is usually done by fitting the model to a portion of the data (called the Training data) and then applying the predictions to, and evaluating the results with, the other portion of the data (called the Test or Validation data).

Model Validation

Page 34: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization34

• Result assessed with respect to business success

• Another criteria can be the cost of failure

Measures of Project Success

Page 35: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization35

• The initial assessment will be directly tied to the modeling effort. That is, you will be concerned with predictive accuracy (who is a churner) or with finding interesting relationships between products or people (in association or cluster analysis).

• But in the long run, the success of a data-mining effort will be measured by concrete factors such as reduced savings, return on investment or profitability, and so forth.

Measures of Project Success

Page 36: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization36

• Bad data• Organizational resistance

• Results that cannot be deployed• Problems of cause and effect

Causes of Failure in Data Mining

Page 37: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization37

• Understanding of the Business• Database Knowledge• Data Mining Methods• Deployment

Skills Needed for Data Mining

Page 38: SPSS

© 2012 IBM Corporation

IBM Global Business Services

Business Analytics & Optimization38

• The course is structured roughly along the phases of the CRISP-DM process model.

Because we don’t have a specific data-mining project to complete (although we will focus, for the most part, on one data file), we won’t

discuss any further the Business Understanding phase, but we will cover the

other stages from Data Understanding to Deployment

Plan of the Course