Download - SPSS
© 2004 IBM Corporation© 2012 IBM Corporation
Introduction to IBM SPSS Modeler and Data Mining
LESSON 1: INTRODUCTION TO DATA MINING
IBM Global Business SolutionsPAKISTAN - ISLAMABAD
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization2
• To introduce the concept of Data Mining • To introduce the CRISP-DM process model as a
general framework for carrying out Data Mining projects
• To describe successful data mining projects and reasons projects fail
• To describe the skills needed for data mining • To sketch the plan of this course
Objectives
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization3
What Data Mining means to you?
Introduction to Data Mining
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization4
Data Mining is a general term which includes a number of techniques to extract useful
information from (large) data files, without necessarily having preconceived notions about
what will be discovered.
Introduction to Data Mining
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization5
The useful information often consists of patterns and
relationships in the data that were previously unknown or
even unsuspected
Introduction to Data Mining
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization6
Data mining is an interactive and iterative process. Business expertise must be used jointly with advanced technologies to identify underlying relationships and features in the data. A seemingly useless pattern in data discovered by data-mining technology can often be transformed into a valuable piece of actionable information using business experience and expertise.
Introduction to Data Mining
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization7
Existing data is used to “train” a model, and then ‘test” it to determine whether it should
be deemed acceptable and likely to generalize to the population of interest
Introduction to Data Mining
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization8
Data mining has been used in applications.
• Developing models to detect fraudulent phone or credit-card activity
• Predicting good and poor sales prospects • Predicting next page browsed on a website. • Identifying customers who are likely to cancel
their policies, subscriptions, or accounts
Introduction to Data Mining
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization9
• Classifying customers into groups with distinct usage or need patterns
• Predicting who is likely to not renew a contract for mobile phone service
• Finding rules that identify products that, when purchased, predict additional purchases
• Identifying factors that lead to defects in a manufacturing process
• Predicting whether a heart attack is likely to recur among those with cardiac disease
Introduction to Data Mining
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization10
• Are data available?• Do data cover the relevant factors?
• Are the data too noisy?
• Are there enough data?
• Is expertise on the data available?
Key Questions for a Data-Mining Project
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization11
Data mining is much more effective if done in a planned, systematic way!
A Strategy for Data Mining: the CRISP-DM Process Methodology
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization12
To guide your planning, answer the following questions:
• What substantive problem do you want to solve? • What data sources are available, and what parts
of the data are relevant to the current problem? • What kind of preprocessing and data cleaning do
you need to do before you start mining the data?
A Strategy for Data Mining: the CRISP-DM Process Methodology
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization13
• What data mining technique(s) will you use? • How will you evaluate the results of the data
mining analysis? • How will you get the most out of the
information you obtained from data mining?
A Strategy for Data Mining: the CRISP-DM Process Methodology
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization14
The data mining process model recommended for use with PASW Modeler is the Cross-Industry Standard Process for Data Mining (CRISP-DM).
Can be applied to a wide variety of industriesand business problems
A Strategy for Data Mining: the CRISP-DM Process Methodology
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization15
Six phases Business understanding
Data understandingData preparationModeling
EvaluationDeployment
A Strategy for Data Mining: the CRISP-DM Process Methodology
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization16
• Identify business objectives and success criteria
• Perform a situational assessment (resources, constraints, assumptions, risks, costs, and benefits)
• Determine the goals of the data-mining project and success criteria
• Produce a project plan
CRISP-DM: Business understanding
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization17
• Collecting initial data, • Describing data, • Exploring data, and • Verifying data quality
CRISP-DM: Data understanding
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization18
• selecting, • cleaning, • constructing, • integrating, and • formatting data
CRISP-DM: Data preparation
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization19
• Extracting data from a data warehouse or data mart
• Linking tables together within a database or in PASW Modeler
• Combining data files from different systems • Reconciling inconsistent field values
Activities: the Data Understanding and Data Preparation
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization20
• Identifying missing, incorrect, or extreme data values
• Data selection • Restructuring data into a form the analysis
requires • Transforming relevant fields (taking
differences, ratios, etc.)
Activities: the Data Understanding and Data Preparation
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization21
• Sophisticated analysis methods/models are used to extract information from the data
Steps• Selecting modeling techniques• Generating test designs• Building and assessing models
CRISP-DM: Modeling
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization22
Developing a model is an iterative process
Several models tried Best model picked
CRISP-DM: Modeling
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization23
A feature of data-mining is the use of multiple models to make predictions, building on the strengths of each technique.
CRISP-DM: Modeling
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization24
Evaluate how the data mining results can help you to achieve your business objectives
CRISP-DM: Evaluation
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization25
At this stage in the project, you have built a model (or models) that appears to have high
quality, from a data analysis perspective. Before writing final reports and deploying the
model, it is important to more thoroughly evaluate the model, and review the steps
executed to construct the model, to be certain it properly achieves the business objectives.
CRISP-DM: Evaluation
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization26
A key aim is to determine if there is some important business issue that has not been
sufficiently considered.
At the end of this phase, a decision will be made on the use of the data-mining results.
CRISP-DM: Evaluation
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization27
Plan for monitoring the model’s predictionsand success
Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as
implementing a repeatable data-mining process.
CRISP-DM: Deployment
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization28
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization29
Data preparation usually precedes modeling. However, decisions made and information
gathered during the modeling phase can often lead you to rethink parts of the data
preparation phase, which can then present new modeling issues, and so on. The two phases feed back on each other until both
phases have been resolved adequately
A Strategy for Data Mining: the CRISP-DM Process Methodology
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization30
The evaluation phase can lead you to re-evaluate your original business understanding, and you may decide that you've been trying to answer the wrong question. At this point, you can revise your business understanding and
proceed through the rest of the process again with a better target in mind.
A Strategy for Data Mining: the CRISP-DM Process Methodology
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization31
The second key point is the iterative nature of data mining. You will rarely, if ever, simply plan a data mining project,
execute it and then pack up your data and go home. Using data mining to address your customers' demands is an
ongoing endeavor. The knowledge gained from one cycle of data mining will almost invariably lead to new questions, new issues, and new opportunities to identify and meet
your customers' needs. Those new questions, issues, and opportunities can usually be addressed by mining your data
once again. This process of mining and identifying new opportunities should become part of the way that you
think about your business and a cornerstone of your overall business strategy.
A Strategy for Data Mining: the CRISP-DM Process Methodology
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization32
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization33
• It is usually done by fitting the model to a portion of the data (called the Training data) and then applying the predictions to, and evaluating the results with, the other portion of the data (called the Test or Validation data).
Model Validation
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization34
• Result assessed with respect to business success
• Another criteria can be the cost of failure
Measures of Project Success
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization35
• The initial assessment will be directly tied to the modeling effort. That is, you will be concerned with predictive accuracy (who is a churner) or with finding interesting relationships between products or people (in association or cluster analysis).
• But in the long run, the success of a data-mining effort will be measured by concrete factors such as reduced savings, return on investment or profitability, and so forth.
Measures of Project Success
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization36
• Bad data• Organizational resistance
• Results that cannot be deployed• Problems of cause and effect
Causes of Failure in Data Mining
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization37
• Understanding of the Business• Database Knowledge• Data Mining Methods• Deployment
Skills Needed for Data Mining
© 2012 IBM Corporation
IBM Global Business Services
Business Analytics & Optimization38
• The course is structured roughly along the phases of the CRISP-DM process model.
Because we don’t have a specific data-mining project to complete (although we will focus, for the most part, on one data file), we won’t
discuss any further the Business Understanding phase, but we will cover the
other stages from Data Understanding to Deployment
Plan of the Course