chapter 2 data mining processes and knowledge discovery identify actionable results

Chapter 2Chapter 2Data Mining Processes and Data Mining Processes and

Knowledge DiscoveryKnowledge Discovery

Identify actionable results

結束

2-2

ContentsContents

Describes the Cross-Industry Standard Process for Describes the Cross-Industry Standard Process for Data Mining (CRISP-DM), a set of phases that can Data Mining (CRISP-DM), a set of phases that can be used in data mining studiesbe used in data mining studies

Discusses each phase in detailDiscusses each phase in detail

Gives an example illustrationGives an example illustration

Discusses a knowledge discovery processDiscusses a knowledge discovery process

Describes the Cross-Industry Standard Process for Describes the Cross-Industry Standard Process for Data Mining (CRISP-DM), a set of phases that can Data Mining (CRISP-DM), a set of phases that can be used in data mining studiesbe used in data mining studies

Discusses each phase in detailDiscusses each phase in detail

Gives an example illustrationGives an example illustration

Discusses a knowledge discovery processDiscusses a knowledge discovery process

結束

2-3

CRISP-DMCRISP-DM

Cross-Industry Standard Process for Data Mining

One of first comprehensive attempts toward standard process model for data mining

Independent of industry sector & technology

結束

2-4

CRISP-DM PhasesCRISP-DM Phases

1. Business (or problem) understanding2. Data understanding

A systematic process to try to make sense of the massive amounts of data generated from daily operations.

3. Data preparation• Transform & create data set for modeling

4. Modeling5. Evaluation

• Check good models, evaluate to assure nothing missing

6. Deployment

結束

2-5

Business UnderstandingBusiness Understanding

Solve a specific problemDetermining business objectives, assessing the current

situation, establishing data mining goals, and developing a project plan.

Clear definition helpsMeasurable success criteria

Convert business objectives to set of data-mining goalsWhat to achieve in technical terms, such as

What types of customers are interested in each of our products?

What are typical profiles of customers …

結束

2-6

Data UnderstandingData Understanding

Initial data collection, data description, data exploration, and the verification of data quality.Three issues considered in data selection:1. Set up a concise and clear description of the problem.

For example, a retail DM project may seek to identify spending behaviors of female shoppers who purchase seasonal clothes.

2. Identify the relevant data for the problem description, such demographical, credit card transactional, financial data…

3. Select variables for the relevant important for the project.

結束

2-7

Data Understanding (cont.)Data Understanding (cont.)

Data types: Demographic data (income, education, age …) Socio-graphic data (hobby, club membership,…) Transactional data (sales record, credit card spending…) Quantitative data: are measurable using numerical values) Qualitative data: known as categorical data, contains both nominal and

ordinal data. (see also page. 22)Related data: Can come from many sources? Internal

ERP (or MIS) Data Warehouse

External Government data Commercial data

Created Research

結束

2-8

Data PreparationData Preparation

Once data sources available are identified, the data need to be selected, cleaned, built into the desired and formatted forms. Clean data: Formats, gaps, filters outliers & redundancies (see page .22)Unified numerical scalesNominal data

Code (such gender data, male and female)Ordinal data

Nominal code or scale (excellent, fair, poor)Cardinal data (Categorical, A, B, C levels)

結束

2-9

Types of DataTypes of Data

Type Features Synonyms

Numerical Continuous Range

Integer Range

Binary Yes/No Flag

Categorical Finite Set

Date/Time Range

String Typeless

Text String

Range: Numeric vales (integer, real, or date/time)Set: Data with distinct multiple value (numeric, string, or data/time)Typeless: for other types of data

結束

2-10

Data Preparation (Cont.)Data Preparation (Cont.)

Several statistical method and visualization tools can be used to preprocess the selected data.Such max, min, mean, and mode can be used to

aggregate or smooth the data.Scatter plots and box plots can be used to filter outliers.More advanced techniques, such as regression analysis,

cluster analysis, decision tree, or hierarchical analysis may be applied in data preprocessing.

In some cases, data preprocessing could take over 50% of the time of the entire data mining process.Shortening data processing time can reduce much of the

total computation time in data mining.

結束

2-11

Data Preparation Data Preparation –– data transformation data transformation

Data transformation is to use simple mathematical formulations or learning curves to convert different measurements of selected, and clean, data into a unified numerical scale for the data analysis.Data transformation can be used to 1. Transform from numerical to numerical scales, to

shrink or enlarge the given data. Such as (x-min)/max-min) to shrink the data into the interval [0,1].

2. Recode categorical data to numerical scales. Categorical data can be ordinal (less, moderate, strong) and nominal (red, yellow, blue..). Such 1=yes, 0=no. see also page. 24.

See page. 24 for more details.See page. 24 for more details.

結束

2-12

ModelingModeling

Data modeling is where the data mining software is used to generate results for various situations. Data visualization and cluster analysis are useful for initial analysis.

Depending on the data type, 1. if the task is to group data, discriminant analysis is

applied.

2. If the purpose is estimation, regression is appropriate the data are continuous (and logistic regression is not).

3. Neural networks could be applied for both tasks.

Data Treatment Training set for development of the model. Test set for testing the model that is built. Maybe others for refining the model

結束

2-13

Data mining techniquesData mining techniques

TechniquesAssociation: the relationship of a particular item in a data

transaction on other items in the same transaction is used to predict patterns. See also page 25 for example.

Classification: the methods are intended for learning different functions that map each item of the selected data into one of a predefined set of classes. Two key research problems related to classification results are the evaluation of misclassification and prediction power(C4.5).Mathematical modeling is often used to construct classification

methods are binary decision trees (CART), neural networks (nonlinear), linear programming (boundary), and statistics.

See also page. 25, 26 for more explanations

結束

2-14

Data mining techniques (Cont.)Data mining techniques (Cont.)

Clustering: taking ungrouped data and uses automatic techniques to put this data into groups.Clustering is unsupervised and does not require a learning set.

(Chapter 5)Predictions: is related to regression technique, to discover

the relationship between the dependent and independent variables.

Sequential patterns: seeks to find similar patterns in data transaction over a business period.The mathematical models behind sequential patterns are logic

rules, fuzzy logic, and so on.Similar time sequences: applied to discover sequences similar

to a known sequence over both past and current business periods.

結束

2-15

EvaluationEvaluation

Does model meet business objectives?

Any important business objectives not addressed?

Does model make sense?

Is model actionable?PDC

APDCA

CRISP-DMCRISP-DM

結束

2-16

DeploymentDeployment

DM can be used to verify previously held hypotheses or for knowledge discovery.

DM models can be applied to business purposes , including prediction or identification of key situations

Ongoing monitoring & maintenanceEvaluate performance against success criteriaMarket reaction & competitor changes (remodeling or

fine tune)

結束

2-17

ExampleExample

Training set for computer purchase16 records5 attributes

GoalFind classifier for consumer behavior

結束

2-18

Database (1st half)Database (1st half)

Case Age Income Student Credit Gender Buy?

A1 31-40 High No Fair Male Yes

A2 >40 Medium No Fair Female Yes

A3 >40 Low Yes Fair Female Yes

A4 31-40 Low Yes Excellent Female Yes

A5 ≤30 Low Yes Fair Female Yes

A6 >40 Medium Yes Fair Male Yes

A7 ≤30 Medium Yes Excellent Male Yes

A8 31-40 Medium No Excellent Male Yes

結束

2-19

Database (2nd half)Database (2nd half)

Case Age Income Student Credit Gender Buy?

A9 31-40 High Yes Fair Male Yes

A10 ≤30 High No Fair Male No

A11 ≤30 High No Excellent Female No

A12 >40 Low Yes Excellent Female No

A13 ≤30 Medium No Fair Male No

A14 >40 Medium No Excellent Female No

A15 ≤30 Unknown No Fair Male Yes

A16 >40 Medium No N/A Female No

結束

2-20

Data SelectionData Selection

Gender has weak relationship with purchaseBased on correlationDrop gender

Selected Attribute Set

{Age, Income, Student, Credit}

結束

2-21

Data PreprocessingData Preprocessing

Income unknown in Case 15

Credit not available in Case 16

Drop these noisy cases

結束

2-22

Data TransformationData Transformation

Assign numerical values to each attributeAge: ≤30 = 3 31-40 = 2 >40 = 1Income: High = 3 Medium = 2 Low = 1Student: Yes = 2 No = 1Credit: Excellent = 2 Fair = 1

結束

2-23

Data MiningData Mining

Categorize outputBuys = C1 Doesn’t buy = C2

Conduct analysisModel says A8, A10 don’t buy; rest doOf the actual yes, 7 correct and 1 notOf the actual no, 2 correct

Confusion matrix

結束

2-24

Data Interpretation and Test Data SetData Interpretation and Test Data Set

Test on independent data

Case Actual Model

B1 Yes Yes (1)

B2 Yes Yes (2)

B3 Yes Yes (3)

B4 Yes Yes (4)

B5 Yes Yes (5)

B6 Yes Yes (6)

B7 Yes Yes (7)

B8 (do not) No No

B9 No Yes

B10 (do not) No No

結束

2-25

Confusion MatrixConfusion Matrix

Model Buy Model Not Totals

Actual Buy 7 0 7

Actual Not 1 2 3

Totals 8 2 10

right

結束

2-26

MeasuresMeasures

Correct classification rate

9/10 = 0.90

Cost function

cost of error:

model says buy, actual no $20

model says no, actual buy $200

1 x $20 + 0 x $200 = $20

結束

2-27

GoalsGoals

Avoid broad concepts:Gain insight; discover meaningful patterns;

learn interesting thingsCan’t measure attainment

Narrow and specify:Identify customers likely to renew; reduce

churn;Rank order by propensity (favor) to…;

結束

2-28

GoalsGoals

Description: what isunderstandexplaindiscover knowledge

Prescription: what should be doneclassifypredict

結束

2-29

GoalGoal

Method A:four rules, explains 70%

Method B:fifty rules, explains 72%

BEST?

Gain understanding: Method A betterminimum description length (MDL)

Reduce cost of mailing: Method B better

結束

2-30

MeasurementMeasurement

AccuracyHow well does model describe observed data?

Confidence levels proportion of the time between lower

and upper limits

Comprehensibility

Whole or parts?

結束

2-31

Measuring PredictiveMeasuring Predictive

Classification & prediction:error rate = incorrect/total

requires evaluation set be representative

Estimatorspredicted - actual (MAD, MSE, MAPE)

variance = sum(predicted - actual)^2

standard deviation = square root of variance

distance - how far off

結束

2-32

StatisticsStatistics

Population - entire group studied

Sample - subset from population

Bias - difference between sample average & population averagemean, median, modedistributionsignificancecorrelation, regression (hamming distance)

結束

2-33

Classification ModelsClassification Models

LIFT = probability in class by sample divided by probability in class by populationif population probability is 20% and

sample probability is 30%,

LIFT = 0.3/0.2 = 1.5

Best lift not necessarily best need sufficient sample size as confidence increase.

結束

2-34

Lift ChartLift Chart

LIFT

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

mailed

responded

% mailed

% responded

結束

2-35

Measuring ImpactMeasuring Impact

Ideal - $ (NPV) because of expenditure

Mass mailing may be better

Depends on:fixed costcost per recipientcost per respondentvalue of positive response

結束

2-36

Bottom LineBottom Line

Return on investment

結束

2-37

Example ApplicationExample Application

Telephone industry

Problem: Unpaid bills

Data mining used to develop models to predict nonpayment as early as possible

See page. 27

結束

2-38

Knowledge Discovery ProcessKnowledge Discovery Process

1 Data SelectionLearning the application domain

Creating target data set

2 Data Preprocessing Data cleaning & preprocessing

3 Data Transformation Data reduction & projection

4 Data Mining

Choosing function

Choosing algorithms

Data mining

5 Data InterpretationInterpretation

Using discovered knowledge

結束

2-39

1: Business Understanding1: Business Understanding

Predict which customers would be insolventIn time for firm to take preventive measures

(and avert losing good customers)

Hypothesis:Insolvent customers would change calling

habits & phone usage during a critical period before & immediately after termination of billing period

結束

2-40

2: Data Understanding2: Data Understanding

Static customer information available in filesBills, payments, usage

Used data warehouse to gather & organize dataCoded to protect customer privacy

結束

2-41

Creating Target Data SetCreating Target Data Set

Customer filesCustomer informationDisconnectsReconnections

Time-dependent dataBillsPaymentsUsage

100,000 customers over 17-month periodStratified (hierarchical) sampling to assure all groups appropriately represented

結束

2-42

3: Data Preparation3: Data Preparation

Filtered out incomplete data

Deleted inexpensive callsReduced data volume about 50%

Low number of fraudulent cases

Cross-checked with phone disconnects

Lagged data made synchronization necessary

結束

2-43

Data Reduction & ProjectionData Reduction & Projection

Information grouped by account

Customer data aggregated by 2-week periods

Discriminant analysis on 23 categories

Calculated average owed by category (significant)

Identified extra charges (significant)

Investigated payment by installments (not significant)

結束

2-44

Choosing Data Mining FunctionChoosing Data Mining Function

Classes:Most possibly solvent (99.3%)Most possibly insolvent (0.7%)

Costs of error widely differentNew data set created through stratified samplingRetained all insolventAltered distribution to 90% solventUsed 2,066 cases total

Critical period identifiedLast 15 two-week periods before service interruption

Variables defined by counting measures in two-week periods46 variables as candidate discriminant factors

結束

2-45

4: Modeling4: Modeling

Discriminant AnalysisLinear modelSPSS – stepwise forward selection

Decision TreesRule-based classifier, C5, C4.5

Neural NetworksNonlinear model

結束

2-46

Data MiningData Mining

Training set about 2/3rdsRest testDiscriminant analysisUsed 17 variablesEqual costs – 0.875 correctUnequal costs – 0.930 correct

Rule-based – 0.952 correctNeural network – 0.929 correct

結束

2-47

5: Evaluation5: Evaluation

1st objective to maximize accuracy of predicting insolvent customersDecision tree classifier best

2nd objective to minimize error rate for solvent customersNeural network model close to Decision tree

Used all 3 on case-by-case basis

結束

2-48

Coincidence Matrix Coincidence Matrix –– Combined Models Combined Models

Model insolvent

Model solvent

Unclass Totals

Actual insolvent

19 17 28 64

Actual solvent

1 626 27 654

Totals 20 643 91 718

結束

2-49

6: Implementation6: Implementation

Every customer examined using all 3 algorithmsIf all 3 agreed, used that classificationIf disagreement, categorized as unclassified

Correct on test data 0.898Only 1 actually solvent customer would

have been disconnected

chapter 2 data mining processes and knowledge discovery identify actionable results

Documents

data types

data set

data description

categorical data

relevant data

selected data

gender data

data preparationtransform