dm_lab1

Data mining exercisewith SPSS Clementine

Winnie LamEmail: cswinnie@comp.polyu.edu.hk

Website: http://www.comp.polyu.edu.hk/~cswinnie/The Hong Kong Polytechnic University

Department of ComputingLast update:09/03/2006

OVERVIEWOVERVIEW

Clementine is a data mining tool that combines advanced modeling technology with ease of use, it helps you discover and predict interesting and valuable relationships within your data.

You can use Clementine for decision-support activities such as:

• Creating customer profiles and determining customer lifetime value.• Detecting and predicting fraud in your organization.• Determining and predicting valuable sequences in Web-site data.• Predicting future trends in sales and growth.• Profiling for direct mailing response and credit risk.• Performing churn prediction, classification and segmentation.

KDD Process

SelectionPreprocessing

Transformation

Data Mining

Evaluation

Preprocessed DataTarget

TransformedData

Patterns

Knowledge

OVERVIEWOVERVIEW

Simplified KDD process

Data Understanding

Data Preparation

Modeling (Data Mining)

Define target &

discover useful data

Obtain Clean &

Useful data

Discover patterns

STREAM CANVAS

NODE PALETTES

OBJECT MANAGER

PROJECT

Learning the Nodes

Sources Record Ops Field Ops Graphs Modeling Output

Source nodes

3. Database - import data using ODBC4. Variable File - free-field ASCII data5. Fixed File - fixed-field ASCII data6. SPSS File - import SPSS files7. SAS File - import files in SAS format8. User Input - replace existing source nodes

Sources Record Ops Field Ops Graphs Modeling Output

8Sources Record Ops Field Ops Graphs Modeling Output

Record Operations Nodes- make changes to the data set at the record level

4. Select5. Sample6. Balance7. Aggregate8. Sort9. Merge10. Append11. Distinct

Field Operation Nodes- for data transformation and preparation

4. Type5. Filter6. Derive7. Filler8. Reclassify9. Binning10.Partition11.Set to Flag12.History13.Field Reorder

Graph Nodes

- explore & check the distribution and relationships

4. Plot5. Multiplot6. Distribution7. Histogram8. Collection9. Web10. Evaluation

Modeling Nodes- Heart of DM process (machine learning)- Each method has certain strengths and is best

suited for particular types of problems.

5. Neural Net6. C5.07. Classification and Regression (C&R) Trees8. QUEST9. CHAID10. Kohonen11. K-Means12. TwoStep Cluster13. Apriori14. Generalized Rule Induction (GRI)15. CARMA16. Sequence Detection17. PCA/ Factor Analysis18. Linear Regression19. Logistic Regression

NODESOutput Nodes- obtain information about your data and modelsexporting data in various formats

5. Table6. Matrix7. Analysis8. Data Audit9. Statistics10. Quality11. Report12. Set Globals13. Publisher14. Database Output15. Flat File16. SPSS Export17. SAS Export18. Excel19. SPSS Procedure

Association Tools• Apriori discovers association rules in the data.

For large problems, Apriori is generally faster to train than GRI.It has no arbitrary limit on the number of rules that can be retained and can handle rules with up to 32 preconditions.

• GRI, Generalized Rule Induction, extracts a set of rules from the data (similar to Apriori). GRI can handle numeric as well as symbolic input fields.

• CARMA uses an association rules discovery algorithm to discoverassociation rules in the data. CARMA node does not require In fields or Out fields. It is equivalent to build an Apriori model with all fields set to Both.

• Sequence discovers patterns in sequential or time-oriented data. A sequence is a list of item sets that tend to occur in a predictable order.

Classification Tools – Decision tree• C5.0. This method splits the sample based on the field that provides the

maximum information gain at each level to produce either a decision tree or a ruleset. The target field must be categorical. Multiple splits into more than two subgroups are allowed.

• C&RT. The Classification and Regression Trees method is based on minimization of impurity measures. A node is considered “pure” if 100% of cases in the node fall into a specific category of the target field. Target and predictor fields can be range or categorical; all splits are binary (only two subgroups).

• CHAID. Chi-squared Automatic Interaction Detector uses chi-squared statistics to identify optimal splits. Target and predictor fields can be range or categorical; nodes can be split into two or more subgroups at each level.

• QUEST. The Quick, Unbiased, Efficient Statistical Tree method is quick to compute and avoids other methods’ biases in favor of predictors with many categories. Predictor fields can be numeric ranges, but the target field must be categorical. All splits are binary.

Clustering Tools

• K-means. An approach to clustering that defines k clusters and iteratively assigns records to clusters based on distances from the mean of each cluster until a stable solution is found.

• TwoSteps. A clustering method that involves preclustering the records into a large number of subclusters and then applying a hierarchical clustering technique to those subclusters to define the final clusters.

• Kohonen Networks. A type of neural network used for clustering. Also known as a self organizing map (SOM).

ClassificationClass 1

Class 2

Class 3

New Sample

With predefined class!

ClusteringClass 1

Class 2

Class 3

No class is defined previously!

TRIANGLE

Practical Session

Data Understanding

Data Description:

• Total no. of records : ? (find out by yourself)

Data file:http://www.comp.polyu.edu.hk/~cswinnie/lab/2005-6_sem2_lab1/MyData_lab1.csv

DiscountGroupref_no

prod_cd

Attributes

Transaction IDDate

Discount offered? Y/NProduct GroupInternal Ref no.Product Code

Step 1: Import Data to Clementine

Data Understanding

Add Node: Var. File (in Sources Palette)

double clickBrowse

Data Understanding

Step 2: Analyze the dataAdd Nodes: Table (in Output Palette)

right click and choose execute

Data Understanding

Step 2: Analyze the dataAdd Nodes: Data Audit (in Output Palette)

Execute

Data Understanding

Step 2: Analyze the dataAdd Nodes: Quality (in Output Palette)

Execute

Data Preparation

Edit Node: Var. File (in Stream)

Goal: Define data type and value

2 Re-define the Type of Group and ref_no to “Set”

Press “Read Values” againdouble click

Data Preparation

Edit Node: Var. File (in Stream)

Goal: Define blanks

Add Node: Filler (in Field Ops Palette)

Goal: Replace all blanks with a specified value

Data Preparation

Result

Add Node: Type (in Field Ops Palette)

Goal: Remove records with blanks

Data Preparation

4 Choose “-1” and delete it

Q: How many records are left?

Add Node: Reclassify (in Field Ops Palette)

Goal: Replace invalid values

Data Preparation

66 Modify to a common set of new value (Y/N)

Data Transformation

Derive New Fields

Useful Node: Derive (in Field Ops Palette)

Weekday : datetime_weekday(dt)Hour : datetime_hour (dt)

Goal: Add new attributes “Weekday” and “Hour”

For weekday, 0 represents Sunday,1 represents Monday,etc.

Q: How many fields in your data?

DiscretizationGoal: Divide the Hour field into 4 intervals

Useful Node: Binning (in Field Ops Palette)

Preprocessed Data

Data Mining

Add Nodes: Type (in Field Ops Palette)

Goal: Update the type and value of data

Association

Add Nodes: SetToFlag (in Field Ops Palette)

Goal: Convert the transactional format to tabular format

Select all values

Association

Add Nodes: Apriori (in Modeling Palette)

Goal: Perform association with Apriori

AssociationGoal: View the mining result

Association Rules:

For 1st Rule: IF P17 and P39 THEN P27

Right Click and choose Browse

Classification

Choose the Inputs and Target

Add Nodes: C5.0 (in Modeling Palette)

ClassificationGoal: View the mining result

Classification Rules:

ClassificationGoal: Find out the classification accuracy

Drag the classification result to the stream

Add Nodes: Classification result (in Model) and Analysis (in Output Palette)

Right Click and Execute

Clustering

Choose the Inputs

Add Nodes: K-means (in Modeling Palette)

Set k= 3

ClusteringGoal: View the mining result

Clustering result:

Q&Asession

SUMMARY

Today, you’ve learnt :• KDD process• the differences between nodes• how to build streams in Clementine• how to do data preparation with

Clementine• Association modeling• Classification modeling• Clustering modeling

dm_lab1

data set

modelsexporting data

data transformation

data similar

data mining tool

website data

set of rules

discoverassociation

Documents

star wars original trilogy trivia (episodes iv-vi)

do you admire leonardo da vinci?

algorithms

the last carnival i ever saw

chapter-01

how computer keyboards work

oedipus the king: the ideal tragic play

(tesla) - the tesla magnetic car engine

the best american humorous short stories

star wars trivia!

physical modelling synthesis overview

cakes recipes

compressing and decompressing folders

heidegger kritik

jan van eyck and the man in a red turban

simple functions in haskell

steve jobs' commencement speech at stanford

effective parenting: establishing boundaries

explore the levels of creation

european colinization of latin america