dm_lab1

Data mining exercisewith SPSS Clementine

Lab 1

Winnie LamEmail: [email protected]

Website: http://www.comp.polyu.edu.hk/~cswinnie/The Hong Kong Polytechnic University

Department of ComputingLast update:09/03/2006

mailto:[email protected]

http://www.comp.polyu.edu.hk/~cswinnie/

2

OVERVIEWOVERVIEW

Clementine is a data mining tool that combines advanced modeling technology with ease of use, it helps you discover and predict interesting and valuable relationships within your data.

You can use Clementine for decision-support activities such as:

• Creating customer profiles and determining customer lifetime value.• Detecting and predicting fraud in your organization.• Determining and predicting valuable sequences in Web-site data.• Predicting future trends in sales and growth.• Profiling for direct mailing response and credit risk.• Performing churn prediction, classification and segmentation.

3

KDD Process

SelectionPreprocessing

Transformation

Data Mining

Evaluation

Preprocessed DataTarget

Data

TransformedData

Patterns

Knowledge

OVERVIEWOVERVIEW

4

Simplified KDD process

Data Understanding

Data Preparation

Modeling (Data Mining)

Define target &

discover useful data

Obtain Clean &

Useful data

Discover patterns

5

STREAM CANVAS

NODE PALETTES

OBJECT MANAGER

PROJECT

Learning the Nodes

Sources Record Ops Field Ops Graphs Modeling Output

7

NODES

Source nodes

3. Database - import data using ODBC4. Variable File - free-field ASCII data5. Fixed File - fixed-field ASCII data6. SPSS File - import SPSS files7. SAS File - import files in SAS format8. User Input - replace existing source nodes

Sources Record Ops Field Ops Graphs Modeling Output

8Sources Record Ops Field Ops Graphs Modeling Output

NODES

Record Operations Nodes- make changes to the data set at the record level

4. Select5. Sample6. Balance7. Aggregate8. Sort9. Merge10. Append11. Distinct


NODES

Field Operation Nodes- for data transformation and preparation

4. Type5. Filter6. Derive7. Filler8. Reclassify9. Binning10.Partition11.Set to Flag12.History13.Field Reorder


NODES

Graph Nodes

- explore & check the distribution and relationships

4. Plot5. Multiplot6. Distribution7. Histogram8. Collection9. Web10. Evaluation


Modeling Nodes- Heart of DM process (machine learning)- Each method has certain strengths and is best

suited for particular types of problems.

5. Neural Net6. C5.07. Classification and Regression (C&R) Trees8. QUEST9. CHAID10. Kohonen11. K-Means12. TwoStep Cluster13. Apriori14. Generalized Rule Induction (GRI)15. CARMA16. Sequence Detection17. PCA/ Factor Analysis18. Linear Regression19. Logistic Regression


NODESOutput Nodes- obtain information about your data and modelsexporting data in various formats

5. Table6. Matrix7. Analysis8. Data Audit9. Statistics10. Quality11. Report12. Set Globals13. Publisher14. Database Output15. Flat File16. SPSS Export17. SAS Export18. Excel19. SPSS Procedure

13

Association Tools• Apriori discovers association rules in the data.

For large problems, Apriori is generally faster to train than GRI.It has no arbitrary limit on the number of rules that can be retained and can handle rules with up to 32 preconditions.

• GRI, Generalized Rule Induction, extracts a set of rules from the data (similar to Apriori). GRI can handle numeric as well as symbolic input fields.

• CARMA uses an association rules discovery algorithm to discoverassociation rules in the data. CARMA node does not require In fields or Out fields. It is equivalent to build an Apriori model with all fields set to Both.

• Sequence discovers patterns in sequential or time-oriented data. A sequence is a list of item sets that tend to occur in a predictable order.

14

Classification Tools – Decision tree• C5.0. This method splits the sample based on the field that provides the

maximum information gain at each level to produce either a decision tree or a ruleset. The target field must be categorical. Multiple splits into more than two subgroups are allowed.

• C&RT. The Classification and Regression Trees method is based on minimization of impurity measures. A node is considered “pure” if 100% of cases in the node fall into a specific category of the target field. Target and predictor fields can be range or categorical; all splits are binary (only two subgroups).

• CHAID. Chi-squared Automatic Interaction Detector uses chi-squared statistics to identify optimal splits. Target and predictor fields can be range or categorical; nodes can be split into two or more subgroups at each level.

• QUEST. The Quick, Unbiased, Efficient Statistical Tree method is quick to compute and avoids other methods’ biases in favor of predictors with many categories. Predictor fields can be numeric ranges, but the target field must be categorical. All splits are binary.

15

Clustering Tools

• K-means. An approach to clustering that defines k clusters and iteratively assigns records to clusters based on distances from the mean of each cluster until a stable solution is found.

• TwoSteps. A clustering method that involves preclustering the records into a large number of subclusters and then applying a hierarchical clustering technique to those subclusters to define the final clusters.

• Kohonen Networks. A type of neural network used for clustering. Also known as a self organizing map (SOM).

16

ClassificationClass 1

Class 2

Class 3

New Sample

With predefined class!

17

ClusteringClass 1

Class 2

Class 3

No class is defined previously!

CROSS

TRIANGLE

STAR

Practical Session

19

Data Understanding

Data Description:

• Total no. of records : ? (find out by yourself)

Data file:http://www.comp.polyu.edu.hk/~cswinnie/lab/2005-6_sem2_lab1/MyData_lab1.csv

TIDdt

DiscountGroupref_no

prod_cd

Attributes

Transaction IDDate

Discount offered? Y/NProduct GroupInternal Ref no.Product Code

http://www.comp.polyu.edu.hk/~cswinnie/lab/2005-6_sem2_lab1/MyData_lab1.csv

20

Step 1: Import Data to Clementine

Data Understanding

Add Node: Var. File (in Sources Palette)

double clickBrowse

21

Data Understanding

Step 2: Analyze the dataAdd Nodes: Table (in Output Palette)

right click and choose execute

22

Data Understanding

Step 2: Analyze the dataAdd Nodes: Data Audit (in Output Palette)

Execute

23

Data Understanding

Step 2: Analyze the dataAdd Nodes: Quality (in Output Palette)

Execute

Data Preparation

25

Data Preparation

Edit Node: Var. File (in Stream)

Goal: Define data type and value

1

2

2 Re-define the Type of Group and ref_no to “Set”

Press “Read Values” againdouble click

26

Data Preparation

Edit Node: Var. File (in Stream)

Goal: Define blanks

1

2

27

Add Node: Filler (in Field Ops Palette)

Goal: Replace all blanks with a specified value

Data Preparation

Result

28

Add Node: Type (in Field Ops Palette)

Goal: Remove records with blanks

Data Preparation

12

3

4

4 Choose “-1” and delete it

5

Q: How many records are left?

29

Add Node: Reclassify (in Field Ops Palette)

Goal: Replace invalid values

Data Preparation

12

3

4 5

66 Modify to a common set of new value (Y/N)

Data Transformation

31

Derive New Fields

Useful Node: Derive (in Field Ops Palette)

Weekday : datetime_weekday(dt)Hour : datetime_hour (dt)

Goal: Add new attributes “Weekday” and “Hour”

For weekday, 0 represents Sunday,1 represents Monday,etc.

Q: How many fields in your data?

32

DiscretizationGoal: Divide the Hour field into 4 intervals

Useful Node: Binning (in Field Ops Palette)

33

Preprocessed Data

Data Mining

35

Data Mining

Add Nodes: Type (in Field Ops Palette)

12

Goal: Update the type and value of data

Association

37

Association

Add Nodes: SetToFlag (in Field Ops Palette)

1

2 3

4

Goal: Convert the transactional format to tabular format

Select all values

38

Association

Add Nodes: Apriori (in Modeling Palette)

1

2

3

Goal: Perform association with Apriori

39

AssociationGoal: View the mining result

Association Rules:

For 1st Rule: IF P17 and P39 THEN P27

Right Click and choose Browse

Classification

41

Classification

Choose the Inputs and Target

Add Nodes: C5.0 (in Modeling Palette)

42

ClassificationGoal: View the mining result

Classification Rules:


43

ClassificationGoal: Find out the classification accuracy

Drag the classification result to the stream

Add Nodes: Classification result (in Model) and Analysis (in Output Palette)

Right Click and Execute

Clustering

45

Clustering

Choose the Inputs

Add Nodes: K-means (in Modeling Palette)

Set k= 3

46

ClusteringGoal: View the mining result

Clustering result:


47

Q&Asession

48

SUMMARY

Today, you’ve learnt :• KDD process• the differences between nodes• how to build streams in Clementine• how to do data preparation with

Clementine• Association modeling• Classification modeling• Clustering modeling

dm_lab1

Documents

data set

modelsexporting data

data transformation

data similar

data mining tool

website data

set of rules

discoverassociation