dm_lab1

48
Data mining exercise with SPSS Clementine Lab 1 Winnie Lam Email: [email protected] Website: http://www.comp.polyu.edu.hk/~cswinnie/ The Hong Kong Polytechnic University Department of Computing Last update:09/03/2006

Upload: tommy96

Post on 11-Jul-2015

737 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: DM_Lab1

Data mining exercisewith SPSS Clementine

Lab 1

Winnie LamEmail: [email protected]

Website: http://www.comp.polyu.edu.hk/~cswinnie/The Hong Kong Polytechnic University

Department of ComputingLast update:09/03/2006

Page 2: DM_Lab1

2

OVERVIEWOVERVIEW

Clementine is a data mining tool that combines advanced modeling technology with ease of use, it helps you discover and predict interesting and valuable relationships within your data.

You can use Clementine for decision-support activities such as:

• Creating customer profiles and determining customer lifetime value.• Detecting and predicting fraud in your organization.• Determining and predicting valuable sequences in Web-site data.• Predicting future trends in sales and growth.• Profiling for direct mailing response and credit risk.• Performing churn prediction, classification and segmentation.

Page 3: DM_Lab1

3

KDD Process

SelectionPreprocessing

Transformation

Data Mining

Evaluation

Preprocessed DataTarget

Data

TransformedData

Patterns

Knowledge

OVERVIEWOVERVIEW

Page 4: DM_Lab1

4

Simplified KDD process

Data Understanding

Data Preparation

Modeling (Data Mining)

Define target &

discover useful data

Obtain Clean &

Useful data

Discover patterns

Page 5: DM_Lab1

5

STREAM CANVAS

NODE PALETTES

OBJECT MANAGER

PROJECT

Page 6: DM_Lab1

Learning the Nodes

Sources Record Ops Field Ops Graphs Modeling Output

Page 7: DM_Lab1

7

NODES

Source nodes

3. Database - import data using ODBC4. Variable File - free-field ASCII data5. Fixed File - fixed-field ASCII data6. SPSS File - import SPSS files7. SAS File - import files in SAS format8. User Input - replace existing source nodes

Sources Record Ops Field Ops Graphs Modeling Output

Page 8: DM_Lab1

8Sources Record Ops Field Ops Graphs Modeling Output

NODES

Record Operations Nodes- make changes to the data set at the record level

4. Select5. Sample6. Balance7. Aggregate8. Sort9. Merge10. Append11. Distinct

Page 9: DM_Lab1

9Sources Record Ops Field Ops Graphs Modeling Output

NODES

Field Operation Nodes- for data transformation and preparation

4. Type5. Filter6. Derive7. Filler8. Reclassify9. Binning10.Partition11.Set to Flag12.History13.Field Reorder

Page 10: DM_Lab1

10Sources Record Ops Field Ops Graphs Modeling Output

NODES

Graph Nodes

- explore & check the distribution and relationships

4. Plot5. Multiplot6. Distribution7. Histogram8. Collection9. Web10. Evaluation

Page 11: DM_Lab1

11Sources Record Ops Field Ops Graphs Modeling Output

Modeling Nodes- Heart of DM process (machine learning)- Each method has certain strengths and is best

suited for particular types of problems.

5. Neural Net6. C5.07. Classification and Regression (C&R) Trees8. QUEST9. CHAID10. Kohonen11. K-Means12. TwoStep Cluster13. Apriori14. Generalized Rule Induction (GRI)15. CARMA16. Sequence Detection17. PCA/ Factor Analysis18. Linear Regression19. Logistic Regression

Page 12: DM_Lab1

12Sources Record Ops Field Ops Graphs Modeling Output

NODESOutput Nodes- obtain information about your data and models- exporting data in various formats

5. Table6. Matrix7. Analysis8. Data Audit9. Statistics10. Quality11. Report12. Set Globals13. Publisher14. Database Output15. Flat File16. SPSS Export17. SAS Export18. Excel19. SPSS Procedure

Page 13: DM_Lab1

13

Association Tools• Apriori discovers association rules in the data.

For large problems, Apriori is generally faster to train than GRI.It has no arbitrary limit on the number of rules that can be retained and can handle rules with up to 32 preconditions.

• GRI, Generalized Rule Induction, extracts a set of rules from the data (similar to Apriori). GRI can handle numeric as well as symbolic input fields.

• CARMA uses an association rules discovery algorithm to discoverassociation rules in the data. CARMA node does not require In fields or Out fields. It is equivalent to build an Apriori model with all fields set to Both.

• Sequence discovers patterns in sequential or time-oriented data. A sequence is a list of item sets that tend to occur in a predictable order.

Page 14: DM_Lab1

14

Classification Tools – Decision tree• C5.0. This method splits the sample based on the field that provides the

maximum information gain at each level to produce either a decision tree or a ruleset. The target field must be categorical. Multiple splits into more than two subgroups are allowed.

• C&RT. The Classification and Regression Trees method is based on minimization of impurity measures. A node is considered “pure” if 100% of cases in the node fall into a specific category of the target field. Target and predictor fields can be range or categorical; all splits are binary (only two subgroups).

• CHAID. Chi-squared Automatic Interaction Detector uses chi-squared statistics to identify optimal splits. Target and predictor fields can be range or categorical; nodes can be split into two or more subgroups at each level.

• QUEST. The Quick, Unbiased, Efficient Statistical Tree method is quick to compute and avoids other methods’ biases in favor of predictors with many categories. Predictor fields can be numeric ranges, but the target field must be categorical. All splits are binary.

Page 15: DM_Lab1

15

Clustering Tools

• K-means. An approach to clustering that defines k clusters and iteratively assigns records to clusters based on distances from the mean of each cluster until a stable solution is found.

• TwoSteps. A clustering method that involves preclustering the records into a large number of subclusters and then applying a hierarchical clustering technique to those subclusters to define the final clusters.

• Kohonen Networks. A type of neural network used for clustering. Also known as a self organizing map (SOM).

Page 16: DM_Lab1

16

ClassificationClass 1

Class 2

Class 3

New Sample

With predefined class!

Page 17: DM_Lab1

17

ClusteringClass 1

Class 2

Class 3

No class is defined previously!

CROSS

TRIANGLE

STAR

Page 18: DM_Lab1

Practical Session

Page 19: DM_Lab1

19

Data Understanding

Data Description:

• Total no. of records : ? (find out by yourself)

Data file:http://www.comp.polyu.edu.hk/~cswinnie/lab/2005-6_sem2_lab1/MyData_lab1.csv

TIDdt

DiscountGroupref_no

prod_cd

Attributes

Transaction IDDate

Discount offered? Y/NProduct GroupInternal Ref no.Product Code

Page 20: DM_Lab1

20

Step 1: Import Data to Clementine

Data Understanding

Add Node: Var. File (in Sources Palette)

double clickBrowse

Page 21: DM_Lab1

21

Data Understanding

Step 2: Analyze the dataAdd Nodes: Table (in Output Palette)

right click and choose execute

Page 22: DM_Lab1

22

Data Understanding

Step 2: Analyze the dataAdd Nodes: Data Audit (in Output Palette)

Execute

Page 23: DM_Lab1

23

Data Understanding

Step 2: Analyze the dataAdd Nodes: Quality (in Output Palette)

Execute

Page 24: DM_Lab1

Data Preparation

Page 25: DM_Lab1

25

Data Preparation

Edit Node: Var. File (in Stream)

Goal: Define data type and value

1

2

2 Re-define the Type of Group and ref_no to “Set”

Press “Read Values” againdouble click

Page 26: DM_Lab1

26

Data Preparation

Edit Node: Var. File (in Stream)

Goal: Define blanks

1

2

Page 27: DM_Lab1

27

Add Node: Filler (in Field Ops Palette)

Goal: Replace all blanks with a specified value

Data Preparation

Result

Page 28: DM_Lab1

28

Add Node: Type (in Field Ops Palette)

Goal: Remove records with blanks

Data Preparation

12

3

4

4 Choose “-1” and delete it

5

Q: How many records are left?

Page 29: DM_Lab1

29

Add Node: Reclassify (in Field Ops Palette)

Goal: Replace invalid values

Data Preparation

12

3

4 5

66 Modify to a common set of new value (Y/N)

Page 30: DM_Lab1

Data Transformation

Page 31: DM_Lab1

31

Derive New Fields

Useful Node: Derive (in Field Ops Palette)

Weekday : datetime_weekday(dt)Hour : datetime_hour (dt)

Goal: Add new attributes “Weekday” and “Hour”

For weekday, 0 represents Sunday,1 represents Monday,etc.

Q: How many fields in your data?

Page 32: DM_Lab1

32

DiscretizationGoal: Divide the Hour field into 4 intervals

Useful Node: Binning (in Field Ops Palette)

Page 33: DM_Lab1

33

Preprocessed Data

Page 34: DM_Lab1

Data Mining

Page 35: DM_Lab1

35

Data Mining

Add Nodes: Type (in Field Ops Palette)

12

Goal: Update the type and value of data

Page 36: DM_Lab1

Association

Page 37: DM_Lab1

37

Association

Add Nodes: SetToFlag (in Field Ops Palette)

1

2 3

4

Goal: Convert the transactional format to tabular format

Select all values

Page 38: DM_Lab1

38

Association

Add Nodes: Apriori (in Modeling Palette)

1

2

3

Goal: Perform association with Apriori

Page 39: DM_Lab1

39

AssociationGoal: View the mining result

Association Rules:

For 1st Rule: IF P17 and P39 THEN P27

Right Click and choose Browse

Page 40: DM_Lab1

Classification

Page 41: DM_Lab1

41

Classification

Choose the Inputs and Target

Add Nodes: C5.0 (in Modeling Palette)

Page 42: DM_Lab1

42

ClassificationGoal: View the mining result

Classification Rules:

Right Click and choose Browse

Page 43: DM_Lab1

43

ClassificationGoal: Find out the classification accuracy

Drag the classification result to the stream

Add Nodes: Classification result (in Model) and Analysis (in Output Palette)

Right Click and Execute

Page 44: DM_Lab1

Clustering

Page 45: DM_Lab1

45

Clustering

Choose the Inputs

Add Nodes: K-means (in Modeling Palette)

Set k= 3

Page 46: DM_Lab1

46

ClusteringGoal: View the mining result

Clustering result:

Right Click and choose Browse

Page 47: DM_Lab1

47

Q&Asession

Page 48: DM_Lab1

48

SUMMARY

Today, you’ve learnt :• KDD process• the differences between nodes• how to build streams in Clementine• how to do data preparation with

Clementine• Association modeling• Classification modeling• Clustering modeling