weka project - classification & association rule generation

VINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR

Data Mining using Weka A Paper on Data Mining techniques using Weka

software

MBA 2010-2012

IT FOR BUSINESS INTELLIGENCE – TERM PAPER

INSTRUCTOR – PROF. PRITHWIS MUKERJEE

SUBMITTED BY SATHISHWARAN.R

10BM60079 MBA 2010-2012

2 Data Mining using WEKA

Table of Contents 1. INTRODUCTION ......................................................................................................................... 3

2. CLASSIFICATION ......................................................................................................................... 3

2.1 DATA .................................................................................................................................... 3

2.2 SCREENS .............................................................................................................................. 3

2.3 OUTPUT ............................................................................................................................... 6

2.4 INTERPRETATION ................................................................................................................ 7

3. ASSOCIATION RULES ................................................................................................................. 7

3.1 DATA .................................................................................................................................... 7

3.2 SCREENS .............................................................................................................................. 8

3.3 OUTPUT ............................................................................................................................. 10

3.4 INTERPRETATION .............................................................................................................. 12

4. REFERNCES ............................................................................................................................... 12


1. INTRODUCTION

Widespread usage of computers has made life easier for business executives. However it has led

to the proliferation of data which had made it difficult to comprehend meaning out of it. The

amount of data that is generated in the world today had made decision making difficult. Data

mining is one approach that identifies the patterns in data and helps in making decisions by

analysing this huge data ocean. Weka (Waikato Environment for Knowledge Analysis) is free

software developed at university of Waikato in New Zealand and is available under the General

Public License. The software can be used for research, education and applications. It has a GUI

interface and comprehensive set of tools for analysing data. In this paper I have worked on data

mining techniques using the Weka software.

2. CLASSIFICATION

2.1 Data

The raw data used for this analysis has been obtained from website: http://tunedit.org/ and it

has been originally gathered from census data. There are 14 original attributes (features)

include age, work class, education, education, marital status, occupation, native country, etc. It

contains continuous, binary and categorical features. I have used the data for a two-class

classification problem. The task is to discover high revenue people from the census data and

also to make sure whether the data has been classified correctly by cross validation.

Link: http://tunedit.org/repo/Data/Agnostic-vs-Prior/Training/ada_prior_train.arff

2.2 Screens

Step 1: Launch Weka


Step 2: Click Explorer

Step 3: Click Open file


Step 4: Data updated in Weka

Step 4: Click Cross Validation and Decision Table. Click Start


2.3 Output

Cross-validation

=== Run information ===

Scheme: weka.classifiers.rules.DecisionTable -X 1 -S "weka.attributeSelection.BestFirst -

D 1 -N 5"

Relation: ADA_Prior

Instances: 4147

Attributes: 15

age

workclass

fnlwgt

education

educationNum

maritalStatus

occupation

relationship

race

sex

capitalGain

capitalLoss

hoursPerWeek

nativeCountry

label

Test mode:10-fold cross-validation

=== Classifier model (full training set) ===

Decision Table:

Number of training instances: 4147

Number of Rules: 130

Non matches covered by Majority class.

Best first.

Start set: no attributes

Search direction: forward

Stale search after 5 node expansions

Total number of subsets evaluated: 96

Merit of best subset found: 83.82

Evaluation (for feature selection): CV (leave one out)

Feature set: 5, 8,11,12,15

Time taken to build model: 0.98 seconds

=== Stratified cross-validation ===


=== Summary ===

Correctly Classified Instances 3461 83.4579 %

Incorrectly Classified Instances 686 16.5421 %

Kappa statistic 0.5073

Mean absolute error 0.2353

Root mean squared error 0.339

Relative absolute error 63.0518 %

Root relative squared error 78.4907 %

Total Number of Instances 4147

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.939 0.483 0.855 0.939 0.895 0.873 -1

0.517 0.061 0.738 0.517 0.608 0.873 1

Weighted Avg. 0.835 0.378 0.826 0.835 0.824 0.873

=== Confusion Matrix ===

a b <-- classified as

2929 189 | a = -1

497 532 | b = 1

2.4 Interpretation

There are 83.45 % correctly classified instances and 16.54 % incorrectly classified

instances.

Classifier accuracy is 54.73 % from the kappa statistic

The forecast error is got from the mean absolute error is 0.339

3461 instances have been classified correctly and 686 instances have been classified

incorrectly.

3. ASSOCIATION RULES

3.1 Data

The data set includes votes for each of the U.S. House of Representatives Congressmen on the 16

key votes identified by the CQA. The CQA lists nine different types of votes: voted for, paired for,

and announced for (these three simplified to yea), voted against, paired against, and announced

against (these three simplified to nay), voted present, voted present to avoid conflict of interest,

and did not vote or otherwise make a position known (these three simplified to an unknown

disposition).

Number of Instances: 435 (267 democrats, 168 republicans)

Number of Attributes: 16 + class name = 17 (all Boolean valued)


Attribute Information:

Class Name: 2 (democrat, republican)

handicapped-infants: 2 (y,n)

water-project-cost-sharing: 2 (y,n)

adoption-of-the-budget-resolution: 2 (y,n)

physician-fee-freeze: 2 (y,n)

el-salvador-aid: 2 (y,n)

religious-groups-in-schools: 2 (y,n)

anti-satellite-test-ban: 2 (y,n)

aid-to-nicaraguan-contras: 2 (y,n)

mx-missile: 2 (y,n)

immigration: 2 (y,n)

synfuels-corporation-cutback: 2 (y,n)

education-spending: 2 (y,n)

superfund-right-to-sue: 2 (y,n)

crime: 2 (y,n)

duty-free-exports: 2 (y,n)

export-administration-act-south-africa: 2 (y,n)

Link: http://tunedit.org/repo/UCI/vote.arff

3.2 Screens

Step 1: Launch Weka


Step 2: Click Explorer

Step 3: Click Open file… and choose respective file


Step 4: Click Associate and choose Apriori

Step 5: Click Start

3.3 Output

=== Run information ===

Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1

Relation: vote

Instances: 435

Attributes: 17

handicapped-infants


water-project-cost-sharing

adoption-of-the-budget-resolution

physician-fee-freeze

el-salvador-aid

religious-groups-in-schools

anti-satellite-test-ban

aid-to-nicaraguan-contras

mx-missile

immigration

synfuels-corporation-cutback

education-spending

superfund-right-to-sue

crime

duty-free-exports

export-administration-act-south-africa

Class

=== Associator model (full training set) ===

Apriori

=======

Minimum support: 0.45 (196 instances)

Minimum metric <confidence>: 0.9

Number of cycles performed: 11

Generated sets of large itemsets:

Size of set of large itemsets L(1): 20




Best rules found:

1. adoption-of-the-budget-resolution=y physician-fee-freeze=n 219 ==> Class=democrat 219

conf:(1)

2. adoption-of-the-budget-resolution=y physician-fee-freeze=n aid-to-nicaraguan-contras=y

198 ==> Class=democrat 198 conf:(1)

3. physician-fee-freeze=n aid-to-nicaraguan-contras=y 211 ==> Class=democrat 210 conf:(1)

4. physician-fee-freeze=n education-spending=n 202 ==> Class=democrat 201 conf:(1)

5. physician-fee-freeze=n 247 ==> Class=democrat 245 conf:(0.99)

6. el-salvador-aid=n Class=democrat 200 ==> aid-to-nicaraguan-contras=y 197 conf:(0.99)

7. el-salvador-aid=n 208 ==> aid-to-nicaraguan-contras=y 204 conf:(0.98)

8. adoption-of-the-budget-resolution=y aid-to-nicaraguan-contras=y Class=democrat 203 ==>

physician-fee-freeze=n 198 conf:(0.98)

9. el-salvador-aid=n aid-to-nicaraguan-contras=y 204 ==> Class=democrat 197 conf:(0.97)


10. aid-to-nicaraguan-contras=y Class=democrat 218 ==> physician-fee-freeze=n 210

conf:(0.96)

3.4 Interpretation

Association rules have been formed by apriori association as they can be seen from the output.

4. REFERENCES:

Book: Data Mining – Practical Machine Learning Tools and Techniques, Ian H. Witten,

Eibe Frank, Mark A. Hall

http://www.cs.waikato.ac.nz/ml/weka/

http://www.tunedit.org/repo/Data/Agnostic-vs-Prior/Training/ada_prior_train.arff

http://tunedit.org/repo/UCI/vote.arff

weka project - classification & association rule generation

Technology