weka project - classification & association rule generation
DESCRIPTION
Weka project - Classification & Association Rule GenerationTRANSCRIPT
VINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR
Data Mining using Weka A Paper on Data Mining techniques using Weka
software
MBA 2010-2012
IT FOR BUSINESS INTELLIGENCE – TERM PAPER
INSTRUCTOR – PROF. PRITHWIS MUKERJEE
SUBMITTED BY SATHISHWARAN.R
10BM60079 MBA 2010-2012
2 Data Mining using WEKA
Table of Contents 1. INTRODUCTION ......................................................................................................................... 3
2. CLASSIFICATION ......................................................................................................................... 3
2.1 DATA .................................................................................................................................... 3
2.2 SCREENS .............................................................................................................................. 3
2.3 OUTPUT ............................................................................................................................... 6
2.4 INTERPRETATION ................................................................................................................ 7
3. ASSOCIATION RULES ................................................................................................................. 7
3.1 DATA .................................................................................................................................... 7
3.2 SCREENS .............................................................................................................................. 8
3.3 OUTPUT ............................................................................................................................. 10
3.4 INTERPRETATION .............................................................................................................. 12
4. REFERNCES ............................................................................................................................... 12
3 Data Mining using WEKA
1. INTRODUCTION
Widespread usage of computers has made life easier for business executives. However it has led
to the proliferation of data which had made it difficult to comprehend meaning out of it. The
amount of data that is generated in the world today had made decision making difficult. Data
mining is one approach that identifies the patterns in data and helps in making decisions by
analysing this huge data ocean. Weka (Waikato Environment for Knowledge Analysis) is free
software developed at university of Waikato in New Zealand and is available under the General
Public License. The software can be used for research, education and applications. It has a GUI
interface and comprehensive set of tools for analysing data. In this paper I have worked on data
mining techniques using the Weka software.
2. CLASSIFICATION
2.1 Data
The raw data used for this analysis has been obtained from website: http://tunedit.org/ and it
has been originally gathered from census data. There are 14 original attributes (features)
include age, work class, education, education, marital status, occupation, native country, etc. It
contains continuous, binary and categorical features. I have used the data for a two-class
classification problem. The task is to discover high revenue people from the census data and
also to make sure whether the data has been classified correctly by cross validation.
Link: http://tunedit.org/repo/Data/Agnostic-vs-Prior/Training/ada_prior_train.arff
2.2 Screens
Step 1: Launch Weka
4 Data Mining using WEKA
Step 2: Click Explorer
Step 3: Click Open file
5 Data Mining using WEKA
Step 4: Data updated in Weka
Step 4: Click Cross Validation and Decision Table. Click Start
6 Data Mining using WEKA
2.3 Output
Cross-validation
=== Run information ===
Scheme: weka.classifiers.rules.DecisionTable -X 1 -S "weka.attributeSelection.BestFirst -
D 1 -N 5"
Relation: ADA_Prior
Instances: 4147
Attributes: 15
age
workclass
fnlwgt
education
educationNum
maritalStatus
occupation
relationship
race
sex
capitalGain
capitalLoss
hoursPerWeek
nativeCountry
label
Test mode:10-fold cross-validation
=== Classifier model (full training set) ===
Decision Table:
Number of training instances: 4147
Number of Rules: 130
Non matches covered by Majority class.
Best first.
Start set: no attributes
Search direction: forward
Stale search after 5 node expansions
Total number of subsets evaluated: 96
Merit of best subset found: 83.82
Evaluation (for feature selection): CV (leave one out)
Feature set: 5, 8,11,12,15
Time taken to build model: 0.98 seconds
=== Stratified cross-validation ===
7 Data Mining using WEKA
=== Summary ===
Correctly Classified Instances 3461 83.4579 %
Incorrectly Classified Instances 686 16.5421 %
Kappa statistic 0.5073
Mean absolute error 0.2353
Root mean squared error 0.339
Relative absolute error 63.0518 %
Root relative squared error 78.4907 %
Total Number of Instances 4147
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.939 0.483 0.855 0.939 0.895 0.873 -1
0.517 0.061 0.738 0.517 0.608 0.873 1
Weighted Avg. 0.835 0.378 0.826 0.835 0.824 0.873
=== Confusion Matrix ===
a b <-- classified as
2929 189 | a = -1
497 532 | b = 1
2.4 Interpretation
There are 83.45 % correctly classified instances and 16.54 % incorrectly classified
instances.
Classifier accuracy is 54.73 % from the kappa statistic
The forecast error is got from the mean absolute error is 0.339
3461 instances have been classified correctly and 686 instances have been classified
incorrectly.
3. ASSOCIATION RULES
3.1 Data
The data set includes votes for each of the U.S. House of Representatives Congressmen on the 16
key votes identified by the CQA. The CQA lists nine different types of votes: voted for, paired for,
and announced for (these three simplified to yea), voted against, paired against, and announced
against (these three simplified to nay), voted present, voted present to avoid conflict of interest,
and did not vote or otherwise make a position known (these three simplified to an unknown
disposition).
Number of Instances: 435 (267 democrats, 168 republicans)
Number of Attributes: 16 + class name = 17 (all Boolean valued)
8 Data Mining using WEKA
Attribute Information:
Class Name: 2 (democrat, republican)
handicapped-infants: 2 (y,n)
water-project-cost-sharing: 2 (y,n)
adoption-of-the-budget-resolution: 2 (y,n)
physician-fee-freeze: 2 (y,n)
el-salvador-aid: 2 (y,n)
religious-groups-in-schools: 2 (y,n)
anti-satellite-test-ban: 2 (y,n)
aid-to-nicaraguan-contras: 2 (y,n)
mx-missile: 2 (y,n)
immigration: 2 (y,n)
synfuels-corporation-cutback: 2 (y,n)
education-spending: 2 (y,n)
superfund-right-to-sue: 2 (y,n)
crime: 2 (y,n)
duty-free-exports: 2 (y,n)
export-administration-act-south-africa: 2 (y,n)
Link: http://tunedit.org/repo/UCI/vote.arff
3.2 Screens
Step 1: Launch Weka
9 Data Mining using WEKA
Step 2: Click Explorer
Step 3: Click Open file… and choose respective file
10 Data Mining using WEKA
Step 4: Click Associate and choose Apriori
Step 5: Click Start
3.3 Output
=== Run information ===
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
Relation: vote
Instances: 435
Attributes: 17
handicapped-infants
11 Data Mining using WEKA
water-project-cost-sharing
adoption-of-the-budget-resolution
physician-fee-freeze
el-salvador-aid
religious-groups-in-schools
anti-satellite-test-ban
aid-to-nicaraguan-contras
mx-missile
immigration
synfuels-corporation-cutback
education-spending
superfund-right-to-sue
crime
duty-free-exports
export-administration-act-south-africa
Class
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.45 (196 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 11
Generated sets of large itemsets:
Size of set of large itemsets L(1): 20
Size of set of large itemsets L(2): 17
Size of set of large itemsets L(3): 6
Size of set of large itemsets L(4): 1
Best rules found:
1. adoption-of-the-budget-resolution=y physician-fee-freeze=n 219 ==> Class=democrat 219
conf:(1)
2. adoption-of-the-budget-resolution=y physician-fee-freeze=n aid-to-nicaraguan-contras=y
198 ==> Class=democrat 198 conf:(1)
3. physician-fee-freeze=n aid-to-nicaraguan-contras=y 211 ==> Class=democrat 210 conf:(1)
4. physician-fee-freeze=n education-spending=n 202 ==> Class=democrat 201 conf:(1)
5. physician-fee-freeze=n 247 ==> Class=democrat 245 conf:(0.99)
6. el-salvador-aid=n Class=democrat 200 ==> aid-to-nicaraguan-contras=y 197 conf:(0.99)
7. el-salvador-aid=n 208 ==> aid-to-nicaraguan-contras=y 204 conf:(0.98)
8. adoption-of-the-budget-resolution=y aid-to-nicaraguan-contras=y Class=democrat 203 ==>
physician-fee-freeze=n 198 conf:(0.98)
9. el-salvador-aid=n aid-to-nicaraguan-contras=y 204 ==> Class=democrat 197 conf:(0.97)
12 Data Mining using WEKA
10. aid-to-nicaraguan-contras=y Class=democrat 218 ==> physician-fee-freeze=n 210
conf:(0.96)
3.4 Interpretation
Association rules have been formed by apriori association as they can be seen from the output.
4. REFERENCES:
Book: Data Mining – Practical Machine Learning Tools and Techniques, Ian H. Witten,
Eibe Frank, Mark A. Hall
http://www.cs.waikato.ac.nz/ml/weka/
http://www.tunedit.org/repo/Data/Agnostic-vs-Prior/Training/ada_prior_train.arff
http://tunedit.org/repo/UCI/vote.arff