10bm60027: weka data mining techniques

VINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR

Weka: Data Mining IT for Business Intelligence Course

Term paper on using some of the data mining techniques with Weka tool

Submitted by: Gaurav Arora 10BM60027

MBA 2010-2012

Table of Contents About Weka ........................................................................................................................................................ 3

Features .............................................................................................................................................................. 3

Data Used ........................................................................................................................................................... 4

For Regression: ................................................................................................................................................ 4

For Clustering: ................................................................................................................................................. 5

Regression Analysis ............................................................................................................................................. 6

Cluster Analysis ................................................................................................................................................... 8

About Weka Weka (Waikato Environment for Knowledge Analysis) is a tool that was developed at the University of Waikato in New Zealand originally for the purpose of identifying information from raw data gathered from agricultural domains. Weka supports many other data mining tasks such as data preprocessing, classification, regression, clustering, visualization and feature selection. The use of this tool is to premise of the application is to derive useful information in the form of trends and patterns from our raw data. It is an open source application that is freely available under the GNU general public license agreement and was originally written in C. Later it was completely rewritten in Java and is now compatible with almost every computing platform. It has a user friendly with graphical interface that allows for quick set up and operation.

Attribute Relationship File Format (ARFF) is the text format file used by Weka to store data in a database.

Features There are four options available on the initial screen.

Simple CLI: provides users without a graphic interface option the ability to execute commands from a terminal window.

Explorer: the graphical interface used to conduct experimentation on raw data Experimenter: this option allows users to conduct different experimental variations on data sets and

perform statistical manipulation Knowledge Flow: Same functionality as Explorer but with drag and drop functionality. The advantage of

this option is that it supports incremental learning from previous results

Main tabs provided in Explorer are:

Preprocess- used to choose the data file to be used by the application Classify- used to test and train different learning schemes on the preprocessed data file under

experimentation Cluster- used to apply different tools that identify clusters within the data file Association- used to apply different rules to the data file that identify association within the data Select attributes-used to apply different rules to reveal changes based on selected attributes

inclusion or exclusion from the experiment Visualize- used to see what the various manipulation produced on the data set in a 2D format, in

scatter plot and bar graph output

Data Used

For Regression: Data from open source like World bank website:http://data.worldbank.org/country/india is used to perform regression analysis.

This data contains various variables linked to the GDP of India. A total of 21 records of data from Year 1999 to 2009 is used. All data are in current U.S. dollars

Exports_of_goods_and_services: Exports of goods and services comprise all transactions between residents of a country and the rest of the world involving general merchandise, goods sent for processing and repairs, nonmonetary gold, and services Imports_of_goods_and_services: Imports of goods, services and income is the sum of goods (merchandise) imports, imports of (nonfactor) services and income (factor) payments Agriculture_value_added: Agriculture includes forestry, hunting, and fishing, as well as cultivation of crops and livestock production. Value added is the net output of a sector after adding up all outputs and subtracting intermediate inputs Manufacturing_value_added: Manufacturing refers to industries belonging to ISIC divisions 15-37. Value added is the net output of a sector after adding up all outputs and subtracting intermediate inputs Industry_value_added: Industry includes manufacturing (ISIC divisions 15-37). It comprises value added in mining, manufacturing (also reported as a separate subgroup), construction, electricity, water, and gas. Value added is the net output of a sector after adding up all outputs and subtracting intermediate inputs Services_value_added: Services includes value added in wholesale and retail trade (including hotels and restaurants), transport, and government, financial, professional, and personal services such as education, health care, and real estate services. Also included are imputed bank service charges, import duties, and any statistical discrepancies noted by national compilers as well as discrepancies arising from rescaling. Value added is the net output of a sector after adding up all outputs and subtracting intermediate inputs GDP: GDP at purchaser's prices is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products Gross_savings: Gross savings are calculated as gross national income less total consumption, plus net transfers.

For Clustering: Another data is obtained from a survey of Mobile subscribers to understand their SMS/GPRS pack usage and spending pattern (Done for Aircel as part of Summer Project). A total of 221 records used.

Various variables and the corresponding questions asked in survey:

internet_hrs: Daily hours spent on Internet through your Mobile?

Internet_home: Do you have Internet connection at home?

Income: Your monthly Income level?

Occupation: Your Occupation?

Age: Your Age?

Travel_hrs: Daily time spent travelling/on the move?

Sms_pack: Which SMS Pack you use/prefer to use?

Sms_bill: How much you spend for SMS Packs monthly?

Gprs_pack: Which GPRS Pack you use/prefer to use?

Gprs_bill: How much you spend for GPRS Packs monthly?

Month_bill: How much is your monthly mobile bill around?

Regression Analysis The regression model is used to predict the result of a unknown dependent variable (GDP), given the values of the independent variables. We take a number of independent variables togetherand find their relation to our dependent variable (GDP).

Weka Run information

Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: GDP_regression-weka.filters.unsupervised.attribute.Remove-R1,11-weka.filters.unsupervised.attribute.Remove-R1-2

Instances: 21 Attributes: 8

Test mode: split 66.0% train, remainder test: Classifier model (full training set) This way 14 entries will be used to create the model and remaining 7 to test the validity of it

Linear Regression Model developed

GDP = 0.4179 * Exports_of_goods_and_services + -0.3048 * Imports_of_goods_and_services + 1.3542 * Agriculture_value_added + 0.6773 * Industry_value_added + 0.9744 * Services_value_added + 0.464 * Gross_savings + -1672945075143.6054

From the above model we can draw following conclusions:

GDP is related to all of the independent variables taken for study Service, Agriculture and Industry Value Added variables are all positively contributing to the GDP Exports and Gross Savings increase also grows the GDP value Imports on other hand is inversely related and decreases the GDP value Agriculture has the highest effect impact followed by Service and then Industry sector

Predictions on test split

inst# actual predicted error 1 32422100000000 32059372648333.712 -362727351666.289 2 5696240000000 5753533243202.538 57293243202.537 3 27546200000000 27612724209197.072 66524209197.07 4 36924900000000 36784939417643.64 -139960582356.359 5 49864300000000 49276857202715.168 -587442797284.828 6 17512000000000 17546685222165.234 34685222165.234 7 55826200000000 57273793936650.44 1447593936650.438

Evaluation on test split

Correlation coefficient 0.9994 Mean absolute error 385175334646.108 Root mean squared error 609530075729.5292 Relative absolute error 2.5776 % Root relative squared error 3.4314 % Total Number of Instances 7

From the results we can say:

As correlation coefficiet is high our model is accurate and gives good estimate of India’s GDP Small value of Root squared error also signifies the model accuracy

Cluster Analysis It allows us to make groups of data to which can be useful for many marketing applications like: segmentation and new product launch. Thus from this survey data we can find SMS/GPRS usage patterns and design our future products according to market needs

Weka Run information

Scheme: weka.clusterers.SimpleKMeans -N 3 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: Aircel_RAW-weka.filters.unsupervised.attribute.Remove-R1-weka.filters.unsupervised.attribute.Remove-R7-weka.filters.unsupervised.attribute.Remove-R7-weka.filters.unsupervised.attribute.Remove-R3

Instances: 221 Attributes: 11 Test mode: evaluate on training data: Clustering model (full training set) : kMeans Clustering to find total clusters possible Number of iterations: 4 Within cluster sum of squared errors: 571.5049648606962

Cluster centroids

Attribute Full Data(221) Cluster#0 (84) Cluster#1 (72) Cluster#2 (65) internet_hrs 2.2527 1.764 2.4954 2.6154

internet_home Yes Yes Yes No income Nil Nil 20K-40K Nil

occupation Student (UG/PG) Student (UG/PG) Service/Employee Student (UG/PG) age 24.1357 22.7143 27.6111 22.1231

travel_hrs 2.6371 2.7488 2.6389 2.4909 sms_pack Monthly Monthly None Monthly sms_bill 30-50 30-50 Nil 30-50

gprs_pack Monthly None Monthly Monthly gprs_bill >80 Nil >80 >80

month_bill 379.8643 310.7143 469.4444 370 Model and evaluation on training set

Cluster Instances 0 84 ( 38%) 1 72 ( 33%) 2 65 ( 29%)

Clusters Explanation

Cluster0:

This segment comprises of Students with average age around 23 years

They don’t have any income No internet connection at home Avid users of SMS and spend maximum of Rs.30-50 per month Spend quiet a time travelling each day Don’t prefer to use GPRS packs for internet surfing and have lower than average use of 1.7 hrs

per day Monthly bill of Rs.300+

Cluster1:

This segment comprises of working Professional with average age around 27 years They have average monthly income in range of 20-40K with internet at home Prefer to use internet on GPRS and take monthly unlimited pack by paying Rs.80 or more Don’t use any particular SMS pack and overall low usage of this VAS Spend quiet a time travelling each day Don’t prefer to use GPRS packs for internet surfing and have lower than average use of 1.7 hrs

per day Monthly bill of Rs.470 is quite high than overall average of Rs.380

Cluster2:

This segment comprises of Students with average age around 22 years They don’t have any income Internet connection at home Useboth GPRS and SMS packs and spend Rs.130+ on these Monthly bill around the same as overall average of Rs.380

Other points to be noted:

We can see that Cluster 0 comprising of Students prefers to communicate through messages with friends and are not using internet pack because of Internet facility at home

For Cluster 1 the young working people, GPRS pack is a must as they need to keep updated on emails and social networking. They don’t want to spend on SMS pack separately

We could come up with a GPRS and SMS combo pack which would cater to both these clusters The pricing of this Combo should be under the total Rs.130 mark to show end customer the

utility of having both facilities monthly with no usage limit. A separate limited sms/internet hour cards can also be introduced at price which in in range

Rs.60-Rs.80 and hence attractive to both segments and useful for Cluster 2 also Even some free Internet hours with SMS packs and vice versa with GPRS packs can increase trial

of these services

10bm60027: weka data mining techniques

Documents