data mining on weka

17
IT & BUSINESS INTELLIGENCE DATA MINING ON WEKA SATYAM KHATRI (10BM60081) MBA, VGSOM IIT KHARAGPUR

Upload: satyamkhatri

Post on 10-May-2015

2.929 views

Category:

Technology


6 download

TRANSCRIPT

Page 1: DATA MINING on WEKA

IT & BUSINESS INTELLIGENCE

DATA MINING

ON

WEKA

SATYAM KHATRI

(10BM60081)

MBA, VGSOM

IIT KHARAGPUR

Page 2: DATA MINING on WEKA

WEKA

WEKA is a collection of open source many data mining and machine learning algorithms. It was created

by researchers at the University of Waikato in New Zealand, it is a Java based, open source tool. WEKA

is used for pre-processing on data, Classification, clustering and association rule extraction

It’s main features are as follows

49 data preprocessing tools

76 classification/regression algorithms

8 clustering algorithms

15 attribute/subset evaluators + 10 search algorithms for feature selection.

3 algorithms for finding association rules

3 graphical user interfaces

“The Explorer” (exploratory data analysis)

“The Experimenter” (experimental environment)

“The Knowledge Flow” (new process model inspired interface)

WEKA FUNCTIONS AND TOOLS

Preprocessing Filters

Attribute selection

Classification/Regression

Clustering

Association discovery

Visualization

DOWNLOAD INSTRUCTIONS

Download Weka (the stable version) from http://www.cs.waikato.ac.nz/ml/weka/

Choose a self-extracting executable (including Java VM)

If you are interested in modifying/extending weka there is a developer version that includes the

source code

WEKA DATA FORMATS

Data can be imported from a file in various format such as ARFF, CSV, C4.5. Data can also be read from

a URL or from an SQL database (using JDBC)

Page 3: DATA MINING on WEKA

CLUSTERING

A cluster, by definition, is a group of similar objects. There could be clusters of people, brands or other

objects. If clusters are formed of customers similar to one another, then cluster analysis can help

marketers identify segments (clusters).If clusters of brands are formed, this can be used to gain insights

into brands that are perceived as similar to each other on a set of attributes. Cluster analysis is hence

used for customer segmentation. Cluster analysis is best performed when the variables are interval or

ratio-scaled

There are two major classes of cluster analysis techniques

hierarchical

non-hierarchical

HIERARCHICAL CLUSTERING

Some measure of distance is used to identify distances between all pairs of objects to be clustered. One

of the popular distance measures used is Euclidean Distance. Another is the Squared Euclidean

Distance. We begin with all objects in separate clusters. Say, we have ten objects in separate clusters.

Two closest objects are joined to form a cluster. The remaining 8 objects would remain separate. This is

stage 1 of hierarchical clustering.

NON HIERARCHICAL CLUSTERING

They are also known as k-means clustering methods, we need to specify the number of clusters we want

the objects to be clustered into. This can be done if we have a hypothesis that the objects will group into a

certain number of clusters. Alternatively, we can first do a hierarchical clustering on the data, find the

approximate number of clusters, and then perform a k-means clustering

IMPLEMENTATION METHODS

k - Means

EM

Cobweb

X-means

Farthest First

Page 4: DATA MINING on WEKA

CLUSTERING ON WEKA

PROBLEM CASE

An Asset Management company (AMC) wants to launch a new Mutual Fund Scheme, AMC wants to

segment the target market, so that it can raise funds easily by different marketing strategies for different

segments of target market.

AMC segments the target market on the basis of following parameters

1. Investor’s Age

2. Marital status

3. Investor’s Monthly income

4. Region of Residence

5. Investment in Derivatives

6. Investment in Equities

7. Investment in Fixed deposits

8. Investment in Gold

9. Existing number of Mutual fund schemes

10. Existing loans

Data is collected from the public base on the above parameters and clustering function is performed on it

WEKA Explorer interface

Page 5: DATA MINING on WEKA

Processing on parameter Investment in Gold

Processing on parameter Existing Number of Mutual fund schemes

Page 6: DATA MINING on WEKA

Processing on parameter Existing Loans

Processing on parameter “Age of Investor “

Page 7: DATA MINING on WEKA

Processing on parameter Investment in Fixed deposits

Processing on parameter Investor’s marital status

Page 8: DATA MINING on WEKA

Processing on parameter “Investor’s region of residence”

Processing on parameter “Investor’s monthly income”

Page 9: DATA MINING on WEKA

Processing on parameter Investment in derivatives

Visualization of the entire dataset

Page 10: DATA MINING on WEKA

To perform clustering, select the "Cluster" tab in the Explorer and click on the "Choose" button. This

results in a drop down list of available clustering algorithms. In this case we select "Simple K Means".

Next, click on the text box to the right of the "Choose" button to get the pop-up window shown k-means

clustering is done by dividing the data into 4 cluster group.

The WEKA Simple K Means algorithm uses Euclidean distance measure to compute distances between

instances and clusters. In the pop-up window we enter 6 as the number of clusters (instead of the default

values of 2) and we leave the value of "seed" as is. The seed value is used in generating a random

number which is, in turn, used for making the initial assignment of instances to clusters. Note that, in

general, K-means is quite sensitive to how clusters are initially assigned. Thus, it is often necessary to try

different values and evaluate the results

Once the options have been specified, we can run the clustering algorithm. Here we make sure that in the

"Cluster Mode" panel, the "Use training set" option is selected, and we click "Start". We can right click the

result set in the "Result list" panel and view the results of clustering in a separate window.

Page 11: DATA MINING on WEKA

CLUSTERING RESULTS

Page 12: DATA MINING on WEKA

Clusters can be visualize as shown below

CLUSTER 1

It consist of people with average age of 44 yrs, mostly male, that stay in town, have average monthly

income of 30000, mostly single and invest in equities, fixed deposits, gold, do not invest in derivatives and

have existing loans.

CLUSTER 2

It consist of people with average age of 49 yrs, mostly male, that stay in town, have average monthly

income of 39000, mostly married and invest in equities, fixed deposits, gold, do not invest in derivatives

and have existing loans.

CLUSTER 3

It consist of people with average age of 39 yrs, mostly male, that stay in cities, have average monthly

income of 24000, mostly married and invest in gold, derivatives, do not invest in equities and fixed

deposits, and have existing loans.

CLUSTER 4

It consist of people with average age of 40 yrs, mostly female, that stay in cities, have average monthly

income of 25000, mostly married and invest in equities, fixed deposits, do not invest in derivatives, gold

and have existing loans.

Page 13: DATA MINING on WEKA

CLASSIFICATION VIA DECISION TREES IN WEKA

PROBLEM CASE

A market research firm wants to model the investment decisions by people in various types of securities

on the basis of following parameters Investor’s Age, Marital status, Investor’s Monthly income, Region of

Residence, Investment in Derivatives, ,Investment in Equities, Investment in Fixed deposits, Investment in

Gold, Investment in Mutual funds, Existing loans. Based on this model, an investment decision by an

entity in a particular type of security can be predicted if other parameters about that entity are mentioned

Data is collected from the public on the above parameters and classification is done

Next, we select the "Classify" tab and click the "Choose" button to select the J48 classifier, Note that J48

(implementation of C4.5 algorithm does not require discretization of numeric attributes, in contrast to the

ID3 algorithm from which C4.5 has evolved. Now, we can specify the various parameters. These can be

specified by clicking in the text box to the right of the "Choose" button, In this example we accept the

default values. The default version does perform some pruning (using the sub tree raising approach), but

does not perform error pruning.

Page 14: DATA MINING on WEKA
Page 15: DATA MINING on WEKA

Under the "Test options" in the main panel we select 10-fold cross-validation as our evaluation approach.

Since we do not have separate evaluation data set, this is necessary to get a reasonable idea of accuracy

of the generated model. We now click "Start" to generate the model. The ASCII version of the tree as well

as evaluation statistics will appear in the eight panel when the model construction is completed We can

view this information in a separate window by right clicking the last result set (inside the "Result list" panel

on the left) and selecting "View in separate window" from the pop-up menu.

Page 16: DATA MINING on WEKA

We can also use our model to classify the new instances. In the main panel, under "Test options" click the

"Supplied test set" radio button, and then click the "Set..." button. This will pop up a window which allows

you to open the file containing test instances.

Page 17: DATA MINING on WEKA

This, once again generates the models from our training data, but this time it applies the model to the new

unclassified instances in order to predict the value of an attribute. Note that the summary of the results in

the right panel does not show any statistics.

WEKA also let's us view a graphical rendition of the classification tree. This can be done by right clicking

the last result set (as before) and selecting "Visualize tree" from the pop-up menu.

Note that by resizing the window and selecting various menu items from inside the tree view (using the

right mouse button), we can adjust the tree view to make it more readable.