1 appendix: the weka data mining software

17
1 Appendix: The WEKA Data Mining Software http://www.cs.waikato.ac.nz/ml/ weka/

Upload: drusilla-francis

Post on 29-Dec-2015

236 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Appendix: The WEKA Data Mining Software

1

Appendix: The WEKA Data Mining Software

http://www.cs.waikato.ac.nz/ml/weka/

Page 2: 1 Appendix: The WEKA Data Mining Software

2

WEKA: Introduction WEKA, developed by Waikato University, New Zealand. WEKA (Waikato Environment for Knowledge Analysis) History: 1st version (version 2.1, 1996); Version 2.3,

1998; Version 3.0, 1999; Version 3.4, 2003; Version 3.6, 2008.

WEKA provides a collection of data mining, machine learning algorithms and preprocessing tools. It includes algorithms for regression, classification, clustering,

association rule mining and attribute selection. It also has data visualization facilities.

WEKA is an environment for comparing learning algorithms

With WEKA, researchers can implement new data mining algorithms to add in WEKA

WEKA is the best-known open-source data mining software.

Page 3: 1 Appendix: The WEKA Data Mining Software

3

WEKA: Introduction WEKA was written in Java.

WEKA 3.4 consists of 271477 lines of code. WEKA 3.6 consists of 509903 lines of code.

It can work on Windows, Linux and Macintosh. Users can access its components through Java

programming or through a command-line interface. It consists of three main graphical user interfaces:

Explorer, Experimenter and Knowledge Flow. The easiest way to use WEKA is through Explorer,

the main graphical user interface. Data can be loaded from various sources, including

files, URLs and databases. Database access is provided through Java Database Connectivity.

Page 4: 1 Appendix: The WEKA Data Mining Software

4

WEKA data format

WEKA stores data in flat files (ARFF format). It’s easy to transform EXCEL file to ARFF format. An ARFF file consists of a list of instances We can create an ARFF file by using Notepad or

Word. The name of the dataset is with @relation Attribute information is with @attribute The data is with @data.

Beside ARFF format, WEKA allows CSV, LibSVM, and C4.5’s format.

Page 5: 1 Appendix: The WEKA Data Mining Software

5

WEKA ARFF format@relation weather@attribute outlook {sunny, overcast, rainy}@attribute temperature real@attribute humidity real@attribute windy {TRUE, FALSE}@attribute play {yes, no}

@datasunny, 85, 85, FALSE, nosunny, 80, 90, TRUE, noovercast, 83, 86, FALSE, yesrainy, 70, 96, FALSE, yesrainy, 68, 80, FALSE, yes……………………………

Page 6: 1 Appendix: The WEKA Data Mining Software

6

Explorer GUI Consists of 6 panels, each for one data mining

tasks: Preprocess Classify Cluster Associate Select Attributes Visualize.

Preprocess: to use WEKA’s data preprocessing tools (called “filters”) to

transform the dataset in several ways. WEKA contains filters for:

Discretization, normalization, resampling, attribute selection, transforming and combining attributes, …

Page 7: 1 Appendix: The WEKA Data Mining Software

7

Explorer (cont.) Classify:

Regression techniques (predictors of “continuous classes”) Linear regression Logistic regression Neural network Support vector machine

Classification algorithms Decision trees – ID3, C4.5 (called J48) Naïve Bayes, Bayes network k-nearest-neighbors Rule learners: Ripper, Prism Lazy rule learners Meta learners (bagging, boosting)

Page 8: 1 Appendix: The WEKA Data Mining Software

8

Clustering Clustering algorithms:

K-Means, X-Means, FarthestFirst Likelihood-based clustering: EM (Expectation-Maximization) Cobweb (incremental clustering algorithm)

Clusters can be visualized and compared to “true” clusters (if given)

Attribute Selection: This provides access to various methods for measuring the utility of attributes and identifying the most important attributes in a dataset. Filter method: the attribute set is filtered to produce the most

promising subset before learning begins. A wide range of filtering criteria, including correlation-based

feature selection, the chi-square statistic, gain ratio, information, support-machine-based criterion.

A variety of search methods: forward and backward selection, best-first search, genetic search and random search.

PCA (principal component analysis) to reduce the dimensionality of a problem.

Discretizing numeric attributes.

Page 9: 1 Appendix: The WEKA Data Mining Software

9

Explorer (cont.)

Assocation rule mining Apriori algorithm

Work only with discrete data Visualization

Scatter plots, ROC curves,Trees, graphs WEKA can visualize single attributes (1-d) and pairs of

attributes (2-d). Color-coded class values. “Zoom-in” function

Page 10: 1 Appendix: The WEKA Data Mining Software

10

Page 11: 1 Appendix: The WEKA Data Mining Software

11

Explorer GUI

(Classify)

Page 12: 1 Appendix: The WEKA Data Mining Software

12

WEKA Experimenter

This interface is designed to facilitate experimental comparisons of the performance of algorithms based on many different evaluation criteria.

Experiments can involves many algorithms that are run on multiple datasets.

Can also iterate over different parameter settings Experiments can also be distributed across different

computer nodes in a network. Once an experiment has been set up, it can be

saved in either XML or binary form, so that it can be re-visited.

Page 13: 1 Appendix: The WEKA Data Mining Software

13

Page 14: 1 Appendix: The WEKA Data Mining Software

14

Knowledge Flow Interface The Explorer is designed for batch-based data

processing: training data is loaded into memory and then processed.

However WEKA has implemented some incremental algorithms.

Knowledge-flow interface can handle incremental updates. It can load and preprocess individual instances before feeding them into incremental learning algorithms.

Knowledge-flow also provides nodes for visualization and evaluation.

Page 15: 1 Appendix: The WEKA Data Mining Software

15

Page 16: 1 Appendix: The WEKA Data Mining Software

16

Conclusions Comparison to R, WEKA is weaker in classical statistics but

stronger in machine learning (data mining) algorithms. WEKA has developed a set of extensions covering diverse

areas, such as text mining, visualization and bioinformatics. WEKA 3.6 includes support for importing PMML models

(Predictive Modeling Markup Language). PMML is a XML-based standard fro expressing statistical and data mining models.

WEKA 3.6 can read and write data in the format used by the well known LibSVM and SVM-Light support vector machine implementations.

WEKA has 2 limitations: Most of the algorithms require all the data stored in main

memory. So it restricts application to small or medium-sized datasets.

Java implementation is somewhat slower than an equivalent in C/C++

Page 17: 1 Appendix: The WEKA Data Mining Software

17

References I.H. Witten and E. Frank, Data Mining: Practical

Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, San Francisco, 2000.

M. Hall and E. Frank, The WEKA Data Mining Software: An Update, J. SIGKDD Explorations, Vol. 11, No. 1, 2008.

R. R. Bouckaert et al., WEKA Manual for Version 3.6.0, 2008.

E. Frank et al., WEKA – A Machine Learning Workbench for Data Mining, 2003.