appendix weka

Upload: imran

Post on 04-Jun-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Appendix Weka

    1/17

    1

    Appendix: The WEKA Data Mining

    Software

    http://www.cs.waikato.ac.nz/ml/weka/

  • 8/13/2019 Appendix Weka

    2/17

    2

    WEKA: Introduction WEKA, developed by Waikato University, New Zealand.

    WEKA (Waikato Environment for Knowledge Analysis) History: 1stversion (version 2.1, 1996); Version 2.3,

    1998; Version 3.0, 1999; Version 3.4, 2003; Version 3.6,2008.

    WEKA provides a collection of data mining, machinelearning algorithms and preprocessing tools. It includes algorithms for regression, classification, clustering,

    association rule mining and attribute selection.

    It also has data visualization facilities.

    WEKA is an environment for comparing learning

    algorithms With WEKA, researchers can implement new data

    mining algorithms to add in WEKA

    WEKA is the best-known open-source data miningsoftware.

  • 8/13/2019 Appendix Weka

    3/17

    3

    WEKA: Introduction

    WEKA was written in Java. WEKA 3.4 consists of 271477 lines of code. WEKA 3.6 consists of 509903 lines of code.

    It can work on Windows, Linux and Macintosh.

    Users can access its components through Java

    programming or through a command-line interface. It consists of three main graphical user interfaces:

    Explorer, Experimenterand Knowledge Flow.

    The easiest way to use WEKA is through Explorer,the main graphical user interface.

    Data can be loaded from various sources, includingfiles, URLs and databases. Database access isprovided through Java Database Connectivity.

  • 8/13/2019 Appendix Weka

    4/17

    4

    WEKA data format

    WEKA stores data in flat files (ARFF format).

    Its easy to transform EXCEL file to ARFF format.

    An ARFF file consists of a list of instances

    We can create an ARFF file by using Notepad orWord.

    The name of the dataset is with @relation

    Attribute information is with @attribute

    The data is with @data.

    Beside ARFF format, WEKA allows CSV, LibSVM,

    and C4.5s format.

  • 8/13/2019 Appendix Weka

    5/17

    5

    WEKA ARFF format

    @relation weather

    @attribute outlook {sunny, overcast, rainy}@attribute temperature real

    @attribute humidity real

    @attribute windy {TRUE, FALSE}

    @attribute play {yes, no}

    @data

    sunny, 85, 85, FALSE, no

    sunny, 80, 90, TRUE, no

    overcast, 83, 86, FALSE, yesrainy, 70, 96, FALSE, yes

    rainy, 68, 80, FALSE, yes

  • 8/13/2019 Appendix Weka

    6/17

    6

    Explorer GUI Consists of 6 panels, each for one data mining

    tasks: Preprocess

    Classify

    Cluster

    Associate

    Select Attributes Visualize.

    Preprocess: to use WEKAs data preprocessing tools (called filters) to

    transform the dataset in several ways.

    WEKA contains filters for: Discretization, normalization, resampling, attribute

    selection, transforming and combining attributes,

  • 8/13/2019 Appendix Weka

    7/17

    7

    Explorer (cont.)

    Classify: Regression techniques (predictors of continuous classes)

    Linear regression

    Logistic regression

    Neural network

    Support vector machine

    Classification algorithms Decision treesID3, C4.5 (called J48)

    Nave Bayes, Bayes network

    k-nearest-neighbors

    Rule learners: Ripper, Prism

    Lazy rule learners Meta learners (bagging, boosting)

  • 8/13/2019 Appendix Weka

    8/17

    8

    Clustering Clustering algorithms:

    K-Means, X-Means, FarthestFirst

    Likelihood-based clustering: EM (Expectation-Maximization)

    Cobweb (incremental clustering algorithm)

    Clusters can be visualized and compared to true clusters (ifgiven)

    Attribute Selection: This provides access to various methods formeasuring the utility of attributes and identifying the most

    important attributes in a dataset. Filter method: the attribute set is filtered to produce the most

    promising subset before learning begins.

    A wide range of filtering criteria, including correlation-basedfeature selection, the chi-square statistic, gain ratio, information,support-machine-based criterion.

    A variety of search methods: forward and backward selection,best-first search, genetic search and random search.

    PCA (principal component analysis) to reduce the dimensionalityof a problem.

    Discretizing numeric attributes.

  • 8/13/2019 Appendix Weka

    9/17

    9

    Explorer (cont.)

    Assocation rule mining Apriori algorithm

    Work only with discrete data

    Visualization

    Scatter plots, ROC curves,Trees, graphs WEKA can visualize single attributes (1-d) and pairs of

    attributes (2-d).

    Color-coded class values.

    Zoom-in function

  • 8/13/2019 Appendix Weka

    10/17

    10

  • 8/13/2019 Appendix Weka

    11/17

    11

    Explorer

    GUI

    (Classify)

  • 8/13/2019 Appendix Weka

    12/17

    12

    WEKA Experimenter

    This interface is designed to facilitate experimentalcomparisonsof the performance of algorithmsbased on many different evaluation criteria.

    Experiments can involves many algorithms that are

    run on multiple datasets.

    Can also iterate over different parameter settings

    Experiments can also be distributed across differentcomputer nodes in a network.

    Once an experiment has been set up, it can besaved in either XML or binary form, so that it can bere-visited.

  • 8/13/2019 Appendix Weka

    13/17

    13

  • 8/13/2019 Appendix Weka

    14/17

    14

    Knowledge Flow Interface

    The Explorer is designed for batch-based data

    processing: training data is loaded into memory andthen processed.

    However WEKA has implemented some incremental

    algorithms.

    Knowledge-flow interface can handle incremental

    updates. It can load and preprocess individual

    instances before feeding them into incremental

    learning algorithms.

    Knowledge-flow also provides nodes for

    visualization and evaluation.

  • 8/13/2019 Appendix Weka

    15/17

    15

  • 8/13/2019 Appendix Weka

    16/17

    16

    Conclusions Comparison to R, WEKA is weaker in classical statistics but

    stronger in machine learning (data mining) algorithms.

    WEKA has developed a set of extensions covering diverseareas, such as text mining, visualization and bioinformatics.

    WEKA 3.6 includes support for importing PMML models(Predictive Modeling Markup Language). PMML is a XML-basedstandard fro expressing statistical and data mining models.

    WEKA 3.6 can read and write data in the format used by the wellknown LibSVM and SVM-Light support vector machineimplementations.

    WEKA has 2 limitations:

    Most of the algorithms require all the data stored in mainmemory. So it restricts application to small or medium-sizeddatasets.

    Java implementation is somewhat slower than an equivalent inC/C++

  • 8/13/2019 Appendix Weka

    17/17

    17

    References

    I.H. Witten and E. Frank, Data Mining: PracticalMachine Learning Tools and Techniques with JavaImplementations, Morgan Kaufmann, SanFrancisco, 2000.

    M. Hall and E. Frank, The WEKA Data Mining

    Software: An Update, J. SIGKDD Explorations, Vol.11, No. 1, 2008.

    R. R. Bouckaert et al., WEKA Manual for Version3.6.0, 2008.

    E. Frank et al., WEKAA Machine LearningWorkbench for Data Mining, 2003.