dmdw lab record

Upload: shaik-karimulla

Post on 03-Apr-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Dmdw Lab Record

    1/60

    1

    JBREC

    INTRODUCTION TO WEKA:

    WHAT IS WEKA?

    Weka is a collection of machine learning algorithms for data mining tasks. Thealgorithms can either be applied directly to a dataset or called from your own Java code. Wekacontains tools for data pre-processing, classification, regression, clustering, association rules, andvisualization. It is also well-suited for developing new machine learning schemes.Main features:

    Comprehensive set of data pre-processing tools, learning algorithms and evaluationmethods

    Graphical user interfaces (incl. data visualization) Environment for comparing learning algorithms

    THE GUI CHOOSER:The GUI chooser is used to start different interfaces of the Weka Environment. These

    interfaces can be considered different programs and they vary in form, function and purpose.Depending on the specific need whether it be simple data exploration, detailed

    experimentation or tackling very large problems, these different interfaces of Weka will be more

    appropriate.

    This project will mainly focus on the Explorer, Experimenter and Knowledge Flow

    interfaces of the Weka Environment.

    The CLI is a text based interface to the Weka Environment. It is the most memory efficient interface

    available in Weka.

    WEKA DATA MINER

    Weka is a comprehensive set of advanced data mining and analysis tools. The strength of

    Weka lies in the area of classification where it covers many of the most current machine learning

    (ML) approaches. The version of Weka used in this project is version 3-4-4.

    At its simplest, it provides a quick and easy way to explore and analyze data. Weka is also

    suitable for dealing with large data where the resources of many computers and or multi-processor

    computers can be used in parallel. We will be examining different aspects of the software with a focuson its decision tree classification features.

  • 7/28/2019 Dmdw Lab Record

    2/60

    2

    JBREC

    DATA HANDLING:

    Weka currently supports 3 external file formats namely CSV, Binary and C45. Weka also

    allows for data to be pulled directly from database servers as well as web servers. Its native data

    format is known as the ARFF format

    ATTRIBUTE RELATION FILE FORMAT (ARFF):

    An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of

    instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project at

    the Department of Computer Science of The University of Waikato for use with the Weka machine

    learning software

    OVERVIEW

    ARFF files have two distinct sections. The first section is the Header information, which isfollowed the Data information.

    The ARFF Header Section

    The ARFF Header section of the file contains the relation declaration and attribute

    declarations.

    The @relation Declaration

    The relation name is defined as the first line in the ARFF file. The format is:

    @relation

    Where is a string. The string must be quoted if the name includesSpaces.

    The @attribute Declarations

    Attribute declarations take the form of an ordered sequence of@attribute statements. Each

    attribute in the data set has its own @attribute statement which uniquely defines the name of that

    attribute and it's data type. The order the attributes are declared indicates the column position in the

    data section of the file. For example, if an attribute is the third one declared then Weka expects that all

    that attributes values will be found in the third comma delimited column.

    The format for the @attribute statement is:

    @attribute

    Where the must start with an alphabetic character. If spaces are to be

    included in the name then the entire name must be quoted.

    The can be any of the four types currently (version 3.2.1) supported by Weka:

  • 7/28/2019 Dmdw Lab Record

    3/60

    3

    JBREC

    numeric string date []

    Where and are defined below. The keywordsnumeric, stringand date are case insensitive.

    Numeric attributes

    Numeric attributes can be real or integer numbers.

    Nominal attributes

    Nominal values are defined by providing an listing the possible values:

    {, , ...}

    For example, the class value of the Iris dataset can be defined as follows:

    @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

    Values that contain spaces must be quoted.

    String attributes

    String attributes allow us to create attributes containing arbitrary textual values. This is

    very useful in text-mining applications, as we can create datasets with string attributes, then write

    Weka Filters to manipulate strings (like StringToWordVectorFilter). String attributes are declared as

    follows:

    @ATTRIBUTE LCC string

    Date attributes

    Date attribute declarations take the form:

    @attribute date []

    where is the name for the attribute and is an optional string specifying how

    date values should be parsed and printed (this is the same format used by SimpleDateFormat). The

    default format string accepts the ISO-8601 combined date and time format: "yyyy-MM-

    dd'T'HH:mm:ss".

    Dates must be specified in the data section as the corresponding string representations of the date/time

    (see example below).

  • 7/28/2019 Dmdw Lab Record

    4/60

    4

    JBREC

    ARFF Data Section

    The ARFF Data section of the file contains the data declaration line and the actual instance lines.

    The @data Declaration

    The @data declaration is a single line denoting the start of the data segment in the file. The format is:

    @data

    The instance data

    Each instance is represented on a single line, with carriage returns denoting the end of the instance.

    Attribute values for each instance are delimited by commas. They must appear in the order that theywere declared in the header section (i.e. the data corresponding to the nth @attribute declaration isalways the nth field of the attribute).

    Missing values are represented by a single question mark, as in:

    @data

    4.4,?,1.5,?,Iris-setosaValues of string and nominal attributes are case sensitive, and any that contain space must be

    quoted, as follows:@relation LCCvsLCSH

    @attribute LCC string

    @attribute LCSH string

    @data

    AG5, 'Encyclopedias and dictionaries.;Twentieth century.'

    AS262, 'Science -- Soviet Union -- History.'

    AE5, 'Encyclopedias and dictionaries.'

    AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Phases.'

    AS281, 'Astronomy, Assyro-Babylonian.; Moon -- Tables.'

    Dates must be specified in the data section using the string representation specified in the attributedeclaration. For example:

    @RELATION Timestamps

    @ATTRIBUTE timestamp DATE "yyyy-MM-dd HH:mm:ss"

    @DATA"2001-04-03 12:12:12"

    "2001-05-03 12:59:55"

    NOTE: All header commands start with @ and all comment lines start with %. Comment and

    blank lines are ignored.

    COMMA SEPARATED VALUE (CSV):

    Ex:

    Sno, Sname, Branch, Year Attributes

    1, abc, MCA, First2, def,MCA, First Data

  • 7/28/2019 Dmdw Lab Record

    5/60

    5

    JBREC

    Sparse Arff Structure:@relation @attribute @data{< index> < value1>, , ,.} .

    Sparse Weather Arff Dataset:@relation weather@attribute outlook {sunny, overcast, rainy}@attribute temperature real@attribute humidity real@attribute windy {TRUE, FALSE}@attribute play {yes, no}@data{0 sunny,1 85,2 85,3 FALSE,4 no}

    {0 sunny,1 80,2 90,3 TRUE,4 no}{0 overcast,1 83,2 86,3 FALSE,4 yes}{0 rainy,1 70,2 96,3 FALSE,4 yes}{0 rainy,1 68,2 80,3 FALSE,4 yes}{0 rainy,1 65,2 70,3 TRUE,4 no}{0 overcast,1 64,2 65,3 TRUE,4 yes}{0 sunny,1 72,2 95,3 FALSE,4 no}{0 sunny,1 69,2 70,3 FALSE,4 yes}{0 rainy,1 75,2 80,3 FALSE,4 yes}{0 sunny,1 75,2 70,3 TRUE,4 yes}{0 overcast,1 72,2 90,3 TRUE,4 yes}

    Data Retrieval and Preparation

    Getting the data:

    There are three ways of loading data into the explorer, these are loading from a file, a

    database connection and finally getting a file from web server. We will be loading the data file from a

    locally stored file.

    Weka supports 4 different file formats namely, CSV, C4.5, flat binary files and the native

    ARFF format. To demonstrate the functionality of the explorer environment we will be loading a CSV

    file and then in the following section we will preprocess the data to prepare it for analysis. To open a

    local data file, click on the Open File button, and in the window that follows select the desired data

    file.

    Preprocess the Data:

    First Method:

  • 7/28/2019 Dmdw Lab Record

    6/60

    6

    JBREC

    Initially (in the Preprocess tab) click "open" and navigate to the directory containing the data file

    (.csv or .arff). In this case we will open the above data file.

  • 7/28/2019 Dmdw Lab Record

    7/60

    7

    JBREC

    Since the data is not in ARFF format, a dialog box will prompt you to use the convertor, as in

    Figure. You can click on "Use Converter" button, and click OK in the next dialog box that appears.

    Again you can click on choose button list of converters are listed below. Choose the which

    converter you want and click on OK button.

    .

    Once the data is loaded, WEKA will recognize the attributes and during the scan of the data will

    compute some basic statistics on each attribute. The left panel in above figure shows the list of

    recognized attributes, while the top panels indicate the names of the base relation (or table) and the

    current working relation (which are the same initially).

  • 7/28/2019 Dmdw Lab Record

    8/60

    8

    JBREC

    Clicking on any attribute in the left panel will show the basic statistics on that attribute. For

    categorical attributes, the frequency for each attribute value is shown, while for continuous attributes

    we can obtain min, max, mean, standard deviation, etc.

    Note that the visualization in the right bottom panel is a form of cross-tabulation across two

    attributes. For example, in above Figure, the default visualization panel cross-tabulates "married" with

    the "pep" attribute (by default the second attribute is the last column of the data file). You can select

    another attribute using the drop down list.

    Second Method:

    In this method you can load data from a web server. In preprocess tab click on Open URL

    button pop-up window appeared like as below.

    You can give a name of web server and followed by file name then click on OK button.

    Third Method:

    Statisticalmeasures ofselectedattribute

    AttributesVisualize

  • 7/28/2019 Dmdw Lab Record

    9/60

    9

    JBREC

    In this method you can load from a Database. In preprocess tab click on OpenDB button then

    window is appears like as.

    Filtering Algorithms:

    Filters transform the input dataset in some way. When a filter is selected using the Choose

    button, its name appears in the line beside that button. Click that line to get a generic object editor to

    specify its properties. What appears in the line is the command-line version of the filter, and the

    parameters are specified with minus signs. This is a good way of learning how to use the Weka

    commands directly. There are two kinds of filters unsupervised and supervised Filters.

    Filters are often applied to a training dataset and then also applied to the test file. If the filter

    is supervisedfor example, if it uses class values to derive good intervals for discretization

    applying it to the test data will bias the results. It is the discretization intervals derived from the

    trainingdata that must be applied to the test data. When using supervised filters you must be careful

    to ensure that the results are evaluated fairly, an issue that does not arise with unsupervised filters.

    We treat Wekas unsupervised and supervised filtering methods separately. Within each type

    there is a further distinction between attribute filters, which work on the attributes in the datasets,

    and instance filters, which work on the instances.

    Sno Name of Function Description

    1 Add Add a new attribute, whose values are all marked as

    missing.

    Add a new nominal attribute representing the cluster

  • 7/28/2019 Dmdw Lab Record

    10/60

    10

    JBREC

    2 Add Cluster assigned to each instance by a given clustering algorithm.

    3 Add Expression

    Create a new attribute by applying a specified mathematical

    function to existing attributes

    4 Add Noise Change a percentage of a given nominal attributes values.

    5 Cluster Membership Use a clusterer to generate cluster membership values,

    which then form the new attributes.

    6 Copy Copy a range of attributes in the dataset

    7Discretize.

    Convert numeric attributes to nominal: Specify whichattributes, number of bins, whether to optimize the numberof bins, and output binary attributes.

    Use equal-width (default) or equal-frequency binning

    8 First Order Apply a first-order differencing operator to a range of

    numeric attributes.

    9Make Indicator

    Replace a nominal attribute with a Boolean attribute. Assign

    value 1 to instances with a particular range of attribute

    values; otherwise, assign 0. By default, the Boolean attribute

    is coded as numeric.

    10 MergeTwoValues Merge two values of a given attribute: Specify the index of

    the two values to be merged.11 NominalToBinary Change a nominal attribute to several binary ones, one for

    each value.

    12 Normalize Scale all numeric values in the dataset to lie within theinterval [0,1].

    13 NumericToBinary Convert all numeric attributes into binary ones: Nonzerovalues become 1.

    14 Numeric Transform Transform a numeric attribute using any Java function.

    15 Remove Type Remove attributes of a given type (nominal, numeric, string,or date).

    16 Remove Useless Remove constant attributes, along with nominal attributesthat vary too much.

    17 ReplaceMissingValues Replace all missing values for nominal and numericattributes with the modes and means of the training data.

    18 Standardize Standardize all numeric attributes to have zero mean andunit variance.

    19 StringToNominal Convert a string attribute to nominal.

    20 Swap Values Swap two values of an attribute.

    Unsupervised Instance Filters:

  • 7/28/2019 Dmdw Lab Record

    11/60

    11

    JBREC

    Sno Name of Function Description

    1 NonSparseToSparse Convert all incoming instances to sparse format

    2 Normalize Treat numeric attributes as a vector and normalize it to a

    given length

    3 Randomize Randomize the order of instances in a dataset

    4 Remove Folds Output a specified cross-validation fold for the dataset

    5 Remove Misclassified Remove instances incorrectly classified according to a

    specified

    6 classifier useful for removing outliers

    7 Remove Percentage Remove a given percentage of a dataset

    8 Remove Range Remove a given range of instances from a dataset

    9 RemoveWithValues Filter out instances with certain attribute values

    10 Resample Produce a random sub sample of a dataset, sampling with

    replacement

    11 SparseToNonSparse Convert all incoming sparse instances into nonsparse

    format

  • 7/28/2019 Dmdw Lab Record

    12/60

    12

    JBREC

    SAMPLE DATASETS:

    a) Weather Dataset:

    Description for weather dataset(arff)

    Title: weather datasetsource of informationno of attributes : 5(1 string,2 numeric,2 nominal)no. of instances: 50

    Attribute Description For Waether Dataset

    attribute 1:outlook(string)attribute 2:temp(numeric)attribute 3:humd(numeric)attribute 4:windy(nominal)

    labels:yes,no

    attribute 5:play(nominal)labels:play,noplay

    Example:

    Sample Waether.arff Dataset

    @relation weather

    @attribute outlook {sunny,overcast,rainy}

    @attribute temperature numeric

    @attribute humidity numeric

    @attribute windy {TRUE,FALSE}

    @attribute play {yes,no}

    @datasunny,85,85,FALSE,no

    sunny,80,90,TRUE,no

    overcast,83,86,FALSE,yes

    rainy,70,96,FALSE,yes

    rainy,68,80,FALSE,yes

    rainy,65,70,TRUE,no

    overcast,64,65,TRUE,yes

    Sample Wether.csv Dataset

    Outlook,temp,humd,windy,playRainy,30,40,yes,playRainy,50,20,no,playSunny,60,50,yes,noplaySunny,65,70,no,noplayOvercast,40,40,yes,play.

  • 7/28/2019 Dmdw Lab Record

    13/60

    13

    JBREC

    Bank Dataset:

    Description of bank dataset

    Title: bank dataset (arff)source of informationno of attributes 12(4 numarical,8 nominal)no of instances 100

    Attribute Description For Bank Dataset

    attribute 1:id(numeric)attribute 2:age(numeric)attribute 3:sex(nominal)

    labels:male,femaleattribute 4:region(nominal)

    labels:inner_city,rural,suburbn,townattribute 5:income(numeric)

    attribute 6:married(nominal)labels:yes,noattribute 7:children(numeric)attribute 8:car(nominal)

    labels:yes,noattribute 9:save_acct(nominal)

    labels:yes,noattribute 10:current_acct(nominal)

    labels:yes,noattribute 11:mortgage(nominal)

    labels:yes,noattribute 12:pep(nominal)

    labels:yes,no

    Example:

    Sample Bank.arffdataset

    @relation personal Equity plan@attribute id@attribute age@attribute sex{male,female}@attribute region{inner_city,rural,suburban,town} @attribute income@attribute marrage{yes,no}

    @attribute children@attribute car{yes,no}@attribute save_acct{yes,no}@attribute current_acct{yes,no}@attribute mortgage{yes,no}@attribute pep{yes,no}@data1,20,male,inner_city,10000,no,0,yes,yes,no,yes,no2,45,male,rural,50000,yes,3,yes,no,no,yes,no3,35,female,suburban,25000,yes,2,yes,no,no,yes,no4,27,male,town,30000,no,0,yes,yes,no,yes,no5,25,female,inner_city,20000,yes,2,yes,no,no,yes,no6,30,male,town,15000,no,0,yes,yes,no,yes,no

  • 7/28/2019 Dmdw Lab Record

    14/60

    14

    JBREC

    Sample Bank.csv datasetid,age,sex,region,income,married,children,car,save_acct,current_acct,mortgage,pep1,20,male,inner_city,10000,no,0,yes,yes,no,yes,no2,45,male,rural,50000,yes,3,yes,no,no,yes,no3,35,female,suburban,25000,yes,2,yes,no,no,yes,no4,27,male,town,30000,no,0,yes,yes,no,yes,no5,25,female,inner_city,20000,yes,2,yes,no,no,yes,no6,30,male,town,15000,no,0,yes,yes,no,yes,no

    German Credit Dataset:

    Description of the German credit dataset.

    Title: German Credit dataSource Information

    Number of Instances: 1000

    Number of Attributes german: 20 (7 numerical, 13 categorical)Number of Attributes german.numer: 24 (24 numerical)

    Attribute Description For German

    Attribute 1: (qualitative)Status of existing checking account

    A11 : ... < 0 DMA12 : 0 = 200 DM /

    salary assignments for at least 1 yearA14 : no checking account

    Attribute 2: (numerical)Duration in month

    Attribute 3: (qualitative)Credit historyA30 : no credits taken/

    all credits paid back dulyA31 : all credits at this bank paid back duly

    A32 : existing credits paid back duly till now

    A33 : delay in paying off in the pastA34 : critical account/

    other credits existing (not at this bank)

    Attribute 4: (qualitative)PurposeA40 : car (new)A41 : car (used)A42 : furniture/equipmentA43 : radio/televisionA44 : domestic appliancesA45 : repairsA46 : educationA47 : (vacation - does not exist?)

  • 7/28/2019 Dmdw Lab Record

    15/60

    15

    JBREC

    A48 : retrainingA49 : businessA410 : others

    Attribute 5: (numerical)Credit amount

    Attibute 6: (qualitative)Savings account/bondsA61 : ... < 100 DMA62 : 100

  • 7/28/2019 Dmdw Lab Record

    16/60

    16

    JBREC

    Other installment plansA141 : bankA142 : storesA143 : none

    Attribute 15: (qualitative)HousingA151 : rentA152 : ownA153 : for free

    Attribute 16: (numerical)Number of existing credits at this bank

    Attribute 17: (qualitative)JobA171 : unemployed/ unskilled - non-resident

    A172 : unskilled - residentA173 : skilled employee / officialA174 : management/ self-employed/

    highly qualified employee/ officer

    Attribute 18: (numerical)Number of people being liable to provide maintenance for

    Attribute 19: (qualitative)TelephoneA191 : noneA192 : yes, registered under the customers name

    Attribute 20: (qualitative)foreign workerA201 : yesA202 : no

    Relabeled values in attribute checking_statusFrom: A11 To: '

  • 7/28/2019 Dmdw Lab Record

    17/60

    17

    JBREC

    From: A43 To: radio/tvFrom: A44 To: 'domestic appliance'From: A45 To: repairsFrom: A46 To: educationFrom: A47 To: vacationFrom: A48 To: retrainingFrom: A49 To: businessFrom: A410 To: other

    Relabeled values in attribute savings_statusFrom: A61 To: '

  • 7/28/2019 Dmdw Lab Record

    18/60

    18

    JBREC

    From: A153 To: 'for free'

    Relabeled values in attribute jobFrom: A171 To: 'unemp/unskilled non res'From: A172 To: 'unskilled resident'From: A173 To: skilledFrom: A174 To: 'high qualif/self emp/mgmt'

    Relabeled values in attribute own_telephoneFrom: A191 To: noneFrom: A192 To: yes

    Relabeled values in attribute foreign_workerFrom: A201 To: yes

    From: A202 To: no

    Relabeled values in attribute classFrom: 1 To: goodFrom: 2 To: bad

    @relation german_credit@attribute checking_status { '

  • 7/28/2019 Dmdw Lab Record

    19/60

    19

    JBREC

    '0

  • 7/28/2019 Dmdw Lab Record

    20/60

    20

    JBREC

    @attribute ' Strawberry Soda_binarized' {0,1}@attribute ' Vanilla Ice Cream_binarized' {0,1}@attribute ' Potato Chips_binarized' {0,1}@attribute ' Strawberry Yogurt_binarized' {0,1}@attribute ' Diet Soda_binarized' {0,1}@attribute ' D Cell Batteries_binarized' {0,1}@attribute ' Paper Towels_binarized' {0,1}@attribute ' Mint Chocolate Bar_binarized' {0,1}@attribute ' Salsa Dip_binarized' {0,1}@attribute ' Buttered Popcorn_binarized' {0,1}@attribute ' Cheese Crackers_binarized' {0,1}@attribute ' Chocolate Bar_binarized' {0,1}@attribute ' Rice Soup_binarized' {0,1}@attribute ' Mouthwash_binarized' {0,1}@attribute ' Sugar_binarized' {0,1}@attribute ' Cheese Flavored Chips_binarized' {0,1}@attribute ' Sweat Potatoes_binarized' {0,1}

    @attribute ' Deodorant_binarized' {0,1}@attribute ' Waffles_binarized' {0,1}@attribute ' Decaf Coffee_binarized' {0,1}@attribute ' Smoked Turkey Sliced_binarized' {0,1}@attribute ' Screw Driver_binarized' {0,1}@attribute ' Sesame Oil_binarized' {0,1}@attribute ' Red Wine_binarized' {0,1}@attribute ' 60 Watt Lightbulb_binarized' {0,1}@attribute ' Cream Soda_binarized' {0,1}@attribute ' Apple Fruit Roll_binarized' {0,1}@attribute ' Noodle Soup_binarized' {0,1}@attribute ' Ice Cream Sandwich_binarized' {0,1}

    @attribute ' Soda Crackers_binarized' {0,1}@attribute ' Lettuce_binarized' {0,1}@attribute ' AA Cell Batteries_binarized' {0,1}@attribute ' Honey Roasted Peanuts_binarized' {0,1}@attribute ' Frozen Cheese Pizza_binarized' {0,1}@attribute ' Tomato Soup_binarized' {0,1}@attribute ' Manicotti_binarized' {0,1}@attribute ' Toilet Bowl Cleaner_binarized' {0,1}@attribute ' Liquid Laundry Detergent_binarized' {0,1}@attribute ' Instant Rice_binarized' {0,1}@attribute ' Green Pepper_binarized' {0,1}@attribute ' Frozen Broccoli_binarized' {0,1}

    @attribute ' Chardonnay Wine_binarized' {0,1}@attribute ' Brown Sugar Grits_binarized' {0,1}@attribute ' Canned Peas_binarized' {0,1}@attribute ' Skin Moisturizer_binarized' {0,1}@attribute ' Avocado Dip_binarized' {0,1}@attribute ' Blueberry Muffins_binarized' {0,1}@attribute ' Apple Cinnamon Waffles_binarized' {0,1}@attribute ' Chablis Wine_binarized' {0,1}@attribute ' Cantaloupe_binarized' {0,1}@attribute ' Shrimp Cocktail Sauce_binarized' {0,1}@attribute ' 100 Watt Lightbulb_binarized' {0,1}@attribute ' Whole Green Beans_binarized' {0,1}@attribute ' Turkey TV Dinner_binarized' {0,1}@attribute ' Wash Towels_binarized' {0,1}

  • 7/28/2019 Dmdw Lab Record

    21/60

    21

    JBREC

    @attribute ' Dog Food_binarized' {0,1}@attribute ' Cat Food_binarized' {0,1}@attribute ' Frozen Sausage Pizza_binarized' {0,1}@attribute ' Frosted Donuts_binarized' {0,1}@attribute ' Shrimp_binarized' {0,1}@attribute ' Summer Sausage_binarized' {0,1}@attribute ' Plums_binarized' {0,1}@attribute ' Mild Cheddar Cheese_binarized' {0,1}@attribute ' Cream of Wheat_binarized' {0,1}@attribute ' Fresh Lima Beans_binarized' {0,1}@attribute ' Flavored Fruit Bars_binarized' {0,1}@attribute ' Mushrooms_binarized' {0,1}@attribute ' Flour_binarized' {0,1}@attribute ' Plain Rye Bread_binarized' {0,1}@attribute ' Jelly Filled Donuts_binarized' {0,1}@attribute ' Apple Sauce_binarized' {0,1}@attribute ' Hot Chicken Wings_binarized' {0,1}

    @attribute ' Orange Juice_binarized' {0,1}@attribute ' Strawberry Jam_binarized' {0,1}@attribute ' Chocolate Chip Cookies_binarized' {0,1}@attribute ' Vegetable Soup_binarized' {0,1}@attribute ' Oats and Nuts Cereal_binarized' {0,1}@attribute ' Fruit Roll_binarized' {0,1}@attribute ' Corn Oil_binarized' {0,1}@attribute ' Corn Flake Cereal_binarized' {0,1}@attribute ' 75 Watt Lightbulb_binarized' {0,1}@attribute ' Mushroom Pizza - Frozen_binarized' {0,1}@attribute ' Sour Cream_binarized' {0,1}@attribute ' Deli Salad_binarized' {0,1}

    @attribute ' Deli Turkey_binarized' {0,1}@attribute ' Glass Cleaner_binarized' {0,1}@attribute ' Brown Sugar_binarized' {0,1}@attribute ' English Muffins_binarized' {0,1}@attribute ' Apple Soda_binarized' {0,1}@attribute ' Strawberry Preserves_binarized' {0,1}@attribute ' Pepperoni Pizza - Frozen_binarized' {0,1}@attribute ' Plain Oatmeal_binarized' {0,1}@attribute ' Beef Soup_binarized' {0,1}@attribute ' Trash Bags_binarized' {0,1}@attribute ' Corn Chips_binarized' {0,1}@attribute ' Tangerines_binarized' {0,1}

    @attribute ' Hot Dogs_binarized' {0,1}@attribute ' Can Opener_binarized' {0,1}@attribute ' Dried Apples_binarized' {0,1}@attribute ' Grape Juice_binarized' {0,1}@attribute ' Carrots_binarized' {0,1}@attribute ' Frozen Shrimp_binarized' {0,1}@attribute ' Grape Fruit Roll_binarized' {0,1}@attribute ' Merlot Wine_binarized' {0,1}@attribute ' Raisins_binarized' {0,1}@attribute ' Cranberry Juice_binarized' {0,1}@attribute ' Shampoo_binarized' {0,1}@attribute ' Pancake Mix_binarized' {0,1}@attribute ' Paper Plates_binarized' {0,1}@attribute ' Bologna_binarized' {0,1}

  • 7/28/2019 Dmdw Lab Record

    22/60

    22

    JBREC

    @attribute ' 2pct. Milk_binarized' {0,1}@attribute ' Daily Newspaper_binarized' {0,1}@attribute ' Popcorn Salt_binarized' {0,1}

    @data0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

  • 7/28/2019 Dmdw Lab Record

    23/60

    23

    JBREC

    Exp: 1 Date: _ _/_ _/_ _

    Name of the Experiment: ..

    ..

    Add:

    a)SCHEMA : Weka.filters.unsupervised.attribute.AddN unnamed-C 3

  • 7/28/2019 Dmdw Lab Record

    24/60

    24

    JBREC

    Discretize:SCHEMA : Weka.filters.unsupervised.attribute.descretize-B11-M-1.0-R first-last

  • 7/28/2019 Dmdw Lab Record

    25/60

    25

    JBREC

    NominalToBinary:b)SCHEMA : Weka.filters.unsupervised.attribute.NominalToBinary-R first-last

  • 7/28/2019 Dmdw Lab Record

    26/60

    26

    JBREC

    Normalize:

    c)SCHEMA : Weka.filters.unsupervised.attribute.Normalize-S 1.0-T 0.0

  • 7/28/2019 Dmdw Lab Record

    27/60

    27

    JBREC

    NumericToBinary:

    d)SCHEMA : Weka.filters.unsupervised.attribute.NumericToBinary

  • 7/28/2019 Dmdw Lab Record

    28/60

    28

    JBREC

    Swap Values:

    e)SCHEMA : Weka.filters.unsupervised.attribute .Swap ValuesC lastF firsts last

  • 7/28/2019 Dmdw Lab Record

    29/60

    29

    JBREC

    String to Nominal:

    f)Weka.filters.unsupervised.attribute.StringtoNominal

  • 7/28/2019 Dmdw Lab Record

    30/60

    30

    JBREC

    Exp: 2 Date: _ _/_ _/_ _

    Name of the Experiment: ..

    ..

    Implement Weka.Classifiers.trees.j48

    Weather Dataset.arff

    @relation weather

    @attribute outlook {sunny, overcast, rainy}

    @attribute temperature real@attribute humidity real

    @attribute windy {TRUE, FALSE}

    @attribute play {yes, no}

    @data

    sunny,85,85,FALSE,no

    sunny,80,90,TRUE,no

    overcast,83,86,FALSE,yes

    rainy,70,96,FALSE,yes

    rainy,68,80,FALSE,yesrainy,65,70,TRUE,no

    overcast,64,65,TRUE,yes

    sunny,72,95,FALSE,no

    sunny,69,70,FALSE,yes

    rainy,75,80,FALSE,yes

    sunny,75,70,TRUE,yes

    overcast,72,90,TRUE,yes

    overcast,81,75,FALSE,yes

    rainy,71,91,TRUE,no

    Use Training Set Testing Options:

  • 7/28/2019 Dmdw Lab Record

    31/60

    31

    JBREC

  • 7/28/2019 Dmdw Lab Record

    32/60

    32

    JBREC

    === Run information ===

    Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2

    Relation: weather

    Instances: 14

    Attributes: 5outlook

    temperature

    humidity

    windy

    play

    Test mode: evaluate on training data

    === Classifier model (full training set) ===

    J48 pruned tree

    ------------------

    outlook = sunny| humidity 75: no (3.0)

    outlook = overcast: yes (4.0)

    outlook = rainy

    | windy = TRUE: no (2.0)

    | windy = FALSE: yes (3.0)

    Number of Leaves : 5

    Size of the tree : 8

    Time taken to build model: 0.02 seconds

    === Evaluation on training set ====== Summary ===

    Correctly Classified Instances 14 100 %

    Incorrectly Classified Instances 0 0 %

    Kappa statistic 1

    Mean absolute error 0

    Root mean squared error 0

    Relative absolute error 0 %

    Root relative squared error 0 %

    Coverage of cases (0.95 level) 100 %

    Mean rel. region size (0.95 level) 50 %Total Number of Instances 14

    === Detailed Accuracy By Class ===

    TP Rate FP Rate Precision Recall F-Measure ROC Area Class

    1 0 1 1 1 1 yes

    1 0 1 1 1 1 no

    === Confusion Matrix ===

    a b

  • 7/28/2019 Dmdw Lab Record

    33/60

    33

    JBREC

    9 0 | a = yes

    0 5 | b = no

    Visualize Tree:

    Use Cross Validation Testing Option:

    Classifier Output:

    === Run information ===

    Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2

  • 7/28/2019 Dmdw Lab Record

    34/60

    34

    JBREC

    Relation: weather

    Instances: 14

    Attributes: 5

    outlook

    temperature

    humiditywindy

    play

    Test mode: 10-fold cross-validation

    === Classifier model (full training set) ===

    J48 pruned tree

    ------------------

    outlook = sunny

    | humidity 75: no (3.0)

    outlook = overcast: yes (4.0)outlook = rainy

    | windy = TRUE: no (2.0)

    | windy = FALSE: yes (3.0)

    Number of Leaves : 5

    Size of the tree : 8

    Time taken to build model: 0 seconds

    === Stratified cross-validation ===

    === Summary ===

    Correctly Classified Instances 9 64.2857 %Incorrectly Classified Instances 5 35.7143 %

    Kappa statistic 0.186

    Mean absolute error 0.2857

    Root mean squared error 0.4818

    Relative absolute error 60 %

    Root relative squared error 97.6586 %

    Coverage of cases (0.95 level) 92.8571 %

    Mean rel. region size (0.95 level) 64.2857 %

    Total Number of Instances 14

    === Detailed Accuracy By Class ===

    TP Rate FP Rate Precision Recall F-Measure ROC Area Class

    0.778 0.6 0.7 0.778 0.737 0.789 yes

    0.4 0.222 0.5 0.4 0.444 0.789 no

    === Confusion Matrix ===

    a b

  • 7/28/2019 Dmdw Lab Record

    35/60

    35

    JBREC

    Use Supplied Test Set Testing Options:

    We will now use our model to classify the new instances. A portion of the new instances ARFF

    file is depicted in Figure Note that the attribute section is identical to the training data (bank data we

    used for building our model). However, in the data section, the value of the "pep" attribute is "?" (or

    unknown).

    In the main panel, under "Test options" click the "Supplied test set" radio button, and then click the

    "Set..." button. This will pop up a window which allows you to open the file containing test instances,

    as in Figures

  • 7/28/2019 Dmdw Lab Record

    36/60

    36

    JBREC

    In this case, we open the file "bank-new.arff" and upon returning to the main window,

    we click the "start" button. This, once again generates the models from our training data, but this time

    it applies the model to the new unclassified instances in the "bank-new.arff" file in order to predict the

    value of "pep" attribute. The result is depicted in Figure 28. Note that the summary of the results in

    the right panel does not show any statistics. This is because in our test instances the value of the class

    attribute ("pep") was left as "?", thus WEKA has no actual values to which it can compare the

    predicted values of new instances.

    Of course, in this example we are interested in knowing how our model managed to

    classify the new instances. To do so we need to create a file containing all the new instances along

    with their predicted class value resulting from the application of the model. Doing this is much

    simpler using the command line version of WEKA classifier application. However, it is possible to do

    so in the GUI version using an "indirect" approach, as follows.

    First, right-click the most recent result set in the left "Result list" panel. In the resulting

    pop-up window select the menu item "Visualize classifier errors". This brings up a separate window

    containing a two-dimensional graph.

  • 7/28/2019 Dmdw Lab Record

    37/60

    37

    JBREC

    We would like to "save" the classification results from which the graph is generated. In

    the new window, we click on the "Save" button and save the result as the file: "bank-predicted.arff".

    This file contains a copy of the new instances along with an additional column for the predicted value

    of "pep".

  • 7/28/2019 Dmdw Lab Record

    38/60

    38

    JBREC

    Note that two attributes have been added to the original new instances data:

    "Instance_number" and "predictedpep". These correspond to new columns in the data portion. The

    "predictedpep" value for each new instance is the last value before "?" which the actual "pep" class

    value. For example, the predicted value of the "pep" attribute for instance 0 is "YES" according to our

    model, while the predicted class value for instance 4 is "NO".

    Use Percentage Split Testing Options:

    Classifiers Output:=== Run information ===

    Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2

    Relation: weather

    Instances: 14

    Attributes: 5

    outlook

    temperature

    humidity

    windy

    playTest mode: split 66.0% train, remainder test

    === Classifier model (full training set) ===

    J48 pruned tree

    ------------------

    outlook = sunny

    | humidity 75: no (3.0)

    outlook = overcast: yes (4.0)

    outlook = rainy

    | windy = TRUE: no (2.0)

    | windy = FALSE: yes (3.0)

  • 7/28/2019 Dmdw Lab Record

    39/60

    39

    JBREC

    Number of Leaves : 5

    Size of the tree : 8

    Time taken to build model: 0 seconds

    === Evaluation on test split ===

    === Summary ===

    Correctly Classified Instances 2 40 %

    Incorrectly Classified Instances 3 60 %

    Kappa statistic -0.3636

    Mean absolute error 0.6

    Root mean squared error 0.7746

    Relative absolute error 126.9231 %

    Root relative squared error 157.6801 %

    Coverage of cases (0.95 level) 40 %

    Mean rel. region size (0.95 level) 50 %Total Number of Instances 5

    === Detailed Accuracy By Class ===

    TP Rate FP Rate Precision Recall F-Measure ROC Area Class

    0.667 1 0.5 0.667 0.571 0.333 yes

    0 0.333 0 0 0 0.333 no

    === Confusion Matrix ===

    a b

  • 7/28/2019 Dmdw Lab Record

    40/60

    40

    JBREC

    Exp: 3 Date: _ _/_ _/_ _

    Name of the Experiment: ..

    ..

    Analyzing the Output:

    Run Information:

    The first line of the run information section contains information about the learning scheme

    chosen and its parameters. The parameters chosen (both default and modified) are shown in short

    form. In our example, the learning scheme was weka.classifiers.trees.J48 or the J48 algorithm.

    The second line shows information about the relation. Relations in Weka are like data

    files. The name of the relation contains in it the name of data file used to build it, and the names

    of filters that have been applied on it.

    The next part shows the number of instances in the relation, followed by the number ofattributes. This is followed by the list of attributes.

    The last part show the type of testing that was employed; in our example it was 10-fold cross-

    validation

    Classifier Model (full Training set):

    J48 pruned tree

    ------------------

    Outlook = sunny| Humidity 75: no (3.0)

    Outlook = overcast: yes (4.0)

    Outlook = rainy

    | Windy = TRUE: no (2.0)

    | Windy = FALSE: yes (3.0)

    Number of Leaves: 5

    Size of the tree: 8

    It displays information about the model generated using the full training set. It mentions

    full training set because we used cross-validation and what is being displayed here is the final

    model that was built used all of the dataset to be generated. When using tree models, a text

    display of the generated tree is shown. This is followed by the information about the number of

    leaves and overall tree size (above).

    Confusion Matrix:

    A confusion matrix is an easy way of describing the results of the experiment. The best

    way to describe it is by example

  • 7/28/2019 Dmdw Lab Record

    41/60

    41

    JBREC

    === Confusion Matrix ===

    a b

  • 7/28/2019 Dmdw Lab Record

    42/60

    42

    JBREC

    How much does the scheme improve on simply predicting the average? The relative squared erroris

    The relative absolute erroris:

    A B C D

    Root mean-squared error 67.8 91.7 63.3 57.4

    Mean absolute error 41.3 38.5 33.4 29.2

    Root rel squared error 42.2% 57.2% 39.4% 35.8%

    Relative absolute error 43.1% 40.1% 34.8% 30.4%

    Correlation coefficient 0.88 0.88 0.89 0.91

    22

    1

    22

    11

    )(...)(

    )(...)(

    n

    nn

    aaaa

    apap

    ||...||

    ||...||

    1

    11

    n

    nn

    aaaa

    apap

  • 7/28/2019 Dmdw Lab Record

    43/60

    43

    JBREC

    Exp: 4 Date: _ _/_ _/_ _

    Name of the Experiment: ..

    ..

    K- Means Clustering in Weka:

    This example illustrates the use ofk-means clustering with WEKA The sample data set used

    for this example is based on the "bank data" available in ARFF format (bank-data.arff). As an

    illustration of performing clustering in WEKA, we will use its implementation of the K-means

    algorithm to cluster the customers in this bank data set, and to characterize the resulting customer

    segments.Since the Data File is loaded in Weka Explorer.

    Bank Data Set:

    @relation bank@attribute Instance_number numeric@attribute age numeric@attribute sex {"FEMALE","MALE"}@attribute region {"INNER_CITY","TOWN","RURAL","SUBURBAN"}@attribute income numeric@attribute married {"NO","YES"}@attribute children {0,1,2,3}@attribute car {"NO","YES"}@attribute save_act {"NO","YES"}@attribute current_act {"NO","YES"}@attribute mortgage {"NO","YES"}@attribute pep {"YES","NO"}

    @data0,48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES1,40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO2,51,FEMALE,INNER_CITY,16575.4,YES,0,YES,YES,YES,NO,NO3,23,FEMALE,TOWN,20375.4,YES,3,NO,NO,YES,NO,NO4,57,FEMALE,RURAL,50576.3,YES,0,NO,YES,NO,NO,NO5,57,FEMALE,TOWN,37869.6,YES,2,NO,YES,YES,NO,YES6,22,MALE,RURAL,8877.07,NO,0,NO,NO,YES,NO,YES

    7,58,MALE,TOWN,24946.6,YES,0,YES,YES,YES,NO,NO8,37,FEMALE,SUBURBAN,25304.3,YES,2,YES,NO,NO,NO,NO9,54,MALE,TOWN,24212.1,YES,2,YES,YES,YES,NO,NO10,66,FEMALE,TOWN,59803.9,YES,0,NO,YES,YES,NO,NO11,52,FEMALE,INNER_CITY,26658.8,NO,0,YES,YES,YES,YES,NO12,44,FEMALE,TOWN,15735.8,YES,1,NO,YES,YES,YES,YES13,66,FEMALE,TOWN,55204.7,YES,1,YES,YES,YES,YES,YES14,36,MALE,RURAL,19474.6,YES,0,NO,YES,YES,YES,NO15,38,FEMALE,INNER_CITY,22342.1,YES,0,YES,YES,YES,YES,NO16,37,FEMALE,TOWN,17729.8,YES,2,NO,NO,NO,YES,NO17,46,FEMALE,SUBURBAN,41016,YES,0,NO,YES,NO,YES,NO18,62,FEMALE,INNER_CITY,26909.2,YES,0,NO,YES,NO,NO,YES

    19,31,MALE,TOWN,22522.8,YES,0,YES,YES,YES,NO,NO20,61,MALE,INNER_CITY,57880.7,YES,2,NO,YES,NO,NO,YES

  • 7/28/2019 Dmdw Lab Record

    44/60

    44

    JBREC

    21,50,MALE,TOWN,16497.3,YES,2,NO,YES,YES,NO,NO22,54,MALE,INNER_CITY,38446.6,YES,0,NO,YES,YES,NO,NO23,27,FEMALE,TOWN,15538.8,NO,0,YES,YES,YES,YES,NO24,22,MALE,INNER_CITY,12640.3,NO,2,YES,YES,YES,NO,NO25,56,MALE,INNER_CITY,41034,YES,0,YES,YES,YES,YES,NO26,45,MALE,INNER_CITY,20809.7,YES,0,NO,YES,YES,YES,NO27,39,FEMALE,TOWN,20114,YES,1,NO,NO,YES,NO,YES28,39,FEMALE,INNER_CITY,29359.1,NO,3,YES,NO,YES,YES,NO29,61,MALE,RURAL,24270.1,YES,1,NO,NO,YES,NO,YES30,61,FEMALE,RURAL,22942.9,YES,2,NO,YES,YES,NO,NO

  • 7/28/2019 Dmdw Lab Record

    45/60

    45

    JBREC

    In the pop-up window we enter 5 as the number of clusters (instead of the default values of 2)

    and we leave the value of "seed" as is. The seed value is used in generating a random number which

    is, in turn, used for making the initial assignment of instances to clusters. Note that, in general, K-

    means is quite sensitive to how clusters are initially assigned. Thus, it is often necessary to try

    different values and evaluate the results.

    Once the options have been specified, we can run the clustering algorithm. Here we make sure

    that in the "Cluster Mode" panel, the "Use Percentage Split" option is selected, and we click "Start".

    We can right click the result set in the "Result list" panel and view the results of clustering in a

    separate window.

  • 7/28/2019 Dmdw Lab Record

    46/60

    46

    JBREC

    Cluster Output:

    === Run information ===Scheme: weka.clusterers.SimpleKMeans -N 5 -A "weka.core.EuclideanDistance -R first-last" -I500 -S 10

    Relation: bankInstances: 600Attributes: 12

    Instance_numberagesexregionincomemarriedchildrencarsave_act

    current_actmortgagepep

    Test mode: split 66% train, remainder test

    === Clustering model (full training set) ===

    kMeans======Number of iterations: 14Within cluster sum of squared errors: 1719.2889887418955Missing values globally replaced with mean/mode

    Cluster centroids:Cluster#

    Attribute Full Data 0 1 2 3 4(600) (66) (112) (120) (137) (165)

    ===============================================================Instance_number 299.5 306.6364 265.9732 302.2333 320.292 300.1515age 42.395 40.0606 32.7589 51.475 44.3504 41.6424sex FEMALE FEMALE FEMALE FEMALE FEMALE MALEregion INNER_CITY RURAL INNER_CITY INNER_CITY TOWN INNER_CITYincome 27524.0312 26206.1992 18260.9218 34922.1563 27626.442 28873.3638married YES NO YES YES YES YESchildren 0 3 2 1 0 0car NO NO NO NO NO YESsave_act YES YES YES YES YES YEScurrent_act YES YES YES YES YES YESmortgage NO NO NO NO NO YESpep NO NO NO YES NO YES

    Time taken to build model (full training data) : 0.14 seconds=== Model and evaluation on test split ===kMeans======

    Number of iterations: 10Within cluster sum of squared errors: 1115.231316606429

  • 7/28/2019 Dmdw Lab Record

    47/60

    47

    JBREC

    Missing values globally replaced with mean/mode

    Cluster centroids:Cluster#

    Attribute Full Data 0 1 2 3 4(396) (131) (63) (80) (41) (81)

    ===============================================================Instance_number 299.6364 277.1374 347.1905 362.0625 231.5122 271.8642age 43.1061 40.4733 49.1111 45.1 51.2439 36.6049sex MALE FEMALE MALE MALE FEMALE MALEregion INNER_CITY INNER_CITY INNER_CITY TOWN RURAL INNER_CITYincome 27825.983 25733.2533 32891.0238 30817.2439 32090.4995 22158.1307married YES YES YES NO NO YESchildren 0 0 1 0 0 0car NO YES YES NO YES NOsave_act YES YES YES YES YES NOcurrent_act YES YES NO YES YES YES

    mortgage NO NO NO NO NO YESpep NO NO YES YES NO YES

    Time taken to build model (percentage split) : 0.04 secondsClustered Instances0 73 ( 36%)1 37 ( 18%)2 30 ( 15%)3 28 ( 14%)4 36 ( 18%)

    The result window shows the centroid of each cluster as well as statistics on the number andpercentage of instances assigned to different clusters. Cluster centroids are the mean vectors for each

    cluster (so, each dimension value in the centroid represents the mean value for that dimension in the

    cluster). Thus, centroids can be used to characterize the clusters. For example, the centroid for cluster

    1 shows that this is a segment of cases representing middle aged to young (approx. 38) females living

    in inner city with an average income of approx. $28,500, who are married with one child, etc.

    Furthermore, this group have on average said YES to the PEP product.

    Another way of understanding the characteristics of each cluster in through visualization. We

    can do this by right-clicking the result set on the left "Result list" panel and selecting "Visualize

    cluster assignments".

  • 7/28/2019 Dmdw Lab Record

    48/60

    48

    JBREC

    You can choose the cluster number and any of the other attributes for each of the three

    different dimensions available (x-axis, y-axis, and color). Different combinations of choices will

    result in a visual rendering of different relationships within each cluster. In the above example, we

    have chosen the cluster number as the x-axis, the instance number (assigned by WEKA) as the y-axis,

    and the "sex" attribute as the color dimension. This will result in a visualization of the distribution of

    males and females in each cluster. For instance, you can note that clusters 2 and 3 are dominated by

    males, while clusters 4 and 5 are dominated by females. In this case, by changing the color dimension

    to other attributes, we can see their distribution within each of the clusters.

    Finally, we may be interested in saving the resulting data set which included each instance

    along with its assigned cluster. To do so, we click the "Save" button in the visualization window and

    save the result as the file "bank-kmeans.arff".

    @relation bank_clustered

    @attribute Instance_number numeric@attribute age numeric"@attribute sex {FEMALE,MALE}""@attribute region {INNER_CITY,TOWN,RURAL,SUBURBAN}"@attribute income numeric

    "@attribute married {NO,YES}""@attribute children {0,1,2,3}""@attribute car {NO,YES}""@attribute save_act {NO,YES}""@attribute current_act {NO,YES}""@attribute mortgage {NO,YES}""@attribute pep {YES,NO}""@attribute Cluster {cluster0,cluster1,cluster2,cluster3,cluster4,cluster5}"

    @data"0,48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES,cluster1""1,40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO,cluster3""2,51,FEMALE,INNER_CITY,16575.4,YES,0,YES,YES,YES,NO,NO,cluster2""3,23,FEMALE,TOWN,20375.4,YES,3,NO,NO,YES,NO,NO,cluster5"

  • 7/28/2019 Dmdw Lab Record

    49/60

    49

    JBREC

    "4,57,FEMALE,RURAL,50576.3,YES,0,NO,YES,NO,NO,NO,cluster5""5,57,FEMALE,TOWN,37869.6,YES,2,NO,YES,YES,NO,YES,cluster5""6,22,MALE,RURAL,8877.07,NO,0,NO,NO,YES,NO,YES,cluster0""7,58,MALE,TOWN,24946.6,YES,0,YES,YES,YES,NO,NO,cluster2""8,37,FEMALE,SUBURBAN,25304.3,YES,2,YES,NO,NO,NO,NO,cluster5""9,54,MALE,TOWN,24212.1,YES,2,YES,YES,YES,NO,NO,cluster2""10,66,FEMALE,TOWN,59803.9,YES,0,NO,YES,YES,NO,NO,cluster5""11,52,FEMALE,INNER_CITY,26658.8,NO,0,YES,YES,YES,YES,NO,cluster4""12,44,FEMALE,TOWN,15735.8,YES,1,NO,YES,YES,YES,YES,cluster1""13,66,FEMALE,TOWN,55204.7,YES,1,YES,YES,YES,YES,YES,cluster1""14,36,MALE,RURAL,19474.6,YES,0,NO,YES,YES,YES,NO,cluster5""15,38,FEMALE,INNER_CITY,22342.1,YES,0,YES,YES,YES,YES,NO,cluster2""16,37,FEMALE,TOWN,17729.8,YES,2,NO,NO,NO,YES,NO,cluster5""17,46,FEMALE,SUBURBAN,41016,YES,0,NO,YES,NO,YES,NO,cluster5""18,62,FEMALE,INNER_CITY,26909.2,YES,0,NO,YES,NO,NO,YES,cluster4""19,31,MALE,TOWN,22522.8,YES,0,YES,YES,YES,NO,NO,cluster2""20,61,MALE,INNER_CITY,57880.7,YES,2,NO,YES,NO,NO,YES,cluster2"

    "21,50,MALE,TOWN,16497.3,YES,2,NO,YES,YES,NO,NO,cluster5"

    Note that in addition to the "instance_number" attribute, WEKA has also added "Cluster" attribute to

    the original data set. In the data portion, each instance now has its assigned cluster as the last attribute

    value. By doing some simple manipulation to this data set, we can easily convert it to a more usable

    form for additional analysis or processing.

  • 7/28/2019 Dmdw Lab Record

    50/60

    50

    JBREC

    Exp: 5 Date: _ _/_ _/_ _

    Name of the Experiment: ..

    ..

    Aprior Algorithm using Weka

    Market Basket Dataset:

    @relation marketbasket-weka.filters.unsupervised.attribute.NumericToBinary-weka.filters.unsupervised.attribute.NumericToBinary

    @attribute ' Hair Conditioner_binarized' {0,1}@attribute ' Lemons_binarized' {0,1}@attribute ' Standard coffee_binarized' {0,1}

    @attribute ' Frozen Chicken Wings_binarized' {0,1}@attribute ' 98pct. Fat Free Hamburger_binarized' {0,1}@attribute ' Sugar Cookies_binarized' {0,1}@attribute ' Onions_binarized' {0,1}@attribute ' Deli Ham_binarized' {0,1}@attribute ' Dishwasher Detergent_binarized' {0,1}@attribute ' Beets_binarized' {0,1}@attribute ' 40 Watt Lightbulb_binarized' {0,1}@attribute ' Ice Cream_binarized' {0,1}@attribute ' Cottage Cheese_binarized' {0,1}@attribute ' Plain English Muffins_binarized' {0,1}@attribute ' Strawberry Soda_binarized' {0,1}

    @attribute ' Vanilla Ice Cream_binarized' {0,1}@attribute ' Potato Chips_binarized' {0,1}@attribute ' Strawberry Yogurt_binarized' {0,1}@attribute ' Diet Soda_binarized' {0,1}@attribute ' D Cell Batteries_binarized' {0,1}@attribute ' Paper Towels_binarized' {0,1}@attribute ' Mint Chocolate Bar_binarized' {0,1}@attribute ' Salsa Dip_binarized' {0,1}@attribute ' Buttered Popcorn_binarized' {0,1}@attribute ' Cheese Crackers_binarized' {0,1}@attribute ' Chocolate Bar_binarized' {0,1}@attribute ' Rice Soup_binarized' {0,1}@attribute ' Mouthwash_binarized' {0,1}@attribute ' Sugar_binarized' {0,1}@attribute ' Cheese Flavored Chips_binarized' {0,1}@attribute ' Sweat Potatoes_binarized' {0,1}@attribute ' Deodorant_binarized' {0,1}@attribute ' Waffles_binarized' {0,1}@attribute ' Decaf Coffee_binarized' {0,1}@attribute ' Smoked Turkey Sliced_binarized' {0,1}@attribute ' Screw Driver_binarized' {0,1}@attribute ' Sesame Oil_binarized' {0,1}@attribute ' Red Wine_binarized' {0,1}

    @attribute ' 60 Watt Lightbulb_binarized' {0,1}@attribute ' Cream Soda_binarized' {0,1}

  • 7/28/2019 Dmdw Lab Record

    51/60

    51

    JBREC

    @attribute ' Apple Fruit Roll_binarized' {0,1}@attribute ' Noodle Soup_binarized' {0,1}@attribute ' Ice Cream Sandwich_binarized' {0,1}@attribute ' Soda Crackers_binarized' {0,1}@attribute ' Lettuce_binarized' {0,1}@attribute ' AA Cell Batteries_binarized' {0,1}@attribute ' Honey Roasted Peanuts_binarized' {0,1}@attribute ' Frozen Cheese Pizza_binarized' {0,1}@attribute ' Tomato Soup_binarized' {0,1}@attribute ' Manicotti_binarized' {0,1}@attribute ' Toilet Bowl Cleaner_binarized' {0,1}@attribute ' Liquid Laundry Detergent_binarized' {0,1}@attribute ' Instant Rice_binarized' {0,1}@attribute ' Green Pepper_binarized' {0,1}@attribute ' Frozen Broccoli_binarized' {0,1}@attribute ' Chardonnay Wine_binarized' {0,1}@attribute ' Brown Sugar Grits_binarized' {0,1}

    @attribute ' Canned Peas_binarized' {0,1}@attribute ' Skin Moisturizer_binarized' {0,1}@attribute ' Avocado Dip_binarized' {0,1}@attribute ' Blueberry Muffins_binarized' {0,1}@attribute ' Apple Cinnamon Waffles_binarized' {0,1}@attribute ' Chablis Wine_binarized' {0,1}@attribute ' Cantaloupe_binarized' {0,1}@attribute ' Shrimp Cocktail Sauce_binarized' {0,1}@attribute ' 100 Watt Lightbulb_binarized' {0,1}@attribute ' Whole Green Beans_binarized' {0,1}@attribute ' Turkey TV Dinner_binarized' {0,1}@attribute ' Wash Towels_binarized' {0,1}

    @attribute ' Dog Food_binarized' {0,1}@attribute ' Cat Food_binarized' {0,1}@attribute ' Frozen Sausage Pizza_binarized' {0,1}@attribute ' Frosted Donuts_binarized' {0,1}@attribute ' Shrimp_binarized' {0,1}@attribute ' Summer Sausage_binarized' {0,1}@attribute ' Plums_binarized' {0,1}@attribute ' Mild Cheddar Cheese_binarized' {0,1}@attribute ' Cream of Wheat_binarized' {0,1}@attribute ' Fresh Lima Beans_binarized' {0,1}@attribute ' Flavored Fruit Bars_binarized' {0,1}@attribute ' Mushrooms_binarized' {0,1}

    @attribute ' Flour_binarized' {0,1}@attribute ' Plain Rye Bread_binarized' {0,1}@attribute ' Jelly Filled Donuts_binarized' {0,1}@attribute ' Apple Sauce_binarized' {0,1}@attribute ' Hot Chicken Wings_binarized' {0,1}@attribute ' Orange Juice_binarized' {0,1}@attribute ' Strawberry Jam_binarized' {0,1}@attribute ' Chocolate Chip Cookies_binarized' {0,1}@attribute ' Vegetable Soup_binarized' {0,1}@attribute ' Oats and Nuts Cereal_binarized' {0,1}@attribute ' Fruit Roll_binarized' {0,1}@attribute ' Corn Oil_binarized' {0,1}@attribute ' Corn Flake Cereal_binarized' {0,1}@attribute ' 75 Watt Lightbulb_binarized' {0,1}

  • 7/28/2019 Dmdw Lab Record

    52/60

    52

    JBREC

    @attribute ' Mushroom Pizza - Frozen_binarized' {0,1}

    @data0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

    Clicking on the "Associate" tab will bring up the interface for the association rule

    algorithms. The Apriori algorithm which we will use is the deafult algorithm selected. However, in

    order to change the parameters for this run (e.g., support, confidence, etc.) we click on the text box

    immediately to the right of the "Choose" button. Note that this box, at any given time, shows the

    specific command line arguments that are to be used for the algorithm. The dialog box for changingthe parameters is depicted in Figure a2. Here, you can specify various parameters associated with

    Apriori. Click on the "More" button to see the synopsis for the different parameters.

  • 7/28/2019 Dmdw Lab Record

    53/60

    53

    JBREC

    WEKA allows the resulting rules to be sorted according to different metrics such as

    confidence, leverage, and lift. In this example, we have selected lift as the criteria. Furthermore, we

    have entered 1.5 as the minimum value for lift (or improvement) is computed as the confidence of the

    rule divided by the support of the right-hand-side (RHS). In a simplified form, given a rule L => R,

    lift is the ratio of the probability that L and R occur together to the multiple of the two individual

    probabilities for L and R, i.e.,

    lift= Pr(L,R) / Pr(L).Pr(R).

    If this value is 1, then L and R are independent. The higher this value, the more likely that

    the existence of L and R together in a transaction is not just a random occurrence, but because of

    some relationship between them.

    Here we also change the default value of rules (10) to be 100; this indicates that the program

    will report no more than the top 100 rules (in this case sorted according to their lift values). The upper

    bound for minimum support is set to 1.0 (100%) and the lower bound to 0.1 (10%). Apriori in WEKA

    starts with the upper bound support and incrementally decreases support (by delta increments which

    by default is set to 0.05 or 5%). The algorithm halts when either the specified number of rules aregenerated, or the lower bound for min. support is reached. The significance testing option is only

    applicable in the case of confidence and is by default not used (-1.0).

    Once the parameters have been set, the command line text box will show the new

    command line. We now click on start to run the program.

    Associator Output:

    === Run information ===

    Scheme: weka.associations.Apriori -N 5 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1Relation: weather.symbolicInstances: 1000Attributes: 100

    MilkTea powderEggButterJam

    Pastebread

    === Associator model (full training set) ===

    Apriori=======

    Minimum support: 0.25 (3 instances)Minimum metric : 0.9Number of cycles performed: 15

    Generated sets of large itemsets:

  • 7/28/2019 Dmdw Lab Record

    54/60

    54

    JBREC

    Size of set of large itemsets L(1): 12

    Size of set of large itemsets L(2): 26

    Size of set of large itemsets L(3): 4

    Best rules found:

    1. milk=egg 4 ==> tea powder=yes 4 lift:(1.56) lev:(0.1) [1] conv:(1.43)2. milk=butter 4 ==> jam=normal 4 lift:(2) lev:(0.14) [2] conv:(2)3. bread=jam=milk 4 ==> egg=yes 4 lift:(1.56) lev:(0.1) [1] conv:(1.43)4. milk=tea powder=bread 3 ==> egg=high 3 lift:(2) lev:(0.11) [1] conv:(1.5)5. egg=+paste=milk 3 ==> bread=no 3 lift:(2.8) lev:(0.14) [1] conv:(1.93)

    The panel on the left ("Result list") now shows an item indicating the algorithm that was

    run and the time of the run. You can perform multiple runs in the same session each time with

    different parameters. Each run will appear as an item in the Result list panel. Clicking on one of theresults in this list will bring up the details of the run, including the discovered rules in the right panel.

    In addition, right-clicking on the result set allows us to save the result buffer into a separate file. In

    this case, we save the output in the file bank-data-ar1.txt.

    Note that the rules were discovered based on the specified threshold values for support and

    lift. For each rule, the frequency counts for the LHS and RHS of each rule is given, as well as the

    values for confidence, lift, leverage, and conviction. Note that leverage and lift measure similar

    things, except that leverage measures the difference between the probability of co-occurrence of L and

    R (see above example) as the independent probabilities of each of L and R, i.e.,

    Leverage= Pr(L,R) - Pr(L).Pr(R).

    In other words, leverage measures the proportion of additional cases covered by both L

    and R above those expected if L and R were independent of each other. Thus, for leverage, values

    above 0 are desirable, whereas for lift, we want to see values greater than 1. Finally, conviction is

    similar to lift, but it measures the effect of the right-hand-side not being true. It also inverts the ratio.

    So, convictions is measured as:

    Conviction = Pr (L).Pr(not R) / Pr(L,R).

    Thus, conviction, in contrast to lift is not symmetric (and also has no upper bound).

    In most cases, it is sufficient to focus on a combination of support, confidence, and

    either lift or leverage to quantitatively measure the "quality" of the rule. However, the real value of a

    rule, in terms of usefulness and action ability is subjective and depends heavily of the particular

    domain and business objectives.

  • 7/28/2019 Dmdw Lab Record

    55/60

    55

    JBREC

    Exp: 6 Date: _ _/_ _/_ _

    Name of the Experiment: ..

    ..

    Using The Experimenter

    The Experimenter interface to Weka is specialized for conducting experiments where the

    user is interested in comparing several learning schemes on one or several datasets. As its output, the

    Experimenter produces data that can be used to compare these learning schemes visually and

    numerically as well as conducting significance testing. To demonstrate how to use the Experimenter,

    we will conduct an experiment comparing two tree-learning algorithms on the birth dataset.

    There are 3 main areas in the Experimenter interface and they are accessed via tabs at the top

    left of the window. These 3 areas are the Setup area where the experiment parameters are set, the Run

    area where the experiment is started and its progress monitored and lastly the Analyze area where the

    results of the experiment are studied.

    Setting up the Experiment

    The Setup window has 6 main areas that must each be configured in order for the experiment

    to be properly configured. Starting from the top these areas are Experiment Configuration Mode,

    Results Destination, Experiment Type, Iteration Control, Datasets, Algorithms and lastly the Notes

    area.

    Click on New Button to CreateNew Experiment

    Click on Add New Button to add aDataset used to comparingalgorithms

    Click on Add New Button to addClassification algorithms forcomparing

    Choose any Testing Option

    Click on Browse Button toBrowse the Result arff (or) csv

  • 7/28/2019 Dmdw Lab Record

    56/60

    56

    JBREC

    Experiment Configuration Mode:

    We will be using the simple experimental interface mode, as we do not require the extra

    features the advanced mode offers. We will start by creating a new experiment and then defining its

    parameters. A new experiment is created by pushing the on the New button at the top of the window

    and this will create a blank new experiment. After we have finished setting up the experiment, we

    save it using the Save button. Experiment settings are saved in either EXP or a more familiar XML

    format. These files can be opened later to recall all the experiment configuration settings.

    Choose Destination:

    The results of the experiment will be stored in a datafile. This area allows one to specify the

    name and format that this file will have. It is not necessary to do this if one does not intend on using

    this data outside of the Experimenter and if this data does not need to be examined at a later date.

    Results can be stored in the ARFF or CSV format and they can also be sent to an external database.

    Set Experiment Type:

    There are 3 types of experiments available in the simple interface. These types vary in how

    the data is going to be split for the training/testing in the experiment. The options are cross-validation,

    random split and random split with order preserved (i.e. data is split randomly but the order of the

    instances is not randomized; so it will instance#1 followed by instance #2 and so on). We will use

    cross-validation in our example

    Iteration Control:

    For the randomized experiment types, the user has the option of randomizing the data again

    and repeating the experiment. This value for Number of Repetitions controls how many times thiswill take place.

    Add data set(s):

    In this section, the user adds the datasets that will be used in the experiment. Only ARFF files

    can be used here and as mentioned before, the Experimenter is expecting a fully prepared and cleaned

    dataset. There is no option for choosing the classification variable here, and it will always pick the last

    attribute to be the class attribute. In our example, the birth-weight data set is the only one we will use.

    Add Algorithms:

    In this section, the user adds the classification algorithms to be employed in the experiment.

    The procedure here to select an algorithm and choose its options is exactly the same as in the

    Explorer. The difference here is that more than one algorithm can be specified.Algorithms are added by clicking on the add button in the Algorithm section of the window

    and this will pop-up a window that user will use to select the algorithm. This window will also display

    the available options for the selected algorithm.

    The first time the window displays, the ZeroR rule algorithm will be selected. This is shown

    on the picture above. The user can select a different algorithm by clicking on the Choose button.

    Clicking on the More button will display help about the selected Algorithm and a description of its

    available options. For our example, we will add the J48 algorithm with the option for binary splits

    turned on and the REPTree algorithm. Individual algorithms can be edited or deleted by clicking on

    the algorithm from the list of algorithms and then by clicking on the Edit or Delete buttons. Finally,

    any extra notes or comments about the experiment setup can be added by clicking on the Notes button

    at the bottom of the window and entering the information in the window provided.Saving the

    Experiment Setup:

  • 7/28/2019 Dmdw Lab Record

    57/60

    57

    JBREC

    At this point, we have entered all the necessary options to start our experiment. We will now

    save the experiment setup so that we do not have to re-enter all this information again. This is done by

    clicking on the Save Options button on the bottom of the window. These settings can be loaded at

    another time if one wishes to redo the experiment or modify it.

    Running the Experiment

    Running The next step is to run the experiment and this is done by clicking on the Run tab at the top

    of the window. There is not much involved in this step, all that is needed is to click on the Start

    button. The progress of the experiment is displayed the Status area at the bottom of the window and

    any errors reported will be displayed in the Log area. Once the experiment has been run, the next step

    is to analyze the results.

  • 7/28/2019 Dmdw Lab Record

    58/60

    58

    JBREC

    Analyzing the output:

    Click on Experiment Button toanalyze the Experiment

    Choose Comparison Field used to

    compare algorithm

  • 7/28/2019 Dmdw Lab Record

    59/60

    59

    JBREC

    Sort Dataset in ascending order

    Choose Test Base any one ingiven list

  • 7/28/2019 Dmdw Lab Record

    60/60

    60

    Test Output:

    Tester: weka.experiment.PairedCorrectedTTesterAnalysing: Percent_correctDatasets: 1

    Resultsets: 3Confidence: 0.05 (two tailed)Sorted by: -Date: 4/6/12 9:55 PM

    Dataset (1) trees.J4 | (2) trees (3) trees------------------------------------------------------------weather.symbolic (525) 50.03 | 60.03 69.02------------------------------------------------------------

    (v/ /*) | (0/1/0) (0/1/0)

    Key:

    (1) trees.J48 '-C 0.25 -M 2' -217733168393644444(2) trees.REPTree '-M 2 -V 0.0010 -N 3 -S 1 -L -1 -I 0.0' -9216785998198681299(3) trees.RandomTree '-K 0 -M 1.0 -S 1' 8934314652175299374