weka tutorial 2

Upload: jat02013

Post on 02-Apr-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Weka Tutorial 2

    1/36

    WEKA

    Classification Using Decision TreesBased on Dr. Polczynskis Lecture

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    2/36

    !"#$%$&'($)%

    !"#$%

    &'()*)+,%-.''%

    !"#$%*&"$#+(,("%

    !"#$%*#"-#.#+)%

    ?

    /(),0%

    1"&)'&%)%2345%6&+#$#(.%

    '"&&%'(%+,)$$#78%

    9.:.(;.%#"#$%$)%

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    3/36

    !"#$%$&'(

    )*""(

    +&'#",-(

    !"#$%&"'(

    )"*#+(+",-./( )"*#+(012./( 3".#+(+",-./( 3".#+(012./(

    4$(./1$(

    $".5$#6(

    4$(./1$(

    7"&$185+5&6(

    4&1$($".5$#( 4&1$(7"&$185+5&( 4&1$(71&-1,18#(

    Yes

    No

    Yes

    No

    ?

    950(25(0"(

    :#;"(./"$"(

    2"81$15,$6(

    950(25(0"(

    :#;"(./"$"(

    2"81$15,$6(

  • 7/27/2019 Weka Tutorial 2

    4/36

    Looking back on Tutorial 1 which you did two weeks ago. This shows the difficulty of making the right decisions. Note how thehistograms for the input attributes overlap. This means, for example, that the

    Sepal length of some samples of setosa = some samples of versicolor, and some samples of virginica,

    This makes it unclear which species an unknown iris belongs to based on sepal length alone.

    We need to make a decision tree that makes its prediction based on all four flower dimensions.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    5/36

    !"#$%&'()*(&%*&%#+%

    *,*&%-*$"%&('#.%

    .(+/$"%01%234%)56%

    Here, we see this situation as it occurs in the original dataset.

    In this example, we see three samples where all three species have sepal length of 4.9 centimeters.

    So, which species does an unknown sample with sepal length of 4.9 cm belong to?

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    6/36

    !"#$%&"'(

    )"*#+(+",-./( )"*#+(012./( 3".#+(+",-./( 3".#+(012./(

    4$(./1$(

    $".5$#6(

    4$(./1$(

    7"&$185+5&6(

    4&1$($".5$#( 4&1$(7"&$185+5&( 4&1$(71&-1,18#(

    Probably

    Probably not

    ?9/#.(#&"(./"$"(*&5:#:1+1;"$6(

    Probably not

    Probably

    One result of these overlaps is that we typically cannot be absolutely certain of our classification of an unknown sample,we can only determine the likelihood that an unknown sample belongs to a particular class.

    We will return to this issue after building our tree.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    7/36

    !"#$%&$'()*+*'(

    !"#$%&'()*(+,-&.)(&/+01234&54+(323*'(

    ,&$(-.&$/#0/1'(21.3&$(4&"56&$$'(7889(

    !"#$%&'()%

    *"+,-,./%01""%2"$1/"1%%

    :".(;11%(51.(

    ;

  • 7/27/2019 Weka Tutorial 2

    8/36

    Start Weka Explorerand go to the Preprocesstab. Note that C4.5 and J4.8 decision tree learners can accept numerical attributes aswell as nominal attributes, so we can use our original FishersIrisDataset.arff

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    9/36

    Move from the Preprocesstab to the Classifytab. Select the Weka J4.8 algorithm, click on Choose.

    After you click on J48 to highlight this selection, click Closeto return to the Classifytab.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    10/36

    Left-click on J48to open up the Weka Generic Object Editor.

    We will accept all of the defaults here, except that we will change saveInstanceDatato true. This will allow us to find out how eachiris sample is classified after we build the tree. Note that you can find information on the items on this screen by clicking More. For

    now, just click OK.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    11/36

    Click More Options. For now well accept these defaults for now, except that we will select Output predictions. This willallow us to track predictions on individual samples in the test set.

    Note that we have not selected Preserve order for % Split. Ill explain shortly. Click OK.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    12/36

    Click the Percentage split radio button in the Test optionsbox. If you mouse over the Percentage split label, the following helpnote pops up: Train on a percentage of the data and test on the remainder.

    Here, Weka will randomly select 66% of the original dataset to build, or train, the decision tree. Then, Weka will test the tree usingthe remaining samples.

    You may ask: Arent we wasting a third of our data, wouldnt we have a better tree if we used all 150 samples to build the tree?

    It turns out that it is possible to train a decision tree too well, which is called over-fitting. We will return to this topic after constructing

    our tree.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    13/36

    You may have noted 2 slides ago that l did not select Preserve order for % Split.

    This may seem confusing because we just selected a Percentage split of 66%.

    Recall that in our original Fishers iris dataset, all of the setosa samples were at the top of the dataset, the versicolor were in themiddle, and the virginica were all at the end.

    When Weka splits the original dataset into training and test datasets, we need to have a mixture of species in each.

    If we select Preserve order for % Split, then the training dataset would have only two of the three species, and the test set wouldhave only one species. The result would be a poor quality decision tree.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    14/36

    Click Starton the Classifytab. The Classifier outputbox shows the results of classification.

    Before examining the rest of the Classifier outputbox, lets see the decision tree.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    15/36

    Scrolling to the top of the Classifier Output we also verify that were working with the correct data set.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    16/36

    To see the tree, right-click on the highlighted Result listentry for the tree we just built, and then click Visualize tree.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    17/36

    it says that if petal width is less than or equal to 0.6 cm, the iris is setosa.

    If petal width is > 0.6 cm, and also > 1.7 cm, then the iris is virginica. Of course, if petal width is > 1.7 cm, it will also be > 0.6cm, but that is just a coincidence for this particular tree. ,

    If petal width is > 0.6 cm, and petal width is

  • 7/27/2019 Weka Tutorial 2

    18/36

  • 7/27/2019 Weka Tutorial 2

    19/36

    !"#"$%"&'(')*&

    +,-./&%'0-1&2&345&67& +,-./&/,"8-1&2&94:&67&

    +,-./&%'0-1&;&:45&

    &&&&."0& 0.6 cm, and petalwidth is 4.9 cm, and petal width is > 1.5 cm. So the iris is versicolor.

    We see that the tree has 5 leaves, and that the total size of the tree is 9 elements.

    According to our decision tree, we dont need to measure sepal length or sepal width to classify unknown irises. All we need is

    petal length and petal width.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    20/36

    Scroll around the Classifier outputbox until you find the Evaluation on test splitsection. Recall that we split our originaldataset of 150 samples into 66% for training the tree, and 33% for testing the tree.

    Weka actually used 51 samples for testing the tree, and our tree classified 49 out of the 51 test samples correctly.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    21/36

    The Classifier outputbox provides details on how the tree did on the 51 test samples. The Confusion Matrixtells us that therewere 15 setosas in the test set, all of which were properly classified as setosa.

    There were 19 versicolor in the test set, and they were all classified correctly, too.

    There were 17 virginica, but 2 were classified incorrectly as versicolor by our decision tree.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    22/36

    Scroll around the Classifier outputbox until you find the Predictions on test splitsection. Here we see the results for the 51samples used to test our tree. Look for the plus sign in the error column.

    We see that sample #16 is one of the two viriginica that were classified by the tree as versicolor, as noted in the Confusion Marix.

    If you scroll down to Instance #39, you will see the second mis-classified virginica. Under probability distribution, you will see threecolumns, one for each iris species.

    This shows the probability that the classification for a particular sample is correct. The asterisk shows the species with the highest

    probability, which was therefore selected as the predicted class.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    23/36

    We can find more interesting information on how our tree did on the test samples.

    On the Classifiertab, right-click the Result list, and choose Visualize classifier errors.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    24/36

    Recalling that our tree only used petal length and petal width to classify samples, select Petal Lengthand Petal Widthas the axesfor Weka Classifier Visualize.

    Note that the Xs represent properly classified test samples, and squares show incorrectly classified samples.

    We can also see the two virginica samples in the test set of 51 that are incorrectly classified by the tree as versicolor. Here, it is clearwhy the tree classified these samples incorrectly they fall into the versicolor group when petal length and petal width are used to

    predict sample classes.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    25/36

    Left-click on a data point on the plot, you can call up information for the point. Clicking the right-most box on the plot, we seedata for instance 39 in the test dataset.

    As you can see, this is one of the virginicas that the tree incorrectly predicted to be versicolor.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    26/36

    Returning to our decision tree. Note the number 50.0 in this setosa leaf of the tree. This says that 50 out of the original 150 samples

    reached this leaf in the tree. Ignore the decimal place in this number, it is just an artifact of how Weka works. There are 50 setosa inthe original dataset, so this was successful.

    Next, we see that 46 samples reached this virginica leaf. 45 were, in fact, virginica, but 1 of the samples was not a virginica.

    All three of the virginica samples are correctly placed - Added to the 45 from the preceding slide, this gives us 48 out of 50.

    Here are the leaves that samples #16 and #39 ended up in. These are the two virginica in the test dataseset that were mis-classifiedas versicolor.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    27/36

    Lets see if we can find out what this mis-classified sample is. Left-click on this leaf of the tree.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    28/36

    This reveals a plot for all the instances in the virginica leaf we clicked. Again, choose petal length and petal width as the axes.

    Note that the green Xs are virginica, and the red Xs are versicolor. Now we can see the one sample that was mis-classified as avirginica, and that it is actually a versicolor.

    You may need to slide the Jitter bar over to uncover the mis-classified sample. Left-click on the misclassified sample to find out

    more information.

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    29/36

    !"#$%""&$&'&$"()$

    *)++$&"$)+,-./+$*"$

    "*0+)$1,-22'3+)24$506$&'&$"()$*)++$

    %+*$*0'2$"7+$#)"7%4$

    Here is a summary that I put together from the output of our decision tree for the combined training data and test data.

    This similar to the Confusion Matrix, except that it shows all of the samples in the original database, not just the test samples.

    Why doesnt Weka provide such a summary? The answer is that it doesnt really matter how well a decision tree does on classifyingsamples in the training dataset.

    What really matters is how well it does on the test set. Well explain why shortly.

    Two questions arise at this point: Why does our tree mis-classify some samples? How good is our tree compared to other datamining algorithms?

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    30/36

    Why did our tree classify some samples incorrectly?

    Attribute measurement errors,

    e.g., inaccurate measurements on petals and sepals.

    Sample class identification errors,e.g., some setosas classified as versicolor, etc.

    Outlier samples,e.g., some drought-stunted flower samples.

    Atypical sample set,e.g., training samples of sun-loving flower collected from deep shade.

    Inappropriate classification algorithm.J48 decision tree learner doesnt work for irises!

    Tuesday, October 12, 2010

  • 7/27/2019 Weka Tutorial 2

    31/36

  • 7/27/2019 Weka Tutorial 2

    32/36

    !"#$%$%&'

    (#)*+,('

    -.$+/'01,'

    2,3$($4%'!",,'

    5(,'01,'

    2,3$($4%'!",,'

    5%6%47%'

    (#)*+,('

    !"#$%$%&'

    (#)*+,('

    34"",308!,(0'

    (#)*+,('

    34"",308

    !"#$%$%&'

    (#)*+,('!,(0'

    (#)*+,('

    9' 9'

    :;'

  • 7/27/2019 Weka Tutorial 2

    33/36

    !"#$%&'()*

    !"#$%&"'()$&*+",-"%(*(".$/$/0"*,,12"$2"('($1(31&4"

    5*"$2"3&6,.$/0"&(2$&)"(/%"&(2$&)"*,"7/%".,%&12"*,"

    %&26)$3&"%(*(4"

    5*"$2"3&6,.$/0"&(2$&)"(/%"&(2$&)"*,",'&)87*".,%&129"

    :,#"#&11""(".,%&1"%&26)$3&2"!"#$%$%&"%(*("$2"

    3&6,.$/0"1&22"$.;,)*(/*4"

  • 7/27/2019 Weka Tutorial 2

    34/36

    !"#$%&'&()#*+%,*$%-).-*)./%,#&"%&".%

    *$$#$&*01.%(23%

    %4)(25%%60/)7.8%9(1"*0$:#% 4)(25%;*)1#0%4.)7(0%?0@#0..)#0@A%

    B*)$*,%C0#D.)$#&

  • 7/27/2019 Weka Tutorial 2

    35/36

    It turns out that botanist Edgar Anderson, who compiledthe data for Fishers iris dataset, had an assistant namedIgor*. Igor also compiled an iris dataset, Igors irisdataset, linked below.

    For this assignment, you will:

    1. Download the IgorsIrisDataset.xls Excel file.2. Convert to .csv format.3. Open the file in the Weka Explorer Preprocessor.4. Remove the Instancecolumn.

    5. Complete the spreadsheet shown here for Igors irisdataset.

    Cut and paste the spreadsheet into yourhomework assignment

    Tuesday, October 12, 2010

    http://www.rorylewis.com/docs/03_uccs/2010_06_Fall_CS5450/05_CS54501_extra_2010.htmhttp://www.rorylewis.com/docs/03_uccs/2010_06_Fall_CS5450/05_CS54501_extra_2010.htm
  • 7/27/2019 Weka Tutorial 2

    36/36