weka tutorial 2

7/27/2019 Weka Tutorial 2

1/36

WEKA

Classification Using Decision TreesBased on Dr. Polczynskis Lecture

Tuesday, October 12, 2010


2/36

!"#$%$&'($)%

!"#$%

&'()*)+,%-.''%

!"#$%*&"$#+(,("%

!"#$%*#"-#.#+)%

?

/(),0%

1"&)'&%)%2345%6&+#$#(.%

'"&&%'(%+,)$$#78%

9.:.(;.%#"#$%$)%



3/36

!"#$%$&'(

)*""(

+&'#",-(

!"#$%&"'(

)"*#+(+",-./( )"*#+(012./( 3".#+(+",-./( 3".#+(012./(

4$(./1$(

$".5$#6(

4$(./1$(

7"&$185+5&6(

4&1$($".5$#( 4&1$(7"&$185+5&( 4&1$(71&-1,18#(

Yes

No

Yes

No

?

950(25(0"(

:#;"(./"$"(

2"81$15,$6(

950(25(0"(

:#;"(./"$"(

2"81$15,$6(


4/36

Looking back on Tutorial 1 which you did two weeks ago. This shows the difficulty of making the right decisions. Note how thehistograms for the input attributes overlap. This means, for example, that the

Sepal length of some samples of setosa = some samples of versicolor, and some samples of virginica,

This makes it unclear which species an unknown iris belongs to based on sepal length alone.

We need to make a decision tree that makes its prediction based on all four flower dimensions.



5/36

!"#$%&'()*(&%*&%#+%

*,*&%-*$"%&('#.%

.(+/$"%01%234%)56%

Here, we see this situation as it occurs in the original dataset.

In this example, we see three samples where all three species have sepal length of 4.9 centimeters.

So, which species does an unknown sample with sepal length of 4.9 cm belong to?



6/36

!"#$%&"'(

)"*#+(+",-./( )"*#+(012./( 3".#+(+",-./( 3".#+(012./(

4$(./1$(

$".5$#6(

4$(./1$(

7"&$185+5&6(

4&1$($".5$#( 4&1$(7"&$185+5&( 4&1$(71&-1,18#(

Probably

Probably not

?9/#.(#&"(./"$"(*&5:#:1+1;"$6(

Probably not

Probably

One result of these overlaps is that we typically cannot be absolutely certain of our classification of an unknown sample,we can only determine the likelihood that an unknown sample belongs to a particular class.

We will return to this issue after building our tree.



7/36

!"#$%&$'()*+*'(

!"#$%&'()*(+,-&.)(&/+01234&54+(323*'(

,&$(-.&$/#0/1'(21.3&$(4&"56&$$'(7889(

!"#$%&'()%

*"+,-,./%01""%2"$1/"1%%

:".(;11%(51.(

;


8/36

Start Weka Explorerand go to the Preprocesstab. Note that C4.5 and J4.8 decision tree learners can accept numerical attributes aswell as nominal attributes, so we can use our original FishersIrisDataset.arff



9/36

Move from the Preprocesstab to the Classifytab. Select the Weka J4.8 algorithm, click on Choose.

After you click on J48 to highlight this selection, click Closeto return to the Classifytab.



10/36

Left-click on J48to open up the Weka Generic Object Editor.

We will accept all of the defaults here, except that we will change saveInstanceDatato true. This will allow us to find out how eachiris sample is classified after we build the tree. Note that you can find information on the items on this screen by clicking More. For

now, just click OK.



11/36

Click More Options. For now well accept these defaults for now, except that we will select Output predictions. This willallow us to track predictions on individual samples in the test set.

Note that we have not selected Preserve order for % Split. Ill explain shortly. Click OK.



12/36

Click the Percentage split radio button in the Test optionsbox. If you mouse over the Percentage split label, the following helpnote pops up: Train on a percentage of the data and test on the remainder.

Here, Weka will randomly select 66% of the original dataset to build, or train, the decision tree. Then, Weka will test the tree usingthe remaining samples.

You may ask: Arent we wasting a third of our data, wouldnt we have a better tree if we used all 150 samples to build the tree?

It turns out that it is possible to train a decision tree too well, which is called over-fitting. We will return to this topic after constructing

our tree.



13/36

You may have noted 2 slides ago that l did not select Preserve order for % Split.

This may seem confusing because we just selected a Percentage split of 66%.

Recall that in our original Fishers iris dataset, all of the setosa samples were at the top of the dataset, the versicolor were in themiddle, and the virginica were all at the end.

When Weka splits the original dataset into training and test datasets, we need to have a mixture of species in each.

If we select Preserve order for % Split, then the training dataset would have only two of the three species, and the test set wouldhave only one species. The result would be a poor quality decision tree.



14/36

Click Starton the Classifytab. The Classifier outputbox shows the results of classification.

Before examining the rest of the Classifier outputbox, lets see the decision tree.



15/36

Scrolling to the top of the Classifier Output we also verify that were working with the correct data set.



16/36

To see the tree, right-click on the highlighted Result listentry for the tree we just built, and then click Visualize tree.



17/36

it says that if petal width is less than or equal to 0.6 cm, the iris is setosa.

If petal width is > 0.6 cm, and also > 1.7 cm, then the iris is virginica. Of course, if petal width is > 1.7 cm, it will also be > 0.6cm, but that is just a coincidence for this particular tree. ,

If petal width is > 0.6 cm, and petal width is


18/36


19/36

!"#"$%"&'(')*&

+,-./&%'0-1&2&345&67& +,-./&/,"8-1&2&94:&67&

+,-./&%'0-1&;&:45&

&&&&."0& 0.6 cm, and petalwidth is 4.9 cm, and petal width is > 1.5 cm. So the iris is versicolor.

We see that the tree has 5 leaves, and that the total size of the tree is 9 elements.

According to our decision tree, we dont need to measure sepal length or sepal width to classify unknown irises. All we need is

petal length and petal width.



20/36

Scroll around the Classifier outputbox until you find the Evaluation on test splitsection. Recall that we split our originaldataset of 150 samples into 66% for training the tree, and 33% for testing the tree.

Weka actually used 51 samples for testing the tree, and our tree classified 49 out of the 51 test samples correctly.



21/36

The Classifier outputbox provides details on how the tree did on the 51 test samples. The Confusion Matrixtells us that therewere 15 setosas in the test set, all of which were properly classified as setosa.

There were 19 versicolor in the test set, and they were all classified correctly, too.

There were 17 virginica, but 2 were classified incorrectly as versicolor by our decision tree.



22/36

Scroll around the Classifier outputbox until you find the Predictions on test splitsection. Here we see the results for the 51samples used to test our tree. Look for the plus sign in the error column.

We see that sample #16 is one of the two viriginica that were classified by the tree as versicolor, as noted in the Confusion Marix.

If you scroll down to Instance #39, you will see the second mis-classified virginica. Under probability distribution, you will see threecolumns, one for each iris species.

This shows the probability that the classification for a particular sample is correct. The asterisk shows the species with the highest

probability, which was therefore selected as the predicted class.



23/36

We can find more interesting information on how our tree did on the test samples.

On the Classifiertab, right-click the Result list, and choose Visualize classifier errors.



24/36

Recalling that our tree only used petal length and petal width to classify samples, select Petal Lengthand Petal Widthas the axesfor Weka Classifier Visualize.

Note that the Xs represent properly classified test samples, and squares show incorrectly classified samples.

We can also see the two virginica samples in the test set of 51 that are incorrectly classified by the tree as versicolor. Here, it is clearwhy the tree classified these samples incorrectly they fall into the versicolor group when petal length and petal width are used to

predict sample classes.



25/36

Left-click on a data point on the plot, you can call up information for the point. Clicking the right-most box on the plot, we seedata for instance 39 in the test dataset.

As you can see, this is one of the virginicas that the tree incorrectly predicted to be versicolor.



26/36

Returning to our decision tree. Note the number 50.0 in this setosa leaf of the tree. This says that 50 out of the original 150 samples

reached this leaf in the tree. Ignore the decimal place in this number, it is just an artifact of how Weka works. There are 50 setosa inthe original dataset, so this was successful.

Next, we see that 46 samples reached this virginica leaf. 45 were, in fact, virginica, but 1 of the samples was not a virginica.

All three of the virginica samples are correctly placed - Added to the 45 from the preceding slide, this gives us 48 out of 50.

Here are the leaves that samples #16 and #39 ended up in. These are the two virginica in the test dataseset that were mis-classifiedas versicolor.



27/36

Lets see if we can find out what this mis-classified sample is. Left-click on this leaf of the tree.



28/36

This reveals a plot for all the instances in the virginica leaf we clicked. Again, choose petal length and petal width as the axes.

Note that the green Xs are virginica, and the red Xs are versicolor. Now we can see the one sample that was mis-classified as avirginica, and that it is actually a versicolor.

You may need to slide the Jitter bar over to uncover the mis-classified sample. Left-click on the misclassified sample to find out

more information.



29/36

!"#$%""&$&'&$"()$

*)++$&"$)+,-./+$*"$

"*0+)$1,-22'3+)24$506$&'&$"()$*)++$

%+*$*0'2$"7+$#)"7%4$

Here is a summary that I put together from the output of our decision tree for the combined training data and test data.

This similar to the Confusion Matrix, except that it shows all of the samples in the original database, not just the test samples.

Why doesnt Weka provide such a summary? The answer is that it doesnt really matter how well a decision tree does on classifyingsamples in the training dataset.

What really matters is how well it does on the test set. Well explain why shortly.

Two questions arise at this point: Why does our tree mis-classify some samples? How good is our tree compared to other datamining algorithms?



30/36

Why did our tree classify some samples incorrectly?

Attribute measurement errors,

e.g., inaccurate measurements on petals and sepals.

Sample class identification errors,e.g., some setosas classified as versicolor, etc.

Outlier samples,e.g., some drought-stunted flower samples.

Atypical sample set,e.g., training samples of sun-loving flower collected from deep shade.

Inappropriate classification algorithm.J48 decision tree learner doesnt work for irises!



31/36


32/36

!"#$%$%&'

(#)*+,('

-.$+/'01,'

2,3$($4%'!",,'

5(,'01,'

2,3$($4%'!",,'

5%6%47%'

(#)*+,('

!"#$%$%&'

(#)*+,('

34"",308!,(0'

(#)*+,('

34"",308

!"#$%$%&'

(#)*+,('!,(0'

(#)*+,('

9' 9'

:;'


33/36

!"#$%&'()*

!"#$%&"'()$&*+",-"%(*(".$/$/0"*,,12"$2"('($1(31&4"

5*"$2"3&6,.$/0"&(2$&)"(/%"&(2$&)"*,"7/%".,%&12"*,"

%&26)$3&"%(*(4"

5*"$2"3&6,.$/0"&(2$&)"(/%"&(2$&)"*,",'&)87*".,%&129"

:,#"#&11""(".,%&1"%&26)$3&2"!"#$%$%&"%(*("$2"

3&6,.$/0"1&22"$.;,)*(/*4"


34/36

!"#$%&'&()#*+%,*$%-).-*)./%,#&"%&".%

*$$#$&*01.%(23%

%4)(25%%60/)7.8%9(1"*0$:#% 4)(25%;*)1#0%4.)7(0%?0@#0..)#0@A%

B*)$*,%C0#D.)$#&


35/36

It turns out that botanist Edgar Anderson, who compiledthe data for Fishers iris dataset, had an assistant namedIgor*. Igor also compiled an iris dataset, Igors irisdataset, linked below.

For this assignment, you will:

1. Download the IgorsIrisDataset.xls Excel file.2. Convert to .csv format.3. Open the file in the Weka Explorer Preprocessor.4. Remove the Instancecolumn.

5. Complete the spreadsheet shown here for Igors irisdataset.

Cut and paste the spreadsheet into yourhomework assignment

http://www.rorylewis.com/docs/03_uccs/2010_06_Fall_CS5450/05_CS54501_extra_2010.htmhttp://www.rorylewis.com/docs/03_uccs/2010_06_Fall_CS5450/05_CS54501_extra_2010.htm


36/36

weka tutorial 2

Documents