csc 196k semester project: instance based learning weka assignment 2 glynis hawley

CSC 196k Semester Project: Instance Based

LearningWeka Assignment 2

Glynis Hawley

Agenda

• Background: Instance-based Learning

• Project– Requirements– Data– Progress

• Conclusions• References

Background: Instance Based Learning

• Learning/classification based on information stored in a “set” of examples– No rules or decision trees

• “New” instance classified based on its similarity to one (or more) stored example(s)

• e.g. Nearest-neighbor

IBL Algorithm research by David W. Aha

• Two papers helpful in understanding this assignment– Instance-based Learning Algorithms

• David W. Aha , Dennis Kibler, Marc K. Albert • 1991

– Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms

• David W. Aha• 1992

• Three algorithms: IB1, IB2, IB3

IB1: Instance-Based Learner version 1

• Similar to nearest neighbor algorithm• Differences:

– Normalizes all attributes in range [0,1]– Handles missing attributes

• Training: Stores all instances from training set

• Classification: Searches all stored instances for nearest neighbor.

• High computational and spatial expense


• Attempts to reduce storage requirements and computational complexity

• Saves only misclassified instances• Algorithm:

Stored instances = {};For each instance in the training set,

Tentatively classify the instance based on

nearest stored instance. If classification != true class

Add the instance to the stored set

• Tends to accumulate noisy instances


• Tracks the performance of each exemplar – Uses only those that are “good enough”

• Performance exceeds some upper threshold

– Discards those that are “not good enough”• Performance falls below some lower threshold

– Exemplars “in between” • Performance statistics upgraded whenever

exemplar is the nearest neighbor to a “new” instance

• Performance and storage better that IB1 and IB2

Aha’s Results

Aha: Degredation of accuracy with noise

50556065707580859095

100

0 10 20 30 40 50

% Noise

% A

ccu

racy

IB1

IB2

IB3

Results are averaged over 50 trials. [1:274], [2:57]

The Weka IBL Project

• Implement IB2 and IB3• Compare their performance with

that of IB1 and C4.5 (Weka version is called J48)

• Data– Iris data: for initial testing of IB2– LED data– Glass data

LED Dataset

• Synthetic dataset created with led-creator.c [3]

• 8 attributes– 7 segments of display: 0 or 1– Class: digits 0 through 9

• Input – Number of instances to be created– Seed– % noise per attribute

• 10% noise means each bit has a 10% chance of being flipped

Glass Identification Dataset

• 214 instances– 163 Window glass (building windows and vehicle windows)

• 87 float processed

– 70 building windows

– 17 vehicle windows

• 76 non-float processed

– 76 building windows

– 0 vehicle windows

– 51 Non-window glass

• 13 containers

• 9 tableware

• 29 headlamps

Progress Report - Accomplished

• Implemented IB2– Modification of IB1 class methods

• buildClassifier( )• updateClassifier( )

– Preliminary testing with iris data

• Compared accuracy of IB1, IB2, and C4.5 on LED data– 10 sets of 700 instances each with 10%

noise• training set = first 200 instances of each set • testing set = last 500 instances of each set

Accuracy Comparison of IB1, IB2, and C4.5 for 10 separate trials(Note: Line is provided as a visual aid. It has no significance. )

40

45

50

55

60

65

70

75

% A

cc

ura

cy

0 1 2 3 4 5 6 7 8 9 10 11

trial

C4.5 (J48);Avg = 67.3;Stdev = 2.6"

IB1: Avg = 61.0;Stdev = 6.3

IB2: Avg = 56.0;Stdev = 9.9

C4.5 (J48)

IB1

IB2

Compare David Aha’s results [2:52] (over 50 trials): IB1: 70.5 0.4 % IB2: 62.5 0.6 %

Hawley: % Accuracy vs.% Noise

0102030405060708090

100

0 10 20 30 40 50

% Noise

% A

cc

ura

cy

IB1

IB2

C4.5 (J48)

Progress Report - To Do

• Implement IB3– More involved than IB2– Even more difficult when you don’t know

java

• Test accuracy of IB3 on LED data to compare with that of IB1, IB2, and C4.5

• Test accuracy of IB1, IB2, IB3, and C4.5 on the glass data

Conclusions

• Thus far, comparisons of IB1 and IB2 are similar to David Aha’s results.

• Weka assignments (except perhaps #1)– Are somewhat vague.– Require some research to determine what

actual project requirements should be.– Are valuable in building an understanding of

the algorithms and their design.

References

[1] Aha, David W. 1992. Training noisy, irrelevant and novel attributes in instance-based learning algorithms. International Journal of Man-Machine Studies 36(2):267-287.

[2] Aha, David W., Dennis Kibler and Marc Albert. 1991. Instance-based learning algorithms. Machine Learning 6:37-66.

[3] http://ftp.ics.uci.edu/pub/machine-learning-databases/led-display-creator/led-creator.c

csc 196k semester project: instance based learning weka assignment 2 glynis hawley

Documents