csc 196k semester project: instance based learning weka assignment 2 glynis hawley
TRANSCRIPT
CSC 196k Semester Project: Instance Based
LearningWeka Assignment 2
Glynis Hawley
Agenda
• Background: Instance-based Learning
• Project– Requirements– Data– Progress
• Conclusions• References
Background: Instance Based Learning
• Learning/classification based on information stored in a “set” of examples– No rules or decision trees
• “New” instance classified based on its similarity to one (or more) stored example(s)
• e.g. Nearest-neighbor
IBL Algorithm research by David W. Aha
• Two papers helpful in understanding this assignment– Instance-based Learning Algorithms
• David W. Aha , Dennis Kibler, Marc K. Albert • 1991
– Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms
• David W. Aha• 1992
• Three algorithms: IB1, IB2, IB3
IB1: Instance-Based Learner version 1
• Similar to nearest neighbor algorithm• Differences:
– Normalizes all attributes in range [0,1]– Handles missing attributes
• Training: Stores all instances from training set
• Classification: Searches all stored instances for nearest neighbor.
• High computational and spatial expense
IB2: Instance-Based Learner version 2
• Attempts to reduce storage requirements and computational complexity
• Saves only misclassified instances• Algorithm:
Stored instances = {};For each instance in the training set,
Tentatively classify the instance based on
nearest stored instance. If classification != true class
Add the instance to the stored set
• Tends to accumulate noisy instances
IB3: Instance-Based Learner version 3
• Tracks the performance of each exemplar – Uses only those that are “good enough”
• Performance exceeds some upper threshold
– Discards those that are “not good enough”• Performance falls below some lower threshold
– Exemplars “in between” • Performance statistics upgraded whenever
exemplar is the nearest neighbor to a “new” instance
• Performance and storage better that IB1 and IB2
Aha’s Results
Aha: Degredation of accuracy with noise
50556065707580859095
100
0 10 20 30 40 50
% Noise
% A
ccu
racy
IB1
IB2
IB3
Results are averaged over 50 trials. [1:274], [2:57]
The Weka IBL Project
• Implement IB2 and IB3• Compare their performance with
that of IB1 and C4.5 (Weka version is called J48)
• Data– Iris data: for initial testing of IB2– LED data– Glass data
LED Dataset
• Synthetic dataset created with led-creator.c [3]
• 8 attributes– 7 segments of display: 0 or 1– Class: digits 0 through 9
• Input – Number of instances to be created– Seed– % noise per attribute
• 10% noise means each bit has a 10% chance of being flipped
Glass Identification Dataset
• 214 instances– 163 Window glass (building windows and vehicle windows)
• 87 float processed
– 70 building windows
– 17 vehicle windows
• 76 non-float processed
– 76 building windows
– 0 vehicle windows
– 51 Non-window glass
• 13 containers
• 9 tableware
• 29 headlamps
Progress Report - Accomplished
• Implemented IB2– Modification of IB1 class methods
• buildClassifier( )• updateClassifier( )
– Preliminary testing with iris data
• Compared accuracy of IB1, IB2, and C4.5 on LED data– 10 sets of 700 instances each with 10%
noise• training set = first 200 instances of each set • testing set = last 500 instances of each set
Accuracy Comparison of IB1, IB2, and C4.5 for 10 separate trials(Note: Line is provided as a visual aid. It has no significance. )
40
45
50
55
60
65
70
75
% A
cc
ura
cy
0 1 2 3 4 5 6 7 8 9 10 11
trial
C4.5 (J48);Avg = 67.3;Stdev = 2.6"
IB1: Avg = 61.0;Stdev = 6.3
IB2: Avg = 56.0;Stdev = 9.9
C4.5 (J48)
IB1
IB2
Compare David Aha’s results [2:52] (over 50 trials): IB1: 70.5 0.4 % IB2: 62.5 0.6 %
Hawley: % Accuracy vs.% Noise
0102030405060708090
100
0 10 20 30 40 50
% Noise
% A
cc
ura
cy
IB1
IB2
C4.5 (J48)
Progress Report - To Do
• Implement IB3– More involved than IB2– Even more difficult when you don’t know
java
• Test accuracy of IB3 on LED data to compare with that of IB1, IB2, and C4.5
• Test accuracy of IB1, IB2, IB3, and C4.5 on the glass data
Conclusions
• Thus far, comparisons of IB1 and IB2 are similar to David Aha’s results.
• Weka assignments (except perhaps #1)– Are somewhat vague.– Require some research to determine what
actual project requirements should be.– Are valuable in building an understanding of
the algorithms and their design.
References
[1] Aha, David W. 1992. Training noisy, irrelevant and novel attributes in instance-based learning algorithms. International Journal of Man-Machine Studies 36(2):267-287.
[2] Aha, David W., Dennis Kibler and Marc Albert. 1991. Instance-based learning algorithms. Machine Learning 6:37-66.
[3] http://ftp.ics.uci.edu/pub/machine-learning-databases/led-display-creator/led-creator.c