feature subset selection using minimum cost spanning trees mike farah - 18548059 supervisor: dr. sid...
Post on 21-Dec-2015
216 views
TRANSCRIPT
![Page 1: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/1.jpg)
Feature Subset Selection using Minimum Cost Spanning Trees
Mike Farah - 18548059Supervisor: Dr. Sid Ray
![Page 2: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/2.jpg)
Outline
• Introduction Pattern Recognition Feature subset selection
• Current methods• Proposed method• IFS• Results• Conclusion
![Page 3: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/3.jpg)
Introduction: Pattern Recognition
• The classification of objects into groups by learning from a small sample of objects Apples and strawberries:
Classes: apples and strawberries Features: colour, size, weight, texture
• Applications: Character recognition Voice recognition Oil mining Weather prediction …
![Page 4: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/4.jpg)
Introduction: Pattern Recognition
• Pattern representation Measuring and recording features Size, colour, weight, texture….
• Feature set reduction Reducing the number of features used Selecting a subset Transformations
• Classification The resulting features are used for
classification of unknown objects
![Page 5: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/5.jpg)
Introduction: Feature subset selection
• Can be split into two processes: Feature subset searching
Not usually feasible to exhaustively try all feature subset combinations
Criterion function Main issue of feature subset selection (Jain et
al. 2000) Focus of our research
![Page 6: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/6.jpg)
Current methods
• Euclidean distance Statistical properties of the classes are
not considered
• Mahalanobis distance Variances and co-variances of the
classes are taken into account
€
J(x) = min{(μ i −μ j )'(μ i −μ j )}
€
J(x) = min{(μ i −μ j )'Σij−1(μ i −μ j )}
![Page 7: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/7.jpg)
Limitations of Current Methods
![Page 8: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/8.jpg)
Limitations of Current Methods
![Page 9: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/9.jpg)
Friedman and Rafsky’s two sample test
• Minimum spanning tree approach for determining whether two sets of data originate from the same source
• A MST is built across the data from two sources, edges which connect samples of different data sets are removed
• If many edges are removed, then the two sets of data are likely to originate from the same source
![Page 10: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/10.jpg)
Friedman and Rafsky’s two sample test
• Method can be used as a criterion function• MST built across the sample points• Edges which connect samples of different
classes are removed• A good subset is one that provides
discriminatory information about the classes, therefore the fewer edges removed the better
![Page 11: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/11.jpg)
Limitations of Friedman and Rafsky’s technique
![Page 12: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/12.jpg)
Our Proposed Method
• Use the number of edges and edge lengths in determining the suitability of a subset
• A good subset will have a large number of short edges connecting samples of the same class
• And a small number of long edges connecting samples of different classes
![Page 13: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/13.jpg)
Our Proposed Method
• We experimented with using average edge length and weighted average - weighted average was expected to perform better
€
J(x) = 0.5 1−betweenNum
betweenNum + withinNum+
Avg(betweenEdges)
Avg(betweenEdges) + Avg(withinEdges)
⎛
⎝ ⎜
⎞
⎠ ⎟
![Page 14: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/14.jpg)
IFS - Interactive Feature Selector
• Developed to allow users to experiment with various feature selection methods
• Automates the execution of experiments• Allows visualisation of data sets, and
results• Extensible, developers can add criterion
functions, feature selectors and classifiers easily into the system
![Page 15: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/15.jpg)
IFS - Screenshot
![Page 16: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/16.jpg)
IFS - Screenshot
![Page 17: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/17.jpg)
Experimental Framework
Data set No. Samples No. Feats No. Classes
Iris 150 4 3
Crab 200 7 2
Forensic Glass 214 9 7
Diabetes 332 8 2
Character 757 20 8
Synthetic 750 7 5
![Page 18: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/18.jpg)
Experimental Framework
• Spearman’s rank correlationA good criterion function will have good
correlation with the classifier, subsets which are ranked high should achieve high accuracy levels
• Subset chosen Final subsets selected by criterion
functions are compared to the optimal subset chosen by the classifier
• Time
![Page 19: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/19.jpg)
Forensic glass data set results
![Page 20: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/20.jpg)
Forensic glass data set results
![Page 21: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/21.jpg)
Synthetic data set
![Page 22: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/22.jpg)
Synthetic data set
![Page 23: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/23.jpg)
Algorithm completion times
![Page 24: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/24.jpg)
Algorithm completion times
![Page 25: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/25.jpg)
Algorithm complexities
• K-NN
• MST criterion functions
• Mahalanobis distance
• Euclidean distance
€
O(N 2(F + log(N) +K))
€
O(N 2 *F)
€
O(C2 *F 2(N + F))
€
O(C2 *F)
![Page 26: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/26.jpg)
Conclusion
• MST based approaches generally achieved higher accuracy values and rank correlation - in particular with the K-NN classifier
• Criterion function based on Friedman and Rafsky’s two sample test performed the best
![Page 27: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/27.jpg)
Conclusion
• MST approaches are closely related with the KNN classifier
• Mahalanobis criterion function suited to data sets with Gaussian distributions and strong feature interdependence
• Future work: Construct a classifier based on KNN, which
gives closer neighbours higher priority Improve IFS
![Page 28: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d635503460f94a45701/html5/thumbnails/28.jpg)