hais09-beyondhomemadeartificialdatasets
DESCRIPTION
TRANSCRIPT
Beyond Homemade ArtificialBeyond Homemade Artificial Data SetsData Sets
Núria MaciàAlbert Orriols-Puig
Ester Bernadó-Mansilla
Grup de Recerca en Sistemes Intel·ligents
La Salle – Universitat Ramon Llull
C/ Quatre Camins, 2. 08022, Barcelona (Spain)
{nmacia aorriols esterb}@salle url edu{nmacia,aorriols,esterb}@salle.url.edu
MotivationMaturity of machine learningy g
Several highly competitive learners
Research continues on developing new methodsp gTo improve existing methods at some taskTo build more robust, general methods
Slide 2Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets
MotivationMultiple Learner Comparisonsp p
Selection of Data set 1 Data set 2 Data set 3 Data set n…data sets
L 1 L 2 L
Selection of state-of-the-art
Conclusions dependon the intrinsicdiffi lti f Learner 1 Learner 2 Learner m…state-of-the-art
methodsdifficulties of the selected
data sets
Learner 1 Learner 2 Learner 3 … Learner mData set 1 63 33 ± 13 29 68 83 ± 8 87 64 40 ± 14 65 55 00 ± 13 61
Creation of tablesf lt Data set 1 63.33 ± 13.29 68.83 ± 8.87 64.40 ± 14.65 55.00 ± 13.61
Data set 2 69.30 ± 6.83 84.03 ± 7.30 81.16 ± 5.54 69.75 ± 8.19…Data set m 33.08 ± 14.09 0.00 ± 0.00 32.40 ± 9.44 47.59 ± 11.22Rank 3.46 3.14 3.08 2.80
of results
Application of statistical tests
Slide 3Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets
Pos 4 3 2 1
MotivationThis lead to previous work on
Metrics to estimate the difficulty of classification problems (Ho & Basu 2002)Metrics to estimate the difficulty of classification problems (Ho & Basu, 2002)
Link problem complexity to learner performance (Bernadó-Mansilla et al, 2005, 2006)
Slide 4
Drawback: Limited number of real-world problems
Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets
Motivation
The purpose of the present work is toThe purpose of the present work is toBuild an automatic method to generate boundedly-difficult problems toproblems to
Better understand of which complexities affect different learners
Empower multiple learner comparisons
Slide 5Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets
Outline
1. ADS generationg
2. EMO-made artificial data sets
3. Preliminary experimental results
4. Conclusions & further work
Slide 6Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets
ADS GenerationWhy should we generate ADS?y g
Real-world problem constraintsLimit in the number of real-world problems that we can obtainDiversity of complexities among data sets not ensured
Previous works on data complexity analysis observed:Gaps in some spaces on the complexity spaceGaps in some spaces on the complexity spaceRisk of analyzing over collections of similar data sets
Direction: Work under a controlled scenario with boundedly-difficult problems
Slide 7Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets
ADS GenerationWhat type of problems would we like to generate?yp p g
Concepts follow a certain data distribution or physical processProblems whose complexity can move through several p y gdimensions
How can we generate ADS?1. Specify the desired difficulty by fixing a value to the m metrics
E.g.: N1 = 0.5, F1 = 0.22. Run a physical process or sample a data distribution that
generates n unlabeled examples {e e e }generates n unlabeled examples {e1, e2, …, en}3. Solve the following optimization problem:
Find a class labeling that generates a data set that has theFind a class labeling that generates a data set that has the difficulty fixed in step 1
Slide 8
Goal: To solve a multi-objective optimization problemGrup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets
Outline
1. ADS generationg
2. EMO-made artificial data sets
3. Preliminary experimental results
4. Conclusions & further work
Slide 9Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets
EMO SolutionEMO solution based on NSGA-II
Meta-informationMeta informationData dimensionality specified by the user: number of attributes and instancesContinuous- or nominal-valued attributesUnlabeled data following any distributionThe user specified the problem difficulty desired
Problem objectivesEach objective corresponds to a selected complexity metric
Slide 10Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets
EMO SolutionKnowledge representationg p
Individual = class labeling of the instances of the data set
1 2 3 4 … nClass? Class? Class? Class? Class?Class?{0,1}
Class?{0,1}
Class?{0,1}
Class?{0,1} … Class?
{0,1}
Number of instances in the data set
Slide 11Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets
EMO Solution
Process organization
NSGA-II algorithmFast-non dominated sortingCrowding distance
Genetic operatorsS-wise tournament selectionTwo point crossoverTwo-point crossoverBit-wise mutation
Slide 12Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets
Outline
1. ADS generationg
2. EMO-made artificial data sets
3. Preliminary results
4. Conclusions & further work
Slide 13Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets
MethodologyDeparting from two types of data sets:p g yp
Data set with randomly distributed examplesReal-world data set without labels
EMO configurationEMO configuration400 individuals50 generationsgProbability of crossover = 0.85Probability of mutation = 1/ny
Slide 14Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets
Objectives to OptimizeFisher discriminant ratio
Measures to what extent individual features contribute to class discriminationto class discrimination
Fraction of points on the class boundaryBuild minimum spanning tree (MST) connecting all the points regardless of classCount the number of edges joining opposite classesCount the number of edges joining opposite classesEstimates the length of the class boundary
Ratio of average intra/inter class distanceCompute the Euclidean distance from each point to
the nearest neighbor of the same classgthe nearest neighbor of another class
Return the ratioCompares the within class spread to the size of the gap between classes
Slide 15
Compares the within-class spread to the size of the gap between classes
Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets
Results
Random distribution Iris problem
Problems with bounded complexity generatedObtaining data sets in regions of the complexity space that were not
Slide 16Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets
covered with the analysis over real-world problems
Outline
1. ADS generationg
2. EMO-made artificial data sets
3. Preliminary experimental results
4. Conclusions & further work
Slide 17Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets
Conclusions
Highlighted the need for the systematic creation of ADSDesign of an EMO-based system that
Departs from a set of unlabeled examplesSets labels to each exampleOptimizes several complexity metrics
to automatically generate ADS of boundedly difficulty
Experiments show that:Pareto of data sets with different complexities can be obtainedPareto of data sets with different complexities can be obtainedBlind regions of the complexity space can be covered
Slide 18Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets
Further Work
The non-mentioned limitationFree class labeling may result in data sets whose structure mayFree class labeling may result in data sets whose structure may not be feasible in nature
Future work to tackle this pointIntroduce constraints to the problem
Some instances cannot change their class
Enable instance selection
Slide 19Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets
Beyond Homemade ArtificialBeyond Homemade Artificial Data SetsData Sets
Núria MaciàAlbert Orriols-Puig
Ester Bernadó-Mansilla
Grup de Recerca en Sistemes Intel·ligents
La Salle – Universitat Ramon Llull
C/ Quatre Camins, 2. 08022, Barcelona (Spain)
{nmacia aorriols esterb}@salle url edu{nmacia,aorriols,esterb}@salle.url.edu