hais09-beyondhomemadeartificialdatasets

20
Beyond Homemade Artificial Beyond Homemade Artificial Data Sets Data Sets Núria Macià Albert Orriols-Puig Ester Bernadó-Mansilla Grup de Recerca en Sistemes Intel·ligents La Salle Universitat Ramon Llull C/ Quatre Camins, 2. 08022, Barcelona (Spain) {nmacia aorriols esterb}@salle url edu {nmacia,aorriols,esterb}@salle.url.edu

Upload: albert-orriols-puig

Post on 24-Jan-2015

447 views

Category:

Education


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: HAIS09-BeyondHomemadeArtificialDatasets

Beyond Homemade ArtificialBeyond Homemade Artificial Data SetsData Sets

Núria MaciàAlbert Orriols-Puig

Ester Bernadó-Mansilla

Grup de Recerca en Sistemes Intel·ligents

La Salle – Universitat Ramon Llull

C/ Quatre Camins, 2. 08022, Barcelona (Spain)

{nmacia aorriols esterb}@salle url edu{nmacia,aorriols,esterb}@salle.url.edu

Page 2: HAIS09-BeyondHomemadeArtificialDatasets

MotivationMaturity of machine learningy g

Several highly competitive learners

Research continues on developing new methodsp gTo improve existing methods at some taskTo build more robust, general methods

Slide 2Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets

Page 3: HAIS09-BeyondHomemadeArtificialDatasets

MotivationMultiple Learner Comparisonsp p

Selection of Data set 1 Data set 2 Data set 3 Data set n…data sets

L 1 L 2 L

Selection of state-of-the-art

Conclusions dependon the intrinsicdiffi lti f Learner 1 Learner 2 Learner m…state-of-the-art

methodsdifficulties of the selected

data sets

Learner 1 Learner 2 Learner 3 … Learner mData set 1 63 33 ± 13 29 68 83 ± 8 87 64 40 ± 14 65 55 00 ± 13 61

Creation of tablesf lt Data set 1 63.33 ± 13.29 68.83 ± 8.87 64.40 ± 14.65 55.00 ± 13.61

Data set 2 69.30 ± 6.83 84.03 ± 7.30 81.16 ± 5.54 69.75 ± 8.19…Data set m 33.08 ± 14.09 0.00 ± 0.00 32.40 ± 9.44 47.59 ± 11.22Rank 3.46 3.14 3.08 2.80

of results

Application of statistical tests

Slide 3Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets

Pos 4 3 2 1

Page 4: HAIS09-BeyondHomemadeArtificialDatasets

MotivationThis lead to previous work on

Metrics to estimate the difficulty of classification problems (Ho & Basu 2002)Metrics to estimate the difficulty of classification problems (Ho & Basu, 2002)

Link problem complexity to learner performance (Bernadó-Mansilla et al, 2005, 2006)

Slide 4

Drawback: Limited number of real-world problems

Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets

Page 5: HAIS09-BeyondHomemadeArtificialDatasets

Motivation

The purpose of the present work is toThe purpose of the present work is toBuild an automatic method to generate boundedly-difficult problems toproblems to

Better understand of which complexities affect different learners

Empower multiple learner comparisons

Slide 5Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets

Page 6: HAIS09-BeyondHomemadeArtificialDatasets

Outline

1. ADS generationg

2. EMO-made artificial data sets

3. Preliminary experimental results

4. Conclusions & further work

Slide 6Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets

Page 7: HAIS09-BeyondHomemadeArtificialDatasets

ADS GenerationWhy should we generate ADS?y g

Real-world problem constraintsLimit in the number of real-world problems that we can obtainDiversity of complexities among data sets not ensured

Previous works on data complexity analysis observed:Gaps in some spaces on the complexity spaceGaps in some spaces on the complexity spaceRisk of analyzing over collections of similar data sets

Direction: Work under a controlled scenario with boundedly-difficult problems

Slide 7Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets

Page 8: HAIS09-BeyondHomemadeArtificialDatasets

ADS GenerationWhat type of problems would we like to generate?yp p g

Concepts follow a certain data distribution or physical processProblems whose complexity can move through several p y gdimensions

How can we generate ADS?1. Specify the desired difficulty by fixing a value to the m metrics

E.g.: N1 = 0.5, F1 = 0.22. Run a physical process or sample a data distribution that

generates n unlabeled examples {e e e }generates n unlabeled examples {e1, e2, …, en}3. Solve the following optimization problem:

Find a class labeling that generates a data set that has theFind a class labeling that generates a data set that has the difficulty fixed in step 1

Slide 8

Goal: To solve a multi-objective optimization problemGrup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets

Page 9: HAIS09-BeyondHomemadeArtificialDatasets

Outline

1. ADS generationg

2. EMO-made artificial data sets

3. Preliminary experimental results

4. Conclusions & further work

Slide 9Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets

Page 10: HAIS09-BeyondHomemadeArtificialDatasets

EMO SolutionEMO solution based on NSGA-II

Meta-informationMeta informationData dimensionality specified by the user: number of attributes and instancesContinuous- or nominal-valued attributesUnlabeled data following any distributionThe user specified the problem difficulty desired

Problem objectivesEach objective corresponds to a selected complexity metric

Slide 10Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets

Page 11: HAIS09-BeyondHomemadeArtificialDatasets

EMO SolutionKnowledge representationg p

Individual = class labeling of the instances of the data set

1 2 3 4 … nClass? Class? Class? Class? Class?Class?{0,1}

Class?{0,1}

Class?{0,1}

Class?{0,1} … Class?

{0,1}

Number of instances in the data set

Slide 11Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets

Page 12: HAIS09-BeyondHomemadeArtificialDatasets

EMO Solution

Process organization

NSGA-II algorithmFast-non dominated sortingCrowding distance

Genetic operatorsS-wise tournament selectionTwo point crossoverTwo-point crossoverBit-wise mutation

Slide 12Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets

Page 13: HAIS09-BeyondHomemadeArtificialDatasets

Outline

1. ADS generationg

2. EMO-made artificial data sets

3. Preliminary results

4. Conclusions & further work

Slide 13Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets

Page 14: HAIS09-BeyondHomemadeArtificialDatasets

MethodologyDeparting from two types of data sets:p g yp

Data set with randomly distributed examplesReal-world data set without labels

EMO configurationEMO configuration400 individuals50 generationsgProbability of crossover = 0.85Probability of mutation = 1/ny

Slide 14Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets

Page 15: HAIS09-BeyondHomemadeArtificialDatasets

Objectives to OptimizeFisher discriminant ratio

Measures to what extent individual features contribute to class discriminationto class discrimination

Fraction of points on the class boundaryBuild minimum spanning tree (MST) connecting all the points regardless of classCount the number of edges joining opposite classesCount the number of edges joining opposite classesEstimates the length of the class boundary

Ratio of average intra/inter class distanceCompute the Euclidean distance from each point to

the nearest neighbor of the same classgthe nearest neighbor of another class

Return the ratioCompares the within class spread to the size of the gap between classes

Slide 15

Compares the within-class spread to the size of the gap between classes

Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets

Page 16: HAIS09-BeyondHomemadeArtificialDatasets

Results

Random distribution Iris problem

Problems with bounded complexity generatedObtaining data sets in regions of the complexity space that were not

Slide 16Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets

covered with the analysis over real-world problems

Page 17: HAIS09-BeyondHomemadeArtificialDatasets

Outline

1. ADS generationg

2. EMO-made artificial data sets

3. Preliminary experimental results

4. Conclusions & further work

Slide 17Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets

Page 18: HAIS09-BeyondHomemadeArtificialDatasets

Conclusions

Highlighted the need for the systematic creation of ADSDesign of an EMO-based system that

Departs from a set of unlabeled examplesSets labels to each exampleOptimizes several complexity metrics

to automatically generate ADS of boundedly difficulty

Experiments show that:Pareto of data sets with different complexities can be obtainedPareto of data sets with different complexities can be obtainedBlind regions of the complexity space can be covered

Slide 18Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets

Page 19: HAIS09-BeyondHomemadeArtificialDatasets

Further Work

The non-mentioned limitationFree class labeling may result in data sets whose structure mayFree class labeling may result in data sets whose structure may not be feasible in nature

Future work to tackle this pointIntroduce constraints to the problem

Some instances cannot change their class

Enable instance selection

Slide 19Grup de Recerca en Sistemes Intel·ligents Beyond Homemade Artificial Data Sets

Page 20: HAIS09-BeyondHomemadeArtificialDatasets

Beyond Homemade ArtificialBeyond Homemade Artificial Data SetsData Sets

Núria MaciàAlbert Orriols-Puig

Ester Bernadó-Mansilla

Grup de Recerca en Sistemes Intel·ligents

La Salle – Universitat Ramon Llull

C/ Quatre Camins, 2. 08022, Barcelona (Spain)

{nmacia aorriols esterb}@salle url edu{nmacia,aorriols,esterb}@salle.url.edu