genetic programming for classification with unbalanced data(part 1)

Slide 1

Genetic Programming for Classification with Unbalanced DataPresented byNoorulainAmina Asif

A Research Paper by: Urvesh Bhowan, Mengjie Zhang, Mark Johnston(Evolutionary Computation Research Group,Victoria University of Wellington, New Zealand)Pattern Recognition LabDepartment of Computer Science & Information SciencesPakistan Institute of Engineering & Applied Sciences

OUTLINEAbstract of the paperIntroduction to the basic conceptsClassificationUnbalanced dataPerformance biasGP Framework for classificationProgram representation and classification strategyEvolutionary parametersStandard fitness function for classificationImproving GP with New and Improved fitness Functions4 variations of the fitness function

2Genetic Programming for Classification with Unbalanced DataAbstract of the paperThis paper compares two Genetic Programming (GP) approaches for classification with unbalanced data.

The first focuses on adapting the fitness function to evolve classifiers with good classification ability across both minority and majority classes.

The second uses a multi-objective approach to simultaneously evolve a Pareto front (or set) of classifiers along the minority and majority class tradeoff surface.3Genetic Programming for Classification with Unbalanced DataIntroductionClassificationa way of predicting class membership for a set of examples using properties of the examples.

Unbalanced DatasetData sets having an uneven distribution ofclass examples,Minority class : a small number of examples in datasetMajority class: make up large part of the data set.

4Genetic Programming for Classification with Unbalanced DataIntroductionPerformance Bias:poor accuracy on the minority class but high accuracy on the majority class

Solution??misclassification costs for minority class examples

5Genetic Programming for Classification with Unbalanced DataGP Approaches2 GP approaches discussedAdaptation of Fitness FunctionMulti Objective Genetic Programming (MOGP)6Genetic Programming for Classification with Unbalanced DataGP Framework for ClassificationProgram RepresentationTerminals (example features and constants)Functions (+, -, x, % and conditional if )Classification StrategyTranslates the output of a genetic program (floating point number) into two class labels using the division between positive and non-positiveMinority class: Positive or 0Majority class: Negative7Genetic Programming for Classification with Unbalanced DataGP Framework for ClassificationEvolutionary ParametersInitial population (ramped half and half)Cross over 60%Mutation30%Elitism10%Training and test data Half of each data set was randomly chosen as the training set and the other half as the test set, both preserving the original class imbalance ratio.8Genetic Programming for Classification with Unbalanced DataStandard Fitness Function9Genetic Programming for Classification with Unbalanced DataStandard Fitness Function10Genetic Programming for Classification with Unbalanced DataPredicted PositivePredictedNon PositiveActual PositiveTPFNActual Non PositiveFPTNAdapting Standard Fitness Functionfoverall can be unsuitableFavors solution with a performance bias

Fitness functions should be modifiedTo consider the accuracy of each class as equally importantTo improve the minority class accuracyGenetic Programming for Classification with Unbalanced Data11New Fitness Function 1Correctly classifying TPs may be more important than correctly classifying TNs

Designed to investigate two aspectsHow to effectively balance between TP and TN rates?Is the overall classification ability better?New Fitness Function 1

Weight given to TPsProportion of correctly classified minority class objectsWeight given to TNsProportion of correctly classified majority class objectsNew Fitness Function 1When W=0.5Classification accuracy given equal importance

When W>0.5Minority class accuracy given more importance by a factor WNew Fitness Function 2Based on Correlation RatioMeasures the relationship between linear statistical dispersions within sets of observations

Correlation Ratio adaptedTo measure how well two sets of GP class observations are separatedHigher the correlation ratio, better the separation

GoalExplore the effectiveness of separability-based evaluation metricNew Fitness Function 2The correlation ratio

M=Number of classesNc=Number of examples in Class c =Overall mean =Mean for Class cPci=Output of the classifier P for ith example of class c

Sum of the Distances between the Class Means and the Overall MeanLarger the distance, Larger the ratioSum of the distances between the observations and the population meanNew Fitness Function 2r will return values between 0 and 1Close to 1 => better separationClose to 0 => poor separation

Separation should be according to classification strategyMinority class observations should be +ve numbersMajority class observations should be -ve numbers

New Fitness Function 2The Fitness Function

Fitness values between 0 and 2Values close to 2 => optimal fitnessValues close to 0 => poor fitness

An indicator function; returns 1 if mean of majority and minority classes are negative and positive respectively, 0 otherwiseIncorporates the ordering preferenceImproved Fitness Function 1Relatively recent improvements

Equally weighted accuracy + a new objectiveLevel of error for each classEstimated using largest and smallest incorrect observations for a particular class

Values scaled between 0 and 11=> highest level of error0=> no error

Class observations may be positive or negative, so absolute has been takenImproved Fitness Function 1The Fitness Function

Equal weights for the accuracy of both the classesSmaller the level of error => higher the fitnessFitness Value Between 0 and 4Values close to 4 => optimum fitnessValues close to 0 => poor fitness

Improved Fitness Function 2Uses the Wilcoxon-Mann-Whitney StatisticA well-known approximation for AUCWithout having to compute the curve

Uses the separability-based metric directly into the program fitness

Expensive to compute Improved Fitness Function 2The Fitness Function

Where,

Rating done on two metricsClassification AccuracySeparability

Pair-wise comparisonsLarger Value => More SeparabilitySummary Having an unbalanced data set may cause performance bias towards the majority classIn GP Class Imbalance Problems can be treated in two waysAdapting the Fitness Function (introducing new metrics)WeightsCorrelation Ratio (Separability)Levels of ErrorMOGPDiscussed TodayTo be discussed in the next presentationTHANK YOU

genetic programming for classification with unbalanced data(part 1)

Documents