genetic programming for classification with unbalanced data(part 1)
DESCRIPTION
ppt presentationTRANSCRIPT
Slide 1
Genetic Programming for Classification with Unbalanced DataPresented byNoorulainAmina Asif
A Research Paper by: Urvesh Bhowan, Mengjie Zhang, Mark Johnston(Evolutionary Computation Research Group,Victoria University of Wellington, New Zealand)Pattern Recognition LabDepartment of Computer Science & Information SciencesPakistan Institute of Engineering & Applied Sciences
OUTLINEAbstract of the paperIntroduction to the basic conceptsClassificationUnbalanced dataPerformance biasGP Framework for classificationProgram representation and classification strategyEvolutionary parametersStandard fitness function for classificationImproving GP with New and Improved fitness Functions4 variations of the fitness function
2Genetic Programming for Classification with Unbalanced DataAbstract of the paperThis paper compares two Genetic Programming (GP) approaches for classification with unbalanced data.
The first focuses on adapting the fitness function to evolve classifiers with good classification ability across both minority and majority classes.
The second uses a multi-objective approach to simultaneously evolve a Pareto front (or set) of classifiers along the minority and majority class tradeoff surface.3Genetic Programming for Classification with Unbalanced DataIntroductionClassificationa way of predicting class membership for a set of examples using properties of the examples.
Unbalanced DatasetData sets having an uneven distribution ofclass examples,Minority class : a small number of examples in datasetMajority class: make up large part of the data set.
4Genetic Programming for Classification with Unbalanced DataIntroductionPerformance Bias:poor accuracy on the minority class but high accuracy on the majority class
Solution??misclassification costs for minority class examples
5Genetic Programming for Classification with Unbalanced DataGP Approaches2 GP approaches discussedAdaptation of Fitness FunctionMulti Objective Genetic Programming (MOGP)6Genetic Programming for Classification with Unbalanced DataGP Framework for ClassificationProgram RepresentationTerminals (example features and constants)Functions (+, -, x, % and conditional if )Classification StrategyTranslates the output of a genetic program (floating point number) into two class labels using the division between positive and non-positiveMinority class: Positive or 0Majority class: Negative7Genetic Programming for Classification with Unbalanced DataGP Framework for ClassificationEvolutionary ParametersInitial population (ramped half and half)Cross over 60%Mutation30%Elitism10%Training and test data Half of each data set was randomly chosen as the training set and the other half as the test set, both preserving the original class imbalance ratio.8Genetic Programming for Classification with Unbalanced DataStandard Fitness Function9Genetic Programming for Classification with Unbalanced DataStandard Fitness Function10Genetic Programming for Classification with Unbalanced DataPredicted PositivePredictedNon PositiveActual PositiveTPFNActual Non PositiveFPTNAdapting Standard Fitness Functionfoverall can be unsuitableFavors solution with a performance bias
Fitness functions should be modifiedTo consider the accuracy of each class as equally importantTo improve the minority class accuracyGenetic Programming for Classification with Unbalanced Data11New Fitness Function 1Correctly classifying TPs may be more important than correctly classifying TNs
Designed to investigate two aspectsHow to effectively balance between TP and TN rates?Is the overall classification ability better?New Fitness Function 1
Weight given to TPsProportion of correctly classified minority class objectsWeight given to TNsProportion of correctly classified majority class objectsNew Fitness Function 1When W=0.5Classification accuracy given equal importance
When W>0.5Minority class accuracy given more importance by a factor WNew Fitness Function 2Based on Correlation RatioMeasures the relationship between linear statistical dispersions within sets of observations
Correlation Ratio adaptedTo measure how well two sets of GP class observations are separatedHigher the correlation ratio, better the separation
GoalExplore the effectiveness of separability-based evaluation metricNew Fitness Function 2The correlation ratio
M=Number of classesNc=Number of examples in Class c =Overall mean =Mean for Class cPci=Output of the classifier P for ith example of class c
Sum of the Distances between the Class Means and the Overall MeanLarger the distance, Larger the ratioSum of the distances between the observations and the population meanNew Fitness Function 2r will return values between 0 and 1Close to 1 => better separationClose to 0 => poor separation
Separation should be according to classification strategyMinority class observations should be +ve numbersMajority class observations should be -ve numbers
New Fitness Function 2The Fitness Function
Fitness values between 0 and 2Values close to 2 => optimal fitnessValues close to 0 => poor fitness
An indicator function; returns 1 if mean of majority and minority classes are negative and positive respectively, 0 otherwiseIncorporates the ordering preferenceImproved Fitness Function 1Relatively recent improvements
Equally weighted accuracy + a new objectiveLevel of error for each classEstimated using largest and smallest incorrect observations for a particular class
Values scaled between 0 and 11=> highest level of error0=> no error
Class observations may be positive or negative, so absolute has been takenImproved Fitness Function 1The Fitness Function
Equal weights for the accuracy of both the classesSmaller the level of error => higher the fitnessFitness Value Between 0 and 4Values close to 4 => optimum fitnessValues close to 0 => poor fitness
Improved Fitness Function 2Uses the Wilcoxon-Mann-Whitney StatisticA well-known approximation for AUCWithout having to compute the curve
Uses the separability-based metric directly into the program fitness
Expensive to compute Improved Fitness Function 2The Fitness Function
Where,
Rating done on two metricsClassification AccuracySeparability
Pair-wise comparisonsLarger Value => More SeparabilitySummary Having an unbalanced data set may cause performance bias towards the majority classIn GP Class Imbalance Problems can be treated in two waysAdapting the Fitness Function (introducing new metrics)WeightsCorrelation Ratio (Separability)Levels of ErrorMOGPDiscussed TodayTo be discussed in the next presentationTHANK YOU