n. gagunashvili (unak & mpik) methods of multivariate analysis for imbalance data problem under-...
TRANSCRIPT
![Page 1: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/1.jpg)
N. Gagunashvili (UNAK & MPIK)
Methods of multivariate analysis for imbalance data problem
Under- and Oversampling TechniquesNikolai Gagunashvili (UNAK and MPIK)
![Page 2: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/2.jpg)
N. Gagunashvili (UNAK & MPIK)
Four possibilities that can be used for solving imbalance data problem
• Choice of appropriate classifier• Use cost sensitive approach• Use sampling based approach• Bagging
![Page 3: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/3.jpg)
N. Gagunashvili (UNAK & MPIK)
Main idea of sampling based approach is to modify the distribution of events so that the rare class is well represented in the training sample.
There are
• Undersampling
• Oversampling
• Hybrid oversampling and undersampling
![Page 4: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/4.jpg)
N. Gagunashvili (UNAK & MPIK)
In case of undersampling we can take random sample of majority class (BG).
Potential problem : some of useful BG
instances may not be chosen for training and classifier will not be optimal.
Reduction majority class without losing performance of classification can be used
![Page 5: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/5.jpg)
N. Gagunashvili (UNAK & MPIK)
Class Number of instances in training sample
Number of instances in test sample
D0 1851
1837
Background 496651 6704513
For illustration Monte-Carlo for D0 analysis will be used
Data is taken in mass window 1844.5GeV – 1884.5GeV
![Page 6: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/6.jpg)
N. Gagunashvili (UNAK & MPIK)
Algorithm of reduction number of background instances without losing performance
An instance t is removed if all k of its neighbors are of the same class. The instance is only removed, however, if its neighbors are at least 60% sure of their classification. For our example we take k = 20 then at least 12 instances should confirm the class of neighbors.
After reduction number of background combination reduced up to 19712 instances (more the 25 times lower sample)!
BG = 17861, D0 = 1851
![Page 7: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/7.jpg)
N. Gagunashvili (UNAK & MPIK)
Training sample: BG = 17861, D0 = 1851
![Page 8: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/8.jpg)
N. Gagunashvili (UNAK & MPIK)
Oversampling is replication the events of minority class.
Potential problem: could be for this method is overfitting for noisy data, because noisy data will be replicate.
To avoid overfitting the procedure of randomized
oversampling is proposed (SMOTE and Bordeline-SMOTE) with cleaning noisy data.
Hui Han, Wen-Yauan Wang, Bing-Huan Mao, Bodeline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning, ICIC 2005, part 1, LNCS3644, 878-887, 2005.
![Page 9: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/9.jpg)
N. Gagunashvili (UNAK & MPIK)
Bodeline-SMOTE algorithm
![Page 10: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/10.jpg)
N. Gagunashvili (UNAK & MPIK)
Training sample: BG = 17861, D0 = 1851+3*555=3516
![Page 11: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/11.jpg)
N. Gagunashvili (UNAK & MPIK)
Cleaning procedure can improve performance of algorithms.
One of this procedure is removing instances that participate in Tomek links.
Tomek link is defined as a pair of instances x and y from different classes, that there exists no instances z such that d(x; z) < d(x; y) or d(y; z) < d(x; y), where d is the distance between a pair of examples.
Instances in Tomek links are noisy or lie in the decision border.
I. Tomek, Two Modifcations of CNN. IEEE Transactions on Systems Man and Communications SMC-6 (1976), 769-772.
![Page 12: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/12.jpg)
N. Gagunashvili (UNAK & MPIK)
Tr. sample: BG = 17861-456=17405, D0 = 3516-456=3060
![Page 13: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/13.jpg)
N. Gagunashvili (UNAK & MPIK)
Sizes of training samples
Class Training sample
After edition
After SMOTE
After Tomek link edition
D0 1851 1851 3516 3060
BG 496651 17861 17861 17405
![Page 14: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/14.jpg)
N. Gagunashvili (UNAK & MPIK)
![Page 15: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/15.jpg)
N. Gagunashvili (UNAK & MPIK)
Excluded attributes after wrapper:
![Page 16: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/16.jpg)
N. Gagunashvili (UNAK & MPIK)
![Page 17: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/17.jpg)
N. Gagunashvili (UNAK & MPIK)
![Page 18: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/18.jpg)
N. Gagunashvili (UNAK & MPIK)
![Page 19: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/19.jpg)
N. Gagunashvili (UNAK & MPIK)
![Page 20: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/20.jpg)
N. Gagunashvili (UNAK & MPIK)
![Page 21: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/21.jpg)
N. Gagunashvili (UNAK & MPIK)
![Page 22: N. Gagunashvili (UNAK & MPIK) Methods of multivariate analysis for imbalance data problem Under- and Oversampling Techniques Nikolai Gagunashvili (UNAK](https://reader036.vdocument.in/reader036/viewer/2022062713/56649f575503460f94c7cc3f/html5/thumbnails/22.jpg)
N. Gagunashvili (UNAK & MPIK)
Conclusions
Methods of undersampling related with filtering redundandant events of majority class permit improve performance of classifier essentially.
Oversampling with randomization (Bordeline SMOTE algorithm) and removing events that participate in Tomek link improve
performance of classifier.