arxiv:2010.04308v2 [cs.cv] 14 nov 2020 · 2020. 11. 17. · arxiv:2010.04308v2 [cs.cv] 14 nov 2020....

LEAVE UNSET:1–15, 2020 Machine Learning for Health (ML4H) 2020

Addressing the Real-world Class Imbalance Problem inDermatology

Wei-Hung Weng∗ [email protected]

Jonathan Deaton [email protected] Health

Vivek Natarajan [email protected] Health

Gamaleldin F. Elsayed [email protected] Research

Yuan Liu [email protected] Health

Abstract

Class imbalance is a common problemin medical diagnosis, causing a stan-dard classifier to be biased towards thecommon classes and perform poorly onthe rare classes. This is especially truefor dermatology, a specialty with thou-sands of skin conditions but many ofwhich have low prevalence in the realworld. Motivated by recent advances,we explore few-shot learning methodsas well as conventional class imbalancetechniques for the skin condition recog-nition problem and propose an evalua-tion setup to fairly assess the real-worldutility of such approaches. We find theperformance of few-show learning meth-ods does not reach that of conventionalclass imbalance techniques, but combin-ing the two approaches using a novelensemble improves model performance,especially for rare classes. We concludethat ensembling can be useful to ad-dress the class imbalance problem, yetprogress can further be accelerated byreal-world evaluation setups for bench-marking new methods.

∗ Work done at Google.

Keywords: Dermatology, Class imbal-ance, Few-shot learning, Classification.

1. Introduction

Skin disease is the fourth leading cause ofnon-fatal medical conditions burden world-wide (Seth et al., 2017). Due to the globalshortage of dermatologists, access to derma-tology care is limited, leading to rising costs,poor patient outcomes, and health inequali-ties. Recent research endeavors have demon-strated that deep learning systems built todetect skin conditions from either dermato-scopic or digital camera images can achieveexpert level performance in diagnosing cer-tain skin conditions (Esteva et al., 2017;Liu et al., 2020). Despite the encouragingprogress, such systems can only identify afew common skin conditions well, leaving avast number of skin conditions still unad-dressed in the real-world (Figure 1 (a)).

As many skin conditions occur infre-quently in real-world, datasets collected froma natural patient population are highly im-balanced, with few examples for those skinconditions. This makes it challenging to

© 2020 W.-H. Weng, J. Deaton, V. Natarajan, G.F. Elsayed & Y. Liu.

arX

iv:2

010.

0430

8v2

[cs

.CV

] 1

4 N

ov 2

020

Addressing the Real-world Class Imbalance Problem in Dermatology

Figure 1: Class imbalance problem in der-matology at a glance (a) and pro-posed modeling framework (b).

build models that can reliably detect rareskin conditions. Such limitations not onlymay diminish the clinical utility of these sys-tems due to the low skin condition coverage,but also may cause harm to patients as theyare often biased towards common diseases.

A wide variety of techniques have beenproposed over the years to address theclass imbalance problem in machine learning.These include relatively classic approacheslike modified resampling strategies (Chawlaet al., 2002), loss reweighting (Eban et al.,2017), focal loss and bias initialization (Linet al., 2017), to more complex approachessuch as using generative models to augmentthe rare classes with synthetic images (Ghor-bani et al., 2020).

A related counterpart to the class imbal-ance problem is few-shot learning (FSL), alearning scenario inspired by the human abil-ity to learn from very few examples (Lakeet al., 2011). It can further be general-ized to the meta-learning framework thataims to gain learning experiences from solv-ing many meta tasks in order to achieve

a better performance for the tasks (in thiscase, rare classes) with limited training ex-amples (Vinyals et al., 2016).

The classic evaluation framework for FSLfollows an “N -way-k-shot” setting: given ktraining examples for N classes as a ran-dom subset of all classes, test if the FSLtechniques can correctly distinguish amongthe N classes. It is often not understoodwhether FSL will translate well beyond thiscontrived setting to the real-world problem,when the task requires recognizing the cor-rect class among all possible classes (all-wayclassification) in the wild (Chen et al., 2019;Triantafillou et al., 2020). Furthermore, it ischallenging to compare the classification per-formance between FSL and the conventionalsupervised training without a unified evalu-ation framework.

In this paper, we investigate various FSLmethods to tackle the class imbalance prob-lem in skin condition classification (Figure 1(b)). We modify the typical FSL evalua-tion setup to adapt to the real-world setting,and design studies to compare FSL methodsto conventional supervised learning (CSL)with various classic techniques to address theclass imbalance problem. We further explorethe potential use of the combination of FSLand CSL with various class imbalance tech-niques, motivated by the hypothesis that acombination may provide independent pre-dictions, capturing different aspects of thedata. Specifically, our contributions are:

• We propose a real-world evaluationframework to compare the FSL methodsagainst conventional methods (i.e., CSLwith class imbalance techniques) for theclass imbalance problem.

• We find that FSL methods don’t out-perform CSL with class imbalance tech-niques, yet an ensemble of these twotypes can outperform either alone, es-pecially for the rare classes.

2


2. Related Work

2.1. Class Imbalance

Imbalanced class distribution is common inreal-world datasets. For mild class imbal-ance, machine learning algorithms, such assupport vector machines (Cortes and Vapnik,1995), random forests (Breiman, 2001), andmodern deep learning techniques can usuallyhandle such cases well. However, special careis required to manage moderate or extremeimbalance situations (Figure 2), when minor-ity classes constitutes typically less than 20%or 1% in the training data respectively.

Conventional Techniques Researchershave developed various resampling and cost-sensitive learning strategies to improve mod-els under such skewed class distribution set-tings. Resampling aims to balance the classdistribution in the training data by eitherdownsampling common classes or upsam-pling rare classes. Downsampling may in-crease variance; thus, upsampling is morecommonly used (Chawla et al., 2002), yetupsampling procedures for extremely imbal-anced data may be computationally expen-sive (Weng et al., 2019).

In contrast, cost-sensitive learning penal-izes algorithms by increasing the cost of clas-sification mistakes on the rare classes. Itcan be implemented in various ways, suchas reweighting the losses of specific classes,introducing bias as prior into the classifica-tion loss at each class (Lin et al., 2017), orless used approaches such as global objec-tives (Eban et al., 2017). To prevent the easyexamples in the common classes from dom-inating the gradients during classifier train-ing, the focal loss function is developed (Linet al., 2017), to not only out-weight the rareclasses, but also emphasize the hard exam-ples during training.

Few-shot Learning (FSL) FSL can alsobe utilized to improve the classification per-

formance for the classes with very few train-ing examples. The FSL algorithms can becategorized into several types: metric-based,optimization-based, and transfer learning-based approaches.

Metric-based FSL learn a representationby learning to compare training examples.Koch et al. (2015) developed the Siameseneural network to compare two examples atthe same time with two identical twin net-works sharing the same weights. Matchingnetwork utilizes the cosine similarity as thedistance metric and adopts the long-shortterm memory (LSTM) network for generat-ing embeddings (Vinyals et al., 2016). Proto-typical network, on the other hand, uses theEuclidean distance and convolutional net-works to determine the embedding of theclass prototypes (Snell et al., 2017). Rela-tion networks further employ the concatena-tion strategy and relation module to scoreexample similarity (Sung et al., 2018).

Optimization-based FSL aims to learna set of parameters that allows a meta-learner to quickly adapt to new tasks (Finnet al., 2017). Ravi and Larochelle (2017)designed an LSTM-based meta-learner toreplace the stochastic gradient decent al-gorithm in order to learn a better opti-mizer. One can also learn a better modelinitialization for faster task adaptation withless gradient updates (Model-Agnostic Meta-Learning, MAML) (Finn et al., 2017).Others have integrated both metric andoptimization-based FSL to further improvethe performance (Triantafillou et al., 2020).

Finally, transfer learning tackles the few-shot classification fine-tuning a model pre-trained on a much larger dataset (Chenet al., 2019; Tian et al., 2020). Chen et al.(2019) adopts the training-then-finetuningprocess for few-shot classification withoutmeta-learning (Baseline++). Recently, Tianet al. (2020) demonstrated that transferlearning under the meta-learning framework

3


Figure 2: Categories of methods for tackling the class imbalance problem.

can yield strong performance for the few-shotclassification problem.

Despite the progress in FSL, they are typ-ically evaluated in a contrived setting andlittle is known about how they work in real-world all-way classification problems. In thisstudy, we therefore propose an evaluationmethod that allows us to use FSL for theall-way classification problem, and comparewith the CSL-based methods.

2.2. Machine Learning and FSL inDermatology

Artificial intelligence in dermatology is arapidly growing topic of interest in recentyears (Cruz-Roa et al., 2013; Esteva et al.,2017; Liu et al., 2020; Yang et al., 2018;Prabhu et al., 2019; Li et al., 2020; Maha-jan et al., 2020; Guo et al., 2020; Le et al.,2020). For example, Esteva et al. (2017) ap-plied deep learning to clinical skin photos fortwo binary skin cancer classification tasks.Liu et al. (2020) developed a CSL-based sys-tem that identifies 26 common skin condi-tions with performance superior to generalpractitioners and on par with dermatologists.

However, it is challenging to extend suchsystems to support the rare skin conditionsdue to the limited training examples. Previ-ous efforts have explored various approaches,such as adopting domain knowledge to learna better representation (Yang et al., 2018),developing or modifying the meta-learningbased methods (Prabhu et al., 2019; Liet al., 2020; Mahajan et al., 2020), and mak-

ing comparison between different approachestackling the extreme class imbalance prob-lem in dermatology (Guo et al., 2020; Leet al., 2020). Yet there is no study investigat-ing both the FSL and CSL-based class imbal-ance techniques and comparing them underthe real-world all-way classification setting.

Our work focuses on the all-way classifica-tion under a extreme long-tailed data distri-bution that often occurs in real-world med-ical imaging tasks. Under an unified eval-uation framework, we make comparison be-tween FSL and CSL-based class imbalancetechniques. We further propose ensemblingFSL with conventional methods and demon-strate gains from combining both strategies.

3. Methods

In this work, we investigate the model perfor-mance on skin condition classification prob-lem across CSL baseline, CSL with class im-balance techniques, FSL techniques, and dif-ferent ensembles between these approaches.We use an evaluation setup that allows us tocompare across all methods for the real-worldclassification problem, which entails properlyaddressing all skin conditions in our datasets.In this section, we explain how we modify theevaluation setup to adapt FSL to this real-world scenario, and introduce the learningalgorithms and metrics used for evaluation.

3.1. FSL Task and Evaluation Setup

Task Formulation and Standard Eval-uation FSL follows the meta-learning set-

4


ting, which consists of training, validation,and testing phases: training for learning aclassifier (meta-learner), validation for hy-perparameter tuning, and testing for evalu-ating the learned classifier. Data used in eachof the three phases needs to be split into sup-port and query set. For the standard FSLevaluation, the classes used in these threephases should be all disjoint (Figure 3 (a)).

Given a dataset with M classes, the train-ing can be batch or episodic. For batch train-ing, we train a model using all examples froma randomN classes (N < M) in the train set;for episodic training, we train a classifier un-der N -way-k-shot learning scenario: in eachtraining episode, we choose random k sam-ples from the random N classes (N < M) inthe train set to form a support set to learna model, and then evaluate on a query setwhich includes other samples in the same Nclasses in the training set. The validationand test sets are used to evaluate the modelduring and after training, respectively.

Real-world Evaluation for FSL Thestandard N -way-k-shot framework doesn’tfit the real-world classification problem,where we need to discriminate all classes si-multaneously. Therefore, we introduce thefollowing changes to the classic FSL evalua-tion setup (Figure 3 (b)):

(1) The dataset is split into development(train ∪ validation) and test sets in advance,where both sets may contain samples fromall possible classes.

(2) Different from the classical setup wherethe classes in all three splits are disjoint,in the real-world setup, only training andvalidation have disjoint classes, and boththe support set and the query set for thoseclasses come from the development set. Dur-ing testing, the support set that is used totrain the final model comes from the devel-opment set, whereas the query set is exactly

the same as the test set and includes disjointsamples from all classes.

3.2. Modeling

The following FSL algorithms are used inthe study (Figure 2): k-nearest neigh-bors (k-NN) and Baseline++ (Chen et al.,2019), matching networks (Vinyals et al.,2016), prototypical networks (Snell et al.,2017), relation networks (Sung et al., 2018),MAML (Finn et al., 2017) and Proto-MAML (Triantafillou et al., 2020). We alsoimplemented the following CSL-based classimbalance techniques: upsampling with uni-form sampling based on the ground truthclass during training, bias initialization (BI)with the log of frequency in the trainingset (Lin et al., 2017), inverse frequencyweighting (IFW), focal loss (FL) (Lin et al.,2017), and the combination of BI/IFW andFL/IFW (Appendix A for details).

3.3. Metrics

We report the balanced accuracy (a.k.a. nor-malized sensitivity, or macro recall) sepa-rately for the common, rare, and all classesfor the real-world evaluation. The balancedaccuracy is used to account for the classimbalance issue to avoid overweighting thecommon classes. We also reported top-1 all-way accuracy in the Appendix. We furtherreport the 95% binomial confidence interval,derived from the mean µ and standard devi-ation σ of accuracies of E episodes of FSL orE runs of the CSL-based model, which canbe expressed as µ± 1.96 σ√

E.

4. Experiments

4.1. Data

We use the clinical skin images dataset col-lected by a teledermatology service serving17 different clinical sites (Liu et al., 2020).We divide the data to different subsets asillustrated in Figure 3 according to the tem-poral order (75% in the development set and

5


Figure 3: Differences between the standard (a) and the real-world (b) FSL evaluation set-tings. We report metrics over test query sets illustrated with bold box.

25% in the test set). Specifically, the test setis chosen to the more recent patient visits, tomimic the real world setting as using earlierdata for model training and deploying themodel to serve future patients. The develop-ment set is further partitioned into train andvalidation sets based on per skin conditionstratified sampling to ensure enough samplesfor training and validation. The statistics ofthe data splits are shown in Table A1.

There are a total of 419 skin conditions inthe dataset. We define the common classesbased on the selection criteria used in (Liuet al., 2020) (more than 100 cases in the de-velopment set), and the rare classes as thosewith (1) more than 20 cases in the develop-ment set, and (2) more than 5 cases in thetest set. The criteria was established in orderto ensure sufficient examples for training andevaluation. For other extremely rare classes(i.e., classes with


Figure 4: Standard FSL evaluation (5-way-5-shot accuracy with 95% confi-dence interval) on the telederma-tology dataset. We investigate thechange of N and k values duringtraining. x-axis is the FSL settingduring training.

atively optimal option to achieve the bestperformance under the standard FSL eval-uation (Figure 4; N -way-5-shot bars whereN = 5, 10, 14). However, increasing k is nothelpful, possibly because of overfitting (Fig-ure 4; 14-way-k-shot bars where k = 5, 9).This finding is consistent with the previ-ous literature (Snell et al., 2017). Amongall methods, Proto-MAML and prototypi-cal networks are comparable and consistentlyoutperforms the others for the teledermatol-ogy dataset. This dataset has shown similarproperties as other FSL datasets under thestandard FSL evaluation.

5.2. Real-world Evaluation

FSL-based Approaches In the real-world setting, we report all-way (69-way)performance (Figure 5, Table A4) on the testset (i.e. query set of the testing phase; seeFigure 3 (b)).

The balanced accuracy is chosen as themetric to avoid overweighting the commonclasses. We conduct the N -way-k-shot ex-periments again for finding the optimal train-ing setting under the real-world setup. Dif-ferent from the findings in the standard FSLevaluation, we find that the optimal N and k

values are not quite consistent across meth-ods. The best performant method in thereal-world FSL evaluation is also matchingnetworks. In Figure 5 (a, b) and Table A4,we also find that the FSL methods consis-tently perform better on the rare classes thanthe common classes.

In brief, we find that the conclusion fromthe standard FSL evaluation that higher N -way training yields better meta-learner is in-consistent with the real-world evaluation re-sults. Thus, more in-depth exploration is re-quired in real world to identify the optimalN -way setting. We also confirm our hypoth-esis that the FSL models can be beneficial forrare class prediction but may not be for com-mon classes. For the common class predic-tion, relying on the CSL methods may be thebetter option in this case. We later use thetop performant models based on validation,which is the matching network model trainedin 14-way-9-shot setting, for the model en-semble experiments below.

CSL-based Class Imbalance Tech-niques Next, we compare the real-worldall-way classification performance betweenFSL methods and CSL-based class imbal-ance techniques under the single model (non-ensemble) setting. In Figure 6 and Ta-ble A51, we find that the best FSL model(matching network) is only slightly betterthan the CSL baseline for the rare classprediction, possibly because of less trainingdata. Yet the CSL-based class imbalancetechniques yield better balance between com-mon and rare classes, as demonstrated withall classes balanced accuracy. Moreover, theCSL-based model integrating focal loss andinverse frequency weighting (FL/IFW) yieldseven better performance on the rare classclassification. Such results suggests that FSL

1. We further included results for prior correction,another class imbalance approach, in AppendixTable A5, and the method description in Ap-pendix A.1.

7


Figure 5: Real-world FSL evaluation. We report the balanced accuracy for (a) commonclass (b) rare class (c) all class prediction using different N -way-k-shot settings.

is not beneficial when used independently inthis real-world setting. One possible expla-nation could be that given FSL models arenot using all the data from common classesduring training, it hinders the learning ofgood low-level visual features, whereas CSL-based models are able to take advantage ofit.

Figure 6: Comparison between FSL andCSL-based class imbalance tech-niques on the all-way skin condi-tion classification.

Model Ensemble To better utilize theFSL and CSL-based class imbalance tech-niques, we experiment on ensembling thetrained models. We approximate the jointensemble output probability by comput-ing the geometric mean of the probabilitiesacross M selected models:

Pc =

(M∏m=1

P(m)c

) 1m

where P(m)c is the probability the mth model

in the ensemble gives to class c. We com-pute the corresponding ensemble logits by

first normalizing each model logits (f(m)c ) us-

ing LogSumExp(·) (note this normalizationdoes not alter the probabilities) before ag-gregating the logits from different models:

f̄c =1

M

M∑m=1

(f (m)c − log

(C∑i=1

exp(f(m)i

)))

where f̄c is the normalized logit for the en-semble. We apply Softmax(·) to the nor-malized logits to compute the final ensembleprobabilities.

To utilize the CSL-based methods in com-mon class prediction, and FSL in rare classprediction, we further use the predictionfrom CSL-based methods if the ensembleprediction falls into the group of commonclasses, and use the prediction from FSL if itis from the group of rare classes while ensem-bling FSL with CSL-based models. Based onour hypothesis and the results on the vali-dation set, we use the best performant FSLmodel for rare classes, which is matching net-work, and the best CSL-based class imbal-ance model for common classes, which is up-sampling method, as basic components forensemble. In Figure 7 (a) and Table A6,we find that FSL-only ensembles do not per-form well, which is consistent with the ob-servations from the previous single modelsetting. In contrast, ensembling FSL withCSL-based methods leads to a slight de-crease in the common classes yet some im-provement over rare classes and all classes interms of balanced accuracy (Figure 7); en-sembling the matching network model withthe CSL model with upsampling technique

8


Figure 7: Model ensemble. (a) 2-model ensemble to show the improvement after using FSL.(b) 4-model ensemble to demonstrate the improvement after ensembling modelswith different mechanisms.

(Figure 7 (a) orange) tends to strike a betterbalance between common and rare classes.In Figure 7 (b), we find that making the en-semble more heterogeneous by using modelswith different mechanisms yields even bettertrade-off between common and rare classesalong with a more notable performance in-crease, especially for rare classes. These find-ings confirm our hypothesis that the ensem-ble of FSL and CSL can benefit such real-world imbalanced skin condition classifica-tion problem.

5.3. Qualitative Analysis

In Figure 8, we present eight examples fromrare classes alongside with predictions of dif-ferent models. We show that the FSL andCSL-based class imbalance techniques pre-dict the rare classes more accurately whilethe CSL baseline tends to be biased towardthe common classes such as acne, eczemaand psoriasis. For example, eczema is a verycommon skin condition with various presen-tations, but it can be visually similar to drugrash or other skin lesions. While the base-line tends to over-predict common classes,the FSL can correctly distinguish the drugrash from eczema or inflicted skin lesions.

6. Conclusion

In this work, we propose the real-world eval-uation framework to fairly assess and com-

pare the performance between FSL and CSL-based methods on the all-way skin conditionclassification problem, where extreme classimbalance exists. We find that FSL meth-ods do not outperform class imbalance tech-niques, yet when ensembled with CSL-basedclass imbalance techniques, they lead to im-proved model performance especially for therare classes. We also observe that ensemblingthe models with different strengths and moreheterogeneity, such as upsampling and FSL,yields promising results.

To further improve the FSL methods, re-searchers have, in recent times, proposedstronger representation learning strategiesvia feature reuse (Raghu et al., 2020), orself-supervised learning (Tian et al., 2020).We believe that evaluating the FSL methodson real-world benchmarks and use cases asthe one outlined in this work can greatly ac-celerate progress in the development of FSLmethods with real-world utility.

For future deployment and real-world im-pact, we also need to account for poten-tial model failures. For example, false pos-itive cases may lead to wastage of medicalresources while false negatives may lead todelayed treatment, especially for malignantcases. The ensemble of both CSL and FSLapproaches may mitigate these issues, yet toassess the generalizability of the results oneneeds to evaluate our proposed methods on

9


Figure 8: Case study of rare classes. FSL and class imbalance techniques are better atidentifying rare classes compared to the baseline model.

other datasets and domains, which is an in-teresting future direction.

Acknowledgments

The authors thank Dr. Jaehoon Lee for de-tailed comments and reviews, Dr. Greg Cor-rado and all members of the Google HealthDermatology team for their support.

References

Leo Breiman. Random forests. Machinelearning, 45(1):5–32, 2001.

Nitesh V Chawla, Kevin W Bowyer,Lawrence O Hall, and W PhilipKegelmeyer. Smote: synthetic mi-nority over-sampling technique. Journalof artificial intelligence research, 16:321–357, 2002.

Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira,Yu-Chiang Frank Wang, and Jia-BinHuang. A closer look at few-shot classi-fication. ICLR, 2019.

Corinna Cortes and Vladimir Vapnik.Support-vector networks. Machine learn-

10


ing, 20(3):273–297, 1995.

Angel Alfonso Cruz-Roa, John Edi-son Arevalo Ovalle, Anant Madabhushi,and Fabio Augusto González Osorio.A deep learning architecture for imagerepresentation, visual interpretability andautomated basal-cell carcinoma cancerdetection. MICCAI, 2013.

Elad Eban, Mariano Schain, Alan Mackey,Ariel Gordon, Ryan Rifkin, and GalElidan. Scalable learning of non-decomposable objectives. Artificial Intelli-gence and Statistics, pages 832–840, 2017.

Andre Esteva, Brett Kuprel, Roberto ANovoa, Justin Ko, Susan M Swetter,Helen M Blau, and Sebastian Thrun.Dermatologist-level classification of skincancer with deep neural networks. Nature,542(7639):115–118, 2017.

Chelsea Finn, Pieter Abbeel, and SergeyLevine. Model-agnostic meta-learning forfast adaptation of deep networks. ICML,2017.

Amirata Ghorbani, Vivek Natarajan, DavidCoz, and Yuan Liu. DermGAN: Syn-thetic generation of clinical skin imageswith pathology. Proceedings of the Ma-chine Learning for Health NeurIPS Work-shop, pages 155–170, 2020.

Yunhui Guo, Noel C Codella, Leonid Karlin-sky, James V Codella, John R Smith, KateSaenko, Tajana Rosing, and Rogerio Feris.A broader study of cross-domain few-shotlearning. ECCV, 2020.

Gregory Koch, Richard Zemel, and RuslanSalakhutdinov. Siamese neural networksfor one-shot image recognition. ICML deeplearning workshop, 2, 2015.

Brenden Lake, Ruslan Salakhutdinov, JasonGross, and Joshua Tenenbaum. One shot

learning of simple visual concepts. Pro-ceedings of the annual meeting of the cog-nitive science society, 33(33), 2011.

Patrice Latinne, Marco Saerens, and Chris-tine Decaestecker. Adjusting the outputsof a classifier to new a priori probabilitiesmay significantly improve classification ac-curacy: evidence from a multi-class prob-lem in remote sensing. ICML, 1:298–305,2001.

Duyen NT Le, Hieu X Le, Lua T Ngo, andHoan T Ngo. Transfer learning with class-weighted and focal loss function for au-tomatic skin cancer classification. arXivpreprint arXiv:2009.05977, 2020.

Xiaomeng Li, Lequan Yu, Chi-Wing Fu, andPheng-Ann Heng. Difficulty-aware meta-learning for rare disease diagnosis. MIC-CAI, 2020.

Tsung-Yi Lin, Priya Goyal, Ross Girshick,Kaiming He, and Piotr Dollár. Focal lossfor dense object detection. ICCV, pages2980–2988, 2017.

Yuan Liu, Ayush Jain, Clara Eng, David HWay, Kang Lee, Peggy Bui, KimberlyKanada, Guilherme de Oliveira Marinho,Jessica Gallegos, Sara Gabriele, et al. Adeep learning system for differential diag-nosis of skin diseases. Nature Medicine,pages 1–9, 2020.

Kushagra Mahajan, Monika Sharma, andLovekesh Vig. Meta-DermDiagnosis: Few-shot skin disease identification using meta-learning. Proceedings of the IEEE/CVFConference on Computer Vision and Pat-tern Recognition Workshops, pages 730–731, 2020.

Viraj Prabhu, Anitha Kannan, MuraliRavuri, Manish Chaplain, David Sontag,

11


and Xavier Amatriain. Few-shot learn-ing for dermatological disease diagnosis.MLHC, 2019.

Aniruddh Raghu, Maithra Raghu, SamyBengio, and Oriol Vinyals. Rapid learningor feature reuse? towards understandingthe effectiveness of MAML. ICLR, 2020.

Sachin Ravi and Hugo Larochelle. Opti-mization as a model for few-shot learning.ICLR, 2017.

Divya Seth, Khatiya Cheldize, DanielleBrown, and Esther E Freeman. Globalburden of skin disease: inequities and in-novations. Current dermatology reports, 6(3):204–210, 2017.

Jake Snell, Kevin Swersky, and RichardZemel. Prototypical networks for few-shot learning. Advances in neural informa-tion processing systems, pages 4077–4087,2017.

Flood Sung, Yongxin Yang, Li Zhang, TaoXiang, Philip HS Torr, and Timothy MHospedales. Learning to compare: Rela-tion network for few-shot learning. CVPR,pages 1199–1208, 2018.

Yonglong Tian, Yue Wang, Dilip Krishnan,Joshua B Tenenbaum, and Phillip Isola.Rethinking few-shot image classification:a good embedding is all you need? ECCV,2020.

Eleni Triantafillou, Tyler Zhu, VincentDumoulin, Pascal Lamblin, Utku Evci,Kelvin Xu, Ross Goroshin, Carles Gelada,Kevin Swersky, Pierre-Antoine Manzagol,et al. Meta-Dataset: A dataset of datasetsfor learning to learn from few examples.ICLR, 2020.

Oriol Vinyals, Charles Blundell, TimothyLillicrap, Daan Wierstra, et al. Matchingnetworks for one shot learning. Advances

in neural information processing systems,pages 3630–3638, 2016.

Wei-Hung Weng, Yuannan Cai, AngelaLin, Fraser Tan, and Po-Hsuan CameronChen. Multimodal multitask representa-tion learning for pathology biobank meta-data prediction. Machine Learning forHealth NeurIPS Workshop, 2019.

Jufeng Yang, Xiaoxiao Sun, Jie Liang, andPaul L Rosin. Clinical skin lesion diagnosisusing representations inspired by derma-tologist criteria. CVPR, pages 1258–1266,2018.

12


Appendix A. Overview of the ClassImbalance Methods

A.1. CSL-based class imbalancetechniques

The following list describes the details ofclass imbalance techniques adopted for theCSL.

• Upsampling: on top of the CSL baselinewith the cross entropy loss, the uniformsampling is performed across all classesduring training.

• Bias initialization (BI) (Lin et al., 2017):the final layer of the network, i.e., classi-fication heads, is initialized with bias ofthe log values of the amount of trainingexamples in the class.

• Inverse frequency weighting (IFW): thecross entropy loss is further weighted bythe inverse frequency of the amount oftraining examples in the class.

• Focal loss (FL) (Lin et al., 2017): thecross entropy loss is extended to theα-balanced cross entropy loss that ac-counts for the class imbalance by mul-tiplying a weighting factor α ∈ [0, 1]:LαCE = −

∑i αi log(pi), where ai = a

for correct prediction, and ai = 1 − aotherwise. It also attempts to down-weigh easy samples and focus on hardsamples by introducing a modulatingfactor (1 − pi)γ with a focusing pa-rameter γ ≥ 0. The objective canbe expressed as Lfocal = −

∑i αi(1 −

pi)γ log(pi) , where M is a multiplier.

• Prior correction: predictions from aCSL baseline model are further dividedby maximum-likelihood estimates forclass priors based on the training set.This method can be seen as a specific usecase of the post-hoc correction methodintroduced by (Latinne et al., 2001) to

account for label shift and calibrate themodel predictions to a domain with uni-form class distribution.

We further combine BI/IFW, as well asFL/IFW, for the combined class imbalancetechniques.

A.2. FSL algorithms

The methods for FSL can be categorized intobatch training and episodic training meth-ods. For batch learning, we explore the fol-lowing:

• k-nearest neighbor (k-NN): the k-NNmethod classify the given test examplebased on the cosine distance between thetest example and the class centroids inthe vector space.

• Baseline++ (Chen et al., 2019): Base-line++ is a finetune model that uses thesupport set in testing episodes to trainthe final layer on top of the embeddings.

For the episodic training, we select the fol-lowing meta-learners from the metric-basedand optimization-based algorithms:

• Metric-based algorithms

– Matching networks (Vinyals et al.,2016): matching networks use theaverage of class example embed-dings, which is computed by bi-lateral long short term memory(LSTM), as class prototypes, andclassify query examples based onthe cosine distance-weighted linearcombination of the support labels.

– Prototypical networks (Snell et al.,2017): prototypical networks usethe average of embeddings learnedfrom CNN as class prototypes, andclassify query examples by comput-ing the Euclidean distance betweenthe embeddings of prototypes andqueries.

13


– Relation networks (Sung et al.,2018): relation networks also aver-age the embeddings to create pro-totypes, yet concatenate the pro-totype of each class with query ex-amples, and use the relation mod-ule, which is parameterized by ad-ditional layers, to predict the classprobability given the query.

• Optimization-based algorithms

– Model Agnostic Meta-Learning(MAML) (Finn et al., 2017):MAML is an optimization-basedmethod that attempt to learn aset of optimal initialization pa-rameters to fast learn on differentdownstream tasks with smallnumber of gradient steps. Weuse the first-order approximationMAML in our study.

– Proto-MAML (Triantafillou et al.,2020): Proto-MAML combines theideas of prototypical networks andMAML by reintepreting the proto-typical network as a linear layer ontop of the learned embedding (Snellet al., 2017). Then the lin-ear layer with prototpical network-equivalent weights and bias can beused as the initialization of the lastlayer of MAML during training.

14


Split Train Validation Test

Patient number 9249 302 2755Image number 11403 526 3136

Table A1: Statistics of the skin dataset in the study.

#Class Included classes

Common 27

Acne, Actinic Keratosis, Allergic Contact Dermatitis, Alopecia Areata,Androgenetic Alopecia, Basal Cell Carcinoma, Cyst, Eczema, Folliculitis,Hidradenitis, Lentigo, Melanocytic Nevus, Melanoma,Other, Post Inflammatory Hyperpigmentation, Psoriasis,Squamous cell carcinoma/squamous cell carcinoma in situ (SCC/SCCIS),seborrheic keratosis/irritated seborrheic keratosis (SK/ISK),Scar Condition, Seborrheic Dermatitis, Skin Tag, Stasis Dermatitis,Tinea, Tinea Versicolor, Urticaria, Verruca vulgaris, Vitiligo

Rare 42

Abscess, Acanthosis nigricans, Acne keloidalis, Amyloidosis of skin,Central centrifugal cicatricial alopecia, Condyloma acuminatum,Confluent and reticulate papillomatosis, Cutaneous lupus,Dermatofibroma, Dissecting cellulitis of scalp, Drug Rash,Erythema nodosum, Folliculitis decalvans, Granuloma annulare,Hemangioma, Herpes Simplex, Herpes Zoster,Idiopathic guttate hypomelanosis, Inflicted skin lesions, Insect Bite,Intertrigo, Irritant Contact Dermatitis, Keratosis pilaris,Lichen Simplex Chronicus (LSC), Lichen planus/lichenoid eruption,Lichen sclerosus, Lipoma, Melasma, Milia, Molluscum Contagiosum,Onychomycosis, Paronychia, Perioral Dermatitis, Photodermatitis,Pigmented purpuric eruption, Pityriasis rosea, Prurigo nodularis,Pyogenic granuloma, Rosacea, Scabies, Telogen effluvium, Xerosis

Table A2: Detailed categorization for common and rare skin conditions in the study.

15


Train 5-way-5-shot 10-way-5-shot 14-way-5-shot 14-way-9-shotMethod Eval 5-way-5-shot 5-way-5-shot 5-way-5-shot 5-way-5-shot

k-NN 0.544±0.010 0.55±0.010 0.548±0.010 0.545±0.010Baseline++ 0.475±0.010 0.469±0.010 0.593±0.010 0.488±0.010

Matching networks 0.597±0.010 0.592±0.010 0.566±0.010 0.581±0.010Prototypical networks 0.587±0.010 0.613±0.010 0.610±0.010 0.605±0.010

Relation networks 0.585±0.010 0.586±0.010 0.578±0.010 0.559±0.010MAML 0.585±0.010 0.594±0.010 0.587±0.010 0.580±0.010

Proto-MAML 0.599±0.010 0.628±0.010 0.626±0.010 0.595±0.010

Table A3: Standard FSL evaluation for the skin dataset with the change of N and k value(5-way-5-shot accuracy, with 95% confidence interval).“14-way-9-shot” indicatesthat we use 14-way-9-shot for the training in meta-learning, but use 5-way-5-shot for all evaluations. The boldface values indicate the best setting for eachFSL algorithm. We found that higher way is in general helpful to improve theperformance, yet the 10-way-5-shot training is relatively optimal for standardFSL using the teledermatology dataset to prevent overfit/underfit while per-forming meta-learning. Proto-MAML and prototypical networks outperformsother methods.

Method 5-way-5-shot 10-way-5-shot 14-way-5-shot 14-way-9-shot

k-NN 0.137 (0.094/0.168) 0.134 (0.098/0.159) 0.117 (0.101/0.125) 0.136 (0.093/0.161)Baseline++ 0.048 (0.044/0.052) 0.050 (0.044/0.055) 0.042 (0.041/0.044) 0.049 (0.039/0.056)

Matching networks 0.235 (0.097/0.325) 0.228 (0.135/0.302) 0.229 (0.118/0.304) 0.264 (0.130/0.350)Prototypical networks 0.211 (0.159/0.244) 0.219 (0.157/0.262) 0.219 (0.159/0.261) 0.213 (0.168/0.245)

Relation networks 0.038 (0.026/0.046) 0.056 (0.036/0.071) 0.090 (0.047/0.119) 0.072 (0.045/0.091)MAML 0.078 (0.042/0.101) 0.181 (0.056/0.250) 0.157 (0.065/0.218) 0.210 (0.080/0.296)

Proto-MAML 0.123 (0.089/0.146) 0.184 (0.107/0.237) 0.213 (0.122/0.274) 0.240 (0.134/0.308)

Table A4: Real-world FSL evaluation. We report the performance of models trained by dif-ferent N -way-k-shot setting. The value outside the parentheses is the balancedaccuracy of ALL classes, and the values inside the parenthesis are the balancedaccuracy of common/rare classes, respectively. The boldface values indicate thebest algorithm for each training setup. Different from the standard FSL evalua-tion, higher N and k yield better meta-learner in the real-world FSL evaluation.The performance of classifying rare classes is consistently better than classifyingthe common classes. Matching networks and Proto-MAML using 14-way-9-shottraining outperforms the other methods under the real-world evaluation. We usethem for the method comparison and model ensemble.

16


All-way accuracyBalancedaccuracy

Balanced acc.(common class)

Balanced. acc.(rare class)

FSL 0.147 0.264 0.130 0.359CSL baseline 0.648 0.411 0.525 0.347Upsampling 0.610 0.504 0.575 0.458

BI 0.661 0.420 0.537 0.345IFW 0.621 0.494 0.523 0.476

BI/IFW 0.620 0.506 0.515 0.500FL 0.657 0.429 0.530 0.364

FL/IFW 0.395 0.552 0.455 0.616Prior Correction 0.450 0.580 0.518 0.620

Table A5: Comparison between FSL, CSL baseline and other class imbalance techniquesin the all-way classification problem under single model (non-ensemble) setting.The boldface values indicate the best method for each evaluation metric. Notethat all-way accuracy is a biased metric towards common classes, as the test setis dominated with common classes. Also note that we include here an additionalclass-imbalance method ”Prior Correction” (for details please refer to AppendixA.1). Prior correction achieves the greatest balanced accuracy, an observationtheoretically supported by it’s interpretation as a label-shift correction and thefact that balanced accuracy is equivalent to overall accuracy in a domain withuniform class distribution. Single FSL model (matching networks) is not bene-ficial while using it independently for all-way classification. Instead, single CSLmodel with upsampling strategy outperforms others in the common class classi-fication, and single CSL model with focal loss/inverse frequency weighting yieldsthe best performance for the rare class classification.

17


#Model CombinationAll-wayaccuracy

Balancedaccuracy

Balanced acc.(common class)

Balanced acc.(rare class)

2 2 FSL 0.233 0.340 0.215 0.4212 CSL Baseline 0.663 0.431 0.523 0.371

FSL+CSL Baseline 0.633 0.435 0.506 0.3892 Upsampling 0.617 0.509 0.58 0.464

FSL+Upsampling 0.582 0.524 0.557 0.503

4 4 FSL 0.24 0.363 0.216 0.4594 Upsampling 0.623 0.529 0.591 0.489

FSL+Upsampling+BI/IFW+FL/IFW 0.582 0.558 0.541 0.569

Table A6: Comparison between ensemble FSL, CSL baseline and CSL-based class imbalancetechniques in the all-way classification problem under ensemble setting. Notethat all-way accuracy is a biased metric towards common classes, as the testset is dominated with common classes. In all-way classification, the balancedaccuracy on all classes improves after ensembling with the FSL model (2 CSLbaseline versus FSL+CSL baseline, and 2 upsampling versus FSL+upsampling).Similarly, combining models with different learning mechanisms helps improvethe balanced accuracy across all classes (4 FSL versus 4 upsampling versus 4different models).

18

1 Introduction2 Related Work2.1 Class Imbalance2.2 Machine Learning and FSL in Dermatology

3 Methods3.1 FSL Task and Evaluation Setup3.2 Modeling3.3 Metrics

4 Experiments4.1 Data4.2 Training Frameworks

5 Results5.1 Standard FSL Evaluation5.2 Real-world Evaluation5.3 Qualitative Analysis

6 ConclusionA Overview of the Class Imbalance MethodsA.1 CSL-based class imbalance techniquesA.2 FSL algorithms

arxiv:2010.04308v2 [cs.cv] 14 nov 2020 · 2020. 11. 17. · arxiv:2010.04308v2 [cs.cv] 14 nov 2020....

Documents