few-shot learning with global relatedness decoupled
TRANSCRIPT
Few-shot Learning with Global RelatednessDecoupled-Distillation
Yuan Zhou1, Yanrong Guo, Shijie Hao, Richang Hong, Zhengjun Zha, Meng Wang1 [email protected]
...
...
(a) The conventional metric learning based method (b) Our Global Relatedness Decoupled-Distillation (GRDD)
π β |π½1
π β |π½2
Update π½π
Global features
Episodic features
Stage2οΌRelatedness Decoupled-Distillation (RDD)
Global relatedness
The groups of the
decoupled-relatedness
1
2
ππ
...πΏ1S πΏ1
QπΏ2Q πΏπQ
Q
πΏ1Q
πΏπQQ
πΏ2Q
πΏ1S
πΏ2S
πΏπSS
...πΏ2S πΏ1
Q πΏ2Q πΏπQ
Q
...πΏπSS
πΏ1Q
πΏ2Q πΏπQ
Q
... ...
Query and
support images
Label predictor(e.g. cosine similarity)
Cross-
entropy loss
... π β |π½1
Random
episodic labels
Update π½π
Episodic features
Query and
support images
Stage1οΌLearning global category knowledge
Global category knowledge
Figure 1. The brief illustration of our Global Relatedness Decoupled-Distillation method (b) in training the meta-learnerπ (Β·|π½1), compared with the conventional metric learning based method (a). Of note, during the relatedness distillation, thewell-trained global learner π (Β·|π½2) is frozen.
AbstractDespite the success that metric learning based approacheshave achieved in few-shot learning, recent works reveal theineffectiveness of their episodic training mode. In this paper,we point out two potential reasons for this problem: 1) therandom episodic labels can only provide limited supervisioninformation, while the relatedness information between thequery and support samples is not fully exploited; 2) the meta-learner is usually constrained by the limited contextual in-formation of the local episode. To overcome these problems,we propose a new Global Relatedness Decoupled-Distillation(GRDD) method using the global category knowledge andthe Relatedness Decoupled-Distillation (RDD) strategy. OurGRDD learns new visual concepts quickly by imitating thehabit of humans, i.e. learning from the deep knowledge dis-tilled from the teacher. More specifically, we first train aglobal learner on the entire base subset using category labelsas supervision to leverage the global context information
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACMmust be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] β18, June 03β05, 2018, Woodstock, NYΒ© 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00https://doi.org/10.1145/1122445.1122456
of the categories. Then, the well-trained global learner isused to simulate the query-support relatedness in globaldependencies. Finally, the distilled global query-support re-latedness is explicitly used to train the meta-learner usingthe RDD strategy, with the goal of making the meta-learnermore discriminative. The RDD strategy aims to decouple thedense query-support relatedness into the groups of sparsedecoupled relatedness. Moreover, only the relatedness ofa single support sample with other query samples is con-sidered in each group. By distilling the sparse decoupledrelatedness group by group, sharper relatedness can be ef-fectively distilled to the meta-learner, thereby facilitatingthe learning of a discriminative meta-learner. We conductextensive experiments on the miniImagenet and CIFAR-FSdatasets, which show the state-of-the-art performance of ourGRDD method.
CCS Concepts: β’ Computing methodologies β Learn-ing latent representations.
Keywords: Few-shot learning, Global relatedness, Related-ness decoupled-distillation, metric learning
ACM Reference Format:Yuan Zhou1, Yanrong Guo, Shijie Hao, Richang Hong, ZhengjunZha, Meng Wang and 1 [email protected]. 2018. Few-shot Learning with Global Relatedness Decoupled-Distillation. InWoodstock β18: ACM Symposium on Neural Gaze Detection, June03β05, 2018, Woodstock, NY. ACM, New York, NY, USA, 11 pages.https://doi.org/10.1145/1122445.1122456
arX
iv:2
107.
0558
3v2
[cs
.CV
] 2
2 Se
p 20
21
Woodstock β18, June 03β05, 2018, Woodstock, NY Zhou et al.
1 IntroductionIn recent years, deep learning has achieved impressive suc-cess in computer vision tasks such as image classification[14, 38], object detection [33, 34] and semantic segmenta-tion [4, 9]. However, it is well known that the deep-learningmodel tends to overfit for scarce training samples and per-forms far from satisfactory. In contrast, humans are still ableto learn new visual concepts quickly in the data-scarce cir-cumstances. This motivates the emergence of research forthe few-shot learning (FSL) problem [7], i.e., having the ma-chine learning system quickly learn new visual conceptsfrom only one or a few labelled training examples.Intuitively, a straightforward solution to the overfitting
problem is augmenting the target training dataset, e.g., bydata synthesis [13, 37] or large-scale weakly labelled or un-labelled datasets [6, 47]. However, the main problem of thedata augmentation based approach is that the augmentationpolicies need to be tailored for different datasets due to do-main gaps [44]. As one of the most widely used methods, themetric learning based method has achieved promising perfor-mance in FSL while maintaining high flexibility. In general,it aims to train a meta-learner for learning the transferablefeature embeddings from the known categories (i.e., the cat-egories of the auxiliary base subset π«πππ π with adequatetraining data). To bridge the gaps between the training andtesting phases, the episodic meta-training is designed [43].Therefore, for the target FSL task whose sample categoriesare unobserved, the meta-learner first encodes the queryand support samples into the embedding domain. Then, thequery samples are matched with the support sample cate-gories with the highest similarity [43] or the lowest distance[39].Despite the success achieved by metric learning based
methods, recent works [5, 42] show that their episodic train-ing mode is ineffective or even unnecessary. In this paper, wefirst point out two potential reasons for this phenomenon:1) the random episodic labels can only offer limited super-vision information, 2) while the meta-learner is generallyconstrained by the limited intra- and inter-categorical con-text dependencies of the local episode. These issues limit themodelβs capability in producing high-quality transferablefeature embeddings, and thus suppress model performance(the analysis is provided in Table 2 and Figure 3 of Section4.3). To overcome these problems, we propose a new met-ric learning based method, named as Global RelatednessDecoupled-Distillation (GRDD), which mimics the humanhabit of learning new concepts quickly, i.e., learning fromdeep knowledge distilled by the teacher. The differences be-tween our GRDD and the previous typical metric learningbased methods are shown in Figure 1. In the previous metriclearning based methods (e.g. [39, 41, 43]), the meta-learnerlearns from the randomly constructed episodic labels whosesupervision information is limited. In contrast, our GRDD
utilizes the global relatedness between the query and supportsamples to train the meta-learner, which is more informativeand thus makes the learned transferable embeddings morediscriminative.As can be seen in Figure 2, GRDD is designed in a two-
stage training manner as dual-learners are used. In the firsttraining stage, we train the global-learner π (Β·|π½2) on the en-tire base subset using category labels as supervision to fullyexploit the global context dependencies of the categories.Then, in the second stage, the well-trained global-learneris used as a teacher to guide the episodic meta-trainingof the meta-learner π (Β·|π½1). Specifically, we first use theglobal-learner to simulate the global query-support relat-edness for each episode, via leveraging the learned globalcategory knowledge. Then, the global relatedness informa-tion is explicitly distilled to the meta-learner, which allowsthe meta-learner to know the samplesβ relatedness in theglobal context. To facilitate this process, we propose theRelatedness Decoupled-Distillation (RDD) strategy. It decou-ples the dense query-support relatedness into the groupsof sparse decoupled relatedness. In particular, each groupof decoupled relatedness only considers the relatedness ofa single support sample with other query samples. On onehand, the sparser the relatedness is, the easier it can be dis-tilled. On the other hand, decoupled relatedness is sharperin konwledge distillation, which is crucial in learning a dis-criminative meta-learner. To validate our method, extensiveexperiments are conducted on two public FSL datasets, i.e.,miniImagenet [43] and CIFAR-FS [2], which firmly validatesthe effectiveness of our method.
All in all, the contribution of this paper can be summarizedas follows:
β’ We point out the weaknesses of the current episodictraining mode used in the metric learning based FSLmethods, and propose a newGlobal Relatedness Decoupled-Distillation (GRDD) method to overcome these prob-lems.
β’ Instead of the random episodic labels, we propose toexplicitly use the distilled global query-support re-latedness to train the meta-learner, which makes thelearned transferable feature embeddings more discrim-inative.
β’ We introduce the Relatedness Decoupled-Distillation(RDD) strategy to facilitate the relatedness distillation.It decouples the entire query-support relatedness intothe groups of sparse decoupled relatedness to makethe relatedness information sharper and easier to bedistilled.
β’ On the miniImagenet and CIFAR-FS datasets, our pro-posed GRDD presents the state-of-the-art performancecompared to other counterparts.
Few-shot Learning with Global Relatedness Decoupled-Distillation Woodstock β18, June 03β05, 2018, Woodstock, NY
2 Related workIn this section, we briefly review the related FSL methodsand introduce the differences between our proposed methodand the most relevant approaches.
Metric learning based method. The metric learningbased methods work in a learning-to-learn paradigm. It aimsto train a meta-learner for learning high-quality transferablefeature embeddings that can be well generalized to solvethe target FSL tasks whose sample categories are unseen.Among the metric learning based methods, MatchNet [43] isa representative work. It develops the episodic meta-trainingto bridge the gaps between the training and testing phasesof FSL, using the random episodic labels as training super-vision. Snell et al. [39] further develop MatchNet by intro-ducing prototype representation so that query samples arecategorized according to their Euclidean distances to theprototypes. Li et al. [20] propose to retrieve the global classrepresentations by using the local features to categorize thequery samples. Moreover, in [19], they introduce the adap-tive margin loss to improve the feature representation of thesamples by further considering the semantic relation of thecategories in Glove [26]. Unlike the above methods that userandom episodic labels as supervision [19, 20, 39, 43], ourGRDD explicitly uses the global query-support relatednessto train the meta-learning with the goal of making the meta-learner more discriminative. Moreover, compared to [19],the distilled sample-wise relatedness is more fine-grainedthan the category-level relation in Glove [26], and thus moreimplicit information can be exploited, leading to accurateclassification, as in Table 2. Additionally, different from theworks [30, 42] simply resorting to the pretraining strategy,our GRDD aims to enhance the performance of the episodictraining mode.
External memory based method. The external mem-ory based methods are inspired by the recent success of Neu-ral Turing Machine [11]. As a representative work, MAML[8] proposes to design the memory module in a key-to-valueparadigm. It first records the useful information of the sup-port set into memory and then reads out the stored infor-mation to categorize the query samples. Ramalho et al. [31]boost the memory module by only memorizing the mostunexpected information, thus suppressing memory redun-dancy. Kaiser et al. [16] design a long-term memory modulesuitable for solving the lifelong learning problems. It shouldbe noted that memory-augmented models generally need tobe fine-tuned on the support set of the target tasks in orderto obtain sufficient useful information of the new categories.In contrast, our GRDD can be used to directly categorize thequery samples without the need for fine-tuning.
Hallucination-based method. The hallucination-basedFSL methods can be divided into two sub-directions, i.e. hal-lucinations of new data [13, 45, 50] and hallucinations ofclassifier weights [10, 27, 29]. Hariharan et al. [13] propose
a non-parametric data hallucination approach that halluci-nates new support features for the novel unseen categoriesusing the inter-category commonality. Wang et al. [45] pro-pose a hallucinator that synthesizes new images with dif-ferent object poses or backgrounds by introducing randomnoise into the original image, while Zhang et al. [50] proposeto hallucinate new data using the guidance of salient objects.In contrast to hallucinating new data, [10, 27, 29] proposeto hallucinate the classifier weights for the novel categoriesaccording to the feature activations of the support samples.
Transductive vs. inductive method. In traditional in-ductive FSL, each query sample is categorized independently.Transductive FSL, on the other hand, aims to categorize allquery samples at once, or to consider the generated episodictasks as a whole, thus leveraging information from boththe support and the query sets. For example, Boudiaf et al.[3] propose to maximize the mutual information betweenthe embedding features and the label prediction. Ziko etal. [51] propose to impose an additional constraint on cat-egory inference, i.e., nearby samples should have the sameconsistent label assignments. In contrast, Liu et al. in [21]propose to propagate labels from labelled instances to un-labelled instances using the manifold structure of the data.As mentioned in [51], the transductive-based methods areusually more accurate than the inductive ones. Neverthe-less, they face an unavoidable drawback, namely that thetransductive model has to be retrained from scratch whennew query samples or new episodic tasks appear. As for ourGRDD, it is inductive and thus can be used to categorizenew query samples or address new tasks directly once thetraining phase is complete. Last but not least, our GRDD canbe easily integrated into the transductive approach, such asβGRDD-TIMβ in Table 1.
3 MethodIn this section, we first present the preliminaries and thendescribe the proposed method in detail.
3.1 PreliminaryAn FSL task consists of two subsets of data, commonly re-ferred to as the support set πΊ =
{(πΏπ
π , π¦ππ )}ππβ1π=0 and query
set πΈ ={πΏπ
π
}ππβ1π=0 . For the default βπΆ-way πΎ-shotβ setting,
the ππ labelled support samples are prepared by randomlysampling πΎ labelled samples from each of the πΆ categories(i.e., ππ = πΆ Γ πΎ), while the ππ unlabelled query samplesare also randomly drawn from these πΆ categories. Note thatthe instances from the support and query sets are disjoint,i.e., πΊ β© πΈ = β . The ultimate goal of FSL is to categorize thequery samples by exploiting the prior knowledge containedin the support set, as in Equation 1:
π¦π = arg maxοΏ½ΜοΏ½π β{1,..,πΆ }
π (π¦π |πΏπ
π, πΊ) . (1)
Woodstock β18, June 03β05, 2018, Woodstock, NY Zhou et al.
.
Relatedness
decoupling
Relatedness
decoupling
.
The mini-batch image
Ο β |π½2CE
Stage 1: Global Category Knowledge Learning
Update π½π
Global features
Stage 2: Relatedness Decoupled-Distillation
FC β |π½3Prediction
...
Episodic features
Category labels
Update π½π
(a) Training on the base subset π«base
......
The episode data
Ο β |π½2
π β |π½1
...
Global features
Query: πΏ1π~πΏππ
π
Support: πΏ1π~πΏππ
π
KL KL KL...
...πΏ1S πΏ1
QπΏ2Q πΏπQ
Q
1
Rotation
FC β |π½4 CE
Rotation
labels
(b) Testing on the target task
π 1
Logistic
Regression
Brid AlpacaTiger
...
πβ|π½
1Query: πΏ1
π~πΏππ
π Support: πΏ1π~πΏππ
π
...
πΏ1Q
πΏπQQ
πΏ2Q
πΏ1S
πΏ2S
πΏπSS
πΏ1Q
πΏπQQ
πΏ2Q
πΏ1S
πΏ2S
πΏπSS
...πΏ2S πΏ1
QπΏ2Q πΏπQ
Q
...πΏπSS
πΏ1Q
πΏ2Q πΏπQ
Q
...πΏ1S πΏ1
QπΏ2Q πΏπQ
Q
...πΏ2S πΏ1
QπΏ2Q πΏπQ
Q
...πΏπSS
πΏ1Q
πΏ2Q πΏπQ
Q
2ππ
ππ21
SUM
Figure 2. The overview for our Global Relatedness Decoupled-Distillation (GRDD) method. In particular, βCEβ stands for thecross-entropy loss and βKLβ for the KL divergence, while π (Β·|π½1) and π (Β·|π½2) represent the meta-learner and the global-learner,respectively. Moreover, βFCβ indicates the fully connected layer, and βSUMβ denotes the summation operation.
In Equation 1, π (π¦π |πΏπ
π, πΊ) gives the probability that the
query sample πΏπ
πis classified as the label π¦π conditioned
on the support set πΊ . To handle the FSL task, the metriclearning based approach resorts to an auxiliary base subsetπ«πππ π . Note that the label spaces of the base subset π«πππ π andthe target FSL task are disjoint. As with current metric learn-ing based methods, episodic meta-training is commonly usedto bridge the gaps between the training and testing phasesof FSL, using the randomly constructed episodic labels astraining supervision, such as [39, 41, 43]. Obviously, theepisodic training mode has advantages in training a meta-learner with high generalization. Nevertheless, the recentworks (e.g., [42, 46]) reveal its ineffectiveness in training theFSL model. Therefore, this paper highlights two potentialproblems of the current episodic meta-training and proposesa new metric learning based method to alleviate these prob-lems.
3.2 Global Relatedness Decoupled-DistillationWe propose a new metric learning based method, called asGlobal Relatedness Decoupled-Distillation (GRDD), whichaims to imitate the human habit of learning novel concepts,i.e., learning from deep knowledge distilled by the teacher.In our GRDD method, two different learners are used, calledglobal-learner π (Β·|π½2) andmeta-learner π (Β·|π½1). Accordingly,as shown in Figure 2, GRDD is designed in a two-stage train-ing manner. In the first stage, the global-learner π (Β·|π½2) istrained on the entire base subset π«πππ π , using the categorylabels as training supervision. In this way, category knowl-edge can be exploited in the global contextual dependencies.
In the second training stage, we then use the global query-support relatedness distilled from the global-learner π (Β·|π½2)to train the meta-learner π (Β·|π½1) based on the episodic train-ing mode. To facilitate the relatedness learning, we proposethe Relatedness Decoupled-Distillation (RDD) strategy inour GRDD, which decouples the dense query-support re-latedness into the groups of sparse decoupled relatedness,making the relatedness sharper and easier to distill. Sections3.2.1 and 3.2.2 present these two training stages in detail.
3.2.1 Global Category Knowledge Learning. To fullyexploit the global context dependencies of the categories,we train the global-learner π (Β·|π½2) on the entire base sub-set π«πππ π at the first training stage, using the category la-bels as supervision. In this process, we use the well-knownmini-batch training strategy for fast model convergence.We also employ the data augmentation strategy [30], whichaugments the input mini-batch images
{πΏπ
}ππβ1π=0 by rotat-
ing them to 90β¦, 180β¦, and 270β¦, respectively, and obtains{πΏπ π
}ππβ1,270π=0,π =0 . Accordingly, the one-hot rotation labels are
constructed, i.e.,{ππ π
}ππβ1,270π=0,π =0 . Note that in the following sec-
tions π = 0, 90 , 180 or 270 is used when there are no specialstatements. As shown in Figure 2, we first send the aug-mented mini-batch data to the global-learner π (Β·|π½2) and usethe global-learner to extract their high-level feature repre-sentations
{ππ π
}ππβ1,270π=0,π =0 as in Equation 2:
ππ π = π (πΏπ π |π½2) (2)
where ππ π is the features of πΏπ π , while π½2 gives the learnable
parameters of the global-learner. Then, a fully connected
Few-shot Learning with Global Relatedness Decoupled-Distillation Woodstock β18, June 03β05, 2018, Woodstock, NY
layer πΉπΆ (Β·|π½3) is applied to the features ππ π to predict theircategories, as in Equation 3:
ππ π = πΉπΆ (ππ π |π½3). (3)
In Equation 3, ππ π denotes the one-hot category prediction of
ππ π , while π½3 indicates the learnable parameters of the fullyconnected layer.After that, another fully connected layer πΉπΆ (Β·|π½4) is ap-
plied to ππ π , aiming to infer the label of rotation angle, as
described in Equation 4:
ππ π = πΉπΆ (ππ π |π½4). (4)
Finally, the groundtruth category labels{π¦π}ππβ1π=0 and the
rotation labels{ππ π
}ππβ1,270π=0,π =0 are used to jointly optimize the
entire network, as shown below:
π½ β²2 = π½2 β ππ1 β
π(Lπ + Lπ)ππ½2
(5)
π½ β²3 = π½3 β ππ1 β
π(Lπ + Lπ)ππ½3
(6)
π½ β²4 = π½4 β ππ1 β
π(Lπ + Lπ)ππ½4
(7)
where
Lπ =1
ππ β 4
ππβ1βοΈπ=0
270βοΈπ =0
πΆπππ πβ1βοΈπ=0
πππ_βππ‘ (π¦π ) π β log(ππ π ) π (8)
and
Lπ =1
ππ β 4
ππβ1βοΈπ=0
270βοΈπ =0
3βοΈπ=0
πππ_βππ‘ (ππ π ) π β log(ππ π ) π . (9)
Note that the learning rate ππ1 is initialized as 5πβ2 in Equa-tion 5, 6 and 7, and decays in the βpolyβ manner. πππ_βππ‘ (Β·)indicates the operation for one-hot encoding.
The advantages of learning global category knowledge aretwofold. On one hand, the above training strategy is moreglobal than the episodic training mode, and thus categoryknowledge can be learned in the more global contextual de-pendencies. On the other hand, the global knowledge learnedby the global-learner π (Β·|π½2) is more informative than therandom episodic labels, which can be used to better guidethe training of the meta-learner π (Β·|π½1). In this way, thepreviously mentioned weaknesses of episodic meta-trainingcan be relieved. In Section 3.2.2, we will elaborate the de-tails of distiling the learned global knowledge to train themeta-learner π (Β·|π½1).
3.2.2 RelatednessDecoupled-Distillation. Consideringthat the random episodic labels can only provide limitedsupervision information, we therefore propose to use theglobal-learner π (Β·|π½2) for simulating the relatedness betweenthe query and support samples in the global context depen-dencies of the categories, which is then used to explicitlytrain the meta-learner π (Β·|π½1). To facility the learning of
relatedness, our GRDD method introduces the RelatednessDecoupled-Distillation (RDD) strategy.
More specifically, for each episodic data{πΏππ ,πΏ
π
π
}ππβ1,ππβ1π=0, π=0 ,
we first use the global-learner π (Β·|π½2), to extract their high-level features
{πππ ,π
π
π
}ππβ1,ππβ1π=0, π=0 . Since the features are ex-
tracted based on the learned global category knowledge,they are referred to as global features in this paper. Based onthese global features, we then extract the global relatednessinformation between the query and support samples (i.e.,πΉπ β RππΓππ ) using Equation 10:
πΉπ
π π=
πππ β ππ
π
| |πππ| | β | |ππ
π| |
(10)
where πΉπ
π πdenotes the (π, π) element of πΉπ which is the re-
lation between the π-π‘β support sample and the π-π‘β querysample.Then, the relatedness πΉπ is decoupled into the groups of
sparse decoupled relatedness [ππ
0 , ...,ππ
ππβ1] for knowledgedistillation, as in Equation 11:
ππ
π=β₯
ππβ1π=0
ππ₯π (πΉπ π/π )βπ ππ₯π (πΉπ π/π )
(11)
where β₯ denotes the concatenation operation, while π is thetemperature hyperparameter used to smooth the values ofrelatedness for knowledge distillation.
After that, we distill the decoupled relatedness [ππ
0 , ...,ππ
ππβ1]to the meta-learner π (Β·|π½1) group by group. For better expla-nation, we use πΉπ to represent the query-support relatednesscomputed based on the features extracted from the meta-learner, while the decoupled relatedness computed basedon πΉπ is denoted as [ππ
0 , ...,ππππβ1]. We first use KL diver-
gence πΎπΏ(Β·, Β·) to measure the deviation between each groupof [(ππ
0 ,ππ0 ), ..., (π
π
ππβ1,ππππβ1)], and then the loss of KL de-
viation is summed up as in Equation 12:
Lππ =
ππβ1βοΈπ=0
πΎπΏ(ππ
π,ππ
π ). (12)
In addition, a regularized term Lππ‘ is used to regularize therelatedness distillation, which constrains that samples withthe same categories have higher relatedness:
Lππ‘ =1ππ
ππβ1βοΈπ=0
πΆβ1βοΈπ=0
πππ_βππ‘ (π¦ππ) π β log(ππ ) π (13)
where
ππ =
ππβ1βοΈπ=0
(πΉπ )ππ π β πππ_βππ‘ (π¦ππ ). (14)
Finally, the meta-learner parameters are updated via thejoint usage of Lππ and Lππ‘ , as shown in Equation 15:
π½ β²1 = π½1 β ππ2 β
π(Lππ + πΎ β Lππ‘ )ππ½1
(15)
Woodstock β18, June 03β05, 2018, Woodstock, NY Zhou et al.
Table 1. The accuracy comparison between our proposed GRDD and the related state-of-the-art approaches on miniImagenetand CIFAR-FS datasets, with 95% confidence interval. It is noteworthy that the methods marked with ββ‘β are based ontransductive learning, while the remaining methods are inductive. Moreover, βArch.β denotes the network architecture, whileβn/aβ indicates the unavailable results in original papers.
miniImagenet, 5-way CIFAR-FS, 5-wayMethod Reference Arch. 1-shot 5-shot 1-shot 5-shot
MatchNet [43] NeurIPSβ 16 ConvNet-4 43.7 Β± 0.8 55.3 Β± 0.7 π/π π/πMAML [8] ICMLβ 17 ConvNet-4 48.7 Β± 1.8 63.1 Β± 0.9 58.9 Β± 1.9 71.5 Β± 1.0
ProtoNet [39] NeurIPSβ 17 ConvNet-4 49.4 Β± 0.8 68.2 Β± 0.7 55.5 Β± 0.7 72.0 Β± 0.6DFS [10] ICCVβ 18 ConvNet-4 56.2 Β± 0.9 73.0 Β± 0.6 π/π π/π
RelationNet [41] CVPRβ 18 ConvNet-4 50.4 Β± 0.8 65.3 Β± 0.7 55.0 Β± 1.0 69.3 Β± 0.8IMP [1] ICMLβ 19 ConvNet-4 43.6 Β± 0.8 55.3 Β± 0.7 π/π π/π
TAML [15] CVPRβ 19 ConvNet-4 51.8 Β± 1.9 66.0 Β± 0.9 π/π π/πSAML [12] ICCVβ 19 ConvNet-4 52.2 Β± π/π 66.5 Β± π/π π/π π/πGCR [20] ICCVβ 19 ConvNet-4 53.2 Β± 0.8 72.3 Β± 0.6 π/π π/πKTN [25] ICCVβ 19 ConvNet-4 54.6 Β± 0.8 71.2 Β± 0.7 π/π π/πPARN [48] ICCVβ 19 ConvNet-4 55.2 Β± 0.8 71.6 Β± 0.7 π/π π/πR2D2 [2] ICLRβ 19 ConvNet-4 51.2 Β± 0.6 68.8 Β± 0.1 65.3 Β± 0.2 79.4 Β± 0.1DC [49] ICLRβ 21 ConvNet-4 54.6 Β± 0.6 π/π π/π π/π
Our GRDD - ConvNet-4 58.9 Β± 0.8 77.1 Β± 0.6 69.3 Β± 0.9 84.7 Β± 0.6Our GRDD-ConvNet4 - ConvNet-4 58.0 Β± 0.8 76.6 Β± 0.6 67.3 Β± 0.9 83.5 Β± 0.6
SNAIL [22] ICLRβ 18 ResNet-12 55.7 Β± 1.0 68.9 Β± 0.9 π/π π/πAdaResNet [23] ICMLβ 18 ResNet-12 56.9 Β± 0.6 71.9 Β± 0.6 π/π π/πTADAM [24] NeurIPSβ 18 ResNet-12 58.5 Β± 0.3 76.7 Β± 0.3 π/π π/πShot-Free [32] ICCVβ 19 ResNet-12 59.0 Β± π/π 77.6 Β± π/π 69.2 Β± π/π 84.7 Β± π/πTEWAM [28] ICCVβ 19 ResNet-12 60.1 Β± π/π 75.9 Β± π/π 70.4 Β± π/π 81.3 Β± π/πMTL [40] CVPRβ 19 ResNet-12 61.2 Β± 1.8 75.5 Β± 0.8 π/π π/πVFSL [36] CVPRβ 19 ResNet-12 61.2 Β± 0.3 77.7 Β± 0.2 π/π π/π
MetaOptNet [18] CVPRβ 19 ResNet-12 62.6 Β± 0.6 78.6 Β± 0.5 72.6 Β± 0.7 84.3 Β± 0.5TRAML [19] CVPRβ 20 ResNet-12 67.1 Β± 0.5 79.5 Β± 0.6 π/π π/πDSN-MR CVPRβ 20 ResNet-12 67.4 Β± 0.8 82.9 Β± 0.6 75.6 Β± 0.9 86.2 Β± 0.6CBM [46] MMβ 20 ResNet-12 64.8 Β± 0.5 80.5 Β± 0.3 π/π π/πRFS [42] Arxivβ 20 ResNet-12 64.8 Β± 0.6 82.1 Β± 0.4 73.9 Β± 0.8 86.9 Β± 0.5SKD [30] Arxivβ 20 ResNet-12 67.0 Β± 0.9 83.5 Β± 0.5 76.9 Β± 0.9 88.9 Β± 0.6Our GRDD - ResNet-12 67.5 Β± 0.8 84.3 Β± 0.5 77.5 Β± 0.9 89.1 Β± 0.6TPN [21]β‘ ICLRβ 19 ConvNet-4 55.5 Β± 0.9 69.9 Β± 0.7 π/π π/πFeat [? ]β‘ CVPRβ 20 ConvNet-4 57.0 Β± 0.2 72.9 Β± 0.2 π/π π/πMRN [? ]β‘ MMβ 20 ConvNet-4 57.8 Β± 0.7 71.1 Β± 0.5 π/π π/π
Our GRDD-TIMβ‘ - ConvNet-4 65.7 Β± 0.3 80.1 Β± 0.2 79.9 Β± 0.2 87.9 Β± 0.2Our GRDD-TIM-ConvNet4β‘ - ConvNet-4
LaplacianShot [51]β‘ ICMLβ 20 ResNet-18 72.1 Β± 0.2 82.3 Β± 0.1 π/π π/πTIM [3]β‘ NeurIPSβ 20 ResNet-18 73.9 Β± 0.2 85.0 Β± 0.1 π/π π/π
BD-CSPN [? ]β‘ ECCVβ 20 WRN-28-10 70.3 Β± 0.9 81.9 Β± 0.6 π/π π/πIFSL-SIB [? ]β‘ NeurIPSβ 20 WRN-28-10 73.5 Β± π/π 83.2 Β± π/π π/π π/π
Our GRDD-TIMβ‘ - ResNet-12 75.8 Β± 0.2 87.3 Β± 0.1 85.4 Β± 0.2 91.1 Β± 0.2
where the hyperparameter πΎ is to control the balance be-tween Lππ and Lππ‘ . Note that during this process, the param-eters of the well-trained global-learner π (Β·|π½2) are frozen. Byusing the proposed RDD strategy to train the meta-learnerπ (Β·|π½1), our method achieves competitive experimental per-formance compared to other counterparts, which is shownin the next section.
4 ExperimentWith the aim of validating our proposed method, we con-duct extensive experiments on two public FSL datasets, i.e.,miniImagenet [43] and CIFAR-FS [2]. In this section, we first
introduce these datasets and the implementation details ofthe experiments. Then, we compare our GRDD in detail withthe related state-of-the-art approaches.
4.1 Dataset and Implementation DetailsDataset. miniImagenet [43] and CIFAR-FS [2] are the mostcommonly used FSL datasets. In particular, CIFAR-FS is de-rived from the CIFAR-100 [17] dataset, while miniImagenetis derived from the larger ILSVRC-12 [35] dataset. Remark-ably, these two datasets both contain 60000 images with 100different semantic categories. But the image resolutions ofthe datasets are different. Specifically, CIFAR-FS consists of
Few-shot Learning with Global Relatedness Decoupled-Distillation Woodstock β18, June 03β05, 2018, Woodstock, NY
Table 2. The demonstration for the weaknesses of the cur-rent episodic meta-training and the strength of our relat-edness distillation method in FSL. βCLβ or βELβ denotes theexperiments that use category labels or episodic labels astraining supervision. βGRβ represents the usage of our globalrelatedness. βArch.β denotes the network architecture.
Supervision Arch. of π ( Β· |π½2) Arch. of π ( Β· |π½1) Acc (%)CL ππ ConvNet-4 64.6 Β± 0.9
CL+EL ππ ConvNet-4 66.7 Β± 0.9CL+GR ConvNet-4 ConvNet-4 67.3 Β± 0.9CL+GR ResNet-12 ConvNet-4 69.3 Β± 0.9CL ππ ResNet-12 74.9 Β± 0.9
CL+EL ππ ResNet-12 72.9 Β± 0.9CL+GR ConvNet-4 ResNet-12 75.4 Β± 0.9CL+GR ResNet-12 ResNet-12 77.5 Β± 0.9
32 Γ 32 images, while the images from miniImagenet have aresolution of 84 Γ 84. Following previous works [30, 46? ],for these two datasets, the 100 categories are divided into 64,16 and 20 for training, validation and testing, respectively.
Implementation Details. All our experiments are builton Pytorch1. Following [2, 39, 46], we respectively use ConvNet-4 [43] and ResNet-12 [14] to implement the meta-learner.Note that the global-learner is implemented by ResNet-12if there is no special declaration. For all experiments, wechoose the Stochastic Gradient Descent (SGD) as the op-timizer, of which the weight decay is empirically set to5πβ4. Under the two-stage training manner, different trainingstrategies are applied to the different stages. In particular, forthe first training stage, we adopt the βpolyβ learning rate, i.e.,ππ1 = ππ_ππππ‘Γ(1β ππ‘ππ
ππ‘ππ_π‘ππ‘ππ )πππ€ππ , where ππ_ππππ‘ is set to 1πβ1
and πππ€ππ is set to 0.9. We also use the well-known mini-batch training strategy for fast model convergence. Note thatfor all datasets, the batch size is set to 64 and the epoch isset to 90. However, in the second training stage, we set asmaller initial learning rate ππ2 and epoch, which are 1πβ3
and 15, respectively. Moreover, the learning rate decays by afactor of 0.1 for the last 5 epochs. For the hyperparameters,πΎ andπ are respectively set to 0.2 and 4, which are validatedin Section 4.3.
4.2 Comparison with the state-of-the-art methodsIn this section, we compare our GRDD with related state-of-the-art approaches summarized in Table 1. Note that fora fair comparison, the methods based on different networkstructures are compared accordingly.
ConvNet-4. In this part, we compare our GRDD with themethods implemented using ConvNet-4. As shown in Table1, our GRDD largely outperforms the compared methods onthe miniImagenet and CIFAR-FS datasets. For example, onthe miniImagenet dataset, GRDD is more accurate than DC
1https://pytorch.org/
(b) CL+EL(a) CL (c) Our GRDD
Figure 3. The t-SNE visualization for the embeddings of theβCLβ, βCL+ELβ and our GRDD method based on ResNet-12.
[49] and R2D2 [2] by about 4% and 8%, respectively. On theCIFAR-FS dataset, the accuracy of our GRDD is still higherthan that of R2D2, about 4% higher on the 5-way 1-shot task,while 5% higher on the 5-way 5-shot task.
ResNet-12. In general, higher accuracy can be achievedby using larger models. Thus, by implementing GRDD withResNet-12, GRDD consistently shows better performancethan the ConvNet-4 version. As shown in Table 1, our GRDDis obviously more accurate than most methods, such asSNAIL [22] and VFSL [36]. Moreover, our GRDD is evenbetter compared to the recent works RFS [42], SKD [30] andCBM [46]. For example, on the miniImagenet dataset, theaccuracy of our method is 0.5% and 0.8% higher than thatof SKD on the 1-shot and 5-shot tasks, respectively. On theCIFAR-FS dataset, the accuracy of our GRDD is about 4%and 2% more accurate than RFS, respectively.
Transductive learning. Although our GRDD is proposedas an inductive approach, it can be easily integrated into thetransductive learning approach. For example, we integrateour GRDD with TIM [3], which is called βGRDD-TIMβ inTable 1. On the one hand, βGRDD-TIMβ achieves a signifi-cant performance gain over the baseline TIM [51]. On theother hand, it also achieves the state-of-the-art performanceamong the transductive counterparts even if our GRDD isimplemented based on a smaller neural network ResNet-12.The above experiments strongly firm the effectiveness
and flexibility of our GRDD. It should be noted that theablation study is conducted in Section 4.3 to further analyzeour proposed method.
4.3 Ablation studyIn this section, we conduct the ablation study for our work.We first analyze the weaknesses of the current episodic train-ing mode. Then, we investigate the impact of each compo-nent in our GRDD. Finally, the settings of two vital haperpa-rameters (i.e.,πΎ andπ ) are validated. For brevity, we note thatβCLβ denotes the methods pretrained on the Category Labels,while βCL+ELβ indicates the methods that further finetrainthe pretrained model using the Episodic Labels. Moreover,
Woodstock β18, June 03β05, 2018, Woodstock, NY Zhou et al.
63
68
73
78
ConvNet-4 ResNet-12
AC
C (
%)
CIFAR-FSGRDD
w/o RDD
w/o
CL
CL+EL
50
55
60
65
70
ConvNet-4 ResNet-12
AC
C (
%)
miniImagenetGRDD
w/o RDD
w/o
CL
CL+EL
βππ‘βππ‘
Figure 4. The influence of each component of GRDD. βw/oRDDβ indicates that GRDD is used without the RelatednessDecoupled-Distillation (RDD) strategy and the relatedness isdistilled as a whole instead. βw/o Lππ‘ β indicates that GRDDis implemented without using the regularized term Lππ‘ .
T=1
T=4
(a) w/o decoupling (b) w/ decoupling
ππ Γ πQ matrix
ππ groups
ππ
Decoupled relatedness slot
Figure 5. The comparison between relatedness with (b) andwithout (a) relatedness decoupling in knowledge distillation,under different values of temperature π . Note that withoutdecoupling, the relatedness is distilled as a whole matrix.While with decoupling, the relatedness is distilled corre-sponding to each decoupled relatedness slot.
88
88.5
89
89.5
72
75
78
81
84
0 0.2 0.4 0.6 0.8 1
Hyperparameter Ο
CIFAR-FS
5-way 1-shot5-way 5-shot
88
88.5
89
89.5
74
76
78
80
0.5 1 2 4 6 8Hyperparameter T
CIFAR-FS
5-way 1-shot5-way 5-shot
1-s
ho
t A
CC
(%
)
5-s
ho
t A
CC
(%
)
1-s
ho
t A
CC
(%
)
5-s
ho
t A
CC
(%
)
Figure 6. Ablation study for the hyperparameters πΎ and π .
βCL+GRβ denotes our GRDD that uses the Global Relatednessextracted from the category labels to train the meta-learner.
Improvement over episodic training. Recent works[42, 46] find that the episodic training mode in FSL is ineffec-tive and unnecessary. Here, we give two potential reasonsfor this phenomenon, which are experimentally analyzedin this part. As shown in Table 2, βCL+ELβ does not always
yield a performance gain over the baseline model βCLβ. Forexample, on ResNet-12, accuracy actually decreases by about2% when episodic labels are further used. This is becausethe episodic labels can only provide limited supervision andthus are unable to boost the quality of feature embeddingseffectively. Instead, the learned global category knowledgemay be destroyed by the local episodic meta-training, whosecontext is very limited. However, when more informativeglobal relatedness is used in meta-training, βCL+GRβ achievessignificant improvement in all experiments. In addition, themore accurate the relatedness information is (i.e., extractedby a larger model), the higher the accuracy can be obtained.This proves the effectiveness of our GRDD, while the limitedinformation of the episodic labels is the bottleneck in theepisodic training mode. This conclusion is also consistentwith the visualized analysis in Figure 3, where we can seethat the episodic labels make the embedding space morecompact, but the boundary between different categories be-comes blurred because of the limited guidance of supervisioninformation. However, our relatedness information makesthe embedding space more compact, meanwhile the categoryboundary becomes clearer and more discriminative.
Influence of each component in GRDD. As shown inFigure 4, we first compare our GRDD with two degenerateversions βw/o Lππ‘ β and βw/o RDDβ. The results in Figure 4indicate: 1) using the RDD strategy is better than distillingthe relatedness information as a whole matrix; 2) incorporat-ing RDD with the regularized term Lππ‘ is better than usingRDD alone. Moreover, our GRDD also shows consistentlybetter performance than βCLβ and βCL+ELβ. Therefore, theeffectiveness of the two key components of our GRDD can beverified. It is also worth noting that the visualization of therelatedness with and without decoupling in the knowledgedistillation is demonstrated in Figure 5, where the decoupledrelatedness is more discriminative than the relatedness thatis considered as a whole matrix.
Hyperparameter settings. Furthermore, the experimentsin Figure 6 are conducted to validate the settings for two keyhyperparameters in our GRDD, i.e., πΎ and π . The results inFigure 6 show that the πΎ = 0.2 and π = 4 setting can yieldbetter performance.
5 ConclusionIn this paper, we show that the bottleneck of the episodictraining mode lies in the limited supervision informationof episodic labels and the scarce category context. To alle-viate these problems, we propose a new Global RelatednessDecoupled-Distillation (GRDD) method that explicitly usesthe more informative global query-support relatedness totrain the meta-learner, making it more discriminative. More-over, the Relatedness Decoupled-Distillation (RDD) strategyis introduced to facilitate this procedure. RDD decouplesthe dense relatedness into the groups of sparse decoupled
Few-shot Learning with Global Relatedness Decoupled-Distillation Woodstock β18, June 03β05, 2018, Woodstock, NY
relatedness, making the relatedness sharper and easier to bedistilled. Extensive experiments on the miniImagenet andCIFAR-FS datasets validate the effectiveness of our method.In the future, we plan to apply our method in other FSLdomains, such as open-set FSL and domain-shift FSL.
Woodstock β18, June 03β05, 2018, Woodstock, NY Zhou et al.
References[1] Kelsey Allen, Evan Shelhamer, Hanul Shin, and Joshua Tenenbaum.
2019. InfiniteMixture Prototypes for Few-shot Learning. In Proceedingsof the International Conference on Machine Learning. 232β241.
[2] Luca Bertinetto, Joao F Henriques, Philip Torr, and Andrea Vedaldi.2018. Meta-learning with differentiable closed-form solvers. In Pro-ceedings of the International Conference on Learning Representations.
[3] Malik Boudiaf, Imtiaz Ziko, JΓ©rΓ΄me Rony, Jose Dolz, Pablo Piantanida,and Ismail Ben Ayed. 2020. Information Maximization for Few-ShotLearning. Advances in Neural Information Processing Systems 33 (2020).
[4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Mur-phy, and Alan L Yuille. 2017. Deeplab: Semantic image segmentationwith deep convolutional nets, atrous convolution, and fully connectedcrfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40,4 (2017), 834β848.
[5] Guneet Singh Dhillon, Pratik Chaudhari, Avinash Ravichandran, andStefano Soatto. 2019. A Baseline for Few-Shot Image Classification. InInternational Conference on Learning Representations.
[6] Matthijs Douze, Arthur Szlam, Bharath Hariharan, and HervΓ© JΓ©gou.2018. Low-shot learning with large-scale diffusion. In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition.3349β3358.
[7] Li Fe-Fei et al. 2003. A Bayesian approach to unsupervised one-shotlearning of object categories. In Proceedings Ninth IEEE InternationalConference on Computer Vision. IEEE, 1134β1141.
[8] Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnosticmeta-learning for fast adaptation of deep networks. In Proceedingsof the 34th International Conference on Machine Learning-Volume 70.1126β1135.
[9] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, andHanqing Lu. 2019. Dual attention network for scene segmentation.In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. 3146β3154.
[10] Spyros Gidaris and Nikos Komodakis. 2018. Dynamic few-shot visuallearning without forgetting. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. 4367β4375.
[11] Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural turingmachines. arXiv preprint arXiv:1410.5401 (2014).
[12] Fusheng Hao, Fengxiang He, Jun Cheng, Lei Wang, Jianzhong Cao,and Dacheng Tao. 2019. Collect and select: Semantic alignment metriclearning for few-shot learning. In Proceedings of the IEEE InternationalConference on Computer Vision. 8460β8469.
[13] Bharath Hariharan and Ross Girshick. 2017. Low-shot visual recog-nition by shrinking and hallucinating features. In Proceedings of theIEEE International Conference on Computer Vision. 3018β3027.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deepresidual learning for image recognition. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. 770β778.
[15] Muhammad Abdullah Jamal and Guo-Jun Qi. 2019. Task agnostic meta-learning for few-shot learning. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. 11719β11727.
[16] Εukasz Kaiser, Ofir Nachum, Aurko Roy, and Samy Bengio. 2017. Learn-ing to remember rare events. arXiv preprint arXiv:1703.03129 (2017).
[17] Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layersof features from tiny images. (2009).
[18] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and StefanoSoatto. 2019. Meta-learning with differentiable convex optimization.In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. 10657β10665.
[19] Aoxue Li, Weiran Huang, Xu Lan, Jiashi Feng, Zhenguo Li, and LiweiWang. 2020. Boosting few-shot learning with adaptive margin loss.In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. 12576β12584.
[20] Aoxue Li, Tiange Luo, Tao Xiang,WeiranHuang, and LiweiWang. 2019.Few-shot learning with global class representations. In Proceedings ofthe IEEE International Conference on Computer Vision. 9715β9724.
[21] Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang,Sung Ju Hwang, and Yi Yang. 2019. Learning to propagate labels:Transductive propagation network for few-shot learning. In 7th Inter-national Conference on Learning Representations, ICLR 2019. Interna-tional Conference on Learning Representations, ICLR.
[22] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. 2018.A Simple Neural Attentive Meta-Learner. In Proceedings of the Interna-tional Conference on Learning Representations.
[23] Tsendsuren Munkhdalai, Xingdi Yuan, Soroush Mehri, and AdamTrischler. 2018. Rapid adaptation with conditionally shifted neurons.In International Conference on Machine Learning. PMLR, 3664β3673.
[24] Boris Oreshkin, Pau RodrΓguez LΓ³pez, and Alexandre Lacoste. 2018.Tadam: Task dependent adaptive metric for improved few-shot learn-ing. In Advances in Neural Information Processing Systems. 721β731.
[25] Zhimao Peng, Zechao Li, Junge Zhang, Yan Li, Guo-Jun Qi, and JinhuiTang. 2019. Few-shot image recognition with knowledge transfer. InProceedings of the IEEE International Conference on Computer Vision.441β449.
[26] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014.Glove: Global vectors for word representation. In Proceedings of the2014 conference on empirical methods in natural language processing(EMNLP). 1532β1543.
[27] Hang Qi, Matthew Brown, and David G Lowe. 2018. Low-shot learn-ing with imprinted weights. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. 5822β5830.
[28] Limeng Qiao, Yemin Shi, Jia Li, Yaowei Wang, Tiejun Huang, andYonghong Tian. 2019. Transductive episodic-wise adaptive metric forfew-shot learning. In Proceedings of the IEEE International Conferenceon Computer Vision. 3603β3612.
[29] Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L Yuille. 2018. Few-shot image recognition by predicting parameters from activations. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition. 7229β7238.
[30] Jathushan Rajasegaran, Salman Khan, Munawar Hayat, Fahad ShahbazKhan, andMubarak Shah. 2020. Self-supervised KnowledgeDistillationfor Few-shot Learning. arXiv preprint arXiv:2006.09785 (2020).
[31] Tiago Ramalho and Marta Garnelo. 2018. Adaptive Posterior Learning:few-shot learning with a surprise-based memory module. In Interna-tional Conference on Learning Representations.
[32] Avinash Ravichandran, Rahul Bhotika, and Stefano Soatto. 2019. Few-shot learning with embedded class models and shot-free meta training.In Proceedings of the IEEE International Conference on Computer Vision.331β339.
[33] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016.You only look once: Unified, real-time object detection. In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition.779β788.
[34] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster r-cnn: Towards real-time object detectionwith region proposal networks.IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6(2016), 1137β1149.
[35] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, SanjeevSatheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,Michael Bernstein, et al. 2015. Imagenet large scale visual recogni-tion challenge. International journal of computer vision 115, 3 (2015),211β252.
[36] Edgar Schonfeld, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, andZeynep Akata. 2019. Generalized zero-and few-shot learning viaaligned variational autoencoders. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. 8247β8255.
Few-shot Learning with Global Relatedness Decoupled-Distillation Woodstock β18, June 03β05, 2018, Woodstock, NY
[37] Eli Schwartz, Leonid Karlinsky, Joseph Shtok, Sivan Harary, MattiasMarder, Abhishek Kumar, Rogerio Feris, Raja Giryes, and Alex MBronstein. 2018. Ξ-encoder: an effective sample synthesis method forfew-shot object recognition. In Proceedings of the 32nd InternationalConference on Neural Information Processing Systems. 2850β2860.
[38] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo-lutional networks for large-scale image recognition. arXiv preprintarXiv:1409.1556 (2014).
[39] Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical net-works for few-shot learning. Advances in Neural Information ProcessingSystems 30 (2017), 4077β4087.
[40] Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and Bernt Schiele. 2019. Meta-transfer learning for few-shot learning. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. 403β412.
[41] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, andTimothy M Hospedales. 2018. Learning to compare: Relation net-work for few-shot learning. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. 1199β1208.
[42] Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, andPhillip Isola. 2020. Rethinking Few-Shot Image Classification: a GoodEmbedding Is All You Need? arXiv preprint arXiv:2003.11539 (2020).
[43] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, DaanWierstra, et al.2016. Matching networks for one shot learning. Advances in NeuralInformation Processing Systems 29 (2016), 3630β3638.
[44] Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. 2020.Generalizing from a few examples: A survey on few-shot learning.
ACM Computing Surveys (CSUR) 53, 3 (2020), 1β34.[45] Yu-XiongWang, Ross Girshick, Martial Hebert, and Bharath Hariharan.
2018. Low-shot learning from imaginary data. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition. 7278β7286.
[46] Zeyuan Wang, Yifan Zhao, Jia Li, and Yonghong Tian. 2020. Coopera-tive Bi-path Metric for Few-shot Learning. In Proceedings of the 28thACM International Conference on Multimedia. 1524β1532.
[47] Yu Wu, Yutian Lin, Xuanyi Dong, Yan Yan, Wanli Ouyang, and YiYang. 2018. Exploit the unknown gradually: One-shot video-basedperson re-identification by stepwise learning. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. 5177β5186.
[48] Ziyang Wu, Yuwei Li, Lihua Guo, and Kui Jia. 2019. Parn: Position-aware relation networks for few-shot learning. In Proceedings of theIEEE International Conference on Computer Vision. 6659β6667.
[49] Shuo Yang, Lu Liu, and Min Xu. 2021. Free Lunch for Few-shot Learn-ing: Distribution Calibration. arXiv preprint arXiv:2101.06395 (2021).
[50] Hongguang Zhang, Jing Zhang, and Piotr Koniusz. 2019. Few-shotlearning via saliency-guided hallucination of samples. In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition.2770β2779.
[51] Imtiaz Ziko, Jose Dolz, Eric Granger, and Ismail Ben Ayed. 2020. Lapla-cian regularized few-shot learning. In International Conference on Ma-chine Learning. PMLR, 11660β11670.