isi-10 project final reportmeta networks€¦ · isi-10 project final report meta networks xue bai...

9
ISI-10 Project Final Report Meta Networks Xue BAI 12 Yaxian CHEN 12 Xiaomeng WANG 12 Abstract In this final report, we first introduced the paper Meta Networks (the studied problem, the main ideas, the experiments conducted and the results obtained) as well as some state-of-the-art meta learning models to present our comprehension on this subject. Then we developed the part of implementation based on our mi-report. We repro- duced some experiences and analyzed our results by comparing with the results in the paper. We also designed and conducted some new experi- ences to verify the properties of MetaNet. Finally we talked about the problems that we encountered during the project and what could be done if we would have more time. 1. Presentation of article 1.1. Introduction Our project is based on the paper Meta Networks (Munkhdalai & Yu, 2017) which introduces a novel meta learning method to learn tasks from meta level knowledge with small datasets and demonstrates properties of this model via a series of one-shot supervised classification ex- periments and generalization tests. Standard deep neural networks require a large number of labeled data for training and show catastrophic forgetting when training various tasks in a sequential manner. In order to solve these problems, several works on meta learning, the ability of learning to learn, have been carried out. There are two levels of learning in a meta learning model: the meta- level gathering generic knowledge of different tasks and transferring meta-information to base-level; the base-level providing generalization with transferred meta knowledge within each task. In this article, the authors present their 1 IODAA, Agro ParisTech, Paris, France 2 ISI, Paris- Dauphine, Paris, France. Correspondence to: Xue BAI <[email protected]>. Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). meta learning model MetaNet where neural networks are able to learn a new task from an example on the fly. From the architecture given in the article 1, we can know that three principle components exist in MetaNet: two learn- ing modules(a base learner, a meta learner) and an external memory. When input task arrives, the base learner firstly analyzes it and delivers higher order meta information to meta learner as a feedback to show its own current sta- tus. Meanwhile the meta learner acquires meta information with which it rapidly parameterizes both itself and the base learner in a separate space. With rapid generalization, this model can recognize the change of the input task. What’s more, the external memory helps to store parameters related to each example calculated previously. As the meta learner has access to the external memory, it can read its memory when new concepts appear which allows MetaNet for rapid learning and generalization. In this model, two types of loss function gradients are used as meta information: one is embedding loss function which indicates whether our data are correctly represented, and the other one is task loss function which suggests the results of classification. Figure 1. Overall architecture of Meta Networks 1.2. Model of Meta Network In this article, the idea is to embed a higher order meta information for rapid generalization of neural network. With

Upload: others

Post on 15-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ISI-10 Project Final ReportMeta Networks€¦ · ISI-10 Project Final Report Meta Networks Xue BAI 12Yaxian CHEN Xiaomeng WANG Abstract In this final report, we first introduced

ISI-10 Project Final ReportMeta Networks

Xue BAI 1 2 Yaxian CHEN 1 2 Xiaomeng WANG 1 2

AbstractIn this final report, we first introduced the paperMeta Networks (the studied problem, the mainideas, the experiments conducted and the resultsobtained) as well as some state-of-the-art metalearning models to present our comprehensionon this subject. Then we developed the part ofimplementation based on our mi-report. We repro-duced some experiences and analyzed our resultsby comparing with the results in the paper. Wealso designed and conducted some new experi-ences to verify the properties of MetaNet. Finallywe talked about the problems that we encounteredduring the project and what could be done if wewould have more time.

1. Presentation of article1.1. Introduction

Our project is based on the paper Meta Networks(Munkhdalai & Yu, 2017) which introduces a novel metalearning method to learn tasks from meta level knowledgewith small datasets and demonstrates properties of thismodel via a series of one-shot supervised classification ex-periments and generalization tests.

Standard deep neural networks require a large number oflabeled data for training and show catastrophic forgettingwhen training various tasks in a sequential manner. In orderto solve these problems, several works on meta learning, theability of learning to learn, have been carried out. There aretwo levels of learning in a meta learning model: the meta-level gathering generic knowledge of different tasks andtransferring meta-information to base-level; the base-levelproviding generalization with transferred meta knowledgewithin each task. In this article, the authors present their

1IODAA, Agro ParisTech, Paris, France 2ISI, Paris-Dauphine, Paris, France. Correspondence to: Xue BAI<[email protected]>.

Proceedings of the 35 th International Conference on MachineLearning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018by the author(s).

meta learning model MetaNet where neural networks areable to learn a new task from an example on the fly.

From the architecture given in the article 1, we can knowthat three principle components exist in MetaNet: two learn-ing modules(a base learner, a meta learner) and an externalmemory. When input task arrives, the base learner firstlyanalyzes it and delivers higher order meta information tometa learner as a feedback to show its own current sta-tus. Meanwhile the meta learner acquires meta informationwith which it rapidly parameterizes both itself and the baselearner in a separate space. With rapid generalization, thismodel can recognize the change of the input task. What’smore, the external memory helps to store parameters relatedto each example calculated previously. As the meta learnerhas access to the external memory, it can read its memorywhen new concepts appear which allows MetaNet for rapidlearning and generalization.

In this model, two types of loss function gradients are usedas meta information: one is embedding loss function whichindicates whether our data are correctly represented, and theother one is task loss function which suggests the results ofclassification.

Figure 1. Overall architecture of Meta Networks

1.2. Model of Meta Network

In this article, the idea is to embed a higher order metainformation for rapid generalization of neural network. With

Page 2: ISI-10 Project Final ReportMeta Networks€¦ · ISI-10 Project Final Report Meta Networks Xue BAI 12Yaxian CHEN Xiaomeng WANG Abstract In this final report, we first introduced

ISI-10 Final Report • Meta Networks

this meta information, it constructs a flexible model that iscapable to learn a sequence of tasks with different input andout distribution.

MetaNet consists of three modules: meta learner, baselearner and external-based memory. Meta learner and baselearner are two main learning modules. Meta learner gener-ates fast weights of never-before-seen task via an externalmemory module. While base learner extracts the method toobtain useful representation as the task objective. In this net,it also used a novel layer augmentation method to combinethe standard slow weights and specific fast weights.

To train MetaNet, there are three main procedures: acqui-sition of meta information, generation of fast weights andoptimization of slow weights. The input of the proceduresis a sequence of tasks, where each task is completed by asupport and training set. It trains meta learner and baselearner at the same time.

To test MetaNet, it uses one-shot supervised learning. Itclassifies the support set of a sample of other tasks withnever-before-seen classes. And for each class, it has onlyone example.

This article then explains meta learner, base learner andlayer augmentation in detail. Meta learner is used to gener-

ate meta parameters and it has three functions, a dynamicrepresentation learning function u and fast weight genera-tion functions m and d.

Function u learns to represent the task by task-level fastweights. It is a neural net parameterized by slow weights Qand task-level fast weights Q*. The input of this functionare the example x

0

i from the support set {x0

i, y0

i}N

i=1 and theweights. Its output is the representation of this example.With the result and y

0

i, it calculates the loss Lossemd toobtain the gradients of meta information. It uses cross-entropy loss for a one-shot learning, and contrastive lossfor the other situation. In that way, it generates the fastweight Q* on a per task basis. Function m, a neural network,learns the mapping from the loss gradient and obtains thefast weights W⇤

i from the base learners. The results arestored in a memory M, which is indexed with task dependentembeddings. Function d denotes a neural net with LSTM(Long Short-Term Memory). It observes the loss gradientsof each sampled example from support set. Then it generatesa meta parameter that is the task specific parameter.

Once the parameters are generated and stored in the memoryM with index R, the meta learner parameterizes the baselearner with the fast weights with the help of soft attention.The soft attention applies softmax of similarity between thememory index and the input embedding to get the distribu-tion of each input.

The base learner aims to provide the meta learner with feed-backs about the new input task by a representation of metainformation, which is the loss gradient information. Be-sides, the base learner takes the task specific representationsas input, that helps to reduce the number of meta parame-ters effectively and leverage shared representations. Duringtraining, it minimizes the loss and the parameters are twoslow weights and two meta weights.

Layer augmentation is an important and innovated part inthis network. Both meta learner and base learner have slowweight layers and fast weight layers. These two layers inbase learner are used for rapid generalization a specific ex-ample and is essential in convergence of MetaNet models.The two kinds of weights can detect the features when rep-resent two different domains into the same one. It sharesan idea of MLP. It transforms the input by the two weights.Then by using a ReLU function, it results in two separateactivation vectors. The vectors are added up and aggregatedthrough the softmax layer. Before output, it normalizes theresults.

1.3. Affirmations

In this article, authors show that MetaNet can improve thestate-of-the-art accuracy of one-shot supervised learning andowns the abilities of generalization and continual learning

Page 3: ISI-10 Project Final ReportMeta Networks€¦ · ISI-10 Project Final Report Meta Networks Xue BAI 12Yaxian CHEN Xiaomeng WANG Abstract In this final report, we first introduced

ISI-10 Final Report • Meta Networks

Figure 2. A layer augmented MLP

through their experiments.

1.3.1. ONE-SHOT LEARNING TEST

For the first part, they carried out one-shot learning testson three datasets: Omniglot, ImageNet and MNIST. Om-niglot is a dataset consisting 1623 classses with only 20images per class, from 50 different alphabets. ImageNetis an image database organized according to the WordNethierarchy, in which each node of the hierarchy is depictedby hundreds and thousands of images (currently average ofover five hundred images per node). The MNIST databaseis a large database of handwritten digits containing 60,000training images and 10,000 testing images. Four groups ofbenchmark experiments are reported: Omniglot previoussplit, MNIST as out-of-domain data, Mini-ImageNet andOmniglot standard split.

Omniglot previous split : Splitting Omniglot into 1200 and423 for training and testing, the research team performed 5,10, 15 and 20-way one-shot classification and studied threevariations of MetaNet: MetaNet-, MetaNet and MetaNet+.MetaNet- is a variant without task-level fast weight Q⇤ inthe embedding function and MetaNet+ is a variant with anadditional task-level weights for the base learner in addi-tion to W⇤. The performances of MetaNet are better thanthose from previous papers both in 5-way and 20-way testbut comparing the results among MetaNet’s variations, thechange of fast weights in the model can not improve classi-fication performance but decreases the accuracy.

MNIST as out-of-domain data : Based on the Omniglotprevious split, but testing variations in out of the domainsetting with MNIST, all variations show lower accuracy andthe performance gap between variations increases.

Mini-ImageNet : With the training, validation and testingsets of 64, 16 and 20 ImageNet classes (resized as 28x28pixels) and 15 sampled examples per class for evaluation,MetaNet improves the previous accuracy and obtained thebest result.

Omniglot standard split : Different from the Omniglotprevious split, the authors used the standard split of Om-niglot: 30 training alphabets and 20 evaluation alphabets andformed 400 trials from the evaluation to test the model. Inthis case, only the MetaNet was trained and tested. It showeda high accuracy slightly better than human performance buta slight decrease compared with the first experiment, whichsuggests that the setup consisting of less training classesand more test classes is more difficult for MetaNet.

1.3.2. GENERALIZATION TEST

Based on the results of the one-shot learning tests, the au-thors then conducted a set of experiments to test the gen-eralization of MetaNet from three aspects and they demon-strated several interesting properties relating to generaliza-tion and continual learning.

(a) N-way training and K-way testing

Purpose: To test whether a MetaNet model trained on anN-way one-shot task could generalize to anther K-way task(where N 6= K) without actually training on the second task.

Design: To handle this experiment, the authors firstly in-serted a softmax layer into the base learner during evaluationand then augmented it with the case weights generated bythe meta learner in order to see if the meta learning is genericenough to be able to parameterize the new softmax layer onthe fly. Then the MetaNet models were trained on one of 5,10, 15 and 20-way one-shot tasks and evaluated on the rest.

Results: As shown in the table, when the models are trainedon easier tasks than test ones (i.e. N > K) we observe de-scribing performances. Conversely the models trained onharder tasks (i.e. N < K) achieved increasing performanceswhen tested on the easier tasks and the performance is evenhigher than the ones applied to the tasks with the same leveldifficulty (i.e. N = K). For example, the model skilled on20-way classification improved the 5-way one-shot base-line by 0.6% showing a ceiling performance in this setting.Furthermore the authors think that the test performance ob-tained in the case of same level difficulty (i.e. N = K) can beused as a performance lower or an upper bound dependingon a scenario under which the model will be deployed inthe future. For example, for the MetaNet model that will

Page 4: ISI-10 Project Final ReportMeta Networks€¦ · ISI-10 Project Final Report Meta Networks Xue BAI 12Yaxian CHEN Xiaomeng WANG Abstract In this final report, we first introduced

ISI-10 Final Report • Meta Networks

be deployed under the N > K scenario, we can obtain theperformance lower bound by testing it on the N =K scenario.

Figure 3. Accuracy of MetaNet trained on N-way and tested onK-way one-shot tasks.

Conclusion: MetaNet has good flexibility in one-shot learn-ing tasks with different numbers of classes.

(b) Rapid parameterization of fixed weight base learner

Purpose: To test whether a meta learner trained for rapidparameterization of a base could parameterize another baselearner during evaluation.

Design: In this experiment, the authors replaced the entirebase learner with a new CNN during evaluation. The slowweights of this network remained fixed. While the fastweights are generated by the meta learner that is trained toparameterize the old base learner and used to augmentedthe fixed slow weights.

Results: As shown in the figure, the test performances ofthe base learner (target CNN), the small CNN (with 32filters) and the big CNN (with 128 filters) are compared.The performance difference between these models is largein earlier training iterations. But the test accuracies of thebase learners converge as the meta learner sees more one-shot learning trials (after around 17000 trials).

Figure 4. Comparison of the test performances of the base learnerson Omniglot 5-way classification.

Conclusion: MetaNet effectively learns to parameterize aneural net with fixed weights.

(c) Meta-level continual learning

Purpose: To test whether the meta space of MetaNet isproblem-independent in the case of the loss gradient.

Design: The authors formulated two problems in a sequen-tial manner. They first trained and tested the model on theOmniglot set and the switched into training on the MNISTdata. After training on a certain number of MNIST one-shottasks, they re-evaluated the model on the same Omniglottest set and compare performances. More specifically, theyallocated separate parameters for the weights W and Q whenswitching the problem so only the meta weights were up-dated. And they conducted multiple runs and increased theMNIST training trials by multiples of 400 (i.e. 400, 800,1200) in each run giving are time for MetaNet to adapt itsmeta weights on the second problem so that it may forgetthe knowledge about Omniglot.

Results: The authors plotted the accuracy difference be-tween two Omniglot test performance obtained before andafter training on the MINIST task as shown in the figure.The performance improvement (y-axis) after training onthe MNIST tasks ranges from -1.7% to 1.24% dependingon the training time (x-axis).The positive performance im-provements indicate that the training on the second problemautomatically improve the performance of the earlier taskexhibiting the reverse transfer property. However, reversetransfer happens only up to a certain point in MNIST train-ing (2400 trials). After that, the meta weights start to forgetthe Omniglot information so the Omniglot test accuracydrops.

Figure 5. The difference between the two Omniglot test accuraciesobtained before and after training on MNIST task.

Conclusion: MetaNet supports meta-level continual learn-ing within a certain trials of the problem inserted.

Page 5: ISI-10 Project Final ReportMeta Networks€¦ · ISI-10 Project Final Report Meta Networks Xue BAI 12Yaxian CHEN Xiaomeng WANG Abstract In this final report, we first introduced

ISI-10 Final Report • Meta Networks

2. State of the ArtBefore implementing the experiences in this paper, we didthe study of the art on meta learning in order to learn moreabout the general research status and to better understandthe innovation of this paper.

Meta Learning has become an important research branchafter Reinforcing Learning in the field of Machine Learn-ing. The task of rapid generalization on new concepts withsmall training data presents a significant challenge to neuralnetwork models. The key to human beings’ quick learningability is that human is capable of learning to learn, so thescientists try to develop some new neural network modelswhich can fully utilize past knowledge and experience toguide the learning of new tasks.

To do this, the scientists have come up with divers solutionsto improve the meta learning ability of neural networks.Here we present some new ideas in the papers that we haveread.

2.1. Strategy based on external memory

Basic idea: Since we must learn from past experience, canwe achieve this by adding memory to neural network?

Santoro et al. proposed a memory-augmented neural net-work model as shown in Figure 6. A successful strategywould involve the use of an external memory to store boundsample representation-class label information, which canthen be retrieved at a later point for successful classifica-tion when a sample from an already-seen class is presented.Specifically, sample data xt from a particular time stepshould be bound to the appropriate class label yt, which ispresented in the subsequent time step. Later, when a samplefrom this same class is seen, it should retrieve this boundinformation from the external memory to make a prediction.Backpropagated error signals from this prediction step willthen shape the weight updates from the earlier steps in orderto promote this binding strategy(Santoro et al., 2016).

Figure 6. Network strategy (Santoro et al., 2016)

This paper (Munkhdalai & Yu, 2017) is also based thisexternal memory mechanism and LSTM model.

2.2. Strategy based on gradient descent

Basic idea: Since the purpose of Meta Learning is to achieverapid learning, and the key point of rapid learning is theaccurate and fast gradient of the neural network, then is itpossible to use the previous task to learn how to predict thegradient more accurately when facing new tasks?

Andrychowicz et al. showed how the design of an optimiza-tion algorithm can be cast as a learning problem, allowingthe algorithm to learn to exploit structure in the problems ofinterest in an automatic way. They proposed to replace hand-designed update rules with a learned update rule, which iscalled the optimizer g, specified by its own set of parameters�. This results in updates to the optimizee f of the formoptimizer: ✓t+1 = ✓t + gt(5f(✓t),�). And they trained ageneral neural network to predict gradients with quadraticregressions so that this neural network can speed up thefollowing training.

2.3. Strategy based on attention mechanism

Basic idea: Humans can easily find the most important partsto distinguish a picture. Then, can we use previous tasks totrain an Attention model so that it can directly focus on themost important parts in the face of new tasks?

Vinyals et al. constructed an attention mechanism, thatis, the final output is obtained by the superposition of theattention on the preliminary result: by =

Pki=1 a(bx, xi)yi,

where xi and yi are the samples and labels from the supportset S and a is the attention mechanism which fully specifiesthe classifier (Vinyals et al., 2016).

This paper (Munkhdalai & Yu, 2017) also uses a similarattention mechanism.

3. Implementation3.1. Comprehension of code sources

3.1.1. ANALYSES OF MODEL STRUCTURES

The authors have constructed three MetaNet: MetaNetFull,MetaNetPartial, MetaNetParticialU. In the MetaNetFull,the whole model is fast-parameterized and layer-augmented.It defines a class of model, in which it constructs a functionto set the index of fast weights in the memory M and afunction of meta learner and base learner. It defines a CNNin this function. When training the model, it measures theprediction and loss of the embedding the inputs. Then, itinjects the fast weights and updates the synaptic connections.Finally, it optimizes the model and applies to predicate thenewly come input.

Page 6: ISI-10 Project Final ReportMeta Networks€¦ · ISI-10 Project Final Report Meta Networks Xue BAI 12Yaxian CHEN Xiaomeng WANG Abstract In this final report, we first introduced

ISI-10 Final Report • Meta Networks

For MetaNetPartial and MetaNetParticialU models, theyare similar to the full MetaNet. However, the formertwo MetaNets have only last few layers which are fast-parameterized and layer-augmented. And in MetaNetParti-cialU, the output of embedding function is given for clas-sification, forcing the model to operate on dynamic taskspace.

Besides, the authors provides functions (in the Genera-tors.py) to generate examples from three datasets (Omniglot,ImageNet and MNIST) separately and to augment data (im-ages). Only the MNIST dataset can be used directly afterdownload from database. As for Omniglot and ImageNet,the data need to be pre-processed before experiments :

1)For Omniglot, besides the standard split, the authors exe-cuted a previous split and augmented data by rotation of 90,180 and 270(Santoro et al., 2016).

2)For ImageNet, the authors resized images under the con-sideration of computation efficiency.

3.1.2. ANALYSES OF EXPERIMENTS

In order to understand clearly how the experiments werecarried out, we also analyzed the code sources and tried toknow the structure of code for each experiment before ourimplementation.

One-shot Learning Test: At the beginning of experiment,it needs to generate training, testing (and validation for Mini-ImageNet) sets from corresponding pre-processed dataset.After each epoch, we can obtain prediction accuracy. Forexperiments on Omniglot dataset, the model MetaNetFull isused to train and test. The training of ImageNet is similarto that of Omniglot but with another meta learning modelMetaNetPartial. To obtain the variants of metanet, we candelete or add the layers for gradients calculation in MetaNet-Full. As for the test of out of domain data, what we need todo is to change the generator to test dataset.

Generalization Test: These series of test are based on one-shot learning test. We are able to modify the parameters (thenumber of train classes and the number of test classes) in thegiven sources to achieve such N-way Training and K-wayTesting transformation. The test of rapid parameterizationof fixed weight is more complicated because we need tomodify the structure of MetaNetFull for a new base learner(CNN) instead of the simple modification of parameters.At last, for the continual learning, we should add a part togenerate MNIST training set via the script Generators.py,empty the gradient list and loss list after each epochs inthe script train-omniglot.py, add the training information ofMNIST after the normal test on Omniglot dataset and retestOmniglot. At the end of the experiments, we are able tocompare the accuracy rates before and after MNIST train.

We implemented our experiments based on the authors’ideas. For the first time we replicated the one-shot learningtests and then we endeavored to reproduce the generalizationtests. However, limited by the computer configuration andthe computation ability, we could only reproduce a part ofexperiments.

3.2. Reproduction of One-shot Learning Test

Before the experiments, we downloaded theOmniglotdataset(Lake et al., 2015) and split it into 1200 and 423classes following the instruction of a cited article(Santoroet al., 2016), and then we did image rotations. Hereby ourone-shot learning tests were all based on this previous split.

3.2.1. OMNIGLOT PREVIOUS SPLIT

This experience is aimed to test the one-shot learning per-formance of standard metanet: MetaFull. We set a supportset with one example for both training and testing. Then wesampled 5 classes as train dataset, 5 classes as test datasetvia the script Generators.py and there are 10 examples foreach class. This is our 5-way one-shot classification setup.

In original experiments, the authors used GPU to processlearning tasks. Without such equipment, we could onlychanged some parameters to shift experiments into CPUmodel. As it took more than ten minutes for one epoch, weonly set 60 epochs but it has already shown the tendency ofconvergence (see in the following section).

We also tried to perform 10, 15 and 20-way one-shot clas-sification, but once the number of classes was more than 5,there was some problem of memory of our machine. So wewere not able to finish these series experiments.

3.2.2. MNIST AS OUT-OF-DOMAIN DATA

After the classic one-shot learning test, we reproduced thetest by treating MNIST as a separate domain data under thehypothesis that the metanet model was dynamic enough toperform well in this task.

Based on the script train-omniglot.py, we replaced the Om-niglot testing dataset with MNIST images which were gen-erated by MnistGenerator in Generators.py. The numberof classes in testing set was 10, and each class owned 10examples. But the training set setup remained the same asthe previous split. To be concluded, it was a 5-way learning10-way testing test. We trained our model for 30 epochs.

3.2.3. ONE-SHOT TRAIN TO K-SHOT TEST

Inspired by the description of authors ”Supplying more sup-port examples during test increases the test performance”,we decided to change the number of support examples writ-ten as nb samples per class in train-omniglot.py to verify

Page 7: ISI-10 Project Final ReportMeta Networks€¦ · ISI-10 Project Final Report Meta Networks Xue BAI 12Yaxian CHEN Xiaomeng WANG Abstract In this final report, we first introduced

ISI-10 Final Report • Meta Networks

whether this affirmation is correct.

The training set was always the same with one supportexample and the experiments were always based on standardmetanet model MetaFull. We used a testing set with 3and 5 support examples separately, which formed an one-shot learning three-shot testing experiment and an one-shotlearning five-shot testing experiment. For both two tests, wetrained our models for 30 epochs.

3.3. Reproduction of Generalization Test

We reproduced the N-way training and K-way testing exper-iment to test the generalization of MetaNet.

We modified the script train omniglot.py by resetting thefollowing parameters :

• n epoch (number of epochs)

• n outputs train (number of train classes in an episode)

• n outputs test (number of test classes in an episode)

Due to the limited computation ability of the computer thatwe use, every epoch takes around 150s. So we changedthe number of epochs into 60 to reduce the runtime. Anddue to the limited memory space of GPU in our computer,we cant run the process once the number of train classes isgreater than 5 or the number of test classes is greater than10. So we tested two sets of parameters : 5-way trainingand 5-way testing (i.e. N=5, K=5) and 5-way training and10-way testing(i.e. N=5, K=10), and then we analyzed thetrain task accuracy and test task accuracy.

4. ResultsWe carried out one-shot classification experiments on twodatasets: Omniglot and MNIST. And we have done theexperiments that we have mentioned before: Omniglot wasuesd for Omniglot previous split and generalization test,meanwhile, MNIST was used as out-of -domain data.

4.1. One-shot Learning Test

As the author indicated, we split the Omniglot classes into1200 and 423 classes for training and testing. We performed5-way one-shot classification. We set 60 as the number ofepoch due to the limited computation ability.

The performance augmented rapidly before 10 epochs forboth train and test. And the accuracy reached at 90% after17 epochs, since then the accuracy of train became stable,the average accuracy for train set is 92.8%. With regardto test set, the average accuracy is 93.1 % after 17 epochs,which is slightly higher than the train set. However, theaccuracy is more volatile.

For 5-way one-shot classification, the performance of testis not stable within 60 epochs, and the average accuracyis 93.1 %. It could be predicted that the accuracy still hasthe potential to augment. Compared with 98.95% in thisarticle, our result can be improved by increasing the numberof epochs.

4.1.1. ONE-SHOT TRAIN TO K-SHOT TEST

As the model is very flexible, we have set different valuesfor support examples during training and test, we train themodel on one-shot task, then test it on N-shot (3-shot and5-shot) tasks with 30 epochs.

In Figure7 we drew the accuracy of one shot learning with 1-shot, 3-shot,5-shot on support test. Both of the three tests fortraining converged after 15 epochs and have a high accuracyat last. The results are similar between 1-shot and 5 shot fortraining, and the accuracy of 3-shot is little bit higher thantwo two others, which is 92.5%. When applied to test set,the accuracy are more fluctuated than training. Similarly,the accuracy of 3-shot test is 95.4%, which is higher than1-shot and 5 shot, 92.4% and 94.6% respectively.

The author explains that supplying more support examplesduring test increases the test performance. And our experi-ment proved that the influence of the examples of supporton the test performance is not the linear. There is a ceilingfor the number of the examples for support. When there are3 examples of support, test can have the best performanceamong the 1-shot and 5-shot.

Figure 7. The accuracy of one shot learning with support test on1-shot, 3-shot and 5-shot

4.1.2. MNIST AS OUT-OF-DOMAIN DATA

The author treated MNIST images as a separate domain data.Particularly a model is trained on the Omniglot training setand evaluated on the MNIST test set in 10-way one-shot

Page 8: ISI-10 Project Final ReportMeta Networks€¦ · ISI-10 Project Final Report Meta Networks Xue BAI 12Yaxian CHEN Xiaomeng WANG Abstract In this final report, we first introduced

ISI-10 Final Report • Meta Networks

Accuracy 1-shot-test 3-shot-test 5-shot-test

train 90.7% 92.5% 90.3%test 92.3% 95.4% 94.6%

Table 1. Accuracy of one-shot learning with support test on 1-shot,3-shot and 5-shot

learning setup.

From the Figure8, the accuracy of training converges but forthe test doesn’t within 30 epochs. Besides, the accuracy fortraining is 90.3% but for test , it is only 49.0%.

The author concluded that MetaNet successfully performsreverse transfer. Conversely, we find that after training onthe MNIST task, there is no performance improvement fortraining Omniglot and it becomes even worse. We speculatethat in the original experiment, it was 10-way learning, butin our experiment it was 5-way learning 10-way testing.We have tried on training 10-way learning, but the memoryhas run out when training. Hence, we changed to 5-waylearning. Besides, the number of epochs is small, it can’thave better accuracy. We can’t get the same conclusion withthe author.

Figure 8. Accuracy of training of importing MNIST as out-of-domain data

4.2. Generalization Test

For the generalization test, we conducted the experiments totest whether a MetaNet model trained on an N-way oneshottask could generalize to another K-way task (where N = K)without actually training on the second task, that is N-waytraining and k-way testing. In this experiment, we have tried5-way training 5-way testing and 5-way training 10-waytesting.

In Figure 9, we found that the two curves for training are

Accuracy 5-way train 5-way test 5-way train 10-way test

train 92.8% 93.0%test 93.2% 41.2%

Table 2. Accuracy of generalization test from N-way task to an-other K-way task

approaching. The accuracy of training for 5-way training5-way testing is 92.8% and the other is 93.0%. In that way,we can conclude that the number of examples doesn’t makea difference when training. What’s more, the accuracy oftest for 5-way training 5-way testing is a nearly equaled tothe accuracy for training. However, there is big gap betweenthe test and train for 5-way training 10-way testing. And theaccuracy of test doesn’t converge. Hence, when the numberclasses in the test set increases, the performance declines. IfN = K, there is no much difficulty of classifying. When Nis smaller than K, it doesn’t work well.

The authors observed exactly the same performance with us.And they mentioned that the models trained on harder tasks(i.e. N is bigger than K) achieved increasing performances.These results is in accordance with the common sense.Thefewer the classes for train are, the more the difficulty forclassify more the classes is.

Figure 9. The accuracy of N-way training K-way testing

5. Discussion and future work5.1. Difficulties

We have tried our best to understand and implement this arti-cle. However, we encountered some difficulties even thoughthe authors have provided the code sources. In general,thereexists the difficulties in the following aspects: the internaldifficulties and the external difficulties.

Page 9: ISI-10 Project Final ReportMeta Networks€¦ · ISI-10 Project Final Report Meta Networks Xue BAI 12Yaxian CHEN Xiaomeng WANG Abstract In this final report, we first introduced

ISI-10 Final Report • Meta Networks

For the internal difficulties:Firstly, there exists some problems of codes. We can’tunderstand some codes that they provided. Exactly, wewanted to train the mini-imagenet at first, we have alreadypreprocessed this dataset. But we have encountered the error”backward indexes out of bounds”. We have examined thecodes carefully, but we didn’t solve this problem in the end.

Secondly, the algorithm is not optimal for space. We did theexperiments of 10-way training 5-way testing, but it resultedin error of out of memory. When training the model, it needsmore the memory for more classes. As the authors havetried even 20-way training. We thought that we couldn’tachieve that without reducing the memory unless we usea powerful computer. The structure of network could beoptimized.

Thirdly, for continual learning MNIST in the generalizationtest, we don’t know how to shuffle 50 times to build 500classes, as we know that there are 10 classes in MNIST.Forthe rapid parameterization of fixed weight base learner, wedon’t know how to change a base learner.

For the external difficulties:Firstly, as this article tries to analyze enormous data, it canbe done much easier with GPU. With limited resource, wehave tried the virtual machine first. But we failed, as thecode sources demand CUDA environment which can beonly installed on a computer with graphic card. We havespent much time on it. In the last, we bought a virtualmachine with graphic card to conduct these experiments.

Then, it should pay much attention to the versions of Pythonand packages that the authors used in this paper. Normally,the computer is equipped with Python in version 2 andChainer (a package of Python is in version 3). However,the author used Chainer in version 1.19.0. There are manydifferences between the two versions. If you don’t want tochange the code, use a lower version is a simpler way. How-ever, it demands a correspondent lower version of Numpydue to the problem of API. So we also spent lots of time onhandling this problem.

Finally, the limited computation ability results in much timeto calculate. Training the Metanet demands large numberof epochs. As each epoch takes around 10 minutes to finishby a computer equipped with CPU and 3 minutes for a com-puter with GPU, it takes long time to realize more epochs.So in our experiments we set a small number of epochs.Besides, the size of GPU memory space has an effect of theconduction of experiments. It problem of out of memoryoften happened. It’s better to prepare a powerful machine.

5.2. Future work

In spite of the difficulties, we want to explore deeper inthis article. If we would have more time and an advanced

machine, the experiments of training imagenet and MNISTcould be interesting. The parameters are important in thismethod. By modifying the parameters, we will find moreinteresting properties of Metanet.

Besides, we should try to find a mechanism to save thememory. We propose that we can save the models in thehard disc and delete useless memory while training in theRamdom Access Memory. We can read them when weneeds them. We pay some affordable time for more space.Besides,in this paper, it’s related to meta learner and baselearner, fast weights and slow weights. It could be morerobust and expressive to maintain higher order information.

ReferencesLake, Brenden M, Salakhutdinov, Ruslan, and Tenenbaum,

Joshua B. Human-level concept learning through prob-abilistic program induction. Science, 350(6266):1332–1338, 2015.

Munkhdalai, T. and Yu, H. Meta networks. In Proceed-ings of the 34th International Conference on MachineLearning (ICML 2017), Sydney, Australia, 2017.

Santoro, Adam, Bartunov, Sergey, Botvinick, Matthew,Wierstra, Daan, and Lillicrap, Timothy. Meta-learningwith memory-augmented neural networks. In Interna-tional conference on machine learning, pp. 1842–1850,2016.

Vinyals, Oriol, Blundell, Charles, Lillicrap, Tim, Wierstra,Daan, et al. Matching networks for one shot learning. InAdvances in Neural Information Processing Systems, pp.3630–3638, 2016.