(schmidhuber 1987; bengio, arxiv:1905.10837v3 [cs.lg] 6

16
Sequential mastery of multiple tasks: Networks naturally learn to learn and forget to forget Guy Davidson NYU Center for Data Science [email protected] Michael C. Mozer Google Research University of Colorado at Boulder [email protected] Abstract We explore the behavior of a standard convolutional neural net in a continual-learning setting that intro- duces classification tasks sequentially and requires the net to master new tasks while preserving mastery of previously learned tasks. This setting corresponds to that which human learners face as they acquire domain expertise serially, for example, as an individual studies a textbook. Through simulations involving sequences of ten related tasks, we find reason for optimism that nets will scale well as they advance from having a single skill to becoming multi-skill domain experts. We observe two key phenomena. First, forward facilitation —the accelerated learning of task n +1 having learned n previous tasks—grows with n. Second, backward inter- ference —the forgetting of the n previous tasks when learning task n +1—diminishes with n. Amplifying forward facilitation is the goal of research on metalearn- ing, and attenuating backward interference is the goal of research on catastrophic forgetting. We find that both of these goals are attained simply through broader exposure to a domain. In a standard supervised setting, neural networks are trained to perform a single task, such as classifi- cation, defined in terms of a discriminative distribu- tion p(y | x, D) for labels y conditioned on input x and data set D. Although such models are useful in engi- neering applications, they do not reflect the breadth required for general intelligence, which includes the ability to select among many tasks. Multitask learn- ing (Caruana 1997) is concerned with training mod- els to perform any one of n tasks, typically via a multi-headed neural network, where head i represents the distribution p(y i | x, D 1 ,..., D n ). Related tasks serve as regularizers on one another (Caruana 1993; Ruder 2017). Continual or lifelong learning (Thrun 1996; Parisi et al. 2019) addresses a naturalistic variant in which tasks are tackled sequentially and mastery of previously learned tasks must be maintained while each new task is mastered. Lifelong learning requires consideration of two issues: catastrophic forgetting (McCloskey and Cohen Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 1989) and metalearning (Schmidhuber 1987; Bengio, Bengio, and Cloutier 1991; Thrun 1996), Catastrophic forgetting is characterized by a dramatic drop in task 1 performance following training on task 2, i.e., the accuracy of the model p(y 1 | x, D 1 →D 2 ) is significantly lower than accuracy of the model p(y 1 | x, D 1 ), where the arrow denotes training sequence. Metalearning aims to facilitate mastery on task n from having previously learned tasks 1, 2,...,n - 1. Success in metalearning is measured by a reduction in training-trials-to-criterion or an increase in model accuracy given finite training for the n’th task, p(y n |x, D 1 ... →D n ), relative to the first task, p(y 1 | x, D 1 ). Researchers have proposed a variety of creative approaches—specialized mechanisms, learning proce- dures, and architectures—either for mitigating forget- ting or for enhancing transfer. We summarize these approaches in the next (related work) section. Although the literatures on catastrophic forgetting and metalearn- ing have been considered separately for the most part, we note that they have a complementary relationship. Whereas catastrophic-forgetting reflects backward in- terference of a new task on previously learned tasks, metalearning reflects forward facilitation of previously learned tasks on a new task (Lopez-Paz and Ranzato 2017). Whereas catastrophic forgetting research has focused on the first task learned, metalearning research has focused on the last task learned. We thus view these two topics as endpoints of a continuum. To unify the topics, we examine the continuum from the first task to the n’th. We train models on a sequence of related tasks and investigate the consequences of introducing each new task i. We measure how many training trials are required to learn the i’th task while maintaining performance on tasks 1 ...i - 1 through continued practice. Simultaneously, we measure how performance drops on tasks 1 ...i - 1 after introducing task i and how many trials are required to retrain tasks 1 ...i - 1. We believe that examining scaling behavior— performance as a function of i—is critical to assessing the efficacy of sequential multitask learning. Scaling behavior has been mostly overlooked in recent deep- learning research, which is odd considering its central role in computational complexity theory, and therefore, arXiv:1905.10837v3 [cs.LG] 6 Sep 2019

Upload: others

Post on 10-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: (Schmidhuber 1987; Bengio, arXiv:1905.10837v3 [cs.LG] 6

Sequential mastery of multiple tasks:Networks naturally learn to learn and forget to forget

Guy DavidsonNYU Center for Data Science

[email protected]

Michael C. MozerGoogle Research

University of Colorado at [email protected]

Abstract

We explore the behavior of a standard convolutionalneural net in a continual-learning setting that intro-duces classification tasks sequentially and requires thenet to master new tasks while preserving mastery ofpreviously learned tasks. This setting corresponds tothat which human learners face as they acquire domainexpertise serially, for example, as an individual studiesa textbook. Through simulations involving sequences often related tasks, we find reason for optimism that netswill scale well as they advance from having a single skillto becoming multi-skill domain experts. We observetwo key phenomena. First, forward facilitation—theaccelerated learning of task n + 1 having learned nprevious tasks—grows with n. Second, backward inter-ference—the forgetting of the n previous tasks whenlearning task n + 1—diminishes with n. Amplifyingforward facilitation is the goal of research on metalearn-ing, and attenuating backward interference is the goalof research on catastrophic forgetting. We find thatboth of these goals are attained simply through broaderexposure to a domain.

In a standard supervised setting, neural networksare trained to perform a single task, such as classifi-cation, defined in terms of a discriminative distribu-tion p(y |x,D) for labels y conditioned on input x anddata set D. Although such models are useful in engi-neering applications, they do not reflect the breadthrequired for general intelligence, which includes theability to select among many tasks. Multitask learn-ing (Caruana 1997) is concerned with training mod-els to perform any one of n tasks, typically via amulti-headed neural network, where head i representsthe distribution p(yi |x,D1, . . . ,Dn). Related tasksserve as regularizers on one another (Caruana 1993;Ruder 2017).Continual or lifelong learning (Thrun 1996; Parisi

et al. 2019) addresses a naturalistic variant in whichtasks are tackled sequentially and mastery of previouslylearned tasks must be maintained while each new task ismastered. Lifelong learning requires consideration of twoissues: catastrophic forgetting (McCloskey and Cohen

Copyright c© 2020, Association for the Advancement ofArtificial Intelligence (www.aaai.org). All rights reserved.

1989) and metalearning (Schmidhuber 1987; Bengio,Bengio, and Cloutier 1991; Thrun 1996), Catastrophicforgetting is characterized by a dramatic drop in task1 performance following training on task 2, i.e., theaccuracy of the model p(y1 |x,D1 → D2) is significantlylower than accuracy of the model p(y1 |x,D1), wherethe arrow denotes training sequence. Metalearning aimsto facilitate mastery on task n from having previouslylearned tasks 1, 2, . . . , n− 1. Success in metalearning ismeasured by a reduction in training-trials-to-criterionor an increase in model accuracy given finite trainingfor the n’th task, p(yn|x,D1 → . . . → Dn), relative tothe first task, p(y1 |x,D1).Researchers have proposed a variety of creative

approaches—specialized mechanisms, learning proce-dures, and architectures—either for mitigating forget-ting or for enhancing transfer. We summarize theseapproaches in the next (related work) section. Althoughthe literatures on catastrophic forgetting and metalearn-ing have been considered separately for the most part,we note that they have a complementary relationship.Whereas catastrophic-forgetting reflects backward in-terference of a new task on previously learned tasks,metalearning reflects forward facilitation of previouslylearned tasks on a new task (Lopez-Paz and Ranzato2017). Whereas catastrophic forgetting research hasfocused on the first task learned, metalearning researchhas focused on the last task learned. We thus view thesetwo topics as endpoints of a continuum.

To unify the topics, we examine the continuum fromthe first task to the n’th. We train models on a sequenceof related tasks and investigate the consequences ofintroducing each new task i. We measure how manytraining trials are required to learn the i’th task whilemaintaining performance on tasks 1 . . . i − 1 throughcontinued practice. Simultaneously, we measure howperformance drops on tasks 1 . . . i− 1 after introducingtask i and how many trials are required to retrain tasks1 . . . i− 1. We believe that examining scaling behavior—performance as a function of i—is critical to assessingthe efficacy of sequential multitask learning. Scalingbehavior has been mostly overlooked in recent deep-learning research, which is odd considering its centralrole in computational complexity theory, and therefore,

arX

iv:1

905.

1083

7v3

[cs

.LG

] 6

Sep

201

9

Page 2: (Schmidhuber 1987; Bengio, arXiv:1905.10837v3 [cs.LG] 6

in assessing whether existing algorithms offer any hopefor extending to human-scale intelligence.

Surprisingly, we are aware of only one article (Schwarzet al. 2018) that jointly considers forgetting and met-alearning through their scaling properties. However,Schwarz et al.’s research, like that in the catastrophic-forgetting and metalearning literatures, suggests thatspecialized mechanisms are required for neural networksto operate in a lifelong learning setting. The punchline of our article is that a standard neural network ar-chitecture trained sequentially to acquire and maintainmastery of multiple tasks exhibits faster acquisition ofnew knowledge and less disruption of previously acquiredknowledge as domain expertise expands. We also arguethat the net’s learning and forgetting characteristicshave an intriguing correspondence to the human andanimal behavioral literature.

Related researchTo overcome catastrophic forgetting, standard tech-niques such as drop out have been suggested (Goodfel-low et al. 2015), but most propose augmenting mod-els with specialized mechanisms (Parisi et al. 2019,for review). Kirkpatrick et al. (2017) introduce elas-tic weight consolidation, which adds a penalty to themodel loss that encourages stability of weights thatmost contribute to performance on previously trainedtasks. Lopez-Paz and Ranzato (2017) describe Gradi-ent Episodic Memory, which retains examples of previ-ous tasks and minimizes the aforementioned negativebackward transfer. Kemker and Kanan (2018) deviseFearNet, a neurally-inspired model with dual-memorydesign, using consolidation mechanisms modeled aftermammalian sleep consolidation. Zenke, Poole, and Gan-guli (2017) are similarly biologically-inspired, motivatingintelligent synapses which track their relevance to partic-ular tasks. Kamra, Gupta, and Liu (2017) offer anotherdual-memory model, augmenting with a generative re-play model able to recreate past experiences and improveperformance on previously learned tasks.To facilitate metalearning, mechanisms have been

offered to encourage inter-task transfer, such as MAML(Finn, Abbeel, and Levine 2017) and SNAIL (Mishra etal. 2018). Other approaches employ recurrence to modifythe learning procedure itself (Andrychowicz et al. 2016;Wang et al. 2017). Schwarz et al. (2018) construct adual-component model consisting of a knowledge storeof previously learned tasks and an active componentthat is used to efficiently learn the current task. Aconsolidation procedure then transfers knowledge fromshort- to long-term stores.

Despite the creativity of this assortment of methods,our concern centers on the fact that researchers assumethe inadequacy of standard methods, and no attempthas been made to understand properties of a standardarchitecture as it is trained sequentially on a series oftasks, and to characterize the extent of forgetting andtransfer as more tasks are learned.

Figure 1: Example training images

MethodologyThe tasks we train are defined over images consistingof multiple synthetic shapes having different colors andtextures (Figure 1). The tasks involve yes/no responsesto questions about whether an image contains certainobjects or properties, such as “is there a red object?”or “is there a spherical object?” We generate a seriesconsisting of 10 episodes; in each episode, a new taskis introduced (more details to follow on the tasks). Amodel is trained de novo on episode 1, and then contin-ues training for the remaining episodes. In episode i,training involves a mix of examples drawn from tasks1 to i until an accuracy criterion of 95% is attainedon a hold-out set for all tasks. To balance training onthe newest task (task i in episode i) and retraining onprevious tasks, we adapt the methodology of Nguyenet al. (2018): half the training set consists of examplesfrom the newest task, and the other half consists of anequal number of examples from each of the previoustasks 1 through i− 1. (In episode 1, only the single taskis trained.) The same set of training images is repeatedeach epoch of training, but they are randomly reassignedto different tasks from epoch to epoch. In each epoch,we roughly balance the number of yes and no targetresponses for each task. We turn now to details of theimages, tasks, and architecture.Image generation. We leverage the CLEVR (Johnson

et al. 2017) codebase to generate 160× 120 pixel colorimages each with 4 or 5 objects that vary along threedimensions : shape, color, and texture. We introduce ad-ditional features on each dimension to ensure 10 featurevalues per dimension. (See supplementary material fordetails.) We synthesized 45,000 images for a trainingset, roughly balancing the count of each feature acrossimages. An additional 5,000 images were generated for ahold-out set. Each image could used for any task. Eachepoch of training involved one pass through all images,with a random assignment of images to task each epochto satisfy the constraint on the distribution of tasks.Tasks. For each replication of our simulation, we se-

lect one of the three dimensions and randomize the orderof the ten within-dimension tasks. To reduce sensitivityof the results to order, we performed replications usinga Latin square design (Bailey 2008, ch. 9), guarantee-ing that within a block of ten replications, each taskwill appear in each ordinal position exactly once. Weconstructed six such Latin square blocks for each of thethree dimensions, resulting in 180 total simulation repli-

Page 3: (Schmidhuber 1987; Bengio, arXiv:1905.10837v3 [cs.LG] 6

TaskRepresentation

16

160

Conv-1

32

80

Conv-2

4840

Conv-3

6420

Conv-4

Concatenatetask

512

FC-1

512

FC-2

512

FC-3

512

FC-4

2

Out

Figure 2: Model architecture. Input consists of image andtask representation. Dashed line from task representation toConv-1 indicates optional task modulated visual processing,described under “Task-modulated visual processing.”

cations. Because we observed no meaningful differencesacross task dimensions (see supplementary material),the results we report below collapse across dimension.Architecture. Our experiments use a basic vision

architecture with four convolutional layers followed byfour fully connected layers (Figure 2). The convolutionallayers—with 16, 32, 48, and 64 filters successively—eachhave 3x3 kernels with stride 1 and padding 1, followedby ReLU nonlinearities, batch normalization, and 2x2max pooling. The fully-connected layers have 512 unitsin each, also with ReLU nonlinearities. All models

were implemented in PyTorch (Paszke et al. 2017) andtrained with ADAM (Kingma and Ba 2015) using alearning rate of 0.0005 and weight decay of 0.0001. Notethat our model is generic and is not specialized formetalearning or for preventing catastrophic forgetting.Instead of having one output head for each task, wespecify the task as a component of the input. Similarto Sort-of-CLEVR (Santoro et al. 2017), we code thetask as a one-hot input vector. We concatenate the taskrepresentation to the output of the last convolutionallayer before passing it to the first fully-connected layer.

ResultsMetalearningFigure 3a depicts hold-out accuracy for a newly intro-duced task as a function of the number of training trials.Curve colors indicate the task’s ordinal position in theseries of episodes, with cyan being the first and magentabeing the tenth. Not surprisingly, task accuracy im-proves monotonically over training trials. But notably,metalearning is evidenced because the accuracy of taski+1 is strictly higher than the accuracy of task i for i > 2.To analyze our simulations more systematically, we re-mind the reader that the simulation sequence presentsfifty-five opportunities to assess learning: the task intro-

16k

32k

64k

128k

256k

512k

1024

k20

48k

Log(Number of training trials)

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

New

task

acc

urac

y

4k 8k 16k

32k

64k

128k

256k

512k

Log(Number of training trials)

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Firs

t tas

k ac

cura

cy

1 2 3 4 5 6 7 8 910Log(Number of times trained)

4k8k

16k32k64k

128k256k512k

Log(

trial

s to

crite

rion)

1 2 3 4 5 6 7 8 910Log(Episode number)

4k8k

16k32k64k

128k256k512k

Log(

New

task

tria

ls to

crit

erio

n)

1 2 3 4 5 6 7 8 9 10Number of times trained

0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

afte

r 22.

5k tr

ials

1 2 3 4 5 6 7 8 9 10Episode number

0.5

0.6

0.7

0.8

0.9

1.0

New

task

acc

urac

yaf

ter 2

2.5k

tria

ls

12345678910

Task

ord

inal

pos

ition

12345678910

Num

ber o

f tim

es tr

aine

d

(a)

(b)

(c)

(d)

(e)

(f)

Figure 3: (a) Hold-out set accuracy as a function of training trials (log scale) for a newly introduced task. Colored linesindicate task ordinal position (cyan = introduced in episode 1; magenta = introduced in episode 10). In all panels, the shadedregion represents ±1 standard error of the mean. (b) Hold-out accuracy of the task introduced in episode 1 by number oftimes it is retrained (black = 1 time, copper = 10 times). (c) Number of trials required to reach the accuracy criterion (logscale) as a function of the number of times a given task is trained (also log scale). As in (a), the colors indicate task ordinalposition (the episode in which a task is introduced). (d) Similar to (c) but plotting only the new task introduced in a givenepisode. (e) Hold-out accuracy attained after a fixed amount of training (22.5k trials) of a given task, graphed as a function ofnumber of times a given task is trained. As in (a), the colors indicate the episode in which a task is introduced. (f) Similar to(e) but plotting only the new task introduced in a given episode.

Page 4: (Schmidhuber 1987; Bengio, arXiv:1905.10837v3 [cs.LG] 6

duced in episode 1 (i.e., ordinal position 1) is trainedten times, the task introduced in episode 2 is trainednine times, and so forth, until the task introduced inepisode 10, which is trained only once. Figure 3c indi-cates, with one line per task, the training required ina given episode to reach a hold-out accuracy of 95%—the dashed line in Figure 3a. Training required perepisode is plotted as a function of the number of timesthe task is retrained. The downward shifting interceptof the curves for later tasks in the sequence indicatessignificantly easier learning and relearning. Figure 3eshows an alternative view of difficulty-of-training byplotting accuracy after a fixed amount of (re)training.The conditions that require the least number of trials tocriterion (Figure 3c) also achieve the highest accuracyafter a small amount of training (Figure 3e).

Catastrophic forgettingFigure 3b shows the accuracy of the task introduced inthe first episode (y1) as it is retrained each episode.1The fact that performance in a new episode drops belowcriterion (the dashed line) indicates backward interfer-ence. However, there is a relearning savings : the amountof interference diminishes monotonically with the num-ber of times trained. Notably, catastrophic forgetting oftask 1 is essentially eliminated by the last few episodes.Figure 3c shows very similar relearning savings for tasks2-10 as for task 1. The roughly log-log linear curvesoffer evidence of power-law decrease in the retrainingeffort required to reach criterion.Figure 3 also reveals that the first two episodes are

anomalous. Strong backward interference on task 1 isexhibited when task 2 is introduced (the crossover of thecyan curve in Figure 3c), a phenomenon that does notoccur for subsequent tasks. Similarly, strong forwardinterference on task 2 of task 1 is evident (slower learningfor task 2 than for task 1 in Figure 3d), but tasks 3-10are increasingly facilitated by previous learning. Thesefindings suggest that to understand properties of neuralnets, we must look beyond training on just two tasks,which is often the focus of research in transfer learningand catastrophic forgetting.

Resilience to forgettingThe fact that old tasks need to be retrained each episodesuggests that training on a new task induces forgettingof the old. However, because we trained simultaneouslyon the old and new tasks, we have no opportunity toexamine forgetting explicitly. However, we can cloneweights at any point in the simulation and examine adifferent training trajectory moving forward. We tookthe network weights at the start of each episode i, atwhich point the network is at criterion on tasks 1 throughi− 1. Then, instead of retraining on all i tasks, we train

1The misalignment of the first point is due to the factthat the accuracy is assessed at the end of a training epoch,and each successive episode has fewer trials of task y1 perepoch.

Figure 4: Exploration of forgetting. (a) Residual accuracyof task 1 as the task introduced in episodes 2-10 is trained(corresponding to 1-9 times that task 1 had previously beentrained, with a black and copper for 1 and 9, respectively).(b) The inferred exponential decay rate as a function of thenumber of times a task is trained.

only on task i. We probe the network regularly toevaluate performance on old tasks.

Figure 4a depicts the time course of forgetting of thetask introduced in episode 1 on each subsequent episode.The black curve corresponds to episode 2 (task 1 hasbeen trained only once previously) and the copper curvecorresponds to episode 10 (task 1 has been trained 9times previously). Task 1 becomes more robust to back-ward interference from the new task in later episodes,In episode i, task 1 has been (re)trained i − 1 timespreviously, yielding a sort of spaced practice that ap-pears to cause the memory to be more robust. Thisresult is suggestive of the finding in human memorythat interleaved, temporally distributed practice yieldsmore robust and durable memory (Kang et al. 2014;Cepeda et al. 2008).Figure 4a depicts only some of the forty-five oppor-

tunities we have to assess forgetting: we have one afterthe model learns a single task, two after the modellearns two, up to nine after the model learns the ninthtask (for which we examine forgetting by training onthe tenth and final task in the order). To conduct amore systematic analysis, we fit the forgetting curves foreach task i in each episode e > i. The forgetting curvecharacterizes accuracy a after t training batches of 1500trials. Accuracy must be adjusted for guessing: becauseour tasks have a baseline correct-guessing rate of 0.5,we define a = 0.5 + 0.5m, to be the observed accuracywhen memory strength m lies between 0 (no task mem-ory) and 1 (complete and accurate task memory). Weexplore two characterizations of memory strength. Thefirst is of exponential decay, m = α exp(−βt), where αis the initial accuracy, β is a decay rate, and t is thenumber of intervening training batches. The second isof power-law decay, m = α(1 + γt)−β , where γ servesas a timescale variable. This power-law decay curve iscommon in the psychological literature on forgetting(Wixted and Carpenter 2007) and has the virtue overm = αt−β that it can characterize strength at t = 0.

We fit the exponential and power-law functions sepa-rately to the data from each of the 45 model trainingpoints across 67 replications of our experiment. Follow-ing Clauset, Shalizi, and Newman (2009), we fit each

Page 5: (Schmidhuber 1987; Bengio, arXiv:1905.10837v3 [cs.LG] 6

form to the first half of the data, and assess it on thesecond half of the data. The power-law function ob-tains a substantially lower MSE on the training data(power-law: 0.0045, exponential: 0.0198), the exponen-tial function fit the held-out data better (power: 0.0232,exponential: 0.0192), and the exponential function of-fered a better fit on 24 of 45 training points of the model.We therefore adopt the exponential-decay function andcharacterize decay by rate parameter β.

Figure 4b presents the inferred decay rate β for eachof the forty-five model training points, presented in thestyle of Figures 3c,e. The basic pattern is clear: ad-ditional practice yields a more durable memory trace,regardless of a task’s ordinal position. Further, with theexception of tasks 1 and 2, the forgetting rate of taski on episode i+ 1 decreases with i, One is tempted tointerpret this effect in terms of studies of human long-term memory, where serial position effects are a robustphenomenon: Items learned early and late are preservedin memory better than items learned in between (Glen-berg et al. 1980). Psychological studies train people onlyonce, so there are no behavioral data concerning howserial position interacts with number of times trained,as we have in the simulation. There are a number ofrespects in which our simulation methodology does notalign with experimental methodology in psychologicalstudies, such as the fact that we assess forgetting shortlyafter exposure, not at the end of a sequence of tasks.Nonetheless, the correspondence between our simulationand human memory is intriguing.

Heterogeneous task sequencesWe noted two benefits of training on task sequences:reduced backward interference and increased forwardfacilitation. We next try to better understand the sourceof these benefits. In particular, we ask how the benefitsrelate to similarity among tasks. Previously, we sam-pled tasks homogeneously : all ten tasks in a sequencewere drawn from a single dimension (color, shape, ortexture). We now explore the consequence of samplingtasks heterogeneously : the ten tasks in a sequence draw

1 2 3 4 5 6 7 8 910Log(Number of times trained)

4k

8k

16k

32k

64k

128k

256k

512k

Log(

trial

s to

crite

rion)

1 2 3 4 5 6 7 8 9 10Number of times trained

-50k

0

50k

100k

150k

200k

tria

ls to

crit

erio

n

1

2

3

4

5

6

7

8

9

10

Task

ord

inal

pos

ition

(a) (b)

1

2

12

3

Figure 5: Heterogeneous task sequences. (a) Number oftrials required to reach the accuracy criterion versus numberof times a task is trained (cf. Figure 3c). The first two tasksare labeled by the numbers 1 and 2. (b) Increase in numberof trials required to reach accuracy criterion for homogeneoussequences compared to heterogeneous as a baseline. Positivevalues indicate points learned faster in the heterogeneouscondition, negative values in the baseline condition.

from all three dimensions. Each replication utilizes asingle permutation of the three dimensions and sam-ples the ten tasks cycling between the dimensions (fourfrom the first, three from the other two). We employeda similar Latin square design to balance between thepermutations, such that each block of six replicationsincludes each permutation once.Figure 5a presents the results of 114 replications of

the heterogeneous sequences, nineteen using each of thesix task permutations. To facilitate the comparison tothe homogeneous sequence results (Figure 3c), we plotin Figures 5b the increase in number of trials to criterionwith homogeneous sequences compared to heterogeneousas a baseline. With several exception points, the differ-ences are not notable, suggesting that inter-tasks effectswith heterogeneous sequences are similar to those withhomogeneous sequences. Thus, inter-task effects appearto be primarily due learning to process visual imagesin general, rather than the specific task-relevant dimen-sions. The two outlier points in Figure 5b concern thefirst two episodes: With heterogeneous training, theinterference between tasks 1 and 2 nearly vanishes, per-haps because the resources and representations requiredto perform the two tasks overlap less. One might havepredicted just the opposite result, but apparently, ex-tracting information relevant for one dimension does notpreclude constructing representations suitable for otherdimensions. In fact, the result appears consistent witha finding from human memory—that reducing (seman-tic) similarity of items reduces interference among them(Baddeley and Dale 1966).

Task-modulated visual processingThe architecture that we have experimented with thusfar treats the convolutional layers as visual feature ex-tractors, trained end-to-end on task sequences, butthe convolutional layers have no explicit informationabout task; task input is provided only to the fi-nal layers of the net. In contrast, processing in hu-man visual cortex can be task modulated (Fias et al.2002). Perhaps modifying the architecture to providetask information to convolutional layers would reduceinter-task interference. Along the lines of Mozer andFan (2008), we investigated a modified model usingtask-modulated visual processing, adopting a simpler ap-proach than most existing architectures for conditionalnormalization or gated processing (Perez et al. 2018;Chen et al. 2018). We consider task modulation via atask-specific learned bias for each channel in a convolu-tional layer. As before, task is coded as a one-hot vector.We incorporate connections from the task representationto a convolutional layer (Figure 2), with one bias pa-rameter for the Cartesian product of tasks and channels.This bias parameter is added to the output of each filterin a channel before applying the layer nonlinearity.We investigated task modulation at each of the four

convolutional layers in our model. Because the results oftask modulation at the different layers are quite similar(see supplementary material), we report the results of

Page 6: (Schmidhuber 1987; Bengio, arXiv:1905.10837v3 [cs.LG] 6

1 2 3 4 5 6 7 8 910Log(Number of times trained)

4k

8k

16k

32k

64k

128k

256k

512kLo

g(tri

als t

o cr

iterio

n)

1 2 3 4 5 6 7 8 9 10Number of times trained

-50k

0

50k

100k

150k

200k

250k

300k

tria

ls to

crit

erio

n

1

2

3

4

5

6

7

8

9

10

Task

ord

inal

pos

ition

(a) (b)1

2

Figure 6: Effect of modulating first convolutional layerwith information about the current task. (a) Number oftrials required to reach the accuracy criterion versus numberof times a task is trained (cf. Figure 3c). (b) Increase innumber of trials required to reach accuracy criterion fornon-task-modulated versus task modulated architectures.

modulating processing at the first convolutional layer.Figure 6 depicts the results of three Latin square repli-cations, yielding thirty simulations for each dimension,or ninety in total. Introducing task-based modulationallows the model to avoid catastrophic forgetting pre-viously observed from learning the second task on thefirst, and to a lesser effect, improves performance in thethird episode as well. As the model learns additionaltasks, and continues retraining on the same tasks, thebenefits of task-modulation diminish rapidly (Figure 6b),suggesting the primary benefit is in aiding early learning.We hypothesize that modulating visual processing withthe task representation allows the model to learn flexiblevisual representations that produce less interference.

Comparison to MAMLThe results we have presented thus far serve as a baselineagainst which one can compare any method specializedto reduce forgetting or boost transfer. We conductedcomparisons to several such methods, and in this sectionwe report on experiments withmodel-agnostic metalearn-ing or MAML (Finn, Abbeel, and Levine 2017). MAMLis designed to perform metalearning on a sequence oftasks in order to learn the next task in the sequencemore efficiently. However, it is not designed for ourcontinual-learning paradigm, which requires preserva-tion of mastery for previous tasks. We explored twovariants of MAML adapted to our paradigm. We reporthere on the more successful of the two (see supplemen-tary material for details).

Our paradigm is based on a series of 10 episodes wheretasks accumulate across episodes. MAML is also trainedover a series of episodes, but we make a correspondencebetween one episode of MAML—the outer loop of thealgorithm—and what we will refer to as a micro-episodeof our paradigm, which corresponds to a single batchin our original training procedure. Each micro-episodestarts with network weights w, and we draw a half-batch of 750 examples (compared to 1500 in the originalsetting) of which 50% are from the newest task, andthe remainder are split evenly across the previous tasks.(For task 1, all examples are from task 1.) From w,we compute a gradient step based on the examples for

each task, and apply this step separately to w, yieldingi copies of the weights in episode i, {w1, ...,wi}, eachspecialized for its corresponding task. We then drawa second half-batch of 750 examples and perform ametatraining step, as described in MAML. Metatraininginvolves computing the gradient with respect to w forthe new examples of each task k based on the weightswk. Following MAML, we then update w, and proceedto the next micro-episode until our training criterion isattained. Having halved the batch size, we doubled thelearning rate from 0.0005 in the original setting to 0.001for both of MAML’s learning rates. Model details areotherwise identical to the base model.

Over 90 replications (30 per dimensions), we find thatthe performance of our MAML variant is qualitativelysimilar to that of our base model (compare Figure 7aand Figure 3c). However, quantitatively, the MAML-based method requires more trials to reach criterion onexpectation: Figure 7b shows the relative number oftrials to criterion, where negative indicates that MAMLis worse than our base model. Apparently the cost ofsplitting the data and devoting half to meta-trainingdoes not outweigh the benefit of meta-training.

DiscussionWe explored the behavior of a standard convolutionalneural net for classification in a continual-learning set-ting that introduces tasks sequentially and requires thenet to master new tasks while preserving mastery of pre-viously learned tasks. This setting corresponds to thatwhich human learners naturally face as they becomedomain experts. For example, consider students readinga calculus text chapter by chapter. Early on, engagingwith a chapter and its associated exercises results inforgetting of previously mastered material. However, asmore knowledge is acquired, students begin to scaffoldand link knowledge and eventually are able to integratethe new material with the old. As the final chapters arestudied, students have built a strong conceptual frame-work which facilitates the integration of new materialwith little disruption of the old. These hypotheticalstudents behave much like the net we studied in this

1 2 3 4 5 6 7 8 910Log(Number of times trained)

4k

8k

16k

32k

64k

128k

256k

512k

Log(

trial

s to

crite

rion)

1 2 3 4 5 6 7 8 9 10Number of times trained

-150k

-125k

-100k

-75k

-50k

-25k

0

25k

tria

ls to

crit

erio

n(P

ositi

ve =

MAM

L be

tter)

1

2

3

4

5

6

7

8

9

10

Task

ord

inal

pos

ition

(a) (b)

Figure 7: MAML. (a) Number of trials (inner and outerloops combined) required to reach the accuracy criterionversus number of times a task is trained (cf. Figure 3c).(b) Increase in number of trials required to reach accuracycriterion for training without MAML compared to utilizingMAML. Negative values indicate slower learning with MAMLthan the base model.

Page 7: (Schmidhuber 1987; Bengio, arXiv:1905.10837v3 [cs.LG] 6

article.We summarize our novel findings, and where appro-

priate, we link these findings more concretely to theliterature on human learning.

1. Metalearning (forward facilitation) is observed oncethe net has acquired sufficient domain expertise. Inour paradigm, ‘sufficient expertise’ means having mas-tered two tasks previously. Metalearning is demon-strated when training efficiency—the number of tri-als needed to reach criterion—improves with eachsuccessive task (Figures 3d,f). Metalearning occursnaturally in the model and does not require special-ized mechanisms. Indeed, incorporating a specializedmechanism, MAML, fails to enhance metalearning inour continual-learning paradigm.

2. Catastrophic forgetting (backward interference) is re-duced as the net acquires increasing domain expertise(i.e., as more related tasks are learned). In Figure 3c,compare tasks introduced early (cyan) and late (ma-genta) in the sequence, matched for number of timesthey have been trained (position on the abscissa).Retraining efficiency improves for tasks introducedlater in the task sequence, indicating a mitigation offorgetting. Note that the number of trials to relearna skill is less than the number of trials required toinitially learn a skill (the exception being task 1 inepisode 2). This relearning savings effect has longbeen identified as a characteristic of human memory(Ebbinghaus 1908), as, of course, has the ubiquity offorgetting, whether due to the passage of time (Lind-sey et al. 2014) or to backward interference from newknowledge (Osgood 1948).

3. The potential for catastrophic forgetting (backwardinterference) is also reduced each time a task is re-learned, as indicated by the monotonically decreas-ing curves in Figure 3c and by the change in for-getting rates in Figure 4. A task that is practicedover multiple episodes receives distributed practicethat is interleaved with other tasks. The durabilityof memory with distributed, interleaved practice isone of the most well studied phenomena in cognitivepsychology (Kang et al. 2014; Cepeda et al. 2008;Taylor and Rohrer 2010; Birnbaum et al. 2013).

4. Training efficiency improves according to a powerfunction of the number of tasks learned, controllingfor experience on a task (indicated by the linear curvein Figure 3d, plotted in log-log coordinates), and alsoaccording to a power function of the amount of train-ing a given task has received, controlling for numberof tasks learned (indicated by the linear curves in Fig-ure 3c). Power-law learning is a robust characteristicof human skill acquisition, observed on a range ofbehavioral measures (Newell and Rosenbloom 1980;Donner and Hardy 2015).

5. Forward facilitation and reduction in backward inter-ference is observed only after two or more tasks havebeen learned. This pattern can be seen by the non-

monotonicities in the curves of Figures 3d,f and in thecrossover of curves in Figures 3c,e. Catastrophic for-getting is evidenced primarily for task 1 when task 2 islearned—the canonical case studied in the literature.However, the net becomes more robust as it acquiresdomain expertise, and eventually the relearning effortbecomes negligible (e.g., copper curves in Figure 3b).The anomalous behavior of task 2 is noteworthy, yield-ing a transition behavior perhaps analogous to the“zero-one-infinity” principle (MacLennan 1999).

6. Catastrophic forgetting in the second episode can bemitigated in two different ways: first, by choosingtasks that rely on different dimensions (Figure 5);and second, by introducing task-based modulationof visual processing (Figure 6). We conjecture thatboth of these manipulations can be characterized interms of reducing the similarity of the tasks. In hu-man learning, reducing (semantic) similarity reducesinterference (Baddeley and Dale 1966).

We are able to identify these intriguing phenomena be-cause our simulations examined scaling behavior andnot just effects of one task on a second—the typical casefor studying catastrophic forgetting—or the effects ofmany tasks on a subsequent task—the typical case formetalearning and few-shot learning. Studying the con-tinuum from the first task to the n’th is quite revealing.We find that learning efficiency improves as more

tasks are learned. Although MAML produces no ben-efit over the standard architecture that served as ourbaseline, we have yet to explore other methods that areexplicitly designed to facilitate transfer and suppressinterference (Mishra et al. 2018; Kirkpatrick et al. 2017;Lopez-Paz and Ranzato 2017). The results presentedin this article serve as a baseline to assess the bene-fits of specialized methods. A holy grail of sorts wouldbe to identify methods that achieve backward facili-tation, where training on later tasks improves perfor-mance on earlier tasks, and compositional generaliza-tion (Fodor and Pylyshyn 1988; Fodor and Lepore 2002;Lake and Baroni 2018; Loula, Baroni, and Lake 2018),where learning the interrelationship among earlier tasksallows new tasks to be performed on the first trial. Hu-mans demonstrate the former under rare conditions(Ausubel, Robbins, and Blake 1957; Jacoby, Wahlheim,and Kelley 2015); the latter is common in human be-havior, as when individuals are able to perform a taskimmediately from instruction.

An exciting direction for future research concerns op-timizing curricula for continual learning. Our initialapproach was inspired by best practices of the science oflearning literature (Weinstein, Madan, and Sumeracki2018). Our hope is that investigations of networks mayin turn provide helpful guidance for improving curriculafor human learners. Toward this goal, it is encourag-ing that we observed more than superficial similaritiesbetween human and network continual learning.

Page 8: (Schmidhuber 1987; Bengio, arXiv:1905.10837v3 [cs.LG] 6

References[Andrychowicz et al. 2016] Andrychowicz, M.; Denil, M.; Gomez, S.; Hoff-man, M. W.; Pfau, D.; Schaul, T.; Shillingford, B.; and de Freitas, N. 2016.Learning to learn by gradient descent by gradient descent. Advances inNeural Information Processing Systems 29 3981—-3989.

[Ausubel, Robbins, and Blake 1957] Ausubel, D. P.; Robbins, L. C.; andBlake, Elias, J. 1957. Retroactive inhibition and facilitation in the learningof school materials. Journal of Educational Psychology 48(6):334–343.

[Baddeley and Dale 1966] Baddeley, A. D., and Dale, C. A. 1966. The effectof semantic similarity on retroactive interference in long- and short-termmemory. Journal of Verbal Learning and Verbal Behavior 5:417–420.

[Bailey 2008] Bailey, R. 2008. Design of comparative experiments. Cam-bridge University Press, 1st edition.

[Bengio, Bengio, and Cloutier 1991] Bengio, Y.; Bengio, S.; and Cloutier, J.1991. Learning a synaptic learning rule. In Seattle International JointConference on Neural Networks, volume ii, 969. IEEE.

[Birnbaum et al. 2013] Birnbaum, M. S.; Kornell, N.; Bjork, E. L.; andBjork, R. A. 2013. Why interleaving enhances inductive learning: Theroles of discrimination and retrieval. Memory & Cognition 41(3):392–402.

[Caruana 1993] Caruana, R. 1993. Multitask learning: A knowledge-basedsource of inductive bias. In Proceedings of the Tenth International Con-ference on Machine Learning, 41–48. Morgan Kaufmann.

[Caruana 1997] Caruana, R. 1997. Multitask Learning. Machine Learning28:41–75.

[Cepeda et al. 2008] Cepeda, N. J.; Vul, E.; Rohrer, D.; Wixted, J. T.; andPashler, H. 2008. Spacing effects in learning: A temporal ridgeline ofoptimal retention. Psychological Science 19:1095–1102.

[Chen et al. 2018] Chen, Z.; Li, Y.; Bengio, S.; and Si, S. 2018. GaterNet:Dynamic filter selection in convolutional neural network via a dedicatedglobal gating network. Technical report, arXiv.

[Clauset, Shalizi, and Newman 2009] Clauset, A.; Shalizi, C.; and Newman,M. 2009. Power-law distributions in empirical data. SIAM Review51(4):661–703.

[Donner and Hardy 2015] Donner, Y., and Hardy, J. L. 2015. Piecewisepower laws in individual learning curves. Psychonomic Bulletin & Review22(5):1308–1319.

[Ebbinghaus 1908] Ebbinghaus, H. 1908. Psychology: An elementary text-book. Arno Press.

[Elman 1993] Elman, J. L. 1993. Learning and development in neural net-works: the importance of starting small. Cognition 48(1):71–99.

[Fias et al. 2002] Fias, W.; Dupont, P.; Reynvoet, B.; and Orban, G. A.2002. The quantitative nature of a visual task differentiates between ventraland dorsal stream. Journal of Cognitive Neuroscience 14:646–658.

[Finn, Abbeel, and Levine 2017] Finn, C.; Abbeel, P.; and Levine, S. 2017.Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. Pro-ceedings of the 34th International Conference on Machine Learning,70.

[Fodor and Lepore 2002] Fodor, J. A., and Lepore, E. 2002. Composition-ality Papers. Oxford University Press UK.

[Fodor and Pylyshyn 1988] Fodor, J. A., and Pylyshyn, Z. W. 1988. Con-nectionism and cognitive architecture: a critical analysis. Cognition 28(1-2):3–71.

[Glenberg et al. 1980] Glenberg, A. M.; Bradley, M. M.; Stevenson, J. A.;Kraus, T. A.; Tkachuk, M. J.; Gretz, A. L.; and Turpin, B. M. 1980. Atwo-process account of long-term serial position effects. Journal of Exper-imental Psychology: Human Learning and Memory 6(4):355–369.

[Goodfellow et al. 2015] Goodfellow, I. J.; Mirza, M.; Xiao, D.; Courville,A.; and Bengio, Y. 2015. An Empirical Investigation of Catastrophic For-getting in Gradient-Based Neural Networks.

[Jacoby, Wahlheim, and Kelley 2015] Jacoby, L. L.; Wahlheim, C. N.; andKelley, C. M. 2015. Memory consequences of looking back to notice change:Retroactive and proactive facilitation. Journal of Experimental Psychol-ogy: Learning, Memory, and Cognition 41(5):1282–1297.

[Johnson et al. 2017] Johnson, J.; Fei-Fei, L.; Hariharan, B.; Zitnick, C. L.;Van Der Maaten, L.; and Girshick, R. 2017. CLEVR: A Diagnostic Datasetfor Compositional Language and Elementary Visual Reasoning. In Proceed-ings of the IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPR).

[Kamra, Gupta, and Liu 2017] Kamra, N.; Gupta, U.; and Liu, Y. 2017.Deep Generative Dual Memory Network for Continual Learning.

[Kang et al. 2014] Kang, S. H. K.; Lindsey, R. V.; Mozer, M. C.; and Pash-ler, H. 2014. Retrieval practice over the long term: Expanding or equal-interval spacing? Psychological Bulletin & Review 21:1544–1550.

[Kemker and Kanan 2018] Kemker, R., and Kanan, C. 2018. FearNet:Brain-Inspired Model for Incremental Learning. In ICLR.

[Kingma and Ba 2015] Kingma, D. P., and Ba, J. 2015. Adam: A Methodfor Stochastic Optimization. In Proceedings of the 3rd InternationalConference for Learning Representations.

[Kirkpatrick et al. 2017] Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Ve-ness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.;Grabska-Barwinska, A.; Hassabis, D.; Clopath, C.; Kumaran, D.; and Had-sell, R. 2017. Overcoming catastrophic forgetting in neural networks. Pro-ceedings of the National Academy of Sciences 114(13):3521–3526.

[Lake and Baroni 2018] Lake, B., and Baroni, M. 2018. Generalization with-out systematicity: On the compositional skills of sequence-to-sequence re-current networks. In Dy, J., and Krause, A., eds., Proceedings of the35th International Conference on Machine Learning, volume 80 of Pro-ceedings of Machine Learning Research, 2873–2882. Stockholmsmässan,Stockholm Sweden: PMLR.

[Lindsey et al. 2014] Lindsey, R. V.; Shroyer, J. D.; Pashler, H.; and Mozer,M. C. 2014. Improving Students’ Long-Term Knowledge Retention ThroughPersonalized Review. Psychological Science 25(3):639–647.

[Lopez-Paz and Ranzato 2017] Lopez-Paz, D., and Ranzato, M. . A. 2017.Gradient Episodic Memory for Continual Learning. In NIPS.

[Loula, Baroni, and Lake 2018] Loula, J.; Baroni, M.; and Lake, B. 2018.Rearranging the Familiar: Testing Compositional Generalization in Recur-rent Networks. Technical report, arXiv.

[MacLennan 1999] MacLennan, B. J. 1999. Principles of ProgrammingLanguages (3rd Ed.): Design, Evaluation, and Implementation. NewYork, NY, USA: Oxford University Press, Inc.

[McCloskey and Cohen 1989] McCloskey, M., and Cohen, N. J. 1989. Catas-trophic Interference in Connectionist Networks: The Sequential LearningProblem. Psychology of Learning and Motivation 24:109–165.

[Mishra et al. 2018] Mishra, N.; Rohaninejad, M.; Chen, X.; and Abbeel, P.2018. A Simple Neural Attentive Meta-Learner. In International Confer-ence on Learning Representations.

[Mozer and Fan 2008] Mozer, M. C., and Fan, A. 2008. Top-Down modula-tion of neural responses in visual perception: a computational exploration.Natural Computing 7(1):45–55.

[Newell and Rosenbloom 1980] Newell, A., and Rosenbloom, P. S. 1980.Mechanisms of Skill Acquisition and the Law of Practice. In Anderson,J. R., ed., Cognitive Skills and their Acquisition. Erlbaum, hillsdale, edi-tion.

[Nguyen et al. 2018] Nguyen, C. V.; Li, Y.; Bui, T. D.; and Turner, R. E.2018. Variational Continual Learning. ICLR.

[Osgood 1948] Osgood, C. E. 1948. An investigation into the causes ofretroactive interference. Journal of Experimental Psychology 38(2):132–154.

[Parisi et al. 2019] Parisi, G. I.; Kemker, R.; Part, J. L.; Kanan, C.; andWermter, S. 2019. Continual lifelong learning with neural networks: Areview. Neural Networks 113:54–71.

[Paszke et al. 2017] Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang,E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017.Automatic differentiation in PyTorch. In NIPS.

[Perez et al. 2018] Perez, E.; Strub, F.; de Vries, H.; Dumoulin, V.; andCourville, A. 2018. FiLM: Visual Reasoning with a General ConditioningLayer. In AAAI’18.

[Ruder 2017] Ruder, S. 2017. An Overview of Multi-Task Learning in DeepNeural Networks. Technical report, arXiv.

[Santoro et al. 2017] Santoro, A.; Raposo, D.; Barrett, D. G. T.; Mali-nowski, M.; Pascanu, R.; Battaglia, P.; Lillicrap, T.; and London, D. 2017.A simple neural network module for relational reasoning. In Guyon, I.;Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S. .;and Garnett, R., eds., Advances in Neural Information Processing Sys-tems 30, 4967—-4976. Curran Associates, Inc.

[Schmidhuber 1987] Schmidhuber, J. 1987. Evolutionary Principles inSelf-Referential Learning. Ph.D. Dissertation, TU Munich.

[Schwarz et al. 2018] Schwarz, J.; Luketina, J.; Czarnecki, W. M.; Grabska-Barwinska, A.; Teh, Y. W.; Pascanu, R.; and Hadsell, R. 2018. Progress& Compress: A scalable framework for continual learning. In Proceed-ings of the 35th International Conference on Machine Learning, PMLR80:4528–4537.

[Taylor and Rohrer 2010] Taylor, K., and Rohrer, D. 2010. The effects ofinterleaved practice. Applied Cognitive Psychology 24(6):837–848.

[Thrun 1996] Thrun, S. 1996. Is Learning The n-th Thing Any Easier ThanLearning The First? In Touretzky, D. S.; Mozer, M. C.; and Hasselmo,M. E., eds., Advances in Neural Information Processing Systems 8,640—-646. MIT Press.

[Wang et al. 2017] Wang, J. X.; Kurth-Nelson, Z.; Tirumala, D.; Soyer, H.;Leibo, J. Z.; Munos, R.; Blundell, C.; Kumaran, D.; and Botvinick, M.2017. Learning to Reinforcement Learn. In CogSci. DeepMind.

[Weinstein, Madan, and Sumeracki 2018] Weinstein, Y.; Madan, C. R.; andSumeracki, M. A. 2018. Teaching the science of learning. Cognitive re-search: principles and implications 3(1):2.

[Wixted and Carpenter 2007] Wixted, J. T., and Carpenter, S. K. 2007. TheWickelgren power law and the ebbinghaus savings function. PsychologicalScience 18:133–134.

[Zenke, Poole, and Ganguli 2017] Zenke, F.; Poole, B.; and Ganguli, S. 2017.Continual Learning Through Synaptic Intelligence. In ICML.

Page 9: (Schmidhuber 1987; Bengio, arXiv:1905.10837v3 [cs.LG] 6

Supplementary materialsFeature valuesThe ten colors used in our experiments are: gray, red,blue, green, brown, purple, magenta, yellow, orange,pink. The ten shapes are: cube, sphere, cylinder, pyra-mid, cone, torus, rectangular box, ellipsoid, octahedron,dodecahedron. And the ten textures are: metal, rubber,chainmail, marble, maze, metal weave, polka dots, rug,bathroom tiles, wooden planks. See additional exampleimages below:

Figure 8: Additional example training images

Page 10: (Schmidhuber 1987; Bengio, arXiv:1905.10837v3 [cs.LG] 6

Results by dimensionTo justify our collapsing of the results across dimensions,we provide the results broken down for each individualdimension below. Figure 9 depicts the trials required toreach the accuracy criterion, Figure 9g,h reproducingFigure 3c,d, and the rest of the subfigures offering theresults for replications within each dimension. Whilecolors are easier to learn than shapes or textures, simu-lations in all three dimensions show the same qualitativefeatures. Similarly, Figure 10 depicts the accuracy af-ter a fixed small amount of training, with Figure 10g,hreproducing Figure 3e,f. These results provide furtherevidence for the ease of learning color compared to theother two dimensions, but the qualitative similarity re-mains striking.

Page 11: (Schmidhuber 1987; Bengio, arXiv:1905.10837v3 [cs.LG] 6

1 2 3 4 5 6 7 8 910Log(Number of times trained)

4k

8k

16k

32k

64k

128k

256k

512k

Log(

trial

s to

crite

rion)

Color

1 2 3 4 5 6 7 8 910Log(Episode number)

4k

8k

16k

32k

64k

128k

256k

512k

Log(

trial

s to

crite

rion)

1 2 3 4 5 6 7 8 910Log(Number of times trained)

4k

8k

16k

32k

64k

128k

256k

512kShape

1 2 3 4 5 6 7 8 910Log(Episode number)

4k

8k

16k

32k

64k

128k

256k

512k

1 2 3 4 5 6 7 8 910Log(Number of times trained)

4k

8k

16k

32k

64k

128k

256k

512kTexture

1 2 3 4 5 6 7 8 910Log(Episode number)

4k

8k

16k

32k

64k

128k

256k

512k

1 2 3 4 5 6 7 8 910Log(Number of times trained)

4k

8k

16k

32k

64k

128k

256k

512kCombined

1 2 3 4 5 6 7 8 910Log(Episode number)

4k

8k

16k

32k

64k

128k

256k

512k

12345678910

Task

ord

inal

pos

ition

12345678910

Num

ber o

f tim

es tr

aine

d

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 9: (a, c, e, g): Number of trials required to reach the accuracy criterion (log scale) as a function of the number oftimes a given task is trained (also log scale). The colored lines indicate task ordinal position (cyan = introduced in episode 1;magenta = introduced in episode 10). (b, d, f, h): Number of trials required to reach the accuracy criterion (log scale) as afunction of the episode number. The colored lines indicate the number of times a task was retrained on (black = 1 time, copper= 10 times). In all panels, the shaded region represents ±1 standard error of the mean.

1 2 3 4 5 6 7 8 9 10Number of times trained

0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

afte

r 22.

5k tr

ials

Color

1 2 3 4 5 6 7 8 9 10Episode number

0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

afte

r 22.

5k tr

ials

1 2 3 4 5 6 7 8 9 10Number of times trained

0.5

0.6

0.7

0.8

0.9

1.0 Shape

1 2 3 4 5 6 7 8 9 10Episode number

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 6 7 8 9 10Number of times trained

0.5

0.6

0.7

0.8

0.9

1.0 Texture

1 2 3 4 5 6 7 8 9 10Episode number

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 6 7 8 9 10Number of times trained

0.5

0.6

0.7

0.8

0.9

1.0 Combined

1 2 3 4 5 6 7 8 9 10Episode number

0.5

0.6

0.7

0.8

0.9

1.0

12345678910

Task

ord

inal

pos

ition

12345678910

Num

ber o

f tim

es tr

aine

d

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 10: (a, c, e, g): Accuracy after a fixed amount of training (22,500 trials) as a function of the number of times agiven task is trained (log scale). The colored lines indicate task ordinal position (cyan = introduced in episode 1; magenta =introduced in episode 10). (b, d, f, h): Accuracy after the same fixed amount of training as a function of the episode number.The colored lines indicate the number of times a task was retrained on (black = 1 time, copper = 10 times). In all panels, theshaded region represents ±1 standard error of the mean.

Page 12: (Schmidhuber 1987; Bengio, arXiv:1905.10837v3 [cs.LG] 6

Task-modulated processing at differentlevelsAll figures reported below are combined over replica-tions in all three dimensions, where for each modulationlevel we performed thirty simulations in each dimension,yielding ninety simulations in total for each modulationlevel. In Figure 11, we provide the results plotted inFigure 5a-b for task-modulation at each convolutionallayer (separately). In Figure 12, we provide equivalentplots to Figure 2e-f for the task-modulated models. InFigure 13, we provide equivalent plots to Figure 5c-d for the task-modulated models. The only anomalywe observe is in Figure 13 for task-modulation at thesecond convolutional layer, where the eight and ninthtasks appear easier to learn for the first time withouttask-modulation. Save for this anomaly, we observedremarkably consistent results between the different mod-ulation levels, and hence we reported a single one, ratherthan expanding about all four.

Page 13: (Schmidhuber 1987; Bengio, arXiv:1905.10837v3 [cs.LG] 6

1 2 3 4 5 6 7 8 910Log(Number of times trained)

4k8k

16k32k64k

128k256k512k

Log(

trial

s to

crite

rion)

Conv-1

1 2 3 4 5 6 7 8 910Log(Episode number)

4k8k

16k32k64k

128k256k512k

Log(

trial

s to

crite

rion)

1 2 3 4 5 6 7 8 910Log(Number of times trained)

4k8k

16k32k64k

128k256k512k

Log(

trial

s to

crite

rion)

Conv-2

1 2 3 4 5 6 7 8 910Log(Episode number)

4k8k

16k32k64k

128k256k512k

Log(

trial

s to

crite

rion)

1 2 3 4 5 6 7 8 910Log(Number of times trained)

4k8k

16k32k64k

128k256k512k

Log(

trial

s to

crite

rion)

Conv-3

1 2 3 4 5 6 7 8 910Log(Episode number)

4k8k

16k32k64k

128k256k512k

Log(

trial

s to

crite

rion)

1 2 3 4 5 6 7 8 910Log(Number of times trained)

4k8k

16k32k64k

128k256k512k

Log(

trial

s to

crite

rion)

Conv-4

1 2 3 4 5 6 7 8 910Log(Episode number)

4k8k

16k32k64k

128k256k512k

Log(

trial

s to

crite

rion)

Figure 11: Top panels: Number of trials required to reach the accuracy criterion (log scale) as a function of the number oftimes a given task is trained (also log scale). The colors indicate task ordinal position (the episode in which a task is introduced;cyan = introduced in episode 1; magenta = introduced in episode 10). Bottom panels: Similar to the top panels, but graphedas a function of episode number with the line colors indicating the number of times a task is retrained (black = 1 time, copper= 10 times).

1 2 3 4 5 6 7 8 9 10Number of times trained

0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

afte

r 22.

5k tr

ials

Conv-1

1 2 3 4 5 6 7 8 9 10Episode number

0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

afte

r 22.

5k tr

ials

1 2 3 4 5 6 7 8 9 10Number of times trained

0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

afte

r 22.

5k tr

ials

Conv-2

1 2 3 4 5 6 7 8 9 10Episode number

0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

afte

r 22.

5k tr

ials

1 2 3 4 5 6 7 8 9 10Number of times trained

0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

afte

r 22.

5k tr

ials

Conv-3

1 2 3 4 5 6 7 8 9 10Episode number

0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

afte

r 22.

5k tr

ials

1 2 3 4 5 6 7 8 9 10Number of times trained

0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

afte

r 22.

5k tr

ials

Conv-4

1 2 3 4 5 6 7 8 9 10Episode number

0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

afte

r 22.

5k tr

ials

Figure 12: Top panels: Hold-out accuracy attained after a fixed amount of training (22.5k trials) of a given task, graphedas a function of number of times a given task is trained. As in Figure 11, the colors indicate task ordinal position (the episodein which a task is introduced; cyan = introduced in episode 1; magenta = introduced in episode 10). Bottom panels: Similarto the top panels, but graphed as a function of episode number with the line colors indicating–as in Figure 11–the number oftimes a task is retrained (black = 1 time, copper = 10 times).

Page 14: (Schmidhuber 1987; Bengio, arXiv:1905.10837v3 [cs.LG] 6

1 2 3 4 5 6 7 8 9 10Number of times trained

-200k

-100k

0

100k

200k

300k

Tria

ls to

crit

erio

n

Conv-1 modulation

1 2 3 4 5 6 7 8 9 10Episode number

-200k

-100k

0

100k

200k

300k

Tria

ls to

crit

erio

n

1 2 3 4 5 6 7 8 9 10Number of times trained

-200k

-100k

0

100k

200k

300k

Conv-2 modulation

1 2 3 4 5 6 7 8 9 10Episode number

-200k

-100k

0

100k

200k

300k

1 2 3 4 5 6 7 8 9 10Number of times trained

-200k

-100k

0

100k

200k

300k

Conv-3 modulation

1 2 3 4 5 6 7 8 9 10Episode number

-200k

-100k

0

100k

200k

300k

1 2 3 4 5 6 7 8 9 10Number of times trained

-200k

-100k

0

100k

200k

300k

Conv-4 modulation

1 2 3 4 5 6 7 8 9 10Episode number

-200k

-100k

0

100k

200k

300k

12345678910

Task

ord

inal

pos

ition

12345678910

Num

ber o

f tim

es tr

aine

d

Figure 13: Top panels: Increase in number of trials required to reach accuracy criterion for non-task-modulated versus taskmodulated architectures as a function of the number of times a given task is trained (also log scale). The colors indicate taskordinal position (the episode in which a task is introduced; cyan = introduced in episode 1; magenta = introduced in episode10). Bottom panels: Similar to the top panels, but graphed as a function of episode number with the line colors indicatingthe number of times a task is retrained (black = 1 time, copper = 10 times).

Page 15: (Schmidhuber 1987; Bengio, arXiv:1905.10837v3 [cs.LG] 6

MAML comparison supplementWe compared our baseline model to two versions ofMAML. Both utilized the training procedure we describeunder the ‘Comparison to MAML’ section. The first,reported in the middle column below, only utilized thisprocedure in training, and was tested without the meta-testing step. In other words, this model was testedexactly as our baseline model was tested, to see if MAMLmanages to learn representations that allow it to answerquestions on unseen images without further adaptation.The second version, which we ended up reporting, alsoutilizes the micro-episode procedure at test time, makingtrain and test identical. The results below demonstratesimilar qualitative behavior between our baseline andboth versions. However, as the second version, using themeta-testing procedure, fares better, we opt to report itin the submission.

1 2 3 4 5 6 7 8 910Log(Number of times trained)

4k8k

16k32k64k

128k256k512k

Log(

trial

s to

crite

rion)

Baseline

1 2 3 4 5 6 7 8 910Log(Episode number)

4k8k

16k32k64k

128k256k512k

Log(

trial

s to

crite

rion)

1 2 3 4 5 6 7 8 910Log(Number of times trained)

4k8k

16k32k64k

128k256k512k

Log(

trial

s to

crite

rion)

MAML (no meta-test)

1 2 3 4 5 6 7 8 910Log(Episode number)

4k8k

16k32k64k

128k256k512k

Log(

trial

s to

crite

rion)

1 2 3 4 5 6 7 8 910Log(Number of times trained)

4k8k

16k32k64k

128k256k512k

Log(

trial

s to

crite

rion)

MAML (reported)

1 2 3 4 5 6 7 8 910Log(Episode number)

4k8k

16k32k64k

128k256k512k

Log(

trial

s to

crite

rion)

Figure 14: MAML comparison. The left column plots result from our baseline condition. The middle column offers resultsfrom a version of MAML which did not follow the micro-episode procedure at test that, that is, did not meta-test. The rightcolumn, corresponding to the model reported in the submission, follows the micro-episode procedure we describe at both trainand test.

Page 16: (Schmidhuber 1987; Bengio, arXiv:1905.10837v3 [cs.LG] 6

Comparison to simultaneous learningWe performed a systematic comparison between our se-quential method of training and the standard supervisedlearning approach of training on all tasks simultaneously.We know that sequential training is beneficial to humans–every course covers one topic at a time, rather thanthrowing the entire textbook and mixing all topics fromday one. There is also ample evidence for the value ofcurricular approaches in machine learning, going as farback as (Elman 1993). However, curricula in machinelearning usually attempt to scaffold tasks from smallerto larger, or easier to harder, following some difficultygradient. Our results in Figure 15 suggest, surprisingly,that randomly chosen sequential curriculum (that is,random task introduction orderings) can significantlyspeed up learning in some cases. We find, interest-ingly, that this effect varies by dimension. While in theshape condition the simultaneous learning is competitivewith sequential training, we find that in both textureand color sequential training proceeds much faster. Inthose cases, the number of training trials required tolearn each task when trained sequentially (the cyan-to-magenta curves) is far less than the number of trialsrequired to learn each task when trained simultaneously(the red curve). That is, task n+ 1 is learned far fasterfollowing tasks 1–n than simultaneously with tasks 1–10.The long plateau in the color and texture cases appearsto suggest some form of initial representation learningwhich is made more efficient by learning sequentially,rather than simultaneously.

64k

128k

256k

512k

1024

k20

48k

Log(Number of training trials)

0.6

0.7

0.8

0.9

1.0

Hold

-out

set a

ccur

acy

color

64k

128k

256k

512k

1024

k

Log(Number of training trials)

0.6

0.7

0.8

0.9

1.0

Hold

-out

set a

ccur

acy

shape

64k

128k

256k

512k

1024

k20

48k

Log(Number of training trials)

0.6

0.7

0.8

0.9

1.0

Hold

-out

set a

ccur

acy

texture

Figure 15: Simultaneous vs. sequential training. The cyan (first) to magenta (last) colored lines plot the accuracy after somenumber of training trials for each task the model learned. The average accuracy over all ten tasks, when learned simultaneously,is plotted in red. To make the comparisons valid, the simultaneous training is in the number of training trials for each task,rather than combined for all tasks.