self-net: lifelong learning via continual self-modeling · jaya krishna mandivarapu department of...

Self-Net: Lifelong Learning via ContinualSelf-Modeling

Blake Camp1?, Jaya Krishna Mandivarapu1∗, and RolandoEstrada1[0000−0003−1607−2618]

Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA{bcamp2,jmandivarapu1}@student.gsu.edu, [email protected]

Abstract. Learning a set of tasks over time, also known as continuallearning (CL), is one of the most challenging problems in artificial in-telligence. While recent approaches achieve some degree of CL in deepneural networks, they either (1) grow the network parameters linearlywith the number of tasks, (2) require storing training data from previoustasks, or (3) restrict the network’s ability to learn new tasks. To addressthese issues, we propose a novel framework, Self-Net, that uses an au-toencoder to learn a set of low-dimensional representations of the weightslearned for different tasks. We demonstrate that these low-dimensionalvectors can then be used to generate high-fidelity recollections of theoriginal weights. Self-Net can incorporate new tasks over time with lit-tle retraining and with minimal loss in performance for older tasks. Oursystem does not require storing prior training data and its parametersgrow only logarithmically with the number of tasks. We show that ourtechnique outperforms current state-of-the-art approaches on numerousdatasets—including continual versions of MNIST, CIFAR10, CIFAR100,and Atari—and we demonstrate that our method can achieve over 10Xstorage compression in a continual fashion. To the best of our knowl-edge, we are the first to use autoencoders to sequentially encode sets ofnetwork weights to enable continual learning.

Keywords: Continual learning · Deep learning · Autoencoders.

1 Introduction

Lifelong or continual learning (CL) is one of the most challenging problems inmachine learning, and it remains a significant hurdle in the quest for artificialgeneral intelligence (AGI) [2,5]. In this paradigm, a single system must learnto solve new tasks without forgetting previously learned information. Differenttasks might require different data (e.g., images vs. text) or they might process thesame data in different ways (e.g., classifying an object in an image vs. segmentingit). Crucially, in CL there is no point at which a system stops learning; it mustalways be able to update its representation of its problem domain(s).

? Both authors contributed equally.

arX

iv:1

805.

1035

4v3

[cs

.LG

] 1

2 Ju

l 201

9

2 Camp et al.

Fig. 1. Framework overview: Our proposed system has a set of reusable task-specificnetworks (TN), a Buffer for storing the latest m tasks, and a lifelong, auto-encoder(AE) for long-term storage. Given new tasks {tk+1, ..., tk+m}, where k is the numberof tasks previously encountered, we first train m task-networks to learn optimal pa-rameters {θk+1, ..., θk+m} for these tasks. These networks are temporarily stored in theBuffer. When the Buffer fills up, we incorporate the new networks into our long-termrepresentation by retraining the AE on both its approximations of previously learnednetworks and the new batch of networks. When an old network is needed (e.g. when atask is revisited), we reconstruct its weights and load them onto the corresponding TN(solid arrow). Even when the latent representation ei is asymptotically smaller thanθi, the reconstructed network closely approximates the performance of the original.

CL is particularly challenging for deep neural networks because they aretrained end-to-end. In standard deep learning we tune all of the network’s pa-rameters based on training data, usually via backpropagation [20]. While thisparadigm has proven highly successful for individual tasks, it is not suitable forcontinual learning because it overwrites existing weights (a phenomenon evoca-tively dubbed catastrophic forgetting [19]). For example, if we first train a net-work on task A and then on task B, the latter training will modify the weightslearned for A, thus likely reducing the network’s performance on this task.

There are several approaches that can achieve some degree of continual learn-ing in deep networks. However, existing methods suffer from at least one of threelimitations: they either (1) restrict the network’s ability to learn new tasks bypenalizing changes to existing weights [7,27,22,14]; (2) expand the model sizelinearly as the number of tasks grows [21,10] (or dynamically define task-specificsub-networks [26,11], which is asymptotically equivalent); or (3) retrain on oldtasks. In the latter, we either (a) store some of the old training data directly[12,17,14], thus increasing storage requirements linearly (and at a faster ratethan increasing the network, since data tends to be higher dimensional), or (b)they use compressed data [16,6,23,25], which complicates training.

In this paper, we propose a novel approach, Self-Net, that overcomes theaforementioned limitations by decoupling how it learns a new task from how itstores it. Figure 1 provides an overview of our proposed framework. Our system

Self-Net 3

grows only logarithmically with the number of tasks, while retaining excellentperformance across all learned tasks. Our approach is loosely inspired by therole that the hippocampus is purported to play in memory consolation [24]. Asnoted in [15], during learning the brain forms an initial neural representation incortical regions; the hippocampus then consolidates this representation into aform that is optimized for storage and retrieval. These complementary biologi-cal mechanisms enable continual learning by efficiently consolidating knowledgeand prior experiences. In this spirit, we propose a system that consists of threecomponents: (1) a set of reusable task-networks (TNs), (2) a Buffer in whichwe store the latest m learned weights exactly, and (3) a lifelong autoencoder(AE) with which we can encode an arbitrary number of older tasks. The AElearns a low-dimensional representation for each of the high-dimensional param-eter vectors that define weights of the TNs. Thus, our system self-models itsown behavior, allowing it to approximate previously learned parameters insteadof storing them directly. In short, when our system learns a new task, it firststrains an appropriate TN using standard deep learning and then stores a copyof the weights in the Buffer. When the Buffer fill up, the AE learns a set ofcompact, latent vectors for the weights in the Buffer. The Self-Net then discardsthe original weights, freeing up the Buffer to store new tasks. If our system needsto solve a previously learned task, it generates an approximation of the originalweights by feeding the corresponding latent vector through the AE and thenloading the reconstructed weights onto a TN.

Our approach leverages the flexibility of conventional neural networks whileavoiding their inability to remember old tasks. More specifically, a TN is free tomodify its parameters as needed to learn a new task, since previously learnedweights are encoded by the AE. Our AE doesn’t simply memorize old weights;our experiments show that an AE can encode a very large number of networkswhile retaining excellent performance on all tasks (Section 4.3). Our frameworkcan even incorporate fine-tuning by initializing a TN with the weights from aprevious, related task. Below, we first overview existing CL methods for deepnetwork and then detail our approach in Section 3.

2 Prior work

Several methods have recently emerged for continual learning in deep networks,although, as noted above, existing approaches either (1) restrict new learning,(2) grow the number of parameters linearly, or (3) require old training data.Notable examples of the first type include Elastic Weight Consolidation (EWC)[7], Synaptic Intelligence [27], Variational Continual Learning [14] (which alsoreuses old data), and Progress & Compress [22]. These approaches reuse the samenetwork for each new task, but they apply a regularization method to restrictchanges in weights over time. Hence, they typically use constant space1. EWC,

1 Although, as noted in [4], standard EWC stores an O(n) set of Fisher weights foreach task, so it actually grows linearly. The modified version proposed in [4] doesuse constant space.

4 Camp et al.

in particular, uses the diagonal of the Fisher information matrix between theweights learned for the new task vs. the old tasks. Like our proposed approach,Progress & Compress also uses both a task-network and a long-term storagenetwork; however, it uses EWC to update the weights of the latter, so it hasvery similar performance to this first method.

The second category includes Progressive Networks [21], Dynamically Ex-pandable Networks [26], and Context-Dependent Gating [11]. These methodsachieve excellent performance, but they grow the network linearly with the num-ber of tasks, which is asymptotically the same as using independent networks.Thus, they cannot scale to large numbers of tasks. Their advantage is in facili-tating transfer learning, i.e., using previous learning to speed up new learning.

Finally, some methods store a fraction of the old training data and use it toretrain the network on previously learned tasks. Key approaches include Expe-rience Replay [12] iCarl [17], Variational Continual Learning [14], and Learningwithout Forgetting [10]. Unfortunately, this paradigm combines the drawbacksof the previous two. First, most of these methods use a single network, so theycannot continually learn a large number of tasks well. Second, their storage re-quirements grow linearly in the number of tasks because they have to store oldtraining data. Moreover, data usually takes up orders of magnitude more spacethan the network itself because a trained network is effectively a compressedrepresentation of the training set [1]. A few methods reduce this storage require-ment by storing a compressed representation of the data. Methods of this typeinclude Lifelong Generative Modeling [16], FearNet [6], and Deep GenerativeReplay [23]. Our proposed approach uses a similar idea but instead stores thenetworks themselves, rather than the data. Our scheme has two advantages overcompressing the data. First, networks are much smaller, so we can encode themmore quickly, using less space. Second, by reconstructing the networks directly,we do not need to retrain task-networks on data from previous tasks.

3 Methodology

Figure 1 provides a high-level overview of our proposed approach. Our Self-Netsystem uses a set of reusable task-networks (TNs), a Buffer for storing newlylearned tasks, and a lifelong autoencoder (AE) for storing older tasks. In addi-tion, we store an O(log (n)) latent vector for each task. Each TN is just a stan-dard neural network, which can learn regression, classification, or reinforcementlearning tasks (or some combination of the three). For ease of discussion, we willfocus on the case where there is a single TN and the Buffer can hold only one net-work; the extension to multiple networks and larger Buffers is trivial. The AE ismade up of an encoder that compresses an input vector into a lower-dimensional,latent vector e and a decoder that maps e back to the higher-dimensional space.Our system can produce high-fidelity recollections of the learned weights, de-spite this intermediate compression. In our experiments, we used a contractiveautoencoder (CAE) [18] due to its ability to quickly incorporate new values intoits latent space.

Self-Net 5

In CL, we must learn k different tasks sequentially. To learn these tasks inde-pendently, one would need to train and save k networks, with O(n) parameterseach, for a total of O(kn) space. In contrast, we propose using our AE to encodeeach of these k networks as an O(log(n)) latent vector. Thus, our method usesonly O(n+ k log (n)) space, where the O(n) term accounts for the TNs and thefixed-size Buffer. Despite this compression, our experiments show that we canobtain a high-quality approximation of previously learned weights, even whenthe number of tasks exceeds the number of parameters in the AE (Sec. 4.3).Below, we first describe how to encode a single task-network before discussinghow to encode multiple tasks in a continual fashion.

3.1 Single-network encoding

Let t be a task (e.g., recognizing faces) and let θ be the O(n)-dimensional vectorof parameters of a network trained to solve t. That is, using a task-network withparameters θ, we can achieve performance p on t (e.g., a classification accuracy of

95%). Now, let θ be the approximate reconstruction of θ by our autoencoder andlet p be the performance that we obtain by using these reconstructed weights fortask t. Our goal is to minimize any performance loss w.r.t. the original weights. Ifthe performance of the reconstructed weights is acceptable, then we can simplystore the O(log (n)) latent vector e, instead of the O(n) original vector θ.

If we had access to the test data for t, we could assess this difference inperformance directly and train our AE until we achieve an acceptable margin ε:

p− p ≤ ε. (1)

For example, for a classification task we could stop training our AE if the dropin accuracy is less than 1%.

In a continual learning setting, though, the above scheme requires storingvalidation data for each old task. Instead, we measure a distance between theoriginal and reconstructed weights and stop training when we achieve a suitablyclose approximation. Empirically, we determined that the cosine similarity,

cos(θ, θ) =θ · θ‖θ‖‖θ‖

=

∑ni=1 θiθi√∑n

i=1 θ2i

√∑ni=1 θ

2i

, (2)

is an excellent proxy for a network’s performance. Unlike the mean-squarederror, this distance metric is scale-invariant, so it is equally suitable for weightsof different scales. As detailed in Section 4, a cosine similarity of 0.997 or higheryielded excellent performance for a wide variety of tasks and architectures.

In addition, one can improve the efficacy with which the AE learns a newtask by encouraging the parameters of all task-networks to remain in the samegeneral neighborhood. This can be accomplished by fine-tuning all networks froma common source and penalizing large deviations from this initial configurationwith a regularization term. Formally, let θ∗ be the source parameters, ideally

6 Camp et al.

optimized for some highly-related task. Without loss of generality, we can definethe loss function of task-network θi for task ti as:

TaskNetLossi = TaskLoss+ λMSE(θ∗, θi) (3)

where λ is the regularization coefficient determining the importance of remainingclose to the source parameters vs. optimizing for the current task. By encouragingthe parameters for all task-networks to remain close to one another, we make iteasier for the AE to learn a low-dimensional representation of the original space.

3.2 Continual encoding

We will now detail now to use our Self-Net to encode a sequence of trainednetworks in a continual fashion. Let m be the size of the Buffer, and let k bethe number of tasks which have been previously encountered. As noted above,we train each of these m task-networks using conventional backpropagation, oneper task. Now, assume that our AE has already learned to encode the first ktask-networks. We will now show how to encode the most recent batch of m task-networks corresponding to tasks {tk+1, ..., tk+m} into compressed representations{ek+1, ..., ek+m} while still remembering all previously trained networks.

Let E be the set of latent vectors for the first k networks. In order to integratem new networks {θk+1, ..., θk+m} into the latent space, we first recollect allpreviously trained networks by feeding each e ∈ E as input to the decoder of theAE. We thus generate a set R of recollections, or approximations, of the originalnetworks (see Fig. 1). We then append each network θi in the Buffer to R andretrain the AE on all k+m networks until it can reconstruct them, i.e., until theaverage of their respective cosine similarities is above the predefined threshold.Algorithm 3.2 summarizes our continual learning strategy.

As we show in our experiments, our compressed network representations stillachieve excellent performance compared to the original parameters. Since eachθ ∈ R is simply a vector of network parameters, it can easily be loaded backonto a task-network with the correct architecture. This allows us to discard theoriginal networks and store k networks using only O(k log (n)) space. In addition,our framework can efficiently encode many different types and sizes of networksin a continual fashion. In particular, we can encode a network of arbitrary sizeq using a constant-size AE (that takes inputs of size n) by splitting the inputnetwork into r subvectors2, such that (n = q/r). As we verify in Section 4, wecan effectively reconstruct a large network from its subvectors and still achievea suitable performance threshold.

As Fig. 2 illustrates, we empirically found a strong correlation between areconstructed network’s performance and its cosine similarity w.r.t. to the orig-inal network. Intuitively, this implies that vectors of network parameters thathave a cosine similarity approaching 1 will exhibit near-identical performanceon the underlying task. Thus, the cosine similarity can be used as a terminating

2 We pad with zeros whenever q and n are not multiples of each other.

Self-Net 7

Algorithm 1 Lifelong Learning via Continual Self-Modeling

1: Let T be the set of all Tasks encountered during the lifetime of the system2: Let m be the size of the Buffer3: E = []4: initialize AE5: Set cosine threshold6: for idx,curr task in enumerate(T ) do7: if Buffer is not full then8: - Intitialize TN9: - Train the TN for curr task until optimized

10: - Buffer.append(TN)11: if Buffer is full then12: R = []13: for encoded-network in E do14: r = AE.Decoder(encoded-network)15: R.append(r)

16: for network in Buffer do17: flat network = extract and flatten parameters from network18: R.append(flat network)

19: average cosine similarity = 0.020: E = []21: while average cosine similarity < cosine threshold do22: for r idx,r ∈ enumerate(R) do23: calculate AE loss using Equation (??).24: back-propagate AE w.r.t r25: update average cosine similarity using cos(r,AE(r))26: E [r idx] = AE.Encoder(r)

27: empty Buffer

condition during retraining of the AE. That is, there exists a cosine similaritythreshold above which the performance of the reconstructed network can be ex-pected to be sufficiently similar to that of the original. In practice, we found athreshold of .997 to be sufficient for most experiments. Below, we offer empiricalresults which demonstrate the efficacy and flexibility of our approach.

4 Experimental Results

In order to evaluate the continual-learning performance of Self-Net, we carriedout a range of experiments on a variety of datasets, in both supervised andreinforcement-learning (RL) settings. We first performed a robustness analysisto establish the degree to which an approximation of a network can deviatefrom the original and still retain comparable performance on the underlyingtask (Section 4.1). Then, we evaluated the performance of our approach on thefollowing continual-learning datasets: Permuted MNIST [7], Split MNIST [14],Split CIFAR-10 [27], Split CIFAR-100 [27], and successive Atari games [12] (wedescribe each dataset below). As our experiments show, Self-Net can effectively

8 Camp et al.

Fig. 2. Robustness analysis of network performance as a function of cosinesimilarity: Each dot represents the accuracy of a reconstructed network and the dottedlines are the baseline performances of the original networks. The above values for threedatasets (Permuted MNIST (in pink), MNIST (in cyan), and CIFAR-10 (in blue), showthat cosine similarity values above 0.997 guarantee nearly optimal performance.

encode each of these different types of networks in sequential fashion, effectivelyachieving continual learning and outperforming several competing techniques.Finally, we also analyzed our system’s performance under three additional sce-narios: (1) very large numbers of tasks, (2) different sizes of AEs, and (3)different task-network architectures. We detail each experiment below.

4.1 Robustness analysis

Our approach relies upon approximations of previously learned networks, andwe assume no access to validation data for previously learned tasks. Thus, werequire a method for estimating the performance of a reconstructed networkwhich does not rely upon explicit testing on a validation set.

Figure 2 shows the relationship between performance and deviations fromthe original parameters as measured by cosine similarity, for three datasets.There is a clear correlation between the amount of parameter dissimilarity andthe probability of a decrease in performance. That is, given an approximatenetwork that deviates from the original by some amount, the potential stillexists that such a network will retain comparable performance. However, as thedegree of deviation increases, the probability that the performance remains highfalls steadily. Thus, in order to assume, with reasonable confidence, that theperformance of a reconstructed network will be sufficiently high, the AE mustminimize the degree of deviation as much as possible.

Empirically, we established a cosine similarity threshold above which theprobability of high task-performance stabilizes, as seen in Figure 2. This thresh-old can be used as a terminating condition during retraining of the AE, and it

Self-Net 9

allows the performance of a reconstructed network to be approximated withoutaccess to any validation data. In our experiments, a common threshold yieldsgood performance across a variety of different types and sizes of networks.

4.2 Experiments on CL datasets

Permuted MNIST: As an initial evaluation of Self-Net’s CL performance, wetrained convolutional feed-forward neural networks with 21,840 parameters onsuccessive tasks, each defined by distinct permutations of the MNIST dataset [9],for 10-digit classification. We used networks with 2 convolution layers (kernelsof size 5x5, and stride 1x1), 1 hidden layer (320x50), and 1 output layer (50x10).Our CAE had three, fully connected layers with 21,840, 2000, and 20 parameters,resp. Thus, our latent vectors were of size 20. For this experiment, we used aBuffer of size 1. Each task network was encoded by our lifelong AE in sequentialfashion, and the accuracies of all reconstructed networks were examined at theend of each learning stage (i.e., after learning a new task). Figure 3 (top) showsthe mean performance after each stage. Our technique almost perfectly matchedthe performances achieved by independently trained networks, and it dramati-cally outperformed other state-of-the-art approaches including EWC [7], OnlineEWC (the correction to EWC proposed in [4]), and Progress & Compress [22].As a baseline, we also show the results for SGD (no regularization), L2-basedregularization in which we compare new weights to all the previous weights,and Online L2, which only measures deviations from the weights learned in theprevious iteration. Not only does our technique allow for superior knowledgeretention, but it does not inhibit knowledge acquisition necessary for new tasks.The result is minimal degradation in performance as the number of tasks grow.

Split MNIST: We performed a similar continual learning task but withdifferent binary classification objectives on subsets of the MNIST dataset (SplitMNIST) [14]. Our task-networks, CAE, and Buffer size were the same as for Per-muted MNIST (except that the outputs of the task-networks were binary, insteadof 10 classes). Tasks were defined by tuples comprised of the positive and neg-ative digit class(es), e.g., ([pos={1}, neg={6,7,8,9}], [pos={6}, neg={1,2,3,4}],etc.). Here, the training and test sets consisted of approximately 40% positiveexamples and 60% negative examples. In this domain, too, our technique dra-matically outperformed competing approaches, as seen in Figure 3 (middle).

Split CIFAR-10: We then verified that our proposed approach could recon-struct larger, more sophisticated networks. Similar to the Split MNIST exper-iments above, we divided the CIFAR-10 dataset [8] into multiple training andtest sets, yielding 10 binary classification tasks (one per class). We then traineda task-specific network on each class. Here, we used TNs having an architecturewhich consisted of 2 convolutional layers, followed by 3 fully connected hid-den layers, and a final layer having 2 output units. In all, these task networksconsisted of more than 60K parameters. Again, for this experiment we used aBuffer of size 1. Our CAE had three, fully connected layers with 20442, 1000,and 50 parameters, resp. As noted below, we split the 60K networks into threesubvectors to encode them with our autoencoder. The individual task-networks

10 Camp et al.

Fig. 3. CL performance comparisons with average test set accuracy on all observedtasks at each stage for (top) Permuted MNIST, (middle) Split MNIST, and (bot-tom) Split CIFAR-10.

Self-Net 11

achieved accuracies ranging from 78% to 84%, and a mean accuracy of approxi-mate 81%. Importantly, we encoded these larger networks using almost the sameCAE architecture as the one used in the MNIST experiments. This was achievedby splitting the 60K parameter vectors into three subvectors. As noted in Sec-tion 3, by splitting a larger input vector into smaller subvectors, we can encodenetworks of arbitrary sizes. As seen in Figure 3 (bottom), the accuracies of thereconstructed CIFAR networks also nearly matched the performances of theiroriginal counterparts, while also outperforming all other techniques.

Fig. 4. CL performance comparisons with average test set accuracy on all observedtasks at each stage for CIFAR-100.

Split CIFAR-100: We applied the same learning approach for the CIFAR-100 dataset [8]. We split the dataset into 10 distinct batches comprised of 10classes of images each. This resulted in 10 separate datasets, each designed for10-class classification tasks. We used the same task-network architecture andBuffer size as in our CIFAR-10 experiments, modified slightly to accommodatea 10-class classification objective. The trained networks achieved accuracies rang-ing from 46% to 59%. We then encoded these networks using the same CAE ar-chitecture described in the previous experiments, again accounting for the inputsize discrepancy by splitting the task-networks into smaller subvectors. As seenin Figure 4, our technique almost perfectly matched the performances achievedby independently trained networks.

Incremental Atari: To evaluate the CL performance of Self-Net in thechallenging context of reinforcement learning, we used the code available at[3] to implement a modified Async Advantage Actor-Critic (A3C) framework,originally introduced in [13], to attempt to learn successive Atari games whileretaining good performance across all games. A3c simultaneously learns a policyand a value function for estimating expected future rewards. Specifically, themodel we used was comprised of 4 convolutional layers (kernals of size 3x3, andstrides of size 2x2), a GRU layer (800x256), and two ouput layers: an Actor(256xNum Actions), and Critic (256x1), resulting in a complex model architec-

12 Camp et al.

Fig. 5. CL on five Atari games with Self-Net: To evaluate the reconstructionscore at each stage, we ran the reconstructed networks for 80 full game episodes. Thecumulative mean score is nearly identical to the original TN at each stage.

ture and over 800K parameters. Critically, this entire model can be flattenedand encoded by the single AE in our Self-Net framework having three, fully con-nected layers with 76863, 2000, and 200 parameters, resp. For these experimentswe also used a Buffer of size 1.

Similar to previous experiments, we trained our system on successive tasks,specifically the following Atari games: Boxing, Star Gunner, Kangaroo, Pong,and Space Invaders. Figure 5 shows the near-perfect retention of performance oneach of the 5 games over the lifetime of the system. This was accomplished bytraining on each game only once, never revisiting the game for training purposes.The dashed, vertical lines demarcate the different stages of continual learning.That is, each stage indicates that a new network was trained for a new game,over 40M frames. Afterwards, the mean (dashed, horizontal black lines) andstandard-deviation (solid, horizontal black lines) of the network’s performancewere computed by allowing it to play the game, unrestricted, for 80 episodes.After each stage, the performances of all reconstructed networks were examinedby re-playing each game with the appropriate reconstructed network. As Figure 5shows, the cumulative means and SD’s of the reconstructed networks closelymimic those achieved by their original counterparts.

4.3 Performance and storage scalability

In CL, there is a trade-off between storage and performance. Using differentnetworks for k tasks yields optimal performance but uses O(kn) space, whileregularized methods such as Online EWC only require O(n) space but suffer a

Self-Net 13

steep drop in performance as the number of tasks grows. Our experiments onCL datasets show that our approach achieves much better performance retentionthan existing approaches by using slightly more space, O(n + k log (n)). Moreprecisely, we can quantify performance with respect to implicit storage compres-sion. For example, by the tenth task, Online EWC [4] has essentially performed10x compression because it uses 1/10th of the overall storage required by tendifferent networks; however, its performance by this point is very poor. In con-trast, our system achieves 10X compression when the size of the stored latentvectors grows to 10n. In the following experiments, we verified that our methodretains excellent performance even when reaching 10X compression, thus con-firming that our AE is not simply memorizing previously learned weights.

Fig. 6. 10X Compression for Split-MNIST: Orange lines denote the average ac-curacy achieved by individual networks, one per task. Green lines denote the averageaccuracy when training the AE to encode all networks as a single batch. Blue lines in-dicate the average accuracy obtained by Self-Net at each CL Stage. Top: 50 tasks withlatent Vectors of size 5 and a Buffer of size 5. Middle: 100 tasks with latent vectorsof size 10 and Buffer of size 10. The x-axis (top and middle) denotes the compressionfactor achieved at each learning stage. Bottom: the training epochs required by the5-dimensional AE to incorporate new networks decreases rapidly over time.

14 Camp et al.

The top two plots of Fig. 6 show the mean performance for 50 and 100 Split-MNIST tasks, with latent vectors of size 5 and 10, resp. As before, the AE had21432 input parameters. For comparison, we also plotted the original networks’performance and the performance of the reconstructions when the AE learns allthe tasks in a single batch. The line with dots represents the CL system, whereeach dot indicates the point where the AE had to encode a new set of m networksbecause the Buffer had filled up. For these experiments, we used a Buffer size mof 5 and 10, resp.; these values were chosen so that each new batch of networksyielded an integer compression ratio, e.g., after encoding 15 networks with alatent vector of size 5, the Self-Net achieved 3X compression. Here, we fine-tuned all networks from the mean of the initial set of m networks and penalizeddeviations from this source vector (using λ = 0.001), as described in Section 3.This regularization allowed the AE to incorporate subsequent networks withvery little additional training, as seen in stages 4-10 (bottom image of Fig. 6).

For 10X compression, the Self-Net with a latent vector of size 5 retained∼95.7% average performance across 50 Split-MNIST tasks, while the Self-Netwith 10-dimensional latent vectors retained ∼95.2% across 100 tasks. This rep-resents a relative change of only ∼3.3% compared to the original performanceof ∼99%. In contrast, existing methods dropped to ∼50% performance for 10Xcompression on this dataset (Fig. 3).

4.4 Splitting networks and using multiple architectures

Splitting larger networks into smaller sub-vectors allows us to use a smaller AE.As an additional analysis, we verified that the smaller AE can be trained insubstantially less time than a larger one. Figure 7 (left) shows the respectivetraining rates of an AE with 20,000 input units (blue line)—trained to recon-struct 3 sub-vectors of length 20,000—compared to that of a larger one, with61,000 input units (yellow line), trained on a 60K CIFAR-10 network. Clearly,using more inputs for a smaller AE enables us to more quickly encode largernetworks. Finally, we also validated that the same AE can be used to encodetrained networks of different sizes and architectures. Figure 7 (right) shows thatthe same AE can simultaneously reconstruct 5 MNIST networks and 1 CIFARnetwork so that all approach their original baseline accuracies.

5 Conclusions and future work

In this paper, we introduced a scalable approach for multi-context continuallearning that decouples learning a set of parameters from storing them for fu-ture use. Our proposed framework makes use of state-of-the-art autoencodersto facilitate lifelong learning via continual self-modeling. Our empirical resultsconfirm that our method can efficiently acquire and retain knowledge in con-tinual fashion, even for very large numbers of tasks. In future work, we plan toimprove the efficiency with which the autoencoder can continually model vastnumbers of task networks. Furthermore, we will explore how to use the latent

Self-Net 15

Fig. 7. Additional analyses: Left: the AE training efficiency is improved when largenetworks are split into smaller subvectors. Right: a single AE can encode networks ofdifferent architectures and sizes.

space to extrapolate to new tasks based on existing learned tasks with littleor no training data. We also intend to compress the latent space even further(e.g., using only log (k) latent vectors for k tasks). Promising approaches includeclustering the latent vectors into sets of closely related tasks and using sparselatent representations. Finally, we will also investigate how to infer the currenttask automatically, without a task label.

References

1. Doersch, C.: Tutorial on Variational Autoencoders. ArXiv e-prints (Jun 2016)2. Goodfellow, I.J., Mirza, M., Xiao, D., Courville, A., Bengio, Y.: An empirical

investigation of catastrophic forgetting in gradient-based neural networks. arXivpreprint arXiv:1312.6211 (2013)

3. Greydanus, S.: baby-a3c. https://github.com/greydanus/baby-a3c (2017)4. Huszar, F.: Note on the quadratic penalties in elastic weight consolida-

tion. Proceedings of the National Academy of Sciences 115(11), E2496–E2497 (2018). https://doi.org/10.1073/pnas.1717042115, http://www.pnas.org/

content/115/11/E2496

5. Kemker, R., Abitino, A., McClure, M., Kanan, C.: Measuring catastrophic forget-ting in neural networks. CoRR abs/1708.02072 (2018)

6. Kemker, R., Kanan, C.: Fearnet: Brain-inspired model for incremental learn-ing. In: International Conference on Learning Representations (2018), https:

//openreview.net/forum?id=SJ1Xmf-Rb

7. Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu,A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D.,Clopath, C., Kumaran, D., Hadsell, R.: Overcoming catastrophic forgetting inneural networks. Proceedings of the National Academy of Sciences 114(13),3521–3526 (2017). https://doi.org/10.1073/pnas.1611835114, http://www.pnas.

org/content/114/13/3521

8. Krizhevsky, A.: Learning multiple layers of features from tiny images (2009)9. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied

to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (Nov 1998).https://doi.org/10.1109/5.726791

https://github.com/greydanus/baby-a3c

https://doi.org/10.1073/pnas.1717042115

http://www.pnas.org/content/115/11/E2496

http://www.pnas.org/content/115/11/E2496

https://openreview.net/forum?id=SJ1Xmf-Rb

https://openreview.net/forum?id=SJ1Xmf-Rb

https://doi.org/10.1073/pnas.1611835114

http://www.pnas.org/content/114/13/3521

http://www.pnas.org/content/114/13/3521

https://doi.org/10.1109/5.726791

16 Camp et al.

10. Li, Z., Hoiem, D.: Learning without forgetting. In: ECCV (2016)11. Masse, N.Y., Grant, G.D., Freedman, D.J.: Alleviating catastrophic forgetting us-

ing context-dependent gating and synaptic stabilization. CoRR abs/1802.01569(2018), http://arxiv.org/abs/1802.01569

12. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D.,Riedmiller, M.: Playing Atari with Deep Reinforcement Learning. ArXiv e-prints(Dec 2013)

13. Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T.P., Harley, T., Silver, D.,Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. CoRRabs/1602.01783 (2016), http://arxiv.org/abs/1602.01783

14. Nguyen, C.V., Li, Y., Bui, T.D., Turner, R.E.: Variational continual learn-ing. In: International Conference on Learning Representations (2018), https:

//openreview.net/forum?id=BkQqq0gRb15. Preston, A., Eichenbaum, H.: Interplay of hippocampus and pre-

frontal cortex in memory. Current Biology 23(17), R764 – R773(2013). https://doi.org/https://doi.org/10.1016/j.cub.2013.05.041, http:

//www.sciencedirect.com/science/article/pii/S096098221300636216. Ramapuram, J., Gregorova, M., Kalousis, A.: Lifelong Generative Modeling. ArXiv

e-prints (May 2017)17. Rebuffi, S., Kolesnikov, A., Lampert, C.H.: icarl: Incremental classifier and repre-

sentation learning. CoRR abs/1611.07725 (2016), http://arxiv.org/abs/1611.07725

18. Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders:Explicit invariance during feature extraction. In: Proceedings of the 28th Inter-national Conference on International Conference on Machine Learning. pp. 833–840. ICML’11, Omnipress, USA (2011), http://dl.acm.org/citation.cfm?id=

3104482.310458719. Robins, A.: Catastrophic forgetting, rehearsal and pseudorehearsal. Connection

Science 7(2), 123–146 (1995)20. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-

propagating errors. Nature 323, 533–536 (Oct 1986), http://dx.doi.org/10.

1038/323533a021. Rusu, A.A., Rabinowitz, N.C., Desjardins, G., Soyer, H., Kirkpatrick, J.,

Kavukcuoglu, K., Pascanu, R., Hadsell, R.: Progressive Neural Networks. ArXive-prints (Jun 2016)

22. Schwarz, J., Luketina, J., Czarnecki, W.M., Grabska-Barwinska, A., Whye Teh, Y.,Pascanu, R., Hadsell, R.: Progress & Compress: A scalable framework for continuallearning. ArXiv e-prints (May 2018)

23. Shin, H., Lee, J.K., Kim, J., Kim, J.: Continual learning with deep generativereplay. CoRR abs/1705.08690 (2017), http://arxiv.org/abs/1705.08690

24. Teyler, T.J., DiScenna, P.: The hippocampal memory indexing theory. Be-havioral Neuroscience 100(2), 147–154 (1986). https://doi.org/10.1037/0735-7044.100.2.147

25. Triki, A.R., Aljundi, R., Blaschko, M.B., Tuytelaars, T.: Encoder based lifelonglearning. CoRR abs/1704.01920 (2017), http://arxiv.org/abs/1704.01920

26. Yoon, J., Yang, E., Lee, J., Hwang, S.J.: Lifelong learning with dynamicallyexpandable networks. In: International Conference on Learning Representations(2018), https://openreview.net/forum?id=Sk7KsfW0-

27. Zenke, F., Poole, B., Ganguli, S.: Improved multitask learning through synapticintelligence. CoRR abs/1703.04200 (2017), http://arxiv.org/abs/1703.04200

http://arxiv.org/abs/1802.01569


https://openreview.net/forum?id=BkQqq0gRb

https://openreview.net/forum?id=BkQqq0gRb

https://doi.org/https://doi.org/10.1016/j.cub.2013.05.041

http://www.sciencedirect.com/science/article/pii/S0960982213006362

http://www.sciencedirect.com/science/article/pii/S0960982213006362



http://dl.acm.org/citation.cfm?id=3104482.3104587

http://dl.acm.org/citation.cfm?id=3104482.3104587

http://dx.doi.org/10.1038/323533a0

http://dx.doi.org/10.1038/323533a0


https://doi.org/10.1037/0735-7044.100.2.147

https://doi.org/10.1037/0735-7044.100.2.147


https://openreview.net/forum?id=Sk7KsfW0-


self-net: lifelong learning via continual self-modeling · jaya krishna mandivarapu department of...

Documents