on catastrophic interference in atari 2600 games2019). game score, though not a perfect demarcation...

On Catastrophic Interference in Atari 2600 Games

William Fedus * 1 2 Dibya Ghosh * 1 John D. Martin 1 3 Marc G. Bellemare 1 4 Yoshua Bengio 2 5

Hugo Larochelle 1 4

AbstractModel-free deep reinforcement learning algo-rithms are troubled with poor sample efficiency– learning reliable policies generally requires avast amount of interaction with the environment.One hypothesis is that catastrophic interferencebetween various segments within the environmentis an issue. In this paper, we perform a large-scale empirical study on the presence of catas-trophic interference in the Arcade Learning En-vironment and find that learning particular gamesegments frequently degrades performance on pre-viously learned segments. In what we term theMemento observation, we show that an identicallyparameterized agent spawned from a state wherethe original agent plateaued, reliably makes fur-ther progress. This phenomenon is general – wefind consistent performance boosts across archi-tectures, learning algorithms and environments.Our results indicate that eliminating catastrophicinterference can contribute towards improved per-formance and data efficiency of deep reinforce-ment learning algorithms.

1. IntroductionDespite many notable successes in deep reinforcement learn-ing (RL) most model-free algorithms require huge numbersof samples to learn effective and useful policies (Mnih et al.,2015; Silver et al., 2016). This inefficiency is a centralchallenge in deep RL that limits their deployment. Ourmost successful algorithms rely on environment simulatorsthat permit an unrestricted number of interactions (Belle-mare et al., 2013; Silver et al., 2016) and many algorithmsare specifically designed to churn through as much dataas possible, achieved through parallel actors and hardware

*Equal contribution 1Google Brain 2MILA, Université deMontréal 3Stevens Institute of Technology 4CIFAR Fellow5CIFAR Director. Correspondence to: William Fedus <[email protected]>.

Code available to reproduce experiments at https://github.com/google-research/google-research/tree/master/memento

accelerators (Nair et al., 2015; Espeholt et al., 2018). Thesample efficiency of an algorithm is quantified by the num-ber of interactions required with the environment in orderto achieve a certain level of performance. In this paper, weempirically examine the hypothesis that poor sample effi-ciency is in part due to catastrophic forgetting (McCloskey& Cohen, 1989; French, 1999).

Catastrophic forgetting is a well-known phenomenon acrossmachine learning (Kirkpatrick et al., 2017; Li & Hoiem,2017; Lopez-Paz & Ranzato, 2017). It is well-establishedthat as deep neural networks learn various tasks that theassociated weight changes rapidly impair progress on pre-vious tasks – even if the tasks share no logical connection.However, invoking it as a source of poor sample efficiencyis a potentially counter-intuitive explanation because catas-trophic forgetting emerges across different tasks. Here weshow that this also arises within a single environment aswe probe across the Arcade Learning Environment (ALE)(Bellemare et al., 2013).

We present the Memento1 observation as the first support ofthis hypothesis. Here an agent is trained and the state(s) as-sociated with the maximum performance is saved. Then weinitialize an identical agent with the weights of the originaland launch from this state – in doing so, we find the algo-rithm reliably makes further progress, even if the originalhad plateaued. This dynamic consistently holds across archi-tectures and algorithms including DQN (Mnih et al., 2013;2015), Rainbow (Hessel et al., 2018) and Rainbow with in-trinsic motivation (Bellemare et al., 2014; 2016). We refutethat this performance boost is simply the result of additionaltraining or of higher model capacity – in both cases, theoriginal agent fails to achieve either the same level of per-formance or match equivalent sample efficiency. Crucially,non-interference between the two learners is identified asthe salient difference.

We next establish the concept of a “task” within a game byusing the game score as a task contextualization (Jain et al.,2019). Game score, though not a perfect demarcation oftask boundaries, is an easily extracted proxy with reason-

1Reference to the eponymous psychological thriller where theprotagonist suffers from amnesia and must deduce a reasonableplan with no memory of how he arrived or even his initial goal

arX

iv:2

002.

1249

9v1

[cs

.LG

] 2

8 Fe

b 20

20

https://github.com/google-research/google-research/tree/master/memento




able properties. Changes in game score directly impacts thetemporal-difference (TD) error through the reward (Sutton& Barto, 1998) and often represent milestones such as ac-quiring or manipulating an object, successfully navigatinga space, or defeating an enemy. We show that training oncertain contexts causes prediction errors in other contextsto unpredictably change. In certain environments, learn-ing about one context generalizes to other segments, but inothers, the prediction errors worsen unpredictably. Theseexperiments reveal that individual games result in learninginterference issues more commonly associated with contin-ual and multitask learning (Parisi et al., 2019).

Our paper links catastrophic forgetting to issues of sampleefficiency, performance plateaus, and exploration. To do so,we present evidence through the Memento observation andshow that this generalizes across architectures, algorithmsand games. We refute that this is attributable to longer train-ing duration or additional model capacity. We then probedeeper into these interference dynamics by using the gamescore as a context. This reveals that learning on particularcontexts of the game often catastrophically interferes withprediction errors in others. Cross-context analysis furthercorroborates our results: in games with a limited improve-ment of the Memento agent, which eliminates interferenceby construction, we also demonstrate lower cross-contextcatastrophic interference.

2. BackgroundReinforcement learning conventionally models the environ-ment as a Markov Decision Process (MDP). We reiteratethe details laid forth presented by (Sutton & Barto, 1998)here. The MDP defined as the tupleM = 〈S,A, P,R, γ〉where S is the state space of the environment and A isthe action space which may be discrete or continuous.The learner is not explicitly given the environment tran-sition probability P (st+1|st, at) for going from st ∈ S tost+1 ∈ S given at ∈ A, but samples from this distribu-tion are observed. The environment emits a bounded rewardR : S×A → [rmin, rmax] on each transition and γ ∈ (0, 1)is the discount factor which defines a time-preference forshort-term versus long-term rewards and creates the basisfor a discounted utility model. Generally, the transitionfunction P satisfies a Markov property. This means thattransition to state st+1 is conditionally independent of allprevious state-actions sequences given knowledge of statest, P (st+1|s0, a0, ..., st, at) = P (st+1|st, at).

A policy π : S → A is a mapping from states to actions.The state action value function QH,dπ (s, a) is the expecteddiscounted rewards after taking action a in state s and then

following policy π until termination.

Qπ(s, a) = Eπ,P

[ ∞∑t=0

γtR(st, at)|s0 = s, a0 = a

](1)

where st+1 ∼ P (·|st, at) and at ∼ π(·|st).

Original Q-learning algorithms (Watkins & Dayan, 1992)were proposed in tabular settings. In tabular RL, where eachstate and action are treated as independent positions, thealgorithm generally requires examining each state-actionpair in order to build accurate value functions and effectivepolicies. Function approximation was introduced to param-eterize value functions so that estimates could generalize,requiring fewer interactions with the environment (Sutton &Barto, 1998).

Non-linear, deep function approximation was shown to pro-duce superhuman performance in the Atari Learning Envi-ronment with Deep Q-Networks (DQN) (Mnih et al., 2013;2015) ushering in the field of deep reinforcement learn-ing. This work combined Q-learning with neural networkfunction approximation to yield a scalable reinforcementlearning algorithm. Experience replay (Lin, 1992) is a tech-nique where the agent’s experience is stored in a slidingwindow memory buffer. The agent then either samples ex-perience uniformly or with prioritized scheme (Schaul et al.,2015) for training the learner.

3. The Memento ObservationMulti-task research in the ALE often builds upon the as-sumption that one task is one game and multi-task learningcorresponds to learning multiple games (Kirkpatrick et al.,2017; Fernando et al., 2017) or to different game modes(Farebrother et al., 2018). However, we now question thisassumption and probe the hypothesis that interference oc-curs within a single game. We begin with MONTEZUMA’SREVENGE, a difficult exploration game (Bellemare et al.,2013; 2016; Ali Taïga et al., 2019) that exhibits a compositeobjective structure (learning objective can be broken intoseparate components) because the agent accrues rewards bynavigating a series of rooms with enemies and obstacles.The presence of interference in deep RL algorithms wasconjectured for games with composite structure, though notexamined, by Schaul et al. (2019).

Limitations of existing algorithms. Ali Taïga et al. (2019)recently revisited the difficult exploration games of ALEwith the Rainbow agent (Hessel et al., 2018) and with recentbonus-based methods. Progress through these environmentscan be incentivized by providing the agent with explorationbonuses derived from notions of novelty (Bellemare et al.,2016), curiosity (Pathak et al., 2017) or distillation (Burdaet al., 2018). Count-based exploration algorithms (Belle-mare et al., 2016) maintain visit counts and provide reward


bonuses for reaching unfrequented states. These methodsdemonstrate impressive performance beyond standard RLalgorithms on hard-exploration games in the ALE. Ali Taïgaet al. (2019) finds that environments that previously taxedthe DQN agent’s (Mnih et al., 2015) exploration capabilitieswere better handled when employing recent RL advancesand the Adam optimizer (Kingma & Ba, 2014). However,consistent barriers are still observed with certain environ-ments like MONTEZUMA’S REVENGE. For the Rainbowagent with an intrinsic motivation reward computed via aCTS model (Bellemare et al., 2014) a notable plateaus isobserved at 6600 (Figure 2(a)).

(a) Initial positionwith score of 6600.

(b) Intermediate po-sition with score of8000.

(c) Maximumrecorded score of14500.

Figure 1. Frame from the start position in MONTEZUMA’S RE-VENGE for the Memento observation with a game score of 6600(a). A Rainbow agent with CTS model does not reliably proceedpast this point in the environment. However, a new agent whichstarts from this position and with identical architecture reliablymakes further progress from here. Repeated resets yielded maxi-mum scores of 14500 (right).

Breaking the barrier. However, if a fresh agent with thesame initial weights is launched from the state where thelast agent left off, it makes reliable further progress - re-peated resets recorded maximum scores of up to 14,500 inMONTEZUMA’S REVENGE. The second agent begins eachof its episodes from the final position of the last. As theweights of the second agent are independent of the origi-nal, learning progress and weight updates do not interferewith the original agent. Furthermore, in MONTEZUMA’SREVENGE we find that instead starting the second agent asrandomly initialized (the weights encode no past context)performs equivalently. We refer to this as the Memento ob-servation and demonstrate in Appendix A.2 that this holdson other hard exploration Atari games including GRAVITAR,VENTURE, and PRIVATE EYE.

This observation might be hypothesized to simply be theresult of longer training or of additional network capacity,but we refute both of these in Appendix A.1. In this, we findthat neither longer training nor additional model capacity issufficient to improve past the previous score plateau of 6600.Instead, we posit that the success of the Memento agent isconsistent with catastrophic forgetting interfering with the

(a) Rainbow CTS agent in MONTEZUMA’S REVENGE. Each blackline represents a run from each of the five seeds and the dottedblack line is the maximum achieved score of 6600.

(b) Memento observation in MONTEZUMA’S REVENGE. Both afresh agent initialized with random weights (blue) or with weightsfrom the original model (orange) reliably make further progress.

Figure 2. The Memento observation in MONTEZUMA’S REVENGE.A fresh and randomly initialized Rainbow CTS agent (blue) makesreliable progress from the previous position of score 6600 whichwas a barrier. Additionally, we observe that instead launching anew agent with the original parameters (orange) yields no weighttransfer benefit over a randomly initialized agent.

ability of the original agent to make progress (McCloskey& Cohen, 1989; French, 1999). The original agent’s replaybuffer does contain samples exceeding the game scoreplateau (6600+), which suggests that the exploration policyis not the limiting factor. Rather, the agent is unable tointegrate this new information and learn a value functionin the new region without degrading performance in thefirst region, causing these high-reward samples to be lostfrom the experience replay buffer. Catastrophic interferencemay therefore be linked with exploration challenges in deepRL. Integrating the knowledge via TD-backups of a newlyreceived reward in an environment with a sparse rewardsignal may interfere with the past policy. In this section wehave established evidence for a Rainbow agent with intrinsicrewards in a hard exploration games of ALE and withusing a manually-selected reset position. In Section 4 weseek to understand if this phenomenon applies more broadly.


Experience replay and catastrophic interference. Re-play buffers with prioritized experience replay (Schaul et al.,2015), which samples states with higher temporal-difference(TD) error more frequently, can exasperate catastrophic in-terference issues. Figure 3 records the training dynamicsthe Rainbow agent in MONTEZUMA’S REVENGE. Eachplot is a separate time-series recording which context(s) theagent is learning as it trains in the environment. The agentclearly iterates through stages of the environment indexedby the score, starting by learning exclusively context 0, thencontext 1 and so on. The left column shows the early learn-ing dynamics and the right plot shows resulting oscillationsafter longer training. This is antithetical to approaches toaddress catastrophic forgetting that suggest shuffling acrosstasks and contexts (Ratcliff, 1990; Robins, 1995). These

Conte

xt

0C

onte

xt

1C

onte

xt

2C

onte

xt

3

0 1 2 3 41e7

Conte

xt

4

0.0 0.2 0.4 0.6 0.8 1.01e8

Figure 3. We plot how often the first five game contexts of MON-TEZUMA’S REVENGE are sampled from the replay buffer through-out training. Early in training (left column) the agent almost exclu-sively trains on the first stage (because no later stages have beendiscovered). In intermediary stages, the agent is being trained onall the contexts at the same time (which can lead to interference),and in late stages, is being trained on only the last stage (whichcan lead to catastrophic forgetting). After discovering all context(right column), the agent oscillates between sampling contexts.

dynamics are reminiscent of continual learning2 where up-dates from different sections may interfere with each-other(Schaul et al., 2019). Prioritized experience replay, a usefulalgorithm for sifting through past experience, and an im-portant element of the Rainbow agent, naturally gives riseto continual learning dynamics. However, removing pri-oritized experience replay and sampling instead randomlyand uniformly (Mnih et al., 2013; 2015) can still manifestas distinct data distribution shifts in the replay buffer. As

2In contrast to the standard setting, these score contexts can’tbe learned independently since the agent must pass through earliercontexts to arrive at later contexts.

the agent learns different stages of the game, frontier re-gions may produce more experience as the agent dithers inexploratory processes.

4. Generalizing Across Agents and GamesTo generalize this observation, we first propose a simplealgorithm for selecting states associated with plateaus of thelast agent. In this, we train an agent and then sample a sin-gle trajectory from this agent, (s0, a0, r0, · · · , sT , aT , rT ).We compute the return-till-date (game score) for each state,zt =

∑k<t rk, and choose the earliest timestep tmax at

which the game score is maximized: tmax = min{t :zt = zmax}. The state corresponding to this timestep,smax = stmax

, will become the starting position for theMemento agent. In Appendix C we present alternative ex-periments launching the Memento agent from a diverse setof states Smax consistent with the maximum score and findthe observation also holds. Figure 4 is a schematic of thisapproach.

Selectatrajectorywithgamescore �max

� = 1 � = 0 � = 1 � = −1

� = 1 � = 1 � = 2 � = 1

Choosestate �max

istheinitialstateforMementoagent�max

Figure 4. Process for finding plateau points based on return contex-tualization. We scan the replay buffer B for the trajectory with themaximum game. We extract this trajectory (states denoted as whitecircles and actions denoted as black dots) and we compute the cu-mulative undiscounted return to date labeled z. For the largest zin the trajectory (zmax), we grab the corresponding state smax andthen this state becomes the launch point for the Memento agent.

This automated reset heuristic allows us to more gener-ally evaluate the Memento observation across other choicesof agents and games. We analyze the observation acrossthe ALE suite on two different agents: a Rainbow agent(without the intrinsic motivation reward model) and a DQNagent. The Rainbow agent we use (Castro et al., 2018) usesa prioritized replay buffer (Schaul et al., 2015), n-step re-turns (Sutton & Barto, 1998) and distributional learning(Bellemare et al., 2017). Figure 5 records the percentageimprovement of the new agent over the original agent acrossthe entire ALE suite, finding a significant +25.0% medianimprovement.

Next, we consider ablating the three included algorithmicimprovements of Rainbow and examine the base DQN agent.Similar to the Rainbow agent, we still find that this phe-


-104-103-102-101 -1000100 101 102 103 104

Percent Improvement (Log Scale)

SkiingAsteroidsTimePilot

AssaultCentipede

AmidarDoubleDunkVideoPinball

ChopperCommandRiverraid

BankHeistCrazyClimber

KrullRobotank

BoxingTutankham

BowlingJourneyEscape

FreewayFishingDerby

PongPitfall

KungFuMasterKangaroo

VentureQbertHero

PooyanNameThisGame

PrivateEyeGravitar

StarGunnerRoadRunner

IceHockeySolaris

FrostbiteAirRaid

SeaquestUpNDown

SpaceInvadersGopherZaxxonBerzerk

BattleZoneBeamRider

YarsRevengeMsPacman

ElevatorActionAtlantisTennis

AlienMontezumaRevenge

PhoenixEnduro

BreakoutDemonAttackWizardOfWor

AsterixJamesbond

Random

RA

INB

OW

@ Ite

rati

on 1

98

Rainbow MEMENTO (Single Original Initialization)Med. Improvement 25.08

Figure 5. Each bar corresponds to a different game and the heightis the percentage increase of the Rainbow Memento agent overthe Rainbow baseline. Across the entire ALE suite, the RainbowMemento agent improves 75.0% of the games performance with amedian improvement of +25.0%.

nomenon generalizes. The DQN Memento agent achieves a+17% median improvement over the baseline agent (see Fig-ure 16(b) for game-by-game detail), supporting that thesedynamics hold across different types of value-based agents.

Furthermore, we find that the Memento observation ap-plies across a wide suite of games, including ASTERIXand BREAKOUT, games that do not share the same com-positional narrative structure as the original observationon MONTEZUMA’S REVENGE and other hard explorationgames. This supports that the issues of interference arepresent across agents and across different games. In caseswhere the Memento observation fails to improve, it typicallyis due to a bad or an inescapable initial state being selectedthrough our heuristic algorithm. We did not further attemptto improve this algorithm because we found that our simplecriterion has already produced sufficient evidence for theinterference issues (though further performance could likelybe achieved through more sophisticated state selection).

As in Section 3 we refute that training duration and capacityare sufficient to explain the performance boost of the gen-eralized Memento phenomenon. Figure 13(a) nullifies thattraining duration is the primary cause of the Memento ob-servation. We find that training the original agent for 400Mframes (double the conventional duration) results only in a+3.3% median improvement for the Rainbow agent (versus+25.0% in the Memento observation). Next in Figure 13(b)we find that increased model capacity is also not the primarycause of the Memento observation by training a double ca-pacity Rainbow agent and finding an improvement of only+7.4% (versus +25.0% for the Memento agent).

Figure 6 shows how the performance gap widens over train-ing duration. The Memento agent is far more sample effi-cient than training the baseline Rainbow agent longer. TheMemento agent achieves the same median performance in-crease in only 5M frames compared to 200M frames for thelonger-training Rainbow agent.

0 50 100 150 200Iterations

5

0

5

10

15

20

25

30

35

40

Media

n Im

pro

vem

ent

over

Rain

bow MEMENTO

Rainbow Double Time

Rainbow Double Capacity

Double Time Peak

Double Capacity Peak

Figure 6. Performance of the Memento agent (blue) versus a stan-dard Rainbow agent trained for 400M frames (red) and a doublecapacity Rainbow agent trained for 400M frames (green). We findthat the Memento agent is considerably more sample efficient. Itreaches the peak performance of both longer-training Rainbowagents achieved over 200M frames in <25M frames.

5. Catastrophic InterferenceThe Memento observation showed improvement in hardexploration games with an intrinsically motivated Rainbowagent (Section 3) and more generally across various learningalgorithms on a diverse set of games in the ALE suite (Sec-tion 4). As the Memento agent is constructed to separatelearning updates between the new and original agents, itprovides support for the hypothesis that sample efficiencyand performance is realized by the Memento agent becauseit reduces interference between sections of the game. Wenow investigate this hypothesis at a finer level of detail:specifically, we measure the TD-errors at different contextsas the agent learns about particular sections of the game.

As introduced in Mnih et al. (2015), our agents use atarget network to produce the Bellman targets for the


online network. During training the target network pa-rameters θ− are held semi-fixed, updating only after T -gradient updates and the online network always mini-mizes the TD errors with the semi-fixed target network(in our implementation, T = 2000 online network up-dates). The non-distributional TD errors for a sampledexperience tuple (s, a, s′, a′) ∼ B are given in EquationL(θ) = Eπθ

[` (Q(s, a; θ)− (r + γmaxa′ Q(s′, a′; θ−)))]where ` refers to the Huber loss. We refer readers to Belle-mare et al. (2017) for the distributional equivalent. In theinterim period while the target network is held fixed, theoptimization of the online network parameters∇θL(θ) re-duces to a strictly supervised learning setting, eliminatingseveral complications present in RL. The primary complica-tion avoided is that for any given experience (s, a, s′, a′) theBellman targets do not change as the online network evolves.In the general RL setting with bootstrapping through TD-learning, the targets will necessarily change, otherwise theonline network will simply regress to the Q-values estab-lished with the initial target parameters θ−. We proposeanalyzing this simplified regime in order to carefully exam-ine the presence of catastrophic interference.

Low Catastrophic Interference. We begin with an analy-sis of PONG in Figure 7. While keeping the target networkparameters fixed, we measure the TD-errors across all con-texts after training T -steps from samples from one of thecontexts in the agent’s replay buffer. The left column showsthe absolute change in TD errors and the right column showsthe relative change in TD errors. Each row corresponds totraining on samples from a particular context, in this case,contexts 0, 5, 10, and 15. For PONG, a game that doesnot materially benefit from the Memento observation ineither Rainbow or DQN, we find that learning about onecontext or game score generalizes very well to others (withan occasional aberration).

Figure 8(a) provides a compressed view of the relativechanges in TD-errors, where the (i, j)-th entry correspondsto the relative change in TD error for context j, when themodel is trained on context i. Locations where the diagramis red correspond to interference and negative generaliza-tion, and blue regions correspond to positive generalizationbetween contexts. For these two games, the diagrams areoverwhelmingly blue, indicating positive generalization be-tween most regions.

We consider another game that fared poorly under the Me-mento observation, QBERT, for both the Rainbow and DQNagents 16(b). Given the Memento observation was not ben-eficial this leads to a testable hypothesis that we shouldobserve low degrees of interference. Corroborating this,in Figure 8(b) we find that training on contexts typicallygeneralizes. However, we find a consistent negative im-pact for the 0th context, suggesting that the beginning of

0 5 10 15 20

Context

0.005

0.004

0.003

0.002

0.001

0.000

0.001Change in Loss

0 5 10 15 20

Context

-102

-101

-100

0

100

101

102 Relative Change in Loss

(a) Train on context 0.

0 5 10 15 20

Context

0.006

0.005

0.004

0.003

0.002

0.001

0.000

0.001Change in Loss

0 5 10 15 20

Context

-102

-101

-100

0

100


(b) Train on context 5.

0 5 10 15 20

Context

0.004

0.003

0.002

0.001

0.000Change in Loss

0 5 10 15 20

Context

-102

-101

-100

0Relative Change in Loss

(c) Train on context 10.

0 5 10 15 20

Context

0.004

0.003

0.002

0.001

0.000Change in Loss

0 5 10 15 20

Context

-102

-101

-100

0Relative Change in Loss

(d) Train on context 15.

Figure 7. We track the TD errors over the first twenty contexts inPONG for a Rainbow agent when we train on a particular context.The left column shows the absolute change in TD errors and theright column shows the relative change in TD errors.

the game (context 0) materially differs from later sections.This is evidenced by the red-column showing degradationof performance in this section as others are learned.

High Catastrophic Interference. We have now seen thatlearning about one portion of PONG or QBERT typically im-proves the prediction error in other contexts, that is, learninggeneralizes. However, we now present that this is not gen-erally the case. Figure 9 revisits our running example ofMONTEZUMA’S REVENGE and continues with the sameexperiment. We select an agent that has reached the first fivegame score contexts corresponding to milestones includingcollecting the first key and exiting the initial room. In con-trast to PONG, we find that learning about certain contextsin MONTEZUMA’S REVENGE strictly interferes with the


0 5 10 15

Context Evaluated On

0

5

10

15

Conte

xt

Tra

ined O

n

Relative TD-Error

-100%

0

> 100%

(a) TD errors across game score contexts in PONG

0 1 2 3 4 5 6 7


8

7

6

5

4

3

2

1

Conte

xt

Tra

ined O

n

Relative TD-Error

-100%

0

> 100%

(b) TD errors across game score contexts in QBERT.

Figure 8. For PONG and QBERT, learning in one context usuallygeneralizes to other contexts (improved relative TD-errors in blue).We find no strong evidence of catastrophic interference in PONG.QBERT also generalizes across local contexts, however, a redcolumn for the 0th context indicates some interference effects.

prediction errors in other contexts.

Figure 10(a) shows the corresponding TD-error reductionmatrix; as expected, training on samples from a particularcontext reduces the TD-errors of that context as observed bythe blue diagonals. However, the red off-diagonals indicatethat training on samples from one context negatively inter-feres with predictions in all the other contexts. Issues ofcatastrophic forgetting also extend to games where it mightbe unexpected like BREAKOUT. Figure 10(b) shows that aRainbow agent trained here interferes unpredictably withother contexts. Also, interestingly, we note an asymmetry.Training on later contexts produces worse backward trans-fer (bottom-left) while training on earlier contexts does notproduce as severe forward transfer (top-right) (Lopez-Paz &Ranzato, 2017).

6. Related WorkCatastrophic Interference: To control how memoriesare retained or lost, many study the synaptic dynamics be-tween different environment contexts. These methods either

0 1 2 3 4 5

Context

0.0005

0.0000

0.0005

0.0010

0.0015

0.0020

0.0025Change in Loss

0 1 2 3 4 5

Context

-102

-101

-100

0

100

101

102


(a) Train on context 0.

0 1 2 3 4 5

Context

0.0010

0.0005

0.0000

0.0005

0.0010

0.0015

0.0020

0.0025


0 1 2 3 4 5

Context

-102

-101

-100

0

100

101

102


(b) Train on context 1.

0 1 2 3 4 5

Context

0.0005

0.0000

0.0005

0.0010

0.0015

0.0020


0 1 2 3 4 5

Context

-102

-101

-100

0

100

101

102


(c) Train on context 2.

0 1 2 3 4 5

Context

0.0002

0.0000

0.0002

0.0004

0.0006

0.0008


0 1 2 3 4 5

Context

-102

-101

-100

0

100

101

102


(d) Train on context 3.

0 1 2 3 4 5

Context

0.0005

0.0000

0.0005

0.0010

0.0015

0.0020

0.0025

0.0030


0 1 2 3 4 5

Context

-102

-101

-100

0

100

101

102


(e) Train on context 4.

Figure 9. We track the TD errors over the first five contexts in inMONTEZUMA’S REVENGE for a Rainbow agent while training ona particular context. The left column shows the absolute change inTD errors and the right column shows the relative change in TDerrors. As expected, training on the context reduces the TD errorfor that context, however, note that learning one context comes atthe expense of worse prediction errors on others.

try to regularize or expand the parameter space. Regular-ization has shown to discourage weight updates from over-writing old memories. Elastic Weight Consolidation (EWC)(Kirkpatrick et al., 2017) employs a quadratic penalty tokeep parameters used in successive environments close to-gether, and thus prevents unrelated parameters from beingrepurposed. Li & Hoiem (2017) impose a knowledge dis-tillation penalty (Hinton et al., 2015) to encourage certainnetwork predictions to remain similar as the parametersevolve. The importance of some memories can be encodedin additional weights, then used to scale penalties incurredwhen they change (Zenke et al., 2017). In general, regu-larization is effective when a low loss region exists at the


1 2 3 4 5


5

4

3

2

1C

onte

xt

Tra

ined O

n

Relative TD-Error

-100%

0

> 100%

(a) TD errors across game score contexts in MON-TEZUMA’S REVENGE

0 5 10 15 20 25 30 35


40

35

30

25

20

15

10

5

Conte

xt

Tra

ined O

n

Relative TD-Error

-100%

0

> 100%

(b) TD errors across game score contexts in BREAK-OUT.

Figure 10. For both MONTEZUMA’S REVENGE and BREAKOUT

we find that learning almost always only improves the TD errorson that context (blue diagonals) at the expense of other contexts(red off-diagonals). These analyses suggest a high degree of catas-trophic interference.

intersection of each context’s parameter space. When theseregions are disjoint, however, it can be more effective tofreeze weights (Sharif Razavian et al., 2014) or to replicatethe entire network and augment it with new features for newsettings (Rusu et al., 2016; Yoon et al., 2018; Draelos et al.,2017). Our work studies model expansion in the context ofAtari, with a hypothesis that little overlap exists betweenthe low loss regions of different game contexts. However,some games, such as Pong, exhibit considerable overlap.

Knowledge of the environment context is often a prereq-uisite to regularization or model expansion. Given a fixedstrategy to mitigate interference, some consider the problemfinding the best contextualization. Rao et al. (2019) pro-posed a variational method to learn context representationswhich apply to a set of shared parameters. Similar to theForget-Me Not Process (Milan et al., 2016), the model isdynamically expanded as new data is experienced whichcannot be explained by the model. Aljundi et al. (2019)proposes a similar approach using regularization. Our workposits a context model that marks the boundaries betweenenvironment settings with the game score.

Continual Learning: Multi-task and continual learningin reinforcement learning is a highly active area of study(Ring, 1994; Silver et al., 2013; Oh et al., 2017) with someproposing modular architectures, however, multi-task learn-ing in the context of a single environment is a newer areaof research. Schaul et al. (2019) describe the problem ofray interference in multi-component objectives for bandits,and show that in this setting, learning exhibits winner-take-all dynamics with performance plateaus and learning con-strained to only one component. In the context of diverseinitial state distributions, Ghosh et al. (2017) find that policy-gradient methods myopically consider only a subset of theinitial-state distribution to improve on when using a singlepolicy. The Memento observation makes close connectionsto Go-Explore (Ecoffet et al., 2019) which demonstrates theefficacy of resetting to the boundary of the agent’s knowl-edge and thereafter employing a basic exploration strategyto make progress in difficult exploration games. Finally,recent work in unsupervised continual learning consider acase similar to our setting, where the learning process iseffectively multitask but there are no explicit task labels(Rao et al., 2019).

7. DiscussionThis research provides evidence of a difficult continuallearning problem arising within a single game. The chal-lenge posed by catastrophic interference in the reinforce-ment learning setting is greater than in standard continuallearning for several reasons. First, in this setting we are pro-vided no explicit task labels (Rao et al., 2019). Our analysesused as a proxy label, the game score, to reveal continuallearning dynamics, but alternative task designations may fur-ther clarify the nature of these issues. Next, adding a furtherwrinkle to the problem, the "tasks" exist as part of a singleMDP and therefore generally must transmit information toother tasks. TD errors in one portion of the environmentdirectly effect TD-errors elsewhere via Bellman backups(Bellman, 1957; Sutton & Barto, 1998). Finally, althoughexperience replay (Mnih et al., 2013; 2015) resembles shuf-fling task examples, we find it is not sufficient to combatcatastrophic forgetting unlike the standard setting. We positthat this is likely due to complicated feedback effects of datageneration and learning.

8. ConclusionWe strengthen connections between catastrophic forgettingand central issues in reinforcement learning including poorsample efficiency, performance plateaus (Schaul et al., 2019)and exploration challenges (Ali Taïga et al., 2019). Schaulet al. (2019) hypothesized interference patterns might be ob-served in a deep RL setting for games like MONTEZUMA’SREVENGE. Our empirical studies not only confirm this


hypothesis, but also show that the phenomenon is moreprevalent than previously conjectured. Both the Mementoobservation as well as peering into inter-context interferenceilluminates the nature and severity of interference in deepreinforcement learning. Our findings also suggest that theprior belief of what constitutes a "task" may be mislead-ing and must therefore be carefully examined. We hopethis work provides a clear characterization of the problemand show that it has far reaching implications for manyfundamental problems in reinforcement learning.

AcknowledgementsThis work stemmed from surprising initial results and ourunderstanding was honed through many insightful conversa-tions at Google and Mila. In particular, the authors wouldlike to thank Marlos Machado, Rishabh Agarwal, AdrienAli Taïga, Margaret Li, Ryan Sepassi, George Tucker andMike Mozer for helpful discussions and contributions. Wewould also like thank the reviewers at the Biological and Ar-tificial Reinforcement Learning workshop for constructivereviews on an earlier manuscript.

ReferencesAli Taïga, A., Fedus, W., Machado, M. C., Courville, A.,

and Bellemare, M. G. Benchmarking bonus-based ex-ploration methods on the arcade learning environment.arXiv preprint arXiv:1908.02388, 2019.

Aljundi, R., Kelchtermans, K., and Tuytelaars, T. Task-freecontinual learning. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), June 2019.

Bellemare, M., Veness, J., and Talvitie, E. Skip contexttree switching. In International Conference on MachineLearning, pp. 1458–1466, 2014.

Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T.,Saxton, D., and Munos, R. Unifying count-based explo-ration and intrinsic motivation. In Advances in NeuralInformation Processing Systems, pp. 1471–1479, 2016.

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M.The arcade learning environment: An evaluation plat-form for general agents. Journal of Artificial IntelligenceResearch, 47:253–279, 2013.

Bellemare, M. G., Dabney, W., and Munos, R. A distri-butional perspective on reinforcement learning. arXivpreprint arXiv:1707.06887, 2017.

Bellman, R. A markovian decision process. Journal ofMathematics and Mechanics, 6(5):679–684, 1957.

Burda, Y., Edwards, H., Storkey, A., and Klimov, O. Ex-ploration by random network distillation. arXiv preprintarXiv:1810.12894, 2018.

Castro, P. S., Moitra, S., Gelada, C., Kumar, S., and Belle-mare, M. G. Dopamine: A research framework for deepreinforcement learning. arXiv preprint arXiv:1812.06110,2018.

Draelos, T. J., Miner, N. E., Lamb, C. C., Cox, J. A.,Vineyard, C. M., Carlson, K. D., Severa, W. M., James,C. D., and Aimone, J. B. Neurogenesis deep learning:Extending deep networks to accommodate new classes.2017 International Joint Conference on Neural Networks(IJCNN), pp. 526–533, 2017.

Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O.,and Clune, J. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995,2019.

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih,V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning,I., et al. Impala: Scalable distributed deep-rl with impor-tance weighted actor-learner architectures. arXiv preprintarXiv:1802.01561, 2018.

Farebrother, J., Machado, M. C., and Bowling, M. Gen-eralization and regularization in dqn. arXiv preprintarXiv:1810.00123, 2018.

Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D.,Rusu, A. A., Pritzel, A., and Wierstra, D. Pathnet: Evolu-tion channels gradient descent in super neural networks.arXiv preprint arXiv:1701.08734, 2017.

French, R. M. Catastrophic forgetting in connectionist net-works. Trends in cognitive sciences, 3(4):128–135, 1999.

Ghosh, D., Singh, A., Rajeswaran, A., Kumar, V., andLevine, S. Divide-and-conquer reinforcement learning.arXiv preprint arXiv:1711.09874, 2017.

Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostro-vski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., andSilver, D. Rainbow: Combining improvements in deep re-inforcement learning. In Thirty-Second AAAI Conferenceon Artificial Intelligence, 2018.

Hinton, G., Vinyals, O., and Dean, J. Distillingthe knowledge in a neural network. arXiv preprintarXiv:1503.02531, 2015.

Jain, V., Fedus, W., Larochelle, H., Precup, D., and Belle-mare, M. Algorithmic improvements for deep reinforce-ment learning applied to interactive fiction. Thirty-FourthAAAI Conference on Artificial Intelligence (AAAI-20),2019.


Kingma, D. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Des-jardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T.,Grabska-Barwinska, A., et al. Overcoming catastrophicforgetting in neural networks. Proceedings of the nationalacademy of sciences, pp. 201611835, 2017.

Li, Z. and Hoiem, D. Learning without forgetting. IEEEtransactions on pattern analysis and machine intelligence,40(12):2935–2947, 2017.

Lin, L.-J. Self-improving reactive agents based on reinforce-ment learning, planning and teaching. Machine learning,8(3-4):293–321, 1992.

Lopez-Paz, D. and Ranzato, M. Gradient episodic memoryfor continual learning. In Advances in Neural InformationProcessing Systems, pp. 6467–6476, 2017.

McCloskey, M. and Cohen, N. J. Catastrophic interfer-ence in connectionist networks: The sequential learningproblem. In Psychology of learning and motivation, vol-ume 24, pp. 109–165. Elsevier, 1989.

Milan, K., Veness, J., Kirkpatrick, J., Bowling, M., Koop,A., and Hassabis, D. The forget-me-not process. In Lee,D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Gar-nett, R. (eds.), Advances in Neural Information Process-ing Systems 29, pp. 3702–3710. Curran Associates, Inc.,2016. URL http://papers.nips.cc/paper/6055-the-forget-me-not-process.pdf.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,Antonoglou, I., Wierstra, D., and Riedmiller, M. Playingatari with deep reinforcement learning. arXiv preprintarXiv:1312.5602, 2013.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-land, A. K., Ostrovski, G., et al. Human-level controlthrough deep reinforcement learning. Nature, 518(7540):529, 2015.

Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon,R., De Maria, A., Panneershelvam, V., Suleyman, M.,Beattie, C., Petersen, S., et al. Massively parallel meth-ods for deep reinforcement learning. arXiv preprintarXiv:1507.04296, 2015.

Oh, J., Singh, S., Lee, H., and Kohli, P. Zero-shot taskgeneralization with multi-task deep reinforcement learn-ing. In Proceedings of the 34th International Conferenceon Machine Learning-Volume 70, pp. 2661–2670. JMLR.org, 2017.

Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter,S. Continual lifelong learning with neural networks: Areview. Neural Networks, 2019.

Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T.Curiosity-driven exploration by self-supervised predic-tion. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition Workshops, pp. 16–17,2017.

Rao, D., Visin, F., Rusu, A., Pascanu, R., Teh, Y. W., andHadsell, R. Continual unsupervised representation learn-ing. In Advances in Neural Information Processing Sys-tems, pp. 7645–7655, 2019.

Ratcliff, R. Connectionist models of recognition memory:constraints imposed by learning and forgetting functions.Psychological review, 97(2):285, 1990.

Ring, M. B. Continual learning in reinforcement environ-ments. PhD thesis, University of Texas at Austin Austin,Texas 78712, 1994.

Robins, A. Catastrophic forgetting, rehearsal and pseudore-hearsal. Connection Science, 7(2):123–146, 1995.

Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H.,Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Had-sell, R. Progressive neural networks. arXiv preprintarXiv:1606.04671, 2016.

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori-tized experience replay. arXiv preprint arXiv:1511.05952,2015.

Schaul, T., Borsa, D., Modayil, J., and Pascanu, R. Rayinterference: a source of plateaus in deep reinforcementlearning. arXiv preprint arXiv:1904.11455, 2019.

Sharif Razavian, A., Azizpour, H., Sullivan, J., and Carlsson,S. Cnn features off-the-shelf: an astounding baseline forrecognition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition workshops, pp.806–813, 2014.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., Lanctot, M., et al. Mastering thegame of go with deep neural networks and tree search.nature, 529(7587):484, 2016.

Silver, D. L., Yang, Q., and Li, L. Lifelong machine learningsystems: Beyond learning algorithms. In 2013 AAAIspring symposium series, 2013.

Sutton, R. S. and Barto, A. G. Reinforcement learning: Anintroduction, volume 1. MIT press Cambridge, 1998.

http://papers.nips.cc/paper/6055-the-forget-me-not-process.pdf

http://papers.nips.cc/paper/6055-the-forget-me-not-process.pdf


Watkins, C. J. and Dayan, P. Q-learning. Machine learning,8(3-4):279–292, 1992.

Yoon, J., Yang, E., Lee, J., and Hwang, S. J. Lifelonglearning with dynamically expandable networks. InInternational Conference on Learning Representations,2018. URL https://openreview.net/forum?id=Sk7KsfW0-.

Zenke, F., Poole, B., and Ganguli, S. Continual learningthrough synaptic intelligence. In Proceedings of the 34thInternational Conference on Machine Learning-Volume70, pp. 3987–3995. JMLR. org, 2017.

https://openreview.net/forum?id=Sk7KsfW0-

https://openreview.net/forum?id=Sk7KsfW0-


A. Rainbow Agent with CTS Additional FiguresA.1. Additional Training and Model Capacity

We consider training the Rainbow CTS model longer and with higher capacity.

(a) Result of training the Rainbow CTS for a longerduration. The black vertical line denotes the typicaltraining duration of 200M frames. We find for both theoriginal agent (orange) as well as a sweep over otherupdate-horizons (blue, green), that no agent reliablyexceeds the baseline.

(b) Result of training a Rainbow CTS with the in-creased capacity equal to two separate Rainbow CTSagents accomplished by increasing the filter widths. Wefind for both the original agent (orange) as well as asweep over other update-horizons (blue, green), that noagent reliably exceeds the baseline.

Figure 11. In MONTEZUMA’S REVENGE, neither additional training (Figure 11(a)) nor additional model capacity (Figure 11(b)) leads toimproved performance from the base agent. These results hold for various n-step returns or update horizons of 1, 3 (Rainbow default), and10. The dotted-line is the maximum achieved baseline for the original agent, and all experiments are run with 5 seeds.

A.2. Rainbow with CTS Hard Exploration Games

We demonstrate the Memento observation for three difficult exploration games using the Rainbow + CTS agent.

(a) The Memento observation in GRAVI-TAR.

(b) The Memento observation in VEN-TURE.

(c) The Memento observation in PRIVA-TEEYE.

Figure 12. Each black line represents a run from each of the five seeds and the dotted black line is the maximum achieved score. We findthat the Memento observation holds across hard exploration games. PRIVATEEYE results in a slight decrease in performance for the newagent (orange) due to the count-down timer in, however, we note that it is more stable compared to baseline data (blue) and preservesperformance.


B. Rainbow Agent Additional FiguresB.1. Additional Training and Model Capacity

We first present results of training the Rainbow agent for a longer duration and with additional capacity. In both settings, wefind these variants fail to achieve the performance of the Memento observation.

-104-103-102-101 -1000100 101 102 103 104


SkiingSolaris

ChopperCommandPrivateEye

PhoenixUpNDown

CarnivalJamesbond

AmidarKangaroo

PitfallBankHeist

KungFuMasterBerzerkBowling

DoubleDunkPooyan

VentureFreeway

PongTennisBoxing

FishingDerbyVideoPinballBeamRider

AirRaidEnduroGopher

RoadRunnerRiverraid

ElevatorActionCrazyClimber

AtlantisRobotank

KrullHero

TutankhamQbert

FrostbiteZaxxonAssault

StarGunnerIceHockeyTimePilot

BattleZoneMsPacman

WizardOfWorNameThisGame

GravitarAlien

AsterixCentipede

YarsRevengeAsteroids

MontezumaRevengeBreakout

SpaceInvadersDemonAttack

Seaquest

Random

RA

INB

OW

@ Ite

rati

on 1

98

Rainbow Extended TimeMed. Improvement 3.32

(a) A Rainbow agent trained fortwice the duration or 400M frames.

-104-103-102-101 -1000100 101 102 103 104


SkiingSolaris

ChopperCommandBowling

UpNDownJamesbond

AsterixWizardOfWor

PrivateEyeBattleZone

GopherBreakoutFrostbite

NameThisGameIceHockey

KungFuMasterPooyan

ElevatorActionTimePilot

BankHeistVideoPinballDoubleDunk

KrullZaxxonBerzerkEnduro

BeamRiderBoxing

FreewayPong

StarGunnerSpaceInvaders

RobotankCrazyClimberFishingDerby

KangarooTennis

JourneyEscapeRoadRunner

AssaultCarnival

RiverraidVentureAirRaid

TutankhamAmidar

QbertPhoenix

PitfallAtlantis

DemonAttackHero

MsPacmanCentipedeAsteroidsGravitar

AlienYarsRevenge

SeaquestMontezumaRevenge

Random

RA

INB

OW

@ Ite

rati

on 1

98

Rainbow Double CapacityMed. Improvement 0.06

(b) A Rainbow agent trained withdouble network capacity.

Figure 13. A Rainbow agent trained for twice the duration or twice the network capacity similarly fails to improve materially overthe baseline. Training for 400M frames yields a median improvement of +3% and doubling the network capacity yields no medianimprovement. In the contrast, the Memento agent improves over baseline +25.0%.


C. Memento State StrategyWe consider the performance when the Memento agent instead starts from a set of states S rather than a singular state.This is a more difficult problem requiring demanding more generalization capability from the Memento agent and thusreported median performance is lower. We find, however, that the Memento observation also holds in this more challengingenvironment. Furthermore, we find a higher fraction of games improve under multiple Memento start states, likely due toavoiding the primary issue of starting exclusively from an inescapable position.

-104-103-102-101 -1000100 101 102 103 104



AssaultCentipede




KrullRobotank

BoxingTutankham


FreewayFishingDerby

PongPitfall


VentureQbertHero

PooyanNameThisGame

PrivateEyeGravitar


IceHockeySolaris

FrostbiteAirRaid

SeaquestUpNDown


BattleZoneBeamRider

YarsRevengeMsPacman



PhoenixEnduro


AsterixJamesbond

Random

RA

INB

OW

@ Ite

rati

on 1

98


(a) Single Memento state.

-104-103-102-101 -1000100 101 102 103 104


SkiingAmidarSolaris

ChopperCommandDoubleDunk

KungFuMasterFishingDerby

SeaquestVideoPinball

BankHeistFreeway

PongJourneyEscape

TimePilotTutankham

RiverraidQbert

CrazyClimberIceHockey

BattleZoneWizardOfWor

KangarooGravitar

RobotankNameThisGame

HeroVentureBowling

PitfallRoadRunnerStarGunner

FrostbitePooyan

PrivateEyeZaxxon

KrullUpNDown

BerzerkMsPacman

AirRaidSpaceInvaders

TennisAssault

BeamRiderGopher

YarsRevengePhoenix

AlienElevatorAction

BreakoutEnduroAtlantis

DemonAttackMontezumaRevenge

AsterixJamesbond

Random

RA

INB

OW

@ Ite

rati

on 1

98

Rainbow MEMENTO (Original Initialization)Med. Improvement 10.80

(b) Multiple Memento states.

Figure 14. Multiple Memento states. Each bar corresponds to a different game and the height is the percentage increase of the RainbowMemento agent from a set of start states over the Rainbow baseline. The Rainbow Memento agent launched from a single state results in a25.0% median improvement (left) which is greater than the 11.9% median improvement of launching from multiple states (right).


D. Weight TransferWe examine the importance of the original agent’s parameters in order transfer to later sections of the game in the Mementoobservation. Our original observation relied on the original agent weights. To examine the weight transfer benefits of theseparameters, we consider instead using random initialization for the Memento agent. If the parameters learned in the earlysection of the game by the original agent generalize, we should observe faster progress and potentially better asymptoticperformance. We measure this both for Rainbow and DQN.

0 50 100 150 200Iterations

10

0

10

20

30

40

50

Media

n Im

pro

vem

ent

over

Rain

bow Random Initialization

Original Agent

(a) Rainbow Memento agent.

0 50 100 150 200Iterations

10

5

0

5

10

15

20

25

30

Media

n Im

pro

vem

ent

over

Rain

bow Random Initialization

Original Agent

(b) DQN Memento agent.

Figure 15. We examine the impact of the initial parameters for the Rainbow Memento agent. The median improvement over thecorresponding original agent is recorded for both random weight initialization (blue) as well as parameters from the original agent (red).

Interestingly, we find opposite conclusions for the Rainbow and DQN agents. The Memento Rainbow agent benefits fromthe original agent weights whilst the Memento DQN agent is better off randomly initialized. We would also highlight thatthe gap between both variants is not large. This indicates that the parameters learned from the earlier stages of the game maybe of limited benefit.


E. Memento in Rainbow versus DQN

-104-103-102-101 -1000100 101 102 103 104



AssaultCentipede




KrullRobotank

BoxingTutankham


FreewayFishingDerby

PongPitfall


VentureQbertHero

PooyanNameThisGame

PrivateEyeGravitar


IceHockeySolaris

FrostbiteAirRaid

SeaquestUpNDown


BattleZoneBeamRider

YarsRevengeMsPacman



PhoenixEnduro


AsterixJamesbond

Random

RA

INB

OW

@ Ite

rati

on 1

98


(a) Rainbow Memento agent.

-104-103-102-101 -1000100 101 102 103 104


TimePilotAsteroids

SolarisMontezumaRevenge

VentureBowling

IceHockeyCentipede

AmidarSeaquest

SkiingRiverraid

QbertBankHeist

FreewayPong

CarnivalTennis

UpNDownRobotank

KrullJourneyEscape

PhoenixBoxingAssault

CrazyClimberDoubleDunk

AirRaidEnduro

KangarooKungFuMaster

RoadRunnerPitfall

StarGunnerMsPacman

NameThisGameHero

FishingDerbyChopperCommand

WizardOfWorVideoPinball

GopherSpaceInvaders

PooyanAlien

BerzerkYarsRevengeDemonAttack

BeamRiderTutankham

AtlantisZaxxon

BattleZoneBreakout

JamesbondGravitarAsterix

FrostbiteElevatorAction

PrivateEye

Random

DQ

N @

Ite

rati

on 1

98

DQN MEMENTO (Single)Med. Improvement 17.00

(b) DQN Memento agent.

Figure 16. Each bar corresponds to a different game and the height is the percentage increase of the Memento agent over its baseline.Across the entire ALE suite, the Rainbow Memento agent (left) improves 75.0% of the games performance with a median improvement of+25.0%. The DQN Memento agent (right) improves 73.8% of the games performance with a median improvement of 17.0%.


F. Training Curves

0 50 100 150 200 250 300 350 400

iteration

0

5000

10000

15000

20000

25000

train

_epis

ode_r

etu

rns

AirRaid

algo_name

RAINBOW

Rainbow Restart Single

0 50 100 150 200 250 300 350 400

iteration

0

1000

2000

3000

4000

5000

6000

7000

8000

train

_epis

ode_r

etu

rns

Alien

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

500

1000

1500

2000

2500

3000

3500

train

_epis

ode_r

etu

rns

Amidar

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

500

1000

1500

2000

2500

3000

3500

4000

4500

train

_epis

ode_r

etu

rns

Assault

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

20000

40000

60000

80000

100000

120000

140000

train

_epis

ode_r

etu

rns

Asterix

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

200

400

600

800

1000

1200

1400

1600

1800

train

_epis

ode_r

etu

rns

Asteroids

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

train

_epis

ode_r

etu

rns

Atlantis

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

200

400

600

800

1000

1200

1400

train

_epis

ode_r

etu

rns

BankHeist

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

train

_epis

ode_r

etu

rns

BattleZone

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

2000

4000

6000

8000

10000

12000

14000

16000

train

_epis

ode_r

etu

rns

BeamRider

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

500

1000

1500

2000

2500

train

_epis

ode_r

etu

rns

Berzerk

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

10

20

30

40

50

60

70

80

90

train

_epis

ode_r

etu

rns

Bowling

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

40

20

0

20

40

60

80

100

train

_epis

ode_r

etu

rns

Boxing

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

50

100

150

200

250

300

350

400

450

train

_epis

ode_r

etu

rns

Breakout

algo_name

RAINBOW


0 50 100 150 200

iteration

0

1000

2000

3000

4000

5000

6000

train

_epis

ode_r

etu

rns

Carnival

algo_name

RAINBOW

0 50 100 150 200 250 300 350 400

iteration

2000

3000

4000

5000

6000

7000

8000

train

_epis

ode_r

etu

rns

Centipede

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

train

_epis

ode_r

etu

rns

ChopperCommand

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

50000

100000

150000

200000

train

_epis

ode_r

etu

rns

CrazyClimber

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

10000

20000

30000

40000

50000

60000

70000

train

_epis

ode_r

etu

rns

DemonAttack

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

30

20

10

0

10

20

30

train

_epis

ode_r

etu

rns

DoubleDunk

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

50000

0

50000

100000

150000

200000

train

_epis

ode_r

etu

rns

ElevatorAction

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

1000

2000

3000

4000

5000

6000

7000

train

_epis

ode_r

etu

rns

Enduro

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

100

80

60

40

20

0

20

40

60

train

_epis

ode_r

etu

rns

FishingDerby

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

5

0

5

10

15

20

25

30

35

train

_epis

ode_r

etu

rns

Freeway

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

train

_epis

ode_r

etu

rns

Frostbite

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

5000

10000

15000

20000

train

_epis

ode_r

etu

rns

Gopher

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

200

400

600

800

1000

1200

1400

1600

1800

train

_epis

ode_r

etu

rns

Gravitar

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

10000

20000

30000

40000

50000

60000

70000

80000

train

_epis

ode_r

etu

rns

Hero

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

20

15

10

5

0

5

10

15

train

_epis

ode_r

etu

rns

IceHockey

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

2000

0

2000

4000

6000

8000

10000

12000

14000

16000

train

_epis

ode_r

etu

rns

Jamesbond

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

20000

15000

10000

5000

0

train

_epis

ode_r

etu

rns

JourneyEscape

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

train

_epis

ode_r

etu

rns

Kangaroo

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

2500

3000

3500

4000

4500

5000

5500

6000

6500

train

_epis

ode_r

etu

rns

Krull

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

10000

20000

30000

40000

50000

train

_epis

ode_r

etu

rns

KungFuMaster

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

1000

500

0

500

1000

1500

2000

2500

3000

train

_epis

ode_r

etu

rns

MontezumaRevenge

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

2000

4000

6000

8000

10000

train

_epis

ode_r

etu

rns

MsPacman

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

2000

4000

6000

8000

10000

12000

train

_epis

ode_r

etu

rns

NameThisGame

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

5000

10000

15000

20000

train

_epis

ode_r

etu

rns

Phoenix

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

600

500

400

300

200

100

0

100

200

train

_epis

ode_r

etu

rns

Pitfall

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

30

20

10

0

10

20

30

train

_epis

ode_r

etu

rns

Pong

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

1000

2000

3000

4000

5000

6000

7000

8000

train

_epis

ode_r

etu

rns

Pooyan

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

10000

0

10000

20000

30000

40000

50000

train

_epis

ode_r

etu

rns

PrivateEye

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

5000

10000

15000

20000

25000

30000

train

_epis

ode_r

etu

rns

Qbert

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

5000

10000

15000

20000

25000

30000

train

_epis

ode_r

etu

rns

Riverraid

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

train

_epis

ode_r

etu

rns

RoadRunner

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

10

20

30

40

50

60

70

80

train

_epis

ode_r

etu

rns

Robotank

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

5000

0

5000

10000

15000

20000

25000

30000

35000

40000

train

_epis

ode_r

etu

rns

Seaquest

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

35000

30000

25000

20000

15000

10000

train

_epis

ode_r

etu

rns

Skiing

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

1000

2000

3000

4000

5000

6000

train

_epis

ode_r

etu

rns

Solaris

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

2000

4000

6000

8000

10000

12000

14000

train

_epis

ode_r

etu

rns

SpaceInvaders

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

20000

40000

60000

80000

100000

train

_epis

ode_r

etu

rns

StarGunner

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

25

20

15

10

5

0

5

train

_epis

ode_r

etu

rns

Tennis

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

2000

4000

6000

8000

10000

12000

14000

train

_epis

ode_r

etu

rns

TimePilot

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

50

100

150

200

250

300

train

_epis

ode_r

etu

rns

Tutankham

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

10000

20000

30000

40000

50000

60000

70000

80000

train

_epis

ode_r

etu

rns

UpNDown

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

500

0

500

1000

1500

2000

train

_epis

ode_r

etu

rns

Venture

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

100000

200000

300000

400000

500000

600000

700000

800000

train

_epis

ode_r

etu

rns

VideoPinball

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

5000

10000

15000

20000

25000

30000

train

_epis

ode_r

etu

rns

WizardOfWor

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

20000

40000

60000

80000

100000

120000

train

_epis

ode_r

etu

rns

YarsRevenge

algo_name

RAINBOW


0 50 100 150 200 250 300 350 400

iteration

0

5000

10000

15000

20000

25000

30000

train

_epis

ode_r

etu

rns

Zaxxon

algo_name

RAINBOW


Figure 17. Rainbow MEMENTO training curves. Blue is the baseline agent and orange is the MEMENTO agent launched from the bestposition of the previous (five seeds each).

on catastrophic interference in atari 2600 games2019). game score, though not a perfect demarcation...

Documents