on catastrophic interference in atari 2600 games2019). game score, though not a perfect demarcation...
TRANSCRIPT
On Catastrophic Interference in Atari 2600 Games
William Fedus * 1 2 Dibya Ghosh * 1 John D. Martin 1 3 Marc G. Bellemare 1 4 Yoshua Bengio 2 5
Hugo Larochelle 1 4
AbstractModel-free deep reinforcement learning algo-rithms are troubled with poor sample efficiency– learning reliable policies generally requires avast amount of interaction with the environment.One hypothesis is that catastrophic interferencebetween various segments within the environmentis an issue. In this paper, we perform a large-scale empirical study on the presence of catas-trophic interference in the Arcade Learning En-vironment and find that learning particular gamesegments frequently degrades performance on pre-viously learned segments. In what we term theMemento observation, we show that an identicallyparameterized agent spawned from a state wherethe original agent plateaued, reliably makes fur-ther progress. This phenomenon is general – wefind consistent performance boosts across archi-tectures, learning algorithms and environments.Our results indicate that eliminating catastrophicinterference can contribute towards improved per-formance and data efficiency of deep reinforce-ment learning algorithms.
1. IntroductionDespite many notable successes in deep reinforcement learn-ing (RL) most model-free algorithms require huge numbersof samples to learn effective and useful policies (Mnih et al.,2015; Silver et al., 2016). This inefficiency is a centralchallenge in deep RL that limits their deployment. Ourmost successful algorithms rely on environment simulatorsthat permit an unrestricted number of interactions (Belle-mare et al., 2013; Silver et al., 2016) and many algorithmsare specifically designed to churn through as much dataas possible, achieved through parallel actors and hardware
*Equal contribution 1Google Brain 2MILA, Université deMontréal 3Stevens Institute of Technology 4CIFAR Fellow5CIFAR Director. Correspondence to: William Fedus <[email protected]>.
Code available to reproduce experiments at https://github.com/google-research/google-research/tree/master/memento
accelerators (Nair et al., 2015; Espeholt et al., 2018). Thesample efficiency of an algorithm is quantified by the num-ber of interactions required with the environment in orderto achieve a certain level of performance. In this paper, weempirically examine the hypothesis that poor sample effi-ciency is in part due to catastrophic forgetting (McCloskey& Cohen, 1989; French, 1999).
Catastrophic forgetting is a well-known phenomenon acrossmachine learning (Kirkpatrick et al., 2017; Li & Hoiem,2017; Lopez-Paz & Ranzato, 2017). It is well-establishedthat as deep neural networks learn various tasks that theassociated weight changes rapidly impair progress on pre-vious tasks – even if the tasks share no logical connection.However, invoking it as a source of poor sample efficiencyis a potentially counter-intuitive explanation because catas-trophic forgetting emerges across different tasks. Here weshow that this also arises within a single environment aswe probe across the Arcade Learning Environment (ALE)(Bellemare et al., 2013).
We present the Memento1 observation as the first support ofthis hypothesis. Here an agent is trained and the state(s) as-sociated with the maximum performance is saved. Then weinitialize an identical agent with the weights of the originaland launch from this state – in doing so, we find the algo-rithm reliably makes further progress, even if the originalhad plateaued. This dynamic consistently holds across archi-tectures and algorithms including DQN (Mnih et al., 2013;2015), Rainbow (Hessel et al., 2018) and Rainbow with in-trinsic motivation (Bellemare et al., 2014; 2016). We refutethat this performance boost is simply the result of additionaltraining or of higher model capacity – in both cases, theoriginal agent fails to achieve either the same level of per-formance or match equivalent sample efficiency. Crucially,non-interference between the two learners is identified asthe salient difference.
We next establish the concept of a “task” within a game byusing the game score as a task contextualization (Jain et al.,2019). Game score, though not a perfect demarcation oftask boundaries, is an easily extracted proxy with reason-
1Reference to the eponymous psychological thriller where theprotagonist suffers from amnesia and must deduce a reasonableplan with no memory of how he arrived or even his initial goal
arX
iv:2
002.
1249
9v1
[cs
.LG
] 2
8 Fe
b 20
20
On Catastrophic Interference in Atari 2600 Games
able properties. Changes in game score directly impacts thetemporal-difference (TD) error through the reward (Sutton& Barto, 1998) and often represent milestones such as ac-quiring or manipulating an object, successfully navigatinga space, or defeating an enemy. We show that training oncertain contexts causes prediction errors in other contextsto unpredictably change. In certain environments, learn-ing about one context generalizes to other segments, but inothers, the prediction errors worsen unpredictably. Theseexperiments reveal that individual games result in learninginterference issues more commonly associated with contin-ual and multitask learning (Parisi et al., 2019).
Our paper links catastrophic forgetting to issues of sampleefficiency, performance plateaus, and exploration. To do so,we present evidence through the Memento observation andshow that this generalizes across architectures, algorithmsand games. We refute that this is attributable to longer train-ing duration or additional model capacity. We then probedeeper into these interference dynamics by using the gamescore as a context. This reveals that learning on particularcontexts of the game often catastrophically interferes withprediction errors in others. Cross-context analysis furthercorroborates our results: in games with a limited improve-ment of the Memento agent, which eliminates interferenceby construction, we also demonstrate lower cross-contextcatastrophic interference.
2. BackgroundReinforcement learning conventionally models the environ-ment as a Markov Decision Process (MDP). We reiteratethe details laid forth presented by (Sutton & Barto, 1998)here. The MDP defined as the tupleM = 〈S,A, P,R, γ〉where S is the state space of the environment and A isthe action space which may be discrete or continuous.The learner is not explicitly given the environment tran-sition probability P (st+1|st, at) for going from st ∈ S tost+1 ∈ S given at ∈ A, but samples from this distribu-tion are observed. The environment emits a bounded rewardR : S×A → [rmin, rmax] on each transition and γ ∈ (0, 1)is the discount factor which defines a time-preference forshort-term versus long-term rewards and creates the basisfor a discounted utility model. Generally, the transitionfunction P satisfies a Markov property. This means thattransition to state st+1 is conditionally independent of allprevious state-actions sequences given knowledge of statest, P (st+1|s0, a0, ..., st, at) = P (st+1|st, at).
A policy π : S → A is a mapping from states to actions.The state action value function QH,dπ (s, a) is the expecteddiscounted rewards after taking action a in state s and then
following policy π until termination.
Qπ(s, a) = Eπ,P
[ ∞∑t=0
γtR(st, at)|s0 = s, a0 = a
](1)
where st+1 ∼ P (·|st, at) and at ∼ π(·|st).
Original Q-learning algorithms (Watkins & Dayan, 1992)were proposed in tabular settings. In tabular RL, where eachstate and action are treated as independent positions, thealgorithm generally requires examining each state-actionpair in order to build accurate value functions and effectivepolicies. Function approximation was introduced to param-eterize value functions so that estimates could generalize,requiring fewer interactions with the environment (Sutton &Barto, 1998).
Non-linear, deep function approximation was shown to pro-duce superhuman performance in the Atari Learning Envi-ronment with Deep Q-Networks (DQN) (Mnih et al., 2013;2015) ushering in the field of deep reinforcement learn-ing. This work combined Q-learning with neural networkfunction approximation to yield a scalable reinforcementlearning algorithm. Experience replay (Lin, 1992) is a tech-nique where the agent’s experience is stored in a slidingwindow memory buffer. The agent then either samples ex-perience uniformly or with prioritized scheme (Schaul et al.,2015) for training the learner.
3. The Memento ObservationMulti-task research in the ALE often builds upon the as-sumption that one task is one game and multi-task learningcorresponds to learning multiple games (Kirkpatrick et al.,2017; Fernando et al., 2017) or to different game modes(Farebrother et al., 2018). However, we now question thisassumption and probe the hypothesis that interference oc-curs within a single game. We begin with MONTEZUMA’SREVENGE, a difficult exploration game (Bellemare et al.,2013; 2016; Ali Taïga et al., 2019) that exhibits a compositeobjective structure (learning objective can be broken intoseparate components) because the agent accrues rewards bynavigating a series of rooms with enemies and obstacles.The presence of interference in deep RL algorithms wasconjectured for games with composite structure, though notexamined, by Schaul et al. (2019).
Limitations of existing algorithms. Ali Taïga et al. (2019)recently revisited the difficult exploration games of ALEwith the Rainbow agent (Hessel et al., 2018) and with recentbonus-based methods. Progress through these environmentscan be incentivized by providing the agent with explorationbonuses derived from notions of novelty (Bellemare et al.,2016), curiosity (Pathak et al., 2017) or distillation (Burdaet al., 2018). Count-based exploration algorithms (Belle-mare et al., 2016) maintain visit counts and provide reward
On Catastrophic Interference in Atari 2600 Games
bonuses for reaching unfrequented states. These methodsdemonstrate impressive performance beyond standard RLalgorithms on hard-exploration games in the ALE. Ali Taïgaet al. (2019) finds that environments that previously taxedthe DQN agent’s (Mnih et al., 2015) exploration capabilitieswere better handled when employing recent RL advancesand the Adam optimizer (Kingma & Ba, 2014). However,consistent barriers are still observed with certain environ-ments like MONTEZUMA’S REVENGE. For the Rainbowagent with an intrinsic motivation reward computed via aCTS model (Bellemare et al., 2014) a notable plateaus isobserved at 6600 (Figure 2(a)).
(a) Initial positionwith score of 6600.
(b) Intermediate po-sition with score of8000.
(c) Maximumrecorded score of14500.
Figure 1. Frame from the start position in MONTEZUMA’S RE-VENGE for the Memento observation with a game score of 6600(a). A Rainbow agent with CTS model does not reliably proceedpast this point in the environment. However, a new agent whichstarts from this position and with identical architecture reliablymakes further progress from here. Repeated resets yielded maxi-mum scores of 14500 (right).
Breaking the barrier. However, if a fresh agent with thesame initial weights is launched from the state where thelast agent left off, it makes reliable further progress - re-peated resets recorded maximum scores of up to 14,500 inMONTEZUMA’S REVENGE. The second agent begins eachof its episodes from the final position of the last. As theweights of the second agent are independent of the origi-nal, learning progress and weight updates do not interferewith the original agent. Furthermore, in MONTEZUMA’SREVENGE we find that instead starting the second agent asrandomly initialized (the weights encode no past context)performs equivalently. We refer to this as the Memento ob-servation and demonstrate in Appendix A.2 that this holdson other hard exploration Atari games including GRAVITAR,VENTURE, and PRIVATE EYE.
This observation might be hypothesized to simply be theresult of longer training or of additional network capacity,but we refute both of these in Appendix A.1. In this, we findthat neither longer training nor additional model capacity issufficient to improve past the previous score plateau of 6600.Instead, we posit that the success of the Memento agent isconsistent with catastrophic forgetting interfering with the
(a) Rainbow CTS agent in MONTEZUMA’S REVENGE. Each blackline represents a run from each of the five seeds and the dottedblack line is the maximum achieved score of 6600.
(b) Memento observation in MONTEZUMA’S REVENGE. Both afresh agent initialized with random weights (blue) or with weightsfrom the original model (orange) reliably make further progress.
Figure 2. The Memento observation in MONTEZUMA’S REVENGE.A fresh and randomly initialized Rainbow CTS agent (blue) makesreliable progress from the previous position of score 6600 whichwas a barrier. Additionally, we observe that instead launching anew agent with the original parameters (orange) yields no weighttransfer benefit over a randomly initialized agent.
ability of the original agent to make progress (McCloskey& Cohen, 1989; French, 1999). The original agent’s replaybuffer does contain samples exceeding the game scoreplateau (6600+), which suggests that the exploration policyis not the limiting factor. Rather, the agent is unable tointegrate this new information and learn a value functionin the new region without degrading performance in thefirst region, causing these high-reward samples to be lostfrom the experience replay buffer. Catastrophic interferencemay therefore be linked with exploration challenges in deepRL. Integrating the knowledge via TD-backups of a newlyreceived reward in an environment with a sparse rewardsignal may interfere with the past policy. In this section wehave established evidence for a Rainbow agent with intrinsicrewards in a hard exploration games of ALE and withusing a manually-selected reset position. In Section 4 weseek to understand if this phenomenon applies more broadly.
On Catastrophic Interference in Atari 2600 Games
Experience replay and catastrophic interference. Re-play buffers with prioritized experience replay (Schaul et al.,2015), which samples states with higher temporal-difference(TD) error more frequently, can exasperate catastrophic in-terference issues. Figure 3 records the training dynamicsthe Rainbow agent in MONTEZUMA’S REVENGE. Eachplot is a separate time-series recording which context(s) theagent is learning as it trains in the environment. The agentclearly iterates through stages of the environment indexedby the score, starting by learning exclusively context 0, thencontext 1 and so on. The left column shows the early learn-ing dynamics and the right plot shows resulting oscillationsafter longer training. This is antithetical to approaches toaddress catastrophic forgetting that suggest shuffling acrosstasks and contexts (Ratcliff, 1990; Robins, 1995). These
Conte
xt
0C
onte
xt
1C
onte
xt
2C
onte
xt
3
0 1 2 3 41e7
Conte
xt
4
0.0 0.2 0.4 0.6 0.8 1.01e8
Figure 3. We plot how often the first five game contexts of MON-TEZUMA’S REVENGE are sampled from the replay buffer through-out training. Early in training (left column) the agent almost exclu-sively trains on the first stage (because no later stages have beendiscovered). In intermediary stages, the agent is being trained onall the contexts at the same time (which can lead to interference),and in late stages, is being trained on only the last stage (whichcan lead to catastrophic forgetting). After discovering all context(right column), the agent oscillates between sampling contexts.
dynamics are reminiscent of continual learning2 where up-dates from different sections may interfere with each-other(Schaul et al., 2019). Prioritized experience replay, a usefulalgorithm for sifting through past experience, and an im-portant element of the Rainbow agent, naturally gives riseto continual learning dynamics. However, removing pri-oritized experience replay and sampling instead randomlyand uniformly (Mnih et al., 2013; 2015) can still manifestas distinct data distribution shifts in the replay buffer. As
2In contrast to the standard setting, these score contexts can’tbe learned independently since the agent must pass through earliercontexts to arrive at later contexts.
the agent learns different stages of the game, frontier re-gions may produce more experience as the agent dithers inexploratory processes.
4. Generalizing Across Agents and GamesTo generalize this observation, we first propose a simplealgorithm for selecting states associated with plateaus of thelast agent. In this, we train an agent and then sample a sin-gle trajectory from this agent, (s0, a0, r0, · · · , sT , aT , rT ).We compute the return-till-date (game score) for each state,zt =
∑k<t rk, and choose the earliest timestep tmax at
which the game score is maximized: tmax = min{t :zt = zmax}. The state corresponding to this timestep,smax = stmax
, will become the starting position for theMemento agent. In Appendix C we present alternative ex-periments launching the Memento agent from a diverse setof states Smax consistent with the maximum score and findthe observation also holds. Figure 4 is a schematic of thisapproach.
Selectatrajectorywithgamescore �max
� = 1 � = 0 � = 1 � = −1
� = 1 � = 1 � = 2 � = 1
Choosestate �max
istheinitialstateforMementoagent�max
Figure 4. Process for finding plateau points based on return contex-tualization. We scan the replay buffer B for the trajectory with themaximum game. We extract this trajectory (states denoted as whitecircles and actions denoted as black dots) and we compute the cu-mulative undiscounted return to date labeled z. For the largest zin the trajectory (zmax), we grab the corresponding state smax andthen this state becomes the launch point for the Memento agent.
This automated reset heuristic allows us to more gener-ally evaluate the Memento observation across other choicesof agents and games. We analyze the observation acrossthe ALE suite on two different agents: a Rainbow agent(without the intrinsic motivation reward model) and a DQNagent. The Rainbow agent we use (Castro et al., 2018) usesa prioritized replay buffer (Schaul et al., 2015), n-step re-turns (Sutton & Barto, 1998) and distributional learning(Bellemare et al., 2017). Figure 5 records the percentageimprovement of the new agent over the original agent acrossthe entire ALE suite, finding a significant +25.0% medianimprovement.
Next, we consider ablating the three included algorithmicimprovements of Rainbow and examine the base DQN agent.Similar to the Rainbow agent, we still find that this phe-
On Catastrophic Interference in Atari 2600 Games
-104-103-102-101 -1000100 101 102 103 104
Percent Improvement (Log Scale)
SkiingAsteroidsTimePilot
AssaultCentipede
AmidarDoubleDunkVideoPinball
ChopperCommandRiverraid
BankHeistCrazyClimber
KrullRobotank
BoxingTutankham
BowlingJourneyEscape
FreewayFishingDerby
PongPitfall
KungFuMasterKangaroo
VentureQbertHero
PooyanNameThisGame
PrivateEyeGravitar
StarGunnerRoadRunner
IceHockeySolaris
FrostbiteAirRaid
SeaquestUpNDown
SpaceInvadersGopherZaxxonBerzerk
BattleZoneBeamRider
YarsRevengeMsPacman
ElevatorActionAtlantisTennis
AlienMontezumaRevenge
PhoenixEnduro
BreakoutDemonAttackWizardOfWor
AsterixJamesbond
Random
RA
INB
OW
@ Ite
rati
on 1
98
Rainbow MEMENTO (Single Original Initialization)Med. Improvement 25.08
Figure 5. Each bar corresponds to a different game and the heightis the percentage increase of the Rainbow Memento agent overthe Rainbow baseline. Across the entire ALE suite, the RainbowMemento agent improves 75.0% of the games performance with amedian improvement of +25.0%.
nomenon generalizes. The DQN Memento agent achieves a+17% median improvement over the baseline agent (see Fig-ure 16(b) for game-by-game detail), supporting that thesedynamics hold across different types of value-based agents.
Furthermore, we find that the Memento observation ap-plies across a wide suite of games, including ASTERIXand BREAKOUT, games that do not share the same com-positional narrative structure as the original observationon MONTEZUMA’S REVENGE and other hard explorationgames. This supports that the issues of interference arepresent across agents and across different games. In caseswhere the Memento observation fails to improve, it typicallyis due to a bad or an inescapable initial state being selectedthrough our heuristic algorithm. We did not further attemptto improve this algorithm because we found that our simplecriterion has already produced sufficient evidence for theinterference issues (though further performance could likelybe achieved through more sophisticated state selection).
As in Section 3 we refute that training duration and capacityare sufficient to explain the performance boost of the gen-eralized Memento phenomenon. Figure 13(a) nullifies thattraining duration is the primary cause of the Memento ob-servation. We find that training the original agent for 400Mframes (double the conventional duration) results only in a+3.3% median improvement for the Rainbow agent (versus+25.0% in the Memento observation). Next in Figure 13(b)we find that increased model capacity is also not the primarycause of the Memento observation by training a double ca-pacity Rainbow agent and finding an improvement of only+7.4% (versus +25.0% for the Memento agent).
Figure 6 shows how the performance gap widens over train-ing duration. The Memento agent is far more sample effi-cient than training the baseline Rainbow agent longer. TheMemento agent achieves the same median performance in-crease in only 5M frames compared to 200M frames for thelonger-training Rainbow agent.
0 50 100 150 200Iterations
5
0
5
10
15
20
25
30
35
40
Media
n Im
pro
vem
ent
over
Rain
bow MEMENTO
Rainbow Double Time
Rainbow Double Capacity
Double Time Peak
Double Capacity Peak
Figure 6. Performance of the Memento agent (blue) versus a stan-dard Rainbow agent trained for 400M frames (red) and a doublecapacity Rainbow agent trained for 400M frames (green). We findthat the Memento agent is considerably more sample efficient. Itreaches the peak performance of both longer-training Rainbowagents achieved over 200M frames in <25M frames.
5. Catastrophic InterferenceThe Memento observation showed improvement in hardexploration games with an intrinsically motivated Rainbowagent (Section 3) and more generally across various learningalgorithms on a diverse set of games in the ALE suite (Sec-tion 4). As the Memento agent is constructed to separatelearning updates between the new and original agents, itprovides support for the hypothesis that sample efficiencyand performance is realized by the Memento agent becauseit reduces interference between sections of the game. Wenow investigate this hypothesis at a finer level of detail:specifically, we measure the TD-errors at different contextsas the agent learns about particular sections of the game.
As introduced in Mnih et al. (2015), our agents use atarget network to produce the Bellman targets for the
On Catastrophic Interference in Atari 2600 Games
online network. During training the target network pa-rameters θ− are held semi-fixed, updating only after T -gradient updates and the online network always mini-mizes the TD errors with the semi-fixed target network(in our implementation, T = 2000 online network up-dates). The non-distributional TD errors for a sampledexperience tuple (s, a, s′, a′) ∼ B are given in EquationL(θ) = Eπθ
[` (Q(s, a; θ)− (r + γmaxa′ Q(s′, a′; θ−)))]where ` refers to the Huber loss. We refer readers to Belle-mare et al. (2017) for the distributional equivalent. In theinterim period while the target network is held fixed, theoptimization of the online network parameters∇θL(θ) re-duces to a strictly supervised learning setting, eliminatingseveral complications present in RL. The primary complica-tion avoided is that for any given experience (s, a, s′, a′) theBellman targets do not change as the online network evolves.In the general RL setting with bootstrapping through TD-learning, the targets will necessarily change, otherwise theonline network will simply regress to the Q-values estab-lished with the initial target parameters θ−. We proposeanalyzing this simplified regime in order to carefully exam-ine the presence of catastrophic interference.
Low Catastrophic Interference. We begin with an analy-sis of PONG in Figure 7. While keeping the target networkparameters fixed, we measure the TD-errors across all con-texts after training T -steps from samples from one of thecontexts in the agent’s replay buffer. The left column showsthe absolute change in TD errors and the right column showsthe relative change in TD errors. Each row corresponds totraining on samples from a particular context, in this case,contexts 0, 5, 10, and 15. For PONG, a game that doesnot materially benefit from the Memento observation ineither Rainbow or DQN, we find that learning about onecontext or game score generalizes very well to others (withan occasional aberration).
Figure 8(a) provides a compressed view of the relativechanges in TD-errors, where the (i, j)-th entry correspondsto the relative change in TD error for context j, when themodel is trained on context i. Locations where the diagramis red correspond to interference and negative generaliza-tion, and blue regions correspond to positive generalizationbetween contexts. For these two games, the diagrams areoverwhelmingly blue, indicating positive generalization be-tween most regions.
We consider another game that fared poorly under the Me-mento observation, QBERT, for both the Rainbow and DQNagents 16(b). Given the Memento observation was not ben-eficial this leads to a testable hypothesis that we shouldobserve low degrees of interference. Corroborating this,in Figure 8(b) we find that training on contexts typicallygeneralizes. However, we find a consistent negative im-pact for the 0th context, suggesting that the beginning of
0 5 10 15 20
Context
0.005
0.004
0.003
0.002
0.001
0.000
0.001Change in Loss
0 5 10 15 20
Context
-102
-101
-100
0
100
101
102 Relative Change in Loss
(a) Train on context 0.
0 5 10 15 20
Context
0.006
0.005
0.004
0.003
0.002
0.001
0.000
0.001Change in Loss
0 5 10 15 20
Context
-102
-101
-100
0
100
101 Relative Change in Loss
(b) Train on context 5.
0 5 10 15 20
Context
0.004
0.003
0.002
0.001
0.000Change in Loss
0 5 10 15 20
Context
-102
-101
-100
0Relative Change in Loss
(c) Train on context 10.
0 5 10 15 20
Context
0.004
0.003
0.002
0.001
0.000Change in Loss
0 5 10 15 20
Context
-102
-101
-100
0Relative Change in Loss
(d) Train on context 15.
Figure 7. We track the TD errors over the first twenty contexts inPONG for a Rainbow agent when we train on a particular context.The left column shows the absolute change in TD errors and theright column shows the relative change in TD errors.
the game (context 0) materially differs from later sections.This is evidenced by the red-column showing degradationof performance in this section as others are learned.
High Catastrophic Interference. We have now seen thatlearning about one portion of PONG or QBERT typically im-proves the prediction error in other contexts, that is, learninggeneralizes. However, we now present that this is not gen-erally the case. Figure 9 revisits our running example ofMONTEZUMA’S REVENGE and continues with the sameexperiment. We select an agent that has reached the first fivegame score contexts corresponding to milestones includingcollecting the first key and exiting the initial room. In con-trast to PONG, we find that learning about certain contextsin MONTEZUMA’S REVENGE strictly interferes with the
On Catastrophic Interference in Atari 2600 Games
0 5 10 15
Context Evaluated On
0
5
10
15
Conte
xt
Tra
ined O
n
Relative TD-Error
-100%
0
> 100%
(a) TD errors across game score contexts in PONG
0 1 2 3 4 5 6 7
Context Evaluated On
8
7
6
5
4
3
2
1
Conte
xt
Tra
ined O
n
Relative TD-Error
-100%
0
> 100%
(b) TD errors across game score contexts in QBERT.
Figure 8. For PONG and QBERT, learning in one context usuallygeneralizes to other contexts (improved relative TD-errors in blue).We find no strong evidence of catastrophic interference in PONG.QBERT also generalizes across local contexts, however, a redcolumn for the 0th context indicates some interference effects.
prediction errors in other contexts.
Figure 10(a) shows the corresponding TD-error reductionmatrix; as expected, training on samples from a particularcontext reduces the TD-errors of that context as observed bythe blue diagonals. However, the red off-diagonals indicatethat training on samples from one context negatively inter-feres with predictions in all the other contexts. Issues ofcatastrophic forgetting also extend to games where it mightbe unexpected like BREAKOUT. Figure 10(b) shows that aRainbow agent trained here interferes unpredictably withother contexts. Also, interestingly, we note an asymmetry.Training on later contexts produces worse backward trans-fer (bottom-left) while training on earlier contexts does notproduce as severe forward transfer (top-right) (Lopez-Paz &Ranzato, 2017).
6. Related WorkCatastrophic Interference: To control how memoriesare retained or lost, many study the synaptic dynamics be-tween different environment contexts. These methods either
0 1 2 3 4 5
Context
0.0005
0.0000
0.0005
0.0010
0.0015
0.0020
0.0025Change in Loss
0 1 2 3 4 5
Context
-102
-101
-100
0
100
101
102
103 Relative Change in Loss
(a) Train on context 0.
0 1 2 3 4 5
Context
0.0010
0.0005
0.0000
0.0005
0.0010
0.0015
0.0020
0.0025
0.0030Change in Loss
0 1 2 3 4 5
Context
-102
-101
-100
0
100
101
102
103 Relative Change in Loss
(b) Train on context 1.
0 1 2 3 4 5
Context
0.0005
0.0000
0.0005
0.0010
0.0015
0.0020
0.0025Change in Loss
0 1 2 3 4 5
Context
-102
-101
-100
0
100
101
102
103 Relative Change in Loss
(c) Train on context 2.
0 1 2 3 4 5
Context
0.0002
0.0000
0.0002
0.0004
0.0006
0.0008
0.0010Change in Loss
0 1 2 3 4 5
Context
-102
-101
-100
0
100
101
102
103 Relative Change in Loss
(d) Train on context 3.
0 1 2 3 4 5
Context
0.0005
0.0000
0.0005
0.0010
0.0015
0.0020
0.0025
0.0030
0.0035Change in Loss
0 1 2 3 4 5
Context
-102
-101
-100
0
100
101
102
103 Relative Change in Loss
(e) Train on context 4.
Figure 9. We track the TD errors over the first five contexts in inMONTEZUMA’S REVENGE for a Rainbow agent while training ona particular context. The left column shows the absolute change inTD errors and the right column shows the relative change in TDerrors. As expected, training on the context reduces the TD errorfor that context, however, note that learning one context comes atthe expense of worse prediction errors on others.
try to regularize or expand the parameter space. Regular-ization has shown to discourage weight updates from over-writing old memories. Elastic Weight Consolidation (EWC)(Kirkpatrick et al., 2017) employs a quadratic penalty tokeep parameters used in successive environments close to-gether, and thus prevents unrelated parameters from beingrepurposed. Li & Hoiem (2017) impose a knowledge dis-tillation penalty (Hinton et al., 2015) to encourage certainnetwork predictions to remain similar as the parametersevolve. The importance of some memories can be encodedin additional weights, then used to scale penalties incurredwhen they change (Zenke et al., 2017). In general, regu-larization is effective when a low loss region exists at the
On Catastrophic Interference in Atari 2600 Games
1 2 3 4 5
Context Evaluated On
5
4
3
2
1C
onte
xt
Tra
ined O
n
Relative TD-Error
-100%
0
> 100%
(a) TD errors across game score contexts in MON-TEZUMA’S REVENGE
0 5 10 15 20 25 30 35
Context Evaluated On
40
35
30
25
20
15
10
5
Conte
xt
Tra
ined O
n
Relative TD-Error
-100%
0
> 100%
(b) TD errors across game score contexts in BREAK-OUT.
Figure 10. For both MONTEZUMA’S REVENGE and BREAKOUT
we find that learning almost always only improves the TD errorson that context (blue diagonals) at the expense of other contexts(red off-diagonals). These analyses suggest a high degree of catas-trophic interference.
intersection of each context’s parameter space. When theseregions are disjoint, however, it can be more effective tofreeze weights (Sharif Razavian et al., 2014) or to replicatethe entire network and augment it with new features for newsettings (Rusu et al., 2016; Yoon et al., 2018; Draelos et al.,2017). Our work studies model expansion in the context ofAtari, with a hypothesis that little overlap exists betweenthe low loss regions of different game contexts. However,some games, such as Pong, exhibit considerable overlap.
Knowledge of the environment context is often a prereq-uisite to regularization or model expansion. Given a fixedstrategy to mitigate interference, some consider the problemfinding the best contextualization. Rao et al. (2019) pro-posed a variational method to learn context representationswhich apply to a set of shared parameters. Similar to theForget-Me Not Process (Milan et al., 2016), the model isdynamically expanded as new data is experienced whichcannot be explained by the model. Aljundi et al. (2019)proposes a similar approach using regularization. Our workposits a context model that marks the boundaries betweenenvironment settings with the game score.
Continual Learning: Multi-task and continual learningin reinforcement learning is a highly active area of study(Ring, 1994; Silver et al., 2013; Oh et al., 2017) with someproposing modular architectures, however, multi-task learn-ing in the context of a single environment is a newer areaof research. Schaul et al. (2019) describe the problem ofray interference in multi-component objectives for bandits,and show that in this setting, learning exhibits winner-take-all dynamics with performance plateaus and learning con-strained to only one component. In the context of diverseinitial state distributions, Ghosh et al. (2017) find that policy-gradient methods myopically consider only a subset of theinitial-state distribution to improve on when using a singlepolicy. The Memento observation makes close connectionsto Go-Explore (Ecoffet et al., 2019) which demonstrates theefficacy of resetting to the boundary of the agent’s knowl-edge and thereafter employing a basic exploration strategyto make progress in difficult exploration games. Finally,recent work in unsupervised continual learning consider acase similar to our setting, where the learning process iseffectively multitask but there are no explicit task labels(Rao et al., 2019).
7. DiscussionThis research provides evidence of a difficult continuallearning problem arising within a single game. The chal-lenge posed by catastrophic interference in the reinforce-ment learning setting is greater than in standard continuallearning for several reasons. First, in this setting we are pro-vided no explicit task labels (Rao et al., 2019). Our analysesused as a proxy label, the game score, to reveal continuallearning dynamics, but alternative task designations may fur-ther clarify the nature of these issues. Next, adding a furtherwrinkle to the problem, the "tasks" exist as part of a singleMDP and therefore generally must transmit information toother tasks. TD errors in one portion of the environmentdirectly effect TD-errors elsewhere via Bellman backups(Bellman, 1957; Sutton & Barto, 1998). Finally, althoughexperience replay (Mnih et al., 2013; 2015) resembles shuf-fling task examples, we find it is not sufficient to combatcatastrophic forgetting unlike the standard setting. We positthat this is likely due to complicated feedback effects of datageneration and learning.
8. ConclusionWe strengthen connections between catastrophic forgettingand central issues in reinforcement learning including poorsample efficiency, performance plateaus (Schaul et al., 2019)and exploration challenges (Ali Taïga et al., 2019). Schaulet al. (2019) hypothesized interference patterns might be ob-served in a deep RL setting for games like MONTEZUMA’SREVENGE. Our empirical studies not only confirm this
On Catastrophic Interference in Atari 2600 Games
hypothesis, but also show that the phenomenon is moreprevalent than previously conjectured. Both the Mementoobservation as well as peering into inter-context interferenceilluminates the nature and severity of interference in deepreinforcement learning. Our findings also suggest that theprior belief of what constitutes a "task" may be mislead-ing and must therefore be carefully examined. We hopethis work provides a clear characterization of the problemand show that it has far reaching implications for manyfundamental problems in reinforcement learning.
AcknowledgementsThis work stemmed from surprising initial results and ourunderstanding was honed through many insightful conversa-tions at Google and Mila. In particular, the authors wouldlike to thank Marlos Machado, Rishabh Agarwal, AdrienAli Taïga, Margaret Li, Ryan Sepassi, George Tucker andMike Mozer for helpful discussions and contributions. Wewould also like thank the reviewers at the Biological and Ar-tificial Reinforcement Learning workshop for constructivereviews on an earlier manuscript.
ReferencesAli Taïga, A., Fedus, W., Machado, M. C., Courville, A.,
and Bellemare, M. G. Benchmarking bonus-based ex-ploration methods on the arcade learning environment.arXiv preprint arXiv:1908.02388, 2019.
Aljundi, R., Kelchtermans, K., and Tuytelaars, T. Task-freecontinual learning. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), June 2019.
Bellemare, M., Veness, J., and Talvitie, E. Skip contexttree switching. In International Conference on MachineLearning, pp. 1458–1466, 2014.
Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T.,Saxton, D., and Munos, R. Unifying count-based explo-ration and intrinsic motivation. In Advances in NeuralInformation Processing Systems, pp. 1471–1479, 2016.
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M.The arcade learning environment: An evaluation plat-form for general agents. Journal of Artificial IntelligenceResearch, 47:253–279, 2013.
Bellemare, M. G., Dabney, W., and Munos, R. A distri-butional perspective on reinforcement learning. arXivpreprint arXiv:1707.06887, 2017.
Bellman, R. A markovian decision process. Journal ofMathematics and Mechanics, 6(5):679–684, 1957.
Burda, Y., Edwards, H., Storkey, A., and Klimov, O. Ex-ploration by random network distillation. arXiv preprintarXiv:1810.12894, 2018.
Castro, P. S., Moitra, S., Gelada, C., Kumar, S., and Belle-mare, M. G. Dopamine: A research framework for deepreinforcement learning. arXiv preprint arXiv:1812.06110,2018.
Draelos, T. J., Miner, N. E., Lamb, C. C., Cox, J. A.,Vineyard, C. M., Carlson, K. D., Severa, W. M., James,C. D., and Aimone, J. B. Neurogenesis deep learning:Extending deep networks to accommodate new classes.2017 International Joint Conference on Neural Networks(IJCNN), pp. 526–533, 2017.
Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O.,and Clune, J. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995,2019.
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih,V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning,I., et al. Impala: Scalable distributed deep-rl with impor-tance weighted actor-learner architectures. arXiv preprintarXiv:1802.01561, 2018.
Farebrother, J., Machado, M. C., and Bowling, M. Gen-eralization and regularization in dqn. arXiv preprintarXiv:1810.00123, 2018.
Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D.,Rusu, A. A., Pritzel, A., and Wierstra, D. Pathnet: Evolu-tion channels gradient descent in super neural networks.arXiv preprint arXiv:1701.08734, 2017.
French, R. M. Catastrophic forgetting in connectionist net-works. Trends in cognitive sciences, 3(4):128–135, 1999.
Ghosh, D., Singh, A., Rajeswaran, A., Kumar, V., andLevine, S. Divide-and-conquer reinforcement learning.arXiv preprint arXiv:1711.09874, 2017.
Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostro-vski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., andSilver, D. Rainbow: Combining improvements in deep re-inforcement learning. In Thirty-Second AAAI Conferenceon Artificial Intelligence, 2018.
Hinton, G., Vinyals, O., and Dean, J. Distillingthe knowledge in a neural network. arXiv preprintarXiv:1503.02531, 2015.
Jain, V., Fedus, W., Larochelle, H., Precup, D., and Belle-mare, M. Algorithmic improvements for deep reinforce-ment learning applied to interactive fiction. Thirty-FourthAAAI Conference on Artificial Intelligence (AAAI-20),2019.
On Catastrophic Interference in Atari 2600 Games
Kingma, D. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Des-jardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T.,Grabska-Barwinska, A., et al. Overcoming catastrophicforgetting in neural networks. Proceedings of the nationalacademy of sciences, pp. 201611835, 2017.
Li, Z. and Hoiem, D. Learning without forgetting. IEEEtransactions on pattern analysis and machine intelligence,40(12):2935–2947, 2017.
Lin, L.-J. Self-improving reactive agents based on reinforce-ment learning, planning and teaching. Machine learning,8(3-4):293–321, 1992.
Lopez-Paz, D. and Ranzato, M. Gradient episodic memoryfor continual learning. In Advances in Neural InformationProcessing Systems, pp. 6467–6476, 2017.
McCloskey, M. and Cohen, N. J. Catastrophic interfer-ence in connectionist networks: The sequential learningproblem. In Psychology of learning and motivation, vol-ume 24, pp. 109–165. Elsevier, 1989.
Milan, K., Veness, J., Kirkpatrick, J., Bowling, M., Koop,A., and Hassabis, D. The forget-me-not process. In Lee,D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Gar-nett, R. (eds.), Advances in Neural Information Process-ing Systems 29, pp. 3702–3710. Curran Associates, Inc.,2016. URL http://papers.nips.cc/paper/6055-the-forget-me-not-process.pdf.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,Antonoglou, I., Wierstra, D., and Riedmiller, M. Playingatari with deep reinforcement learning. arXiv preprintarXiv:1312.5602, 2013.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-land, A. K., Ostrovski, G., et al. Human-level controlthrough deep reinforcement learning. Nature, 518(7540):529, 2015.
Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon,R., De Maria, A., Panneershelvam, V., Suleyman, M.,Beattie, C., Petersen, S., et al. Massively parallel meth-ods for deep reinforcement learning. arXiv preprintarXiv:1507.04296, 2015.
Oh, J., Singh, S., Lee, H., and Kohli, P. Zero-shot taskgeneralization with multi-task deep reinforcement learn-ing. In Proceedings of the 34th International Conferenceon Machine Learning-Volume 70, pp. 2661–2670. JMLR.org, 2017.
Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter,S. Continual lifelong learning with neural networks: Areview. Neural Networks, 2019.
Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T.Curiosity-driven exploration by self-supervised predic-tion. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition Workshops, pp. 16–17,2017.
Rao, D., Visin, F., Rusu, A., Pascanu, R., Teh, Y. W., andHadsell, R. Continual unsupervised representation learn-ing. In Advances in Neural Information Processing Sys-tems, pp. 7645–7655, 2019.
Ratcliff, R. Connectionist models of recognition memory:constraints imposed by learning and forgetting functions.Psychological review, 97(2):285, 1990.
Ring, M. B. Continual learning in reinforcement environ-ments. PhD thesis, University of Texas at Austin Austin,Texas 78712, 1994.
Robins, A. Catastrophic forgetting, rehearsal and pseudore-hearsal. Connection Science, 7(2):123–146, 1995.
Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H.,Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Had-sell, R. Progressive neural networks. arXiv preprintarXiv:1606.04671, 2016.
Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori-tized experience replay. arXiv preprint arXiv:1511.05952,2015.
Schaul, T., Borsa, D., Modayil, J., and Pascanu, R. Rayinterference: a source of plateaus in deep reinforcementlearning. arXiv preprint arXiv:1904.11455, 2019.
Sharif Razavian, A., Azizpour, H., Sullivan, J., and Carlsson,S. Cnn features off-the-shelf: an astounding baseline forrecognition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition workshops, pp.806–813, 2014.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., Lanctot, M., et al. Mastering thegame of go with deep neural networks and tree search.nature, 529(7587):484, 2016.
Silver, D. L., Yang, Q., and Li, L. Lifelong machine learningsystems: Beyond learning algorithms. In 2013 AAAIspring symposium series, 2013.
Sutton, R. S. and Barto, A. G. Reinforcement learning: Anintroduction, volume 1. MIT press Cambridge, 1998.
On Catastrophic Interference in Atari 2600 Games
Watkins, C. J. and Dayan, P. Q-learning. Machine learning,8(3-4):279–292, 1992.
Yoon, J., Yang, E., Lee, J., and Hwang, S. J. Lifelonglearning with dynamically expandable networks. InInternational Conference on Learning Representations,2018. URL https://openreview.net/forum?id=Sk7KsfW0-.
Zenke, F., Poole, B., and Ganguli, S. Continual learningthrough synaptic intelligence. In Proceedings of the 34thInternational Conference on Machine Learning-Volume70, pp. 3987–3995. JMLR. org, 2017.
On Catastrophic Interference in Atari 2600 Games
A. Rainbow Agent with CTS Additional FiguresA.1. Additional Training and Model Capacity
We consider training the Rainbow CTS model longer and with higher capacity.
(a) Result of training the Rainbow CTS for a longerduration. The black vertical line denotes the typicaltraining duration of 200M frames. We find for both theoriginal agent (orange) as well as a sweep over otherupdate-horizons (blue, green), that no agent reliablyexceeds the baseline.
(b) Result of training a Rainbow CTS with the in-creased capacity equal to two separate Rainbow CTSagents accomplished by increasing the filter widths. Wefind for both the original agent (orange) as well as asweep over other update-horizons (blue, green), that noagent reliably exceeds the baseline.
Figure 11. In MONTEZUMA’S REVENGE, neither additional training (Figure 11(a)) nor additional model capacity (Figure 11(b)) leads toimproved performance from the base agent. These results hold for various n-step returns or update horizons of 1, 3 (Rainbow default), and10. The dotted-line is the maximum achieved baseline for the original agent, and all experiments are run with 5 seeds.
A.2. Rainbow with CTS Hard Exploration Games
We demonstrate the Memento observation for three difficult exploration games using the Rainbow + CTS agent.
(a) The Memento observation in GRAVI-TAR.
(b) The Memento observation in VEN-TURE.
(c) The Memento observation in PRIVA-TEEYE.
Figure 12. Each black line represents a run from each of the five seeds and the dotted black line is the maximum achieved score. We findthat the Memento observation holds across hard exploration games. PRIVATEEYE results in a slight decrease in performance for the newagent (orange) due to the count-down timer in, however, we note that it is more stable compared to baseline data (blue) and preservesperformance.
On Catastrophic Interference in Atari 2600 Games
B. Rainbow Agent Additional FiguresB.1. Additional Training and Model Capacity
We first present results of training the Rainbow agent for a longer duration and with additional capacity. In both settings, wefind these variants fail to achieve the performance of the Memento observation.
-104-103-102-101 -1000100 101 102 103 104
Percent Improvement (Log Scale)
SkiingSolaris
ChopperCommandPrivateEye
PhoenixUpNDown
CarnivalJamesbond
AmidarKangaroo
PitfallBankHeist
KungFuMasterBerzerkBowling
DoubleDunkPooyan
VentureFreeway
PongTennisBoxing
FishingDerbyVideoPinballBeamRider
AirRaidEnduroGopher
RoadRunnerRiverraid
ElevatorActionCrazyClimber
AtlantisRobotank
KrullHero
TutankhamQbert
FrostbiteZaxxonAssault
StarGunnerIceHockeyTimePilot
BattleZoneMsPacman
WizardOfWorNameThisGame
GravitarAlien
AsterixCentipede
YarsRevengeAsteroids
MontezumaRevengeBreakout
SpaceInvadersDemonAttack
Seaquest
Random
RA
INB
OW
@ Ite
rati
on 1
98
Rainbow Extended TimeMed. Improvement 3.32
(a) A Rainbow agent trained fortwice the duration or 400M frames.
-104-103-102-101 -1000100 101 102 103 104
Percent Improvement (Log Scale)
SkiingSolaris
ChopperCommandBowling
UpNDownJamesbond
AsterixWizardOfWor
PrivateEyeBattleZone
GopherBreakoutFrostbite
NameThisGameIceHockey
KungFuMasterPooyan
ElevatorActionTimePilot
BankHeistVideoPinballDoubleDunk
KrullZaxxonBerzerkEnduro
BeamRiderBoxing
FreewayPong
StarGunnerSpaceInvaders
RobotankCrazyClimberFishingDerby
KangarooTennis
JourneyEscapeRoadRunner
AssaultCarnival
RiverraidVentureAirRaid
TutankhamAmidar
QbertPhoenix
PitfallAtlantis
DemonAttackHero
MsPacmanCentipedeAsteroidsGravitar
AlienYarsRevenge
SeaquestMontezumaRevenge
Random
RA
INB
OW
@ Ite
rati
on 1
98
Rainbow Double CapacityMed. Improvement 0.06
(b) A Rainbow agent trained withdouble network capacity.
Figure 13. A Rainbow agent trained for twice the duration or twice the network capacity similarly fails to improve materially overthe baseline. Training for 400M frames yields a median improvement of +3% and doubling the network capacity yields no medianimprovement. In the contrast, the Memento agent improves over baseline +25.0%.
On Catastrophic Interference in Atari 2600 Games
C. Memento State StrategyWe consider the performance when the Memento agent instead starts from a set of states S rather than a singular state.This is a more difficult problem requiring demanding more generalization capability from the Memento agent and thusreported median performance is lower. We find, however, that the Memento observation also holds in this more challengingenvironment. Furthermore, we find a higher fraction of games improve under multiple Memento start states, likely due toavoiding the primary issue of starting exclusively from an inescapable position.
-104-103-102-101 -1000100 101 102 103 104
Percent Improvement (Log Scale)
SkiingAsteroidsTimePilot
AssaultCentipede
AmidarDoubleDunkVideoPinball
ChopperCommandRiverraid
BankHeistCrazyClimber
KrullRobotank
BoxingTutankham
BowlingJourneyEscape
FreewayFishingDerby
PongPitfall
KungFuMasterKangaroo
VentureQbertHero
PooyanNameThisGame
PrivateEyeGravitar
StarGunnerRoadRunner
IceHockeySolaris
FrostbiteAirRaid
SeaquestUpNDown
SpaceInvadersGopherZaxxonBerzerk
BattleZoneBeamRider
YarsRevengeMsPacman
ElevatorActionAtlantisTennis
AlienMontezumaRevenge
PhoenixEnduro
BreakoutDemonAttackWizardOfWor
AsterixJamesbond
Random
RA
INB
OW
@ Ite
rati
on 1
98
Rainbow MEMENTO (Single Original Initialization)Med. Improvement 25.08
(a) Single Memento state.
-104-103-102-101 -1000100 101 102 103 104
Percent Improvement (Log Scale)
SkiingAmidarSolaris
ChopperCommandDoubleDunk
KungFuMasterFishingDerby
SeaquestVideoPinball
BankHeistFreeway
PongJourneyEscape
TimePilotTutankham
RiverraidQbert
CrazyClimberIceHockey
BattleZoneWizardOfWor
KangarooGravitar
RobotankNameThisGame
HeroVentureBowling
PitfallRoadRunnerStarGunner
FrostbitePooyan
PrivateEyeZaxxon
KrullUpNDown
BerzerkMsPacman
AirRaidSpaceInvaders
TennisAssault
BeamRiderGopher
YarsRevengePhoenix
AlienElevatorAction
BreakoutEnduroAtlantis
DemonAttackMontezumaRevenge
AsterixJamesbond
Random
RA
INB
OW
@ Ite
rati
on 1
98
Rainbow MEMENTO (Original Initialization)Med. Improvement 10.80
(b) Multiple Memento states.
Figure 14. Multiple Memento states. Each bar corresponds to a different game and the height is the percentage increase of the RainbowMemento agent from a set of start states over the Rainbow baseline. The Rainbow Memento agent launched from a single state results in a25.0% median improvement (left) which is greater than the 11.9% median improvement of launching from multiple states (right).
On Catastrophic Interference in Atari 2600 Games
D. Weight TransferWe examine the importance of the original agent’s parameters in order transfer to later sections of the game in the Mementoobservation. Our original observation relied on the original agent weights. To examine the weight transfer benefits of theseparameters, we consider instead using random initialization for the Memento agent. If the parameters learned in the earlysection of the game by the original agent generalize, we should observe faster progress and potentially better asymptoticperformance. We measure this both for Rainbow and DQN.
0 50 100 150 200Iterations
10
0
10
20
30
40
50
Media
n Im
pro
vem
ent
over
Rain
bow Random Initialization
Original Agent
(a) Rainbow Memento agent.
0 50 100 150 200Iterations
10
5
0
5
10
15
20
25
30
Media
n Im
pro
vem
ent
over
Rain
bow Random Initialization
Original Agent
(b) DQN Memento agent.
Figure 15. We examine the impact of the initial parameters for the Rainbow Memento agent. The median improvement over thecorresponding original agent is recorded for both random weight initialization (blue) as well as parameters from the original agent (red).
Interestingly, we find opposite conclusions for the Rainbow and DQN agents. The Memento Rainbow agent benefits fromthe original agent weights whilst the Memento DQN agent is better off randomly initialized. We would also highlight thatthe gap between both variants is not large. This indicates that the parameters learned from the earlier stages of the game maybe of limited benefit.
On Catastrophic Interference in Atari 2600 Games
E. Memento in Rainbow versus DQN
-104-103-102-101 -1000100 101 102 103 104
Percent Improvement (Log Scale)
SkiingAsteroidsTimePilot
AssaultCentipede
AmidarDoubleDunkVideoPinball
ChopperCommandRiverraid
BankHeistCrazyClimber
KrullRobotank
BoxingTutankham
BowlingJourneyEscape
FreewayFishingDerby
PongPitfall
KungFuMasterKangaroo
VentureQbertHero
PooyanNameThisGame
PrivateEyeGravitar
StarGunnerRoadRunner
IceHockeySolaris
FrostbiteAirRaid
SeaquestUpNDown
SpaceInvadersGopherZaxxonBerzerk
BattleZoneBeamRider
YarsRevengeMsPacman
ElevatorActionAtlantisTennis
AlienMontezumaRevenge
PhoenixEnduro
BreakoutDemonAttackWizardOfWor
AsterixJamesbond
Random
RA
INB
OW
@ Ite
rati
on 1
98
Rainbow MEMENTO (Single Original Initialization)Med. Improvement 25.08
(a) Rainbow Memento agent.
-104-103-102-101 -1000100 101 102 103 104
Percent Improvement (Log Scale)
TimePilotAsteroids
SolarisMontezumaRevenge
VentureBowling
IceHockeyCentipede
AmidarSeaquest
SkiingRiverraid
QbertBankHeist
FreewayPong
CarnivalTennis
UpNDownRobotank
KrullJourneyEscape
PhoenixBoxingAssault
CrazyClimberDoubleDunk
AirRaidEnduro
KangarooKungFuMaster
RoadRunnerPitfall
StarGunnerMsPacman
NameThisGameHero
FishingDerbyChopperCommand
WizardOfWorVideoPinball
GopherSpaceInvaders
PooyanAlien
BerzerkYarsRevengeDemonAttack
BeamRiderTutankham
AtlantisZaxxon
BattleZoneBreakout
JamesbondGravitarAsterix
FrostbiteElevatorAction
PrivateEye
Random
DQ
N @
Ite
rati
on 1
98
DQN MEMENTO (Single)Med. Improvement 17.00
(b) DQN Memento agent.
Figure 16. Each bar corresponds to a different game and the height is the percentage increase of the Memento agent over its baseline.Across the entire ALE suite, the Rainbow Memento agent (left) improves 75.0% of the games performance with a median improvement of+25.0%. The DQN Memento agent (right) improves 73.8% of the games performance with a median improvement of 17.0%.
On Catastrophic Interference in Atari 2600 Games
F. Training Curves
0 50 100 150 200 250 300 350 400
iteration
0
5000
10000
15000
20000
25000
train
_epis
ode_r
etu
rns
AirRaid
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
1000
2000
3000
4000
5000
6000
7000
8000
train
_epis
ode_r
etu
rns
Alien
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
500
1000
1500
2000
2500
3000
3500
train
_epis
ode_r
etu
rns
Amidar
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
500
1000
1500
2000
2500
3000
3500
4000
4500
train
_epis
ode_r
etu
rns
Assault
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
20000
40000
60000
80000
100000
120000
140000
train
_epis
ode_r
etu
rns
Asterix
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
200
400
600
800
1000
1200
1400
1600
1800
train
_epis
ode_r
etu
rns
Asteroids
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
train
_epis
ode_r
etu
rns
Atlantis
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
200
400
600
800
1000
1200
1400
train
_epis
ode_r
etu
rns
BankHeist
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
train
_epis
ode_r
etu
rns
BattleZone
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
2000
4000
6000
8000
10000
12000
14000
16000
train
_epis
ode_r
etu
rns
BeamRider
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
500
1000
1500
2000
2500
train
_epis
ode_r
etu
rns
Berzerk
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
10
20
30
40
50
60
70
80
90
train
_epis
ode_r
etu
rns
Bowling
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
40
20
0
20
40
60
80
100
train
_epis
ode_r
etu
rns
Boxing
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
50
100
150
200
250
300
350
400
450
train
_epis
ode_r
etu
rns
Breakout
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200
iteration
0
1000
2000
3000
4000
5000
6000
train
_epis
ode_r
etu
rns
Carnival
algo_name
RAINBOW
0 50 100 150 200 250 300 350 400
iteration
2000
3000
4000
5000
6000
7000
8000
train
_epis
ode_r
etu
rns
Centipede
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
train
_epis
ode_r
etu
rns
ChopperCommand
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
50000
100000
150000
200000
train
_epis
ode_r
etu
rns
CrazyClimber
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
10000
20000
30000
40000
50000
60000
70000
train
_epis
ode_r
etu
rns
DemonAttack
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
30
20
10
0
10
20
30
train
_epis
ode_r
etu
rns
DoubleDunk
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
50000
0
50000
100000
150000
200000
train
_epis
ode_r
etu
rns
ElevatorAction
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
1000
2000
3000
4000
5000
6000
7000
train
_epis
ode_r
etu
rns
Enduro
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
100
80
60
40
20
0
20
40
60
train
_epis
ode_r
etu
rns
FishingDerby
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
5
0
5
10
15
20
25
30
35
train
_epis
ode_r
etu
rns
Freeway
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
train
_epis
ode_r
etu
rns
Frostbite
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
5000
10000
15000
20000
train
_epis
ode_r
etu
rns
Gopher
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
200
400
600
800
1000
1200
1400
1600
1800
train
_epis
ode_r
etu
rns
Gravitar
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
10000
20000
30000
40000
50000
60000
70000
80000
train
_epis
ode_r
etu
rns
Hero
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
20
15
10
5
0
5
10
15
train
_epis
ode_r
etu
rns
IceHockey
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
2000
0
2000
4000
6000
8000
10000
12000
14000
16000
train
_epis
ode_r
etu
rns
Jamesbond
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
20000
15000
10000
5000
0
train
_epis
ode_r
etu
rns
JourneyEscape
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
train
_epis
ode_r
etu
rns
Kangaroo
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
2500
3000
3500
4000
4500
5000
5500
6000
6500
train
_epis
ode_r
etu
rns
Krull
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
10000
20000
30000
40000
50000
train
_epis
ode_r
etu
rns
KungFuMaster
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
1000
500
0
500
1000
1500
2000
2500
3000
train
_epis
ode_r
etu
rns
MontezumaRevenge
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
2000
4000
6000
8000
10000
train
_epis
ode_r
etu
rns
MsPacman
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
2000
4000
6000
8000
10000
12000
train
_epis
ode_r
etu
rns
NameThisGame
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
5000
10000
15000
20000
train
_epis
ode_r
etu
rns
Phoenix
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
600
500
400
300
200
100
0
100
200
train
_epis
ode_r
etu
rns
Pitfall
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
30
20
10
0
10
20
30
train
_epis
ode_r
etu
rns
Pong
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
1000
2000
3000
4000
5000
6000
7000
8000
train
_epis
ode_r
etu
rns
Pooyan
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
10000
0
10000
20000
30000
40000
50000
train
_epis
ode_r
etu
rns
PrivateEye
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
5000
10000
15000
20000
25000
30000
train
_epis
ode_r
etu
rns
Qbert
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
5000
10000
15000
20000
25000
30000
train
_epis
ode_r
etu
rns
Riverraid
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
train
_epis
ode_r
etu
rns
RoadRunner
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
10
20
30
40
50
60
70
80
train
_epis
ode_r
etu
rns
Robotank
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
5000
0
5000
10000
15000
20000
25000
30000
35000
40000
train
_epis
ode_r
etu
rns
Seaquest
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
35000
30000
25000
20000
15000
10000
train
_epis
ode_r
etu
rns
Skiing
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
1000
2000
3000
4000
5000
6000
train
_epis
ode_r
etu
rns
Solaris
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
2000
4000
6000
8000
10000
12000
14000
train
_epis
ode_r
etu
rns
SpaceInvaders
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
20000
40000
60000
80000
100000
train
_epis
ode_r
etu
rns
StarGunner
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
25
20
15
10
5
0
5
train
_epis
ode_r
etu
rns
Tennis
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
2000
4000
6000
8000
10000
12000
14000
train
_epis
ode_r
etu
rns
TimePilot
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
50
100
150
200
250
300
train
_epis
ode_r
etu
rns
Tutankham
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
10000
20000
30000
40000
50000
60000
70000
80000
train
_epis
ode_r
etu
rns
UpNDown
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
500
0
500
1000
1500
2000
train
_epis
ode_r
etu
rns
Venture
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
100000
200000
300000
400000
500000
600000
700000
800000
train
_epis
ode_r
etu
rns
VideoPinball
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
5000
10000
15000
20000
25000
30000
train
_epis
ode_r
etu
rns
WizardOfWor
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
20000
40000
60000
80000
100000
120000
train
_epis
ode_r
etu
rns
YarsRevenge
algo_name
RAINBOW
Rainbow Restart Single
0 50 100 150 200 250 300 350 400
iteration
0
5000
10000
15000
20000
25000
30000
train
_epis
ode_r
etu
rns
Zaxxon
algo_name
RAINBOW
Rainbow Restart Single
Figure 17. Rainbow MEMENTO training curves. Blue is the baseline agent and orange is the MEMENTO agent launched from the bestposition of the previous (five seeds each).