memory-augmented neural networks: enhancing biological ... · memory-augmented neural networks:...

POLITECNICO DI MILANO

ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

Memory-Augmented Neural Networks:Enhancing Biological Plausibility and

Task-Learnability

Author:Marco MARTINOLLI

Number 836632

Supervisor:Prof. Wulfram GERSTNER

Prof. Riccardo SACCO

Dr. Aditya GILRA

A thesis submitted in fulfillment of the requirementsfor the Master degree in Doctor in Mathematical Engineering

in the

School of Industrial and Information EngineeringComputational Science and Engineering

Academic Year 2016-2017

http://www.epfl.ch

http://www.epfl.ch

http://www.ingindinf.polimi.it/en/

http://cse.epfl.ch

iii

Declaration of AuthorshipI, Marco MARTINOLLI, declare that this thesis titled, “Memory-AugmentedNeural Networks: Enhancing Biological Plausibility and Task-Learnability”and the work presented in it are my own. I confirm that:

• This work was done wholly or mainly while in candidature for a re-search degree at this University.

• Where any part of this thesis has previously been submitted for a de-gree or any other qualification at this University or any other institu-tion, this has been clearly stated.

• Where I have consulted the published work of others, this is alwaysclearly attributed.

• Where I have quoted from the work of others, the source is alwaysgiven. With the exception of such quotations, this thesis is entirely myown work.

• I have acknowledged all main sources of help.

• Where the thesis is based on work done by myself jointly with others,I have made clear exactly what was done by others and what I havecontributed myself.

Signed:

Date:

v

Politecnico di MilanoÉcole Polytechnique Fédérale de Lausanne

Abstract

Mathematical EngineeringComputational Science and Engineering

Doctor in Mathematical Engineering

Memory-Augmented Neural Networks: Enhancing Biological Plausibilityand Task-Learnability

by Marco MARTINOLLI

Result of the union of Artificial Neural Networks with memory, Memory-Augmented Neu-

ral Networks (MANNs) represent a new frontier in artificial intelligence, thanks to the com-

bination of the increasing learning capacity of neural networks with the possibility to store

and retrieve relevant information from memory. Being a trend topic in both the machine

learning community and in cognitive neuroscience, the MANNs have been developed in-

dependently in opposite directions with various achievements in terms of learning perfor-

mance and biological plausibility.

In the present work, we have explored the capacities of MANNs by comparing the perfor-

mance of specific models on a sequence of cognitive tasks with an increasing demand of

memory dynamics, in such a way to identify advantages and limitations and propose new

solutions. Specifically, the study involves two neurally faithful models with internal mem-

ory (AuGMEnT and HER) and the current top-performing MANN with external memory, the

DNC.

In particular, the comparative results showed that AuGMEnT model suffers from memory

interference that hampers correct learning of tasks affected by temporal credit assignemnt

problem, like 12AX task; on the other side, the DNC network confirmed its excellent perfor-

mance, but it is penalized by low biological foundations. Afterwards, we proposed three

variants of AuGMEnT to overcome its learning limitations: leaky-AuGMEnT, deep AuGMEnT

and hierarchical AuGMEnT. Morevoer, the addressing scheme of the DNC model has been

simplified to pure content-based approach (R-DNC) and coupled with LRUA access module

to restore one-shot learning on classification tasks, like the Omniglot task.

http://www.epfl.ch

http://www.epfl.ch

https://www.mate.polimi.it/im/

http://cse.epfl.ch

vii

Politecnico di MilanoÉcole Polytechnique Fédérale de Lausanne

Abstract

Mathematical EngineeringComputational Science and Engineering

Doctor in Mathematical Engineering

Memory-Augmented Neural Networks: Enhancing Biological Plausibilityand Task-Learnability

by Marco MARTINOLLI

Ottenute dall’unione di reti neurali con un sistema di memoria, le Memory-Augmented

Neural Networks (MANNs) rappresentano una nuova frontiera dell’intelligenza artificiale,

grazie alla combinazione della crescente capacità di apprendimento delle rete neurali ar-

tificiali e della possibilità di salvare e recuperare importanti informazioni dalla memoria.

Essendo un argomento estremamente attuale sia nel campo del machine learning che della

neuroscienza cognitiva, le MANNs sono state sviluppate in direzioni opposte, raggiungendo

diversi traguardi in termini di performance di apprendimento e di plausibilità biologica.

Il presente lavoro di tesi esplora le capacità delle MANNs comparando le performance di

diversi modelli su una serie di task cognitivi con una crescente richiesta di capacità di mem-

orizzazione, per poi identificare vantaggi e criticità e proporre alternative ai modelli origi-

nali. Nello specifico, lo studio si concentra su due modelli a memoria interna con forti basi

biologiche (AuGMEnT e HER) e su quella che ad oggi è considerata la miglior rete neurale a

memoria esterna, il DNC.

In particolare, i risultati comparativi portano alla luce una difficoltà del modello AuGMEnT

in merito all’immagazzinamento di sequenze di stimoli in maniera stabile, il che non perme-

tte la corretta risoluzione di task affetti dal cosiddetto temporal credit assignment problem,

come il task 12AX; al contrario, il modello DNC conferma le sue eccellenti prestazioni di ap-

prendimento in tutti i casi testati, ma è penalizzato da una bassa credibilità dal punto di

vista neuro-biologico. Pertanto, sono state proposte tre varianti del modello AuGMEnT per

superare le sue debolezze di apprendimento: leaky-AuGMEnT, deep AuGMEnT e hierarchi-

cal AuGMEnT. Inoltre, lo schema di scrittura/lettura di DNC è stato ridotto ad un approccio

puramente content-based (R-DNC)− più biologicamente verosimile− e successivamente in-

tegrato con il modulo LRUA per recuperare la capacità di apprendimento one-shot nel caso

di task di classificazione come Omniglot.

http://www.epfl.ch

http://www.epfl.ch

https://www.mate.polimi.it/im/

http://cse.epfl.ch

ix

AcknowledgementsThis thesis is the fruit of a growth process and of many efforts; it is the finishing line of myuniversitary life and education; it has been a project that stimulated me and taught me a lot;it could even be the first step towards my future career. All this has been a journey that hasseen the contribution of many people.

Naturally, a special aknowledgement is due to Professor Gerstner who opened to me thedoors of his laboratory and who involved in me in multiple activities making me feel as amember of the lab staff. Thanks to my supervisor, Aditya, who advised and guided me inthis period of my master project. Thanks to Vineet for having been a precious collaboratorwho has always trusted me. Finally, thanks to Professor Sacco for having always shownenthusiasm for my project and for transmitting me both professionalism and a remarkablehumaneness.

Thanks to my great Lab for making me feel welcome in the group since the very firstmoment. For inviting me to so many events, scientific talks, coffee breaks and parties, thatit was impossible to participate to all of them. Thank you for having shared with me yourknowledge and your kindness. Thank you for all the support and the ideas you gave meduring my period in the lab.

I thank my mythical Losanna group, which shared the international experience with me.The period at EPFL has definitely been a tough challenge that we passed together. Thankyou for having faced the difficulties with me with positiveness and for the trust that youhave always placed in me, even if you would never admit it. It has been a real honour to bethe ’captain’ of this disobedient and dysfunctional group.

Thanks to the United, in name and in essence, a group of formidable persons who en-riched my life in Milan with joy and conviviality. I owe a lot to them because they made mediscover a deep and honest friendship, that exists and resists over the borders of time andspace. Seeing you again is always a big pleasure. Your Tini loves you all.

Trieste is always in my heart, even if I started to really appreciate it only when I lefthome. I owe so much to Trieste because everyone has contributed in a little part to the manI am now. Among all I cannot not mention my eternal desk buddy, Marco, who I have afeeling with that overcomes all the small differences, and Chiara, for being a certainty in mylife, who keeps on advising me even from far away.

And eventually the biggest acknowledgemnt goes to my parents, my siblings and my

dearest relatives, who have seen me grow up till I overpassed them all in height. Thanks for

the support that you have always given to me, for having made me feel protected and for

letting me go away from home to make my experiences and my mistakes. It is a journey that

is still far to end, but I am sure that beside you future will always be full of colours and great

laughs.

xi

RingraziamentiQuesto lavoro di tesi è il frutto di un processo di crescita e di tanti sforzi; e’ un traguardo chesegna la fine della mia vita universitaria e della mia esperienza formativa; è un progetto chemi ha stimolato e che mi ha insegnato molto; è quello che potrebbe essere il preambolo perun mio futuro lavoro. E’ la fine di un percorso che ha visto il contributo di molte persone ame care.

Naturalmente, un ringraziamento speciale va al professor Gerstern che mi ha aperto leporte del suo laboratorio e che mi ha coinvolto in diverse attività formative e stimolanti.Grazie al mio supervisore, Aditya, che mi ha consigliato e seguito in questo periodo di tesi.Grazie a Vineet per essere stato un prezioso collaboratore che si è sempre fidato del miogiudizio. Grazie infine al professor Sacco per essere sempre stato entusiasta del mio lavoroe per avermi sempre trasmesso professionalità ed una grande umanità.

Grazie al Lab LCN per avermi accolto nella vostra compagnia sin dal primo momento.Per avermi invitato a eventi, conferenze, pause caffé e feste, in un numero tale a cui sfortu-natamente era impossibile stare dietro. Grazie per tutto il supporto e le idee datomi duranteil mio periodo di tesi.

Ringrazio il mitico Losanna Group, con cui ho condiviso la mia esperienza internazionale.Il periodo all’EPFL è stato una prova dura che abbiamo superato assieme. Grazie per avereaffrontato le difficoltà con il buon umore e per la fiducia che avete sempre riposto in me,anche se non lo ammettereste mai. E’ stato un piacere ed un onore essere il ’capitano’ diquesto gruppo disobbediente ed altamente disfunzionale.

Grazie agli United, di nome e di fatto. Un gruppo di individui formidabili che hannodonato al mio periodo milanese tanta spensieratezza e convivialità. Un gruppo a cui devomolto perché mi ha fatto conoscere un’amicizia sincera e profonda, che esiste e resiste senzaincertezze a prescindere dal tempo e dalla lontananza. Rivedervi è sempre un enorme pi-acere. Il vostro Tini vi vuole bene.

Trieste è sempre nel mio cuore, anche se ho iniziato ad apprezzarla davvero solo quandosono partito da casa. Ai miei amici triestini devo molto perché ognuno ha contribuito nelsuo piccolo a formare la persona che sono oggi. Fra tutti non posso non menzionare il mioeterno compagno di banco, Marco, con cui la sintonia é talmente forte che supera le nostreapparenti differenze, e naturalmente Chiara, per essere sempre una presenza certa nella miavita, che continua a consigliarmi anche da molto lontano.

Ed infine il ringraziamento più grande va ai miei genitori, ai miei fratelli e ai miei parenti

più cari, che mi hanno visto crescere fino a superarli tutti in altezza. Grazie per il sostegno

che mi avete sempre dato, per avermi fatto sentire protetto e per avermi lasciato andare a

fare le mie esperienze e a commettere i miei sbagli nel mondo. E’ un viaggio che è ancora

lontano dal finire, ma grazie per essere sempre stati con me ad arricchire ogni attimo assieme

di tante risate ed emozioni.

xiii

Contents

Declaration of Authorship iii

Abstract v

Abstract vii

Acknowledgements ix

1 Introduction 11.1 Biological Introduction to Neural Networks . . . . . . . . . . . 21.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . 41.3 Benefits and Limitations of Memory Augmentation . . . . . . 6

1.3.1 The Working Memory in the Brain . . . . . . . . . . . . 61.3.2 MANNs: Internal vs. External Architectures . . . . . . 71.3.3 The Credit Assignment Problem . . . . . . . . . . . . . 91.3.4 A New Approach: Meta-Learning . . . . . . . . . . . . 10

1.4 Overview of MANNs . . . . . . . . . . . . . . . . . . . . . . . . 11

2 MANN Models 152.1 Attention-Gated MEmory Tagging - AuGMEnT . . . . . . . . . 16

2.1.1 Introduction to the RL Scheme in AuGMEnT . . . . . . . 162.1.2 The Feedforward Step . . . . . . . . . . . . . . . . . . . 172.1.3 The Attentional Feedback Step . . . . . . . . . . . . . . 192.1.4 The Reward Prediction Error as Neuromodulator . . . 202.1.5 Learning Algorithm: a RPE-minimizer approach . . . . 212.1.6 Analogies with Backpropagation . . . . . . . . . . . . . 23

2.2 Hierarchical Error Representation - HER . . . . . . . . . . . . . 252.2.1 HER Model: a Hiearchical Predictive Coding Framework 252.2.2 The HER level . . . . . . . . . . . . . . . . . . . . . . . . 25

The Memory Gating . . . . . . . . . . . . . . . . . . . . 26The Prediction Error . . . . . . . . . . . . . . . . . . . . 27

2.2.3 The Predictive Coding Dynamics in HER . . . . . . . . 302.2.4 Learning Rules . . . . . . . . . . . . . . . . . . . . . . . 32

xiv

Learn Memory Gating . . . . . . . . . . . . . . . . . . . 32Learn to Predict . . . . . . . . . . . . . . . . . . . . . . . 33

2.3 Differentiable Neural Computer - DNC . . . . . . . . . . . . . . 332.3.1 The Neural Controller . . . . . . . . . . . . . . . . . . . 342.3.2 The Interface Vector . . . . . . . . . . . . . . . . . . . . 362.3.3 Addressing the External Memory . . . . . . . . . . . . 37

Content Associations . . . . . . . . . . . . . . . . . . . . 37Dynamic Memory Allocation . . . . . . . . . . . . . . . 38Temporal Linkage . . . . . . . . . . . . . . . . . . . . . 39Writing and Reading to the External Memory . . . . . 41

3 Learning Across Tasks 433.1 The Saccade-Antisaccade Task . . . . . . . . . . . . . . . . . . . 44

3.1.1 The Structure of the SAS Task . . . . . . . . . . . . . . . 453.1.2 Performance on the SAS Task . . . . . . . . . . . . . . . 46

Focus on the HER dynamics in the SAS task . . . . . . . 493.2 The 12-AX Task . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2.1 Description of the 12AX Task . . . . . . . . . . . . . . . 513.2.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . 52

Discussion of HER mechanisms for 12AX task . . . . . . 543.2.3 Analysis of AuGMEnT performance in the 12AX task . . 56

3.3 Image classification: the Omniglot Task . . . . . . . . . . . . . 583.3.1 Introduction to the Omniglot Task . . . . . . . . . . . . 583.3.2 Learning the Omniglot Task . . . . . . . . . . . . . . . . 593.3.3 One-shot learning of the DNC model . . . . . . . . . . . 61

4 Bio-plausible Development of MANN Models 654.1 Increase Learning Power in AuGMEnT . . . . . . . . . . . . . . 65

4.1.1 Leaky-AuGMEnT . . . . . . . . . . . . . . . . . . . . . . . 664.1.2 Deep AuGMEnT . . . . . . . . . . . . . . . . . . . . . . . 68

Depth Augmentation: derivation and consequences . . 68Biological Plausible Alternatives to BP . . . . . . . . . . 71

4.1.3 Hierarchical AuGMEnT . . . . . . . . . . . . . . . . . . . 76Hierarchical Memory Organization . . . . . . . . . . . 76Adding a Gating System to AuGMEnT . . . . . . . . . . 77Modifications in the Feedback Propagation . . . . . . . 79Failure of Hierarchical AuGMEnT . . . . . . . . . . . . . 80

4.2 Bio-Plausible Modifications of DNC Model . . . . . . . . . . . . 814.2.1 Simplification of DNC Addressing Scheme . . . . . . . . 81

xv

4.2.2 Potentiate One-shot Learning with LRUA Scheme . . . 82

5 Conclusions 875.1 Achievements of the Project . . . . . . . . . . . . . . . . . . . . 875.2 Possible Developments for Future Work . . . . . . . . . . . . . 89

A Model Parameterization 91A.1 Table of Parameters: SAS Task . . . . . . . . . . . . . . . . . . . 92A.2 Table of Parameters: 12AX Task . . . . . . . . . . . . . . . . . . 93A.3 Table of Parameters: Omniglot Task . . . . . . . . . . . . . . . 94A.4 Table of Parameters: variants of AuGMEnT model . . . . . . . . 95

B Programming Details 97

Bibliography 99

xvii

List of Figures

1.1 Biological and Artificial Neural Network . . . . . . . . . . . . 31.2 Memory-Augmented Neural Networks: Internal vs. External

Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 AuGMEnT model: feedforward and feedback steps . . . . . . . 162.2 Structure of HER Level . . . . . . . . . . . . . . . . . . . . . . . 262.3 The Multi-level HER Network . . . . . . . . . . . . . . . . . . . 312.4 Scheme of DNC model . . . . . . . . . . . . . . . . . . . . . . . . 35

3.1 Trials in Saccade-Antisaccade Task . . . . . . . . . . . . . . . . 463.2 Comparison of learning perfomances on the saccade task . . . 483.3 Detail of the HER matrices after SAS training . . . . . . . . . . 493.4 Simplified example of 12-AX trial . . . . . . . . . . . . . . . . . 513.5 Simulation Results on the 12AX Task . . . . . . . . . . . . . . . 533.6 Visualization of HER matrices after training on 12AX task . . . 553.7 Detail of the superior prediction matrix in HER . . . . . . . . . 563.8 AuGMEnT Performance on diffferent variants of the 12AX Task 573.9 Dataset Construction in Omniglot Task . . . . . . . . . . . . . 593.10 One-shot learning in the Omniglot task . . . . . . . . . . . . . 603.11 Performance of DNC model on the Omniglot task . . . . . . . . 62

4.1 Deep and Hierarchical AuGMEnT . . . . . . . . . . . . . . . . . 674.2 Performance of leaky AuGMEnT on the 12AX Task . . . . . . . 684.3 Deep AuGMEnT Performance on 12AX Task . . . . . . . . . . . 714.4 Biological Plausible Variants of BackPropagation . . . . . . . . 734.5 Feedback Alignment in RBP . . . . . . . . . . . . . . . . . . . . 744.6 Learning of deep AuGMEnT trained with Random Backpropa-

gation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.7 One-shot learning of R-DNC in the Omniglot task . . . . . . . . 83

xix

List of Tables

3.1 Table of the SAS task . . . . . . . . . . . . . . . . . . . . . . . . 473.2 The 12AX task: table of key information . . . . . . . . . . . . . 523.3 Summarizing the parameters for training of Omniglot task . . 583.4 Test statistics on the Omniglot Task . . . . . . . . . . . . . . . . 63

4.1 Test statistics of R-DNC model in the Omniglot Task . . . . . . 85

A.1 Paramterization of MANN models for SAS task. . . . . . . . . 92A.2 Model parametrization for 12AX Task . . . . . . . . . . . . . . 93A.3 Summarizing the parameters for training of Omniglot task . . 94A.4 Parameter settings for model parametrization of variants of

AuGMEnT for the 12AX task . . . . . . . . . . . . . . . . . . . . 95

xxi

List of Abbreviations

AuGMEnT Attention-Gated MEmory TaggingAGREL Attention-Gated REeinforcement LearningANN Artificial Neural NetworksBG Basal GangliaBP BackPropagationCPT Continuous Performance TaskdlPFC dorsolateral PreFrontal CortexDNC Differentiable Neural ComputerD-NTM Dynamical Neural Turing MachineRNN Random Neural NetworksFA Feedback AlignemntFB FeedBackFF FeedForwardFNN Feedforward Neural NetworksHER Hierarchichal Error ReppresentationLIP Lateral Intra-Parietal cortexLRUA Least Recently Usage AccessLSTM Long Short Term MemoryMANN Memory-Augmented Neural NetworksmPFC medial PreFrontal CortexMRBP Mized Random BackPropagationNTM Neural Turing MachinePBWM Prefrontal Cortex-Basal Ganglia - Working MemoryPFC PreFrontal CortexPRO Predicted Response OutcomeRBP Random BackPropagationRL Reinforcement LearningRL-NTM Reinforcement Learning - Neural Turing MachineRNN Recurrent Neural NetworksRPE Reward Prediction ErrorSARSA State-Action-Reward-State-ActionSAS Saccade-AntiSaccade Task

xxii

SL Supervised LearningSRBP Skipped Random BackPropagationSPAUN Semantic Pointer Architecture Unified NetworkWM Working Memory

1

Chapter 1

Introduction

Although human brain is the central source of our thoughts and actions,there is still a big mystery around the internal processes that are under-neath many brain acitivities like cognition, learning and emotions. In fact,even though neuroscience has made significant progresses in the last century,there is still a joint effort of neurobiologists, neurologists and psychologiststo try to formulate new theories and models that explain how our nervoussystem works. Their final goal is to have a better knowledge of brain pro-cesses and functions and, possibly, understand the executive dysfunctions inneurological disorders like deficit-hyperactivity disorder or Parkinson’s dis-ease.At the same time, mathematicians and computational scientists have takeninspiration from brain architecture to develop models which use simplifiedneural dynamics and learn from experience as people normally do in theirevery day life. Such models are called Artificial Neural Networks (ANNs),that are the computational representations of assemblies of neurons that co-operate to predict events, retrieve memories or learn new behaviors. Bornin 1943, neural networks developed slowly in the 20th century because of thelack of the computational power needed to have satistactory performance onreal problems. In the last decades the availability of more powerful compu-tational resources gave a boost to the research on ANNs and the models arecurrently involved in many fields of modern technology, like computer vi-sion, speech recognition and social networks. Nowadays, taking advantageof advanced processors and supercomputers, the size of a neural network canreach even millions of units and connections and the development of ANNsis very promising to progress in artificial intelligence and robotics.In the last years, the machine learning community has enriched the ANNsby adding an external addressable memory, capable to store and retrieveimportant information (Graves, Wayne, and Danihelka, 2014; Graves et al.,2016). The resulting models, called Memory-Augmented Neural Networks

2 Chapter 1. Introduction

(MANNs), are able to solve more complex tasks which may require long-time storage of relevant data or deep memory associations. As we will seebetter in the next sections, actually memory dynamics had already been in-troduced previously in ANNs (Hochreiter and Schmidhuber, 1997; O’Reillyand Frank, 2006), but memory was always under the form of internal repre-sentations. In this second case the addressing mechanisms were quite poorand limited, but on the upside the structure of the network was generallyfounded on biological considerations. In any case, for sake of simplification,in the present work we will use the expression ’Memory-Augmented NeuralNetworks’ to address both the networks with internal and external memory.

The first part of the present introduction is meant to provide to the readera quick insight into biological and artificial neural networks; afterwards, thefocus will move to MANNs that are the real object of study of the presentthesis, presenting how memory can be included in neural network modelsand which are the current achievements in the machine learning communityaround this thriving research topic.

1.1 Biological Introduction to Neural Networks

Brain activity is the result of the propagation of an electric signal along aseries of interconnected cells that compose the nervous system, called neu-rons. Despite having various morphologies and functions, the structure ofneurons can be generally divided in three parts (as sketched in Figure 1.1 a):a) the soma, the body cell containing the nucleus that can be either receiverand emitter of electric potentials; b) the dendrites, that are cytoplasmatic ex-tensions covered by synapses to receive signals from other neurons; and c)the axon, that is a cable with variable length covered by the so-called myelinsheats, which transmits the output potential originated in the soma to otherpostsynaptic neurons. As a result, neural dynamics consist in a delicate bal-ance between multiple entering signals and conditioned emissions of actionpotentials called spikes.More precisely, input current leads to an increase of the potential in the soma,that is followed by a decay in time. Since the external inputs arrive randomlyfrom multiple pre-synaptical neurons, the activity of the neuron depends ontheir cumulative effect and the resulting dynamics are very noisy. However,when the total potential in the soma reaches a certain threshold (e.g. around

1.1. Biological Introduction to Neural Networks 3

30-40 mV above the rest condition), the neuron emits a spike to the postsy-naptic cells and its potential has a sharp decrease and resets to the initial restpotential around −70 mV.After an action potential is emitted at the soma, it travels along the axon andthe axon terminals until the synapses, where neurotransmitters (like gluta-mate or GABA) are released from the pre-synaptic terminal and capturedby the receptors on the surface of the post-synaptic terminal. As a conse-quence, the effect of the input current depends on the size and on the activityof the specific synapse which connects the two neurons; for this reason wespeak of synaptic conductance, which is a time-dependent quantity relatedto the availability of neurotransmitters and receptor dynamics. In additionto these temporary fluctuations, synaptic strengths can change also in a pro-cess known in neurobiology as synaptic plasticity. In fact, external events,generally coming from new experiences or education, can trigger complexcalcium-based chemical reactions in the brain, that modify the size of thesynapse and the number of receptors to better modulate the input signal andadapt to the desired output.Here, we have just introduced the basic notions of the electrophysiology inbiological neural networks, presenting the main concepts from the ’leakyintegrate-and-fire’ and the Hodgkin-Huxley models (Gerstner et al., 2014;Hodgkin and Huxley, 1952); for a more detailed analysis we invite the readerto read the reference papers.

Figure 1.1. a) Structure of a (motor) neuron cell with soma, dendrites and axon(Figure taken from (Structure of a typical nerve cell)). b) Example of the architecture of

a deep ANN with four layers connected by weight matrices W0, W1 and W2


1.2 Artificial Neural Networks

Artificial Neural Networks are the mathematical simplification of biologicalassemblies of neurons described above. The networks are organized in amulti-layer structure where each layer corresponds to a set of neuronal unitsthat share similar properties or functions. Analogously to their biologicalanalogue, artificial neurons collect inputs from multiple units and activatewith non-linear function of the input, said activation functions.As shown in the graph in Figure 1.1 b, external stimuli are encoded in aninitial input layer as well as the possible responses are encoded in the finaloutput layer; the intermediate layers are the hidden layers and the numberof hidden layers defines the depth of the network. In fact, if there are nohidden units, the input is directly mapped to the output and the network isnamed shallow; in case there are one or multiple hidden layers the networkis considered to be deep and learning can reach higher order of abstractionsof the input. Hence, the process is called deep learning and it is so that thebrain is thought to process information.The different layers of the network are linked by weighted connections, col-lected in weight matrices (see W0, W1 and W2 in the above Figure), that areused to propagate and modulate signals. The weight of the contact corre-spond to its synaptic strength and can be either positive or negative indi-cating respectively an excitatory or inhibitory synapse. If the synaptic con-nections link only neurons which belong to subsequent layers and the signalflows in one direction from the input to the output, then the architecture isan instance of Feedforward Neural Network (FNN). In case of more complexdynamics, where there are self-interactions, connections between neurons inthe same layer or more generally closed cycles in the graph, then the networkis called Recurrent Neural Network (RNN). Typically, RNN are more difficultto implement but they can solve more complex tasks than FNN, involving forinstance language processing or pattern recognition.Similarly to the strength variations induced by synaptic plasticity in the bio-logical neural networks, the weights of the connections are not fixed but theycan change according to predefined dynamics of the network in a learningprocess. In fact, in the artificial model the weights are updated to optimizesome loss functionL that measures the model error in a direct comparison be-tween the output of the network and the given target. In particular, a weight

1.2. Artificial Neural Networks 5

wi,j ∈ R is modified according the general gradient descent rule:

wt+1i,j = wti,j − β

∂L

∂wti,j(1.1)

where t and t+ 1 are two consecutive time levels of the iterative descent andβ is the learning rate parameter. The value of β typically lies in the [0, 1] in-terval, but it has to be adequately tuned to have at the same time stability(β not too high) and efficiency (β not too small). In case of deep networks,the error gradients have to be backpropagated from the output towards theinput layer. As we will discuss in detail in the next chapters, a typical way inmachine learning to propagate backwards the error is BackPropagation (BP),where the backward modulator signal is computed using the chain rule in afeedback process and assuming to have the same weights of the feedforwardstep, i.e. WBP

i = WTi .

In machine learning, the learning process, in which the weights of the ANNare updated in a trial-error approach, is called training phase. During thisprocess both the inputs and the related targets are provided to the system,which learns how to solve a task by iteratively adjusting the weight connec-tions in order to return a response that is more similar to the given target.Generally, the network can be trained with Supervised Learning (SL) or Re-inforcement Learning (RL): in SL the loss function L to optimize depends onan error e that measures the discrepancy between the network output vectory and the provided target vector t; on the contrary, RL is based on a globalreward signal that simply indicates whether the response coincides (or not)with the target. Since the error vector e in SL gives a feedback to the modelfor each possible response (even the wrong ones), SL is faster and more effi-cient than RL, where the reward basically corresponds to a boolean measurethat returns information just for the selected response. Nevertheless, RL ismore biologically realistic because the reward system derives from the feed-back role of neuromodulator signals (like dopamine) released after an actionis taken in a real environment.Training is typically followed by a test phase which allows to mathemati-cally analyze the performance of the trained ANN. In this second step, onlythe stimuli are given to the network and learning dynamics are freezed, i.e.the weights are kept fixed. In general, it is possible to have an intermediatephase, called validation phase, that is normally used to parametrize correctlythe network. Finally, the training should be tailored in such a way to avoidtwo critical cases: overfitting and underfitting. The former occurs when the


model does not adapt well to new unexperienced inputs (in this case the er-ror in the test phase is typically higher than in the training phase); the latter,on the contrary, is characterized by such a poor precision in the output, thatthe model is not even capable of explaining correctly the training data.

1.3 Benefits and Limitations of Memory Augmen-

tation

1.3.1 The Working Memory in the Brain

Storage and memory retrieval are complex functions of the brain that inmany real cases are fundamental to learn difficult processes and new be-haviors. In cognitive neuroscience the memory system that employs mech-anisms like content associations or short-term manipulation of informationis named Working Memory (WM). Despite the ensemble of memory dynam-ics involving several brain regions like hyppocampus and amygdala, WMis specifically associated with the brain activity in prefrontal cortex (PFC)and basal ganglia (BG). In fact, BG are thought to have a gating effect onthe storage of information in the PFC using a inhibitory modulated system,in such a way that only task-relevant data are stored without interference(Frank, Loughry, and O’Reilly, 2001; Frank, 2005). Furthermore, PFC ishighly involved in cognitive functions and in goal-oriented behaviors, hav-ing the ability to anticipate the values of actions to form expectations and de-tect surprising outcomes. In particular, more recent studies (Alexander andBrown, 2011) showed that medial prefrontal cortex (mPFC) can learn out-come predictions in a changing environment by signaling the discrepancybetween actual and expected outcomes. Moreover, the activity in the dorso-lateral prefrontal cortex (dlPFC) is more connected to the registration of theexternal stimuli in the brain and in its contextualization in a learnt environ-ment (Alexander and Brown, 2015).In conclusion, the WM activity is the product of the joint action of sparsefunctional areas of the brain, that operate at different levels of abstraction tosolve cognitive tasks that involve short-time storage and associations recall-ing.

1.3. Benefits and Limitations of Memory Augmentation 7

1.3.2 MANNs: Internal vs. External Architectures

Analogously to the memory of a computer, standard neural networks caninteract with memory to improve their performance on difficult tasks permit-ting the model to store data and detect cue associations. Memory-AugmentedNeural Networks (MANNs) are the result of the coupling between a neuralcontroller and memory, that potentially increases the range of tasks that canbe solved by simple ANN, i.e. their learning power.

In general, memory can be embedded in a neural network model in twoways (see Figure 1.2): it can be either internal or external. In the formerconfiguration (a), the memory consists in a layer of neurons enriched withspecific dynamics that permit maintainance of internal representations. Forinstance, this is the case of a well-known model, the Long-Short Term Mem-ory (LSTM) that adopts a memory gating mechanism: the neural unit is infact equipped with gating variables used to decide when to store the newstimulus or erase the current memory content. In the second case (b), theMANN networks with external memory employ an addressable memorymatrix, where relevant information can be written to and extracted from, us-ing write and read heads that address specific memory locations. The mem-ory is defined ’external’ because it is not strictly embedded in the processingpathway of the signal, without having direct connections with the input or

Figure 1.2. The structural difference between MANN with internal memory (a) andexternal memory (b) is evidently in the location of the memory (yellow). In the leftscheme, the internal memory is completely embedded in the processing pathway ofthe network; at right, the external memory is addressed from the controller usingwrite and read heads respectively to write and retrieved relevant memories. Theneural controller (orange) consists in a hidden layer (either feedforward or recurrent)

or in a combination of stacked layers.


the output of the network. The most famous examples of MANN with ex-ternal memory are the Neural Turing Machine (NTM) (Graves, Wayne, andDanihelka, 2014) and the Differential Neural Computer (DNC) (Graves et al.,2016). Although memory augmentation usually refers only to a neural net-work model that interfaces with an external memory, in the present work wename MANN the neural networks with both types of memory structures.

Typically, MANNs with external memory are more effective at solvingcomplex problems because the addressing mechanism can be designed adhoc to improve the performance of the model on difficult tasks. For instance,as we will see in Chapter 2, the DNC model combines standard content-basedmechanism, which measures the similarity between a target query and mem-ory locations, with more artificial temporal and location-based criteria; thesememory dynamics make DNC outperform internal memory models (like LSTM)or previous external memory MANNs (like NTM) on tasks that demand longtime scales for storing or efficient transitions across memory. For instance, theDNC model can solve several types of task including answering simple ques-tions, finding shortest path in graphs or other inferential problem. On theother side, the more the addressing mechanism of the model is complicated,the more it is penalized in terms of biological plausibility. In fact, the com-plex combination of addressing mechanisms in the communication with theexternal memory does not reflect the biological dynamics that are thought tobe mainly content-based; instead, in general, internal memory solutions aremore biologically realizable because the simplicity of the gating system doesnot imply strong assumptions on the memory mechanisms. Furthermore, aneural setting with addressable external memory normally requires a biggercomputational cost with longer training time and learning may be limited bythe memory capacity.

At this point, it should be clear to the reader that MANN networks com-bine issues from machine learning, cognitive neuroscience and neurobiology.As a result, the field of MANNs is explored by different scientific groups withdifferent methods and goals: on the one hand, the machine learning com-munity aims to develop a powerful network that performs well on a widevariety of tasks, regardless of the biological foundations of the model; on theother hand, the final objective of neuroscientists is to reproduce in a compu-tational model the known neural dynamics to solve simple memory-basedtasks in a controlled environment. As a consequence, the two approaches

1.3. Benefits and Limitations of Memory Augmentation 9

explore the field in opposite directions and naturally they achieve differentresults in terms of biological plausibility and learning performance, that gen-erally are almost disjunctive properties of the MANN. In the present project,we tried to move in a direction which maximizes both.

1.3.3 The Credit Assignment Problem

The common advantage of MANN models is to store useful information inthe memory and retrieve data to modulate the output signal. But under-standing which information is relevant for the task solving can be very com-plicated in many cases. In fact, one of the main problems in cognitive taskswith a memory demand is the so-called credit assignment problem. This is-sue rises in situations where the action to take depends on previous actionsor stimuli that were presented several timesteps before. In such cases thetraining has somehow to take into account what happens during the historyof the trial and learn which are the stimuli that affect the performance andthat have to be stored in memory.Normally, the credit assignment problem is subdivided in two components:

• the temporal credit assignment problem: How to understand the in-fluence of previous responses or storing activities on the final perfor-mance, if the benefits of such actions are available only later in time(for instance at the end of a long trial)?

• the structural credit assignment problem: How to accord credit to theunits that actually contribute the most to the final response, especiallyin case of deep networks where not all the nodes are directly connectedwith the output layer?

Taken alone, the latter issue is a standard problem in neural networks and it iseasily solvable in most situations. But when the temporal credit assignemntproblem interferes with the structural one, the solution can be very stiff be-cause the weights have to be updated according to the whole history of thetrial.The credit assignment problem is a typical criticality in Reinforcement Learn-ing problems, where the reward might be given just at the end of a long se-quence of events. Examples of such critical cases will be presented in Chapter3, introducing the Saccade-Antisaccade task (3.1) and the 12-AX task (3.2).In many cases, the problem is tackled introducing the eligibility trace, that


keeps track of the synapses between neurons that co-activated for the selec-tion of a certain action. The first effect of eligibility traces is that, when theneuromodulator is released, only the tracked synaptic weights are updatedtaking care of the structural problem. Furthermore, the eligibility traces aresubject to an exponential decay in time such that their tracking effect is main-tained for short time. However, experimental findings showed that such de-cay dynamics normally have a decay constant of τ ' 10−1-100 s (Gerstner,2017), that could be too short to solve the temporal credit assignment prob-lem in cases of long trials, so that other solutions have to be investigated.

1.3.4 A New Approach: Meta-Learning

Meta-learning is an approach for task solving that is getting a lot of attentionin the recent years from the machine learning community and that is ad-dressable with MANN architectures. Literally meaning "learning to learn",meta-learning is a learning strategy that operates both within and across tasks:indeed, the network does not only learn how to solve a specific task (stan-dard within- task learning), but it also captures information on the way thetask itself is structured and can employ analogies across tasks to improve theperformance.Thanks to their storage capability, MANN models can learn abstract repre-sentations of raw inputs and bind them with task-relevant associations inmemory. For instance, in the context of image classification, the represen-tation of the input image can be stored together with its label in a certainmemory location as done in (Santoro et al., 2016); as a result, at the nextpresentation of a sample from the same class the correct label has just to beproperly retrieved from the memory. The described procedure enables one-shot learning of never-before-seen cues and more importantly it is not strictlyreferred to image classification but can be applied to any classification taskwhere the network has to select a label given any data structure.Therefore, MANN networks offer a promising approach to meta-learningprovided that memory dynamics are stable and selectively accessible. Forthis reason, MANN networks with internal-memory configuration are notadequate for meta-learning, while the module with external memory sat-isfy the requirements. Although meta-learning is not a central point of thepresent work, an example of one-shot learning with the procedure described

1.4. Overview of MANNs 11

above is shown through the Omniglot task in Chapter 4.

1.4 Overview of MANNs

Before moving into the details of the implemented models and the resultsof our work, it is important to understand what is the state-of-the-art of theMANNs. So, here we present a brief overview of the achievements reachedso far inside the dynamic topic of MANNs, introducing the reader to themain models and results.

One of the first neural networks enriched with memory activity was thethe Long-Short Term Memory (LSTM). Realized by Hodreichter and Schmid-huber (Hochreiter and Schmidhuber, 1997), the LSTM is a recurrent neuralnetwork trained with backpropagation, where each node is equipped withgate variables that regulate the entering and the output information for eachcell. More precisely, the input gate quantifies the amount of the cell input tobe stored in the cell and the output gate defines the amount of informationthat is emitted from the node. In this way, each single cell can learn to storeand maintain internal representations that can be retrieved later in time. Afew years later, Gers improved the model (Gers and Schmidhuber, 2001) byintroducing the forget gate in order to optimize memory storing and avoidmemory interference: it must be noticed that the forget gate regulates theamount of pre-activation cell state to erase while the output gate just indi-cates the amount of output signal that is passed to the rest of the network.Later, it has been also proposed a variant of LSTM, called Gated ReccurentUnit or GRU (Cho et al., 2014), that has just two gate variables (input and re-set gates) but performs as well as standard LSTM.The LSTM network performs well on cognitive tasks affected by credit as-signment problems and the model is still widely used in many applicationsor as subcomponents of other networks. Nevertheless, it is hard to think thatthe mechanisms described in LSTM are actually implemented in the brain.For this reason, O’Reilly & Frank reinterpreted the gating mechanism in theLSTM network under biological-based considerations (O’Reilly and Frank,2006). The product of the study is the Prefrontal and Basal ganglia model forWorking Memory (PBWM) that is largely inspired on the dopaminergic inter-actions between basal ganglia (BG) and prefrontal cortex (PFC). In fact, thecomputational model simulates a complex system of parallel signaling that


involves many cortical and subcortical areas, including for instance the ac-tivity of thalamus, amygdala and dorsal striatum. Without going into theneurobiological details, the PBWM model presents an actor-critic architecture,where the actor corresponds to the BG system that updates the PFC repre-sentations via dynamic gating, while the dopaminergic signal works as criticthat modulates the BG gating activity. The model demostrated to have signif-icant learning abilities across a range of tasks that require WM activity and,more importantly, for the first time a biological-based MANN coming from aRL context had comparable performance with unbiologically plausible mod-els trained by error backpropagation.

In the last years, several neuroscientists applied the same biologically-oriented approach of O’Reilly & Frank work to develop MANN models basedon literature or experimental findings, but that, differently from the com-plex PBWM model, maintain a simple network architecture. In this context, inChapters 2.1 and 2.2 we will discuss in detail the work by Roelfsema (Roelf-sema and Ooyen, 2005; Rombouts, Bohte, and Roelfsema, 2015) and Alexan-der & Brown (Alexander and Brown, 2015; Alexander and Brown, 2016); sohere we just provide to the reader their main achievements. The former studyconstructed a reinformement learning model that combines synaptic plastic-ity and attention feedback in a coherent framework. The main advantageof the resulting model, called Attention-Gated Memory Tagging model orAuGMEnT, is to have interpreted backpropagation process in a biologicallyplausible way as a joint action of neuromodulation and attentional tagging.The latter work by Alexander and Brown takes more inspiration from theneuroanatomy than in neurophysiology. In fact, they proposed the Hierar-chical Error Representation model (HER) that reproduces the theoretical or-ganization of the frontal lobes of the brain in a hierarchical predictive codingframework. Each level of the hierarchical structure is in turn equipped withan internal memory that is equipped with an independent gating system thatis trained by backpropagation. As we will see, the learning ability of HER ishigher than AuGMEnT, because its hierarchical architecture is able to storemultiple items without interference and memory capacity is (theoretically)flexible.

1.4. Overview of MANNs 13

At the same time, the need of stable storing and easy-accessible informa-tion caused a change from the standard configuration of the internal memory-based networks, leading to the development of MANNs with external mem-ory. The first important example of this second approach is the Neural Tur-ing Machine (NTM) (Graves, Wayne, and Danihelka, 2014) from Google Deep-Mind that is in an entirely differentiable model capable of storing and retriev-ing task-relevant information with an efficient addressing scheme. The inter-action with the memory is mediated via write and read heads that updateand retrieve data from a memory matrix following two criteria with com-plementary facilities: content-based associations, that concern the similaritywith memory locations, and location-based strategy, that permits jumps anditerations across memory. The proposed model resulted in promising perfor-mance on a large number of sequence-based tasks and many following stud-ies suggested variations of the base model to improve its performance (Gul-cehre et al., 2016) or biological plausibility (Zaremba and Sutskever, 2015).More recently, Google DeepMind has upgraded its study on MANNs withexternal memory proposing the Differentiable Neural Computer or DNC (Graveset al., 2016). Having a similar structure as NTM, the biggest novelty of DNCconsists of an addressing mechanism that includes an additional temporal-based criterion. In fact, the temporal linkage mechanism allows the networkto record the transitions of the write heads through the memory locationsand consequently the read heads are able to retrieve sequences of instruc-tions in order from the memory. The DNC model, that is described in Chapter2.3, outperforms previous networks like LSTM and NTM in a large variety oftasks and has been tested on complex problems including block puzzles orgraph-related questions on family trees or on the London transport network.

An extremely innovative alternative approach is the one proposed in theEliasmith’s study (Eliasmith et al., 2012), where they try to reproduce thebrain dynamics in a large-scale model that captures many aspects of theknown neuroanatomy and neurophysiology including the functionality as-sociated with the working memory. The proposed model presents a richarchitecture that is globally composed of 2.5 millions spiking neurons, di-vided in components that communicate with each other using neural rep-resentations called ’semantic pointers’. For this reason the model is calledSemantic Pointer Architecture Unified Network or SPAUN. This is not theonly large-scale neural model developed in these years: also the Blue Brain


Project (Markram et al., 2015) or the Cognitive Computational Project (Anan-thanarayanan and Modha, 2007) work on large-scale simulations of the brainwith more neuroanatomical and electrophysiological detail than in Eliasmith’smodel. However, SPAUN is currently the only one large-scale model thatlinks the computational reconstruction of brain dynamics with real behav-ioral functions: in fact, the model has been successfully tested on a variety oftasks similar to the ones used to test the MANN networks described above,like copying, image recognition or question answering.

The research field of MANNs has impressively expanded in the last yearsand extended the learning capability of memory-free neural networks in sen-sory processing and sequence learning. Furthermore, it is likely that in thenear future memory-augmented neural networks will be at the core of newtechnologies around artificial intelligence and robotics.

The remainder of the thesis is structured as follows: in Chapter 2 we in-troduce the MANN models, describing mathematically the dynamics, thelearning rules and the different ways to interface with memory; Chapter 3presents the achievements of those models on same tasks to compare theirperformance and highlight their differences in terms of learning power; inChapter 4 we propose developments of pre-existing models to improve theirbiological plausibility and task-learnability; eventually, the conclusions of thework are reported in Chapter 5 together with some further ideas for futuredevelopments.

15

Chapter 2

MANN Models

Besides having internal or external memory, Memory-Augmented Networkmodels can differ by many other features that go from the biological connec-tion with the neural system to the complexity of the memory dynamics.In the present chapter we are going to present three MANN models withdifferent achievements in terms of biological plausibilty and learning perfor-mance:

A. the Attention-Gated MEmory Tagging (AuGMEnT), which has strongbiological foundations, especially in the linkage between learning ruleand standard synaptic plasticity mechanisms; on the other side, the in-ternal memory is subject to interference and thus it can be used to solvetasks with a sufficiently small memory demand.

B. the Hierarchical Error Representation (HER) network, whose internal-memory architecture is inspired by the known anatomy of the pre-frontal cortex. The memory dynamics are based on a gating mecha-nism at each level of the hierarchical structure that make HER a suitablemodel to handle tasks with a hierarchy of dependacies between cueswithout interference. However, HER is not adequate for more difficulttasks that require higher level of abstraction or management of complexdata structures.

C. the Differentiable Neural Computer (DNC), currently considered to bethe top performing MANN model, capable of solving tasks thanks toan efficient addressing to an external memory. The model can supportlong time storage, complex data structures and variable manipulation,but it is penalized by a poor biological background. DNC is the onlyMANN model with external memory presented in detail in this thesis,but in the next chapters the reader will be introduced to other solutionsfor external memory addressing.

16 Chapter 2. MANN Models

2.1 Attention-Gated MEmory Tagging - AuGMEnT

2.1.1 Introduction to the RL Scheme in AuGMEnT

The first memory-augmented neural network we present in this work is theAttention-Gated MEmory Tagging model or AuGMEnT (Rombouts, Bohte,and Roelfsema, 2015). This model has been chosen because it provides abiological background to a reinforcement learning mechanism. The new re-inforcement learning scheme is the key point of AuGMEnT innovation, since itcombines two synaptic plasticity mechanisms: attentional feedback and neu-romodulation. As we will see better in next paragraphs, the attention systemis based on a feedback signal which permits learning across time, while theneuromodulator controls the sign of the synaptic potentiation in accordancewith the Reward-Prediction Error (RPE).For each stimulus, the learning in AuGMEnT is divided in two phases (Figure2.1): a) a feedforward step, where the input stimulus is processed by the net-work to define the output of the model; b) a feedback step, that follows theselection of the response and updates the synaptic tags and the weights ofthe network in an attentional RL framework.

Figure 2.1. The structure of the AuGMEnT network presents three different typesof layers: sensory units, asssociation units and activity units. Each iteration of thelearning process consists in a forward step (left) and in a backward step (right). Inthe forward step the information from the current stimulus (left branch) and thememory (right branch) is used to compute the Q-values in the activity layer (bot-tom). Once the response is selected and the reward is given, the backward stepkeeps track of the connections which contribute to the selection of the response to

put the synaptic tags (green) and traces (purple).

2.1. Attention-Gated MEmory Tagging - AuGMEnT 17

2.1.2 The Feedforward Step

The AuGMEnT network is composed of three fully-connected layers of units,as shown in the left panel of Figure 2.1. Basically, the process can be dividedin three main steps: a) the input is first encoded in the sensory layer usingstandard one-hot representation and then b) it is processed in the associa-tion layer to abstract the cue or to find temporal correlation; c) finally, thesignal is propagated into the third layer, where each unit corresponds to anaction, and the action that maximizes the expected reward is selected. Actu-ally, the pathways which end into the final activity layer are two: the regularbranch and the memory branch. They cover different functions in the sen-sory processing and the information from both the branches is then used forthe response selection.

The regular branch is a standard feedforward network with one hiddenlayer. The instantaneous stimulus s lives in the S-dimensional space Is,where S is the number of possible stimuli in the task and IS is the spaceof all the S vectors that can be written in the one-hot representation:

Is ={

v ∈ {0, 1}S such that vi = 1 and vj = 0 ∀j 6= i}

(2.1)

The current stimulus is connected to the hidden units yR (called regularunits) via a set of modifiable synaptic weights collected in the weight ma-trix VR:

inpRt = sit VRt yRt = σ

(inpRt

)(2.2)

where σ is the sigmoidal function σ(x) =1

1 + exp(−x).

On the other side, the dynamics that regulate the memory branch areslightly different. In fact, in this case the transitions across stimuli are con-sidered instead of the stimuli themselves. The sensory layer of the memorybranch consists of a set of transient units that encode the activation of a stim-ulus (ON units) as well as its deactivation, i.e. the stimulus is turned off(OFF units). As a consequence, the transient stimulus strans ∈ {0, 1}2S is sim-ply the concatenation of the ON stimulus s+ ∈ {0, 1}S and the OFF stimuluss− ∈ {0, 1}S , defined respectively as the positive and the negative part of thedifference between two consecutive instantaneous stimuli:

s+ = [st − st−1]+ s− = −[st − st−1]− strans = [s+, s−] (2.3)


This transient representation can be helpful in case of tasks where multiplestimuli are presented to the agent for different lenghts of time, or when si-multaneous activation of one stimulus and deactivation of the previous oneis crucial for the task solving.Moreover, the association layer of the memory branch possesses memoryunits that have to maintain task-relevant information through time. In or-der to do that, the stimuli given throughout each trial are cumulated in thememory layer, as follows:

inpMt = inpMt−1 + stranst VMt (2.4)

The hidden states of the memory layer are then simply obtained by applyingthe sigmoidal function to the total input:

yMt = σ(inpMt

)(2.5)

It must be noticed that for a correct working of the AuGMEnT model the statesof the memory units have to be reset to 0 at the end of each trial.

Both the branches converge into the activity layer, where each possibleresponse of the model is associated with an output unit that represents areward-based measure called Q-value. Q-values are formally defined as thefuture expected discount reward conditioned to stimulus st and action at,that is:

Qs,at = E

[inf∑τ=0

γτrt+τ+1

∣∣∣∣ s = st, a = at

](2.6)

where γ ∈ [0, 1] is a discount factor.Numerically, the vector q that approximates the Q-values is obtained by com-bining linearly the hidden states from the regular and the memory branches:

q = yR WR + yM WM (2.7)

where WR and WM are the matrices of synaptical weights which go into theactivity layer.Finally, the Q-values of the different actions partecipate in a winner-takes-allcompetition to select the response of the model. The selection policy π is acombination of a greedy choice πg and a stochastic-based criterion πs: mostof the times the greedy solution πa is adopted, simply by taking the actiona related to the maximum Q-value qa; otherwise, the stochastic selection πs


explores all the possible actions according to the probabilities obtained byapplying the softmax transformation on vector q:

pa =exp(qa)∑k exp(qk)

(2.8)

So, the complete formulation of the so-called ε-greedy policy is:

π(f) =

πa, if f ≤ 1− ε

πs, if 1− ε < f ≤ 1(2.9)

where ε ∈ [0, 1] is typically a small parameter (equal to 0.025 or 0.05) whichgoverns the frequency of response exploration introduced in the model.

2.1.3 The Attentional Feedback Step

After each response selection, all the active synapses that cooperate in thedefinition of the selected Q-value qa are tagged in an attentional feedbackstep. Tagging consists in a filter on the weights that have to be updated withrespect to the reward related to the taken action. Since synaptic tags decaycontinuosly in time, the model keeps memory of previous responses in thesynaptic connections across multiple timesteps and the tag persistence canbe used to detect temporal correlations between subsequent stimuli.

Starting from the final activity layer, the first taggings involve the con-nections between the intermediate hidden units and the action units. Con-sidering a generic unit j from the associative layer and a unit k from theresponse layer, the tagging is a simple form of Hebbian learning, where thetag strengths depend both on the pre-synaptic hidden state hj and the post-synaptic action zk:

∆Tagjk = −αTagjk + yjzk (2.10)

where ∆ indicates the variation of the synaptic tags across subsequent timelevels (∆Tag = Tagt+1 − Tagt), α ∈ [0, 1] is a decay parameter and zk is abinary variable that is equal to 1 if action k has been selected, 0 otherwise. Inthis case, we consider a one-hot vector z ∈ IA instead of the response vectorq defined above, because we are just interested in the selected response andnot in its actual Q-value.


Proceeding backwards towards the sensory layer, tagging becomes a lit-tle more complex because the error information has to be backpropagatedfrom the activity layer through the hidden layer. This is done using feed-back connections W′, which follow the same update rule as their feedforwardcounterpart W =

[WR, WM

]described in (2.10) so that the units that con-

tributed the most to the definition of the selected action receive also strongfeedback. In this way, even if the initialization of W and W′ are different,their strengths become similar during learning, as confirmed by neurophys-iological findings (Mao et al., 2011). Morevoer, as in the feedforward phase,also the synaptic tagging in the feedback step is diversified across the twobranches. So, considering a unit i from the instantaneous sensory layer anda unit j from the associative regular layer, the tags are updated according tothe following rule:

∆TagRij = −αTagRij + si dσ(inpRj

)W ′aj (2.11)

where W ′aj is a row-vector extracted from matrix W′ related to the sole se-

lected action a and dσ(inpRj ) = hRj(1− hRj

)is the derivative of the sigmoidal

activation function that indicates the influence of total input inpRj on the ac-tivity of post-synaptic cell j.In the case of the memory branch, the update rule is slightly modified to takeinto account the accumulation of information in the memory cells of (2.5):

∆TagMij = −αTagMij + sTraceij dσ(inpMj

)W ′aj (2.12)

with the synaptic trace ∆sTraceij = si ∀j, that is updated at each cue pre-sentation and is reset at the end of each trial. The derivation of equation(2.12) is reported in paragraph 2.1.5; here we anticipate that the equation isvalid provided that the learning dynamics of matrix VM are sufficiently slow,meaning to have a small learning parameter.

2.1.4 The Reward Prediction Error as Neuromodulator

Equations (2.10), (2.11) and (2.12) describe the attentional feedback mech-anism, where synaptic taggings are used to link the contribution of eachsynapse with the action selected in the feedforward step. In addition to thissynaptic mechanism, a neuromodulatory signal (biologically equivalent tothe dopamine signal DA) is released homogenously in the network after a


certain action is chosen, in order to connect the goodness of that action withthe derived reward (present or future). In particular, the definition of theneuromodulator δ is analogous to the RPE, computed as:

δt = (rt + γqa′(t))− qa(t− 1) (2.13)

where a is the selected response at time t− 1 (associated with reward rt) anda′ is the one chosen at time t. The formula derives from a standard rein-forcement learning rule called SARSA, that compares the estimated expectedfuture rewards in two consecutive time levels. In this way, parameter δt ispositive when the reward (together with the future expected reward) is actu-ally bigger than what was predicted at time t− 1, and is negative otherwise.As a result, the sign of the neuromodulatory signal is crucial to indicate inwhich direction the weight connections have to be corrected to have morereliable reward predictions.Actually, for reasons that will be clearer in next paragraph, the adopted learn-ing scheme in AuGMEnT is SARSA(λ), a more efficient variant of SARSA thatincludes the learning effect in time of eligibility traces with decay constant λ.

2.1.5 Learning Algorithm: a RPE-minimizer approach

In conclusion, the final update rules for the weights are simply the combina-tion of the effects of attentional feedback and neuromodulation, modulatedby a learning parameter β:

∆Vij = β δt Tagij ∆Wjk = β δt Tagjk (2.14)

One of the key points of the AuGMEnT model is that the weight updaterules in (2.14) are defined in such a way that the network dynamics minimizethe RPE in order to have reliable predictions of the Q-values. Indeed, it can beshown that equations (2.10), (2.11) and (2.12), which regulate the formationof the synaptic tags and traces, are chosen to move in the opposite directionof the gradient ∂E

∂wwith respect to a generic weight w, where E is the squared

prediction error:

E(qa(t− 1)) =1

2([rt + γqa′(t)]− qa(t− 1))2 (2.15)

For sake of completeness and since this procedure will be also recalled inchapter 4, here we report the derivation of equations (2.12)-(2.14) (that is the


most complex case) taken from (Rombouts, Bohte, and Roelfsema, 2015), toprove formally that synaptic weights V M

ij are modified to follow the gradientdescent on Eqa (abbreviation for E(qa(t− 1))).

Proof. We want to show that

∆V Mij = β δt Tag

Mij ∝ −

∂Eqa∂Vij

(2.16)

For simplicity here we proove (2.16) neglecting the tag decay of Tagij (i.e.α = 1), so having TagMij = sTraceij dσ

(inpMj

)W ′aj . Hovever, the theorem

holds also for α 6= 1.Since by construction we have that δt = − ∂Eqa

∂qa(t−1) , then the right-hand side ofequation (2.16) can be written as:

−∂Eqa∂Vij

= − ∂Eqa∂qa(t− 1)

∂qa(t− 1)

∂Vij= δt

∂qa(t− 1)

∂V Mij

It remains to show that∂qa(t− 1)

∂Vij= TagMij . In order to do this, we apply the

chain rule as follows:

∂qa(t− 1)

∂V Mij

=∂qa(t− 1)

∂yMj

∂yMj∂inpMj

∂inpMj∂V M

ij

Now we focus on each derivative separetely.By definition of the Q-value in (2.7) the first derivative is ∂qa/∂yMj = Wja. How-ever, in the feedback step the value Wja is approximated by W ′

aj (which in-dicates by construction the amount of feedback that unit j receives from thewinning action a), because, as discusseed before, they become similar duringlearning. So we have:

∂qa(t− 1)

∂yMj= W ′

aj (2.17)

From equation (2.5) we immediately derive the expression for the secondderivative:

∂yMj∂inpMj

= dσ(inpMj ) = yMj (1− yMj ) (2.18)

Finally, using equation (2.4) we can write:

inpMj (t) =∑i

V Mij (t) stransi (t)+

t−1∑τ=t0

∑i

V Mij (τ) stransi (τ) ∼

∑i

V Mij (t)

t∑τ=t0

stransi (τ)


where t0 indicates the starting time of the trial and the last approximation(*)is from V M

ij (τ) = V Mij (t) ∀t0 ≤ τ < t, that is valid only in the case of slow

learning dynamics, i.e. small learning rate.As a consequence, we have:

∂inpMj (t− 1)

∂V Mij (t− 1)

∼t−1∑τ=t0

stransi (τ) = sTraceij(t− 1) (2.19)

Binding together the different components (2.17)-(2.18)-(2.19), we eventuallyobtain the desired result:

∆V Mij ∝ δt sTraceij y

Mj (1− yMj )W ′

aj = δtTagMij (2.20)

Analogously, we can show that also the update dynamics of matrices VR,WR and WM favor the minimization of the RPE-based functional Eqa . For adetailed explanation of the proofs on the other cases we refer the reader tothe original paper (Rombouts, Bohte, and Roelfsema, 2015).It is important to notice that the assumption (*) leads to the need of an upperbound on the learning rate in order to avoid instability issues in the simula-tion.

As anticipated in the proof, the synaptic tag decay controlled by parame-ter α does not invalidate the relation above. On the contrary, it can be shownthat the continuous decay speeds up the gradient descent. This is due to thefact that decaying tags work as the eligibilty trace λ in SARSA(λ). In fact,taking a tag decay parameter such that α = 1− λγ, the tags follow the expo-nential decay Tag(t + 1) = λγTag(t), as described in the learning dynamicsof SARSA(λ).

2.1.6 Analogies with Backpropagation

The learning mechanism of AuGMEnT described above is particularly fasci-nating because the update rules in equations (2.14) are the explicit expres-sions of biological phenomena linked to attention and synaptic plasticitywithout need of strong assumptions. More importantly, as can be seen by theproof above, during the feedback step the downstream error is propagatedtowards the upwards using chain rule in a similar fashion of backpropaga-tion (BP), the most popular learning algorithm for differentiable ANN that is


often criticized for the lack of biological credibility.

In fact, despite backpropagation is very efficient, there are many biological-based issues that question the possibility that BP is really implemented byneural circuits in our brain. The main critiques on BP algorithm involvethe symmetry constraint on the connectivity patterns and the violation ofthe locality condition. The former problem is linked to the unbiological re-quirement in BP to have equal forward and feedback connections (W

′= W).

This assumption corresponds to ask to the neurons to transfer the error sig-nal back along the axons and with the same synaptic strength as in the for-ward step (even if the nature of information to pass is intrinsically different)or to have a parallel error-propagating network with same weights as thefeedforward network. The locality problem, also known as weight trans-port problem, involves the contradiction of the biological-based conditionthat synaptic learning depends only on information available locally comingfrom either the pre-synaptic or the post-synaptic neuron. In BP this require-ment is violated because each hidden unit has to have precise knowledge ofthe activity of all downstream synapses, for instance the activation functionof downstream layers.

The novelty of AuGMEnT is to create a learning environment which em-beds the backpropagation-like approach in a biological context, where thetypical biological issues associated to standard BP are avoided by construc-tion and resulting learning rules are reinterpreted as the product of the neu-romodulator and the synaptic tags. In fact, on one side, the feedback matrixW′ is not exactly the transposed of the forward counterpart W, but it is a ma-trix that is initialized randomly and that updates with the same learning rulesas W. This means that the two matrices will evolve in such a way that they be-come similar during training according to neurophysiological findings. Onthe other side, the locality problem is prevented simply by construction ofthe AuGMEnT network: having a network with a three-layers depth and lin-ear activation function in the output layer, upstream synapses (in VR or VM )just need to recover the information from the input and from the propagatederror to the postsynaptic unit.However, the AuGMEnT variant of BP still presents a constraint in forwardand feedback connectivities because it requires the same tagging operationon both the channels, but it is less strict than the requirement of having ex-actly the same synaptic strengths. Another important difference with respect

2.2. Hierarchical Error Representation - HER 25

to standard BP involves the nature of the error signal: in standard BP theerror vector is obtained as the difference between target and output predic-tions, while in AuGMEnT the RPE is just a scalar information that is associatedto the selected output unit. As a result, the signal error in AuGMEnT is lessrich in information and learning is generally slower than in BP.

2.2 Hierarchical Error Representation - HER

2.2.1 HERModel: a Hiearchical Predictive Coding Framework

Developed in the last years by W. Alexander and J. Brown (Alexander andBrown, 2015), the Hierarchical Error Representation (HER) model, is mainlyinspired by the known interactions in the frontal cortex supposed to regulatethe working memory and cognitive control. In fact, the computational modeltries to reproduce in silico the known neuroanatomy connections in PFC. Inparticular the mPFC and dlPFC are regions of the PFC thought to operate ina complementary way: first, dlPFC encodes the input according to the taskenviroment and stores the input in the WM; afterwards, the mPFC employsthe WM content to form an expectation and derive a prediction error as itsdeviation from the real outcome.Since the mPFC appears to work at different levels of abstraction of the error,the structure HER model mimics the organization of PFC in a hierarchy ofidentical cooperating levels. This model setting recalls the dynamics in a hi-erarchical predictive coding framework, where the different levels exchangeinformation using bottom-up and top-down pathways. In fact, a signal ispassed to the higher levels of the hierarchy to be somehow explained andthe acquired information is pushed back to the inferior levels in a modula-tion process.In a later work (Alexander and Brown, 2016), the authors state that the ap-proach is so promising that in the future the hierarchical predictive codingframework could be further extended to include more sophisticated pro-cesses in the frontal lobes.

2.2.2 The HER level

The main motif of the HER model is the HER level, greatly inspired by thePRO model previously proposed by Alexander and Brown (Alexander and


Brown, 2011). The HER level, as schematized in Figure 2.2, operates on an in-put stimulus s to produce a prediction p that can be compared with a givenoutput o to define the prediction error e.

Figure 2.2. Simplified scheme of a HER level. At the end points of the line there arethe external stimulus s (blue) and the related output o (purple). The memory gatingmechanim is used to decide whether to store the current stimulus into the workingmemory (yellow) or maintain the previous information; then, the memory represen-tation r from the memory is used to formulate a prediction vector in the predictionblock (green) that is compared with the given output to computed a prediction error(red). Moreover, the prediction is also employed to select a response (orange) that in

turn defines the activity filter for the error computation (dashed line).

The Memory Gating

At each iteration of the learning process an external stimulus s ∈ IS is pre-sented to the model, where IS is the space of the vectors in one-hot repre-sentation as described in equation (2.1). The stimulus corresponds to an ex-ternal condition presented to the agent in a predefined setting, but the newinformation has to be incorporated in the brain to be processed for learning.However, storing the stimulus in the memory is not an automatic step but itdepends on the utility of that information for the task or on the attention ofthe agent itself. For these reasons, the working memory is equipped with agating mechanism which discriminates whether to store the current stimulusor maintain the internal representation in the WM r ∈ IS . The gating systemis mathematically defined as follows:

v = s X (2.21)

where X ∈ RSxS is the weight matrix used to define a real vector v ∈ RS .Since s is encoded in the one-hot representation, the operation above simply


corresponds to the extraction of the row from matrix X associated with thecurrent stimulus type. The resulting vector v corresponds to the storing valueassociated with stimulus s: each component vi indicates a measure of theadvantage to store the information i in the storing memory when s is thepresented stimulus. At each time level, the storing value associated with s,vs, is compared to the storing value of the previous memory r, vr; finally, theprobability of storing s is computed through a biased softmax operation:

p(s) =exp(vs) + b

exp(vs) + exp(vr) + b(2.22)

If the storing value vs is much higher than vr, the new stimulus s will likelybe registered in the brain becoming an internal representation r (mathemati-cally the two vectors coincide); otherwise the old information is maintainedin the memory because preferred to the new input and the latter is discarded.The bias term b in equation (2.22) is an artificial term to pilot the storing ten-dency of an HER level: when the bias is high the memory gate is more likelyto be open to new stimuli, regardless of the previous memory representation.In most cases the gating is unbiased (b = 0 for each level), but in some tasks itis preferrable to use the bias term to adjust specifically the memory dynamicsfor each HER level (as done in the 12-AX task in 3.2).

The Prediction Error

According to the basic rules of supervised learning, in addition to the stimu-lus s the model receives the target output o related to that current stimulus.In the HER model, the output is structured in the form of response-feedbackconjunction: this means that, considering the standard case where feedbackis binary (correct/wrong), each possible response R is divided in the caseR-correct and R-wrong. This results in an output vector that has dimen-sion P = 2R.

Example In case of a binary task (R = 2) the possible responses are just left(L) or right (R). Using the response-feedback conjunction as explained above,the desired output vector o has dimension P = 4 and is composed as follows:

o = [L-correct, L-wrong, R-correct, R-wrong] (2.23)


where each component is a binary variable 0-1. When we adopt this no-tation, we notice that the output is not in the one-hot format as normallyexpected but that there are as many positive values as the number of possi-ble responses. For instance, when the target response is (L), both the com-ponents L-correct and R-wrong are turned on and the output vector isoL = [1, 0, 0, 1]; while in the opposite case the output related to response (R)is oR = [0, 1, 1, 0].

As will be clearer later, this trick is used to get some learning at each iter-ation regardless of the selected response.Given the internal representation r of the input stimulus, the model has fi-nally to predict the target output o and the performance of the model can beconsidered to be good if the gap between the prediction and the output vec-tors is small enough. As a consequence, the prediction vector p belongs tothe same P -dimensional space of the output, adopting the response-feedbacktemplate, but, unlike what seen before, the components are real. The predic-tion vector is linearly computed via a weight matrix W ∈ RSxP :

p = r W (2.24)

Then the so-defined prediction vector has a twofold function: on the onehand, it is used to formulate a response to the external stimulus, on the otherhand it is compared to the output vector to compute the prediction error.Starting from the prediction vector, a measure uR for a general response R

can be obtained simply by aggregating the information from cases correctand wrong:

uR = pR−correct − pR−wrong (2.25)

Iterating the previous operation to all possible responses in the task, we cansimply define the response vector u ∈ RR, where each component is ui =

p2i − p2i+1 ∀i = 1, 2, . . . , R. Finally, the response of the system is selectedaccording to the probabilities defined using the softmax:

P(ui) =exp(γui)∑Rj=0 exp(γuj)

(2.26)

where γ is the temperature parameter that models the attention of the agent:high values of γ emphasize the difference between the different responsesfavoring the dominant option, low values encourage exploration of all pos-sible responses. Numerically, the softmax operation can lead to overflow


problems or instability that can be easily avoided by computing the softmaxon auxiliary variable u∗ = u−maxi ui.

A key variable in the HER model is the prediction error, which computesthe difference between prediction and output, as follows:

e = a ∗ (o− p) (2.27)

where ∗ indicates the element-wise multiplication and vector a is a binary ac-tivity filter that selects only the entries of vectors o and p associated with theselected response. This is done to have learning just on the actions actuallyexperienced.

Example Let us consider again the binary task with responses possible L/Rand assume that the target response to a certain stimulus is L. As seen before,this is translated in an output vector o = [1, 0, 0, 1]. Assume that the predic-tion vector obtained by (2.24) is p = [0.8, 0.1, 0.2, 0.6]. According to (2.25), theresponse vector u has components:

uL = 0.8− 0.1 = 0.7 uR = 0.2− 0.6 = −0.4

having related probabilities P (uL) ∼ 0.9 and P (uR) ∼ 0.1 (considering γ =

2 in (2.26)). In this case, the model is highly inclined to select the correctresponse L leading to an activity filter a = [1, 1, 0, 0]; otherwise, there is still asmall chance to wrongly select R and have a = [0, 0, 1, 1].So, considering the more frequent situation where the model responds L, thecomputation of the prediction error (2.27) would be:

eL =

1

1

0

0

∗

1

0

0

1

−

0.8

0.1

0.2

0.6

=

0.2

−0.1

0

0

(2.28)

while in case the selection brings to the uncorrect response R, doing simpleanalogous computations, we get eR = [0, 0,−0.2, 0.4]T . We can easily noticethat in the case of uncorrect response, the error signal is higher than in thecase of correct behavior. As we will see, this observation is a crucial point forthe overall learning of the HER model.


Taken alone, a single level has a limited learning power and can be usedjust to solve trivial tasks. However, the power of the HER model resides in itshierarchical architecture, simply obtained by stacking an HER level multipletimes. As we will see in next paragraphs, each level of the model solves dif-ferent orders of complexity of the task and is able to communicate with theothers using the prediction error signal.

2.2.3 The Predictive Coding Dynamics in HER

As anticipated, the error signal is the main communication channel betweenthe different layers in the complete HER hierarchical structure (Figure 2.3).The general idea behind the multi-level model follows the dynamics of a classof models called predictive coding and the procedure is based on bottom-up(green arrows) and top-down (red arrows) pathways.

Let us describe some steps of the inter-level dynamics to understand themechanism. After the presentation of stimulus s to all L levels of the HER

architecture, the base level produces a base prediction p0 from the storedobject r0 and the associated prediction error e0. The latter is passed up asa proxy outcome to the superior level, that provides an error prediction p1

given the internal representation r1 stored at the first level. Again this emitsa prediction error e1 that is used at level 2 as a target to produce another errorprediction p2, and so on.When the bottom-up signaling is over, the produced sequence of error pre-dictions {pl}

Ll=1

is used to modulate the prediction at the inferior levels. So,proceeding backwards, prediction pL modulates pL−1 in a modulated predic-tion mL−1 using the further information stored in rL; this top-down modu-lation is iterated through layers until the base level 0, where the aggregatedinformation from upper levels is used to create the base modulated predic-tion m0. This final vector is eventually employed to provide the model re-sponse to the current stimulus s using the response mechanism describedbefore through equations (2.25) and (2.26).

From the algorithmic point of view, the bottom-up error signaling is justimplemented taking the outer product between the stored object rl ∈ IS andthe error el ∈ RPl and use it as output at level l + 1:

Ol+1 = rTl el (2.29)


Figure 2.3. Flow chart of the HER model. The hierarchical structure is obtained byiterating the HER level in figure 2.2 multiple times. Each level has an independentworkin memory with its own gating sytem. The inter-level communication is basedon an upwards error signal (green arrows) and in a downwards prediction modu-lation (red arrows). Taking advantage of the higher order knowledge from the su-perior levels, the base model is responsible for the final formulation of the response(orange block). The predicted response is finally used to define the activity filter a in

equation (2.27) (dashed line).

where the output Ol+1 is a sparse matrix in RSxPl that has to be reshaped ina row vector ol+1 ∈ RS·Pl . The sparsity of the matrix is due to the one-hotrepresentation of rl. The reader should notice that the error signal includesalso the term rl in order to couple the measure of the prediction error withthe WM content it is related to.Regarding the top-down pathway, the modulated prediction ml is simplycomputed by combining the prediction vectors pl ∈ RPl and pl+1 ∈ RPl+1 inthe following way:

ml = pl + Pl+1rl = (Wl + Pl+1) rl (2.30)

where Pl+1 coincides with the reshaped matrix of vector pl+1 in RPl·S andthe second equivalence comes from equation (2.24). The reshape operation


Pl+1 = Pl · S is always possible by construction of the output in equation(2.29).

2.2.4 Learning Rules

The final part for the understanding of the way the HERmodel works consistsin the learning dynamics. In fact, each HER level has to learn two importanttasks: 1) which stimulus is better to store in the working memory for the fi-nal performance and 2) how to compute the predictions related to a specificinternal representation.

Learn Memory Gating

The first learning task is associated with the internal memory gating sys-tem presented in equation (2.21). The ability of the system to discriminatewhether the current stimulus is task-relevant or not is fundamental in mostcognitive tasks and, in the case of the HER model, each level has to learn itsown gating dynamics. The gating skill is achieved by updating the memoryweight matrix Xl using the following learning rule:

Xt+1l = Xt

l + αmdTt (emodl Wmodl ∗ rl) (2.31)

where emodl and Wmodl are respectively the modulated quantities of the pre-

diction error (2.27) and of the weight matrix in (2.24):

emodl = al ∗ (ol −ml) Wmodl = Wl + Pl+1 (2.32)

The learning rule above corresponds to standard backpropagation of themodulated prediction error with learning rate αm, coupled with the eligibil-ity trace dl ∈ [0, 1]S to keep track of the previous stimuli. In fact the eligibilitytrace vector takes value di = 1 when stimulus i is currently presented and itdecays at each iteration by a decay constant λ. In this way, the model is ca-pable to mantain memory of information given in the past and to find usefultemporal correlations between cues.Actually, backpropagation-like learning can be replaced by a RL scheme tothe correct as indicated in the Supplementary Material from (Alexander andBrown, 2016), but the performance is generally penalized. In this case, theerror δ is a scalar measure of the discrepancy between the reward and the

2.3. Differentiable Neural Computer - DNC 33

probability to select the correct response, that is fed directly to WM withoutnecessity of the forward connectivity matrix W as in backpropagation.

Learn to Predict

The second learning goal is to update the weight matrix Wl to compute cor-rect predictions from the stored objects in the working memory. This timethe learning rule is a simple delta-rule with learning rate αp:

Wt+1l = Wt

l + αprTl emodl (2.33)

During training, the update process occurs simultaneously on all the levels,but the learning time is different across the layers. In fact, the first weightmatrices to converge are the memory matrix X0 and the prediction matrixW0

from the base level, while learning at superior levels becomes consistent onlywhen the error signal has converged. In fact, only after convergence of thedynamics in the first level, the second one receives coherent error informa-tion that can be used to train X1 and W1 to obtain useful error predictions toemploy in the top-down modulation. Then, the learning dynamics continueanalogously going upwards in the multi-level architecture.As a consequence, the reader can intuitevely understand that the learningtime might increase significantly with the number of levels, because eachlevel presents two weight matrix to train. However, at the same time, thelearning power of the HER model is also linked to the number of levels, be-cause the memory capability of the network is equal to the number of layersin the architecture. On the biological side, a reasonable amount of HER levelsgoes from 3 to 5 distinct layers (according to (Alexander and Brown, 2015)),leading to a biological constraint on the height of the model.

2.3 Differentiable Neural Computer - DNC

Unlike the previous models, the Differentiable Neural Computer (DNC), re-cently proposed by Google Deepmind (Graves et al., 2016), is an exampleof Memory-Augmented Neural Network with external memory. In fact, thename itself is due to the coupling of the neural controller, the processor, witha memory matrix, the RAM; the term ’differentiable’ is then added to stressthat all the operations are learnt with gradient descent thanks to the differen-tiability of the network.


In particular, the DNC is currently considered to be the top-performing MANN,that has overpassed previous networks like LSTM or NTM thanks to an ef-ficient and innovative addressing mechanism to the memory. In fact, thewriting and reading operations are assigned to write and read heads thatmove around memory locations according to three different principles: con-tent lookup, location usage and temporal linkage. The first two addressingmechanisms are not new because they were already employed in NTM follow-ing dynamics explained in (Graves, Wayne, and Danihelka, 2014); the real in-novation consist in a temporal-based addressing technique that records thetransitions between successive memory locations and enables backward andforward retrieval of input sequences. In addition, in DNC multiple readingsand writings are permitted during each time level, depending on the numberof read and write heads employed in the process.In the reference paper, the authors tested the model on complex tasks, likethe bAbi dataset, graph problems or puzzles that prove the great learningpower of the network. In the present work, the complexity of tasks tested islimited to simpler cases that are more comparable with standard experimen-tal memory-based tasks from cognitive neuroscience.

The main ingredients of the DNC model are the neural network and theexternal memory. The former is a general neural network, while the memoryis a matrix with N locations (rows) and size M (columns). The completegraph of the DNC architecture is sketched in Figure 2.4. As we will see in nextparagraph, the two main components of the model communicate thanks toW write heads andR read heads that follow the ’instructions’ collected in theinterface vector ξ.

2.3.1 The Neural Controller

The structure itself of the neural controller is not crucial, but generally it con-sists in a deep LSTM architecture; this type of controller is generally more ef-ficient than other, because LSTM cells are able to maintain in their states someinternal representations of previous inputs using input, output and forgetgates (Hochreiter and Schmidhuber, 1997). However, in some cases − usu-ally when the complexity of the task is very low − the network may employuniquely the internal memory capacity of the LSTM controller without takingadvantage of the external memory and leading to suboptimal solutions. Thisis a slight drawback for the LSTM option but is still generally preferred to


Figure 2.4. Graph of the DNC network, mainly composed by a neural controller andan external addressable memory. Their interaction is regularized by a sequence ofparameters collected in the interface vector which define gates, keys, strengths andaddressing modes used by the read/write heads (red dashed arrows). The resultingread vectors are used both to compute the network response and as further input at

the next timestep.

others. In the following, for sake of simplicity we consider a controller witha single LSTM layer but the equations and the dynamics can be easily gener-alized to deeper networks.

At each time level, the neural controller receives an external input st ∈ RS

together with R read vectors ρt−1 = [r1t−1; r2t−1; . . . , rRt−1] ∈ RRxM retrievedfrom memory at the previous time level. So the two inputs are concatenatedin a unique vector χt = [st, r1t−1, r2t−1, . . . rRt−1] ∈ RS+RM and passed to the con-troller to determine the hidden state vector ht ∈ RH , where H is the numberof LSTM units.Then the outputs of the controller are the output vector vt and the interfacevector ξt. The latter will be described in the detail of all its components innext paragraph. The output vector is just computed linearly by the hiddenstate via output weights Wout

t ∈ RLxY , where Y is the output dimension:

vt = Woutt ht (2.34)

Actually, the final output of the network is obtained combining the outputvector vt just defined with the retrieved information from memory. In fact,


the interface vector is used to extract from memory the new set of read vec-tors ρt = [r1t ; r2t ; . . . , rRt ] to pass it to the input χt+1 for the next time level andto the final output yt in the following way:

yt = vt + Wrt

[r1t , r

2t , . . . , r

Rt

](2.35)

with Wrt ∈ RRMxY .

2.3.2 The Interface Vector

The key operations in DNC are assigned to the interface vector that defines thememory interactions with the network and the addressing instructions to theW write and R read heads in terms of keys, gates and modes. In the follow-ing, we subdivide the long interface vector ξt ∈ RM ·R+3W ·M+2W ·R+3R+3W in allthe components in order to introduce the reader to the large set of parametersthat will be employed in the memory dynamics:

ξt ={

Krt , Kw

t , βrt , β

wt , Et, Nt, ft, gwt , gat , Πt

}(2.36)

where each subcomponent corresponds to:

• read keys: Krt =

[kr,1t ,k

r,2t , . . . ,k

r,Rt

]∈ RRM

• write keys: Kwt =

[kw,1t ,kw,2t , . . . ,kw,Wt

]∈ RWM

• read strengths: βrt = ψ(βrt

)= ψ

([βr,1t , βr,2t . . . , βr,Rt

])∈ (1, inf)R

• write strenghts: βwt = ψ(βwt

)= ψ

([βw,1t , βw,2t , . . . , βw,Rt

])∈ (1, inf)W

• erase vectors: Et = σ(

Et)

= σ([

e1t , e

2t , . . . , e

Wt

])∈ [0, 1]WM

• write vectors: Nt =[ν1t ,ν

2t , . . . ,ν

Wt

]∈ RWM

• free gates: ft = σ(

ft)

= σ([f 1t , f

2t , . . . , f

Rt

])∈ [0, 1]R

• write gates: gwt = σ (gwt ) = σ([gw,1t , gw,2t , . . . , gw,Wt

])∈ [0, 1]W

• allocation gates: gat = σ (gat ) = σ([ga,1t , ga,2t , . . . , ga,Wt

])∈ [0, 1]W

• read modes: Πt = κ(Πt

)= κ

([π1t , π

2t , . . . , π

Rt

])∈ [0, 1](1+2W )R


with ψ(·), σ(·) and κ(·) the usual activation functions:

ψ(x) = oneplus(x) = 1 + log (1 + ex)

σ(x) = sigmoid(x) =1

1 + e−x

κ(x) = softmax(x) =ex∑i exi

(2.37)

The meaning and the role of each variable will be clearer after reading nextparagraph. However, we anticipate that the read mode πit is a real vectorwith size 1 + 2W , that indicates the instructions for the temporal address-ing. In fact, the different components compete to decide whether the readingshould interest locations with similar content to the current input rather thanretrieve input sequences in the backward or forward order of storage for eachindependent write head.

2.3.3 Addressing the External Memory

In general, the main goals of the memory dynamics are to write the rele-vant information to store in the memory matrix M ∈ RNxM and retrievethe useful data to solve the task at each time level. In DNC, the R read andthe W write heads are responsible for this type of interaction with the exter-nal memory and they address it according to write and read weight vectors{

wwt,i

}Wi=1

and{

wrt,j

}Rj=1

computed following the three criteria described here:content lookup, location usage and temporal retrieval. Normally, there aresome constraints on the number of read/write heads to be used, like usingjust one write head W = 1 (Graves et al., 2016) or having the same numberof read and write heads (Santoro et al., 2016); here, we prefer to report theequations for the most general case R,W ≥ 1.

Content Associations

The content-based addressing is the most natural mechanism because it em-ploys the associations between keys and memory locations according to asimilarity measure. Typically, the similarity between any two vectors a andb is computed with the cosine similarity, like:

c(a,b) =a · b||a|| ||b||

(2.38)


In order to define a probability distribution among the N memory locations,the content lookup is based on the normalized weighted measureC(Mt,kt, βt),defined as:

C(Mt,kt, βt) =eβt c(Mt,kt)∑i eβt c(Mi

t,kt)(2.39)

where the pair key-strength (kt, βt) can be referred either to a read pair (krt , βrt )or to a write pair (kwt , βwt ) and the vector Mi

t indicates the ith memory location.In this way, the quantity C(Mt,kt, βt) is used to weight memory locations toread from and write to.

This type of addressing is the most biologically plausible because it is en-tirely based on association recalling that is how human memory is thoughtto work. Compared to the other adddressing mechanisms, content lookup isalso the least artificial, meaning that usage and temporal criteria have beenadded to improve the performance and not because particularly linked to ex-perimental evidences on the working memory.Furthermore, content-based addressing presents important analogies withthe well-known Hopfield model (Hopfield, 1982), which is an abstract net-work specific for memory retrieval. Shortly, in the Hopfield experiments acue with partial information is provided to the network and the most similarpattern from the memory is recalled to recover the missing information. Therecall dynamics in Hopfield are based on a similarity measure that is calledpattern overlap, that is a discrete analogue of the cosine similarity definedin (2.38). A fundamental difference in the two memory retrieval dynamics isthat DNC content addressing defines a weighting across all N memory loca-tions and extracts a linear combination of all the stored information, while theHopfield model extracts just a single pattern from a given set of memories.The soft addressing to memory in the DNC model is done so that the networkremains entirely differentiable and more easily trainable by backpropagation.Anyway, this point is crucial in the discussion of biological-plausibility of theMANNs with external memory and will be tackled in the next chapters.

Dynamic Memory Allocation

The location-based addressing implements a memory allocation scheme thatproperly frees and allocates new information based on a usage vector ut ∈[0, 1]N and the input free gates {fρt }

Rρ=1 from the interface vector (one for each

read head). The general rules in the usage dynamics are that the usage of


a certain memory location increases at every write and can decrease at eachread according to the amount of unused memory declared in the free gates.As a result, the update rule of the usage vector has the following form:

ut =

(ut−1 + (1− ut−1)

W∏i=1

ww,it−1

)∗ψt (2.40)

whereψt ∈ [0, 1]N is the memory retention vector that indicates how much ofeach memory location read from during previous time level will not be freed;that is

ψt =R∏j=1

(1− f jt wr,j

t−1)

(2.41)

Afterwards, the allocation criterion is to write the new data in the least usedmemory locations. In order to achieve this behaviour, the usage vector istemporarily sorted in ascending order of usage, leading to the constructionof the ausiliary vector ut, and the sorted allocation weighting at ∈ [0, 1]N isdefined as follows:

at,i = (1− ut,i)i∏

j=1

ut,j (2.42)

As a result, to have the final allocation weight vector at it is sufficient to undothe sorting and restore the original indexing of the memory locations.

It must be noticed that actually, in case of multiple writings per time level(W > 1), the computation of the allocation weightings has to take into ac-count the variation of usage while iterating through the write heads. Thismeans that, instead of a single allocation vector at as defined in (2.42), we

have a sequence{

akt}Wk=1

for each write head, that is:

akt,i =

(1− ukt,i

)∏ij=1 u

kt,j

uk+1t = ukt + gw,kt (1− ukt ) akt

k = 1, 2, . . . ,W (2.43)

where gw,kt is the write gate associated with the kth write head and the ˜ indi-cates again the sorting operator in ascending usage order.

Temporal Linkage

The last attention mechanism for the memory addressing is temporal linkage,that enables sequential retrieval of input sequences experienced in the past.


This capability of the network can be a great contribution in famous cognitivetasks, like the copy or the copy-repeat tasks − described in (Graves, Wayne,and Danihelka, 2014) −, where the model has to record sequence of instruc-tions that in a second time have to be retrieved in order.In this context, the crucial operator is the temporal link matrix L ∈ [0, 1]NxN

that records the transitions of the write heads through memory locations.More precisely, the entry Lt[i, j] represents the degree of temporal proximitybetween the writing in location i with respect to the previous writing in lo-cation j: if Lt[i, j] is close to 1 it is likely that write head is passed directlyfrom memory location j and location i. In this way, the link matrix createsa temporal graph between the memory locations. In the case of W multiplewrite heads, for each memory location j we have W writing transitions toas many different memory locations; as a result, we actually consider a se-quence of link matrices

{Lkt}Wk=1

in order to record the movements of all thewrite heads.The construction of the link matrix is based on a precedence vector pkt ∈[0, 1]N , where each component pkt [i] indicates the degree of likelihood that acertain location i is the last written to (with respect to write head k). Thisvector is computed by the following recurrence relation:

pk0 = 0

pkt =(

1−∑N

n=1ww,kt [n]

)pkt−1 + ww,k

t

∀t ≥ 0, ∀k = 1, 2, . . . ,W (2.44)

Aftewards, we define the temporal link matrix Lkt in a recurrence way, whereafter each location modification the related entries in the link matrix are up-dated to remove the old links and add the new ones:

Lk0[i, j] = 0

Lkt [i, i] = 0

Lkt [i, j] =(

1− ww,kt [i]− ww,kt [j])Lkt−1[i, j] + ww,kt [i]pkt−1[j]

∀t ≥ 0 (2.45)

for i, j = 1, 2, . . . , N and k = 1, 2, . . . ,W . Here, self-links have been excludedbecause it is not completely clear how to handle the transitions from a loca-tion to itself.In practice, the link matrix is eventually used to retrieve input sequences ina certain order, that can be either backwards in time, by looking at the mem-ory location written just before the target one, or, otherwise, in the forward


order. In order to move in time, it is sufficient to apply the link operator Lto a weighting vector a ∈ RN : in this way, the vector La moves smoothly thefocus on the memory locations written immediately after those emphasizedby the ones in a. Conversely, the retrieval of the memory locations that werewritten to just before is simply obained by the application of the transposedmatrix LT , thus LTa.In the DNC model this temporal linkage trick is used to determine the back-ward and forward weighting vectors bi,kt ∈ [0, 1]N and fi,kt ∈ [0, 1]N referredto the previous reading by the jth read head:

fj,kt = Lktwr,jt−1 bj,kt = (LTt )kwr,j

t−1 (2.46)

for j = 1, 2, . . . , R and k = 1, 2, . . . ,W .

Actually, the temporal linkage is a quite expensive operation both in termsof computational and memory cost requiring for both O(N2) resources forexact computation; naturally the efficiency decreases significantly when thememory size increases. However, there exists a smart variant of the algo-rithm that employs the sparsity of the link matrix L and filters it keepingjust the K highest values in L. Fortunately, proceeding in this way, the com-putational cost is sharply reduced to O(N log (N)) and the memory cost toO(N), without significant loss of precision. The details of this variant arenot reported here because it would be out of the scope of the present workbut they can be found in the supplementary information from (Graves et al.,2016).

Writing and Reading to the External Memory

At this point, it lasts to define the write and read weight vectors according theaddressing criteria just described and use them to interface with the externalmemory.The write weights ww,i

t related to the ith write head are just computed as aconvex combination of the allocation weight vector at and the write contentweighting C(Mt,kw,it , βw,it ):

ww,it = gw,it

[ga,it at,i + (1− ga,it ) C(Mt,kw,it , βw,it )

](2.47)

where ga,it and gw,it are the write and allocation gates from the interface vector.Finally, the writing operation is the combination of the erase of the addressed


locations and the addition of the write vectors, as mathematically detailed in:

Mt = Mt−1 ∗W∏i=1

(1−ww,it eit) +

W∑i=1

ww,it ν

it (2.48)

where 1 is a NxM matrix filled with ones and the erase and write vectors eitand νit are emitted by the controller in the interface vector.

The reading operation from the memory combines content-based approachwith temporal linkage. Specifically, the read weight vector associated withthe ith read head is a linear combination of content weighting C(Mt,kr,it , β

r,it ),

backward weightings{

bi,kt}Wk=1

and forward weighting{

fi,kt}Wk=1

:

wr,it =

W∑k=1

πi,kt bi,kt + πi,W+1t C(Mt,kr,it , β

r,it ) +

W∑k=1

πi,W+1+kt fi,kt (2.49)

where the weights coincide with the input read mode vector πit ∈ [0, 1]2W+1

that determines which addressing scheme dominates in the retrieval.The final retrieved vectors are eventually obtained by an outer product be-tween the memory matrix and the read weight vectors:

rit = MTt wr,i

t ∀i = 1, 2, . . . , R (2.50)

Subsequently, the set of read vectors ρt = [r1t ; r2t ; . . . , rRt ] is appended to theinput for the next time level to provide some direct access to the memorycontent.

The models described in this chapter have been presented in increasingorder of learning power and decreasing order of biological plausibility. Inthe next chapter, we analyze the performance of such networks on simplecognitive tasks to detect their limitations and have some insight into howthey work on real applications.

43

Chapter 3

Learning Across Tasks

Learning ability and biological plausibility are the two main features that weare interested in to discuss the value of MANN models. However, in thepresent chapter we limit our analysis just on the learning capability of thenetworks, regardless of the biological foundations of the model.In machine learning, the learning ability of an ANN is usually discussed inthe test phase, to see how trained network performs on a new dataset thatpossibly includes never-before-seen input. In this study, we are more inter-ested in the network performance during training. In particular, we lookat three main indicators: learning stability, learning time and learning flex-ibility. Basically, learning stability is generally measured by looking at thepercentage of learning success of the network over multiple simulations. Incognitive neuroscience, learning is considered to be successful if the modelsatisfies a predefined criterion for convergence during training phase. In fact,it may happen that a model does not reach convergence all the times. Thisis usually due to the randomness in the construction of the training datasetor in some internal decisional processes of the network. Furthermore, theperformance of a network can be very sensitive to parametrization, reducingthe reliability of the model when it is applied to different tasks.The learning time measures the efficiency of the model on a specific task. Itcorresponds to the number of trials or cues that are necessary to reach learn-ing convergence during training.By learning flexibility we mean the variety of tasks that the network can solvecorrectly. In fact, tasks can differ in many aspects, like in the data structure toprocess, the temporal length of the trials or in the required memory capacity,i.e. the number of cues to store in order to succesfully get the response.

In the present chapter, we compare the performance of the MANN mod-els described so far on specific tasks that require memory in order to havesome insight into how they work on real cases and observe their advantages

44 Chapter 3. Learning Across Tasks

and drawbacks. In addition to AuGMEnT, HER and DNC, the comparison in-volves also an LSTM network in order to have a standard and popular term ofcomparison for the simulation. The adopted architecture for the LSTM modelconsists just in a deep network with a single hidden LSTM layer.Here, we present three different tasks proposed in an increasing level of com-plexity:

A. The Saccade-Antisaccade Task (SAS), a cognitive task based on eyemovements after presentation of multiple symbols on a screen. It is asimple cue-dependant task with a low memory demand, that requiresjust the storage of two stimuli to be correctly solved, the fixation andthe location marks.

B. The 12AX Task, a hierarchical task in which the agent has to identifydifferent target patterns in a sequence of input symbols projected ona screen, either A-X or B-Y based on a prior cue 1 and 2. Even if thenumber of inputs to store is still small, the temporal credit assignmentproblem is higher than in SAS because the length of 12AX trials is vari-able.

C. The Omniglot Task, an image classification task where the model hasto correctly identify the character associated with an input image. Thecharacters come from 50 alphabets and the dataset contains only 20

samples each.This is the most complex task analyzed because of the data structureof the input and the high number of possible responses. Furthemore,having few training examples for each symbol, the Omniglot task re-quires fast learning of the network and it is here used to discuss one-shot learning, i.e. the capacity of the trained network to correctly clas-sify a character after just one (or few) presentation of a sample image.

Part of the interest of the present study on MANN models is also to seehow learning of the model relates with animal or human learning. Since allthe analyzed tasks have been experimented also on human or animal sub-jects, we will compare the performance in our simulations with the availabledata from these experiments.

3.1 The Saccade-Antisaccade Task

In the present study, we consider the Saccade-AntiSaccade Task (SAS), pre-sented in the AuGMEnT paper (Rombouts, Bohte, and Roelfsema, 2015), as the

3.1. The Saccade-Antisaccade Task 45

most simple problem for the MANN networks. SAS is inspired by cognitiveexperiments performed on monkeys to study the memory representations ofvisual stimuli in the Lateral Intra-Parietal cortex (LIP). The structure of eachtrial covers different phases in which multiple cues are presented on a screenand at the end of each episode the agent has to respond accordingly in orderto gain a reward. Actually, monkeys received also an intermediate smallerreward when they learnt to fixate the task-relevant marks at the center of thescreen . This is an example of shaping strategy, a well-known trick in RL tofacilitate and speed up the learning process. The details on the procedure ofthe trials and the experimental results are discussed in (Gottlieb and Gold-berg, 1999).

3.1.1 The Structure of the SAS Task

The general idea of the experiment is quite straightforward: the monkey hasto look either to ’Left’ (L) or to ’Right’ (R) in agreement with a sequence ofmarks that appear on a screen at each episode. The response, correspond-ing to the direction of the eye movement (called saccade), depends on thecombination of the location and fixation marks. The location cue is a circledisplayed either at the left side (L) or the right side (R) of the screen, whilethe fixation mark is a square presented at the center that indicates whetherthe final move has to be concordant with the location cue (Prosaccade - P) orin the opposite direction (Antisaccade - A).

As can be seen in Figure 3.1, each trial is structured in five phases: a) start,where the screen is initially empty, b) fix, when the fixation mark appearsc) cue, where the location cue is added on the screen, d) delay, in which thelocation circle disappears for two timesteps, e) go, when the fixation markvanishes as well and the test-taker has to give the final response to get thereward. Since the action is given at the end of the trial when the screen iscompletely empty, the task can be solved only if the model stores and man-tains both the stimuli in memory in spite of the delay phase. In addition,the shaping strategy mentioned above is applied in the fix phase of the ex-periment, by giving an intermediate reward if the agent gazes the fixationmark for two consecutive timesteps, to ensure that the agent observes thescreen during the whole trial and that the go response is not random but con-sequential to the cues. The most important details about the trial structure


Figure 3.1. Structure of the Saccade-AntiSaccade Task. Development of the trials inall the possible modalities: P-L and A-R have final response L (green arrow), whiletrials P-R and A-L lead to take action R (red arrow). Figure taken from publication

(Rombouts, Bohte, and Roelfsema, 2015).

are summarized in Table 3.1.

3.1.2 Performance on the SAS Task

The solution of the SAS task requires that both the fixation and the locationmarks are stored in memory without interference. The memory demand isquite low and the fixation mark is given at multiple timesteps so the networkhas different options for the gating. Furthermore, the variability in the train-ing dataset is very low because there are just four possible cue combinationsand the length of each trial is fixed. For all these reasons, the SAS task isconsidered to be very simple for a memory-augmented network.In the following, we compare the performance of the MANN models on thesaccade task in terms of learning success and learning time. According tothe reference paper, training on the SAS task is considered to be succcessfulif the accuracy for each trial type is higher than 90% in the last 50 trials. InAuGMEnT, the reward for the final response is equal to 1.5 units in case of cor-rect response, 0 otherwise; the intermediate reward for the shaping strategyis instead equal to 0.2 units. The other models do not have a RL scheme butthey are trained by backpropagation; thus, the shaping strategy is not reallyreproducible and so we assumed that the trial was failed also in case of theintermediate fixation goal was not achieved. In the Appendix A.1 we report


Table 3.1. Table of the SAS task

Task feature Details

Task Structure 5 phases: start, fix, cue, delay, goInputs Fixation mark: Pro-saccade (P) or Anti-saccade (A)

Location mark: Left (L) or Right (R)Outputs Eye movement: Left (L), Front (F) or Right (R)Trial Types 1. P+L=L 2. P+R=R

3. A+L=R 4. A+R=LTraining dataset Maximum number of trials is 25000.

Each trial type has equal probability.Rewards Correct saccade at go (1.5 units)

Fixation of the screen in fix (0.2 units)

in a table the settings of the main model parameters for the SAS task.

In Figure 3.2, we show the mean results of the simulations of all the stud-ied MANN models. From the upper barplot, we see that the training of allthe models is always successful, satisfying the convergence criterion for eachof the 100 simulations. The accuracy of the model is confirmed also on thetest dataset, composed by 1000 episodes, where the percentage of correct re-sponses is 100% for each model. Given the simplicity of the task, this resultis not a big surprise, but it is still a proof of the capacity of the networks tolearn how to store task-relevant information in a stable and reliable way.The difference among the models raises when we look at the learning time.In fact, we observe that AuGMEnT is the model that takes the most time toreach training convergence with a mean time of 2063 trials (s.d.= 837.7). Thisis probably due to the fact that memory in AuGMEnT is the only one thatdoes not have a gating mechanism for the memory; as a consequence, theAuGMEnT model accumulates all the stimuli received during a trial and itneeds more training time to learn which inputs have to be emphasized andwhich others have to be inhibited. The results are still positive, especiallyif compared with the reference paper by Roelfsema, where convergence ofAuGMEnT is achieved the 99.45% of the times and with a training time twicetime longer than our simulations (around 4100 trials).On the other side, the HER model is the fastest model to reach convergencein the SAS task with a mean training time of 339.7 trials (s.d.= 36.9). Asopposed to AuGMEnT, the HER memory is equipped with an efficient gating


Figure 3.2. Comparison of the performances on the saccade task. At left, the trainingeffciency of the models is compared by looking at the mean predicition error duringtraining, computed as the mean number of errors in 50 trials. At right, the modelperformance is specifically evalued in learning success (above), i.e. percent of net-work realizations that converges according to the SAS criterion, and learning time(below), measured as the number of trials before convergence. Statistics have been

averaged over 100 simulations.

system that allows to store the two cues without interference. The better per-formance of HER with respect to LSTM (mean= 1065, s.d.= 204.7) and DNC

(mean= 535.5, s.d.= 65.8) is due to the possibility to customize the tempo-ral dynamics of HER to adjust to the specific task: therefore, the hierarchicalstructure of the model has been reduced to two levels (as many as the num-ber of cues to store) and specific temporal dynamics have been assigned toeach HER level. In this way, the performance of the network is tailored to tar-get the temporal dynamics of the trial and make learning faster. The detailsof the HER dynamics in the SAS task are better discussed in paragraph 3.1.2.The complexity of the DNCmodel makes it slightly slower than HER because ithas more parameters to train and naturally convergence takes more trainingiterations. However, observing the trend of the mean prediction error duringtraining (averaged over 10 simulated networks), we can see that the DNC ismore stable than HER after convergence: the number of committed errors inDNC remains null after convergence, while HER error presents some sparse er-rors in the first thousands of trials. This is typical of MANNs equipped withan external memory like DNC, where, once the model learns how to correctlyaddress memory, the stability of the later performance is practically guaran-teed.

Moreover, independently of the model, our simulations outperform the


Figure 3.3. Detail of the HER matrices after SAS training. The first row contains thememory matrices for base (left) and superior (right) level; the second row involvesthe prediction matrices for the computation of the error prediction. Dark blue entries

correspond to high values and viceversa.

experimental results done on monkeys. In fact, in spite of the beneficial ef-fect that the shaping strategy has on animal learning, monkeys manage tosolve the task only after a long training that lasted for months with 1000 tri-als per day, about two orders greater than the learning time of the analyzedmodels.

Focus on the HER dynamics in the SAS task

Since the HER model emerged as the top performing model on the SAS task,here we discuss in detail the dynamics in the two levels of the hierarchy tojustify the positive results. Thus, we report in Figure 3.3 the memory and theprediction matrices for the two levels after training phase.

In order to understand the gating dynamics, the reader has to refer to thememory matrix and operate as follows: first, the row associated with thecurrent stimulus has to be selected from both the matrices to get the relatedstoring vectors; second, the component on the diagonal (always associatedwith the presented input) must be compared with the component associatedwith the current memory content. For instance, if the WM at levels 0 and 1


contain the information R and the current stimulus is A, we consider the sec-ond rows from both the matrices and we compare the second entry with thefourth one choosing the one with larger value (darker squares correspond tohigher storing values); in this specific case, we see that the WM at level 0 up-dates the memory content with the fixation mark A, while level 1 mantainsthe previous location information R. Proceeding in this way at each step ofa SAS trial, we realize that WM at the base level tends to store the fixationmark, while the WM at the superior level is responsible for the storing of thelocation cue. These dynamics are not random, but they depend on the differ-ent time scales associated with each level: in particular, the eligibility trace,that takes care of the temporal credit assignment problem, is much higher atthe superior level (λ1 = 0.9) than at the inferior one (λ0 = 0.1). In this way,we enforce the base level to store the current stimulus at each time level andthe superior level to find cue correlations along the history of the trial.As a result of a correct gating, the modulated predictions (in the formatresponse-feedback) behave in the right way. For fixation (F) no location cue isneeded and the fixation mark is stored in both levels; consequently, the mod-ulation from the superior level emphasizes the entries related to responsesFC, LW and RW. Instead, in the go phase the superior level contains the loca-tion information and emphasizes the entries according to the fixation markcontained in the base level: if the correct response is left (L) the cells associ-ated to LC, FW and RW are increased, otherwise the modulation operates inthe opposite way to raise the values of RC, LW and FW.

3.2 The 12-AX Task

Taken from the HER paper (Alexander and Brown, 2015), the 12-AX task isfrequently used to test Working Memory and diagnose behavioral and cogni-tive deficits related to memory dysfunctions. Basically, the problem consistsin identifying some target sequences among a group of symbols that appearon a screen. Although the structure of the trials is quite simple, the difficultyin the 12-AX task consists in the number of stimuli to store in memory and inthe long and variable time cue correlations. In fact, 12-AX is an example ofhiearchical task where stimuli are organized in a hierachy of dependenciesand the correct solution is only made possible if the task-relavant cues arestored since the beginning of the trial. Furthermore, unlike the SAS task, the12AX task is a Continuous Performance Task (CPT), meaning that the agenthas to provide a response to each single cue presentation. There exist several

3.2. The 12-AX Task 51

variants of the 12-AX task which differ in small details about the constructionof the trials. As done in the reference paper of the HER model, we refer to theversion presented in (O’Reilly and Frank, 2006), that includes some distractorcues and describes in detail the criteria for dataset building and convergence.

3.2.1 Description of the 12AX Task

Figure 3.4. Example of trials in the 12-AX Task. Task symbols are presented ona screen in a sequence composed by outer loops, which start with a digit (1 and2), and a random number of inner loops. Each cue presentation is associated to aTarget/Non-Target response (R/L). Here distractors are not included. Figure taken

from paper (O’Reilly and Frank, 2006).

The general procedure of the task is schematized in Figure 3.4 and themain information for the construction of the 12AX dataset are collected inTable 3.2. The set of possible stimuli consists in 8 symbols: two digit cues (1and 2), two context cues (A and B), two target cues (X and Y) and finally twodistractors (C and Z). Each trial (or outer loop) starts with a digit cue and isfollowed by a number of inner loops taken randomly between 1 and 4; innerloops are composed by patterns of context-target cues, like A-X, B-X or B-Y.The distractors are non task-relevant cues, inserted in the O’Reilly version,that can invalidate the sequence creating wrong inner loops like A-Z or C-X.The cues are presented one by one on a screen and the task-taker has twopossible responses for each of them: Target (R) and Non-Target (L). There areonly two valid Target cases: in trials that start with digit 1, the Target is asso-ciated to the target cue X if anticipated by context A (1-. . . -A-X); otherwise, incase of initial digit 2, the Target occurs if the target cue Y comes after contextB (2-. . . -B-Y). The dots are inserted to stress that the target inner loop can oc-cur even after long time the digit cue, as happens in the following example


Table 3.2. The 12AX task: table of key information

Task feature Details

Input 8 possible stimuli: 1,2,A,B,C,X,Y,Z.Output Non-Target (L) or Target (R).Target sequences 1-. . . -A-X or 2-. . . -B-Y.

Probability of target sequence is 25%.Training dataset Sequence of outer loops starting with 1 or 2.

Maximum number is 100000.Inner loops Each outer loop contains a random number of inner loops

between 1 and 4

sequence 1-A-Z-B-Y-C-X-A-X (whose sequence of responses is L-L-L-L-L-L-L-L-R). The variability in the temporal length of each trial is the main issue inthe solution of the 12AX task because of the temporal credit assigment prob-lem: in fact, in case of a target pattern at the end of the outer loop, the modelhas to maintain the digit cue in memory for long time to correctly respond tothe trial.The types of the inner loops are determined randomly, with a probability tohave a potential target (A-X or B-Y) of 50%. As a result, combined with theprobability to have either 1 or 2 as starting digit of the trial, the overall prob-ability to have target sequence is 25%. Since the Target response R has to beassociated only with a X or Y stimulus that appears in the correct sequence,the number of Non-Targets L is generally much bigger; on average, everysequence of 10 Non-Targets there comes a single Target. The low frequencyof Target responses is another source of learning difficulty for the model, be-cause the learning dynamics might become stuck in a local minimum of theloss function corresponding to a situation where the model selects always thesafer response, Non-Target.

3.2.2 Simulation Results

As we have seen, the 12AX task is more complex than the SAS task for tworeasons: a) the number of cues to store in memory to solve the task is bigger,b) the temporal credit assignment problem affects more the performance on12AX because of the structure of the trials. As a consequence, we expect thatthe performance of the MANN models in general to be worse than what ob-served in the SAS task.Nevertheless, it is not possible to compare directly learning results across


tasks because the criteria for the training convergence are task-specific andthey are taken from separate reference papers. In fact, according to (Alexan-der and Brown, 2015), learning convergence in the 12AX task is achievedafter 1000 consecutive correct responses. It must be noticed that, being a CPTtask, the correct responses taken in account for convergence involve also theNon-Targets that generally are 10 times more frequent than Targets.The models have been tested again on the 12AX, by running 100 simulations

Figure 3.5. Learning performance of the MANN models on the 12AX task. Thetraining phase is analyzed by observing the mean prediction error overe 500 trials(left) and with the barplots (right) that summarize the main learning statistics. TheAuGMEnT model (green) is not shown in the barplots because it never converges in

the 12AX simulations.

on a training dataset composed by a maximum of 10000 outer loops. Figure3.5 shows the resulting average performance of each model and the parame-terization for each model is available in the Appendix A.2.The most evident conclusion is that AuGMEnT model fails in the learning ofthe 12AX task during all simulations. In fact the mean training error (greenline) does not decrease quick enough to reach the convergence criterion, butit seems to stabilize around a mean error of 0.3, meaning on average 1 wrongresponse every 150 cues. The reason of this failure is better investigated inparagraph 3.2.3, but it is clear that the lack of a gating system is a great dis-advantage in case of tasks that involve storing of multiple inputs.Regarding the other models, also this time HER and DNC have similar behav-ior, but, unlike in the SAS task, the DNC performance is slightly better both interms of stability and learning efficiency. In fact, the percentage of convergedsimulations for the DNC model is 100%, while in HER the success percent is


equal to 96%; morevoer, the DNC network converges after a mean of 730.6 tri-als (s.d.= 138) against the 746.4 (s.d.= 508.6) trials needed to HER (consistentwith the ∼ 750 trials indicated in the Alexander & Brown paper). The posi-tive results are confirmed in the test phase, where the HER percent of correcttrials (99.96%) is actually a little bit higher than in DNC (99.65%).Finally, the LSTM presents a good and stable performance during trainingachieving convergence at all the network realizations and with a learningtime of 1565.6 trials (s.d.= 371); on the other side, the statistics get worse dur-ing test phase where the percent of solved trials falls to 90.1%.

In conclusion, we compare the results of our simulations with the avail-able experimental results performed on human agents. In (Krueger, 2011),human subjects managed to solve the 12AX task in just ∼ 50 trials, so with alearning time that is one order smaller than our MANN models. However,the superior performance is owed to the symbol identity of the stimuli usedin the task: since the stimuli are number and letters, human mind can eas-ily simplify the structure of the task proceeding by analogy and understandthat, for instance, 1 and 2 must have similar function as well as the groupof letters close in the alphabetical order (A-B-C and X-Y-Z). This is a greatadvantage with respect to the models that encode the stimuli in a one-hotrepresentation, thus treating all the stimuli as orthogonal.In the same paper (Krueger, 2011), they experiment another version of the12AX task, called adjusted 12AX, where all symbols are independent andunrelated, in such a way that the comparison with the computational resultsbecomes more meaningful. In adjusted 12AX the task is considered to besolved if the accuracy of the responses is bigger than 90% in the last 50 trialswith potential targets (X or Y). Under this new convergence criterion, humanagents solved the adjusted 12AX in approximatevely 300 trials, while oursimulations with the HER model managed to converge on average already in220 trials. Moreover, the percentage of success in humans was around 25%,while our simulations met convergence criterion all the times.

Discussion of HER mechanisms for 12AX task

As previously done for the SAS task, we can analyze the HER dynamics byobserving the memory and the prediction weight matrices in Figure 3.6. Inthis case, the HER model has three levels, as many as the number of sym-bols to store in memory. Using the same rules described in paragraph 3.1.2,we can study the memory matrices to understand the gating organization


between the HER levels: the first matrix (upper left), that regulates the WMoperations at the base level, is mainly diagonal indicating that the memoryis constantly updated with the current stimulus regardless what it is− this isneeded since 12AX is a CPT task −; the intermediate level (upper center) hasa similar behavior for most of the stimuli, but, when X or Y are presented tothe network, the model prefers to store respectively the context cues A and B;on the contrary, the top level (upper right) has a very defined behavior and itis responsible for storing the initial digit cues 1 and 2. As a result, the gatingsystem allows to store in memory all the needed information to solve the taskat any single cue presentation. Actually, the weight matrices may convergeto other forms than presented in the figure, but here we illustrate the mostsimple case to understand for the reader. Anyway, the other configurationsare always slight variations of the discussed organization.The prediction modulation is not as easy to discuss using the prediction ma-

Figure 3.6. Visualization of HER matrices after training on the 12AX task. The firstrow presents the WM weight matrices for levels 0, 1 and 2. Analogously, the predic-tion matrices for the top-down modulation are shown in the second row. Dark blue

entries correspond to high values and viceversa.

trices because the number of possibilities in the conjunction output-responseincreases significantly after the transition to the superior level, making thex axis hardly readable. For this reason, we append in Figure 3.7 the zoomof the prediction matrix on the blocks that are more relevant for the overalldynamics.The prediction matrix at the base level (lower left) already shows the baseresponse rates for each current stimulus: all the entries related to Non-Target


response are particularly emphasized while the Target response has positiveprobability only when X or Y are stored at the base level. At level 1 (lowercenter), the modulation can lead to two effects: a) if the stored stimulus is thesame as the memory content in the base level (blocks distributed along thediagonal), the probability to select Non-Target is increased even more; b) oth-erwise, the A or the B have been stored in the WM when X or Y are containedat the base level (as shown in the memory matrix for level 1) and the Targetresponse is emphasized. The modulation from the last matrix gives the finalcontribution for the solution of the 12AX because it contains the digit cues.It becomes relevant only when the inferior levels have stored the potentialtarget patterns A-X or B-Y and determines whether the sequence is indeed aTarget or not.

Figure 3.7. Focus of the prediction matrix from the top level of the HER architecture.On the y axis, 1 and 2 are the cues of interest because stored at level 2; on the inferiorx axis there are the cues stored at level 1 A and B and on the top it is specified the

sub-block associated with the situation when X and Y are stored at the base level.

3.2.3 Analysis of AuGMEnT performance in the 12AX task

As seen in previous section, the AuGMEnT model fails in learning the 12AXtask. This is the first important novelty of the present study because it high-lights a significant limitation of the AuGMEnT network. In fact, the model isnot capable to solve the temporal credit assignment problem and tuning thedecay constant of the synaptic tags is not enough to adjust the performance.In the present section, we investigate the possible reasons behind the weak-ness of the AuGMEnT memory. In order to accomplish this, we test AuGMEnTon simpler versions of the 12AX task and we gradually increase the complex-ity until we discover the critical issue in the learning process. Here we report


Figure 3.8. AuGMEnT performance on simplified versions of the 12AX task that re-duce the temporal window of each trial: the AX and the 12AX-S tasks. Learningsuccess and learning time are shown in the barplots at left, while at right the trendof the mean error is shown while training goes on. Statistics on the 12AX task are

not shown because it never converges.

the results for two alternative versions of the task: AX task and 12AX-S task.The former is the simplest case, in which we reduce the stimuli to 4 symbolsA,B,X,Y and the only target pattern is A-X; each trial is composed by a singlepair of cues and there is a 50% of probability that a pattern is a Target. Thelatter task, the 12AX-S, is just a simplified version of the original 12AX taskin which the length of each outer loop consists just in a cue triplet of typedigit-context-target; in this way the temporal scales are fixed and reducedand the temporal credit assignment problem is limited.Figure 3.8 shows the comparison of the performance of the AuGMEnT modelon tasks AX, 12AX-S and 12AX. This time the simulations are run for a max-imum of 1 million of trials, in order to check if learning is successful whenwe extend the length of training. We can notice that AuGMEnT model solvescorrectly AX (mean= 3356.9,s.d.= 1491.7) and 12AX-S (mean= 5584.4,s.d.=1198.5), but it still does not converge on the 12AX task. This means thatAuGMEnT manages to store three cues in the memory despite the memoryinterference, but it does not support long variable time storage and as a con-sequence the network cannot detect long time correlations between stimuli.


3.3 Image classification: the Omniglot Task

3.3.1 Introduction to the Omniglot Task

A more advanced case is the Omniglot Task (Lake, Salakhutdinov, and Tenen-baum, 2015a), that is an image classification task where each image is a sym-bol from a collection of 50 different alphabets and the model has to iden-tify correctly the symbol. In total the characters are 1623 with 20 sampleseach, coming from the handwriting of different people. Compared to thewell-known MNIST dataset (Lake, Salakhutdinov, and Tenenbaum, 2015b)−where the task consists in the classification of the handwritten digits between0 and 9− the number of classes is very high while the amount of samples perclass is really small. As a consequence, the structure of the Omniglot datasetrepresents a great challenge for efficient training.

Table 3.3. Summarizing the parameters for training of Omniglot task

Parameter Description Value

Nep Total number of episodes for training 100000C Number of characters per episode 5L Length of episode 50S Size of the input image 20x20θ Angle interval for data augmentation [−π

6,+π

6]

Φ Translation interval for data augmentation [−10,+10]

Inspired by the procedure described in (Santoro et al., 2016), these twoproblems have been addressed using data augmentation and an efficient cur-riculum learning. The former is a common trick to reduce the risk of over-fitting and increase the number of samples for each character, which consistsin changing slightly the original images to obtain different versions of them.More precisely, the images are randomly shifted along the x and the y axisin a range of (−10, 10) pixels and rotated by an angle sampled randomly be-tween −π

6and π

6(Figure 3.9, left panel). The difficulty that derives from high

number of classes has been tackled using a curriculum learning which di-vides the training in episodes containing just 5 classes and 10 (augmented)samples each (Figure 3.9, right panel). In this way, the model has a moremanagable number of possible responses and memory can be more easilyused to store internal representations of the characters previously presentedto the network. As we will see in next paragraph, learning the Omniglottask requires a very long training; as a consequence, in order to decrease

3.3. Image classification: the Omniglot Task 59

the computational cost and speed-up the simulation, the images have beendownscaled to 20x20 pixels.

Figure 3.9. Examples of data augmentation (left panel) and training episode (rightpanel) for the Omniglot Task.

3.3.2 Learning the Omniglot Task

Learning in the Omniglot task is more focused on the memory dynamicsrather than in decoding the input. In fact, since training is divided into inde-pendent episodes, where images are randomly changed, it is not helpful tolook for a mapping between pixels and class labels. On the contrary, the mostefficient way to solve the classification task is to learn how to interface withmemory to rapidly store relevant information associated with the current im-age and use it at the next presentation of a sample from the same class.Thanks to the fast memory dynamics, learning in the Omniglot task is an ex-ample of one-shot learning, where training is so fast that only one occurranceof a sample is necessary to learn how to correctly respond to it. In additionto being very fascinating as a learning scheme itself, one-shot learning is par-ticularly interesting for the present study because it is biologically plausible:in fact, in everyday life we need just one experience to learn something newand so there must be a process in the brain that rapidly learns new behaviorwith high accuracy.

One-shot learning on the Omniglot task is made possible by adopting asimple strategy: in fact, it is enough to provide at each timestep t an inputthat includes both the sample image xt and the label associated to the previ-ous sample yt−1, as shown in Figure 3.10. Afterwards, it is sufficient to storethe internal representation of the image (blue arrow) together with the label


given at the next step (red arrow) in the same memory location. In this way,when a sample character from the same class is presented a second time, themodel has just to read from the memory location (light blue arrow) that re-tains all the information necessary to correctly classify the character.

Figure 3.10. Dynamics underneath one-shot learning in the Omniglot task. At eachtimestep T , the network receives an image of a character and the label associated tothe previous image given at T − 1. The one-shot learning is made possible throughstoring the complementary informations in the same memory location in such a waythat it can be addressed by the read head at the second occurance of the character.

The learning strategy here described opens the doors to meta-learning,a type of learning approach that is both in and across tasks. Therefore, themodel does not learn the specific task, but it learns the strategy to solve theclassification problem. This approach is very powerful becuase it is not con-strained on the single task, but the same strategy can be applied to any clas-sification problem independently of the data structure or the classes.

Unfortunately, MANN models with internal memory, like AuGMEnT orHER, are not adequate for a learning of this type because the memory dynam-ics of such models do not satisfy the requirements of independent storageand selective access to stored data. Furthermore, these models are not capa-ble to support an image as an input data structure, because they generallyprocess only stimuli in one-hot representation (or analogously simple). Onthe contrary, the MANN networks with external memory are more suitable


for the Omniglot task having the possibility to store the internal representa-tions of the images in separate independent memory locations that are easilyaddressable by the read heads at the next presentation of the same symbol.As a consequence, in the following paragraph we report the simulation re-sults obtained with the DNC model.

3.3.3 One-shot learning of the DNC model

Training in Omniglot is a very long process because the model has to learnthat memory addressing is more relevant than usual weight update in theneural controller. Therefore, the training dataset is composed by 100000 batchesof 16 episodes with 50 images each, as illustrated in Santoro et al (Santoroet al., 2016). In the same paper, the authors propose a variation of the NTMnetwork, called One-shot NTM that enables one-shot learning in classificationtasks like Omniglot. The difference with respect to classical NTM dynamicsconsists in a new access module to memory, named LRUA (Least RecentlyUsed Access). Basically, the approach in LRUA is to write new informationeither in the last memory locations addressed, in order to update it with newimportant information, or in the most rarely used ones, to optimize memoryspace. The details of this approach will be discussed in the next chapter; herewe use One-shot NTM just as a term of comparison with the performance ofthe DNC model during the test phase. The parametrization of the two models(reported in Appendix A.3) is identical both in the structural and optimizersettings, in such a way that the model comparison is clearly focused on theirdifferent dynamics.

One-shot learning can be analyzed by observing how classification accu-racy improves after each presentation of a character during an episode. Forinstance, we can imagine that letter a is presented to the network in differentforms for 10 times in an episode and it can be recognized as a or not; thesame holds for other 4 characters, for simplicity we can consider b, c, d ande, and again we can keep track of the success (or the failure) of the classifica-tion of such characters across the 10 repetitions. After that, we can averagethe classification accuracy over the 5 symbols and over 500 batches to get asequence of accuracies inside an episode. In one-shot learning, we expectthat already at the second presentation of the same class the model presentshigh accuracy, that increases progressively during training.


In Figure 3.11 we show learning performance of DNC model during train-

Figure 3.11. Performance of DNC models on the Omniglot task during training. Theclassification accuracy is measured at each of 10 presentations of a character and av-eraged across classes in an episode and the lines follow the evolution of the accuracy

during training phase.

ing both with and without data augmentation. We can notice that in boththe cases the model learns successfully the task and that indeed the line cor-responding to the accuracy at the 2nd occurrance (orange) reaches high per-centage of classification success. In addition, the increase is not gradual butit is a rapid escalation after an initial stable phase in which the models tryto assign labels in an almost random way. However, as expected, learningappears to be more successful when we do not preprocess the images usingdata augmentation (left panel) because fitting of un-rotated and un-shiftedimages is easier to learn than in the augmented case (right panel); in fact incase data augmentation is skipped, classification accuracy reaches 96% at the2nd instance and more than 98% at the 10th, while in the case of data augmen-tation the percentages fall to 89% for 2nd instance and 96% for last one.

Nonetheless, data augmentation is employed to reduce the risk of over-fitting, which can lead to poor performance on new images not experiencedduring training. Therefore, performance of DNC model has been tested withand without data augmentation also on never-before-seen images, to checkif the high levels of accuracy are maintained on characters that cannot havebeen learnt during training. So, for the test phase, we have analyzed themean classification accuracy over 100 batches of 16 episodes that containcharacters sampled from a disjoint collection of alphabets from the one usedfor the training phase. The test statistics from the simulations are reportedin Table 3.4 together with the accuracy percentages of other models available


Table 3.4. Test statistics on the Omniglot Task

Model Classification Accuracy (Instances)1st 2nd 3rd 4th 5th 10th

HUMAN 34.5 57.3 70.1 71.8 81.4 92.4

FEEDFORWARD 24.4 19.6 21.1 19.9 22.8 19.5

LSTM 24.4 49.5 55.3 61.0 63.6 62.5

ONE-SHOT NTM 36.4 82.8 91.0 92.6 94.9 98.1

DNC (-augmentation) 37.4 61.6 68.2 72.2 76.2 81.0

DNC (+augmentation) 29.4 81.4 86.6 89.0 90.0 93.0

from (Santoro et al., 2016).The first thing we can notice is that data augmentation is fundamental tohave good performance in the test phase: in fact, even though image prepro-cessing makes training more difficult, in this way the model learns to adapt todifferent versions of characters and consequently also to new characters. Asa result, we see that without image augmentation the classification accuracyof DNC is 61.6% at the second instance and 81.0% at the last one, while withdata augmentation the percentages of correct responses are competitive withthe performance of One-shot NTM. Actually, DNC (with data augmentation)shows slightly lower classification accuracy than One-shot NTM at each in-stance. This is probably due to the fact that One-shot NTM has a simpleaccess module (LRUA), that is designed specifically for classification tasksconstructed like Omniglot; on the contrary, DNC is equipped by a more com-plex addressing scheme that employs multiple approaches in parallel andthat can adapt to several types of task, but not as well as methods imple-mented ad hoc for the specific task.

In particular, we can also compare the DNC performance (with data aug-mentation) with human learning. In fact, a group of human agents took partto the experiment, where sample characters are shown on a screen dividedin episodes as in the computational training; they have to select the label andthen they receive visual feedback together with the correct response (regard-less if their action was correct or not). We can observe that the accuracy inhuman performance is generally lower than in DNC, especially at the secondpresentation of the class where DNC (with data augmentation) reaches a clas-sification accuracy equal to 81.4%, against the 57.3% of humans.

65

Chapter 4

Bio-plausible Development ofMANN Models

In previous chapters we have acquired some confidence with the perfor-mance of the MANN models to discover their limits and their benefits. Inparticular, we were able to reach two important conclusions: 1) on the onehand, the AuGMEnT model has strong biological foundations but it suffersfrom low solving ability that prevents it from solving hierarchical tasks withtemporal credit assignment problem like the 12AX task; 2) on the other hand,the DNC network is a top performer in the cutting-edge field of MANN mod-els that performed very well in the tested tasks in terms of learning time andstability, having the ability to solve a wide variety of tasks; however it lacksbiological plausibility, mainly due to the artificial memory addressing.Now we proceed the exploration of the possibilities of the MANN modelstrying to develop AuGMEnT and DNC models in order to overcome their re-spective drawbacks. Specifically, we will move in two different directions:

• from high biological plausibility to increased task-learnability : startingfrom the biological-based model AuGMEnT, we propose some possiblemodifications both at structural and learning level to extend its learn-ability to new tasks without loss of biological plausibility;

• from high learning power to an increased bio-plausibility: the address-ing mechanism of DNC is reduced to pure content-based associationsand coupled with the LRUA module to test one-shot learning ability.

4.1 Increase Learning Power in AuGMEnT

The main goal of this section is to devolop the AuGMEnT model so as to ob-tain a MANN model biologically plausible that, unlike the original model, isable to solve correctly tasks with a higher memory demand. In the present

66 Chapter 4. Bio-plausible Development of MANN Models

work we will extend the AuGMEnT model to learn the 12AX task.

The first step in the development of AuGMEnT is to slightly change thememory dynamics to tackle the temporal problems in learning observed inSection 3.2.3. The modification consists just in adding a leaky dynamic inthe cumulative proccess in memory. In this way memory storage is greatlysimplified because useless information can be released and memory interfer-ence is significantly reduced. The resulting model of this simple approach isnamed leaky AuGMEnT and, as we will see, it is surprisingly enough to havestable learning of the 12AX task. However, learning time is still high as com-pared to the efficient performance of HER or DNC seen in Section 3.2.

As a consequence, in this section we propose two further approaches toaddress this issue and improve learning efficiency (see Figure 4.1):

A Increase the depth of the network to augment the level of abstractionthat it can reach to solve tasks. The modification is not just structural,but involves also learning problems linked to the backpropagation ofthe error in a deeper network. Depth augmentation can be applied in-dependently either on the controller branch or on the memory branch(or both).

B Inspired by the structure of the HERmodel, develop a hierarchical mem-ory by stacking multiple memory layers and enforcing them on differ-ent time dynamics. In addition, we include gating variables specific toeach level to reduce memory interference.

The parameter settings for each variant of the AuGMEnT model can befound in the Appendix A.4.

4.1.1 Leaky-AuGMEnT

As anticipated in the introduction, leaky-AuGMEnT is a slight modificationof the original model that aims to improve memory dynamics providing aneasy system to free memory from useless content. The simplest way to adda forget effect is to employ a leak rate constante ϕ ∈ [0, 1] that multipliesthe previous memory content at each time level. As a result, equation (2.4)changes to:

inpMt = ϕ inpMt−1 + stranst VMt (4.1)

4.1. Increase Learning Power in AuGMEnT 67

Figure 4.1. Two variants of the AuGMEnT model: A) deep AuGMEnT, obtained in-creasing the depth in the association level, and B) hierarchical AuGMEnT, where mul-

tiple memory layers are stacked to reproduce a hierarchical memory.

Consequently, for consistency the synaptic trace decays with the same rate φ;so we have that:

sTracei,j =t∑

τ=1

ϕτ−1s ∀j = 1, 2, . . . ,M (4.2)

In this way, the model is able to forget part of the memory content when itis no more relevant for the task as the forget gate does in the LSTM or in theGRU model. Opposed to the forget gate, the memory leak rate is not trainedbut it is a fixed constant; in addition, the leak coefficient is not specific to thecell in the layer but refers to the whole content of the memory layer.The results of this straightforward modification are surprisingly very posi-tive: in Figure 4.2 (left) we show that leaky-AuGMEnT (leak coefficient ϕ =

0.7) is more successful in the learning of the 12AX leading to 100% convergedsimulations. Nevertheless, in this new setting learning is slow with a meanof 24150.1 trials (s.d.= 8391.2) to convergence, especially if compared withHER performance that is two orders of magnitude faster. The reason for thisbehavior is that thanks to the leaky memory the model can forget useless in-formation like stimuli stored during previous inner loops; but at the sametime, it has to learn for a long time how to emphasize the initial digit cuein such a way that a track of that information is maintained in memory inspite of the leaky dynamic. This is visible in the converged weight matrixVMt (Figure 4.2 right), where we can see that the digit cue (1±,2±) is sepa-

rately stored in memory cells M4 and M6 with sufficiently high weight values.


Figure 4.2. Improvement in the performance of the AuGMEnT model on the 12AXtask in case of leaky memory with respect to the original conservative memory: thepercentage of success (left) increases up to the 100% reaching convergence in around25000 trials. At the right, we show the detail of the converged memory weight ma-trix that connects the transient units (on and off) to the memory cells of the leaky

AuGMEnT network.

4.1.2 Deep AuGMEnT

Depth Augmentation: derivation and consequences

In machine learning it is well known that depth augmentation generally im-proves the performance of the neural network, but at the cost of a highercomplexity in the learning rules and in the computational cost of the simula-tion. For this reason, very deep networks have to be used with caution andmany times other alternatives to improve performance are preferred. Here,we increase the depth of the AuGMEnT network just adding one hidden layereither on the controller or on the memory branch (or both) and we derive dif-ferent rules to backpropagate the RPE. The resulting model is named DeepAuGMEnT. The purpose of this modification is to see whether the deeper net-work is able to catch and mantain in memory associations for sufficient timeto solve the temporal credit assignment problem and consequently the 12AX.

The modifications on the feedforward step are quite straightforward. Nat-urally, the dynamics of the network change if depth augmentation is applieda single branch or to both of them. Here we write the equations related to thelatter more general case.The additional hidden layers contain sigmoidal units that follow dynamicsanalogous to the regular layer in AuGMEnT (Equation (2.2)):

hRt = σ(yRt WR

t

)hMt = σ

(yMt WM

t

)(4.3)


where σ is the sigmoidal function. The computation of the Q-values changesaccording to the new connections into the output, collected in the weightmatrices HR and HM :

q = hR HR + hM HM (4.4)

The attentional feedback mechanism of AuGMEnT changes in a more sig-nificant way after depth augmentation. In fact, the error signal has to bebackpropagated through more layers in a way that upstream neurons receiveprecise information on the activity of all downstream synapses. The most evi-dent consequence is that the locality condition for synaptic update is violatedin deep AuGMEnT, but this issue will be addressed in the next paragraph.In the following, we illustrate how tagging rules change in deep AuGMEnT

using the same backpropagation-like procedure of the original model. So weapply the chain rule to backpropagate the error from the final activity layer(k units), through the hidden layer (h units) until the upwards connectionsbetween association layer (j units) and the sensory layer (i units). In order tobackpropagate the feedback, we need the feedback matrices (HR)′ and (HM)′

from both branches, that connect the activity layer with the hidden layers;analogously to the relationship between W and W′, feedback matrices H′ areinitialized randomly but they are forced to evolve as the forward counter-parts H so as they become similar during learning and the feedback weightscan be used in the tagging rules instead of the forward ones.So, the final set of tagging equations is:

∆Taghk = −αTaghk + hh zk

∆Tagjh = −αTagjh + yj Φah

∆TagRij = −αTagRij + si yRj

(1− yRj

) ∑hW

′hj ΦR

ah

∆TagMij = −αTagMij + sTraceij yMj

(1− yMj

) ∑hW

′hj ΦM

ah

(4.5)

where Φah = hh (1− hh) H ′ah is the modulator of the error signal up to hiddenunit h. Upwards tagging rules for i-j connections are separated for mem-ory and regular branch because of the different dynamics in the associationlayer: as seen in standard AuGMEnT, the synaptic trace sTrace keeps track ofthe story of stimuli in the memory branch. Analogously to original AuGMEnTmodel, the synaptic tags are then combined with the RPE-neuromodulator δtto update the weights of the network,i.e. wt+1 = wt + β δtTagt.

Analogously to what done in Chapter 2.1.5, we limit to derive the most


complex case of the synaptic tagging, i.e. the last update rule in (4.5) involv-ing the tags TagMi,j which connect the transient units with the memory layerat the top of the the memory branch. Again, here we show the derivation forthe case without synaptic decay (α = 1), but the conclusion still holds for themore general case with α 6= 1.

Proof. We want to prove that the tag equation for the connection that go intothe memory layer is:

Tagij = sTraceij yj (1− yj)∑h

W ′hj hh (1− hh) H ′ah (4.6)

where we omitted the indication (·)M associated with the memory branch tosimplify the notation.

As seen in proof in Chapter 2.1.5, this corresponds to ask that∂qa∂Vij

is the

right-hand side of equation 4.6. Thus, we proceed by propagating the feed-back error from the selected action a back to synapse i − j using the chainrule, that is:

∂qa∂Vij

=∂inpj∂Vij

∂yj∂inpj

∑h

(∂inph∂yj

∂hh∂inph

∂qa∂hh

)(4.7)

where we have to sum over all the units in the hidden layer that contributeto the feedback signal until unit j.At this point, we employ:

• the feedback connectivity matrices H′ and W′ to transport the error

through the layers, so that∂inph∂yj

= W ′hj and

∂qa∂yh

= H ′ah;

• the derivative of sigmoid function to take into account the effect of the

activation function, so that∂yj∂inpj

= yj (1− yj) and∂yh∂inph

= hh (1− hh);

• the approximation of slow time scales of learning during a trial, such

that Vij(τ) = Vij(t) ∀t0 ≤ τ < t, to have that∂inpj∂Vij

= sTracei,j with

sTracei,j defined as in (4.2).

In conclusion it is sufficient to put together all the components and we obtainthe desired formula.

As anticipated, depth augmentation can be independently applied to oneor both branches of AuGMEnT network. So we have analyzed the perfor-mance of deep AuGMEnT in all three possible configurations: a) structure


Figure 4.3. Results of the simulations of deep AuGMEnT on the 12AX task. The usualstatistics (learning success in the left panel and learning time in the right panel) areshown for leaky-AuGMEnT and three different configurations of deep AuGMEnT:1)deep controller branch (DC), 2) deep memory branch (DM) and 3) deep branches on

both sides (DMC).

with deep controller branch (DC), b) structure with deep memory branch(DM) and c) structure with augmented depth on both branches (DMC). InFigure 4.3, we show the mean results over 100 simulations of each vari-ant of deep AuGMEnT on the 12AX, compared to the performance of leaky-AuGMEnT. The performance of each deep structure is similar and there are noevident differences in learning, especially looking at the width of the errorbars. However, we can consider the DMC configuration of deep AuGMEnT

to be slightly more efficient than the others: in fact, it reaches convergence99% of the times with a mean learning time of 20756.6 trials (s.d.= 6561.6),while DC (mean= 25225.2, s.d.= 8234.0) and DM structures (mean= 22417.9,s.d.= 7985.6) converge at each simulation but with more training trials.However, compared to leaky-AuGMEnT, the improvements of deep AuGMEnT

appear to be modest, showing that the difficulties due to memory interfer-ence cannot be overcome with just depth augmentation. It is also true thatthe benefit coming from depth augmentation comes at the cost on the biolog-ical side: the violation of the locality condition. In the next paragraph, we tryto adopt different backpropagation algorithms that are considered to solvethe typical weight transport problem of standard backpropagation.

Biological Plausible Alternatives to BP

As explained in Chapter 2.1.6, the propagation algorithm of the RPE in AuGMEnTis almost analogous with standard Backpropagation (Fugure 4.4, upper left),


with two main exceptions: the definition of the training error as a scalar asso-ciated to the selected response and the random initialization of the feedbackmatrix B. However, the feedback matrix is forced to evolve in the same wayof the forward matrix W (i.e. they have the same tagging rule) and as a resultthe effect of the initialization is less and less important and the two matri-ces become similar during training. As a consequence, the backpropagationnature of AuGMEnT algorithm does not affect the biologial plausibility of themodel because the constraint of symmetry between the feedforward and thefeedback connectivities is (partially) solved and the locality condition is validas long as the AuGMEnT network has a depth inferior to three layers.When we apply depth augmentation to the AuGMEnT network, the bio-plausibilityof the attentional feedback step is weakened beacuse the locality condition ofthe update rule is not respected anymore. In fact, applying the chain rule asdone in equation (4.7), we arrive at the point that the information needed tocorrectly backpropagate the error towards the input is not anymore availablelocally, but it involves downstream dynamics that occur after the postsynap-tic unit (it is sufficient to look at the derivative of the activation function ofthe hidden units hh in (4.6)).

As a consequence, in order to preserve the biological plausibility of AuGMEnTwe propose a class of techniques that are biologically plausible alternativesto standard backpropagation, the so-called Random Backpropagation meth-ods (RBP) (Figure 4.4, upper right). In (Lillicrap et al., 2016), RBP is proposedas a valid alternative to backpropagation, that generally maintains high effi-ciency and solves the biological issues that derive from symmetry constraintand weight transport problem. The RBP is actually based on a simple mod-ification of BP: the feedback weights are not anymore equal (or similar) tothe transpose of feedforward connections, but the feedback matrix B is justa random fixed matrix. This is different from AuGMEnT because now thefeedback matrix is not trained at all, but the forward one keeps updating asusual. Therefore, the symmetry imposed on the connectivity patterns is evi-dently overcome; in addition, the locality condition is respected because, asexplained in (Lillicrap et al., 2016), unlike BP alogorithm, the error signal inRBP does not have to take into account the derivatives of the activation func-tion of the downstream layers.In general, the correct functioning of random backpropagation is guaranteedthanks to a process called Feedback Alignment (FA), for which the learningdynamics adjust to be as in standard backpropagation. In fact, even though


Figure 4.4. Different versions of the BackPropagation technique (BP) that are morebiologically plausible by respecting the locality condition and removing the con-straint on the symmetry of the weights in the feedforward and feedback step. Ran-dom BackPropagation method (RBP) employs random fixed weight matrices tobackpropagate the error signal δRBP downstream the network. Similarly, SkippedRandom BackPropagation (SRBP) consists in the same approach but, in case of deepnetworks with more than one hidden layer, propagates the output error δSRBP di-rectly to all hidden layers without intermediate modulations. Finally, the feedbackδMRBP in Mixed Random BackPropagation (MRBP) is just a convex combination of

RBP and SRBP.

the feedback connections are kept fixed, the forward connections update insuch a way to adapt to the feedback signal. In fact, during the FA process theangle between modulation signal from BP (δBP ) and RBP (δRBP ) is alwayssmaller than π

2on average, meaning that both the signals actually push the

learning dynamics in the same direction during training. In Figure 4.5 weshow the results of FA after applying the RBP algorithm on deep AuGMEnT

with deep memory. We can see that indeed the angle between the feedfor-ward and the feedback weight matrices is always smaller than π

2, but, as

shown in the original paper (Lillicrap et al., 2016), the angle never reacheszero. Unfortunately, feedback alignment is a slow process in deep AuGMEnT

that is conditioned to a small learning rate. In fact, unlike standard backprop-agation, the error signal at each time level is partial because it is restricted to


the selected action and consequently the adaptation process does not occurin parallel over all the components of the forward matrices. Thus, we encour-age response exploration by increasing the exploration rate parameter ε from0.025 to 0.05 and we reduce the learning rate β to 0.08. In this way, align-ment works better because there is coherent adaptation of the weights overthe whole matrix. Unfortunately, as we will see in the simulation results, thisslow learning penalizes the performance of the model that then needs moretraining time to converge.

Figure 4.5. The angle between forward and feedback weight matrices is shown dur-ing training to highlight the effect of Feedback Alignement in Random Backpropa-gation for deep AuGMEnT with augmented depth on the memory branch. the angle

is computed as the scalar product between the matrices reshaped to a vector.

Here, we present also two variants of RBP: Skipped Random Backpropa-gation (SRBP) and Mixed Random Backpropation (MRBP).SRBP is proposed in (Baldi, Sadowski, and Lu, 2016) as a more efficient alter-native to RBP that avoids backpropagation of the error signal through layers.Therefore, as can be seen in Fugure 4.4 (lower left), the strategy of SRBP con-sists in connecting the output layer to each hidden layer with random fixedmatrices, in such a way that the output error signal is directly transmittedto all the layers without intermediate modifications. In this way, the mathe-matical analysis of the model is facilitated and the computational cost of thesimulation is dropped, because in most cases (like in classification tasks) thenumber of units in the output layer is much smaller than in the hidden lay-ers.Inspired by an indication in the same paper, MRBP is another alternativeof random backpropagation algorithm, that combines the effect of RBP andSRBP. It consists in a modulation signal that is a convex combination of the


error signal from RBP and SRBP, i.e. δMRBP = a δRBP + (1− a) δSRBP (Figure4.4, lower right). Even though the computational cost is eventually higher be-cause it needs to backpropagate the error in both ways, the advantage of suchcombination is based on a biological consideration: in fact, the multiple con-nections between layers recall the long-range and short-range interactions ina population of neurons. In spite of the lack of clear indications on how toweight the two contributions, in the simulations the weight parameter a isfixed to 0.8 to stress that short-range interactions are considered to have astronger effect than long-range ones.

In Figure 4.6, we show the mean results of deep AuGMEnT with the dif-ferent variants of BP applied on the 12AX task. We focused on the DMCstructure of the model because it emerged as the most promising configua-tion in the previous analysis. We can see that learning time greatly increaseswhen I use random backpropagation methods, which need about five timesthe number of trials that deep AuGMEnT requires to converge. This confirmsthat the feedback alignment process is slow and has to take some time tostabilize correctly. In particular, the learning time of RBP (mean= 95412.4,s.d.= 59181.4) and MRBP (mean= 100052.5, s.d.= 38485.9) appears to behigher than in SRBP (mean= 75905.0, s.d.= 34054.9).

Figure 4.6. Results of the simulations of deep AuGMEnT (DMC configuration) onthe 12AX Task, trained with three different variants of backpropagation: RandomBackpropagation (RBP), Skipped Random Backpropagation (SRBP) and Mixed Ran-dom Backpropagation (MRBP). Learning performance is compared with leaky and

standard deep AuGMEnT.

In conclusion, deep AuGMEnT improved performance of leaky-AuGMEnT,but not in a significant way to compete with the other MANN models in this


study; on the contrary, when we try to adapt it to alternative backpropaga-tion techniques to enhance biological plausibility, learning of deep AuGMEnT

is penalized by a slow feedback alignment process.

4.1.3 Hierarchical AuGMEnT

The comparative analysis in Chapter 3 highlighted that the most importantdisadvantage in the memory dynamics of AuGMEnT with respect to the otherMANN models was the lack of a gating system. In fact, having a gatingmechanism could help AuGMEnT to reduce memory interference and speedup learning. However, gating is difficult to include in the model in a biologi-cal and differentiable way. It is sufficient to think of the gating system in HER,based on comparison of storing values, to understand that most of the timesgating is a result of a non-differentiable and artificial decision making.

In the present section we introduce another possible development of theAuGMEnT model that aims to reduce memory interference combining twodifferent approaches: hierarchical organization and memory gating. The for-mer, evidently inspired by the multi-level structure of HER, consists in di-viding the memory layer in multiple independent levels with the same size,characterized by different temporal dynamics, in such a way that each levelcan work on different time scales of the task. On the other side, the gatingmechanism takes inspiration from the implementation in DNC, that we knowto be differentiable: in this case, the gating is to be intended in the multi-plicative sense as variables in [0, 1] that multiply the input (or the output) tomodulate the amount of information to store. The combined use of these twomodifications leads to the proposed model named hierarchical AuGMEnT.

Hierarchical Memory Organization

The hierarchical organization in AuGMEnT can be easily obtained by stack-ing the memory layer multiple times. This corresponds to have a matrixyM ∈ [0, 1]LxM where L is the number of memory levels. Afterwards, it issufficient to add the level dimension also in the weight matrices that go intoand out the memory layer; so we end up with three-dimensional matricesVM ∈ RLx2SxM and WM ∈ RLxMxA.So the different levels contribute in the same way to the output and receive


the same input.As done in HER, the levels differ in the temporal dynamics that are assignedby the user to capture the different temporal correlations between stimuli. Toachieve this in AuGMEnT, we tune the synaptic decay parameter αl for eachlevel in such a way that the temporal effect of stimuli is absorbed in differ-ent ways by the various memory levels. High values of αl (i.e. low valuesof λl) correspond to fast decay dynamics where persistance of tags is limitedin time and only recent information are available to the synapses, and viceversa. In this way, there is a separation of time scales that makes that theinputs are processed differently at each level.

Adding a Gating System to AuGMEnT

In DNC the gating variables are collected in the interface vector of equation(2.36) that is emitted from the neural controller; then they are used to decidewhether to write the new data or to forget the previous content in a multi-plicative approach similar to the one applied in LSTM or in GRU. It is impor-tant to specify that gating operations do not affect the biological foundationsof AuGMEnT model as long as gates are local variables, specific for a neuronor a group of neurons. In fact, gating can be biologically justified throughchemical reactions due to biomolecules like catalysts or neuromodulators.

In the hierarchical AuGMEnT model, the values of the gates g ∈ [0, 1]L arecomputed in the regular branch where each component is associated with amemory level. The gating mechanim can be implemented in different waysand in the following we present three gating options that we have exploredin our work:

A. Sigmoidal input gate. This option is the simplest one and requires theleaky constant of leaky-AuGMEnT to free memory from useless infor-mation. The input gates g are computed from the sensory units of thecontroller branch via weight matrix VG ∈ RSxL:

g = σ(s VG

)(4.8)

using the sigmoidal as the activation function to have gate variablesliving in [0, 1]. Afterwards, the gates are applied to the new input of the


memory layer; thus, we get:

inpMl (t) = φ inpMl (t− 1) + gl(t) strans(t) VMl (t) (4.9)

B. Sigmoidal input and forget gates. The input and the forget gates arecomputed in the same way as shown in equation (4.8), but, unlike op-tion A, in this case the leaky dynamics are regulated by trainable vari-ables (forget gates) that depend on the current stimulus and that arespecific to each level. So, the resulting memory dynamics are:

inpMl (t) = fl(t) inpMl (t− 1) + gl(t) strans(t) VMl (t) (4.10)

C. Softmax input gate. The softmax operation returns a weighting vectorwhere the components associated to each level sum to one. The com-putation of the gates is enriched with a temperature parameter ξ > 1

that emphasizes the dominant entry in order to encourage level selec-tiveness. So we have:

gl =exp (ξ s Vg

l )∑Ll′=1 exp (ξ s Vg

l′)(4.11)

Subsequently, gating is applied to the memory level l in a way thatrecalls the dynamics of GRU units:

inpMl (t) = (1− gl(t)) inpMl (t− 1) + gl(t) strans(t) VMl (t) (4.12)

In this way, the gate variable gl works at the same time as an input gateand as a forget gate and allows to the memory to maintain a balance be-tween entering information and memory content: if gl is close to 1, theprevious content of level l is released to make space to the new infor-mation and avoid interference; on the other side, if gl is null, the currentstimulus is rejected and the old memory is maintained over time.

In this case each level learns to store a single input in its memory andminimize the risk of memory interference, as done in HER. However,unlike HER that combines the information coming from the contents ofeach memory level in a modulation process, AuGMEnT does not havea communication system betweeen levels. As a result, in this case,the network has to be adapted to have the possibility to aggregate the


stored information in an intermediate level. This can be achieved ei-ther by adding a further hidden layer, similarly to what done in deepAuGMEnT, or by connecting the hierarchical memory to the regular layerof the controller branch, as done in many MANN models with externalmemory, like One-shot NTM.

Modifications in the Feedback Propagation

The propagation algorithm during the feedback step in hierarchical AuGMEnThas to take into account the effect of gating system. In fact, as we will see,since the input of the memory layer depends on both the gate and the sen-sory layers, the pathway for the error propagation has a bifurcation. Instead,the hierarchical architecture does not affect in a significant way the taggingrules because each memory level is independent and the error signal is prop-agatated through disjoint synaptic connectivities.Here, we omit the complete derivation of the tagging rules and we limit toexplicit the derivative of the input into memory level l (inpMl (t)) with respectto weights V M

l,i,j and V Gi,l . In particular, we refer to equation (4.11) correspond-

ing to gating option C, that is the most complete case.For the derivation of the input with respect to the memory weigth V M

l,i,j weadopt the usual approximation in AuGMEnT of slow learning used in Chap-ters 2.1.5 and 4.1.2:

∂inpMl,j(t)

∂V Ml,i,j(t)

=t∑

τ=1

τ∏τ ′=2

[1− gl(τ ′)] gl(τ) stransi (τ) (4.13)

where the right-hand side of the equation is saved in the synaptic trace vari-able sTrace. In fact, equation (4.13) can be written in the recursive formsTrace(t) = (1− g(t)) sTrace(t − 1) + g(t) strans(t), that recalls the structureof the memory dynamics in (4.12).The derivative with respect to the gate weight wGi,l is simply computed usingthe chain rule:

∂inpMl (t)

∂V Gi,l (t)

=∂inpGl (t)

∂V Gi,l (t)

∂gl(t)

∂inpGl (t)

∑j

∂inpMl,j(t)

∂gl(t)=

= si gl(t) (1− gl(t))∑j

[−inpMl,j(t− 1) +

∑i′

stransi′ V Ml,i′,j

](4.14)


where the last component is a sum because it has to take into account thegate effect over all M memory cells from the same level l.

Failure of Hierarchical AuGMEnT

Although hierarchical AuGMEnT is still a promising solution that combinesseveral ideas from different models, the simulations do not converge on the12AX. The results of these simulations are preliminayr and have to be fur-ther investigated, but the combination of the hierarchical architecture withthe gating system appears to not work properly in all the tested cases. Infact, most of the times the model does not reach convergence in 12AX simu-lations lasting for 100000 trials.The main problem in hierarchical AuGMEnT is the lack of a communicationsignal between the different memory levels. In fact, without knowing what isthe activity of the other levels and what is stored in their memory, there is nota cooperative organization to solve the task. On the contrary, it seems thateach level works independently on different time scales and thus provokingsome interference in the signal arriving to the output.Also the gating system does not seem to help the organization of the hier-chical memory in any tested case (either A, B or C, or combination of these).A possible reason is that normally in gated models like LSTM or GRU, thegate layer is recurrently connected and trained with backpropagation in timetaking into account previous gate states; in this way, the gates can learn bet-ter how to distribute successive stimuli at different memory levels and en-courage a multi-level organization in the memory. Moreover, in LSTM thegating mechanism operates independently on each unit with individual gatevariables, while in hierarchical AuGMEnT gates are associated to each level.Finally, learning of the gating mechanism in the 12AX task is also very slowbecause the RL scheme of AuGMEnT, combined with the depth of the networkrequired to have a read-out from the hierarchical memory and the fact thatmost of the times the responses in 12AX are Non-Targets, leads to a poor andslow feedback signal during training.

We have also studied the performance of models with either hierarchicalmemory or gating mechanism alone to test their effect independently.In the first case, we adopted a three-level memory architecture with 10 mem-ory units each and different temporal dynamics, but without any gating op-eration; in a quick test consisting of ten simulations, the model converged

4.2. Bio-Plausible Modifications of DNC Model 81

all times but with a mean learning time of 35385.7 trials (s.d.= 16884.4), that,compared to the 24150.1 of simple leaky-AuGMEnT, indicates that the modeldoes not take advantage of the higher number of units or of the temporalsetting of the hierarchical memory.Analogously we have tested a gated AuGMEnT model without hierarchicaldivision of the memory in multiple levels, but where each gate is associ-ated with a memory unit of the memory layer. This time the model con-verged nine times out of ten, after a mean of 56385.7 trials before conver-gence (s.d.= 25038.3), that means that indeed gating is a slow learning pro-cess that does not allow fast convergence. In fact, we observed that gatingeffect increased the variability of the behavior of the memory without lettingthe model to satisfy the convergence condition of 1000 consecutive correctpredictions.

In conclusion, hierarchical AuGMEnT has a great potential, being the re-sult of a first attempt to add a gating system into the AuGMEnT framework.However, the current results of the simulations do not show improvementswith respect to leaky-AuGMEnT, so in the future the model will be developedas discussed in the final conclusion of the thesis in Chapter 5.

4.2 Bio-Plausible Modifications of DNC Model

In our simulations, the DNC network confirmed to be an efficient network fordifferent types of tasks with very good performance in learning time and suc-cess. However, we did not investigate the weak point of DNC, that is the nonbiologically correct addressing to memory. Here we focus on this limitationand we propose an easy approach to provide more biological foundations toDNC.

4.2.1 Simplification of DNC Addressing Scheme

As discussed in Chapter 2, the addressing strategy in DNC is a combination ofcontent-based, location-based and temporal criteria; in particular, the mem-ory dynamics that involve storing according to location usage or reading us-ing temporal linkage do not derive from experimental findings or from anattempt to construct a neurally faithful memory model, but they are just ar-tificial techniques designed to improve the overall performance.In the present work, we try to get rid of artificial addressing and maintain


only content lookup to interface with the external memory. In this way, theoperations for both reading and writing are decided by a purely content-based approach, that recalls the dynamics in the biologically-plausible Hop-field model. The final result of this simplification is the Reduced Differen-tiable Neural Computer (R-DNC)1.As a result of this simplification, in R-DNC the weight vectors for writing andreading, originally defined in (2.47) and in (2.49), are directly computed fromthe similarity measure C of the write/read keys with the memory locations;that is: ww,i

t = gw,it C(Mt,kw,it , βw,it ) ∀i = 1, 2, . . . ,W

wr,jt = C(Mt,kr,jt , β

r,jt ) ∀j = 1, 2, . . . , R

(4.15)

where gw,it is the write gate associated with write head i.It is evident that the minimization of the DNC addressing scheme to purecontent-based criteria leads to a backlash in the model performance: in fact,the model has not anymore the possibility to store efficiently sequence ofinputs in a specific order and jump across memory locations. In the nextparagraph we test the performance of R-DNC to see on real cases what is thecost in performance of the model simplification.

4.2.2 Potentiate One-shot Learning with LRUA Scheme

We take as reference case the most complex task that we have studied in thisthesis, the Omniglot task. In this case, the temporal addressing in DNC wasimportant to link the information from the image representation with thelabel given at the next timestep in a temporally offset manner. Moreoever,without having any location-oriented addressing criterion, the network maystore relevant information related to different classes in the same memorylocation creating some interference. As a consequence, we expect a worseperformance of R-DNC than experienced in Chapter 3.3.3 with DNC.As shown in Figure 4.7, the simulation highlights that indeed the R-DNC

model performs worse than DNC, presenting a classification accuracy that ismuch lower at each instance. In particular, we are interested to look at thebehavior of the model at the 2nd image presentation of the same class withinan episode (orange line) in order to observe the effect of addresssing reduc-tion on one-short learning ability. We can immediately see that, even thoughthe accuracy at the 2nd instance has a sharp increase during training, the final

1The work for the simplification of the DNC model in R-DNC was done by V. Jain, aninternship student that we supervised during his work in the LCN lab at EPFL.


percentage of accuracy is not enough high to consider learning in R-DNC inthe limit of one-shot: in fact the classification accuracy at the end of trainingis just 75%, that is lower than other one-shot reference models like One-shotNTM or DNC where the percentage of correct response at the second instanceis slightly higher than 90%.

Figure 4.7. Simulation results of learning of R-DNC in the omniglot task (center),compared with the original performance of DNC (left) and of the network imple-mented with the LRUA access module to memory (right). The classification perfor-mance is observed through training and after each image presentation, computing

the mean accuracy over 500 batches.

In order to recover one-shot learning, we adopt in the R-DNC frameworkthe same addressing module of One-shot NTM, i.e. the Least Recently UsedAccess (LRUA) module described in (Santoro et al., 2016). LRUA is an ad-dressing scheme built specifically for classification tasks with inputs given ina temporal offset fashion, as done in Omniglot. In fact, the approach in LRUAis to select the writing location in agreement with two criteria: either we storethe new information in the last or in the least used locations. The first optioncorresponds to update a content that was read from the memory at the previ-ous step, giving the opportuntity to bind in memory the complementary in-formation given in two consecutive time levels and enable one-shot learning.The second criterion addresses the least used memory location to optimizememory space in case the new information does not show similarities withprevious memories. Although memory usage is taken into account in themodule, LRUA is still a purely content-based addressing approach becausethe first term coincides with the similarity measure from previous time level,while the second is just used as an efficient alternative to avoid memory in-terference.


The described addressing mechanism consists just in a convex combinationof the two possibilities:

wwt = σwr

t−1 + (1− σ) wlut (4.16)

where σ is a trained sigmoidal coefficients that weights the two options andwlut is the least-used weighting defined over the N memory locations, that

has value 1 at the location index with lowest usage and 0 everywhere else. Incase of multiple read and write heads, LRUA adds a constraint for which thenumber of heads has to be equal, so W = R. In this case, each write weightis associated with a read weight and a different least used location.It is important to specify that even if the addressing rule is mostly the same,R-DNC with LRUA scheme is still different from the One-shot NTM men-tioned in the previous chapter. In fact, R-DNC maintains most of the fea-tures and the general structure of DNC discussed in Chapter 2.3, like theerase/writing operations, its own definitions of the interface and the usagevectors and the concatenation of the current input with previous read vec-tors. On the other side, One-shot NTM has the advantage to present a muchsimpler structure, where there is not gating for the external memory and theerase opearation is done by completely deleting the least-used location(s) be-fore writing.

In Figure 4.7, we show the performance of the R-DNC model coupledwith the LRUA access module, compared with DNC (left panel) and standardR-DNC (central panel). We can clearly see that our model outperforms R-DNCand, more importantly, recovers the original accuracy perfomance of DNC de-spite maintaining an addressing scheme that is uniquely content-based. Inparticular, the additional LRUA module allows to restore one-shot learningreaching a percentage of accuracy at the second instance of 89%.The improvement of R-DNC performance with LRUA module is confirmedin the test results collected in Table 4.1. Here, we can observe that indeed theclassification accuracy in R-DNC is much higher when LRUA scheme is ap-plied at each instance. In particular, the test accuracy at the second instanceis equal to 84.8%, even higher than in the original DNC (81.4%) or One-shotNTM (82.8%).

Furthemore, preliminar observations of the results of the omnniglot takswith episodes with 15 classes each show a wider difference in the perfor-mance between R-DNC and R-DNC with LRUA module. In fact, in this more


Table 4.1. Test statistics of R-DNC model in the Omniglot Task

Model Classification Accuracy (Instances)1st 2nd 3rd 4th 5th 10th

ONE-SHOT NTM 36.4 82.8 91.0 92.6 94.9 98.1

DNC 29.4 81.4 86.6 89.0 90.0 93.0

R-DNC 26.5 71.7 76.1 79.6 80.4 83.4

R-DNC + LRUA 33.8 84.8 89.0 91.3 92.5 93.8

complex case, the number of characters to recognize is higher and the mem-ory contribution is much more important because the controller struggles tofind internal representations that are sufficiently distinct the ones from theothers. In particular, partial results show that at halfway of training (i.e. af-ter 46000 episodes), the R-DNC model presents low classification accuracies(25.2% at the second instance, 34.0% at the tenth one), while R-DNC withLRUA shows already very good performance with 88.0% of correct classifi-cations at the second instance and 96.9% at the last one.

In conclusion, it is fundamental to remind that the type of one-shot learn-ing ability coming from LRUA module is not specific to the Omniglot task,but it can be applied to any classification task, providing input and relatedlabel in two consecutive timesteps. This is indeed the advantage of meta-learning: the model can acquire knowledge that spans over tasks which sharesimilar structure (across-tasks learning) and employ it to improve perfor-mance on tasks from that group (inside-task learning). In the case of our inter-est, the MANN model with LRUA model is able to acquire meta-knowledgefor classification tasks about how to store the label information in memory af-ter slow and low training; then the trained model can use the same memory-based approach to classify inputs with different objects or formats, and learn-ing is faster because the model has just to train the controller to process theinput of the specific task.

87

Chapter 5

Conclusions

5.1 Achievements of the Project

In the present work we have explored the cutting-edge field of Memory-Augmented Neural Networks, that in the last years has been developed withdifferent approaches and aims. First of all we gave an overview of the state-of-the-art summarizing their achievements both in terms of learning capa-bility and biological plausibility. In fact, in our work we were interested infinding a model that at the same time has good learning performance in dif-ferent cognitive tasks in the standard machine-learning sense, but that at thesame time is neurally faithful, i.e. that respects the biological constraints ofa neural architecture. Combining these qualities is very challenging becausenormally networks that have high learning performances usually rely on un-biological techniques to interact with memory, while neurally faithful modelsfrom cognitive neuroscience normally have been applied only to simple tasksthat are experimentally performed by humans or animals.

We selected three models from the most recent and relevant studies aboutMANNs: Attention-Gated MEmory Tagging (AuGMEnT), Hierarchical ErrorRepresentation (HER) and Differentiable Neural Computer (DNC). We bench-marked their performance on two simple cognitive tasks, saccade and 12AXtasks, to discover their advantages and limitations in dynamics like memorystorage and retrieval. We compared their performance in terms of percentageof learning success, learning time and variety of solvable tasks. In particular,HER and DNC proved to have good learning performance in these tasks withslight differences in the statistics. More importantly, we highlighted a mem-ory limitation of AuGMEnT: ungated accumulation of data in memory leads tomemory interference that does not allow to the network to ever solve taskswith temporal credit assignment problem like 12AX task. We also showedthat DNC was the only model that supports an image classification task and

88 Chapter 5. Conclusions

that can perform in the limit of one-shot learning.

Afterwards, we developed the studied networks in order to reduce theirdrawbacks.First, we discussed three variants of the AuGMEnT model that aim to over-come the memory interference limitation and solve the 12AX: leaky-AuGMEnT,deep AuGMEnT and hierarchical AuGMEnT. The first model consists in an ad-ditional leaky effect in the memory dynamics, that allows to forget uselessinformation that interferes with correct learning. Surprisingly, this modi-fication is enough to handle the temporal credit assignment problem andstabilize learning of the 12AX task. Deep AuGMEnT is the result of depthaugmentation in AuGMEnT, that is generally a good option to improve ANNperformance and that indeed decreased the learning time of leaky AuGMEnT.However in deep AuGMEnT we had to deal with biological-related problemsof backpropagation in deeper networks: in fact, when the number of hid-den layers increase, the locallity condition for synaptic weight update doesnot hold anymore because information required to backpropagate the erroris not totally available locally in the synapse. In order to overcome this prob-lem, in deep AuGMEnT we applied Random Backpropagation methods RBP,SRBP and MRBP, that avoid typical issues of BP like symmetry and localitycondition by employing random fixed matrices for the feedback connections,but require longer training due to the feedback alignment process. The lastvariant, hierachical AuGMEnT, takes inspiration from the hierachical struc-ture of HER and the gating system of DNC, in such a way that the network hasmultiple independent levels of memory layers with their own temporal andgating dynamics. Nonetheless, the model did not show the desired behav-ior because of the lack of cooperation between the memory levels and slowlearning of the gating variables.Furthermore, we operated in the opposite direction, trying to increase biolog-ical plausibility in DNC that proved to have excellent learning performanceacross all tested tasks. We first simplified the artificial addressing schemein DNC by removing unbiological location-based and temporal linkage ap-proaches. Thus we developed the Reduced-Differentiable Neural Computer(R-DNC), that was obviously penalized by a purely content-based addressingmodule. Secondly, we recovered part of the lost learning power of DNC bycoupling R-DNC with LRUA access module, that enables one-shot learningin classification tasks, like Omniglot.

5.2. Possible Developments for Future Work 89

5.2 Possible Developments for Future Work

The present work represents just the first steps in a longer and broader projectabout MANNs, with the final target to extend learning power and biologicalplausibility of neural networks. In the future we will work on further pos-sibilities to develop the models and improve their performance or their bio-logical foundations.

First of all, we will focus on hierarchical AuGMEnT in order to solve thecurrent learning problems and enhance its learnability with an optimizedgating mechanism. In particular, in the current version of the model onlywriting and forgetting are effectively controlled by gates, but we can alsoadd read gates that handle reading in a similar way the LSTM output gatedoes. and then we try to make it work in the biological setting of AuGMEnT.However, before enriching the gating mechanism, first we want to discoverand deeply understand the weak point in the current model. To achieve this,we will start with an artifical system (without the learning rules of AuGMEnT)that is supposed to work and then we will proceed by putting gradually backthe biological-based mechanisms of AuGMEnT. For instance, we can start byconsidering a LSTM controller in the regular branch and use standard back-propagation to train the network; then, if it works as expected, we can replacethe backpropagtion algorithm with the synaptic tagging system of AuGMEnTon single components of the network and observe the changements in themodel performance at each step.

Regarding DNC network, the power-up of its biological foundations can beenhanced even more than what achieved by the R-DNC. In fact, in (Zarembaand Sutskever, 2015), the authors describe a way to train memory-augmentednetworks with external memory, like NTM or DNC, with a Reinforcement Learn-ing algorithm, called REINFORCE. In this way, the resulting model, calledRL-NTM, is able to interact with discrete external interfaces and overcome thetraining problems due to undifferentiability. The possibility to deal with dis-crete interfaces, like databases or search engines, is very important becausemany real tasks require to take discrete actions in an external environment ina multi-step interaction, like videogames or the stock market.Moreover, in (Gulcehre et al., 2016) it is proposed a variant of the NTM thatemploys as well the REINFORCE algorithm to train a network that can workboth on continuous and discrete memory mechanisms, the Dynamic Neural

90 Chapter 5. Conclusions

Turing Machine (D-NTM). In this case, the memory can be either addressed inthe soft differentiable way − as done in DNC− having a weight vector overallN locations for reading and writing, or in hard non-differentiable manner,i.e. selecting a single memory location to write in or read from. The latter op-tion is considered to be more biologically plausible and recalls the retrievaldynamics by associations described in the Hopfield model.In addition, it is possible to try to implement the reinforcement learning ap-proach of AuGMEnT in the R-DNC framework, adapting the mechanims ofsynaptic tagging and neuromodulation update in a network with externalmemory.

A further direction of the project could be to add a psychological viewin the research. In fact, it is known that emotional states like fear, anger orsurprise, can affect reason, attention, decision making or interpersonal inter-action (Faraji, Preuschoff, and Gerstner, 2016; Dolan, 2002). This is particu-larly true in memory that can be easily conditioned by emotional stimuli thatmodify the behavior of the subject in a stronger way than neutral stimuli.The effect of emotions has been already experimented on real tasks, like vi-sual search or spatial orientating tasks, and it would be interesting to add inthe MANN models electrophysiological data or neurobiological interactionswith amygdala in order to see the consequences in the simulations.

91

Appendix A

Model Parameterization

In this appendix we report the tables that collect the parameter settings forall the MANN models used in the simulations in Chapter 3.In most cases the model parametrization was done by a trial-error explo-ration observing the resulting performance; in the most critical cases, Bayesianoptimization was employed for the key parameters. If available, we took theparameter values directly from the reference papers of each model.

92 Appendix A. Model Parameterization

A.1 Table of Parameters: SAS Task

Model Parameter Description Value

AuGMEnTR : Number of regular units 3M : Number of memory units 4β : Learning parameter 0.15λ : Tag persistance 0.2γ : Discount factor 0.9α : Tag decay rate 1− γλε : Exploration rate 0.025

HERL : Number of levels 2β : WM gain factor [8, 8]αr : Learning rate for pred. matrix [0.15, 0.1]αm : Learning rate for WM matrix [1, 1]λ : Eligibility trace rate [0.1, 0.9]b : Bias factor [0, 0]γ : Response selection temperature 5

LSTMH : Number of hidden units 10α : Learning rate 0.02

DNCH : Number of hidden units 64N : Number of memory locations 16M : Size of memory locations 16W : Number of write heads 1R : Number of read heads 4α : Learning rate 0.05

Table A.1. Paramterization of MANN models for SAS task.

A.2. Table of Parameters: 12AX Task 93

A.2 Table of Parameters: 12AX Task

Model Parameter Description Value

AuGMEnTR : Number of regular units 10M : Number of memory units 10β : Learning parameter 0.1λ : Tag persistance 0.5γ : Discount factor 0.9α : Tag decay rate 1− γλε : Exploration rate 0.025

HERL : Number of levels 3β : WM gain factor [15, 15, 15]αr : Learning rate for pred matrix [0.075, 0.075, 0.075]αm : Learning rate for WM matrix [1, 1, 1]λ : Eligibility trace rate [0.1, 0.5, 0.99]b : Bias factor [10, 0.1, 0.01]γ : Response selection temperature 5

LSTMH : Number of hidden units 10α : Learning rate 0.01

DNCH : Number of hidden units 64N : Number of memory locations 16M : Size of memory locations 16W : Number of write heads 1R : Number of read heads 4α : Learning rate 0.02

Table A.2. Model parametrization for 12AX Task

94 Appendix A. Model Parameterization

A.3 Table of Parameters: Omniglot Task

The model parameters here specified are the same for both DNC and One-shotNTM networks, in such a way that it was possible to compare the performanceof the two models without interference of different parameter settings.

Parameter Description Value

H : Number of hidden units 200N : Number of memory locations 128M : Size of memory locations 40W : Number of write heads 4R : Number of read heads 4α : Learning rate 0.001

Table A.3. Summarizing the parameters for training of Omniglot task

A.4. Table of Parameters: variants of AuGMEnT model 95

A.4 Table of Parameters: variants of AuGMEnTmodel

AuGMEnT Variant Parameter Description Value

LeakyR : Number of regular units 10M : Number of memory units 10β : Learning parameter 0.15λ : Tag persistance 0.15γ : Discount factor 0.9α : Tag decay rate 1− γλε : Exploration rate 0.025φ : Memory leak coefficient 0.7

DeepR : Number of regular units 10M : Number of memory units 10Hr : Number of hidden units (regular-branch) 10 (if used)Hm : Number of hidden units (memory-branch) 10 (if used)β : Learning parameter 0.1 (RBP: 0.08)λ : Tag persistance 0.15 (RBP: 0.5)γ : Discount factor 0.9α : Tag decay rate 1− γλε : Exploration rate 0.025 (RBP: 0.08)φ : Memory leak coefficient 0.7

HierarchicalL : Number of memory levels 3R : Number of regular units 10M : Number of memory units (per level) 10β : Learning parameter [0.08, 0.05, 0.05]λ : Tag persistance [0.15, 0.5, 0.5]γ : Discount factor 0.9α : Tag decay rate 1− γλε : Exploration rate 0.025φ : Memory leak coefficient 0.7ξ : Temperature parameter for gate softmax 3

Table A.4. Parameter settings for model parametrization of variants of AuGMEnT forthe 12AX task

97

Appendix B

Programming Details

Most part of the codes, including the scripts for data analysis and visualiza-tion, is written in python2 and python3. For the AuGMEnT and the HER

models, just the standard packages have been used, e.g. numpy, scipy andmatplotlib.The implementation of the LSTM model has been done in keras (Chollet etal., 2015), a python package that offers easy access to standard neural net-works through API functionalities and that supports both theano and ten-sorflow.The DNC model has been directly taken from the pubblished codes fromGoogle Deepmind (https://github.com/deepmind/dnc) and then have beenfurtherly developed and adapted to our aims. The libraries needed to runthe codes are tensorflow and sonnet. The simplification of the model in theR-DNC model was done by Vineet Jain, an internship student that I super-vised during my Master Project together with my supervisor, Dr. AdityaGilra.In some cases model parametrization has been tuned using spearmint(https://github.com/HIPS/Spearmint), a software package with Mongodbsupport that performs Bayesian optimization by running simulations withparameters that are iteratively adjusted and vary in a predefined interval.

99

Bibliography

Alexander, William H and Joshua W Brown (2011). “Medial prefrontal cortexas an action-outcome predictor”. In: Nature neuroscience 14.10, pp. 1338–1344.

Alexander, William H. and Joshua W. Brown (2015). “Hierarchical Error Rep-resentation: A Computational Model of Anterior Cingulate and Dorso-lateral Prefrontal Cortex”. In: Neural Computation 27.11. PMID: 26378874,pp. 2354–2410. DOI: 10.1162/NECO\_a\_00779. eprint: http://dx.doi.org/10.1162/NECO_a_00779. URL: http://dx.doi.org/10.1162/NECO_a_00779.

Alexander, William H and Joshua W Brown (2016). “Frontal cortex functionderives from hierarchical predictive coding”. In: bioRxiv, p. 076505.

Ananthanarayanan, Rajagopal and Dharmendra S Modha (2007). “Anatomyof a cortical simulator”. In: Proceedings of the 2007 ACM/IEEE conference onSupercomputing. ACM, p. 3.

Baldi, Pierre, Peter Sadowski, and Zhiqin Lu (2016). “Learning in the ma-chine: Random backpropagation and the learning channel”. In: arXiv preprintarXiv:1612.02734.

Bioninja. Structure of a typical nerve cell. URL: ib . bioninja . com . au /standard-level/topic-6-human-physiology/65-neurons-

and-synapses/neurons.html.Cho, Kyunghyun et al. (2014). “Learning phrase representations using RNN

encoder-decoder for statistical machine translation”. In: arXiv preprint arXiv:1406.1078.Chollet, François et al. (2015). Keras. https://github.com/fchollet/

keras.Dolan, Raymond J (2002). “Emotion, cognition, and behavior”. In: science

298.5596, pp. 1191–1194.Eliasmith, Chris et al. (2012). “A large-scale model of the functioning brain”.

In: science 338.6111, pp. 1202–1205.Faraji, Mohammad Javad, Kerstin Preuschoff, and Wulfram Gerstner (2016).

“Balancing New Against Old Information: The Role of Surprise”. In: arXivpreprint arXiv:1606.05642.

http://dx.doi.org/10.1162/NECO\_a\_00779

http://dx.doi.org/10.1162/NECO_a_00779




ib.bioninja.com.au/standard-level/topic-6-human-physiology/65-neurons-and-synapses/neurons.html



https://github.com/fchollet/keras

https://github.com/fchollet/keras

100 BIBLIOGRAPHY

Frank, Michael J (2005). “Dynamic dopamine modulation in the basal gan-glia: a neurocomputational account of cognitive deficits in medicated andnonmedicated Parkinsonism”. In: Journal of cognitive neuroscience 17.1, pp. 51–72.

Frank, Michael J, Bryan Loughry, and Randall C O’Reilly (2001). “Interac-tions between frontal cortex and basal ganglia in working memory: acomputational model”. In: Cognitive, Affective, & Behavioral Neuroscience1.2, pp. 137–160.

Gers, Felix A and Jürgen Schmidhuber (2001). “Long Short-Term Memorylearns context free and context sensitive languages”. In: Artificial NeuralNets and Genetic Algorithms. Springer, pp. 134–137.

Gerstner, Wulfram (2017). “Eligibility traces in experiments - an update”. In:Gerstner, Wulfram et al. (2014). Neuronal dynamics: From single neurons to net-

works and models of cognition. Cambridge University Press.Gottlieb, Jacqueline and Michael E Goldberg (1999). “Activity of neurons in

the lateral intraparietal area of the monkey during an antisaccade task”.In: Nature neuroscience 2.10, pp. 906–912.

Graves, Alex, Greg Wayne, and Ivo Danihelka (2014). “Neural turing ma-chines”. In: arXiv preprint arXiv:1410.5401.

Graves, Alex et al. (2016). “Hybrid computing using a neural network withdynamic external memory”. In: Nature 538.7626, pp. 471–476.

Gulcehre, Caglar et al. (2016). “Dynamic Neural Turing Machine with Softand Hard Addressing Schemes”. In: arXiv preprint arXiv:1607.00036.

Hochreiter, Sepp and Jürgen Schmidhuber (1997). “Long short-term mem-ory”. In: Neural computation 9.8, pp. 1735–1780.

Hodgkin, Alan L and Andrew F Huxley (1952). “A quantitative descriptionof membrane current and its application to conduction and excitation innerve”. In: The Journal of physiology 117.4, pp. 500–544.

Hopfield, John J (1982). “Neural networks and physical systems with emer-gent collective computational abilities”. In: Proceedings of the national academyof sciences 79.8, pp. 2554–2558.

Krueger, Kai A. (2011). “Sequential learning in the form of shaping as a sourceof cognitive flexibility”. In:

Lake, Brenden M., Ruslan Salakhutdinov, and Joshua B. Tenenbaum (2015a).“Human-level concept learning through probabilistic program induction”.In: Science 350.6266, pp. 1332–1338. ISSN: 0036-8075. DOI: 10.1126/science.aab3050. eprint: http://science.sciencemag.org/content/

http://dx.doi.org/10.1126/science.aab3050

http://dx.doi.org/10.1126/science.aab3050

http://science.sciencemag.org/content/350/6266/1332.full.pdf


BIBLIOGRAPHY 101

350/6266/1332.full.pdf. URL: http://science.sciencemag.org/content/350/6266/1332.

Lake, Brenden M, Ruslan Salakhutdinov, and Joshua B Tenenbaum (2015b).“Human-level concept learning through probabilistic program induction”.In: Science 350.6266, pp. 1332–1338.

Lillicrap, Timothy P et al. (2016). “Random synaptic feedback weights sup-port error backpropagation for deep learning”. In: Nature communications7.

Mao, Tianyi et al. (2011). “Long-range neuronal circuits underlying the inter-action between sensory and motor cortex”. In: Neuron 72.1, pp. 111–123.

Markram, Henry et al. (2015). “Reconstruction and simulation of neocorticalmicrocircuitry”. In: Cell 163.2, pp. 456–492.

O’Reilly, Randall C and Michael J Frank (2006). “Making working memorywork: a computational model of learning in the prefrontal cortex andbasal ganglia”. In: Neural computation 18.2, pp. 283–328.

Roelfsema, Pieter R and Arjen van Ooyen (2005). “Attention-gated reinforce-ment learning of internal representations for classification”. In: Neuralcomputation 17.10, pp. 2176–2214.

Rombouts, Jaldert O., Sander M. Bohte, and Pieter R. Roelfsema (2015). “HowAttention Can Create Synaptic Tags for the Learning of Working Memo-ries in Sequential Tasks”. In: PLOS Computational Biology 11.3, pp. 1–34.DOI: 10.1371/journal.pcbi.1004060. URL: https://doi.org/10.1371/journal.pcbi.1004060.

Santoro, Adam et al. (2016). “One-shot learning with memory-augmentedneural networks”. In: arXiv preprint arXiv:1605.06065.

Zaremba, Wojciech and Ilya Sutskever (2015). “Reinforcement Learning Neu-ral Turing Machines-Revised”. In: arXiv preprint arXiv:1505.00521.



http://science.sciencemag.org/content/350/6266/1332

http://science.sciencemag.org/content/350/6266/1332

http://dx.doi.org/10.1371/journal.pcbi.1004060

https://doi.org/10.1371/journal.pcbi.1004060

https://doi.org/10.1371/journal.pcbi.1004060

memory-augmented neural networks: enhancing biological ... · memory-augmented neural networks:...

Documents