neural architectures with memory - university of illinoisslazebni.cs.illinois.edu › spring17 ›...

NeuralArchitectureswithMemory

NitishGupta,ShreyaRajpal25th April,2017

1

StoryComprehension

2

Joe went to the kitchen. Fred went to the kitchen. Joe picked up the milk. Joe travelled to his office. Joe left the milk. Joe went to the bathroom.

Q1 : Where is Joe?

Q2 : Where is the milk now?

Q3 : Where was Joe before the office?

Questions from Joe’s angry mother:

DialogueSystem

3

Hello! What can I do for you today?

I’d like to reserve a table for 6.

Sure! When would you like that reservation?

At 7 PM, please.

Okay. What cuisine would you like?

Actually make that 7:30 PM

Updated! What cuisine?

Is there anything better than a medium rare steak?

Nothing at all! Blackdog has a 4.7 on Yelp.

Sounds perfect! Also, add one more person.

Reservation done for 7, 7:30pm at Blackdog. Enjoy!

Mac

hine

Human

MLmodelsneedmemory!

DeeperAItasksrequireexplicitmemoryandmulti-hopreasoningoverit

•RNNshaveshortmemory•Cannotincreasememorywithoutincreasingnumberofparameters•Needforcompartmentalizedmemory•Read/Writeshouldbeasynchronous

4

MemoryNetworks(MemNN)

5

• ClassofModelswithmemory𝑚 - Arrayofobjects𝑚"

FourComponents:I- InputFeatureMap:InputmanipulationG- Generalization:MemoryManipulationO- OutputFeatureMap:OutputrepresentationgeneratorR- Response:ResponseGenerator

MemoryNetworks,Westonet.al.,ICLR2015

mi

Eachmemoryhereisadensevector

MemNN

6

1. InputFeatureMap• Imagineinputasasequenceofsentences𝑥"

2. UpdateMemories


xi Input Feature Map I(xi)

I(xi)

Generalization

MemNN

7

3. OutputRepresentation• Sayif𝑞 isaquestion,computeoutputrepresentation

4. GenerateAnswerResponse


Output Feature Map o

Response

o r

I(q)

SimpleMemNNforText

8

1. InputFeatureMap - Bag-of-Words representation


xi

Sentence

Bag-of-Words

𝐼(𝑥")

SimpleMemNNforText

9

2. Generalization:Storeinputinnewmemory


mi = I(xi)

Memoriestillnow(i=4) 𝐼(𝑥()

Memoriesafter5inputs

m m

SimpleMemNNforText

10

3. Output: Using𝑘 = 2memoryhopswithquery𝑥

4. Response - SingleWordAnswer


Scoreallmemoriesagainstinput1st Maxscoring

memoryindex

Scoreallwordsagainstqueryand2supportingmemories

Maxscoringword

i = 1, . . . , N

o1 = O1(x,m) = argmax sO(I(x),mi)

i = 1, . . . , N

o2 = O2(x,m) = argmax sO([I(x),mo1 ],mi)

r = argmax

w2W

s

R

([I(x),mo1 ,mo2 ], w)

Scoreallmemoriesagainstinput&𝑜-

2nd Maxscoringmemoryindex

ScoringFunction

• ScoringFunctionisanembeddingmodel

11

s(x, y) = x

TU

TUy

• Whatis𝑈𝑦 ?

Word Embedding Matrix

Sum of Word Embeddings

=

Uy = Uy

ScoringFunctionisjustdot-productbetweensumofwordembeddings!!!


12


Joe went to the kitchen.Fred went to the kitchen.Joe picked up the milk.Joe travelled to his office.Joe left the milk. Joe went to the bathroom.

InputSentences Memories

Where is the milk now?

Question 1st supportingmemory


Question 2nd supportingmemory


Question

Office

Response

TrainingObjective

13

X

f 6=mo1

max(0, � � sO(x,mo1) + sO(x, f)) +

Scorefortrue1st memory

Scoreforanegativememory


TrainingObjective

14

X

f 6=mo1

max(0, � � sO(x,mo1) + sO(x, f)) +

X

f 0 6=mo2

max(0, � � sO([x,mo1 ],mo2) + sO([x,mo1 ], f0)) +

Scorefortrue2nd memory

Scoreforanegativememory


TrainingObjective

15

X

f 6=mo1

max(0, � � sO(x,mo1) + sO(x, f)) +

X

f 0 6=mo2

max(0, � � sO([x,mo1 ],mo2) + sO([x,mo1 ], f0)) +

Scorefortrueresponse

Scoreforanegativeresponse

X

r 6=r

max(0, � � s

R

([x,mo1 ,mo2 ], r) + s

R

([x,mo1 ,mo2 ], r))


Experiment

16

• Large– ScaleQA• 14MStatements– (subject,relation,object)•MemoryHops;𝑘 = 1• Onlyre-rankedcandidatesfromothersystem

Method F1

Faderet.al.2013 0.54

Bordeset.al. 2014b 0.73

MemoryNetworks (Thiswork) 0.72

WhydoesMemoryNetworkperformexactlyaspreviousmodel?


Storedasmemories

Outputishighestscoringmemory

Experiment

17

• Large– ScaleQA• 14MStatements– (subject,relation,object)•MemoryHops;𝑘 = 1• Onlyre-rankedcandidatesfromothersystem

Method F1

Faderet.al.2013 0.54

Bordeset.al. 2014b 0.73

MemoryNetworks (Thiswork) 0.72

WhydoesMemoryNetworksnotperformaswell?

UsefulExperiment

18

• SimulatedWorldQA• 4characters,3objects,5rooms• 7kstatements,3kquestionsfortrainingandsamefortesting• Difficulty1(5)– Entityinquestionismentionedinlast1(5)sentences• For𝑘 = 2,annotationhasintermediatebestmemoriesaswell


Limitations

19

• SimpleBOWrepresentation

• SimulatedQuestionAnsweringdatasetistootrivial

• Strongsupervisioni.e.forintermediatememoriesisneeded


End-to-EndMemoryNetworks(MemN2N)

20

•Whatiftheannotationis:• Inputsentences𝑥-, 𝑥2,… , 𝑥4• Query𝑞• Answer𝑎

•Modelperformsby:• Generatingmemoriesfrominputs• Transformingqueryintosuitablerepresentation• Processqueryandmemoriesjointlyusingmultiplehopstoproducetheanswer• Backpropagatethroughthewholeprocedure

End-To-EndMemoryNetworks,Sukhbaataret.al.,NIPS2015

Joe went to the kitchen. Fred went to the kitchen. Joe picked up the milk. Joe travelled to his office. Joe left the milk. Joe went to the bathroom.


Office

MemN2N

21

1. Convertinputtomemories𝑥" → 𝑚"


mi = Axi BOWinput

Word-EmbeddingMatrix

Sumofword-embeddings

2. Transformquery𝑞 intosamerepresentationspace

u = Bq

3. OutputVectors𝑥" → 𝑐"ci = Cxi

MemN2N

22

3. Scoringmemoriesagainstquery


Memories

Query(transformed)

Scoreforinput/memory

4. Generateoutput

pi = Softmax(uTmi)

o =X

i

pici

Weightedaverageofallinputs(transformed)

MemN2N

23

5. GeneratingResponse


a = Softmax(W (u+ o))

TrainingObjective– MaximumLikelihood/CrossEntropy

Distributionoverresponsewords

Averaged-outputQuery

ˆ

⇥ = argmax

NX

s=1

logP (as)

24End-To-EndMemoryNetworks,Sukhbaataret.al.,NIPS2015

Generatememories

TransformQuery

Generateoutputs

Scorememories

Makeaveragedoutput

Response


Multi-hopMemN2N

u

k+1 = u

k + o

k

Hop1

Hop2

Hop3

DifferentMemoriesandOutputsforeachHop

Experiments


• SimulatedWorldQA

• 20TasksfrombAbIdataset- 1Kand10Kinstancespertask

• Vocabulary=177wordsonly!!!!!

• 60epochs

• LearningRateannealing

• LinearStartwithdifferentlearningrate

• “Modeldivergedveryoften,hencetrainedmultiplemodels”


MemNN MemN2N

Error%(1k) 6.7 12.4

Error%(10k) 3.2 7.5

MovieTriviaTime!

28

• WhichwasStanleyKubricks’sfirstmovie?

• Whendid2001:ASpaceOdysseyrelease?

• AfterTheShining,whichmoviediditsdirectordirect?

FearandDesire

1968

FullMetalJacket

(2001:a_space_odyssey,directed_by,stanley_kubrick)(fear_and_dark, directed_by,stanley_kubrick)

…(fear_and_dark, released_in,1953)

(full_metal_jacket, released_in, 1987)

…(2001:a_space_odyssey, released_in,1968)

…(the_shining, directed_by,stanley_kubrick)

…(AI:artificial_intelligence,written_by,stanley_kubrick)

Subject Relation Object

KnowledgeBase

KnowledgeBase?

29

Incomplete! TooChallenging!

CombineusingMemoryNetworks?

TextualKnowledge?(2001:a_space_odyssey,directed_by,stanley_kubrick)

(fear_and_dark, directed_by,stanley_kubrick)

…(fear_and_dark, released_in,1953)

(full_metal_jacket, released_in, 1987)

…(2001:a_space_odyssey, released_in,1968)

…(the_shining, directed_by,stanley_kubrick)

…(AI:artificial_intelligence,written_by,stanley_kubrick)

Key-ValueMemNNs forReadingDocuments

30

• StructuredMemoriesasKey-ValuePairs• RegularMemNNs havesinglevectorforeachmemory• Keymorerelatedtoquestionandvaluestoanswer

Key-ValueMemoryNetworksforDirectlyReadingDocuments,Milleret.al.,EMNLP2016

Memories = (k1, v1), (k2, v2), . . . , (kN , vN )

KeysandValuescanbeWords,Sentences,Vectorsetc.

(𝑘:Kubrick’sfirstmoviewas,𝑣: FearandDark)

KV-MemNN

31Key-ValueMemoryNetworksforDirectlyReadingDocuments,Milleret.al.,EMNLP2016

1. RetrieverelevantmemoriesusingHashingTechniques

q

(k1, v1)

(k2, v2)

(kN , vN ). . .

(kh1 , vh1)

(kh2 , vh2)

(khN , vhN ). . .

Useinvertedindex,localitysensitivehashing, somethingsensible

AllMemoriesRetrievedRelevantMemories

KV-MemNN


2. ScoreMemory-Keys

phi = Softmax(A�Q(q) ·A�K(khi))

KeyBOWSumof

Embeddings

Dot-ProdDistributionoverMemory-Keys

3. GenerateOutput

o =X

i

phiA�V (vhi)WeightedaverageofMemory-values

pi = Softmax(uTmi)

o =X

i

pici

KV-MemNN- MultipleHops


Inthe𝑗<= hop:

Queryrepresentation:

KeyAddressing

qj = Rj(A�Q(qj�1) + o)

GenerateResponse

phi = Softmax(A�Q(qj) ·A�K(khi))

FinalHop

a = Softmax(A�Q(qH+1) ·B�Y (yi))

KV-MemNN– Whattostoreinmemories?


1. KBBased:

Key:(subject,relation);Value:ObjectK:(2001:a_space_odyssey,directed_by);V:stanley_kubrick

2. DocumentBased

Foreachentityindocument,extract5-wordwindowaroundit

Key:window;Value:Entity

K:screenplaywrittenbyand;V:Hampton

KV-MemNN– Experiments


• WikiMovies Benchmark• Total100KQA-pairs• 10%fortesting

Method KB Doc

E2EMemoryNetwork 78.5 69.9

Key-ValueMemoryNetwork 93.9 76.2

KV-MemNN


RetrieverelevantMemories

ScorerelevantMemory-Keys

GenerateOutputusingAveragedMemory-Values

GenerateResponse

KV-MemNN


⌘

CNN:ComputerVision ::________:NLP


RNN

DynamicMemoryNetworks– TheBeast

39AskMeAnything:DynamicMemoryNetworksforNaturalLanguageProcessing,Kumaret.al.ICML2016

UseRNNs,specificallyGRUsforeverymodule

DMN


ct = GRU(wit, c

i�1t )FinalGRUOutput

for𝑡<= sentence

DMN


q = GRU(qiw, qi�1)

DMN


ei = hiTC

𝑖 = 1

hit = gitGRU(ct, h

it�1) + (1� git)h

it�1𝐻𝑜𝑝 = 𝑖

DMN


ei = hiTC

hit = gitGRU(ct, h

it�1) + (1� git)h

it�1

𝑖 = 2

𝐻𝑜𝑝 = 𝑖

DMN


mi = GRU(ei,mi�1)

DMN


yt = Softmax(W (a)↵t)

↵t = GRU([yt�1, q],↵t�1)

↵0 = mTm

DMN


HowmanyGRUswereusedwith2hops?

DMN– QualitativeResults


48

AlgorithmLearning

NeuralTuringMachine

CopyTask:ImplementtheAlgorithm

Givenalistofnumbersatinput,reproducethelistatoutput

NeuralTuringMachineLearns:1. Whattowritetomemory2. Whentowritetomemory3. Whentostopwriting4. Whichmemorycelltoreadfrom5. Howtoconvertresultofreadintofinaloutput

49

NeuralTuringMachines

50

Controller

ExternalInput ExternalOutput

ReadHeads

Memory

NeuralTuringMachines,Graveset.al.,arXiv:1410.5401

‘Blurry’

WriteHeads


‘Blurry’MemoryAddressing(attimeinstant‘t’)

51

N

M

Mt

wt


SoftAttention(Lectures2,3,20,24)

wt(0)=0.1 wt(1)=0.2 wt(2)=0.5 wt(3)=0.1 wt(4)=0.1


Moreformally,

BlurryReadOperation

Given:Mt(memorymatrix)ofsizeNxMwt (weightvector)oflengthNt(timeindex)

52NeuralTuringMachines,Graveset.al.,arXiv:1410.5401

NeuralTuringMachines:BlurryWrites

BlurryWriteOperationDecomposedintoblurryerase +blurryadd

Given:Mt(memorymatrix)ofsizeNxMwt (weightvector)oflengthNt(timeindex)et(erasevector)oflengthMat(addvector)oflengthM

53

EraseComponent AddComponentNeuralTuringMachines,Graveset.al.,arXiv:1410.5401

NeuralTuringMachines:Erase

54

5 7 9 2 12

11 6 3 1 2

3 7 3 10 6

4 2 5 9 9

3 5 12 8 4

M0

w1(0)=0.1 w1(1)=0.2 w1(2)=0.5 w1(3)=0.1 w1(4)=0.1

1.0

0.7

0.2

0.5

0.0

e1


1xN Mx1

NeuralTuringMachines:Erase

55

4.5 5.6 4.5 1.8 10.8

10.23 5.16 1.95 0.93 1.86

2.94 6.72 2.7 9.8 5.88

3.8 1.8 3.75 8.55 8.55

3 5 12 8 4

w1(0)=0.1 w1(1)=0.2 w1(2)=0.5 w1(3)=0.1 w1(4)=0.1


NeuralTuringMachines:Addition

56

4.5 5.6 4.5 1.8 10.8

10.23 5.16 1.95 0.93 1.86

2.94 6.72 2.7 9.8 5.88

3.8 1.8 3.75 8.55 8.55

3 5 12 8 4

w1(0)=0.1 w1(1)=0.2 w1(2)=0.5 w1(3)=0.1 w1(4)=0.1

3

4

-2

0

2

a1


NeuralTuringMachines:BlurryWrites

57

4.8 6.2 6 2.1 11.1

10.63 5.96 3.95 1.33 2.26

2.74 6.32 1.7 9.6 5.68

3.8 1.8 3.75 8.55 8.55

3.2 5.4 13 8.2 4.2

M1


NeuralTuringMachines:Demo

58

Demonstration:TrainingonCopyTask

FigurefromSnipsAI'sMediumPost


NeuralTuringMachines:AttentionModel


Generatingwt

ContentBased

Example:QATask

- ScoresentencesbysimilaritywithQuestion

- Weightsassoftmax ofsimilarityscores

LocationBased

Example:CopyTask

- Movetoaddress(i+1)afterwritingtoindex(i)

- Weights≈Transitionprobabilities

NeuralTuringMachine:AttentionModel

60

CA

Stepsforgeneratingwt1. ContentAddressing2. Peaking3. Interpolation4. ConvolutionalShift(Location

Addressing)5. Sharpening

I

CS

S

Prev.State

ControllerOutputs



Prev.State

ControllerOutputs

61

:Vector(lengthM)producedbyController



62

CA

Step1:ContentAddressing(CA)


Prev.State

ControllerOutputs


63

CA

Step2:PeakingPrev.State

ControllerOutputs



64

CA

Step3:Interpolation(I)

I

Prev.State

ControllerOutputs



65

CA

Step4:ConvolutionalShift(CS)• Controlleroutputs,anormalized

distributionoverallNpossibleshifts• Rotation-shiftedweightscomputedas:

I

CS

Prev.State

ControllerOutputs



66

CA

Step5:Sharpening(S)• Usestosharpenas:

I

CS

S

Prev.State

ControllerOutputs


NeuralTuringMachine:ControllerDesign

67

• Feed-forward:faster,moretransparency&interpretabilityaboutfunctionlearnt

• LSTM:moreexpressivepower,doesn’tlimitthenumberofcomputationspertimestep

Bothareend-to-enddifferentiable!1. Reading/Writing->ConvexSums2. wt generation->Smooth3. ControllerNetworks


NeuralTuringMachine:NetworkOverview

68

UnrolledFeed-forwardController


FigurefromSnipsAI'sMediumPost

NeuralTuringMachinesvs.MemNNs

MemNNs•Memoryisstatic,withfocusonretrieving(reading)informationfrommemory

NTMs•Memoryiscontinuouslywrittentoandreadfrom,withnetworklearningwhentoperformmemoryreadandwrite

69

NeuralTuringMachines:Experiments

TaskNetworkSize NumberofParameters

NTMw/LSTM* LSTM NTMw/LSTM LSTM

Copy 3x100 3x256 67K 1.3M

RepeatCopy 3x100 3x512 66K 5.3M

Associative 3x100 3x256 70K 1.3M

N-grams 3x100 3x128 61K 330K

PrioritySort 2x100 3x128 269K 385K


NeuralTuringMachines:‘Copy’LearningCurve

Trainedon8-bitsequences,1<=sequencelength<=20


NeuralTuringMachines:‘Copy’Performance

72

LSTM

NTM


NeuralTuringMachinestriggeredanoutbreakofMemory

Architectures!

73

StackAugmentedRecurrentNetworks

Learnalgorithmsbasedonstackimplementations(e.g.learningfixedsequencegenerators)

Usesastackdatastructuretostorememory(asopposedtoamemorymatrix)

74InferringAlgorithmicPatternswithStack-AugmentedRecurrentNets,Joulin et.al.,arXiv:1503.01007

DynamicNeuralTuringMachines

Experimentedwithaddressingschemes• DynamicAddresses:Addressesofmemorylocationslearntintraining– allowsnon-linearlocation-basedaddressing

• Leastrecentlyusedweighting:Preferleastrecentlyusedmemorylocations+interpolatewithcontent-basedaddressing

• DiscreteAddressing:Samplethememorylocationfromthecontent-baseddistributiontoobtainaone-hotaddress

• Multi-stepAddressing:Allowsmultiplehopsovermemory

Results:bAbI QATask

75DynamicNeuralTuringMachinewithSoftandHardAddressingSchemes,Gulchere et.al.,arXiv:1607.00036

Location NTM Content NTM Soft DNTM DiscreteDNTM

1-step 31.4% 33.6% 29.5% 27.9%

3-step 32.8% 32.7% 24.2% 21.7%

StackAugmentedRecurrentNetworks

•Blurry‘push’and‘pop’onstack.E.g.:

• Someresults:

76InferringAlgorithmicPatternswithStack-AugmentedRecurrentNets,Joulin et.al.,arXiv:1503.01007

DifferentiableNeuralComputers

Advancedaddressingmechanisms:

• ContentBasedAddressing

• TemporalAddressing• Maintainsnotionofsequenceinaddressing• TemporalLinkMatrixL(sizeNxN),L[i,j]=degreetowhichlocation Iwaswrittentoafterlocation j.

• UsageBasedAddressing

77Hybridcomputingusinganeuralnetworkwithdynamicexternalmemory,Graveset.al.,Naturevol.538

DNC:UsageBasedAddressing

•Writingincreasesusageofcell,readingdecreasesusageofcell

• Leastusedlocationhashighestusage-basedweighting

• Interpolateb/wusage&contentbasedweightsforfinalwriteweights


DNC:Example


DNC:ImprovementsoverNTMs

NTM

• Largecontiguousblocksofmemoryneeded

• Nowaytofreeupmemorycellsafterwriting

DNC

•Memorylocationsnon-contiguous,usage-based

• Regularde-allotmentbasedonusage-tracking


DNC:Experiments

GraphTasks

GraphRepresentation:(source,edge,destination)tuples

Typesoftasks:

- Traversal:Performwalkongraphgivensource,listofedges

- ShortestPath:Givensource,destination

- Inference:Givensource,relationoveredges;finddestination


DNC:Experiments

GraphTasks

Trainingover3phases:

• Graphdescriptionphase:(source,edge,destination)tuplesfedintothegraph

• Queryphase:Shortestpath(source,____,destination),Inference(source,hybridrelation,___),Traversal(source,relation,relation…,___)

• Answerphase:Targetresponsesprovidedatoutput

Trainedonrandomgraphsofmaximumsize1000


DNC:Experiments

GraphTasks:LondonUnderground


DNC:Experiments



InputPhase

DNC:Experiments



TraversalTask

QueryPhase AnswerPhase

DNC:Experiments



ShortestPathTask

QueryPhase AnswerPhase

DNC:Experiments


GraphTasks:Freya’sFamilyTree

Conclusion

•MachineLearningmodelsrequirememoryandmulti-hopreasoningtoperformAItasksbetter

•MemoryNetworksforTextareaninterestingdirectionbutverysimple

• Genericarchitectureswithmemory,suchasNeuralTuringMachine,limitedapplications shown

• FuturedirectionsshouldbefocusingonapplyinggenericneuralmodelswithmemorytomoreAITasks.


ReadingList

• KarolKurach,MarcinAndrychowicz &IlyaSutskever NeuralRandom-AccessMachines,ICLR,2016

• EmilioParisotto &Ruslan Salakhutdinov NeuralMap:StructuredMemoryforDeepReinforcementLearning,ArXiv,2017

• Pritzel et.al. NeuralEpisodicControl,ArXiv,2017• OriolVinyals,Meire Fortunato, Navdeep Jaitly PointerNetworks,ArXiv,2017

• JackWRaeetal.,ScalingMemory-AugmentedNeuralNetworkswithSparseReadsandWrites,ArXiv 2016

• AntoineBordes, Y-LanBoureau, JasonWeston,LearningEnd-to-EndGoal-OrientedDialog,ICLR2017

• JunhyukOh, ValliappaChockalingam, SatinderSingh, HonglakLee,ControlofMemory,ActivePerception,andActioninMinecraft,ICML2016

• WojciechZaremba, IlyaSutskever,ReinforcementLearningNeuralTuringMachines,ArXiv 2016

89