l2irq`fb h` mb7q`k2`-g1ti2`m hgj2kq`vstraka/courses/npfl114/1819/...ls6grr9-gg2+im`2grk h`...

NPFL114, Lecture 12

Transformer, External MemoryNetworksMilan Straka

May 20, 2019

Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics

unless otherwise stated

Exams

Five questions, written preparation, then we go through it together (or you can leave and let megrade it by myself).

Each question is 20 points, and up to 40 points (surplus above 80 points; there is no distinctionbetween regular and competition points) transfered from the practicals, and up to 10 points forGitHub pull requests.

To pass the exam, you need to obtain at least 60, 75 and 90 out of 100 points for the writtenexam (plus up to 40 points from the practicals), to obtain grades 3, 2 and 1, respectively.

The SIS should give you an exact time of the exam (including a gap between students) so thatyou do not come all at once.

2/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

What Next

In the winter semester:

NPFL117 – Deep Learning Seminar [0/2 Ex]Reading group of deep learning papers (in all areas). Every participant presents a paper aboutdeep learning, learning how to read a paper, present it in a understandable way, and get deeplearning knowledge from other presentations.

NPFL122 – Deep Reinforcement Learning [2/2 C+Ex]In a sense continuation of Deep Learning, but instead of supervised learning, reinforced learningis the main method. Similar format to the Deep Learning course.

NPFL129 – Machine Learning 101A course intended as a prequel to Deep Learning – introduction to machine learning (regression,classification, structured prediction, clusterization, hyperparameter optimization; decision trees,SVM, maximum entropy classifiers, gradient boosting, …), with practicals in Python.


Neural Architecture Search (NASNet) – 2017

Using REINFORCE with baseline, we can design neural network architectures.

We fix the overall architecture and design only Normal and Reduction cells.

Figure 2 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.



Every block is designed by a RNN controller generating individual operations.




Table 2 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.


Attention is All You Need

For some sequence processing tasks, sequential processing of its elements might be toorestricting. Instead, we may want to combine sequence elements independently on their distance.



Figure 1 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.



The attention module for a queries , keys and values is defined as:

The queries, keys and values are computed from current word representations using a linear

transformation as

Q K V

Attention(Q,K,V ) = softmax V .( d k

QK⊤)

W

Q

K

V

= V ⋅ WQ= V ⋅ WK= V ⋅ WV



Multihead attention is used in practice. Instead of using one huge attention, we split queries,keys and values to several groups (similar to how ResNeXt works), compute the attention ineach of the groups separately, and then concatenate the results.

Scaled Dot-Product Attention Multi-Head Attention

Figure 2 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.



Positional EmbeddingsWe need to encode positional information (which was implicit in RNNs).

Learned embeddings for every position.

Sinusoids of different frequencies:

This choice of functions should allow the model to attend to relative positions, since for anyfixed , is a linear function of .

PE (pos,2i)

PE (pos,2i+1)

= sin pos/10000( 2i/d)

= cos pos/10000( 2i/d)

k PE pos+k PE pos



Positional embeddings for 20 words of dimension 512, lighter colors representing values closerto 1 and darker colors representing values closer to -1.

http://jalammar.github.io/illustrated-transformer/



RegularizationThe network is regularized by:

dropout of input embeddings,dropout of each sub-layer, just before before it is added to the residual connection (and thennormalized),label smoothing.

Default dropout rate and also label smoothing weight is 0.1.

Parallel ExecutionTraining can be performed in parallel because of the masked attention – the softmax weights ofthe self-attention are zeroed out not to allow attending words later in the sequence.

However, inference is still sequential (and no substantial improvements have been achieved onparallel inference similar to WaveNet).


Why Attention

Table 1 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.


Transformers Results



Transformers Results

N dmodel dff h dk dv Pdrop ǫlstrain PPL BLEU params

steps (dev) (dev) ×106

base 6 512 2048 8 64 64 0.1 0.1 100K 4.92 25.8 65

(A)

1 512 512 5.29 24.94 128 128 5.00 25.516 32 32 4.91 25.832 16 16 5.01 25.4

(B)16 5.16 25.1 5832 5.01 25.4 60

(C)

2 6.11 23.7 364 5.19 25.3 508 4.88 25.5 80

256 32 32 5.75 24.5 281024 128 128 4.66 26.0 168

1024 5.12 25.4 534096 4.75 26.2 90

(D)

0.0 5.77 24.60.2 4.95 25.5

0.0 4.67 25.30.2 5.47 25.7

(E) positional embedding instead of sinusoids 4.92 25.7

big 6 1024 4096 16 0.3 300K 4.33 26.4 213



Neural Turing Machines

So far, all input information was stored either directly in network weights, or in a state of arecurrent network.

However, mammal brains seem to operate with a working memory – a capacity for short-termstorage of information and its rule-based manipulation.

We can therefore try to introduce an external memory to a neural network. The memory

will be a matrix, where rows correspond to memory cells.

M



The network will control the memory using a controller which reads from the memory andwrites to is. Although the original paper also considered a feed-forward (non-recurrent)controller, usually the controller is a recurrent LSTM network.

Figure 1 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.


Neural Turing Machine

ReadingTo read the memory in a differentiable way, the controller at time emits a read distribution

over memory locations, and the returned read vector is then

WritingWriting is performed in two steps – an erase followed by an add. The controller at time emits

a write distribution over memory locations, and also an erase vector and an add vector

. The memory is then updates as

t w tr t

r =t w (i) ⋅i

∑ t M (i).t

t

w t e ta t

M (i) =t M (i)[1 −t−1 w (i)e ] +t t w (i)a .t t



The addressing mechanism is designed to allow both

content addressing, andlocation addressing.




Content AddressingContent addressing starts by the controller emitting the key vector , which is compared to all

memory locations , generating a distribution using a with temperature .

The measure is usually the cosine similarity

k tM (i)t softmax β t

w (i) =tc

exp(β ⋅ distance(k ,M (j))∑j t t t

exp(β ⋅ distance(k ,M (i))t t t

distance

distance(a, b) = .∣∣a∣∣ ⋅ ∣∣b∣∣

a ⋅ b



Location-Based AddressingTo allow iterative access to memory, the controller might decide to reuse the memory locationfrom previous timestep. Specifically, the controller emits interpolation gate and defines

Then, the current weighting may be shifted, i.e., the controller might decide to “rotate” theweights by a small integer. For a given range (the simplest case are only shifts ), the

network emits distribution over the shifts, and the weights are then defined using a

circular convolution

Finally, not to lose precision over time, the controller emits a sharpening factor and the final

memory location weights are

g t

w =tg

g w +t tc (1 − g )w .t t−1

{−1, 0, 1}softmax

(i) =w~t w (j)s (i −j

∑ tg

t j).

γ tw (i) =t (i) / (j) .w~t γ t ∑j w

~t

γ t



Overall ExecutionEven if not specified in the original paper, following the DNC paper, the LSTM controller canbe implemented as a (potentially deep) LSTM. Assuming read heads and one write head, the

input is and read vectors from previous time step, and output of the

controller are vectors , and the final output is . The is a

concatenation of

R

x t R r , … , r t−11

t−1R

(ν , ξ )t t y =t ν +t W [r , … , r ]r t1

tR ξ t

k , β , g , s , γ ,k , β , g , s , γ , … ,k , β , g , s , γ , e ,a .t1

t1

t1

t1

t1

t2

t2

t2

t2

t2

tw

tw

tw

tw

tw

tw

tw



Copy TaskRepeat the same sequence as given on input. Trained with sequences of length up to 20.




Figure 4: NTM Generalisation on the Copy Task. The four pairs of plots in the top row

depict network outputs and corresponding copy targets for test sequences of length 10, 20, 30,

and 50, respectively. The plots in the bottom row are for a length 120 sequence. The network

was only trained on sequences of up to length 20. The first four sequences are reproduced with

high confidence and very few mistakes. The longest one has a few more local errors and one

global error: at the point indicated by the red arrow at the bottom, a single vector is duplicated,

pushing all subsequent vectors one step back. Despite being subjectively close to a correct copy,

this leads to a high loss.




Figure 5: LSTM Generalisation on the Copy Task. The plots show inputs and outputs

for the same sequence lengths as Figure 4. Like NTM, LSTM learns to reproduce sequences

of up to length 20 almost perfectly. However it clearly fails to generalise to longer sequences.

Also note that the length of the accurate prefix decreases as the sequence length increases,

suggesting that the network has trouble retaining information for long periods.




Associative RecallIn associative recall, a sequence is given on input, consisting of subsequences of length 3. Thena randomly chosen subsequence is presented on input and the goal is to produce the followingsubsequence.


Differentiable Neural Computer

NTM was later extended to a Differentiable Neural Computer.

B C F

B C F

W

N

W

a Controller b Read and write heads c Memoryd Memory usage and temporal links

Output

Input

Write vector

Erase vector

Write key

Read key

Read mode

Read key

Read mode

Read vectors

Write

Read 1

Read 2

Figure 1 of paper "Hybrid computing using a neural network with dynamic external memory", https://www.nature.com/articles/nature20101.



The DNC contains multiple read heads and one write head.

The controller is a deep LSTM network, with input at time being the current input and

read vectors from previous time step. The output of the controller are vectors

, and the final output is . The is a concatenation of

parameters for read and write heads (keys, gates, sharpening parameters, …).

In DNC, usage of every memory location is tracked, which allows us to define allocationweighting. Furthermore, for every memory location we track which memory location was writtento previously ( ) and subsequently ( ).

The, the write weighting is defined as a weighted combination of the allocation weighting andwrite content weighting, and read weighting is computed as a weighted combination of readcontent weighting, previous write weighting, and subsequent write weighting.

t x t R

r , … , r t−11

t−1R

(ν , ξ )t t y =t ν +t W [r , … , r ]r t1

tR ξ t

b t f t



a Random graph c Family treeb London Underground

Shortest-pathTraversal

Ian Jodie

Tom Charlotte

MatJo

Fergus

AliceSteve

Simon Freya Natalie

Jane

Bob

Mary

Alan Lindsey

Liam Nina

Becky Alison

Maternal great uncle

Shortest-path question:(Moorgate, PiccadillyCircus, _)

Traversal question:(BondSt, _, Central),

(_, _, Circle), (_, _, Circle),

(_, _, Circle), (_, _, Circle),

(_, _, Jubilee), (_, _, Jubilee),

Underground input:(OxfordCircus, TottenhamCtRd, Central)

(TottenhamCtRd, OxfordCircus, Central)

(BakerSt, Marylebone, Circle)

(BakerSt, Marylebone, Bakerloo)

(BakerSt, OxfordCircus, Bakerloo)

(LeicesterSq, CharingCross, Northern)

(TottenhamCtRd, LeicesterSq, Northern)

(OxfordCircus, PiccadillyCircus, Bakerloo)

(OxfordCircus, NottingHillGate, Central)

(OxfordCircus, Euston, Victoria)

84 edges in total

Inference question:(Freya, _, MaternalGreatUncle)

Family tree input:(Charlotte, Alan, Father)

(Simon, Steve, Father)

(Steve , Simon, Son1)

(Nina, Alison, Mother)

(Lindsey, Fergus, Son1)

(Bob, Jane, Mother)

(Natalie, Alice, Mother)

(Mary, Ian, Father)

(Jane, Alice, Daughter1)

(Mat, Charlotte, Mother)

54 edges in total

Answer: (BondSt, NottingHillGate, Central)

(NottingHillGate, GloucesterRd, Circle)

(Westminster, GreenPark, Jubilee)

(GreenPark, BondSt, Jubilee)

Answer: (Moorgate, Bank, Northern)

(Bank, Holborn, Central)

(Holborn, LeicesterSq, Piccadilly)

(LeicesterSq, PiccadillyCircus, Piccadilly)

Answer: (Freya, Fergus, MaternalGreatUncle)

…

…

…




c London Underground map

d Read key

a Read and write weightings

From

e Location content

To Line

Oxford Circus>Tottenham Court RdTottenham Court Rd>Oxford Circus

Green Park>Oxford CircusVictoria>Green Park

Oxford Circus>Green ParkGreen Park>Victoria

Green Park>Piccadilly CircusPiccadilly Circus>Leicester SqPiccadilly Circus>Green Park

Leicester Sq>Piccadilly CircusPiccadilly Circus>Oxford CircusCharing Cross>Piccadilly CircusPiccadilly Circus>Charing CrossOxford Circus>Piccadilly Circus

Leicester Sq>Tottenham Court RdCharing Cross>Leicester SqLeicester Sq>Charing Cross

Tottenham Court Rd>Leicester SqVictoria>___ Victoria N

___>___ Victoria N___>___ Central E___>___ North S

___>___ Piccadilly W___>___ Bakerloo N___>___ Central E

Charing Cross

Green Park

Leicester Sq

Oxford Circus

Piccadilly Circus

Tottenham Court Rd

Victoria

Bakerloo N

Bakerloo S

Central E

Central W

North N

North S

Piccadilly E

Piccadilly W

Victoria N

Victoria S

Charing Cross

Green Park

Leicester Sq

Oxford Circus

Piccadilly Circus

Tottenham Court Rd

Victoria

From To Line

Charing Cross

Green Park

Leicester Sq

Oxford Circus

Piccadilly Circus

Tottenham Court Rd

Victoria

Bakerloo N

Bakerloo S

Central E

Central W

North N

North S

Piccadilly E

Piccadilly W

Victoria N

Victoria S

Charing Cross

Green Park

Leicester Sq

Oxford Circus

Piccadilly Circus

Tottenham Court Rd

Victoria

Decode

Decoded m

emory locations

b Read mode

Decode

Write head

Read head 1

Read head 2

BackwardContentForward

Graph definition Query Answer

BackwardContentForward

Time

0 5 10 15 20 25 30



Memory-augmented Neural Networks

Figure 1 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065.


Memory-augmented NNs

Page 3 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065.

Page 4 of paper "One-shot learning with Memory-Augmented NeuralNetworks", https://arxiv.org/abs/1605.06065.


Memory-augmented NNs

Figure 2 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065.


l2irq`fb h` mb7q`k2`-g1ti2`m hgj2kq`vstraka/courses/npfl114/1819/...ls6grr9-gg2+im`2grk h`...

Documents