l2irq`fb h` mb7q`k2`-g1ti2`m hgj2kq`vstraka/courses/npfl114/1819/...ls6grr9-gg2+im`2grk h`...

39

Upload: others

Post on 07-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • NPFL114, Lecture 12

    Transformer, External MemoryNetworksMilan Straka

    May 20, 2019

    Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics

    unless otherwise stated

  • Exams

    Five questions, written preparation, then we go through it together (or you can leave and let megrade it by myself).

    Each question is 20 points, and up to 40 points (surplus above 80 points; there is no distinctionbetween regular and competition points) transfered from the practicals, and up to 10 points forGitHub pull requests.

    To pass the exam, you need to obtain at least 60, 75 and 90 out of 100 points for the writtenexam (plus up to 40 points from the practicals), to obtain grades 3, 2 and 1, respectively.

    The SIS should give you an exact time of the exam (including a gap between students) so thatyou do not come all at once.

    2/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • What Next

    In the winter semester:

    NPFL117 – Deep Learning Seminar [0/2 Ex]Reading group of deep learning papers (in all areas). Every participant presents a paper aboutdeep learning, learning how to read a paper, present it in a understandable way, and get deeplearning knowledge from other presentations.

    NPFL122 – Deep Reinforcement Learning [2/2 C+Ex]In a sense continuation of Deep Learning, but instead of supervised learning, reinforced learningis the main method. Similar format to the Deep Learning course.

    NPFL129 – Machine Learning 101A course intended as a prequel to Deep Learning – introduction to machine learning (regression,classification, structured prediction, clusterization, hyperparameter optimization; decision trees,SVM, maximum entropy classifiers, gradient boosting, …), with practicals in Python.

    3/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Architecture Search (NASNet) – 2017

    Using REINFORCE with baseline, we can design neural network architectures.

    We fix the overall architecture and design only Normal and Reduction cells.

    Figure 2 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.

    4/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Architecture Search (NASNet) – 2017

    Every block is designed by a RNN controller generating individual operations.

    Figure 3 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.

    5/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Architecture Search (NASNet) – 2017

    Figure 4 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.

    6/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Architecture Search (NASNet) – 2017

    Table 2 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.

    7/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Architecture Search (NASNet) – 2017

    Figure 5 of paper "Learning Transferable Architectures for Scalable Image Recognition", https://arxiv.org/abs/1707.07012.

    8/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Attention is All You Need

    For some sequence processing tasks, sequential processing of its elements might be toorestricting. Instead, we may want to combine sequence elements independently on their distance.

    9/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Attention is All You Need

    Figure 1 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.

    10/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Attention is All You Need

    The attention module for a queries , keys and values is defined as:

    The queries, keys and values are computed from current word representations using a linear

    transformation as

    Q K V

    Attention(Q,K,V ) = softmax V .( d k

    QK⊤)

    W

    Q

    K

    V

    = V ⋅ WQ= V ⋅ WK= V ⋅ WV

    11/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Attention is All You Need

    Multihead attention is used in practice. Instead of using one huge attention, we split queries,keys and values to several groups (similar to how ResNeXt works), compute the attention ineach of the groups separately, and then concatenate the results.

    Scaled Dot-Product Attention Multi-Head Attention

    Figure 2 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.

    12/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Attention is All You Need

    Positional EmbeddingsWe need to encode positional information (which was implicit in RNNs).

    Learned embeddings for every position.

    Sinusoids of different frequencies:

    This choice of functions should allow the model to attend to relative positions, since for anyfixed , is a linear function of .

    PE (pos,2i)

    PE (pos,2i+1)

    = sin pos/10000( 2i/d)

    = cos pos/10000( 2i/d)

    k PE pos+k PE pos

    13/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Attention is All You Need

    Positional embeddings for 20 words of dimension 512, lighter colors representing values closerto 1 and darker colors representing values closer to -1.

    http://jalammar.github.io/illustrated-transformer/

    14/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Attention is All You Need

    RegularizationThe network is regularized by:

    dropout of input embeddings,dropout of each sub-layer, just before before it is added to the residual connection (and thennormalized),label smoothing.

    Default dropout rate and also label smoothing weight is 0.1.

    Parallel ExecutionTraining can be performed in parallel because of the masked attention – the softmax weights ofthe self-attention are zeroed out not to allow attending words later in the sequence.

    However, inference is still sequential (and no substantial improvements have been achieved onparallel inference similar to WaveNet).

    15/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Why Attention

    Table 1 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.

    16/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Transformers Results

    Table 2 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.

    17/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Transformers Results

    N dmodel dff h dk dv Pdrop ǫlstrain PPL BLEU params

    steps (dev) (dev) ×106

    base 6 512 2048 8 64 64 0.1 0.1 100K 4.92 25.8 65

    (A)

    1 512 512 5.29 24.94 128 128 5.00 25.516 32 32 4.91 25.832 16 16 5.01 25.4

    (B)16 5.16 25.1 5832 5.01 25.4 60

    (C)

    2 6.11 23.7 364 5.19 25.3 508 4.88 25.5 80

    256 32 32 5.75 24.5 281024 128 128 4.66 26.0 168

    1024 5.12 25.4 534096 4.75 26.2 90

    (D)

    0.0 5.77 24.60.2 4.95 25.5

    0.0 4.67 25.30.2 5.47 25.7

    (E) positional embedding instead of sinusoids 4.92 25.7

    big 6 1024 4096 16 0.3 300K 4.33 26.4 213

    Table 4 of paper "Attention Is All You Need", https://arxiv.org/abs/1706.03762.

    18/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Turing Machines

    So far, all input information was stored either directly in network weights, or in a state of arecurrent network.

    However, mammal brains seem to operate with a working memory – a capacity for short-termstorage of information and its rule-based manipulation.

    We can therefore try to introduce an external memory to a neural network. The memory

    will be a matrix, where rows correspond to memory cells.

    M

    19/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Turing Machines

    The network will control the memory using a controller which reads from the memory andwrites to is. Although the original paper also considered a feed-forward (non-recurrent)controller, usually the controller is a recurrent LSTM network.

    Figure 1 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

    20/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Turing Machine

    ReadingTo read the memory in a differentiable way, the controller at time emits a read distribution

    over memory locations, and the returned read vector is then

    WritingWriting is performed in two steps – an erase followed by an add. The controller at time emits

    a write distribution over memory locations, and also an erase vector and an add vector

    . The memory is then updates as

    t w tr t

    r =t w (i) ⋅i

    ∑ t M (i).t

    t

    w t e ta t

    M (i) =t M (i)[1 −t−1 w (i)e ] +t t w (i)a .t t

    21/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Turing Machine

    The addressing mechanism is designed to allow both

    content addressing, andlocation addressing.

    Figure 2 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

    22/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Turing Machine

    Content AddressingContent addressing starts by the controller emitting the key vector , which is compared to all

    memory locations , generating a distribution using a with temperature .

    The measure is usually the cosine similarity

    k tM (i)t softmax β t

    w (i) =tc

    exp(β ⋅ distance(k ,M (j))∑j t t t

    exp(β ⋅ distance(k ,M (i))t t t

    distance

    distance(a, b) = .∣∣a∣∣ ⋅ ∣∣b∣∣

    a ⋅ b

    23/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Turing Machine

    Location-Based AddressingTo allow iterative access to memory, the controller might decide to reuse the memory locationfrom previous timestep. Specifically, the controller emits interpolation gate and defines

    Then, the current weighting may be shifted, i.e., the controller might decide to “rotate” theweights by a small integer. For a given range (the simplest case are only shifts ), the

    network emits distribution over the shifts, and the weights are then defined using a

    circular convolution

    Finally, not to lose precision over time, the controller emits a sharpening factor and the final

    memory location weights are

    g t

    w =tg

    g w +t tc (1 − g )w .t t−1

    {−1, 0, 1}softmax

    (i) =w~t w (j)s (i −j

    ∑ tg

    t j).

    γ tw (i) =t (i) / (j) .w~t γ t ∑j w

    ~t

    γ t

    24/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Turing Machine

    Overall ExecutionEven if not specified in the original paper, following the DNC paper, the LSTM controller canbe implemented as a (potentially deep) LSTM. Assuming read heads and one write head, the

    input is and read vectors from previous time step, and output of the

    controller are vectors , and the final output is . The is a

    concatenation of

    R

    x t R r , … , r t−11

    t−1R

    (ν , ξ )t t y =t ν +t W [r , … , r ]r t1

    tR ξ t

    k , β , g , s , γ ,k , β , g , s , γ , … ,k , β , g , s , γ , e ,a .t1

    t1

    t1

    t1

    t1

    t2

    t2

    t2

    t2

    t2

    tw

    tw

    tw

    tw

    tw

    tw

    tw

    25/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Turing Machines

    Copy TaskRepeat the same sequence as given on input. Trained with sequences of length up to 20.

    Figure 3 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

    26/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Turing Machines

    Figure 4: NTM Generalisation on the Copy Task. The four pairs of plots in the top row

    depict network outputs and corresponding copy targets for test sequences of length 10, 20, 30,

    and 50, respectively. The plots in the bottom row are for a length 120 sequence. The network

    was only trained on sequences of up to length 20. The first four sequences are reproduced with

    high confidence and very few mistakes. The longest one has a few more local errors and one

    global error: at the point indicated by the red arrow at the bottom, a single vector is duplicated,

    pushing all subsequent vectors one step back. Despite being subjectively close to a correct copy,

    this leads to a high loss.

    Figure 4 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

    27/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Turing Machines

    Figure 5: LSTM Generalisation on the Copy Task. The plots show inputs and outputs

    for the same sequence lengths as Figure 4. Like NTM, LSTM learns to reproduce sequences

    of up to length 20 almost perfectly. However it clearly fails to generalise to longer sequences.

    Also note that the length of the accurate prefix decreases as the sequence length increases,

    suggesting that the network has trouble retaining information for long periods.

    Figure 5 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

    28/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Turing Machines

    Figure 6 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

    29/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Turing Machines

    Associative RecallIn associative recall, a sequence is given on input, consisting of subsequences of length 3. Thena randomly chosen subsequence is presented on input and the goal is to produce the followingsubsequence.

    30/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Turing Machines

    Figure 11 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

    31/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Neural Turing Machines

    Figure 12 of paper "Neural Turing Machines", https://arxiv.org/abs/1410.5401.

    32/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Differentiable Neural Computer

    NTM was later extended to a Differentiable Neural Computer.

    B C F

    B C F

    W

    N

    W

    a Controller b Read and write heads c Memoryd Memory usage and temporal links

    Output

    Input

    Write vector

    Erase vector

    Write key

    Read key

    Read mode

    Read key

    Read mode

    Read vectors

    Write

    Read 1

    Read 2

    Figure 1 of paper "Hybrid computing using a neural network with dynamic external memory", https://www.nature.com/articles/nature20101.

    33/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Differentiable Neural Computer

    The DNC contains multiple read heads and one write head.

    The controller is a deep LSTM network, with input at time being the current input and

    read vectors from previous time step. The output of the controller are vectors

    , and the final output is . The is a concatenation of

    parameters for read and write heads (keys, gates, sharpening parameters, …).

    In DNC, usage of every memory location is tracked, which allows us to define allocationweighting. Furthermore, for every memory location we track which memory location was writtento previously ( ) and subsequently ( ).

    The, the write weighting is defined as a weighted combination of the allocation weighting andwrite content weighting, and read weighting is computed as a weighted combination of readcontent weighting, previous write weighting, and subsequent write weighting.

    t x t R

    r , … , r t−11

    t−1R

    (ν , ξ )t t y =t ν +t W [r , … , r ]r t1

    tR ξ t

    b t f t

    34/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Differentiable Neural Computer

    a Random graph c Family treeb London Underground

    Shortest-pathTraversal

    Ian Jodie

    Tom Charlotte

    MatJo

    Fergus

    AliceSteve

    Simon Freya Natalie

    Jane

    Bob

    Mary

    Alan Lindsey

    Liam Nina

    Becky Alison

    Maternal great uncle

    Shortest-path question:(Moorgate, PiccadillyCircus, _)

    Traversal question:(BondSt, _, Central),

    (_, _, Circle), (_, _, Circle),

    (_, _, Circle), (_, _, Circle),

    (_, _, Jubilee), (_, _, Jubilee),

    Underground input:(OxfordCircus, TottenhamCtRd, Central)

    (TottenhamCtRd, OxfordCircus, Central)

    (BakerSt, Marylebone, Circle)

    (BakerSt, Marylebone, Bakerloo)

    (BakerSt, OxfordCircus, Bakerloo)

    (LeicesterSq, CharingCross, Northern)

    (TottenhamCtRd, LeicesterSq, Northern)

    (OxfordCircus, PiccadillyCircus, Bakerloo)

    (OxfordCircus, NottingHillGate, Central)

    (OxfordCircus, Euston, Victoria)

    84 edges in total

    Inference question:(Freya, _, MaternalGreatUncle)

    Family tree input:(Charlotte, Alan, Father)

    (Simon, Steve, Father)

    (Steve , Simon, Son1)

    (Nina, Alison, Mother)

    (Lindsey, Fergus, Son1)

    (Bob, Jane, Mother)

    (Natalie, Alice, Mother)

    (Mary, Ian, Father)

    (Jane, Alice, Daughter1)

    (Mat, Charlotte, Mother)

    54 edges in total

    Answer: (BondSt, NottingHillGate, Central)

    (NottingHillGate, GloucesterRd, Circle)

    (Westminster, GreenPark, Jubilee)

    (GreenPark, BondSt, Jubilee)

    Answer: (Moorgate, Bank, Northern)

    (Bank, Holborn, Central)

    (Holborn, LeicesterSq, Piccadilly)

    (LeicesterSq, PiccadillyCircus, Piccadilly)

    Answer: (Freya, Fergus, MaternalGreatUncle)

    Figure 2 of paper "Hybrid computing using a neural network with dynamic external memory", https://www.nature.com/articles/nature20101.

    35/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Differentiable Neural Computer

    c London Underground map

    d Read key

    a Read and write weightings

    From

    e Location content

    To Line

    Oxford Circus>Tottenham Court RdTottenham Court Rd>Oxford Circus

    Green Park>Oxford CircusVictoria>Green Park

    Oxford Circus>Green ParkGreen Park>Victoria

    Green Park>Piccadilly CircusPiccadilly Circus>Leicester SqPiccadilly Circus>Green Park

    Leicester Sq>Piccadilly CircusPiccadilly Circus>Oxford CircusCharing Cross>Piccadilly CircusPiccadilly Circus>Charing CrossOxford Circus>Piccadilly Circus

    Leicester Sq>Tottenham Court RdCharing Cross>Leicester SqLeicester Sq>Charing Cross

    Tottenham Court Rd>Leicester SqVictoria>___ Victoria N

    ___>___ Victoria N___>___ Central E___>___ North S

    ___>___ Piccadilly W___>___ Bakerloo N___>___ Central E

    Charing Cross

    Green Park

    Leicester Sq

    Oxford Circus

    Piccadilly Circus

    Tottenham Court Rd

    Victoria

    Bakerloo N

    Bakerloo S

    Central E

    Central W

    North N

    North S

    Piccadilly E

    Piccadilly W

    Victoria N

    Victoria S

    Charing Cross

    Green Park

    Leicester Sq

    Oxford Circus

    Piccadilly Circus

    Tottenham Court Rd

    Victoria

    From To Line

    Charing Cross

    Green Park

    Leicester Sq

    Oxford Circus

    Piccadilly Circus

    Tottenham Court Rd

    Victoria

    Bakerloo N

    Bakerloo S

    Central E

    Central W

    North N

    North S

    Piccadilly E

    Piccadilly W

    Victoria N

    Victoria S

    Charing Cross

    Green Park

    Leicester Sq

    Oxford Circus

    Piccadilly Circus

    Tottenham Court Rd

    Victoria

    Decode

    Decoded m

    emory locations

    b Read mode

    Decode

    Write head

    Read head 1

    Read head 2

    BackwardContentForward

    Graph definition Query Answer

    BackwardContentForward

    Time

    0 5 10 15 20 25 30

    Figure 3 of paper "Hybrid computing using a neural network with dynamic external memory", https://www.nature.com/articles/nature20101.

    36/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Memory-augmented Neural Networks

    Figure 1 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065.

    37/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Memory-augmented NNs

    Page 3 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065.

    Page 4 of paper "One-shot learning with Memory-Augmented NeuralNetworks", https://arxiv.org/abs/1605.06065.

    38/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN

  • Memory-augmented NNs

    Figure 2 of paper "One-shot learning with Memory-Augmented Neural Networks", https://arxiv.org/abs/1605.06065.

    39/39NPFL114, Lecture 12 Organization NASNet Transformer NeuralTuringMachines DNC MANN