deep learning for natural language processing the

17
Deep Learning for Natural Language Processing The Transformer model Richard Johansson [email protected]

Upload: others

Post on 05-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Deep Learning for Natural LanguageProcessing

The Transformer model

Richard Johansson

[email protected]

-20pt

drawbacks of recurrent models

I even with GRUs and LSTMs, it is difficult for RNNs topreserve information over long distances

I we introduced attention as a way to deal with this problemI can we skip the RNN and just use attention?

-20pt

attention models: recapI first, compute an “energy” ei for each state hi

I for the attention weights, we apply the softmax:

αi =exp ei∑nj=1 exp ej

I finally, the “summary” is computed as a weighted sum

s =n∑

i=1

αihi

-20pt

the Transformer

I the Transformer (Vaswani et al., 2017) is anarchitecture that uses attention for information flow:“Attention is all you need”

I it was originally designed for machine translationand has two parts:I an encoder that “summarizes” an input sentenceI a decoder (a conditional LM) that generates an

output, based on the input

I let’s consider the encoder

-20pt

illustration of a Transformer block

-20pt

illustration of a Transformer block

-20pt

illustration of a Transformer block

-20pt

illustration of a Transformer block

-20pt

illustration of a Transformer block

-20pt

illustration of a Transformer block

-20pt

multi-head attention

I in each layer, the Transformer applies severalattention models (“heads”) in parallel

I intuitively, the heads are “looking” for differenttypes of information

I each attention head computes a scaled dotproduct attention:

ei =1√dqi · kj

α = softmax(e)

where qi and kj are linear transformations of theinput at positions i and j

-20pt

a layer in the Transformer encoder

I after each application of multi-head attention, a2-layer feedforward model (with ReLU activation) isapplied

I residual connections (“shortcuts”) and layernormalization (Ba et al., 2016) added for robustnessand to facilitate training

I the Transformer encoder consists of a stack of thistype of block

-20pt

what do the attention heads look at?

I see (Vig, 2019)

-20pt

pros and cons

+ short path length for information flow– quadratic complexity

-20pt

the road ahead

I the full Transformer is an effective model formachine translation

I we’ll return to it when we discussencoder–decoder architectures

I for now, let’s use it simply as a pre-trainedrepresentation

-20pt

reading

The Illustrated Transformer

-20pt

references

J. Ba, J. Kiros, and G. Hinton. 2016. Layer normalization. arXiv:1607.06450.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In NIPS 30 .

J. Vig. 2019. Visualizing attention in transformer-based languagerepresentation models. arXiv:1904.02679.