sequence to sequence (encoder-decoder) learning

Seq2seq...and beyond

Hello!I am Roberto Silveira

EE engineer, ML enthusiast

[email protected]

@rsilveira79

mailto:[email protected]

mailto:[email protected]

SequenceIs a matter of time

RNNIs what you need!

Basic Recurrent cells (RNN)

Source: http://colah.github.io/

Issues× Difficulties to deal with long term

dependencies× Difficult to train - vanish gradient issues

http://colah.github.io/

Long term issues

Source: http://colah.github.io/, CS224d notes

Sentence 1"Jane walked into the room. John walked in too. Jane said hi to ___"

Sentence 2"Jane walked into the room. John walked in too. It was late in the day, and everyone was walking home after a long day at work. Jane said hi to ___"


LSTM in 2 min...

Review× Address long term dependencies× More complex to train× Very powerful, lots of data



LSTM in 2 min...

Review× Address long term dependencies× More complex to train× Very powerful, lots of data

Cell state


Forget gate

Input gate

Output gate


Gated recurrent unit (GRU) in 2 min ...

Review× Fewer hyperparameters× Train faster× Better solution w/ less data

Source: http://www.wildml.com/, arXiv:1412.3555

http://www.wildml.com/

Gated recurrent unit (GRU) in 2 min ...

Review× Fewer hyperparameters× Train faster× Better solution w/ less data

Source: http://www.wildml.com/, arXiv:1412.3555

Reset gate

Update gate

http://www.wildml.com/

Seq2seq learning

Or encoder-decoder architectures

Variable size input - output

Source: http://karpathy.github.io/

http://karpathy.github.io/

Basic idea"Variable" size input (encoder) ->

Fixed size vector representation ->"Variable" size output (decoder)

"Machine","Learning",

"is","fun"

"Aprendizado","de",

"Máquina","é",

"divertido"

0.6360.1220.981

Input One word at a time Stateful

ModelStateful

ModelEncoded

Sequence

Output One word at a time

First RNN(Encoder)

Second RNN

(Decoder)

Memory of previous word influence next

result

Memory of previous word influence next

result

Sequence to Sequence Learning with Neural Networks (2014)

"Machine","Learning",

"is","fun"

"Aprendizado","de",

"Máquina","é",

"divertido"

0.6360.1220.981

1000d word embeddings

4 layers1000

cells/layer

Encoded Sequence

LSTM(Encoder)

LSTM(Decoder)

Source: arXiv 1409.3215v3

TRAINING → SGD w/o momentum, fixed learning rate of 0.7, 7.5 epochs, batches of 128 sentences, 10 days of training (WMT 14 dataset English to French)

4 layers1000

cells/layer

https://arxiv.org/pdf/1409.3215v3.pdf

Recurrent encoder-decoders

Les chiens aiment les os <EOS> Dogs love bones

Dogs love bones <EOS>

Source Sequence Target Sequence




Les chiens aiment les os <EOS> Dogs love bones





Leschiensaimentlesos <EOS> Dogs love bones





Recurrent encoder-decoders - issues

● Difficult to cope with large sentences (longer than training corpus)

● Decoder w/ attention mechanism →relieve encoder to squash into fixed length vector


NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (2015)


Decoder

Context vector for each target word

Weights of each annotation hj


NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (2015)


Decoder

Context vector for each target word

Weights of each annotation hj

Non-monotonic alignment


Attention models for NLP


Les chiens aiment les os <EOS>

+

<EOS>





+

<EOS>

Dogs





+

<EOS>

Dogs

Dogs

love+





+

<EOS>

Dogs

Dogs

love+

love

bones+


Challenges in using the model● Cannot handle true

variable size input

Source: http://suriyadeepan.github.io/

PADDING

BUCKETING

WORD EMBEDDINGS

● Capture context semantic meaning

● Hard to deal with both short and large sentences

http://suriyadeepan.github.io/

padding

Source: http://suriyadeepan.github.io/

EOS : End of sentencePAD : FillerGO : Start decodingUNK : Unknown; word not in vocabulary

Q : "What time is it? "A : "It is seven thirty."

Q : [ PAD, PAD, PAD, PAD, PAD, “?”, “it”,“is”, “time”, “What” ] A : [ GO, “It”, “is”, “seven”, “thirty”, “.”, EOS, PAD, PAD, PAD ]

http://suriyadeepan.github.io/

Source: https://www.tensorflow.org/

bucketing

Efficiently handle sentences of different lengths

Ex: 100 tokens is the largest sentence in corpus

How about short sentences like: "How are you?" → lots of PAD

Bucket list: [(5, 10), (10, 15), (20, 25), (40, 50)](defaut on Tensorflow translate.py)

Q : [ PAD, PAD, “.”, “go”,“I”] A : [GO "Je" "vais" "." EOS PAD PAD PAD PAD PAD]

https://www.tensorflow.org/

Word embeddings (remember previous presentation ;-)Distributed representations → syntactic and semantic is captured

Take =

0.2860.792-0.177-0.1070.109

-0.5420.3490.271

Word embeddings (remember previous presentation ;-)Linguistic regularities (recap)

Phrase representations (Paper - earning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation)



Phrase representations (Paper - earning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation)


1000d vector representation


applications

Neural conversational model - chatbots



Google Smart reply

Google Smart reply


Interesting facts● Currently responsible for 10% Inbox replies● Training set 238 million messages



Google Smart reply


Seq2Seq model

Interesting facts● Currently responsible for 10% Inbox replies● Training set 238 million messages

Feedforward triggering model

Semi-supervised semantic clustering


Image captioning(Paper - Show and Tell: A Neural Image Caption Generator)



Image captioning(Paper - Show and Tell: A Neural Image Caption Generator)

Encoder

Decoder



What's next?

And so?

Multi-task sequence to sequence(Paper - MULTI-TASK SEQUENCE TO SEQUENCE LEARNING)


One-to-Many (common encoder)

Many-to-One(common decoder)

Many-to-Many


Neural programmer(Paper - NEURAL PROGRAMMER: INDUCING LATENT PROGRAMS WITH GRADIENT DESCENT)



Unsupervised pre-training for seq2seq - 2017(Paper - UNSUPERVISED PRETRAINING FOR SEQUENCE TO SEQUENCE LEARNING)



Unsupervised pre-training for seq2seq - 2017(Paper - UNSUPERVISED PRETRAINING FOR SEQUENCE TO SEQUENCE LEARNING)


Pre-trained

Pre-trained


[email protected]

@rsilveira79

Place your screenshot here

A Quick example on tensorflow

sequence to sequence (encoder-decoder) learning

Technology