sequence to sequence models · sequence to sequence models jon dehdari january 25, 2016 1 / 12....

Sequence to Sequence Models

Jon Dehdari

January 25, 2016

1 / 12

Good Morning!

2 / 12

Sentence Vectors

• We’ve seen that words can be represented as vectors. Cansentences be represented as vectors?

• Sure, why not? How? From the hidden state at the end of asentence: hi = φenc(hi−1, si ) (φenc = LSTM or GRU)

• Are they any good? For Elman networks (SRNs), not so much. ForLSTMs or GRUs, yes, they’re pretty good

Image courtesy of http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2

3 / 12

Sentence Vectors

• Sure, why not?

How? From the hidden state at the end of asentence: hi = φenc(hi−1, si ) (φenc = LSTM or GRU)

3 / 12

Sentence Vectors

3 / 12

Sentence Vectors

3 / 12

Sentence Vectors

3 / 12

Sentence Vector Examples

−15 −10 −5 0 5 10 15 20−20

I gave her a card in the garden

In the garden , I gave her a cardShe was given a card by me in the garden

She gave me a card in the gardenIn the garden , she gave me a card

I was given a card by her in the garden

−8 −6 −4 −2 0 2 4 6 8 10−6

John respects Mary

Mary respects JohnJohn admires Mary

Mary admires John

Mary is in love with John

John is in love with Mary

Sentence vectors were projected to two dimensions using PCA

Images courtesy of Sutskever, et al (2014)

4 / 12

Generating Sentences from Vectors

• We can also try to go the other direction, generatingsentences from vectors

• How? Use an RNN to decode, rather than encode asentence: zi = φdec(zi−1,ui−1,hT )

• hT ensures global sentence coherency (& adequacy in MT);ui−1 ensures local fluency

5 / 12

Using Neural Encoders & Decoders to Translate• We can combine the neural encoder and decoder of previous

slides to form an encoder-decoder model• This can be used for machine translation, and other tasks that

map sequences to sequences

• Monolingual word projections (vectors/embeddings) aretrained to maximize likelihood of next word

• Source-side word projections (si ) in an encoder-decodersetting are trained to maximize target-side likelihood

6 / 12

Using Neural Encoders & Decoders to Translate• We can combine the neural encoder and decoder of previous

slides to form an encoder-decoder model• This can be used for machine translation, and other tasks that

map sequences to sequences

• Monolingual word projections (vectors/embeddings) aretrained to maximize likelihood of next word

• Source-side word projections (si ) in an encoder-decodersetting are trained to maximize target-side likelihood

6 / 12

Bidirectional RNNs• The basic encoder-decoder architecture doesn’t handle long

sentences very well• Everything must fit into a fixed-size vector,• and RNNs remember recent items better

• We can combine left-to-right and right-to-left RNNs toovercome these issues

7 / 12

Bidirectional RNNs• The basic encoder-decoder architecture doesn’t handle long

sentences very well• Everything must fit into a fixed-size vector,• and RNNs remember recent items better• We can combine left-to-right and right-to-left RNNs to

overcome these issues

7 / 12

What if . . .

• Even bidirectional encoder-decoders have a hard time withlong sentences

• We need a way to keep track of what’s already beentranslated and what to translate next

• For neural nets, the solution is often more neural nets . . .

8 / 12

What if . . .

8 / 12

What if . . .

8 / 12

Achtung, Baby!• Attention-based decoding adds another network (a) that takes

as input the encoder’s hidden state (h) and the decoder’shidden state (z), and outputs a probability for each sourceword at each time step (when and where to pay attention) :ei ,j = a(zi−1, hj) = v>a tanh(Wasi−1 + Uahj)

• The attention weights can also function as soft wordalignments. They’re trained on target-side MLE

9 / 12

Achtung, Baby!• Attention-based decoding adds another network (a) that takes

as input the encoder’s hidden state (h) and the decoder’shidden state (z), and outputs a probability for each sourceword at each time step (when and where to pay attention) :ei ,j = a(zi−1, hj) = v>a tanh(Wasi−1 + Uahj)

• The attention weights can also function as soft wordalignments. They’re trained on target-side MLE

9 / 12

Image Caption Generation• You can use attention-based decoding to give textual

descriptions of images

Image courtesy of http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-310 / 12

Image Caption Generation Examples

Images courtesy of http://arxiv.org/abs/1502.03044

11 / 12

Image Caption Generation, Step by Step

Images courtesy of http://arxiv.org/abs/1502.03044

12 / 12

sequence to sequence models · sequence to sequence models jon dehdari january 25, 2016 1 / 12....

Documents

sequence-to-sequence models can zhifeng chen directly

sequence models i

lecture 5: sequence models ii

mutation models i: basic nucleotide sequence mutation...

4-2. srns. rosanna coppo (eng)

interaction models: sequence and collaboration ·...

sequence-to-sequence...

can sequence-to-sequence models crack substitution ciphers?

lecture 3: markov models of sequence evolution

lecture 9: machine translation and sequence-to-sequence...

sequence classification & hidden markov models

statistical language models for alternative sequence...

machine translation and sequence-to-sequence models

sequence stratigraphic models for - gcssepm home ·...

srns relocation and dscr

sequence-to-sequence models for punctuated transcription...

sequence models

statistical models forgenome sequence mapping

sequence to sequence models - github pages

sequence to sequence models for machine translation ·...