interleaving language and rl from language generation to ... · interleaving language and rl...

Post on 22-Jun-2020

27 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Interleaving Language and RLLanguage Generation

Florian StrubReinforcement Learning Summer School

7 July 2019

Interleaving Language and RL — Florian Strub

Overview

● Kind introduction to NLP

● Policy gradient for Translation

● Goal-oriented dialogue systems○ Dialogue setting, ○ GuessWhat?!○ Self-play for language generation

● Other linguistic grounded tasks:○ Language as goal representation: Instruction Following○ Language as state representation: Text Games○ Language as policy compositionality: Emergence of Language

~15min

~talk to me!

~20min

~15min

Interleaving Language and RL — Florian Strub

Kind Introduction to NLPWhat is NLP?

Natural Language Processing (NLP) aims to extract representations of textual information to read and make sense of human languages in a valuable manner.

Interleaving Language and RL — Florian Strub

Kind Introduction to NLPWhat is NLP?

Natural Language Processing (NLP) aims to extract representations of textual information to read and make sense of human languages in a valuable manner.

Interleaving Language and RL — Florian Strub

Kind Introduction to NLPLanguage is Hard :)

Georgetown–IBM experiment 1954 Translate Sixty Russian sentences into English

ARPA report 1966: “there is no immediate or predictable prospect of useful machine translation”

Interleaving Language and RL — Florian Strub

Kind Introduction to NLPSmall Historical Note

Rationalism

Language possesses an underlying structure that must be discovered, conjecturing that human language is innate and can be modelled like physic laws.

Empiricism

Language is a cognitive process that can be learned through experimentation, advocating to explore learning mechanisms rather than linguistic models

Chomsky, Noam. Syntactic structures.1957.

Transformational generative grammars

J. R. Firth. A synopsis of linguistic theory 1930-55. Studies in Linguistic Analysis

“You shall know a word by the company it keeps”

“The validity of statistical (information theoretic)

approach to MT has indeed been recognized …

as early as 1949. And was universally

recognized as mistaken [sic] by 1950. … The

crude force of computers is not science.”

Review of Brown et al. (1990)

Interleaving Language and RL — Florian Strub

Kind Introduction to NLPLanguage Modelling

Goal of statistical approach: How likely is a sentence?

- “The cherry on the cake” - “The cake on the cherry ”

but the cake is a lie...

Interleaving Language and RL — Florian Strub

Kind Introduction to NLPLanguage Modelling

A sentence is represented as a sequence of words:

http://videolectures.net/deeplearning2016_cho_language_understanding/

We compute the probability of the word sequence:

Interleaving Language and RL — Florian Strub

Kind Introduction to NLPLanguage Modelling

How to estimate the conditional probabilities?

N-gram Modelling:

We can simply count words!

Do not trust the cake!

Several problems (memory footprint, do not generalize etc.)

Interleaving Language and RL — Florian Strub

Kind introduction to MLPNeural Language Modelling

How to estimate the conditional probabilities?

Learn a representation of word history:

Use neural networks !!!

The cherry on the

cherry on the cake

Interleaving Language and RL — Florian Strub

Kind introduction to MLPNeural Language Modelling

Hash Table:

RNN RNN:

MLP

softmax

Classifier:

Basic RNN, LSTM, GRU etc.

Required indexed vocabulary:V = { “Is”:1, … , “cherry”:23 , ... }

Word embedding!

Greedy, stochastic, beam search etc.

Sampling procedure:

Interleaving Language and RL — Florian Strub

Kind introduction to MLPTraining procedure

Given a corpora:

Goal is to maximize the joint probability:

Cake and grief counseling will be available at the conclusion of the test.

Key NLP equation

Interleaving Language and RL — Florian Strub

Kind introduction to MLPTraining Procedure

The cherry on the

cherry over the table

Training Time Test Time

The cherry over the

cherry over the table

*Ground truth

Predicted

Teacher Forcing

** Ground truth

Predicted

Exposure bias

Interleaving Language and RL — Florian Strub

Kind introduction to MLPIt is only the begining!

We can generate language :)

But wait… it is pretty useless! (so is the cake, what a lame running joke...)

Can we use it as a translation system?

Well...

Interleaving Language and RL — Florian Strub

Kind introduction to MLPIt is only the begining!

We can generate language :)

But wait… it is pretty useless! (so is the cake, what a lame running joke...)

The cherry on the cake

La cerise sur le gateau

Can we use it as a translation system?

Well...

He looked for the cake

Il est allé chercher .. le gateau

Bad word alignment

Interleaving Language and RL — Florian Strub

Kind introduction to MLPSeq2Seq Models

Idea: Decompose encoding and decoding!

DecoderEncoder

input

output

sta

te

Interleaving Language and RL — Florian Strub

Kind introduction to MLPSeq2Seq Models

Idea: Decompose encoding and decoding!

DecoderEncoder

sta

te

Le gateau est un mensonge

The cake is a lie

Interleaving Language and RL — Florian Strub

Kind introduction to MLPSeq2Seq Models

sta

te

Idea: Decompose encoding and decoding!

RNNRNN RNNRNN RNN

Encoder

the cake is a

RNN RNN RNN RNN RNN

Decoder

lie

le gateau est un mensonge

Interleaving Language and RL — Florian Strub

Kind introduction to MLPSeq2Seq Models

sta

te

Idea: Decompose encoding and decoding!

RNN RNNRNN RNNRNN

le gateau est un mensonge

the cake … lie <stop>

RNN RNN RNN RNN RNN

<Ignore output>

<start> the cake … lie

Similar training as language modelling (teacher forcing)

Sequence-to-Sequence

Interleaving Language and RL — Florian Strub

Kind introduction to MLPSeq2Seq Models

sta

te

Idea: Decompose encoding and decoding!

the cake … lie <stop>

RNN RNN RNN RNN RNN

<start> the cake … lieR

esN

et

Image captioning

Interleaving Language and RL — Florian Strub

Kind introduction to MLPSeq2Seq Models

Seq2Seq is an Encoder/Decoder architecture

1) Encode language representation

2) Decode vector representation

Model is trained with Cross-Entropy (Teacher Forcing)

Decoder(RNN)

Encoder

input

output

sta

te

You cannot cram the

meaning of a whole

%&!$# sentence into a

single $&!#* vector.

Generated tokensinput tokens

Interleaving Language and RL — Florian Strub

Kind introduction to MLPTranslation

WMT dataset:

- 12M sentences French/English

- Vocabulary 80K words

- Assessed on BLEU score

BLEU is the geometric average of overlapping n-grams in a

set of targets sentences from n=1 to 4 .

Interleaving Language and RL — Florian Strub

Kind introduction to MLPTranslation

WMT dataset:

- 12M sentences French/English

- Vocabulary 80K words

- Assessed on BLEU score

Input le gateau est un mensonge

Predicted the lie is the cake

Target the cake is a lie

When several targets exists, n-gram can be count as many time as they exist in any of the targets

BLEU is the geometric average of overlapping n-grams in a

set of targets sentences from n=1 to 4 .

N=1

4

--

5

Interleaving Language and RL — Florian Strub

Kind introduction to MLPTranslation

WMT dataset:

- 12M sentences French/English

- Vocabulary 80K words

- Assessed on BLEU score

Input le gateau est un mensonge

Predicted the lie is the cake

Target the cake is a lie

When several targets exists, n-gram are count as many time as they may exist in one of the targets

N=2

1

--

4

N=3

0

--

3

N=1 N=4

0

--

2

4

--

5

BLEU is the geometric average of overlapping n-grams in a

set of targets sentences from n=1 to 4 .

Interleaving Language and RL — Florian Strub

Kind introduction to MLPBLEU

BLEU: Order of magnitude

BLEU

Sota 2014 (Durrani 2014) 37.0

Sequence-to-sequence (K. Cho 2014) 34.5

Sequence-to-Sequence (Wu 2016) 38.95

Transformer (Vaswani 2017) 41.8

BLEU is a lie!?

Estimated Human BLEU (Papineni 2002) 34.7

Interleaving Language and RL — Florian Strub

RL is the carrot on the cake

Interleaving Language and RL — Florian Strub

Policy Gradient for Language GenerationSupervised limitation

Supervised in good… but:

● We enforce a mismatch between training and testing: Teacher forcing vs Exposure Bias

● We optimize cross-entropy… but we care about BLEU !

● We optimize for sentence generation… but CE is at the word level (can be criticized)

sta

tethe cake … lie <stop>

RNN RNN RNN RNN RNN

<start> the cake … lie

En

co

de

r

Le

ga

tea

u e

st

un

m

en

so

ng

e

Interleaving Language and RL — Florian Strub

Policy Gradient for Language GenerationLanguage as a MDP

Idea: Turn language translation into a MDP and BLEU as a reward

• 𝑠𝑡 = 𝒙, 𝑦1, ⋯ , 𝑦𝑡

• 𝒙 = 𝑥1,⋯ , 𝑥𝑇 where 𝑥𝑡 ∈ 𝑉𝑖𝑛 and 𝒙 is the input sentence

• 𝑦1, ⋯ , 𝑦𝑡′ where 𝑦𝑡 ∈ 𝑉𝑜𝑢𝑡 are the generated token

• 𝑎𝑡+1 ∼ 𝑉𝑜𝑢𝑡𝑝𝑢𝑡

• 𝑠𝑡+1 = 𝑠𝑡 ∪ 𝑎𝑡+1

• 𝑟𝑡+1 = 𝐵𝐿𝐸𝑈 if 𝑎𝑡+1 = < 𝑠𝑡𝑜𝑝 >

0 Otherwise

sta

te𝑦1 𝑦2 … 𝑦𝑇′−1 <stop>

RNN RNN RNN RNN RNN

<start> 𝑦1 𝑦2 𝑦𝑇′−2 𝑦𝑇′−1

En

co

de

r

𝒙

Reward: BLEU

Interleaving Language and RL — Florian Strub

Policy Gradient for Language GenerationPolicy Gradient

The goal is to optimize the score function:

𝐽𝜽 = න𝑑𝜋𝜽𝑉𝜋𝜽𝑑𝑥

Expected BLEU according the translation policy over all the potential language pair

The score gradient is estimated by:

∇J𝜽=𝜽ℎ =

𝑡′=1

𝑇′

𝑦=1

𝑉𝑜𝑢𝑡

∇𝜽ℎlog(𝜋𝜽ℎ(𝑦𝑡|𝒙, 𝑦1, ⋯ , 𝑦𝑡′−1))(𝑄𝜋 𝜽ℎ − 𝑏)

Policy Gradient (Sutton 1999) improves the policy by following the score gradient:

𝜽ℎ+1 = 𝜽ℎ + 𝛼ℎ𝛁𝐉𝜽=𝜽ℎ𝛼 learning rateℎ training step

𝑏 baselineQ state-action function

𝑑 probability state distribution𝑉 Value function

Interleaving Language and RL — Florian Strub

Policy Gradient for Language GenerationPolicy Gradient

As 𝑉𝑜𝑢𝑡 may be very big, RL is not straightforward!

For example, WMT has 80k words ; Atari has 18 actions!

∇J𝜽=𝜽ℎ =

𝑡′=1

𝑇′

𝑦=1

𝑉𝑜𝑢𝑡

∇𝜽ℎlog(𝜋𝜽ℎ(𝑦𝑡|𝒙, 𝑦1, ⋯ , 𝑦𝑡′−1))(𝑄𝜋 𝜽ℎ − 𝑏)

𝑄𝜋 𝜽ℎ is hard to parametrize: Overestimation, memory footprint (Bahdanau 2016)

σ𝑦=1𝑉𝑜𝑢𝑡 can be intractable. Potential

subsampling etc. (Liu 2018)

Impossible to start from random policy. Required warm start (Ranzato 2015)

Should be parametrized

Interleaving Language and RL — Florian Strub

Policy Gradient for Language GenerationPolicy Gradient

Monte-Carlo Variant (REINFORCE-like)

∇J𝜽=𝜽ℎ =

𝑡′=1

𝑇′

∇𝜽ℎlog(𝜋𝜽ℎ(𝑦𝑡|𝒙, 𝑦1, ⋯ , 𝑦𝑡′−1))(𝑄𝜋 − 𝑏)

Where

𝑄𝜋 =

𝜏

𝑡′

𝛾𝜏𝑟𝜏

Intuitively, the full trajectory is equally rewarded. It is either all good or all bad!

𝛾 is often set to 1: No need to search for shortest word trajectory

Interleaving Language and RL — Florian Strub

Policy Gradient for Language GenerationSL vs RL

sta

te

𝑦1 𝑦2 … <stop>

RNN RNN RNN RNN

<start> 𝑦1 𝑦2 𝑦𝑇′−1

En

co

de

r

𝒙

BLEU

sta

te

𝑦1 𝑦2 … <stop>

RNN RNN RNN RNN

<start> 𝑦1∗ 𝑦2

∗ 𝑦𝑇′−1∗

En

co

de

r

𝒙

Reinforcement Learning:

𝐽𝜽 = න𝑑𝜋𝜽𝑉𝜋𝜽𝑑𝑥

High-variance, require warm-start

Optimize true score

Signal over trajectories

Supervised learning:

𝑡′=1

𝑇′

log 𝑝𝜃(𝑦𝑡′|𝑥, 𝑦1, ⋯ , 𝑦𝑡′)

Low-variance, sample efficient

Optimize surrogate

Signal word after words

Interleaving Language and RL — Florian Strub

Policy Gradient for Language GenerationResults, finally!

Policy Gradient

BLEU score for summarization / image captioning ROUGE score for image captioning (Ranzato 2015)

Does it work ?

Interleaving Language and RL — Florian Strub

Policy Gradient for Language GenerationDamn!

Does it really work ? Well…

RED is a lie!

Button was denied his 100th race for McLaren after an ERS prevented him from making it to the start-line. It capped a miserable weekend for the Briton. Button has out-qualified. Finished ahead of Nico Rosberg at Bahrain. Lewis Hamilton has. In 11 races. . The race. To lead 2,000 laps. . In. . . And.

(Paulus 2017)

Reward hacking….

Interleaving Language and RL — Florian Strub

Policy Gradient for Language GenerationTricks, heart of Deep RL ;)

A few trick to alleviate RL issues :

• Parametrize and train your baseline correctly

• Increase batch size!

• Perform an extensive parameter RL sweep parameters

• Adding softmax temperature + check your SL baseline (overconfident)

• Slowly transition from SL to RL

• Check qualitative results!

Recommended slides: http://www.phontron.com/slides/neubig19structured.pdf

Interleaving Language and RL — Florian Strub

Dialogue System with RL

Interleaving Language and RL — Florian Strub

Dialogue SystemsClassic pipeline

Open the pad day door, Hal

I'm sorry, Dave. I'm afraid I can't do that

Natural Language Generator

Natural Language

Understanding

Dialogue utterance 𝑢𝑡:{request “from open door”, From: “Dave”}

Dialogue State Tracker N

ew

sta

te 𝑠𝑡 :

{“Da

ve s

tatu

s”: th

rea

t}

Policy Learning

Action 𝑎𝑡{Inform “Request Denied”, action: “None”}

POMDP setting

Later on… Dave disconnect HAL…

Interleaving Language and RL — Florian Strub

Dialogue Systems

POMDP setting:

• Observation: {request “from open door”, From: “Dave”}

• State: {“Dave status”: threat}

• Policy: Dialogue Manager

• Action: {Inform “Request Denied”, action: “None”}

• Reward: Later on… Dave disconnect HAL…

Given a good NLU / NLG… Good new… it works!

(Young 2013) (DSTC)

Hand-Crafted

Interleaving Language and RL — Florian Strub

Dialogue SystemsClassic pipeline

Open the pad day door, Hal

I'm sorry, Dave. I'm afraid I can't do that

Natural Language Generator

Natural Language

Understanding

Dialogue utterance 𝑢𝑡:{request “from open door”, From: “Dave”}

Dialogue State Tracker N

ew

sta

te 𝑠𝑡 :

{“Da

ve s

tatu

s”: th

rea

t}

Policy Learning

Action 𝑎𝑡{Inform “Request Denied”, action: “None”}

POMDP setting

Later on… Dave disconnect HAL…

Interleaving Language and RL — Florian Strub

Dialogue SystemsClassic pipeline

Open the pad day door, Hal

I'm sorry, Dave. I'm afraid I can't do that

End-2-End Model

Interleaving Language and RL — Florian Strub

Dialogue SystemsClassic pipeline

Open the pad day door, Hal

I'm sorry, Dave. I'm afraid I can't do that

Decoder(RNN)

Encoder(RNN)

sta

te

Idea : Turn into (natural) translation systems! (Vinyals et V. Le 2015) (When et l, 2015)

Interleaving Language and RL — Florian Strub

Dialogue SystemsTaxonomy

Chatbot

Open discussion!

Numerous dataset (Lowe 2015)

Numerous models (Gao 2019)

No reward signal…

Goal-oriented dialogue

Dialogue to solve a task: book plane ticket, find restaurant etc.

No large-scale goal oriented dataset with natural language (10k dialogue)

Clear reward signal!

Interleaving Language and RL — Florian Strub

Wait !! What is that!

Visually grounded goal-oriented natural dialogues

Interleaving Language and RL — Florian Strub

GuessWhat?!Symbol grounding problem

Symbol Grounding Problem!

(Harnad 1990)

Interleaving Language and RL — Florian Strub

GuessWhat?!Symbol grounding problem

How to ground symbol?

ケーキ

小麦粉、脂肪、卵、砂糖、およびその他の成分の混合物から作られた、焼きたての、時にはアイスまたは装飾された柔らかい甘い食べ物。

Interleaving Language and RL — Florian Strub

GuessWhat?!

Visually grounded dialogues with self-play

Game features:

● Dialogue

● Visually grounded

● Collaborative

● Goal-oriented with a clear reward

Come play… it will be fun.

GuessWhat?! Game

The game consists in locating a hidden object into a natural scene representation by asking a sequence of questions.

Interleaving Language and RL — Florian Strub

Questionner

Oracle

GuessWhat?!Let’s play

Interleaving Language and RL — Florian Strub

Questionner

GuessWhat?!Let’s play

Questionner

Oracle

Interleaving Language and RL — Florian Strub

Is it a vase ?

GuessWhat?!Let’s play

Questionner

Oracle

Interleaving Language and RL — Florian Strub

Is it a vase ? Yes

GuessWhat?!Let’s play

Questionner

Oracle

Interleaving Language and RL — Florian Strub

Is it a vase ? Yes

Is it in the front row?

GuessWhat?!Let’s play

Questionner

Oracle

Interleaving Language and RL — Florian Strub

Is it a vase ? Yes

Is it in the front row? No

GuessWhat?!Let’s play

Questionner

Oracle

Interleaving Language and RL — Florian Strub

Is it a vase ? Yes

Is it in the front row? No

Does it have some red on it?

GuessWhat?!Let’s play

Questionner

Oracle

Interleaving Language and RL — Florian Strub

Is it a vase ? Yes

Is it in the front row? No

Does it have some red on it? No

GuessWhat?!Let’s play

Questionner

Oracle

Interleaving Language and RL — Florian Strub

Is it a vase ? Yes

Is it in the front row? No

Does it have some red on it? No

Is it the second vase from the right?

GuessWhat?!Let’s play

Questionner

Oracle

Interleaving Language and RL — Florian Strub

Is it a vase ? Yes

Is it in the front row? No

Does it have some red on it? No

Is it the second vase from the right? Yes

GuessWhat?!Let’s play

Questionner

Oracle

Interleaving Language and RL — Florian Strub

Is it a vase ? Yes

Is it in the front row? No

Does it have some red on it? No

Is it the second vase from the right? Yes

I found it!

GuessWhat?!Let’s play

Questionner

Oracle

Interleaving Language and RL — Florian Strub

Is it a vase ? Yes

Is it in the front row? No

Does it have some red on it? No

Is it the second vase from the right? Yes

Correct!

GuessWhat?!Let’s play

Questionner

Oracle

Interleaving Language and RL — Florian Strub

https://guesswhat.ai/explore

GuessWhat?!

Interleaving Language and RL — Florian Strub

GuessWhat?!Dataset Statistics – Language metrics

Average: 5+ questions Average: ~5 words

Interleaving Language and RL — Florian Strub

GuessWhat?!Dataset Statistics – Success ratio

The more object there are, the lower is the success ratio The bigger the object is, the higher is the success ratio

Interleaving Language and RL — Florian Strub

Optimal Policy

GuessWhat?!

Interleaving Language and RL — Florian Strub

GuessWhat?! DatasetPotential Language Policy

- Word Taxonomy:

◦ Is it a vehicle? A car? A motorbike?

- Spatial reasoning

◦ Is it on the left? In the background?

◦ Is it on the right of the blue car? Is it between the two zebra?

◦ Is it on the table?

- Counting

◦ Is it the second man of the left?

- Others:

◦ Can it fly? Do you eat with it?

◦ Is it big? Is it square?

Interleaving Language and RL — Florian Strub

GuessWhat?!Cloud of Words

Interleaving Language and RL — Florian Strub

Questioner Oracle

Repeat until <stop_dialogue>

question

yes/no answer

Find object?

Questioner evaluation procedureGuesseur

dialogue

GuessWhat?!Game Loop

Interleaving Language and RL — Florian Strub

GuessWhat?!Game Notation

Interleaving Language and RL — Florian Strub

Oracle

79.5%

accuracy

GuessWhat?!Models

Interleaving Language and RL — Florian Strub

Gu

esse

r

63.8%

accuracy

GuessWhat?!Models

Interleaving Language and RL — Florian Strub

Training (Minimize cross-entropy):

GuessWhat?!Models

QG

en

Policy warm-up

Interleaving Language and RL — Florian Strub

Accuracy: The higher, the better!

- New Objects : Image from training set + pick random obje

- New images : Images from the testing set (never seen at training time)

GuessWhat?!Quantitative Results

QG

en

Interleaving Language and RL — Florian Strub

GuessWhat?!Qualitative Results

Lack of generalization

Poor grounding

Language Imitation pitfall

Interleaving Language and RL — Florian Strub

GuessWhat?!Limitation of supervised learning

Observation:

● Space of action/state dialogue is too large to generalize

● Supervised learning miss planning aspect

● Supervised learning does not care solving the task! Wrong metric

● (Side issue) Grounding seems imperfect…

Interleaving Language and RL — Florian Strub

GuessWhat?!What if…

Is it the best question

What happen if the answer is no

Can we phrase it differently?

Interleaving Language and RL — Florian Strub

Questioner Oracle

Repeat until <stop_dialogue>

question

yes/no answer

Find object?Guesseur

dialogue

GuessWhat?!Game Loop

Reward

Update agent

Interleaving Language and RL — Florian Strub

GuessWhat?!RL Notation

Interleaving Language and RL — Florian Strub

For each question For each word

GuessWhat?!

Policy Gradient!

Policy Gradient

Conditioned on the image

Interleaving Language and RL — Florian Strub

Generate dialogue

Generate game

Find object

Update model

Initialization

GuessWhat?!RL Algorithm

Interleaving Language and RL — Florian Strub

QGen

Accuracy: The higher, the better!

- New Objects : Image from training set + pick random object

- New images : Images from the testing set (never seen at training time)

GuessWhat?!

Interleaving Language and RL — Florian Strub

GuessWhat?!

Interleaving Language and RL — Florian Strub

GuessWhat?!What happened?

Good

Optimize the metric

Language strategy is consistent

Bad

Optimize the metric

Language strategy is poor limited

Language quality is bad

Interleaving Language and RL — Florian Strub

GuessWhat?!Vocabulary Evolution

Human(3000k words)

Beam-search~500 words

RL>100 words

Vocabulary drop:

- Model quickly reduce the action space

- Supervised model are overly confident ( p > 50%)

Are current RL algorithms really ok for NLG?

Better warm-up ?

Interleaving Language and RL — Florian Strub

GuessWhat?!Language Drift

Language drift…

How to enforce language quality while optimizing for the goal?Reward shaping ? HRL?

(Lewis 2017)

Interleaving Language and RL — Florian Strub

Overview

● Kind introduction to NLP

● Policy gradient for Translation

● Goal-oriented dialogue systems○ Dialogue setting○ GuessWhat?!○ Self-play for language generation

● Other linguistic grounded tasks:○ Language as goal representation: Instruction Following○ Language as state representation: Text Games○ Language as policy compositionality: Emergence of Language

~15min

~talk to me!

~20min

~15min

Interleaving Language and RL — Florian Strub

Question?The cherry on the cake is a lie !

top related