interleaving language and rl from language generation to ... · interleaving language and rl...

Interleaving Language and RLLanguage Generation

Florian StrubReinforcement Learning Summer School

7 July 2019

Interleaving Language and RL — Florian Strub

Overview

● Kind introduction to NLP

● Policy gradient for Translation

● Goal-oriented dialogue systems○ Dialogue setting, ○ GuessWhat?!○ Self-play for language generation

● Other linguistic grounded tasks:○ Language as goal representation: Instruction Following○ Language as state representation: Text Games○ Language as policy compositionality: Emergence of Language

~15min

~talk to me!

~20min

~15min


Kind Introduction to NLPWhat is NLP?

Natural Language Processing (NLP) aims to extract representations of textual information to read and make sense of human languages in a valuable manner.


Kind Introduction to NLPLanguage is Hard :)

Georgetown–IBM experiment 1954 Translate Sixty Russian sentences into English

ARPA report 1966: “there is no immediate or predictable prospect of useful machine translation”


Kind Introduction to NLPSmall Historical Note

Rationalism

Language possesses an underlying structure that must be discovered, conjecturing that human language is innate and can be modelled like physic laws.

Empiricism

Language is a cognitive process that can be learned through experimentation, advocating to explore learning mechanisms rather than linguistic models

Chomsky, Noam. Syntactic structures.1957.

Transformational generative grammars

J. R. Firth. A synopsis of linguistic theory 1930-55. Studies in Linguistic Analysis

“You shall know a word by the company it keeps”

“The validity of statistical (information theoretic)

approach to MT has indeed been recognized …

as early as 1949. And was universally

recognized as mistaken [sic] by 1950. … The

crude force of computers is not science.”

Review of Brown et al. (1990)


Kind Introduction to NLPLanguage Modelling

Goal of statistical approach: How likely is a sentence?

- “The cherry on the cake” - “The cake on the cherry ”

but the cake is a lie...



A sentence is represented as a sequence of words:

http://videolectures.net/deeplearning2016_cho_language_understanding/

We compute the probability of the word sequence:

http://videolectures.net/deeplearning2016_cho_language_understanding/



How to estimate the conditional probabilities?

N-gram Modelling:

We can simply count words!

Do not trust the cake!

Several problems (memory footprint, do not generalize etc.)


Kind introduction to MLPNeural Language Modelling

How to estimate the conditional probabilities?

Learn a representation of word history:

Use neural networks !!!

The cherry on the

cherry on the cake


Kind introduction to MLPNeural Language Modelling

Hash Table:

RNN RNN:

MLP

softmax

Classifier:

Basic RNN, LSTM, GRU etc.

Required indexed vocabulary:V = { “Is”:1, … , “cherry”:23 , ... }

Word embedding!

Greedy, stochastic, beam search etc.

Sampling procedure:


Kind introduction to MLPTraining procedure

Given a corpora:

Goal is to maximize the joint probability:

Cake and grief counseling will be available at the conclusion of the test.

Key NLP equation


Kind introduction to MLPTraining Procedure

The cherry on the

cherry over the table

Training Time Test Time

The cherry over the

cherry over the table

*Ground truth

Predicted

Teacher Forcing

** Ground truth

Predicted

Exposure bias


Kind introduction to MLPIt is only the begining!

We can generate language :)

But wait… it is pretty useless! (so is the cake, what a lame running joke...)

Can we use it as a translation system?

Well...


Kind introduction to MLPIt is only the begining!

We can generate language :)

But wait… it is pretty useless! (so is the cake, what a lame running joke...)

The cherry on the cake

La cerise sur le gateau

Can we use it as a translation system?

Well...

He looked for the cake

Il est allé chercher .. le gateau

Bad word alignment


Kind introduction to MLPSeq2Seq Models

Idea: Decompose encoding and decoding!

DecoderEncoder

input

output

sta

te




DecoderEncoder

sta

te

Le gateau est un mensonge

The cake is a lie



sta

te


RNNRNN RNNRNN RNN

Encoder

the cake is a

RNN RNN RNN RNN RNN

Decoder

lie

le gateau est un mensonge



sta

te


RNN RNNRNN RNNRNN

le gateau est un mensonge

the cake … lie <stop>

RNN RNN RNN RNN RNN

<Ignore output>

<start> the cake … lie

Similar training as language modelling (teacher forcing)

Sequence-to-Sequence



sta

te


the cake … lie <stop>

RNN RNN RNN RNN RNN

<start> the cake … lieR

esN

et

Image captioning



Seq2Seq is an Encoder/Decoder architecture

1) Encode language representation

2) Decode vector representation

Model is trained with Cross-Entropy (Teacher Forcing)

Decoder(RNN)

Encoder

input

output

sta

te

You cannot cram the

meaning of a whole

%&!$# sentence into a

single $&!#* vector.

Generated tokensinput tokens


Kind introduction to MLPTranslation

WMT dataset:

- 12M sentences French/English

- Vocabulary 80K words

- Assessed on BLEU score

BLEU is the geometric average of overlapping n-grams in a

set of targets sentences from n=1 to 4 .



WMT dataset:




Input le gateau est un mensonge

Predicted the lie is the cake

Target the cake is a lie

When several targets exists, n-gram can be count as many time as they exist in any of the targets



N=1

4

--

5



WMT dataset:




Input le gateau est un mensonge

Predicted the lie is the cake

Target the cake is a lie

When several targets exists, n-gram are count as many time as they may exist in one of the targets

N=2

1

--

4

N=3

0

--

3

N=1 N=4

0

--

2

4

--

5




Kind introduction to MLPBLEU

BLEU: Order of magnitude

BLEU

Sota 2014 (Durrani 2014) 37.0

Sequence-to-sequence (K. Cho 2014) 34.5

Sequence-to-Sequence (Wu 2016) 38.95

Transformer (Vaswani 2017) 41.8

BLEU is a lie!?

Estimated Human BLEU (Papineni 2002) 34.7

http://www.mt-archive.info/10/EAMT-2014-Durrani.pdf

https://arxiv.org/abs/1406.1078




RL is the carrot on the cake


Policy Gradient for Language GenerationSupervised limitation

Supervised in good… but:

● We enforce a mismatch between training and testing: Teacher forcing vs Exposure Bias

● We optimize cross-entropy… but we care about BLEU !

● We optimize for sentence generation… but CE is at the word level (can be criticized)

sta

tethe cake … lie <stop>

RNN RNN RNN RNN RNN

<start> the cake … lie

En

co

de

r

Le

ga

tea

u e

st

un

m

en

so

ng

e


Policy Gradient for Language GenerationLanguage as a MDP

Idea: Turn language translation into a MDP and BLEU as a reward

• 𝑠𝑡 = 𝒙, 𝑦1, ⋯ , 𝑦𝑡

• 𝒙 = 𝑥1,⋯ , 𝑥𝑇 where 𝑥𝑡 ∈ 𝑉𝑖𝑛 and 𝒙 is the input sentence

• 𝑦1, ⋯ , 𝑦𝑡′ where 𝑦𝑡 ∈ 𝑉𝑜𝑢𝑡 are the generated token

• 𝑎𝑡+1 ∼ 𝑉𝑜𝑢𝑡𝑝𝑢𝑡

• 𝑠𝑡+1 = 𝑠𝑡 ∪ 𝑎𝑡+1

• 𝑟𝑡+1 = 𝐵𝐿𝐸𝑈 if 𝑎𝑡+1 = < 𝑠𝑡𝑜𝑝 >

0 Otherwise

sta

te𝑦1 𝑦2 … 𝑦𝑇′−1 <stop>

RNN RNN RNN RNN RNN

<start> 𝑦1 𝑦2 𝑦𝑇′−2 𝑦𝑇′−1

En

co

de

r

𝒙

Reward: BLEU


Policy Gradient for Language GenerationPolicy Gradient

The goal is to optimize the score function:

𝐽𝜽 = න𝑑𝜋𝜽𝑉𝜋𝜽𝑑𝑥

Expected BLEU according the translation policy over all the potential language pair

The score gradient is estimated by:

∇J𝜽=𝜽ℎ =

𝑡′=1

𝑇′

𝑦=1

𝑉𝑜𝑢𝑡

∇𝜽ℎlog(𝜋𝜽ℎ(𝑦𝑡|𝒙, 𝑦1, ⋯ , 𝑦𝑡′−1))(𝑄𝜋 𝜽ℎ − 𝑏)

Policy Gradient (Sutton 1999) improves the policy by following the score gradient:

𝜽ℎ+1 = 𝜽ℎ + 𝛼ℎ𝛁𝐉𝜽=𝜽ℎ𝛼 learning rateℎ training step

𝑏 baselineQ state-action function

𝑑 probability state distribution𝑉 Value function

https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf



As 𝑉𝑜𝑢𝑡 may be very big, RL is not straightforward!

For example, WMT has 80k words ; Atari has 18 actions!

∇J𝜽=𝜽ℎ =

𝑡′=1

𝑇′

𝑦=1

𝑉𝑜𝑢𝑡

∇𝜽ℎlog(𝜋𝜽ℎ(𝑦𝑡|𝒙, 𝑦1, ⋯ , 𝑦𝑡′−1))(𝑄𝜋 𝜽ℎ − 𝑏)

𝑄𝜋 𝜽ℎ is hard to parametrize: Overestimation, memory footprint (Bahdanau 2016)

σ𝑦=1𝑉𝑜𝑢𝑡 can be intractable. Potential

subsampling etc. (Liu 2018)

Impossible to start from random policy. Required warm start (Ranzato 2015)

Should be parametrized






Monte-Carlo Variant (REINFORCE-like)

∇J𝜽=𝜽ℎ =

𝑡′=1

𝑇′

∇𝜽ℎlog(𝜋𝜽ℎ(𝑦𝑡|𝒙, 𝑦1, ⋯ , 𝑦𝑡′−1))(𝑄𝜋 − 𝑏)

Where

𝑄𝜋 =

𝜏

𝑡′

𝛾𝜏𝑟𝜏

Intuitively, the full trajectory is equally rewarded. It is either all good or all bad!

𝛾 is often set to 1: No need to search for shortest word trajectory


Policy Gradient for Language GenerationSL vs RL

sta

te

𝑦1 𝑦2 … <stop>

RNN RNN RNN RNN

<start> 𝑦1 𝑦2 𝑦𝑇′−1

En

co

de

r

𝒙

BLEU

sta

te

𝑦1 𝑦2 … <stop>

RNN RNN RNN RNN

<start> 𝑦1∗ 𝑦2

∗ 𝑦𝑇′−1∗

En

co

de

r

𝒙

Reinforcement Learning:

𝐽𝜽 = න𝑑𝜋𝜽𝑉𝜋𝜽𝑑𝑥

High-variance, require warm-start

Optimize true score

Signal over trajectories

Supervised learning:

𝑡′=1

𝑇′

log 𝑝𝜃(𝑦𝑡′|𝑥, 𝑦1, ⋯ , 𝑦𝑡′)

Low-variance, sample efficient

Optimize surrogate

Signal word after words


Policy Gradient for Language GenerationResults, finally!

Policy Gradient

BLEU score for summarization / image captioning ROUGE score for image captioning (Ranzato 2015)

Does it work ?



Policy Gradient for Language GenerationDamn!

Does it really work ? Well…

RED is a lie!

Button was denied his 100th race for McLaren after an ERS prevented him from making it to the start-line. It capped a miserable weekend for the Briton. Button has out-qualified. Finished ahead of Nico Rosberg at Bahrain. Lewis Hamilton has. In 11 races. . The race. To lead 2,000 laps. . In. . . And.

(Paulus 2017)

Reward hacking….



Policy Gradient for Language GenerationTricks, heart of Deep RL ;)

A few trick to alleviate RL issues :

• Parametrize and train your baseline correctly

• Increase batch size!

• Perform an extensive parameter RL sweep parameters

• Adding softmax temperature + check your SL baseline (overconfident)

• Slowly transition from SL to RL

• Check qualitative results!

Recommended slides: http://www.phontron.com/slides/neubig19structured.pdf

http://www.phontron.com/slides/neubig19structured.pdf


Dialogue System with RL


Dialogue SystemsClassic pipeline

Open the pad day door, Hal

I'm sorry, Dave. I'm afraid I can't do that

Natural Language Generator

Natural Language

Understanding

Dialogue utterance 𝑢𝑡:{request “from open door”, From: “Dave”}

Dialogue State Tracker N

ew

sta

te 𝑠𝑡 :

{“Da

ve s

tatu

s”: th

rea

t}

Policy Learning

Action 𝑎𝑡{Inform “Request Denied”, action: “None”}

POMDP setting

Later on… Dave disconnect HAL…


Dialogue Systems

POMDP setting:

• Observation: {request “from open door”, From: “Dave”}

• State: {“Dave status”: threat}

• Policy: Dialogue Manager

• Action: {Inform “Request Denied”, action: “None”}

• Reward: Later on… Dave disconnect HAL…

Given a good NLU / NLG… Good new… it works!

(Young 2013) (DSTC)

Hand-Crafted

http://cs.brown.edu/courses/csci2951-k/papers/young13.pdf

https://www.microsoft.com/en-us/research/event/dialog-state-tracking-challenge/





Natural Language Generator

Natural Language

Understanding

Dialogue utterance 𝑢𝑡:{request “from open door”, From: “Dave”}

Dialogue State Tracker N

ew

sta

te 𝑠𝑡 :

{“Da

ve s

tatu

s”: th

rea

t}

Policy Learning

Action 𝑎𝑡{Inform “Request Denied”, action: “None”}

POMDP setting

Later on… Dave disconnect HAL…





End-2-End Model





Decoder(RNN)

Encoder(RNN)

sta

te

Idea : Turn into (natural) translation systems! (Vinyals et V. Le 2015) (When et l, 2015)


http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP199.pdf


Dialogue SystemsTaxonomy

Chatbot

Open discussion!

Numerous dataset (Lowe 2015)

Numerous models (Gao 2019)

No reward signal…

Goal-oriented dialogue

Dialogue to solve a task: book plane ticket, find restaurant etc.

No large-scale goal oriented dataset with natural language (10k dialogue)

Clear reward signal!


https://www.aclweb.org/anthology/P18-5002


Wait !! What is that!

Visually grounded goal-oriented natural dialogues


GuessWhat?!Symbol grounding problem

Symbol Grounding Problem!

(Harnad 1990)

https://www.sciencedirect.com/science/article/pii/0167278990900876


GuessWhat?!Symbol grounding problem

How to ground symbol?

ケーキ

小麦粉、脂肪、卵、砂糖、およびその他の成分の混合物から作られた、焼きたての、時にはアイスまたは装飾された柔らかい甘い食べ物。


GuessWhat?!

Visually grounded dialogues with self-play

Game features:

● Dialogue

● Visually grounded

● Collaborative

● Goal-oriented with a clear reward

Come play… it will be fun.

GuessWhat?! Game

The game consists in locating a hidden object into a natural scene representation by asking a sequence of questions.


Questionner

Oracle

GuessWhat?!Let’s play


Questionner


Questionner

Oracle


Is it a vase ?


Questionner

Oracle


Is it a vase ? Yes


Questionner

Oracle


Is it a vase ? Yes

Is it in the front row?


Questionner

Oracle


Is it a vase ? Yes

Is it in the front row? No


Questionner

Oracle


Is it a vase ? Yes


Does it have some red on it?


Questionner

Oracle


Is it a vase ? Yes


Does it have some red on it? No


Questionner

Oracle


Is it a vase ? Yes



Is it the second vase from the right?


Questionner

Oracle


Is it a vase ? Yes



Is it the second vase from the right? Yes


Questionner

Oracle


Is it a vase ? Yes




I found it!


Questionner

Oracle


Is it a vase ? Yes




Correct!


Questionner

Oracle


https://guesswhat.ai/explore

GuessWhat?!

https://guesswhat.ai/explore


GuessWhat?!Dataset Statistics – Language metrics

Average: 5+ questions Average: ~5 words


GuessWhat?!Dataset Statistics – Success ratio

The more object there are, the lower is the success ratio The bigger the object is, the higher is the success ratio


Optimal Policy

GuessWhat?!


GuessWhat?! DatasetPotential Language Policy

- Word Taxonomy:

◦ Is it a vehicle? A car? A motorbike?

- Spatial reasoning

◦ Is it on the left? In the background?

◦ Is it on the right of the blue car? Is it between the two zebra?

◦ Is it on the table?

- Counting

◦ Is it the second man of the left?

- Others:

◦ Can it fly? Do you eat with it?

◦ Is it big? Is it square?


GuessWhat?!Cloud of Words


Questioner Oracle

Repeat until <stop_dialogue>

question

yes/no answer

Find object?

Questioner evaluation procedureGuesseur

dialogue

GuessWhat?!Game Loop


GuessWhat?!Game Notation


Oracle

79.5%

accuracy

GuessWhat?!Models


Gu

esse

r

63.8%

accuracy

GuessWhat?!Models


Training (Minimize cross-entropy):

GuessWhat?!Models

QG

en

Policy warm-up


Accuracy: The higher, the better!

- New Objects : Image from training set + pick random obje

- New images : Images from the testing set (never seen at training time)

GuessWhat?!Quantitative Results

QG

en


GuessWhat?!Qualitative Results

Lack of generalization

Poor grounding

Language Imitation pitfall


GuessWhat?!Limitation of supervised learning

Observation:

● Space of action/state dialogue is too large to generalize

● Supervised learning miss planning aspect

● Supervised learning does not care solving the task! Wrong metric

● (Side issue) Grounding seems imperfect…


GuessWhat?!What if…

Is it the best question

What happen if the answer is no

Can we phrase it differently?


Questioner Oracle

Repeat until <stop_dialogue>

question

yes/no answer

Find object?Guesseur

dialogue

GuessWhat?!Game Loop

Reward

Update agent


GuessWhat?!RL Notation


For each question For each word

GuessWhat?!

Policy Gradient!

Policy Gradient

Conditioned on the image


Generate dialogue

Generate game

Find object

Update model

Initialization

GuessWhat?!RL Algorithm


QGen

Accuracy: The higher, the better!

- New Objects : Image from training set + pick random object

- New images : Images from the testing set (never seen at training time)

GuessWhat?!


GuessWhat?!


GuessWhat?!What happened?

Good

Optimize the metric

Language strategy is consistent

Bad

Optimize the metric

Language strategy is poor limited

Language quality is bad


GuessWhat?!Vocabulary Evolution

Human(3000k words)

Beam-search~500 words

RL>100 words

Vocabulary drop:

- Model quickly reduce the action space

- Supervised model are overly confident ( p > 50%)

Are current RL algorithms really ok for NLG?

Better warm-up ?


GuessWhat?!Language Drift

Language drift…

How to enforce language quality while optimizing for the goal?Reward shaping ? HRL?

(Lewis 2017)



Overview

● Kind introduction to NLP

● Policy gradient for Translation

● Goal-oriented dialogue systems○ Dialogue setting○ GuessWhat?!○ Self-play for language generation

● Other linguistic grounded tasks:○ Language as goal representation: Instruction Following○ Language as state representation: Text Games○ Language as policy compositionality: Emergence of Language

~15min

~talk to me!

~20min

~15min


Question?The cherry on the cake is a lie !

interleaving language and rl from language generation to ... · interleaving language and rl...

Documents