interleaving language and rl from language generation to ... · interleaving language and rl...

Interleaving Language and RLLanguage Generation

Florian StrubReinforcement Learning Summer School

7 July 2019

Interleaving Language and RL — Florian Strub

Overview

● Kind introduction to NLP

● Policy gradient for Translation

● Goal-oriented dialogue systems○ Dialogue setting, ○ GuessWhat?!○ Self-play for language generation

● Other linguistic grounded tasks:○ Language as goal representation: Instruction Following○ Language as state representation: Text Games○ Language as policy compositionality: Emergence of Language

~15min

~talk to me!

~20min

~15min

Kind Introduction to NLPWhat is NLP?

Natural Language Processing (NLP) aims to extract representations of textual information to read and make sense of human languages in a valuable manner.

Kind Introduction to NLPWhat is NLP?

Natural Language Processing (NLP) aims to extract representations of textual information to read and make sense of human languages in a valuable manner.

Kind Introduction to NLPLanguage is Hard :)

Georgetown–IBM experiment 1954 Translate Sixty Russian sentences into English

ARPA report 1966: “there is no immediate or predictable prospect of useful machine translation”

Kind Introduction to NLPSmall Historical Note

Rationalism

Language possesses an underlying structure that must be discovered, conjecturing that human language is innate and can be modelled like physic laws.

Empiricism

Language is a cognitive process that can be learned through experimentation, advocating to explore learning mechanisms rather than linguistic models

Chomsky, Noam. Syntactic structures.1957.

Transformational generative grammars

J. R. Firth. A synopsis of linguistic theory 1930-55. Studies in Linguistic Analysis

“You shall know a word by the company it keeps”

“The validity of statistical (information theoretic)

approach to MT has indeed been recognized …

as early as 1949. And was universally

recognized as mistaken [sic] by 1950. … The

crude force of computers is not science.”

Review of Brown et al. (1990)

Kind Introduction to NLPLanguage Modelling

Goal of statistical approach: How likely is a sentence?

- “The cherry on the cake” - “The cake on the cherry ”

but the cake is a lie...

A sentence is represented as a sequence of words:

http://videolectures.net/deeplearning2016_cho_language_understanding/

We compute the probability of the word sequence:

How to estimate the conditional probabilities?

N-gram Modelling:

We can simply count words!

Do not trust the cake!

Several problems (memory footprint, do not generalize etc.)

Kind introduction to MLPNeural Language Modelling

How to estimate the conditional probabilities?

Learn a representation of word history:

Use neural networks !!!

The cherry on the

cherry on the cake

Kind introduction to MLPNeural Language Modelling

Hash Table:

RNN RNN:

softmax

Classifier:

Basic RNN, LSTM, GRU etc.

Required indexed vocabulary:V = { “Is”:1, … , “cherry”:23 , ... }

Word embedding!

Greedy, stochastic, beam search etc.

Sampling procedure:

Kind introduction to MLPTraining procedure

Given a corpora:

Goal is to maximize the joint probability:

Cake and grief counseling will be available at the conclusion of the test.

Key NLP equation

Kind introduction to MLPTraining Procedure

The cherry on the

cherry over the table

Training Time Test Time

The cherry over the

cherry over the table

*Ground truth

Predicted

Teacher Forcing

** Ground truth

Predicted

Exposure bias

Kind introduction to MLPIt is only the begining!

We can generate language :)

But wait… it is pretty useless! (so is the cake, what a lame running joke...)

Can we use it as a translation system?

Well...

Kind introduction to MLPIt is only the begining!

We can generate language :)

But wait… it is pretty useless! (so is the cake, what a lame running joke...)

The cherry on the cake

La cerise sur le gateau

Can we use it as a translation system?

Well...

He looked for the cake

Il est allé chercher .. le gateau

Bad word alignment

Kind introduction to MLPSeq2Seq Models

Idea: Decompose encoding and decoding!

DecoderEncoder

output

DecoderEncoder

Le gateau est un mensonge

The cake is a lie

RNNRNN RNNRNN RNN

Encoder

the cake is a

RNN RNN RNN RNN RNN

Decoder

le gateau est un mensonge

RNN RNNRNN RNNRNN

le gateau est un mensonge

the cake … lie <stop>

RNN RNN RNN RNN RNN

<start> the cake … lie

Similar training as language modelling (teacher forcing)

Sequence-to-Sequence

the cake … lie <stop>

RNN RNN RNN RNN RNN

<start> the cake … lieR

Image captioning

Seq2Seq is an Encoder/Decoder architecture

1) Encode language representation

2) Decode vector representation

Model is trained with Cross-Entropy (Teacher Forcing)

Decoder(RNN)

Encoder

output

You cannot cram the

meaning of a whole

%&!$# sentence into a

single $&!#* vector.

Generated tokensinput tokens

Kind introduction to MLPTranslation

WMT dataset:

- 12M sentences French/English

- Vocabulary 80K words

- Assessed on BLEU score

BLEU is the geometric average of overlapping n-grams in a

set of targets sentences from n=1 to 4 .

WMT dataset:

Input le gateau est un mensonge

Predicted the lie is the cake

Target the cake is a lie

When several targets exists, n-gram can be count as many time as they exist in any of the targets

WMT dataset:

Input le gateau est un mensonge

Predicted the lie is the cake

Target the cake is a lie

When several targets exists, n-gram are count as many time as they may exist in one of the targets

N=1 N=4

Kind introduction to MLPBLEU

BLEU: Order of magnitude

Sota 2014 (Durrani 2014) 37.0

Sequence-to-sequence (K. Cho 2014) 34.5

Sequence-to-Sequence (Wu 2016) 38.95

Transformer (Vaswani 2017) 41.8

BLEU is a lie!?

Estimated Human BLEU (Papineni 2002) 34.7

RL is the carrot on the cake

Policy Gradient for Language GenerationSupervised limitation

Supervised in good… but:

● We enforce a mismatch between training and testing: Teacher forcing vs Exposure Bias

● We optimize cross-entropy… but we care about BLEU !

● We optimize for sentence generation… but CE is at the word level (can be criticized)

tethe cake … lie <stop>

RNN RNN RNN RNN RNN

<start> the cake … lie

Policy Gradient for Language GenerationLanguage as a MDP

Idea: Turn language translation into a MDP and BLEU as a reward

• 𝑠𝑡 = 𝒙, 𝑦1, ⋯ , 𝑦𝑡

• 𝒙 = 𝑥1,⋯ , 𝑥𝑇 where 𝑥𝑡 ∈ 𝑉𝑖𝑛 and 𝒙 is the input sentence

• 𝑦1, ⋯ , 𝑦𝑡′ where 𝑦𝑡 ∈ 𝑉𝑜𝑢𝑡 are the generated token

• 𝑎𝑡+1 ∼ 𝑉𝑜𝑢𝑡𝑝𝑢𝑡

• 𝑠𝑡+1 = 𝑠𝑡 ∪ 𝑎𝑡+1

• 𝑟𝑡+1 = 𝐵𝐿𝐸𝑈 if 𝑎𝑡+1 = < 𝑠𝑡𝑜𝑝 >

0 Otherwise

te𝑦1 𝑦2 … 𝑦𝑇′−1 <stop>

RNN RNN RNN RNN RNN

<start> 𝑦1 𝑦2 𝑦𝑇′−2 𝑦𝑇′−1

Reward: BLEU

Policy Gradient for Language GenerationPolicy Gradient

The goal is to optimize the score function:

𝐽𝜽 = න𝑑𝜋𝜽𝑉𝜋𝜽𝑑𝑥

Expected BLEU according the translation policy over all the potential language pair

The score gradient is estimated by:

∇J𝜽=𝜽ℎ =

𝑡′=1

𝑇′

𝑦=1

𝑉𝑜𝑢𝑡

∇𝜽ℎlog(𝜋𝜽ℎ(𝑦𝑡|𝒙, 𝑦1, ⋯ , 𝑦𝑡′−1))(𝑄𝜋 𝜽ℎ − 𝑏)

Policy Gradient (Sutton 1999) improves the policy by following the score gradient:

𝜽ℎ+1 = 𝜽ℎ + 𝛼ℎ𝛁𝐉𝜽=𝜽ℎ𝛼 learning rateℎ training step

𝑏 baselineQ state-action function

𝑑 probability state distribution𝑉 Value function

As 𝑉𝑜𝑢𝑡 may be very big, RL is not straightforward!

For example, WMT has 80k words ; Atari has 18 actions!

∇J𝜽=𝜽ℎ =

𝑡′=1

𝑇′

𝑦=1

𝑉𝑜𝑢𝑡

∇𝜽ℎlog(𝜋𝜽ℎ(𝑦𝑡|𝒙, 𝑦1, ⋯ , 𝑦𝑡′−1))(𝑄𝜋 𝜽ℎ − 𝑏)

𝑄𝜋 𝜽ℎ is hard to parametrize: Overestimation, memory footprint (Bahdanau 2016)

σ𝑦=1𝑉𝑜𝑢𝑡 can be intractable. Potential

subsampling etc. (Liu 2018)

Impossible to start from random policy. Required warm start (Ranzato 2015)

Should be parametrized

Monte-Carlo Variant (REINFORCE-like)

∇J𝜽=𝜽ℎ =

𝑡′=1

𝑇′

∇𝜽ℎlog(𝜋𝜽ℎ(𝑦𝑡|𝒙, 𝑦1, ⋯ , 𝑦𝑡′−1))(𝑄𝜋 − 𝑏)

𝑄𝜋 =

𝑡′

𝛾𝜏𝑟𝜏

Intuitively, the full trajectory is equally rewarded. It is either all good or all bad!

𝛾 is often set to 1: No need to search for shortest word trajectory

Policy Gradient for Language GenerationSL vs RL

𝑦1 𝑦2 … <stop>

RNN RNN RNN RNN

<start> 𝑦1 𝑦2 𝑦𝑇′−1

𝑦1 𝑦2 … <stop>

RNN RNN RNN RNN

<start> 𝑦1∗ 𝑦2

∗ 𝑦𝑇′−1∗

Reinforcement Learning:

𝐽𝜽 = න𝑑𝜋𝜽𝑉𝜋𝜽𝑑𝑥

High-variance, require warm-start

Optimize true score

Signal over trajectories

Supervised learning:

𝑡′=1

𝑇′

log 𝑝𝜃(𝑦𝑡′|𝑥, 𝑦1, ⋯ , 𝑦𝑡′)

Low-variance, sample efficient

Optimize surrogate

Signal word after words

Policy Gradient for Language GenerationResults, finally!

Policy Gradient

BLEU score for summarization / image captioning ROUGE score for image captioning (Ranzato 2015)

Does it work ?

Policy Gradient for Language GenerationDamn!

Does it really work ? Well…

RED is a lie!

Button was denied his 100th race for McLaren after an ERS prevented him from making it to the start-line. It capped a miserable weekend for the Briton. Button has out-qualified. Finished ahead of Nico Rosberg at Bahrain. Lewis Hamilton has. In 11 races. . The race. To lead 2,000 laps. . In. . . And.

(Paulus 2017)

Reward hacking….

Policy Gradient for Language GenerationTricks, heart of Deep RL ;)

A few trick to alleviate RL issues :

• Parametrize and train your baseline correctly

• Increase batch size!

• Perform an extensive parameter RL sweep parameters

• Adding softmax temperature + check your SL baseline (overconfident)

• Slowly transition from SL to RL

• Check qualitative results!

Recommended slides: http://www.phontron.com/slides/neubig19structured.pdf

Dialogue System with RL

Dialogue SystemsClassic pipeline

Open the pad day door, Hal

I'm sorry, Dave. I'm afraid I can't do that

Natural Language Generator

Natural Language

Understanding

Dialogue utterance 𝑢𝑡:{request “from open door”, From: “Dave”}

Dialogue State Tracker N

te 𝑠𝑡 :

{“Da

s”: th

Policy Learning

Action 𝑎𝑡{Inform “Request Denied”, action: “None”}

POMDP setting

Later on… Dave disconnect HAL…

Dialogue Systems

POMDP setting:

• Observation: {request “from open door”, From: “Dave”}

• State: {“Dave status”: threat}

• Policy: Dialogue Manager

• Action: {Inform “Request Denied”, action: “None”}

• Reward: Later on… Dave disconnect HAL…

Given a good NLU / NLG… Good new… it works!

(Young 2013) (DSTC)

Hand-Crafted

Natural Language Generator

Natural Language

Understanding

Dialogue utterance 𝑢𝑡:{request “from open door”, From: “Dave”}

Dialogue State Tracker N

te 𝑠𝑡 :

{“Da

s”: th

Policy Learning

Action 𝑎𝑡{Inform “Request Denied”, action: “None”}

POMDP setting

Later on… Dave disconnect HAL…

End-2-End Model

Decoder(RNN)

Encoder(RNN)

Idea : Turn into (natural) translation systems! (Vinyals et V. Le 2015) (When et l, 2015)

Dialogue SystemsTaxonomy

Chatbot

Open discussion!

Numerous dataset (Lowe 2015)

Numerous models (Gao 2019)

No reward signal…

Goal-oriented dialogue

Dialogue to solve a task: book plane ticket, find restaurant etc.

No large-scale goal oriented dataset with natural language (10k dialogue)

Clear reward signal!

Wait !! What is that!

Visually grounded goal-oriented natural dialogues

GuessWhat?!Symbol grounding problem

Symbol Grounding Problem!

(Harnad 1990)

GuessWhat?!Symbol grounding problem

How to ground symbol?

ケーキ

小麦粉、脂肪、卵、砂糖、およびその他の成分の混合物から作られた、焼きたての、時にはアイスまたは装飾された柔らかい甘い食べ物。

GuessWhat?!

Visually grounded dialogues with self-play

Game features:

● Dialogue

● Visually grounded

● Collaborative

● Goal-oriented with a clear reward

Come play… it will be fun.

GuessWhat?! Game

The game consists in locating a hidden object into a natural scene representation by asking a sequence of questions.

Questionner

Oracle

GuessWhat?!Let’s play

Questionner

Oracle

Is it a vase ?

Questionner

Oracle

Is it a vase ? Yes

Questionner

Oracle

Is it a vase ? Yes

Is it in the front row?

Questionner

Oracle

Is it a vase ? Yes

Is it in the front row? No

Questionner

Oracle

Is it a vase ? Yes

Does it have some red on it?

Questionner

Oracle

Is it a vase ? Yes

Does it have some red on it? No

Questionner

Oracle

Is it a vase ? Yes

Is it the second vase from the right?

Questionner

Oracle

Is it a vase ? Yes

Is it the second vase from the right? Yes

Questionner

Oracle

Is it a vase ? Yes

I found it!

Questionner

Oracle

Is it a vase ? Yes

Correct!

Questionner

Oracle

https://guesswhat.ai/explore

GuessWhat?!

GuessWhat?!Dataset Statistics – Language metrics

Average: 5+ questions Average: ~5 words

GuessWhat?!Dataset Statistics – Success ratio

The more object there are, the lower is the success ratio The bigger the object is, the higher is the success ratio

Optimal Policy

GuessWhat?!

GuessWhat?! DatasetPotential Language Policy

- Word Taxonomy:

◦ Is it a vehicle? A car? A motorbike?

- Spatial reasoning

◦ Is it on the left? In the background?

◦ Is it on the right of the blue car? Is it between the two zebra?

◦ Is it on the table?

- Counting

◦ Is it the second man of the left?

- Others:

◦ Can it fly? Do you eat with it?

◦ Is it big? Is it square?

GuessWhat?!Cloud of Words

Questioner Oracle

Repeat until <stop_dialogue>

question

yes/no answer

Find object?

Questioner evaluation procedureGuesseur

dialogue

GuessWhat?!Game Loop

GuessWhat?!Game Notation

Oracle

accuracy

GuessWhat?!Models

accuracy

GuessWhat?!Models

Training (Minimize cross-entropy):

GuessWhat?!Models

Policy warm-up

Accuracy: The higher, the better!

- New Objects : Image from training set + pick random obje

- New images : Images from the testing set (never seen at training time)

GuessWhat?!Quantitative Results

GuessWhat?!Qualitative Results

Lack of generalization

Poor grounding

Language Imitation pitfall

GuessWhat?!Limitation of supervised learning

Observation:

● Space of action/state dialogue is too large to generalize

● Supervised learning miss planning aspect

● Supervised learning does not care solving the task! Wrong metric

● (Side issue) Grounding seems imperfect…

GuessWhat?!What if…

Is it the best question

What happen if the answer is no

Can we phrase it differently?

Questioner Oracle

Repeat until <stop_dialogue>

question

yes/no answer

Find object?Guesseur

dialogue

GuessWhat?!Game Loop

Reward

Update agent

GuessWhat?!RL Notation

For each question For each word

GuessWhat?!

Policy Gradient!

Policy Gradient

Conditioned on the image

Generate dialogue

Generate game

Find object

Update model

Initialization

GuessWhat?!RL Algorithm

Accuracy: The higher, the better!

- New Objects : Image from training set + pick random object

- New images : Images from the testing set (never seen at training time)

GuessWhat?!

GuessWhat?!What happened?

Optimize the metric

Language strategy is consistent

Optimize the metric

Language strategy is poor limited

Language quality is bad

GuessWhat?!Vocabulary Evolution

Human(3000k words)

Beam-search~500 words

RL>100 words

Vocabulary drop:

- Model quickly reduce the action space

- Supervised model are overly confident ( p > 50%)

Are current RL algorithms really ok for NLG?

Better warm-up ?

GuessWhat?!Language Drift

Language drift…

How to enforce language quality while optimizing for the goal?Reward shaping ? HRL?

(Lewis 2017)

Overview

● Kind introduction to NLP

● Policy gradient for Translation

● Goal-oriented dialogue systems○ Dialogue setting○ GuessWhat?!○ Self-play for language generation

● Other linguistic grounded tasks:○ Language as goal representation: Instruction Following○ Language as state representation: Text Games○ Language as policy compositionality: Emergence of Language

~15min

~talk to me!

~20min

~15min

Question?The cherry on the cake is a lie !

interleaving language and rl from language generation to ... · interleaving language and rl...

Documents

computing science - newcastle...

javier gozálvez...

07_encoding & interleaving

optimal interleaving: serial phonology-morphology

distributed link fragmentation and interleaving over ... ·...

inductive learning: does interleaving exemplars …...massed...

interleaving helps students distinguish among similar...

ucc28070 interleaving continuous conduction mode pfc

aalborg universitet optimal interleaving angle ... · bede,...

1 optimal interleaving on tori

toms river english language arts curriculum · rl.8.1...

interleaving andspeech coding

rl 55.950 site boundary...rl rl ground 9.850 rl level 01...

chapter 9 xoring stage forward error correction and radio...

radar pulse interleaving for multi-target...

interleaving learning, problem solving, and execution in...

d4.3.1strategies and design for interleaving reasoning and

nominated rl list2201528586gunpal rl\827 2406502093riyasat...

interleaving - sigir 2016 presentation

textbook review form english language arts …...