interleaving language and rl from language generation to ... · interleaving language and rl...
TRANSCRIPT
Interleaving Language and RLLanguage Generation
Florian StrubReinforcement Learning Summer School
7 July 2019
Interleaving Language and RL — Florian Strub
Overview
● Kind introduction to NLP
● Policy gradient for Translation
● Goal-oriented dialogue systems○ Dialogue setting, ○ GuessWhat?!○ Self-play for language generation
● Other linguistic grounded tasks:○ Language as goal representation: Instruction Following○ Language as state representation: Text Games○ Language as policy compositionality: Emergence of Language
~15min
~talk to me!
~20min
~15min
Interleaving Language and RL — Florian Strub
Kind Introduction to NLPWhat is NLP?
Natural Language Processing (NLP) aims to extract representations of textual information to read and make sense of human languages in a valuable manner.
Interleaving Language and RL — Florian Strub
Kind Introduction to NLPWhat is NLP?
Natural Language Processing (NLP) aims to extract representations of textual information to read and make sense of human languages in a valuable manner.
Interleaving Language and RL — Florian Strub
Kind Introduction to NLPLanguage is Hard :)
Georgetown–IBM experiment 1954 Translate Sixty Russian sentences into English
ARPA report 1966: “there is no immediate or predictable prospect of useful machine translation”
Interleaving Language and RL — Florian Strub
Kind Introduction to NLPSmall Historical Note
Rationalism
Language possesses an underlying structure that must be discovered, conjecturing that human language is innate and can be modelled like physic laws.
Empiricism
Language is a cognitive process that can be learned through experimentation, advocating to explore learning mechanisms rather than linguistic models
Chomsky, Noam. Syntactic structures.1957.
Transformational generative grammars
J. R. Firth. A synopsis of linguistic theory 1930-55. Studies in Linguistic Analysis
“You shall know a word by the company it keeps”
“The validity of statistical (information theoretic)
approach to MT has indeed been recognized …
as early as 1949. And was universally
recognized as mistaken [sic] by 1950. … The
crude force of computers is not science.”
Review of Brown et al. (1990)
Interleaving Language and RL — Florian Strub
Kind Introduction to NLPLanguage Modelling
Goal of statistical approach: How likely is a sentence?
- “The cherry on the cake” - “The cake on the cherry ”
but the cake is a lie...
Interleaving Language and RL — Florian Strub
Kind Introduction to NLPLanguage Modelling
A sentence is represented as a sequence of words:
http://videolectures.net/deeplearning2016_cho_language_understanding/
We compute the probability of the word sequence:
Interleaving Language and RL — Florian Strub
Kind Introduction to NLPLanguage Modelling
How to estimate the conditional probabilities?
N-gram Modelling:
We can simply count words!
Do not trust the cake!
Several problems (memory footprint, do not generalize etc.)
Interleaving Language and RL — Florian Strub
Kind introduction to MLPNeural Language Modelling
How to estimate the conditional probabilities?
Learn a representation of word history:
Use neural networks !!!
The cherry on the
cherry on the cake
Interleaving Language and RL — Florian Strub
Kind introduction to MLPNeural Language Modelling
Hash Table:
RNN RNN:
MLP
softmax
Classifier:
Basic RNN, LSTM, GRU etc.
Required indexed vocabulary:V = { “Is”:1, … , “cherry”:23 , ... }
Word embedding!
Greedy, stochastic, beam search etc.
Sampling procedure:
Interleaving Language and RL — Florian Strub
Kind introduction to MLPTraining procedure
Given a corpora:
Goal is to maximize the joint probability:
Cake and grief counseling will be available at the conclusion of the test.
Key NLP equation
Interleaving Language and RL — Florian Strub
Kind introduction to MLPTraining Procedure
The cherry on the
cherry over the table
Training Time Test Time
The cherry over the
cherry over the table
*Ground truth
Predicted
Teacher Forcing
** Ground truth
Predicted
Exposure bias
Interleaving Language and RL — Florian Strub
Kind introduction to MLPIt is only the begining!
We can generate language :)
But wait… it is pretty useless! (so is the cake, what a lame running joke...)
Can we use it as a translation system?
Well...
Interleaving Language and RL — Florian Strub
Kind introduction to MLPIt is only the begining!
We can generate language :)
But wait… it is pretty useless! (so is the cake, what a lame running joke...)
The cherry on the cake
La cerise sur le gateau
Can we use it as a translation system?
Well...
He looked for the cake
Il est allé chercher .. le gateau
Bad word alignment
Interleaving Language and RL — Florian Strub
Kind introduction to MLPSeq2Seq Models
Idea: Decompose encoding and decoding!
DecoderEncoder
input
output
sta
te
Interleaving Language and RL — Florian Strub
Kind introduction to MLPSeq2Seq Models
Idea: Decompose encoding and decoding!
DecoderEncoder
sta
te
Le gateau est un mensonge
The cake is a lie
Interleaving Language and RL — Florian Strub
Kind introduction to MLPSeq2Seq Models
sta
te
Idea: Decompose encoding and decoding!
RNNRNN RNNRNN RNN
Encoder
the cake is a
RNN RNN RNN RNN RNN
Decoder
lie
le gateau est un mensonge
Interleaving Language and RL — Florian Strub
Kind introduction to MLPSeq2Seq Models
sta
te
Idea: Decompose encoding and decoding!
RNN RNNRNN RNNRNN
le gateau est un mensonge
the cake … lie <stop>
RNN RNN RNN RNN RNN
<Ignore output>
<start> the cake … lie
Similar training as language modelling (teacher forcing)
Sequence-to-Sequence
Interleaving Language and RL — Florian Strub
Kind introduction to MLPSeq2Seq Models
sta
te
Idea: Decompose encoding and decoding!
the cake … lie <stop>
RNN RNN RNN RNN RNN
<start> the cake … lieR
esN
et
Image captioning
Interleaving Language and RL — Florian Strub
Kind introduction to MLPSeq2Seq Models
Seq2Seq is an Encoder/Decoder architecture
1) Encode language representation
2) Decode vector representation
Model is trained with Cross-Entropy (Teacher Forcing)
Decoder(RNN)
Encoder
input
output
sta
te
You cannot cram the
meaning of a whole
%&!$# sentence into a
single $&!#* vector.
Generated tokensinput tokens
Interleaving Language and RL — Florian Strub
Kind introduction to MLPTranslation
WMT dataset:
- 12M sentences French/English
- Vocabulary 80K words
- Assessed on BLEU score
BLEU is the geometric average of overlapping n-grams in a
set of targets sentences from n=1 to 4 .
Interleaving Language and RL — Florian Strub
Kind introduction to MLPTranslation
WMT dataset:
- 12M sentences French/English
- Vocabulary 80K words
- Assessed on BLEU score
Input le gateau est un mensonge
Predicted the lie is the cake
Target the cake is a lie
When several targets exists, n-gram can be count as many time as they exist in any of the targets
BLEU is the geometric average of overlapping n-grams in a
set of targets sentences from n=1 to 4 .
N=1
4
--
5
Interleaving Language and RL — Florian Strub
Kind introduction to MLPTranslation
WMT dataset:
- 12M sentences French/English
- Vocabulary 80K words
- Assessed on BLEU score
Input le gateau est un mensonge
Predicted the lie is the cake
Target the cake is a lie
When several targets exists, n-gram are count as many time as they may exist in one of the targets
N=2
1
--
4
N=3
0
--
3
N=1 N=4
0
--
2
4
--
5
BLEU is the geometric average of overlapping n-grams in a
set of targets sentences from n=1 to 4 .
Interleaving Language and RL — Florian Strub
Kind introduction to MLPBLEU
BLEU: Order of magnitude
BLEU
Sota 2014 (Durrani 2014) 37.0
Sequence-to-sequence (K. Cho 2014) 34.5
Sequence-to-Sequence (Wu 2016) 38.95
Transformer (Vaswani 2017) 41.8
BLEU is a lie!?
Estimated Human BLEU (Papineni 2002) 34.7
Interleaving Language and RL — Florian Strub
RL is the carrot on the cake
Interleaving Language and RL — Florian Strub
Policy Gradient for Language GenerationSupervised limitation
Supervised in good… but:
● We enforce a mismatch between training and testing: Teacher forcing vs Exposure Bias
● We optimize cross-entropy… but we care about BLEU !
● We optimize for sentence generation… but CE is at the word level (can be criticized)
sta
tethe cake … lie <stop>
RNN RNN RNN RNN RNN
<start> the cake … lie
En
co
de
r
Le
ga
tea
u e
st
un
m
en
so
ng
e
Interleaving Language and RL — Florian Strub
Policy Gradient for Language GenerationLanguage as a MDP
Idea: Turn language translation into a MDP and BLEU as a reward
• 𝑠𝑡 = 𝒙, 𝑦1, ⋯ , 𝑦𝑡
• 𝒙 = 𝑥1,⋯ , 𝑥𝑇 where 𝑥𝑡 ∈ 𝑉𝑖𝑛 and 𝒙 is the input sentence
• 𝑦1, ⋯ , 𝑦𝑡′ where 𝑦𝑡 ∈ 𝑉𝑜𝑢𝑡 are the generated token
• 𝑎𝑡+1 ∼ 𝑉𝑜𝑢𝑡𝑝𝑢𝑡
• 𝑠𝑡+1 = 𝑠𝑡 ∪ 𝑎𝑡+1
• 𝑟𝑡+1 = 𝐵𝐿𝐸𝑈 if 𝑎𝑡+1 = < 𝑠𝑡𝑜𝑝 >
0 Otherwise
sta
te𝑦1 𝑦2 … 𝑦𝑇′−1 <stop>
RNN RNN RNN RNN RNN
<start> 𝑦1 𝑦2 𝑦𝑇′−2 𝑦𝑇′−1
En
co
de
r
𝒙
Reward: BLEU
Interleaving Language and RL — Florian Strub
Policy Gradient for Language GenerationPolicy Gradient
The goal is to optimize the score function:
𝐽𝜽 = න𝑑𝜋𝜽𝑉𝜋𝜽𝑑𝑥
Expected BLEU according the translation policy over all the potential language pair
The score gradient is estimated by:
∇J𝜽=𝜽ℎ =
𝑡′=1
𝑇′
𝑦=1
𝑉𝑜𝑢𝑡
∇𝜽ℎlog(𝜋𝜽ℎ(𝑦𝑡|𝒙, 𝑦1, ⋯ , 𝑦𝑡′−1))(𝑄𝜋 𝜽ℎ − 𝑏)
Policy Gradient (Sutton 1999) improves the policy by following the score gradient:
𝜽ℎ+1 = 𝜽ℎ + 𝛼ℎ𝛁𝐉𝜽=𝜽ℎ𝛼 learning rateℎ training step
𝑏 baselineQ state-action function
𝑑 probability state distribution𝑉 Value function
Interleaving Language and RL — Florian Strub
Policy Gradient for Language GenerationPolicy Gradient
As 𝑉𝑜𝑢𝑡 may be very big, RL is not straightforward!
For example, WMT has 80k words ; Atari has 18 actions!
∇J𝜽=𝜽ℎ =
𝑡′=1
𝑇′
𝑦=1
𝑉𝑜𝑢𝑡
∇𝜽ℎlog(𝜋𝜽ℎ(𝑦𝑡|𝒙, 𝑦1, ⋯ , 𝑦𝑡′−1))(𝑄𝜋 𝜽ℎ − 𝑏)
𝑄𝜋 𝜽ℎ is hard to parametrize: Overestimation, memory footprint (Bahdanau 2016)
σ𝑦=1𝑉𝑜𝑢𝑡 can be intractable. Potential
subsampling etc. (Liu 2018)
Impossible to start from random policy. Required warm start (Ranzato 2015)
Should be parametrized
Interleaving Language and RL — Florian Strub
Policy Gradient for Language GenerationPolicy Gradient
Monte-Carlo Variant (REINFORCE-like)
∇J𝜽=𝜽ℎ =
𝑡′=1
𝑇′
∇𝜽ℎlog(𝜋𝜽ℎ(𝑦𝑡|𝒙, 𝑦1, ⋯ , 𝑦𝑡′−1))(𝑄𝜋 − 𝑏)
Where
𝑄𝜋 =
𝜏
𝑡′
𝛾𝜏𝑟𝜏
Intuitively, the full trajectory is equally rewarded. It is either all good or all bad!
𝛾 is often set to 1: No need to search for shortest word trajectory
Interleaving Language and RL — Florian Strub
Policy Gradient for Language GenerationSL vs RL
sta
te
𝑦1 𝑦2 … <stop>
RNN RNN RNN RNN
<start> 𝑦1 𝑦2 𝑦𝑇′−1
En
co
de
r
𝒙
BLEU
sta
te
𝑦1 𝑦2 … <stop>
RNN RNN RNN RNN
<start> 𝑦1∗ 𝑦2
∗ 𝑦𝑇′−1∗
En
co
de
r
𝒙
Reinforcement Learning:
𝐽𝜽 = න𝑑𝜋𝜽𝑉𝜋𝜽𝑑𝑥
High-variance, require warm-start
Optimize true score
Signal over trajectories
Supervised learning:
𝑡′=1
𝑇′
log 𝑝𝜃(𝑦𝑡′|𝑥, 𝑦1, ⋯ , 𝑦𝑡′)
Low-variance, sample efficient
Optimize surrogate
Signal word after words
Interleaving Language and RL — Florian Strub
Policy Gradient for Language GenerationResults, finally!
Policy Gradient
BLEU score for summarization / image captioning ROUGE score for image captioning (Ranzato 2015)
Does it work ?
Interleaving Language and RL — Florian Strub
Policy Gradient for Language GenerationDamn!
Does it really work ? Well…
RED is a lie!
Button was denied his 100th race for McLaren after an ERS prevented him from making it to the start-line. It capped a miserable weekend for the Briton. Button has out-qualified. Finished ahead of Nico Rosberg at Bahrain. Lewis Hamilton has. In 11 races. . The race. To lead 2,000 laps. . In. . . And.
(Paulus 2017)
Reward hacking….
Interleaving Language and RL — Florian Strub
Policy Gradient for Language GenerationTricks, heart of Deep RL ;)
A few trick to alleviate RL issues :
• Parametrize and train your baseline correctly
• Increase batch size!
• Perform an extensive parameter RL sweep parameters
• Adding softmax temperature + check your SL baseline (overconfident)
• Slowly transition from SL to RL
• Check qualitative results!
Recommended slides: http://www.phontron.com/slides/neubig19structured.pdf
Interleaving Language and RL — Florian Strub
Dialogue System with RL
Interleaving Language and RL — Florian Strub
Dialogue SystemsClassic pipeline
Open the pad day door, Hal
I'm sorry, Dave. I'm afraid I can't do that
Natural Language Generator
Natural Language
Understanding
Dialogue utterance 𝑢𝑡:{request “from open door”, From: “Dave”}
Dialogue State Tracker N
ew
sta
te 𝑠𝑡 :
{“Da
ve s
tatu
s”: th
rea
t}
Policy Learning
Action 𝑎𝑡{Inform “Request Denied”, action: “None”}
POMDP setting
Later on… Dave disconnect HAL…
Interleaving Language and RL — Florian Strub
Dialogue Systems
POMDP setting:
• Observation: {request “from open door”, From: “Dave”}
• State: {“Dave status”: threat}
• Policy: Dialogue Manager
• Action: {Inform “Request Denied”, action: “None”}
• Reward: Later on… Dave disconnect HAL…
Given a good NLU / NLG… Good new… it works!
(Young 2013) (DSTC)
Hand-Crafted
Interleaving Language and RL — Florian Strub
Dialogue SystemsClassic pipeline
Open the pad day door, Hal
I'm sorry, Dave. I'm afraid I can't do that
Natural Language Generator
Natural Language
Understanding
Dialogue utterance 𝑢𝑡:{request “from open door”, From: “Dave”}
Dialogue State Tracker N
ew
sta
te 𝑠𝑡 :
{“Da
ve s
tatu
s”: th
rea
t}
Policy Learning
Action 𝑎𝑡{Inform “Request Denied”, action: “None”}
POMDP setting
Later on… Dave disconnect HAL…
Interleaving Language and RL — Florian Strub
Dialogue SystemsClassic pipeline
Open the pad day door, Hal
I'm sorry, Dave. I'm afraid I can't do that
End-2-End Model
Interleaving Language and RL — Florian Strub
Dialogue SystemsClassic pipeline
Open the pad day door, Hal
I'm sorry, Dave. I'm afraid I can't do that
Decoder(RNN)
Encoder(RNN)
sta
te
Idea : Turn into (natural) translation systems! (Vinyals et V. Le 2015) (When et l, 2015)
Interleaving Language and RL — Florian Strub
Dialogue SystemsTaxonomy
Chatbot
Open discussion!
Numerous dataset (Lowe 2015)
Numerous models (Gao 2019)
No reward signal…
Goal-oriented dialogue
Dialogue to solve a task: book plane ticket, find restaurant etc.
No large-scale goal oriented dataset with natural language (10k dialogue)
Clear reward signal!
Interleaving Language and RL — Florian Strub
Wait !! What is that!
Visually grounded goal-oriented natural dialogues
Interleaving Language and RL — Florian Strub
GuessWhat?!Symbol grounding problem
Symbol Grounding Problem!
(Harnad 1990)
Interleaving Language and RL — Florian Strub
GuessWhat?!Symbol grounding problem
How to ground symbol?
ケーキ
小麦粉、脂肪、卵、砂糖、およびその他の成分の混合物から作られた、焼きたての、時にはアイスまたは装飾された柔らかい甘い食べ物。
Interleaving Language and RL — Florian Strub
GuessWhat?!
Visually grounded dialogues with self-play
Game features:
● Dialogue
● Visually grounded
● Collaborative
● Goal-oriented with a clear reward
Come play… it will be fun.
GuessWhat?! Game
The game consists in locating a hidden object into a natural scene representation by asking a sequence of questions.
Interleaving Language and RL — Florian Strub
Questionner
Oracle
GuessWhat?!Let’s play
Interleaving Language and RL — Florian Strub
Questionner
GuessWhat?!Let’s play
Questionner
Oracle
Interleaving Language and RL — Florian Strub
Is it a vase ?
GuessWhat?!Let’s play
Questionner
Oracle
Interleaving Language and RL — Florian Strub
Is it a vase ? Yes
GuessWhat?!Let’s play
Questionner
Oracle
Interleaving Language and RL — Florian Strub
Is it a vase ? Yes
Is it in the front row?
GuessWhat?!Let’s play
Questionner
Oracle
Interleaving Language and RL — Florian Strub
Is it a vase ? Yes
Is it in the front row? No
GuessWhat?!Let’s play
Questionner
Oracle
Interleaving Language and RL — Florian Strub
Is it a vase ? Yes
Is it in the front row? No
Does it have some red on it?
GuessWhat?!Let’s play
Questionner
Oracle
Interleaving Language and RL — Florian Strub
Is it a vase ? Yes
Is it in the front row? No
Does it have some red on it? No
GuessWhat?!Let’s play
Questionner
Oracle
Interleaving Language and RL — Florian Strub
Is it a vase ? Yes
Is it in the front row? No
Does it have some red on it? No
Is it the second vase from the right?
GuessWhat?!Let’s play
Questionner
Oracle
Interleaving Language and RL — Florian Strub
Is it a vase ? Yes
Is it in the front row? No
Does it have some red on it? No
Is it the second vase from the right? Yes
GuessWhat?!Let’s play
Questionner
Oracle
Interleaving Language and RL — Florian Strub
Is it a vase ? Yes
Is it in the front row? No
Does it have some red on it? No
Is it the second vase from the right? Yes
I found it!
GuessWhat?!Let’s play
Questionner
Oracle
Interleaving Language and RL — Florian Strub
Is it a vase ? Yes
Is it in the front row? No
Does it have some red on it? No
Is it the second vase from the right? Yes
Correct!
GuessWhat?!Let’s play
Questionner
Oracle
Interleaving Language and RL — Florian Strub
https://guesswhat.ai/explore
GuessWhat?!
Interleaving Language and RL — Florian Strub
GuessWhat?!Dataset Statistics – Language metrics
Average: 5+ questions Average: ~5 words
Interleaving Language and RL — Florian Strub
GuessWhat?!Dataset Statistics – Success ratio
The more object there are, the lower is the success ratio The bigger the object is, the higher is the success ratio
Interleaving Language and RL — Florian Strub
Optimal Policy
GuessWhat?!
Interleaving Language and RL — Florian Strub
GuessWhat?! DatasetPotential Language Policy
- Word Taxonomy:
◦ Is it a vehicle? A car? A motorbike?
- Spatial reasoning
◦ Is it on the left? In the background?
◦ Is it on the right of the blue car? Is it between the two zebra?
◦ Is it on the table?
- Counting
◦ Is it the second man of the left?
- Others:
◦ Can it fly? Do you eat with it?
◦ Is it big? Is it square?
Interleaving Language and RL — Florian Strub
GuessWhat?!Cloud of Words
Interleaving Language and RL — Florian Strub
Questioner Oracle
Repeat until <stop_dialogue>
question
yes/no answer
Find object?
Questioner evaluation procedureGuesseur
dialogue
GuessWhat?!Game Loop
Interleaving Language and RL — Florian Strub
GuessWhat?!Game Notation
Interleaving Language and RL — Florian Strub
Oracle
79.5%
accuracy
GuessWhat?!Models
Interleaving Language and RL — Florian Strub
Gu
esse
r
63.8%
accuracy
GuessWhat?!Models
Interleaving Language and RL — Florian Strub
Training (Minimize cross-entropy):
GuessWhat?!Models
QG
en
Policy warm-up
Interleaving Language and RL — Florian Strub
Accuracy: The higher, the better!
- New Objects : Image from training set + pick random obje
- New images : Images from the testing set (never seen at training time)
GuessWhat?!Quantitative Results
QG
en
Interleaving Language and RL — Florian Strub
GuessWhat?!Qualitative Results
Lack of generalization
Poor grounding
Language Imitation pitfall
Interleaving Language and RL — Florian Strub
GuessWhat?!Limitation of supervised learning
Observation:
● Space of action/state dialogue is too large to generalize
● Supervised learning miss planning aspect
● Supervised learning does not care solving the task! Wrong metric
● (Side issue) Grounding seems imperfect…
Interleaving Language and RL — Florian Strub
GuessWhat?!What if…
Is it the best question
What happen if the answer is no
Can we phrase it differently?
Interleaving Language and RL — Florian Strub
Questioner Oracle
Repeat until <stop_dialogue>
question
yes/no answer
Find object?Guesseur
dialogue
GuessWhat?!Game Loop
Reward
Update agent
Interleaving Language and RL — Florian Strub
GuessWhat?!RL Notation
Interleaving Language and RL — Florian Strub
For each question For each word
GuessWhat?!
Policy Gradient!
Policy Gradient
Conditioned on the image
Interleaving Language and RL — Florian Strub
Generate dialogue
Generate game
Find object
Update model
Initialization
GuessWhat?!RL Algorithm
Interleaving Language and RL — Florian Strub
QGen
Accuracy: The higher, the better!
- New Objects : Image from training set + pick random object
- New images : Images from the testing set (never seen at training time)
GuessWhat?!
Interleaving Language and RL — Florian Strub
GuessWhat?!
Interleaving Language and RL — Florian Strub
GuessWhat?!What happened?
Good
Optimize the metric
Language strategy is consistent
Bad
Optimize the metric
Language strategy is poor limited
Language quality is bad
Interleaving Language and RL — Florian Strub
GuessWhat?!Vocabulary Evolution
Human(3000k words)
Beam-search~500 words
RL>100 words
Vocabulary drop:
- Model quickly reduce the action space
- Supervised model are overly confident ( p > 50%)
Are current RL algorithms really ok for NLG?
Better warm-up ?
Interleaving Language and RL — Florian Strub
GuessWhat?!Language Drift
Language drift…
How to enforce language quality while optimizing for the goal?Reward shaping ? HRL?
(Lewis 2017)
Interleaving Language and RL — Florian Strub
Overview
● Kind introduction to NLP
● Policy gradient for Translation
● Goal-oriented dialogue systems○ Dialogue setting○ GuessWhat?!○ Self-play for language generation
● Other linguistic grounded tasks:○ Language as goal representation: Instruction Following○ Language as state representation: Text Games○ Language as policy compositionality: Emergence of Language
~15min
~talk to me!
~20min
~15min
Interleaving Language and RL — Florian Strub
Question?The cherry on the cake is a lie !