recent developments in statistical dialogue systems › seminars › steveyoungslides.pdf ·...

24
Recent Developments in Statistical Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge University Engineering Department Cambridge, UK Steve Young

Upload: others

Post on 03-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

Recent Developments in Statistical Dialogue Systems

Machine Intelligence Laboratory Information Engineering Division

Cambridge University Engineering Department Cambridge, UK

Steve Young

Page 2: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

2

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

Contents

§  Review of Basic Ideas and Current Limitations

§  Semantic Decoding

§  Fast Learning

§  Parameter Optimisation and Structure Learning

Page 3: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

3

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

Spoken Dialog Systems (SDS)

Database

Recognizer Semantic Decoder

Dialog Control

Synthesizer Message Generator

User

Waveforms Words Dialog Acts

I want to find a restaurant?

inform(venuetype=restaurant)

request(food) What kind of food would you like?

Page 4: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

4

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

A Statistical Spoken Dialogue System

Ontology

Bayesian Belief

Network

Semantic Decoder

Stochastic Policy

Response Generator

inform(food=italian){0.6} inform(food=indian) {0.2} inform(area=east){0.1} null(){0.1}

confirm(food=italian) request(area)

Decision

Reward Function Rewards: success/fail

Reinforcement Learning

Supervised Learning

Partially Observable Markov Decision Process (POMDP)

ASR Evidence

Belief State

Belief Propagation

I want an

Italian

Your looking for an Italian restaurant,

whereabouts?

Id like italian {0.4} I want an Italian {0.2} Id like Indian{0.2} In the east{0.1}

Page 5: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

5

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

The POMDP SDS Framework

ot = p(ut | xt ) = p(ut |wt )p(wt | xt )wt

∑speech words words user

acts user acts

speech

observation semantic decoder

speech recogniser

bt (st ) =ηp(ot | st,at−1) P(st | st−1,at−1)bt−1(st−1)st−1

∑state action state

belief state

at ~ π (bt,at ) =eθ⋅φat (bt )

eθ⋅φa (bt )a∑

system action

policy parameters

belief state features R = E r(st,at )bt (st )

st

∑"#$

%$

&'$

($

reward function

Maximise θ

Page 6: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

6

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

Dialogue State

otype

gtype

utype

htype

Goal

User Act

History

Observation at time t

User Behaviour

Recognition/ Understanding

Errors

Tourist Information Domain •  type = bar,restaurant •  food = French, Chinese, none

Memory ofood

gfood

ufood

hfood

Nex

t Tim

e S

lice

t+1

J. Williams (2007). ”POMDPs for Spoken Dialog Systems." Computer Speech and Language 21(2)

NB: All conditioned by previous

action

Page 7: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

7

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

Belief Monitoring (Tracking)

B R F C - B R F C -

otype

gtype

utype

ofood

gfood

ufood

otype

gtype

utype

ofood

gfood

ufood

t=1 t=2

inform(type=bar, food=french) {0.6} inform(type=restaurant, food=french) {0.3}

confirm(type=restaurant, food=french)

affirm() {0.9}

Page 8: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

8

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

Choosing the next action – the Policy

gtype gfood

B R F C -

inform(type=bar) {0.4} select(type=bar, type=restaurant)

0 0 0 1 0 1 0 0

type food

Quantize €

φ1

φ2

φ30 0 1 0 1 0

0 0 1 0 1 0

0 0 1 0 1 0

All Possible Summary Actions: inform, select,

confirm, etc

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

π(a | b,θ ) =eθ.φa (b)

eθ.φ % a (b)% a ∑

⋅ θPolicy Vector

?

Sample

a = select

Map

Page 9: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

9

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

WER

Succ

ess

Rat

e

●●

●●

●●

●●

●●

●●●

●●

Plot of predicted success against WER

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

WER

Success

●● ● ● ●●● ●● ●● ●● ●●● ● ●● ● ● ●● ●● ●● ●● ●● ●

● ●● ●●● ● ● ● ● ●●●● ●● ●● ●●● ● ●● ● ●● ●● ● ●● ● ●● ●●● ●●●● ●● ●●● ●● ●● ●● ● ● ●●● ●

Plot of success against WER

Success

Fail

Let’s Go 2010 Control Test Results

Word Error Rate (WER)

Baseline System

Average Success = 64.8% Average WER = 42.4%

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

WER

Succ

ess

Rat

e

●●

●●

●●

●●

●●

●●●

●●

Plot of predicted success against WER

Cambridge 89% Success 33% WER

Baseline 65% Success 42% WER

Another 75% Success 34% WER

All Qualifying Systems Predicted Success Rate

B. Thomson "Bayesian Update of State for the Let's Go Spoken Dialogue Challenge.” SLT 2010.

Page 10: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

10

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

Demo of Cambridge Restaurant Information

Page 11: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

11

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

Issues with the 2010 System Design

§  Poor coverage of N-best of semantic hypotheses

§  Hand-crafting of summary belief space

§  Slow policy optimisation and reliance on user simulation

§  Dependence on hand-crafted dialogue model parameters

§  Dependence on static ontology/database

Page 12: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

12

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

N-best Semantic Decoding

ASR Semantic Decoder (applied N

times)

Merge Duplicates

speech

N-best word

strings

N-best dialogue

acts

M-best dialogue

acts

M<<N

Conventional N-best decoding

ASR Feature Extraction

Semantic Decoder (applied once)

speech

Word confusion network

n-gram features

M-best dialogue

acts

M<<N

Confusion Network decoding

optional context

Page 13: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

13

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

food=italian

venue=restaurant

area=centre

Confusion Network Decoder (Mairesse/Henderson)

inform multi-way SVM

Binary Vector

Context

bj   if   itemj  in last system act

SVMfood=italian

SVMfood=indian

… … SVMarea=east

0.91

0.42

0.01

I I’d

Is In

were

want

water well

indian

italian

in --

xi = E C(n-grami ){ }1/ n-grami

n=1,2,3

input features

M-best semantic

trees

Page 14: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

14

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

Confusion Network Decoder Evaluation

0 20 40 60 80 100

0.3

0.5

0.7

0.9

WER

F-score

C.Net. + ContextPhoenix

Predicted F-score

0 20 40 60 80 100

01

23

45

6

WER

ICE

C.Net. + ContextPhoenix

Predicted ICE

Comparison of item retrieval on corpus of 4.8k utterances N-best hand-crafted Phoenix decoder vs Confusion network decoder (trained on 10k utterances)

Live Dialogue System

N-best Phoenix Confusion Net F-score 0.80 0.82 ICE 2.02 1.26 Average Reward 10.6 11.15

“Discriminative Spoken Language Understanding using Word Confusion Networks”, Henderson et al, IEEE SLT 2012

Page 15: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

15

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

Policy Optimization

Policy parameters chosen to maximize expected reward

J(θ) = E 1T

r(st ,at ) | πθt∑

%

& ' (

) *

˜ ∇ J(θ) = Fθ−1∇J(θ )

Natural gradient ascent works well

Gradient is estimated by sampling dialogues and in practice Fisher Information Matrix does not need to be explicitly computed.

Fisher Information

Matrix

This is the Natural Actor Critic Algorithm.

J. Peters and S. Schaal (2008). "Natural Actor-Critic." Neurocomputing 71(7-9)

However, A)  slow (~100k dialogues) and B)  requires summary space approximations

Page 16: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

16

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

Q-functions and the SARSA algorithm

Traditional reinforcement learning is commonly based on finding the optimal Q function:

Q*(b,a) =maxπ

Eπ r(bτ ,aτ )τ=t+1

T

∑"#$

%&'

(

)*

+

,-

The optimal deterministic policy is then simply

π *(b) = argmaxa

Q*(b,a)!" #$

Q* can be found sequentially using the SARSA algorithm

b = b0; choose action a e-greedily from π(b) For each dialogue turn

Take action a, observe reward r and next state b’ choose action a’ e-greedily from π(b’) Q(b, a) = Q(b, a) +λ[Q(b’, a’) – (Q(b, a) – r)] b = b’; a = a’

end

Eventually, Q à Q*

Page 17: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

17

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

Gaussian Process based Learning – Milica Gasic

For POMDPs, the belief space is continuous and direct representations of Q are intractable. However, Q can be approximated as a zero mean Gaussian process by designing a kernel to represent the correlations between points in belief x action space. Thus:

Q(b,a) ~�� 0,k (b,a), (b,a)( )( )Given a sequence of state-action pairs

and rewards Bt = (b0,a0 ), (b1,a1),.., (bt,at )[ ]! rt = r0, r1,.., rt[ ]!

there is a closed form solution for the posterior:

Q(b,a) | Bt, rt ~ N Q(b,a), cov (b,a), (b,a)( )( )This suggests a SARSA-like sequential optimisation:

b = b0; choose action a e-greedily from For each dialogue turn

Take action a, observe reward r and next state b’ choose action a’ e-greedily from Update the posterior covariance estimate b = b’; a = a’

end

Q(b,a)

Q( !b , !a )

Page 18: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

18

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

Benefits of GP-SARSA

§  sequential estimation of distribution of Q (not Q itself)

§  each new data point can impact on whole distribution via the covariance function à very efficient use of training data

§  much faster learning than gradient methods such a Natural Actor Critic (NAC)

TownInfo System

GP

NAC

a =argmax

aQ(b,a)!" #$  with prob 1−ε

random action           with prob ε

&

'(

)(

ε-greedy exploration

Page 19: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

19

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

500 1000 1500 2000 2500 30005

6

7

8

9

10

11

12

13

14

15

Training dialogues

Ave

rage

rew

ard

Variance explorationStochastic policy

Benefits of GP-SARSA

§  variance of Q is known at each stage à more intelligent exploration:

a =argmax

aQ(b,a)!" #$  with prob 1−ε

argmaxa

cov((b,a), (b,a))[ ] with prob ε

&

'(

)(

Variance exploration

Qi (b,ai ) ~ N Q(b,ai ), cov((b,ai ), (b,ai )( )a = argmax

aiQi (b,ai )!" #$ 

Stochastic policy

TownInfo System

Well trained within 3k dialogues

“On-line policy optimisation of SDS via live interaction with human subjects”, Gasic et al, ASRU 2011 “Gaussian processes for policy optimisation of large scale POMDP-based SDS”, Gasic et al, SLT 2012.

And summary space mapping no longer needed

Page 20: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

20

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

Parameter Estimation – Blaise Thomson

ot

gt

ut

ht

Goal

User Act

History

Observation at time t

User Behaviour

Memory

Recognition Errors

gt+1

Transition Model p(gt+1|gt,at)

p(ut|gt,at-1)

p(ot|ut)

p(ht|ut,ht-1,at-1)

These parameters typically hand-crafted. How can we learn them from data?

at-1

Page 21: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

21

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

Factor Graphs and Expectation Propagation

ot

gt

ut

ht

Goal

User Act

History

gt+1

at-1 at-1

f1(S1) where S1 = at−1,gt,ut( )

b(st | ot, st−1,at−1)∝ fk (Sk )k=1

K

Posterior

•  exact computation is intractable •  can be approximated using belief propagation •  we use Expectation Propagation (EP) •  using EP factors can be discrete & continuous

System Act

{θi}

•  hence, parameters can be added and updated simultaneously

Page 22: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

22

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

Effect of Parameter Learning on TownInfo System

Page 23: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

23

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

Structure Learning

The ability to learn parameters can be extended to learn structure.

Compute Evidence p(D|gk) for all gk in G

Let G = { gk} be a set of Bayesian Networks (or Factor Graphs):

Corpus data D

Prune G

Augment G by perturbing each gk

Stop?

Select gk with highest Evidence

Evidence can be computed efficiently using expectation

propagation

Potential to learn: a)  additional values for an existing variable b)  new hidden variables c)  new links between variables

Page 24: Recent Developments in Statistical Dialogue Systems › Seminars › SteveYoungSlides.pdf · Dialogue Systems Machine Intelligence Laboratory Information Engineering Division Cambridge

24

Recent Developments in Statistical Dialogue Systems, Braunschweig, Sept 2012 © Steve Young

Conclusions

§  Statistical Dialogue Systems based on POMDPs are viable, offer increased robustness to noise and require no hand-crafting

§  Good progress is being made on increasing accuracy and speeding up learning

§  Learning directly on human users rather than depending on user simulators is now possible

§  Current systems are built from static ontologies for closed domains

§  Next steps will include building more flexible systems capable of dynamically adapting to new information content.

Acknowledgement - this work is the result of a team effort: Catherin Breslin, Milica Gasic, Matt Henderson, Dongho Kim, Martin Szummer, Blaise Thomson, Pirros Tsiakoulis and former members of the CUED Dialogue Systems Group