how much data is enough? – generating reliable policies w/mdp’s

48
How much data is enough? – Generating reliable policies w/MDP’s Joel Tetreault University of Pittsburgh LRDC July 14, 2006

Upload: anise

Post on 31-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

How much data is enough? – Generating reliable policies w/MDP’s. Joel Tetreault University of Pittsburgh LRDC July 14, 2006. Problem. Problems with designing spoken dialogue systems: How to handle noisy data or miscommunications? Hand-tailoring policies for complex dialogues? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: How much data is enough? – Generating reliable policies w/MDP’s

How much data is enough? – Generating reliable policies w/MDP’s

Joel TetreaultUniversity of PittsburghLRDCJuly 14, 2006

Page 2: How much data is enough? – Generating reliable policies w/MDP’s

Problem Problems with designing spoken dialogue systems:

How to handle noisy data or miscommunications? Hand-tailoring policies for complex dialogues? What features to use?

Previous work used machine learning to improve the dialogue manager of spoken dialogue systems [Singh et al., ‘02; Walker, ‘00; Henderson et al., ‘05]

However, very little empirical work [Paek et al., ‘05;

Frampton ‘05] on comparing the utility of adding specialized features to construct a better dialogue state

Page 3: How much data is enough? – Generating reliable policies w/MDP’s

Goal

How does one choose which features best contribute to a better model of dialogue state?

Goal: show the comparative utility of adding three different features to a dialogue state

4 features: concept repetition, frustration, student performance, student moves

All are important to tutoring systems, but also are important to dialogue systems in general

Page 4: How much data is enough? – Generating reliable policies w/MDP’s

Previous Work

In complex domains, annotation and testing is time-consuming so it is important to properly choose best features beforehand

Developed a methodology for using Reinforcement Learning to determine whether adding complex features to a dialogue state will beneficially alter policies [Tetreault & Litman, EACL ’06]

Extensions: Methodology to determine which features are the best Also show our results generalize over different action

choices (feedback vs. questions)

Page 5: How much data is enough? – Generating reliable policies w/MDP’s

Outline

Markov Decision Processes (MDP) MDP Instantiation Experimental Method Results

Policies Feature Comparison

Page 6: How much data is enough? – Generating reliable policies w/MDP’s

Markov Decision Processes

What is the best action an agent should take at any state to maximize reward at the end?

MDP Input: States Actions Reward Function

Page 7: How much data is enough? – Generating reliable policies w/MDP’s

MDP Output

Policy: optimal action for system to take in each state

Calculated using policy iteration which depends on: Propagating final reward to each state the probabilities of getting from one state to the next

given a certain action Additional output: V-value: the worth of each

state

Page 8: How much data is enough? – Generating reliable policies w/MDP’s

MDP’s in Spoken Dialogue

MDP

DialogueSystem

Training dataPolicy

User Simulator

HumanUser

MDP works offline

Interactions work online

Page 9: How much data is enough? – Generating reliable policies w/MDP’s

ITSPOKE Corpus

100 dialogues with ITSPOKE spoken dialogue tutoring system [Litman et al. ’04]

All possible dialogue paths were authored by physics experts

Dialogues informally follow question-answer format

60 turns per dialogue on average Each student session has 5 dialogues

bookended by a pretest and posttest to calculate how much student learned

Page 10: How much data is enough? – Generating reliable policies w/MDP’s

Corpus Annotations

Manual annotations: Tutor Moves (similar to Dialog Acts)

[Forbes-Riley et al., ’05]

Student Frustration and Certainty [Litman et al. ’04] [Liscombe et al. ’05]

Automated annotations: Correctness (based on student’s response to last question) Concept Repetition (whether a concept is repeated) %Correctness (past performance)

Page 11: How much data is enough? – Generating reliable policies w/MDP’s

MDP State Features

Features Values

Correctness Correct (C), Incorrect (I)

Certainty Certain (cer), Neutral (neu), Uncertain (unc)

Concept Repetition New Concept (0), Repeated (R)

Frustration Frustrated (F) , Neutral (N)

% Correctness 50-100% (H)igh, 0-49% (L)ow

Page 12: How much data is enough? – Generating reliable policies w/MDP’s

MDP Action Choices

Action Example Turn

SAQ (Short Answer Question)

“What is the direction of that force relative to your fist?”

CAQ (Complex Answer Question)

“What is the definition of Newton’s Second Law?”

Mix “If it doesn’t hit the center of the pool what do you know about the magnitude of its displacement from the center of the pool when it lands? Can it be zero? Can it be nonzero?

NoQ “So you can compare it to my response…”

Page 13: How much data is enough? – Generating reliable policies w/MDP’s

MDP Reward Function

Reward Function: use normalized learning gain to do a median split on corpus:

10 students are “high learners” and the other 10 are “low learners”

High learner dialogues had a final state with a reward of +100, low learners had one of -100

)1(

)(

pretest

pretestposttestNLG

Page 14: How much data is enough? – Generating reliable policies w/MDP’s

Methodology

Construct MDP’s to test the inclusion of new state features to a baseline: Develop baseline state and policy Add a feature to baseline and compare polices A feature is deemed important if adding it results in a

change in policy from a baseline policy given 3 metrics: # of Policy Differences (Diff’s) %Policy Change (%PC) Expected Cumulative Reward (ECR)

For each MDP: verify policies are reliable (V-value convergence)

Page 15: How much data is enough? – Generating reliable policies w/MDP’s

Hypothetical Policy Change Example

B1 State Policy B1+Certainty

State

1 [C] CAQ [C,Cer]

[C,Neu]

[C,Unc]

2 [I] SAQ [I,Cer]

[I,Neu]

[I,Unc]

+Cert 1

Policy CAQ

CAQ

CAQ

SAQ

SAQ

SAQ

+Cert 2

Policy Mix

CAQ

Mix

Mix

CAQ

Mix

0 Diffs 5 Diffs

Page 16: How much data is enough? – Generating reliable policies w/MDP’s

Tests

+%Correct

+Concept

+Frustration

B2+

Correctness +Certainty

Baseline 1 Baseline 2

B1+

Page 17: How much data is enough? – Generating reliable policies w/MDP’s

Baseline

Actions: {SAQ, CAQ, Mix, NoQ} Baseline State: {Correctness}

Baseline network

[C]

FINAL

[I]

SAQ|CAQ|Mix|NoQ

Page 18: How much data is enough? – Generating reliable policies w/MDP’s

Baseline 1 Policies

Trend: if you only have student correctness as a model of student state, give a hint or other state act to the student, otherwise give a Mix of complex and short answer questions

# State State Size Policy

1 [C] 1308 NoQ

2 [I] 872 Mix

Page 19: How much data is enough? – Generating reliable policies w/MDP’s

But are our policies reliable?

Best way to test is to run real experiments with human users with new dialogue manager, but that is months of work

Our tact: check if our corpus is large enough to develop reliable policies by seeing if V-values converge as we add more data to corpus

Method: run MDP on subsets of our corpus (incrementally add a student (5 dialogues) to data, and rerun MDP on each subset)

Page 20: How much data is enough? – Generating reliable policies w/MDP’s

Baseline Convergence Plot

Page 21: How much data is enough? – Generating reliable policies w/MDP’s

Methodology: Adding more Features Create more complicated baseline by adding

certainty feature (new baseline = B2) Add other 4 features (concept repetition, frustration,

performance, student move) individually to new baseline

Check V-value and policy convergence Analyze policy changes Use Feature Comparison Metrics to determine the

relative utility of the three features

Page 22: How much data is enough? – Generating reliable policies w/MDP’s

Tests

+%Correct

+Concept

+Frustration

B2+

Correctness +Certainty

Baseline 1 Baseline 2

B1+

Page 23: How much data is enough? – Generating reliable policies w/MDP’s

Certainty

Previous work (Bhatt et al., ’04) has shown the importance of certainty in ITS

A student who is certain and correct, may require a harder question since he or she is doing well, but one that is correct but showing some doubt is a sign they are becoming confused, give an easier question

Page 24: How much data is enough? – Generating reliable policies w/MDP’s

B2: Baseline + Certainty Policies

B1 State Policy B1+Certainty

State

+Certainty Policy

1 [C] NoQ [C,Cer]

[C,Neu]

[C,Unc]

Mix

SAQ

Mix

2 [I] Mix [I,Cer]

[I,Neu]

[I,Unc]

Mix

NoQ

Mix

Trend: if neutral, give SAQ or NoQ, else give Mix

Page 25: How much data is enough? – Generating reliable policies w/MDP’s

Baseline 2 Convergence Plots

Page 26: How much data is enough? – Generating reliable policies w/MDP’s

Baseline 2 Diff Plots

Diff: For each subset corpus, compare policywith policy generated with full corpus

Page 27: How much data is enough? – Generating reliable policies w/MDP’s

Tests

+%Correct

+Concept

+Frustration

B2+

Correctness +Certainty

Baseline 1 Baseline 2

B1+

Page 28: How much data is enough? – Generating reliable policies w/MDP’s

Feature Comparison (3 metrics) # Diff’s

Number of new states whose policies differ from the original

Insensitive to how frequently a state occurs % Policy Change (%P.C.)

Take into account the frequency of each state-action sequence

Occurences State # Total

State Diff of Occurences% PC

Page 29: How much data is enough? – Generating reliable policies w/MDP’s

Feature Comparison

Expected Cumulative Reward (E.C.R.) One issue with %P.C. is that frequently occurring states

have low V-values and thus may bias the score Use the expected value of being at the start of the

dialogue to compare features ECR = average V-value of all start states

Page 30: How much data is enough? – Generating reliable policies w/MDP’s

Feature Comparison ResultsState Feature #Diff’s %P.C. E.C.R

Student Move 10 82.2% 43.21

Concept Repetition 10 80.2% 39.52

Frustration 8 66.4% 31.30

Percent Correctness 4 44.3% 28.47

Trend of SMove > Concept Repetition > Frustration > Percent Correctness stays the same over all three metrics

Baseline: Also tested the effects of a binary random feature If enough data, a random feature should not alter policies Average diff of 5.1

Page 31: How much data is enough? – Generating reliable policies w/MDP’s

How reliable are policies?

Frustration

Concept

Possible data size is small and with increased data we may see more fluctuations

Page 32: How much data is enough? – Generating reliable policies w/MDP’s

Confidence Bounds

Hypothesis: instead of looking at the V-values and policy differences directly, look at the confidence bounds of each V-value

As data increases, confidence of V-value should shrink to reflect a better model of the world

Additionally, the policies should converge as well

Page 33: How much data is enough? – Generating reliable policies w/MDP’s

Confidence Bounds

CB’s can also be used to distinguish how much better an additional state feature is over a baseline state space

That is, if the lower bound of a new state space is greater than the upper bound of the baseline state space

Page 34: How much data is enough? – Generating reliable policies w/MDP’s

Crossover Example

Data

ECR

More complicated Model

Baseline

Page 35: How much data is enough? – Generating reliable policies w/MDP’s

Confidence Bounds: App #2

Automatic model switching If you know a model, at it’s worst (ie. It’s lower

bound is better than another model’s upper bound) then you can automatically switch to the more complicated model

Good for online RL applications

Page 36: How much data is enough? – Generating reliable policies w/MDP’s

Confidence Bound Methodology For each data slice, calculate upper and lower

bounds on the V-value Take transition matrix for slice and sample from each row

using direch. statistical formula 1000 times do this b/c real world data is not exactly approximating what

data is like in the real world, but may be close So get 1000 new transition matrices that are all very similar

Run MDP on all 1000 transition matrices to get a range of ECR’s Rows with not a lot of data are very volatile so expect large

range of ECR’s, but as data increases, transition matrices should stabilize such that most of the new matrices produce similar policies and values as the original

Take upper and lower bounds at 2.5% percentile

Page 37: How much data is enough? – Generating reliable policies w/MDP’s

Experiment

Original action/state setup did not show anything promising State/action space too large for data? Not best MDP instantiation

Looked at a variety of MDP configurations Refined reward metric Adding discourse segmentation

Page 38: How much data is enough? – Generating reliable policies w/MDP’s

+essay Instantiation with ’03+’05 data

Page 39: How much data is enough? – Generating reliable policies w/MDP’s

+essay Baseline1

Page 40: How much data is enough? – Generating reliable policies w/MDP’s

+essay Baseline2

Page 41: How much data is enough? – Generating reliable policies w/MDP’s

+essay B2+SMove

Page 42: How much data is enough? – Generating reliable policies w/MDP’s

Feature Comparison ResultsState Feature #Diff’s %P.C. E.C.R

Student Move 5 43.4% 49.17

Concept Repetition 3 25.5% 42.56

Frustration 1 0.03% 32.99

Percent Correctness 3 11.19% 28.50

Reduced state size: Certainty = {Cert+Neutral, Uncert} Trend that SMove and Concept Repetition are the best features B2 ECR = 31.92

Page 43: How much data is enough? – Generating reliable policies w/MDP’s

Baseline 1

Upper = 23.65 Lower = 0.24

Page 44: How much data is enough? – Generating reliable policies w/MDP’s

Baseline 2

Upper = 57.16 Lower = 39.62

Page 45: How much data is enough? – Generating reliable policies w/MDP’s

B2+ Concept Repetition

Upper = 64.30 Lower =49.16

Page 46: How much data is enough? – Generating reliable policies w/MDP’s

B2+Percent Correctness

Upper =48.42 Lower = 32.86

Page 47: How much data is enough? – Generating reliable policies w/MDP’s

B2+Student Move

Upper = 61.36 Lower = 39.94

Page 48: How much data is enough? – Generating reliable policies w/MDP’s

Discussion

Baseline 2 – has crossover effect and policy stability

More complex features (B2 + X) – have crossover effect, but not sure if polices are stable (some stabilize at 17 students)

Indicates that 100 dialogues isn’t enough for even this simple MDP? (but is enough for baseline 2 to feel confident about?)