1 shaping in speech graffiti: results from the initial user study stefanie tomko dialogs on dialogs...
Post on 21-Dec-2015
216 views
TRANSCRIPT
1
Shaping in Speech Graffiti: results from the initial user study
Stefanie TomkoDialogs on Dialogs meeting10 February 2006
2
Big picture (i.e. thesis statement)
A system of shaping and adaptivity can be used to induce more efficient user interactions with spoken dialog systems.
This strategy can increase efficiency by increasing the amount of user input that is actually understood by the system, leading to increased task completion rates and higher user satisfaction.
This strategy can also reduce upfront training time, thus accelerating the process of reaching optimally efficient interaction.
3
This study
Speech Graffiti? (target)
shapeable? (expanded
)
{confsig}
User input
resultshaping
prompt
yes
no
yes
no
4
My approach, graphically
Speech Graffiti? (target)
shapeable? (expanded
)
intelligent shaping
help
User input
resultshaping
prompt
yes
no
yes
no
5
Speech Graffiti
Standardized framework of syntax, keywords, and principles
Domain-specific vocabulary Theater is Showcase North Theater
Showcase Cinemas Pittsburgh NorthGenre is drama
DramaWhat movies are playing?
{confsig} [an error beep, since previous utterance is not in grammar]WHERE WAS I?
Theater is Showcase Cinemas Pittsburgh North, genre is dramaOPTIONS
You can specify or ask about title, show time, rating, {ellsig} [a 3-beep list continuation signal]What is title?
2 matches: Dark Water, War of the WorldsSTART OVER
Starting overTheater is Northway Mall Cinemas Eight
Northway Mall Cinemas 8What is address?
1 match: 8000 McKnight Road in Pittsburgh
6
Expanded grammar Exploit the fact that knowledge of speaking to a
limited-language system restricts input Create a grammar that will accept more natural
language input cf. SG This grammar is opaque for users Why have two grammars?
Lower perplexity LMs lower error rates
Some applications may be SG-only
Restriction: linear mapping from EXP input to TGT equivalent
7
Shaping strategy
Handle user input accepted by expanded grammar but not target
Balance current task success with future interaction efficiency
Baseline strategy – this study: Confirm expanded grammar input with
full, explicit slot+value confirmation Give result if appropriate for query
8
Study participants
“Normal” adults, i.e. not CMU students 15 males, 14 females, aged 23-54 Native speakers of American Eng. Little/no computer programming exp New to Speech Graffiti
9
Study design
Between-subjects 3 conditions
non-shaping+tutorial (BT) shaping+tutorial (ST) shaping+no_tutorial (SN)
Tutorial 9-slide .ppt presentation 5 minutes
10
Study tasks
15 tasks 4 difficulty levels
# of slots to be specified/queried 40 minutes or when all tasks
completed Only one user did not get to attempt all
15 tasks in 40 minutes Afterwards: SASSI questionnaire
11
Results
In short, the baseline shaping strategy didn’t have an effect
Efficiency
turns to completion
1
3
5
7
9
11
non-shaping shaping
time to completion, in seconds
0
10
20
30
40
50
60
70
80
90
100
non-shaping shaping
completed tasks
0
2
4
6
8
10
12
non-shaping shaping
Mean results from shaping subjects are only slightly better – non-significant
12
User satisfaction
Again, no significant differences
No differences on individual SASSI factors No efficiency/satisfaction differences
between tutorial/non-tutorial, either
user satisfaction (mean of means)
1
2
3
4
5
6
7
non-shaping shaping
13
Grammaticality
How often did users speak within the Target SG grammar?
0
10
20
30
40
50
60
70
80
Q1 Q4
non-shaping shaping
From Q1 to Q4, both groups showed significant increases in TGT gram
14
Error rates - WER
For non-shaping: 39.9% 30.3% for grammatical utts 38.3% utt-level concept error
For shaping: a bit harder to figure, because of 2-pass ASR Each shaping input generated a
TGT hyp & a EXP hyp Selection based on AM/LM score and a
few simple heuristics
15
Error rates – WER
Shaping: For selected hypothesis: 37.3% All TGT: 40.9% All EXP: 64.2%
25.6% utt-level concept error
16
So – what happened?
Shaping users had success with NL-ish input, and shaping prompts were not strong enough to change behavior.
17
Biggest problem
Using NL or slot-only query formats My theory: <slot> is <value> specification
format is very structured. what is <slot> sounds structured to me, but
to users it sounds like <just ask a question!>
In new versions, query format will be list <slot> Users don’t seem to have too much trouble
adapting to a structure – but the structure needs to be clear.
Will also shape more explicitly by confirming with “I think you meant, ‘list movies’”
Also for more explicit shaping of specifications
18
Other problems Not using start over to clear context Confusion about semantics of location Long utterances Using next instead of more Pacing
These will be addressed via targeted help messages
19
Current hang-up
Can we improve WER? LM improvements? COTS recognizer?
Dragon: Using Results Issues
20
A little bit about trying DNS
Dragon Naturally Speaking 8 Distribution from Jahanzeb
Set up for dictation – i.e. mic input So, no telephone models
To compare with Sphinx Test set of utterances from this study Rerecorded with head mic (so, read) at 16kHz Downsampled to 8kHz for Sphinx
21
More Dragon stuff
Two groups TGT
Sphinx mean 56.4% ( Worse than 8k telephone model (?)
Dragon mean 35.9% Mean diff: Dragon 18.8pts less (ns)
EXP Sphinx mean 68.5% Dragon mean 45.4% Mean diff: Dragon 22.3pts less (s)
22
More Dragon stuff
But – Dragon rates are not that different from original Sphinx WER rates Sphinx WER in this test might be fishy
Setup seems tricky – can I still do 2-pass decoding?
Would need to change to mic setup Black-box LM stuff
Mysterious adaptation? – not good for user studies!
So, sticking with Sphinx.