Dialogue Context-Based Speech
Recognition using User Simulation
Ioannis Konstas
Master of Science
Artificial Intelligence
School of Informatics
University of Edinburgh
August 2008
i
©Copyright 2008
by
Ioannis Konstas
ii
Declaration
I hereby declare that this thesis is of my own composition, and that it contains no ma-
terial previously submitted for the award of any other degree. The work reported in
this thesis has been executed by myself, except where due acknowledgement is made
in the text.
Ioannis Konstas
iii
Abstract
Speech recognizers do not usually perform well in spoken dialogue systems due to their lack of linguistic knowledge and thus their inability to cope with the context of the dialogues in a similar way that humans do. This study, following the fashion of several previous efforts, attempts to build a post-processing system that will act as an intermediate filter between the speech recogniser and the dialogue system in an attempt to improve the accuracy of the former. In order to achieve this, it trains a Memory Based Classifier using features extracted from recognition hypotheses, acoustic information, n-best list distributional properties and most importantly a User Simulation model trained on dialogue data that simulates the way people predict the next dialogue move based on the discourse history. The system was trained on dialogue logs extracted using the TownInfo dialogue system and consists of a two-tier architecture, namely a classifier that ascribes to each hypothesis of the speech recogniser a confidence label and a re-ranker that extracts the hypothesis with the highest confidence label out of the n-best list. Overall the system exhibited a relative reduction of Word Error Rate (WER) of 5.13% and a relative increase of Dialogue Move Accuracy (DMA) of 4.22% compared to always selecting the topmost hypothesis (Baseline), thus capturing a 44.06% of the possible WER improvement on this data and 61.55% for the DMA measure, therefore validating the main hypothesis of this thesis, i.e. the User Simulation can effectively boost the speech recogniser's performance. Future work involves using a more elaborate semantic parser for the labelling of each hypothesis and evaluation of the system and the integration of the system to a real dialogue system such as the TownInfo System.
iv
Acknowledgements
I wish to warmly thank my supervisor Oliver Lemon for his constant guidance, sup-
port and time he spent for the completion of the project. I also wish to thank Kalliroi
Georgila, Xingun Liu, Helen Hastie and Sofia Morfopoulou for their co-operation
and helpful advice.
v
Contents
Declaration ...................................................................................................................ii
Abstract........................................................................................................................iii
Acknowledgements......................................................................................................iv
Contents........................................................................................................................v
Chapter 1 - Introduction................................................................................................1
1.1 Overview............................................................................................................3
1.2 Related Work......................................................................................................3
1.2.1 Topmost hypothesis classification..............................................................3
1.2.2 Re-ranking of n-best lists............................................................................4
1.3 User Simulation..................................................................................................7
1.4 TiMBL – Memory Based Learning....................................................................8
1.5 Evaluation Metrics.............................................................................................9
1.5.1 Classifier Metrics (Precision, Recall, F-Measure and Accuracy)...................9
1.5.2 Re-ranker Metrics (Word Error Rate, Dialogue Move Accuracy, Sentence
Accuracy)...............................................................................................................10
1.6 The TownInfo Dialogue System.......................................................................11
Chapter 2 - Methodology............................................................................................13
2.1 The Edinburgh TownInfo Corpus.....................................................................14
2.2 Automatic Labelling.........................................................................................18
2.3 Features............................................................................................................19
2.4 System Architecture.........................................................................................22
2.5. Experiments.....................................................................................................26
2.6. Baseline and Oracle ........................................................................................28
Chapter 3 - Results .....................................................................................................29
3.1 First Layer – Classifier Experiments................................................................30
3.2 Second Layer – Re-ranker Experiments..........................................................32
3.3. Significance tests: McNemar's test & Wilcoxon test......................................33
Chapter 4 - Experiment with high-level Features.......................................................35
vi
Chapter 5 - Discussion and Conclusions....................................................................37
5.1. Automatic Labelling........................................................................................38
5.2. Features...........................................................................................................39
5.3 Classifier..........................................................................................................42
5.4 Results..............................................................................................................42
5.5. Future work.....................................................................................................44
5.6. Conclusions.....................................................................................................44
References...................................................................................................................46
1
CHAPTER 1
Introduction
Speech recognizers, being essential modules in spoken dialogue systems, do not
alone provide adequate performance for a particular system to be robust and intuitive
enough to use. The most common reasons that usually account for erroneous recogni-
tions of the user's utterances is the ASR module's lack of linguistic knowledge and
their inability to perform well in noisy environments. In an attempt to compare them
with the human speech recognition subsystem, there is evidence that the latter is usu-
ally able to predict upcoming words if it is posed in a certain sufficiently constraining
linguistic context (so called 'high-Cloze' contexts) as in the case of domain-specific
dialogues even in situations where the levels of surrounding distracting noise is high
(Pickering et al., 2007).
This very interesting behaviour of the human brain leads us to believe that we can
simulate in a way its ability to correctly disambiguate possible misrecognitions of
what the user of a dialogue system intended to say. The most obvious way would be
to induce such linguistic context such as the history of the dialogue so far between
the user and the system to a post-processing system in an effort to boost the speech
recogniser's module accuracy.
Let us consider the following psycholinguistic theory of Pickering et al. (2007) who
advocate the fact that people go a step further and use their language production sub-
system in order to make predictions during comprehension of their co-speaker in a
dialogue: “if B overtly imitates A, then A's comprehension of B's utterance is facilit-
ated by A's memory for A's previous utterance.” With this in mind let us make the
following analogy:
Let us consider that A is the speaker-system in a dialogue session and B the user-hu-
man of this system. The user is said to imitate the system in terms both of words'
2
choice and semantics of the messages to be conveyed since he or she is asked to ful-
fil a certain scenario in a rather limited domain of interest, for example book a train
ticket, find a hotel/bar/restaurant, etc. Then the system (A) can understand better
what the user (B) said, because it “remembers”, i.e. it has stored, its previous actions
and turns.
Taking into account this interpretation, it can be justified that it would be rather use-
ful if we could model the dialogues between the user and the system and thus have
the system “remember” what it had said and done before, in order to better under-
stand what the user is really saying. The idea behind this theory can be approached
computationally by what is called User Simulation and is the main theme around
which this study revolves.
In an attempt to combine this theory with the speech recognition module of a dia-
logue system we consider that the latter produces several hypotheses for a given user
utterance as its output, namely an n-best list. The justification posed yields to the
conclusion that the topmost hypothesis out of this list might not be correct either in
terms of word recognition or semantic interpretation. Instead the correct hypothesis
might exist somewhere lower in this n-best list.
Note that in dialogue systems we are usually interested in the semantic representation
of an utterance, since it is usually sufficient for the system merely to understand what
the user wants to say, instead of the exact way he or she said it. Word alignment of
course may account for the level of confidence that the semantic interpretation is
truly the one meant to be conveyed by the user.
This study attempts to build a post-processing system that will take as its input the
speech recogniser's n-best lists of the user's utterances in dialogues and re-rank them
in an effort to extract the correct ones in terms both of semantic representation and
word alignment. In order to achieve this, it trains a Memory Based Classifier using
features extracted from recognition hypotheses, acoustic information, n-best list dis-
tributional properties and a User Simulation model trained on dialogue data.
3
1.1 Overview
The chapters of this thesis are organised as follows:
Chapter 1: Introduction to the problem of context-sensitive speech recognition and
previous work on this area.
Chapter 2: Detailed description of the methodology adopted to implement and train
the system discussed in the study.
Chapter 3: Results of the experiments conducted in order to train the system.
Chapter 4: Additional experiment with minimal number of high-level features.
Chapter 5: Discussion on the methodology and the results, future work and conclu-
sion.
1.2 Related Work
The notion of incorporating explicit knowledge to evaluate and refine the ASR hypo-
theses in the context of enhancing the dialogue strategy of a system, as assumed
above, is not something new. Several studies have been performed in an effort to
boost the performance of the speech recognizer following either of two different ap-
proaches to essentially the same problem: either make decisions on the topmost hy-
pothesis of the ASR's output or classify and then perform re-ranking of the n-best
lists. All the experiments of these studies were conducted on similar input data, i.e.
transcripts and wave files of user utterances and logs of dialogues. However, the sys-
tems they were extracted from were different as far as their target domain is con-
cerned and the magnitude of the corpora ranged from a few hundred utterances
(Gabsdil and Lemon, 2004), to several thousand utterances (Litman et al., 2000).
1.2.1 Topmost hypothesis classification
Litman et al. (2000) use prosodic cues extracted directly from speech waveforms
4
rather than confidence scores of the acoustic model incorporated in the speech recog-
nizer, in order to predict misrecognised user utterances in their corpus. In their exper-
iments they show that utterances containing word errors have certain prosodic fea-
tures. What is more, even simple acoustic features such as the energy of the captured
waveforms and their duration provide with good separation between correctly recog-
nised and misrecognised utterances. In this fashion they maintain that these features
can account for more accuracy than standard confidence scores. In order to distin-
guish between correct and incorrect recognitions they train a classifier using RIPPER
(Cohen, 1996), which is a tree decision model and build a set of binary rules based
on the entropy of each feature on a given training set. Their corpus consists of 544
dialogues between humans and three different dialogue systems; voice dialling and
messaging (Kamm et al., 1998), accessing email (Walker et al., 1998), accessing on-
line train schedules (Litman and Pan, 2000). Their best configuration scores 77.4%
accuracy, a 48.8% relative increase compared to their baseline.
Walker et al. (2000) use a combination of features from the speech recognizer, natur-
al language understanding, and dialogue history to attribute different classes to the
topmost hypothesis, namely: correct, partially correct, and misrecognised. Like Lit-
man et al. (2000) they also use RIPPER as their classifier. They train their system on
11.787 spoken utterances collected by AT&T's How may I help you corpus (Gorin et
al. 1997; Boyce and Gorin, 1996), consisting of dialogues over the telephone con-
cerning subscriptions' related scenarios. Their system achieves 86% accuracy, an im-
provement of 23% over the baseline.
1.2.2 Re-ranking of n-best lists
On the other hand, Chotimongkol and Rudnicky (2001), Gabsdil and Lemon (2004),
Jonson (2006) and Andersson (2006) move a step further than simply classifying the
topmost hypothesis and perform re-ranking of the n-best lists using prosodic and
speech recognition features as well as dialogue context and task-related attributes.
Chotimongkol and Rudnicky (2001) train a linear regression model on acoustic, syn-
tactic and semantic features in order to reorder the n-best hypotheses for a single ut-
5
terance. Each hypothesis in the list is ascribed a correctness score, namely its relative
word error rate in the list. Then the one that scores lower is chosen instead of the
top-1 result. The corpus used is extracted from the Communicator system (Rudnicky
et al., 2000) regarding travel planning and consists of 35766 utterances for which the
25-best lists are taken into consideration. The performance of the re-ranker is 11.97%
WER resulting in a 3.96% relative reduction compared to the baseline
Gabsdil and Lemon (2004) similarly perform reordering of n-best lists by combining
acoustic and pragmatic features. Their study shows that the dialogue features such as
the previous system question and if a hypothesis is the correct answer to a particular
question contributed more than the other more common attributes. Each hypothesis
in the n-best list is automatically labelled as being either in-grammar, out-of-gram-
mar (oog) (WER <= 50), out-of-grammar (oog) (WER > 50) or crosstalk. This la-
belling is based on a combination of the semantic parse of each hypothesis and its
alignment with the true transcript. Their approach to the problem is in two steps: first
they use TiMBL (Daelemans et al., 2007), a memory based classifier, in order to pre-
dict the correct label of each hypothesis in the n-list and then they perform a simple
re-ranking by choosing the hypothesis that has the most significant label (if it exists
in the list) according to the order: in-grammar < oog (WER ≤ 50) < oog (WER > 50)
< crosstalk. The corpus used was extracted with the WITAS dialogue system (Lem-
on, 2004) and consisted of interactions with a simulated aerial vehicle, a total of 30
dialogues with 303 utterances, the 10-best lists of which were taken into considera-
tion. Their system performed 25% relatively better than the baseline with a weighted
f-score of 86.38%.
Jonson (2006) classifies recognition hypotheses with quite similar labels denoting ac-
ceptance, clarification, confirmation and rejection. These labels have been automatic-
ally crafted in an equivalent manner as in the Gabsdil and Lemon (2004) study and
correspond to varying levels of confidence, being essentially potential directives to
the dialogue manager. Apart from the common features Jonson includes close-con-
text features, e.g. previous dialogue moves, slot fulfilment as well as the dialogue
history. She also includes attributes that account for the whole n-best list, i.e. stand-
ard deviation of confidence scores etc. Jonson (2006) also uses TiMBL in order to
6
classify each hypothesis of the n-best list to one of the 5 labels incorporated and uses
the same re-rank algorithm as Gabsdil and Lemon (2004) to choose the top-1 hypo-
thesis. Her system got trained on the GoDiS corpus, comprising of dialogues dealing
with a virtual jukebox which consist of 486 user utterances the 40-best lists of which
were taken into account. Her optimal set-up scored 83% of DMA and 58% of SA
(see section 1.5.2 for explanation of these measures), gaining a 56.60% of relative in-
crease of the DMA and 20.83% for the SA measure compared to the baseline.
Andersson (2006) uses similar acoustic, list and dialogue features but adheres to a
simpler binary annotation characterizing whether each hypothesis of the ASR n-best
list is close enough ('B') or not ('N') to the original transcript. For the classification
purposes of the given problem he trains maximum entropy models and performs a
simple re-rank by choosing the first hypothesis, if it exists, which belongs to the 'B'
category. His corpus is taken from the Edinburgh TownInfo system, containing dia-
logues for booking of hotels/bars/restaurants (see section 2.1) and consists of 191
dialogues or 2904 utterances taking on average into consideration on average the 7-
best lists. He scores an absolute improvement of error of 4.1% which interprets to a
relative improvement of 44.5% compared to the baseline.
Gruenstein (2008) follows a somewhat different approach to the problem of re-rank-
ing by considering the prediction of the system's response rather than the user's utter-
ance in the context of a multi-modal dialogue system. Along with the common recog-
nition and distributional features of the hypotheses in the n-best lists he takes into ac-
count features that deal with the response of the system to the n-best list produced by
the speech recogniser. Similarly to Andersson (2006) he labels each hypothesis as
'acceptable' or 'unacceptable' depending on the semantic match with the true tran-
script. He then trains an SVM to predict either of the two classes and then fits a lin-
ear regression model to the classification output of the SVM in order to output a con-
fidence score between -1 and +1, with -1 being totally 'unacceptable' and +1 totally
'acceptable'. Re-ranking is then performed by setting a threshold in the domain
[-1,+1] and choosing the hypothesis that exceeds this threshold. Unlike the previous
the studies this method is able to output a numerical confidence score rather than a
discrete label. His system is trained on data taken from the City Browser multi-modal
7
system (Gruenstein et al., 2006; Gruenstein and Seneff, 2007), resulting in 1912 ut-
terances. His system scored an absolute 72% of F1-measure yielding 16% improve-
ment compared to the baseline.
1.3 User Simulation
What makes this study different from the previous work in the area of post-pro-
cessing of the ASR hypotheses is the incorporation of a User Simulation output as an
additional feature in my own system. Undeniably, the history of the discourse
between a user and a dialogue system plays an important role as to what might be ex-
pected from the user to say next. As a result, most of the studies mentioned in the
previous section make various efforts to capture it by including relevant features dir-
ectly in their classifiers. Although this may account for simplicity and performance in
the runtime of their system, still they fail to some extent to adopt a more systematic
way in coping with user behaviour along the dialogue.
A User Simulation model is what comes in hand to fill this gap by getting trained on
small corpora of dialogue data in order to simulate real user behaviour. In my system
I have used a User Simulation model created by Georgila et al. (2006) based on n-
grams of dialogue moves. Essentially, it treats a dialogue as a sequence of lists of
consecutive user and system turns in a certain high level semantic representation, i.e.
{<Speech Act>, <Task>} pairs. (see section 2.1 for complete explanation of this se-
mantic representation). It takes as input the n - 1 most recent lists of {<Speech Act>,
<Task>} pairs in the dialogue history, and uses the statistics of n-grams in the train-
ing set to decide on the next user action. If no n-grams match the current history, the
model can back-off to n-grams of lower order.
The benefit from using n-gram models in order to simulate user actions is that they
are fully probabilistic and are fast to train even on large corpora. A main drawback
which is common in the case of n-grams is that they are considered to be quite local
in their predictions. In other words, given a history of n – 1 dialogue turns, the pre-
diction of the n-th user turn may be too dependent on the previous ones and thus
might not make much sense in the more global context of the dialogue.
8
The main hypothesis of this study is that by using the User Simulation model to pre-
dict the next dialogue move of the user utterance as a feature in my system, I shall ef-
fectively increase the performance of the speech recogniser module.
1.4 TiMBL – Memory Based Learning
In this study I chose TiMBL 6.1 (Daelemans et al. 2007) as the main model for the
classification of the hypotheses in the n-best list. TiMBL is considered to be a well
established, open-source and efficient C++ software, already used with considerable
results by Gabsdil and Lemon (2004) and Jonson (2006).
Memory-Based Learning (MBL) is an elegantly simple and robust machine learning
method which has been applied to a multitude of tasks in Natural Language Pro-
cessing (NLP). MBL descends directly from the plain k-Nearest Neighbour (k-NN)
method of classification, which is still considered to be a quick, yet powerful pattern
classification algorithm .
Though plain k-NN performs well in various applications it is notoriously inefficient
in its runtime use, since each test vector needs to be compared to all the training data.
As a result, since classification speed is a critical issue in any realistic application of
MBL, non-trivial data-structures and speed-up optimizations have been employed in
TiMBL. Typically, training data are compressed and represented into a decision-tree
structure.
In general, MBL is founded on the hypothesis that “performance in cognitive tasks is
based on reasoning on the basis of similarity of new situations to stored representa-
tions of earlier experiences, rather than on the application of mental rules abstracted
from earlier experiences” (Daelemans et al., 2007). TiMBL, as every common ma-
chine learning method, is divided into two parts, namely the learning component
which is memory-based, and the performance component which is similarity-based.
The learning component of MBL is memory-based and it merely involves adding
training instances to memory. This process is sometimes referred to as “lazy” since
storing into memory takes place without some form of abstraction or intermediate
9
representation. An instance consists of a fixed-length vector of n feature-value pairs,
plus the class that this particular vector belongs to.
In the performance component of an MBL system, the training vectors are used in or-
der to perform classification of a previously unseen test datum. The similarity
between the new instance X and all training vectors Y in memory is computed using
some distance metric ∆(X, Y). The extrapolation is done by assigning the most fre-
quent class within the found subset of most similar examples, the k-nearest neigh-
bours, as the class of the new test datum. If there exists a tie among classes, a certain
tie breaking algorithm is applied.
In order to compute the similarity between a test datum and each training vector I
chose to use IB1 with information-theoretic feature weighting (among other MBL
implementations found in the TiMBL library). The IB1 algorithm calculates this sim-
ilarity in terms of weighted overlap: the total difference between two patterns is the
sum of the relevance weights of those features which are not equal. The class for the
test datum is decided on the basis of the least distant item(s) in memory.
To compute relevance, Gain Ratio is used which is essentially the Information Gain
of each feature of the training vectors divided by the entropy of the feature-values .
Gain Ratio is considered to be a normalised version of the Information Gain measure
(according to which each feature is considered independent of the rest, and measures
how much information each of these contributes to our knowledge of the correct
class label ).
1.5 Evaluation Metrics
In this section I shall introduce the six metrics that were used in the evaluation of the
two core components of the system, namely the classifier and re-ranker.
1.5.1 Classifier Metrics (Precision, Recall, F-Measure and Accuracy)
Precision of a given class X is the ratio of the vectors that were classified correctly to
10
class X (True Positive) to the total number of vectors that were classified as class X
either correctly or not (True Positive + False Positive).
True Positive Precision = (1.1) True Positive + False Positive
Recall of a given class X is the ratio of the vectors that were classified correctly to
this class (True Positive) to the total number of vectors that actually belong to class
X, in other words to the number of vectors that were correctly classified as class X
plus the number of vectors that were incorrectly not classified as class X (True Posit-
ive + False Negative).
True Positive Recall = (1.2) True Positive + False Negative
F-measure is a combination of precision and recall. The general formula of this met-
ric is (b is a non-negative real valued constant):
(b2 + 1) Precision Recall ⋅ ⋅ F-measure = (1.3) b2 Precision + Recall ⋅
In this study we use the formula of F1 (b = 1), which gives equal gravity to precision
and recall and is also called the weighted harmonic mean of precision and recall:
2 Precision Recall ⋅ ⋅ F1 = (1.4) Precision + Recall
Accuracy of the classifier in total is the ratio of the vectors that were correctly classi-
fied in their classes to the total number of vectors that exist in the test set.
1.5.2 Re-ranker Metrics (Word Error Rate, Dialogue Move Accuracy, Sentence Accuracy)
Word Error Rate (WER) is the ratio of the number of deletions, insertions and substi-
11
tutions in the transcription of a hypothesis as compared to the true transcript to the
total number of words in the true transcript. In my system I compute it by measuring
the Levenshtein distance between the hypothesis and the true transcript.
Deletions + Insertions + Substitutions WER = (1.5) Length of Transcript
Dialogue Move Accuracy (DMA) is a variant of the Concept Error Rate (CER) as
defined by Boros et al. (1996), which takes into account the semantic aspects of the
difference between the classified utterance and the true transcription. CER is similar
in a sense to the WER, since it takes into account deletions, insertions and substitu-
tions but to the semantic rather than the word level of the utterance. In our case DMA
is stricter than CER, in the sense that it does not allow for partial matches in the se-
mantic representation. In other words, if the classified utterance corresponds to the
same semantic representation as the transcribed then we have 100% DMA, otherwise
0%.
Sentence Accuracy (SA) is the alignment of a single hypothesis in the n-best list with
the true transcription. Similarly to DMA, it accounts for perfect alignment between
the hypothesis and the transcription, i.e. if they match perfectly we have 100% SA,
otherwise 0%.
1.6 The TownInfo Dialogue System
The training datasets used in this study were collected from user (both native and
non-native) interactions with the TownInfo dialogue system (Lemon, Georgila and
Henderson, 2006) developed within the TALK project (http://www.talk-project.org).
The TALK TownInfo system is an experimental specific domain system where pre-
sumptive users interact with it via natural speech in order to book a room in a hotel, a
table in a restaurant or try to find a bar. Each user was given a specific scenario to
fulfil involving subtasks of preferred choice regarding price range, location and type
12
of facility (Lemon, Georgila, Henderson and Stuttle, 2006). The dialogue system is
implemented in the Open Agent Architecture (OOA) (Cheyer and Martin, 2001) with
the main components being a dialogue manager, a dialogue policy reinforcement
learner, a speech recogniser and a speech synthesiser. The input to my system are
given by the dialogue manager and the speech recogniser.
The dialogue manager, DIPPER (Bos et al., 2003), is an Information State Update
(ISU) approach to dialogue management that was specifically developed to handle
spoken input/output and integrates several communicating software agents. These in-
clude agents that monitor dialogue progress and facilitate other agents to decide as to
what action should be taken based on previous and current state of the dialogue. The
output of DIPPER which is of particular interest for us is the logging of the flow of
the dialogues with information such as user utterances, system output (both the tran-
script and semantic representation), current user task etc. The goal of the system is
to guide the dialogue manager’s clarification and confirmation strategies that is then
given to the speech synthesiser to realise (Lemon, Georgila, Henderson and Stuttle,
2006).
The speech recogniser was built with the ATK tool-kit (Young, 2004). The recogniser
models natural speech using Hidden Markov Models and utilizes n-grams as its lan-
guage model integrating in this way domain-specific data with wide coverage data
instead of a domain dependent recognition grammar network. What is more, it oper-
ates in an n-best mode, which means that it produces the top-n hypotheses that were
recognised against the recorded speech of the user, ordered by the model's overall
confidence score.
13
CHAPTER 2
Methodology
The outcome of this study is a standalone software written in JAVA and C/C++ that
performs re-ranking of the n-best lists produced by the speech recogniser in the con-
text of a dialogue system. The input to the system is the n-best list that corresponds
to a user utterance along with the confidence score per utterance and per word of
each utterance and the dialogue log which contains the turns of both the system and
the user. The output of the system is a single hypothesis that has been chosen from
the n-best list along with a label that corresponds to a degree of “certainty” as to the
correctness of the picked hypothesis.
In order to extract the correct hypothesis the system includes a Memory Based Clas-
sifier (TiMBL - Daelemans et al., 2007) that has been trained to the Edinburgh Town-
Info corpus, consisting of 126 dialogues containing 1554 user utterances. For each
utterance I used the 60-best lists produced by the recogniser resulting in a total of
93240 hypotheses. Each hypothesis was automatically labelled with one of the 'opt',
'pos', 'neg' and 'ign' labels denoting a decreasing order of confidence as to the match
with the true transcript and semantic representation of the user's utterance, with 'opt'
representing the most confident estimate.
The final testing of the classifier was performed on a separate part of the Edinburgh
TownInfo corpus consisting of 58 dialogues containing 510 utterances. For each ut-
terance I used the full 60-best lists resulting in 30200 hypotheses. The testing of the
re-ranker was performed with a 10-fold cross-validation on the whole training set.
The reason for choosing cross-validation is that it accounts for more significance of
the results and overall robustness of the system in its ability to boost the recogniser's
performance.
The classifier was trained with four different sets of increasing number of features;
14
the first set included only list features such as standard deviation of confidence
scores, the second included the first plus current hypothesis features such as hypo-
thesis length, confidence score etc., the second included the the previous two plus
acoustic features and the fourth all previous plus the User Simulation score.
2.1 The Edinburgh TownInfo Corpus
The input to the system is extracted from the Edinburgh TownInfo corpus which con-
sists of a total of 184 dialogues an example of which is given in Figure 2.1. The cor-
pus was already captured and transcribed by K. Georgila since it is a part of the train-
ing set used in the User Simulation model also incorporated in my system. In more
detail, the corpus contains a collection of interactions with the TownInfo dialogue
system by native and non-native users (see section 1.6).
Figure 2.1: A dialogue log from the Edinburgh TownInfo corpus (Andersson 2006)
System: How can I help you?User: I need to find a hotel roomSystem: Did you say hotel?User: YesSystem: Do you want something central?User: Yes pleaseSystem: Do you need a single or double room?User: [empty]System: Do you need a single or double room?User: [empty]System: Do you need a single or double room?User: SingleSystem: Would you like something cheap, expensive or reasonably priced?User: I don't mindSystem: Now I am going to search for a single central room. Please wait a moment while I search for suitable hotel options.(system presents options)Are you interested in any hotel option number or would you like me to keep going?User: Number three(system repeats option number three)System: Are you happy with that choice?User: YesSystem: Would you like to search some more?User: NoSystem: Goodbye...and thanks for using the Edinburgh Cambridge dialogue system.
15
Figure 2.2: Part of an n-best list for the transcription 'Something cheap'. The second column denotes the acoustic score of the speech recogniser
Each utterance is contained in various formats depending on the context we are fo-
cusing our attention on. On the highest level we have a collection of dialogue logs
which are structured in accordance to the Information State Update (ISU) paradigm
as shown in Figure 2.3.
Apart from the transcript of the user or system's utterance (shown in bold) the logs
also contain a semantic representation for the limited knowledge domain of hotels,
bar and restaurants that denote the current Dialogue Move. More specifically, each
utterance is transcribed in the following format: <Speech Act>, <Task>, <Slot
Value> (shown in red in the example of Figure 2.3 with the equivalent values
filled).The Speech Act field is a high-level representation of the type of the sentence
that was uttered by the user/system and takes values such as provide_info, yes_an-
swer, which mean that the user/system tries to convey some domain-specific inform-
Something cheap
<s> SOMETHING CHEAP </s> -15268.4<s> SOMETHING A CHEAP </s> -15283.8<s> I THINK CHEAP </s> -15294.8<s> UH SOMETHING CHEAP </s> -15287.4<s> SOMETHING CHEAPER </s> -15287.3<s> SOMETHING CHEAP I </s> -15276.6<s> I DON'T CHEAP </s> -15307.4<s> I SOMETHING CHEAP </s> -15310.5<s> I WANT A CHEAP </s> -15383.5<s> A SOMETHING CHEAP </s> -15287.4<s> SOMETHING CHEAP A </s> -15259.5<s> SOMETHING CHEAP UH </s> -15259.5<s> I HAVE A CHEAP </s> -15396.8<s> I WANT CHEAP </s> -15327.8<s> THE SOMETHING CHEAP </s>-15311.4<s> UH THANK CHEAP </s> -15270.5<s> AH SOMETHING CHEAP </s> -15291.3<s> ER SOMETHING CHEAP </s> -15300<s> SOMETHING CHEAP AT </s> -15261.9<s> I THINK A CHEAP </s> -15336.4
16
ation, answers affirmatively respectively.
Figure 2.3: Excerpt from a dialogue log containing the most useful fields, showing the Information State fields for the user utterance 'chinese'
The Task field is a lower-level representation of the contents of the uttered by the
user/system and takes values such as top_level_trip, food_type, which summarise the
fact that the user/system has made a general statement for a hotel, bar or restaurant or
a statement for the type of food respectively. Finally, the Slot Value field is the low-
est-representation of the message conveyed and usually corresponds to specific in-
formation such as chinese if the Task Field has the value of food_type, cheap if the
TypeOfPolicy: 1STATE 7DIALOGUE LEVELTurn: userTurnNumber: 3Speaker: userDialogueActType: userConvDomain: about_taskSpeechAct: [provide_info]AsrInput: chinese TransInput: Output: TASK LEVELTask: [food_type]FilledSlot: [food_type]FilledSlotValue: [chinese]LOW LEVELAudioFileName: kirsten-003--2006-11-06_12-30-13.wavConfidenceScore: 0.44HISTORY LEVELPreviouslyFilledSlots: [null],[top_level_trip],[null],[food_type]PreviouslyFilledSlotsValues: [null],[restaurant],[],[chinese]PreviouslyGroundedSlots: [null],[null],[top_level_trip],[]SpeechActsHist: opening_closing,request_info,[provide_info,provide_info],explicit_confirm,[yes_answer,yes_answer],request_info,[provide_info]TasksHist: meta_greeting_goodbye,top_level_trip,[top_level_trip,food_type],top_level_trip,[top_level_trip,food_type],food_type,[food_type]FilledSlotsHist: [top_level_trip,food_type],[],[food_type]FilledSlotsValuesHist: [restaurant,chinese],[],[chinese]
17
Task Field is filled with hotel_price etc
For the purposes of the experiments I have used a different cut-down version of the
ISU logs that contain just the semantic parses of the dialogue moves of the systems'
and users' turns and the file names of the wave files that correspond to the users ut-
terances.
For each utterance we have a series of files of 60-best lists produced by the speech
recogniser, namely the transcription hypotheses on a sentence level along with the
acoustic model score (Figure 2.2) and the equivalent transcriptions on a word level,
with information such as the duration of each recognised frame and the confidence
score of the acoustic and language model of each word (Figure 2.4). Finally, there
exist the wave files of each utterance which were used to compute various acoustic
features.
Figure 2.4: Speech recogniser's output at a word level for the transcript 'Something cheap'. The columns correspond to: start of frame, end of frame, label, language modelling, acoustic, total and confidence score.
Something cheap
0 6000000 <s> 15.000000 -3306.653320 -3291.653320 0.9031416000000 10900000 SOMETHING -69.225189 -3832.953613 -3902.178711 0.77487310900000 14000000 CHEAP -16.965895 -2162.810547 -2179.776367 0.95097314000000 25900000 </s> 6.578006 -5965.947266 -5959.369141 0.935400///0 6000000 <s> 15.000000 -3306.653320 12041.324219 0.9031416000000 10300000 SOMETHING -69.225189 -3324.001465 -3393.226562 0.78582710300000 11000000 A -42.978447 -608.698303 -651.676758 0.51422211000000 14000000 CHEAP -17.142681 -2078.526367 -2095.668945 0.95485414000000 25900000 </s> 6.578006 -5965.947266 -5959.369141 0.935400///0 6700000 <s> 15.000000 -3828.631348 11577.962891 0.8906816700000 8800000 I -20.461586 -1653.962280 -1674.423828 0.7206948800000 11200000 THINK -33.112690 -1921.326782 -1954.439453 0.78422211200000 14000000 CHEAP -73.240974 -1924.966553 -1998.207520 0.95729914000000 25900000 </s> 6.578006 -5965.947266 -5959.369141 0.935400
18
2.2 Automatic Labelling
In order to perform the re-ranking of the n-best lists we have to rely on some meas-
ure of correctness of each hypothesis. In other words we need to distinguish among
those that are supposed to be close enough to the true transcript or not. Instead of ad-
opting the industry-standard measure of closeness for speech recognisers, namely
WER, I adhered to a less strict hybrid method that combines primarily the DMA and
then the WER of each hypothesis. What is more, in order to induce some kind of dis-
crete confidence scoring that can guide or at least facilitate the dialogue manager to
choose for a particular strategy move. I have devised four labels with decreasing or-
der of confidence: 'opt', 'pos', 'neg', 'ign'. These are automatically generated by using
two different modules: a keyword parser that computes the {<Speech Act><Task>}
pair as described in the previous section and a Levenshtein Distance calculator for
the computation of the DMA and WER of each hypothesis respectively.
The reason for opting towards a more abstract level, namely the semantics of the hy-
potheses rather than delving into the lower level of individual word recognition, is
that in Dialogue Systems it is usually sufficient to rely on the message that is being
conveyed by the user rather than the words that he or she used.
Similar to Gabsdil and Lemon (2004) and Jonson (2006) I ascribed to each utterance
either of the 'opt', 'pos', 'neg', 'ign' labels according to the following schema:
• opt: The hypothesis is perfectly aligned and semantically identical to the tran-
scription
• pos: The hypothesis is not entirely aligned (WER ≤ 50) but is semantically
identical to the transcription
• neg: The hypothesis is semantically identical to the transcription but does not
align well (WER > 50) or is semantically different compared to the transcrip-
tion
• ign: The hypothesis was not addressed to the system (crosstalk), e.g. the user
laughed, coughed, etc.
19
The 50% value for the WER as a threshold for the distinction between the 'pos' and
'neg' category is adopted from Gabsdil (2003), based on the fact that WER is affected
by concept accuracy (Boros et al. 2003). In other words, if a hypothesis is erroneous
as far as its transcript is concerned then it is highly likely that it does not even convey
the correct message from a semantic point of view.
It can be clearly seen that I am always labelling conceptually equivalent hypotheses
to a particular transcription as potential candidate dialogue strategy moves and total
misrecognitions as rejections. In Figure 2.5 we can see some examples of the four la-
bels. Notice that in the case of silence, we give an opt to the empty hypothesis.
Transcript: I'd like to find a bar please Transcipt: silence
I WOULD LIKE TO FIND A BAR PLEASE pos - opt
I LIKE TO FIND A FOUR PLEASE neg MM ign
I'D LIKE TO FIND A BAR PLEASE opt HM ign
WOULD LIKE TO FIND THE OR PLEASE ign UM ign
2.3 Features
All the features used by the system are extracted by the dialogue logs, the n-best lists
per utterance and per word and the audio files. The majority of the features chosen
are based on their success in previous systems as described in the literature. The nov-
el feature of course is the User Simulation score which may make redundant most of
the equivalent dialogue features met in other studies.
In order to measure the usefulness of each candidate feature and thus choose the
most important I used the common metrics of Information Gain and Gain Ratio (see
section 1.4 for a very brief explanation) on the whole training set, i.e. 93240 hypo-
theses.
In total I extracted 13 attributes that can be grouped into 4 main categories; those that
concern the current hypothesis to be classified, those that concern low-level statistics
Figure 2.5: Examples of the four labels: opt, pos, neg and ign
20
of the audio files, those that concern the whole n-best list, and finally the user simu-
lation feature:
1. Current Hypothesis Features (6): acoustic score, overall model confidence
score, minimum word confidence score, grammar parsability, hypothesis
length and hypothesis duration.
2. Acoustic Features (3): minimum, maximum and RMS amplitude
3. List Features (3): n-best rank, deviation of confidence scores in the list,
match with most frequent Dialogue Move
4. User Simulation (1): User Simulation confidence score
The current hypothesis features were extracted from the n-best list files that con-
tained the hypotheses' transcription along with overall acoustic score (Figure 2.4) per
utterance and from the equivalent files that contained the transcription of each word
along with the start of frame, end of frame and confidence score:
Acoustic score is the negative log likelihood that is ascribed by the speech re-
cogniser to the whole hypothesis, being the sum of the individual word acous-
tic scores. Intuitively this is considered to be helpful since it depicts the con-
fidence of the statistical model only for each word and is also adopted in pre-
vious studies. Incorrect alignments shall tend to adapt less well to the model
and thus have low log likelihood.
Overall model confidence score is the average of the individual word confid-
ence scores. In the lack of the real model confidence scores in the given files
of the corpus, I adhered to the average of each word confidence score as the
next best approach to the models' overall confidence taking into account both
the language and acoustic model.
Minimum word confidence score is also computed by the individual word
transcriptions and accounts for the confidence score of the word for which the
speech recogniser is least certain of. It is expected to help our classifier distin-
guish between poor overall hypotheses' recognitions since a high overall con-
fidence score can sometimes prove to be misleading.
21
Grammar Parsability is the negative log likelihood of the transcript for the
current hypothesis as produced by the Stanford Parser, a wide-coverage Prob-
abilistic Context-Free Grammar (PCFG) (Klein et al. 2003, http://nlp.stan-
ford.edu/software/lex-parser.shtml). This feature seems helpful since we ex-
pect that a highly ungrammatical hypothesis is likely not to match with the
true transcription semantically.
Hypothesis duration is the length of the hypothesis in milliseconds as extrac-
ted from the n-best list files with transcriptions per word that include the start
and the end time of the recognised frame. The reason for the inclusion of this
feature is that can help distinguish between short utterances such as yes/no
answers, medium-sized utterances of normal answers and long utterances
caused by crosstalk.
Hypothesis length is the number of words in a hypothesis and is considered to
help in a similar way as the above feature.
The acoustic features were extracted directly from the wave files using SoX, an in-
dustry-standard open-source audio editing and converter utility in *NIX environ-
ments:
Minimum, maximum and RMS amplitude are pretty straightforward features
rather common in all previous studies mentioned in section 1.2.
The list features were calculated based on the n-best list files with transcriptions per
utterance and per word and take into account the whole list:
N-best rank is the position of the hypothesis in the list and could be useful in
the sense that 'opt' and 'pos' are usually found in the upper part of the list
rather than the bottom.
Deviation of confidence scores in the list is the deviation of the overall model
confidence score of the hypothesis from the mean confidence score in the list.
This feature is extracted in the hope that it will indicate potential clusters of
confidence scores in particular positions in the list, i.e. group hypotheses that
deviate in a specific fashion from the mean and thus indicating them being
classified with the same label.
22
Match with most frequent Dialogue Move is the only boolean feature crafted
and indicates whether the Dialogue Move of the current hypothesis, i.e. the
pair of {<Speech Act><Task>} coincides with the most frequent one. The
trend in n-best lists is to have a majority of utterances that belong to one or
two labels and only one hypothesis belonging to the 'opt' and/or a few to the
'pos'. As a result, the idea behind this feature is to extract such potential out-
liers which are the desired goal for the re-ranker.
Finally, the user simulation score is given as an output from the User Simulation
model created by K. Georgila and adapted for the purposes of this study (see next
section for more details). The model is operating with 5-grams. The input to it is giv-
en by two different sources: the history of the dialogue, namely the 4 previous Dia-
logue Moves, is taken by the dialogue logs and the current hypothesis' semantic parse
which is generated on the fly by the same keyword parser used in the automatic la-
belling.
User Simulation score is the probability that the current hypothesis' Dialogue
Move has really been said by the user given the 4 previous Dialogue Moves.
The advantage of this feature has been discussed in section 1.3.
2.4 System Architecture
The system developed in the context of this study is implemented mainly in JAVA,
with the exception of the parts that interact with the User Simulation model of K.
Georgila and the TiMBL classifier (Daelemans et al., 2007) that were written in
C/C++ and Java Native Interface (JNI). In Figure 2.6 we can see an overview of the
system's architecture. Currently the system works in off-line mode, i.e. gets its input
from flat files that comprise the Edinburgh TownInfo corpus and performs re-ranking
of an n-best list, i.e. outputs the hypothesis that has the label with the highest degree
of confidence along with this very label. For evaluation purposes it currently com-
putes the DMA, SA and WER of the training set with 10-fold cross-validation as its
output.
However, an OAA wrapper has been included as well in order to enable it to work in
23
a real time environment, where the input shall be given directly by the speech recog-
niser and the dialogue logger and its output will be given as input to the dialogue
manager.
Figure 2.6: The system's architecture
A brief description of the individual components follows:
• The keyword parser was originally written in C by K. Georgila and has been
adapted to Java by Neil Mayo, the version of whom I included in my system.
The keyword parser reads a vocabulary file which contains a simple mapping
from the various domain-specific words of interest that are met in the tran-
scripts to an intermediate reduced vocabulary that will be used by a pattern
matcher. It then reads a file that contains all the patterns of the reduced
Top Hypothesis and Label
Re-Ranker TiMBL
User Simulation
Keyword Parser
Feature Extractor
N-best transcriptions
n-1 hypotheses'Dialogue Moves n-th hypothesis
Dialogue MoveUser Simulation score
Feature Vectors
Labels
Edinburgh TownInfo Dialogue Corpus
24
vocabulary and maps these to {<Speech Act><Task>} pairs. Note that the ori-
ginal pattern files included with the original version of the parser by N. Mayo
mapped the vocabulary to a different semantic representation. However, these
were not considered to be helpful since I wanted to keep the same formalism
adopted in the ISU logs that already existed.
• The User Simulation is written in C by K. Georgila and is ported to my sys-
tem via JNI. Originally, K. Georgila has written the User Simulation as an
OAA agent but the first experiments that were conducted using this version
were rather inefficient in terms of runtime. The reason for that was that the
OAA itself was inducing unwanted overhead due to possibly the large size of
the messages that were transferred between my system and the agent. As a
result I wrote a JNI wrapper in the original C code that interfaces its three
main functions: load the model from n-grams stored in flat files to memory,
simulate the user action given the n-1 history and kill the model.
It should be noted that originally the User Simulation was trained using both
the Cambridge and Edinburgh TownInfo corpus resulting in a total of 461 dia-
logues with 4505 dialogues. These were stored as mentioned above as n-
grams in flat files produced by the CMU-Cambridge Statistical Language
Model Toolkit v.2 using absolute discounting for smoothing the n-gram prob-
abilities. Since I am also using the Edinburgh TownInfo for training and test-
ing TiMBL (Daelemans et al., 2007) I had to reduce the training dataset given
to the User Simulation to avoid having it get trained on test data of my system
as well. As a result, I have subtracted the separate part of the Edinburgh
TownInfo corpus consisting of 58 dialogues containing 510 utterances, that
was used to test TiMBL (Daelemans et al., 2007) classifier and re-calculated
the n-grams.
• The feature extractor is the core module of my system written fully in JAVA
and is responsible for reading the Edinburgh TownInfo corpus from the vari-
ous flat files that make it up and extracts the features that were described in
detail in the previous section. The output of this module is the training and
testing dataset in ARFF format since it was considered convenient to visualise
25
and measure the Information Gain from WEKA (http://www.c-
s.waikato.ac.nz/ml/weka/). This format can also be read by TiMBL (Daele-
mans et al., 2007).
• TiMBL (Daelemans et al., 2007) is written purely in C++ and usually runs in
standalone mode. However, it provides a rather convenient API that enables
other software to integrate it quite easily in their work flow. Since my system
is written in JAVA I wrote a JNI wrapper for that as well, porting main API
calls, namely load the model from a flat file in a tree-based format, train the
model, test a flat file against a trained model, predict the class of a single vec-
tor given a trained model and kill the model.
The input to TiMBL is a set of feature vectors with a combination of real-val-
ued numbers, integers and a single boolean attribute. The classifier itself per-
forms internally a conversion of the numeric attributes to discrete ones using
a default of 20 classes. The output is a set of labels that are attributed to each
input vector. Note that TiMBL completely ignores the fact that the input vec-
tors actually correspond to hypothesis in an n-best list, in other words each
vector is fully independent from the others. It is the responsibility of the Fea-
ture Extractor and the Re-ranker to keep track of the position of each vector
in a dialogue, n-best list and mapping to a single hypothesis.
TiMBL was trained using different parameter combinations mainly choosing
between number of k-nearest neighbours (1 to 5) and distance metrics
(Weighted Overlap and Modified Value Difference Metric). Quite surpris-
ingly, there was not any significant gain from using parameter combinations
other than the default, namely Weighted Overlap with k = 1 neighbours.
• The Re-ranker is written in JAVA and takes as input the labels that have been
assigned to each hypothesis of the n-best list under investigation and returns
the hypothesis according to the following algorithm along with the corres-
ponding label in the hope that it will assist the dialogue manager's strategies
(adapted by Gabsdil and Lemon 2004):
1. Scan the list of classified n-best recognition hypotheses
26
top-down. Return the first result that is classified as
'opt'.
2. If 1. fails, scan the list of classified n-best recognition
hypotheses top-down. Return the first result that is clas-
sified as 'pos'.
3. If 2. fails, count the number of neg’s and ign’s in the
classified recognition hypotheses. If the number of neg’s
is larger or equal than the number of ign’s then return the
first 'neg'.
4. Else return the first 'ign' utterance.
2.5. Experiments
In this study the experiments were conducted in two layers: the first layer concerns
only the classifier, i.e. the ability of the system to correctly classify each hypothesis
to either of the four labels 'opt', 'pos', 'neg', 'ign' and the second layer the re-ranker,
i.e. the ability of the system to boost the speech recogniser's accuracy.
For the first layer, I trained the TiMBL classifier using the Weighted Overlap metric
and k = 1 nearest neighbours (as discussed in the previous section) on 75 % of the
Edinburgh TownInfo corpus consisting of 126 dialogues containing 1554 user utter-
ances. For each utterance correspond 60-best lists produced by the recogniser result-
ing in a total of 93240 hypotheses.
Using this corpus, I performed a series of experiments using different sets of features
in order to both determine and illustrate the increasing performance of the classifier.
These sets were determined not only by the literature but also by the Information
Gain measures that were calculated on the training set using WEKA, as shown in
Figure 2.7).
27
Figure 2.7: Information Gain for all 13 attributes (measured using WEKA)
Quite surprisingly, we can notice that the rank being given by the Information Gain
measure coincides perfectly with the logical grouping of the attributes that was ini-
tially performed (see section 2.3). As a result, I chose to stick to this very grouping as
the final 4 feature sets on which the experiments on the classifier were performed
with the following order:
1. List Features
2. List Features + Current Hypothesis Features
3. List Features + Current Hypothesis Features + Acoustic Features
4. List Features + Current Hypothesis Features + Acoustic Features + User Sim-
ulation
Note that the User Simulation score seems to be a really strong feature, scoring first
in the Information Gain rank, validating our main hypothesis.
The testing of the classifier using each of the above feature sets was performed on
the remaining 25 % of the Edinburgh TownInfo corpus comprising of 58 dialogues,
consisting of 510 utterances and taking the 60-best lists resulting in a total of 30600
InfoGain Attribute------------------------------------- 1.0324 userSimulationScore
0.9038 rmsAmp 0.8280 minAmp 0.8087 maxAmp
0.4861 parsability 0.3975 acousScore 0.3773 hypothesisDuration 0.2545 hypothesisLength 0.1627 avgConfScore 0.1085 minWordConfidence
0.0511 nBestRank 0.0447 standardDeviation 0.0408 matchesFrequentDM
28
vectors. In each experiment I measured Precision, Recall, F-measure per class and
total Accuracy of the classifier .
For the second layer, I used a trained instance of the TiMBL classifier on the 4th fea-
ture set (List Features + Current Hypothesis Features + Acoustic Features + User
Simulation) and performed re-ranking using the algorithm illustrated in the previous
section on the same training set used in the first layer using 10-fold cross validation.
2.6. Baseline and Oracle
For the first layer I chose as a baseline the scenario when the most frequent label,
'neg', would be chosen in every case for the four-way classification.
For the second layer I chose as a baseline the normal speech recogniser's behaviour,
i.e. giving as output the topmost hypothesis. As an oracle for the system I defined the
choice of either the first 'opt' in the n-best list to be classified or if this does not exist
the first 'pos' in the list. In this way it is guaranteed that we shall always get as output
a perfect match to the true transcript as far as its Dialogue Move is concerned,
provided there exists a perfect match somewhere in the list.
29
CHAPTER 3
Results
As explained in chapter 2, I performed two series of experiments in two layers: the
first corresponds to the training of the classifier alone and the second to the system as
a whole measuring the re-ranker's output. A brief summary of the method follows:
• First Layer – Classifier Experiments
• Baseline
• List Features (LF)
• List Features + Current Hypothesis Features (LF + CHF)
• List Features + Current Hypothesis Features + Acoustic Features (LF
+ CHF + AF)
• List Features + Current Hypothesis Features + Acoustic Features +
User Simulation (LF + CHF + AF + US)
• Second Layer – Re-ranker Experiments
• Baseline
• 10-fold cross-validation
• Oracle
All results reported in this chapter are drawn from the TiMBL classifier which is be-
ing trained with the Weighted Overlap metric and k = 1 nearest neighbours settings.
Both layers are trained on the same Edinburgh TownInfo Corpus of 126 dialogues
containing 1554 user utterances or a total of 93240 hypotheses. The first layer was
tested against a separate Edinburgh TownInfo Corpus of 58 dialogues containing 510
user utterances or a total of 30600 hypotheses, while the second was tested on the
30
whole training set with 10-fold cross-validation.
3.1 First Layer – Classifier Experiments
In these series of experiments I measure precision, recall and F1-measure for each of
the four labels and overall F1-measure and accuracy of the classifier. In order to have
a better view of the classifier's performance I have also included the confusion
matrices for the final experiment with all 13 attributes which scores better than the
rest.
Table 3.1-3.4 show per class and per attribute set measures, while Table 3.5 shows a
collective view of the results for the four sets of attributes and the baseline being the
majority class label 'neg'. Table 3.6 shows the confusion matrix for the final experi-
ment.
Feature set (opt) Precision Recall F1-MeasureLF 42.50% 58.41% 49.20%LF+CHF 62.35% 65.71% 63.99%
LF+CHF+AF 55.59% 61.59% 58.43%
LF+CHF+AF+ US 70.51% 73.66% 72.05%
Table 3.1 Precision, Recall and F1-Measure for the 'opt' category
Feature set (pos) Precision Recall F1-MeasureLF 25.18% 1.72% 3.22%LF+CHF 51.22% 57.37% 54.11%
LF+CHF+AF 51.52% 54.60% 53.01%
LF+CHF+AF+ US64.79% 61.80% 63.26%
Table 3.2 Precision, Recall and F1-Measure for the 'pos' category
31
Feature set (neg) Precision Recall F1-MeasureLF 54.20% 96.36% 69.38%LF+CHF 70.70% 74.95% 72.77%
LF+CHF+AF 69.50% 73.37% 71.38%
LF+CHF+AF+ US 85.61% 87.03% 86.32%
Table 3.3 Precision, Recall and F1-Measure for the 'neg' category
Feature set (ign) Precision Recall F1-MeasureLF 19.64% 1.31% 2.46%LF+CHF 63.52% 48.72% 55.15%
LF+CHF+AF 59.30% 48.90% 53.60%
LF+CHF+AF+ US 99.89% 99.93% 99.91%
Table 3.4 Precision, Recall and F1-Measure for the 'ign' category
Feature set F1-MeasureAccuracyBaseline - 51.08%LF 37.31% 53.07%LF+CHF 64.06% 64.77%
LF+CHF+AF 62.63% 63.35%
LF+CHF+AF+ US 86.03% 84.90%
Table 3.5: F1-Measure and Accuracy for the four attribute sets.
In tables 3.1 – 3.5 we generally notice an increase in precision, recall and F1-meas-
ure as we progressively add more attributes to the system with the exception of the
addition of the Acoustic Features which seem to impair the classifier's performance.
We also make note of the fact that in the case of the 4th attribute set the classifier can
distinguish very well the 'neg' and 'ign' categories with 86.32% and 99.91% F1-meas-
ure respectively.
Most importantly, we take a remarkable boost in F1-measure and accuracy with the
addition of the User Simulation score. We mark a 37.36% relative increase in F1-
measure and 34.02% increase in the accuracy compared to the 3rd experiment, which
32
contains all but the User Simulation score attribute and a 66.20% relative increase of
the accuracy compared to the Baseline.
In table 3.4 we make note of a considerably low recall measure for the 'ign' category
in the case of the LF experiment, suggesting that the list features do not add extra
value to the classifier, partially validating the Information Gain measure (Figure 2.7).
opt pos neg ign
opt 232 37 46 0
pos 47 4405 2682 8
neg 45 2045 13498 0
ign 5 0 0 7550
Table 3.6 Confusion Matrix for LF + CHF + AF + US set.
Taking a closer look to the 4th experiment with all 13 features we notice in table 3.6
that most errors occur between the 'pos' and 'neg' category. In fact, for the 'neg' cat-
egory the False Positive Rate (FPR) is 18.17% and for the 'pos' 8.9%, all in all a lot
larger than for the other categories.
3.2 Second Layer – Re-ranker Experiments
In these series of experiments I measure WER, DMA and SA for the system as a
whole. In order to make sure that the improvement noted was really attributed to the
classifier I computed the p-values for each of these measures using the Wilcoxon
signed rank test for the WER and McNemar chi-square test for the DMA and SA
measure.
WER DMA SA
Baseline 47.72% 75.05% 40.48%
Classifier 45.27% ** 78.22% * 42.26%
Oracle 42.16% *** 80.20% *** 45.27% ***
Table 3.7 WER, DMA and SA measures for the Baseline, Classifier and Oracle (*** indicates p < 0.001, ** indicates p < 0.01, * indicates p < 0.05)
33
In table 3.7 we note that the classifier scores 45.27% WER making a notable relative
reduction of 5.13% compared to the baseline and 78.22% DMA incurring a relative
improvement of 4.22%. The classifier scored 42.26% on SA but it was not con-
sidered significant compared to the baseline (0.05 < p < 0.10). Comparing the classi-
fier's performance with the Oracle it achieves a 44.06% of the possible WER im-
provement on this data, 61.55% for the DMA measure and 37.16% for the SA meas-
ure.
Finally, we also notice that the Oracle has a 80.20% for the DMA, which means that
19.80% of the n-best lists did not include at all a hypothesis that matched semantic-
ally to the true transcript.
3.3. Significance tests: McNemar's test & Wilcoxon test
McNemar’s test (Tan et al., 2001) is a statistical process that can validate the signific-
ance of differences between two classifiers on boolean data. Let fA be the baseline
and fB be our system. Given a pair of binary data (in our case the answers whether
the true transcript and the topmost hypothesis for the baseline or the output of the re-
ranker for our system match semantically and on a word level for the case of DMA
and SA measure respectively) we record the matches on each utterance both for fA
and fB simultaneously to construct the following contingency table:
Correct by fA Incorrect by fA
Correct by fB n00 n01
Correct by fB n10 n11
McNemar’s test is based on the idea that there is little information about the distribu-
tion with which both the baseline and the classifier get the correct results or for
which both get incorrect results; it is based entirely on the values of n01 and n10. Un-
der the null hypothesis (H0), the two algorithms should have the same error rate,
meaning n01 = n10. It is essentially a x2 test and performs a test using the following
statistic:
34
If the H0 is correct, then the probability that this number is bigger than x2 = 3.84 is
less than 0.05. So we may reject the H0 in favour of the hypothesis that the two al-
gorithms have different performance.
The Wilcoxon signed rank test is a statistical test for real-valued paired data that do
not follow a normal distribution as is the case with the WER distribution as can be
seen in Figure 3.1 below. I used MATLAB's version of the test signrank.
Figure 3.1: WER distribution for the re-ranker output
WER
Utt
eran
ces
35
CHAPTER 4
Experiment with high-level Features
The four sets of attributes already described in the previous chapters were chosen
based both on previous studies and on intuition and were also partially justified ac-
cording to the ranking produced by running the Information Gain measure on the Ed-
inburgh TownInfo corpus training set.
Apart from this more traditional approach in the feature selection process, I also
trained a Memory Based Classifier based only on the higher level features of merely
the User Simulation score and the Grammar Parsability (US + GP). The idea behind
this choice is to try and find a combination of features that ignores low level charac-
teristics of the user's utterances as well as features that heavily rely on the speech re-
cogniser and thus by default are not considered to be very trustworthy.
Quite surprisingly, the results taken from an experiment with just the User Simula-
tion score and the Grammar Parsability are very promising and comparable with
those acquired from the 4th experiment with all 13 attributes. Table 3.9 shows the pre-
cision, recall and F1-measure per label and table 3.10 illustrates the classifier's per-
formance in comparison with the 4th experiment.
Label Precision Recall F1-Measureopt 73.99% 64.13% 68.70%pos 76.29% 46.21% 57.56%
neg 81.87% 94.42% 87.70%
ign 99.99% 99.95% 99.97%
Table 3.9 Precision, Recall and F1-measure for the high-level features experiment
36
It can be derived from table 3.9 that there is a somewhat considerable decrease in the
recall and a corresponding increase in the precision of the 'pos' and 'opt' categories
compared to the LF + CHF + AF + US attribute set, which account for lower F1-
measures. However, all in all the US + GP set manages to classify correctly 207 more
vectors and quite interestingly commits far fewer ties1 and manages to resolve more
compared to the full 13 attribute set.
Feature set F1-Measure Accuracy Ties/Resolved/%
LF+CHF+AF+ US 86.03% 84.90% 4993 / 863 / 57.13%
US + GP 85.68% 85.58% 115/ 75 / 65.22%
Table 3.10 F1-measure, Accuracy and number of ties that were correctly resolved by TiMBL for the LF+CHF+AF+US and US+GP feature sets
Next, I performed an experiment on the re-ranker using the aforementioned classifier
and it did not achieve much compared to the Baseline for the DMA and SA measures
(it scored 74.85% DMA, 0.2% lower than the Baseline and 40.82% SA, 0.34% high-
er than the Baseline, both results being statistically insignificant). For the WER it
scored 46.39%, a relative decrease of 2.78% compared to the Baseline, achieving
23.92% of the possible WER improvement for this dataset.
Following the success of the previous experiment on the classifier alone, I took
things to their extremes and trained TiMBL with just the User Simulation score fea-
ture. Not surprisingly the classifier scored 80.60% overall F1-measure and 81.64%
accuracy, but it was unable to classify correctly any of the 'opt' hypotheses correctly.
As a result, it was not considered necessary to continue to check the performance of
the re-ranker with this rather minimal classifier.
1 In the case of k-nn algorithms we might come across situations when a particular vector is found to be equidistant from two or more neighbours that belong to different classes. In this case a particular tie-resolving scheme is adopted such as weighted voting.
37
CHAPTER 5
Discussion and Conclusions
In this chapter we shall discuss the methodology applied as a whole and the results
that were drawn from the experiments on the Edinburgh TownInfo corpus and
present some overall conclusions. The results especially for the second layer of ex-
periments are limited by three major reasons:
1. The speech recogniser's performance. The oracle score for the DMA measure
shows that approximately 19.80% of the n-best lists do not contain a hypo-
thesis that matches semantically with the true transcript. This partly resulted
in a dataset which is highly imbalanced (Figure 5.1) and impaired the classifi-
er's separability. According to Andersson (2006) there are two causes for this
problem:
• Mis-timed recognition – where the microphone was not activated in due
time before the user started speaking and/or was deactivated before the
user had finished speaking.
• Bad recognition hypotheses – where the user said something clearly but
the system failed to recognise it. This can be ascribed to decoding para-
meters and deficiency of the language model to include domain-specific
vocabulary.
2. The problem we are trying to solve is somewhat trivial from a semantic point
of view. In order to compute the labels and measure the DMA of each hypo-
thesis I have used a keyword parser (see section 2.4), which “translates” each
sentence to the format {<Speech Act><Task>} pair. While this high level of
representation seems to be enough for the User Simulation model, it seems as
though we let even highly ungrammatical hypotheses align semantically with
38
the true transcript. Although this assumption can be justified by the fact that
in dialogue systems we are interested in the messages to be conveyed rather
than the exact way they have been uttered by the user, we have artificially in-
creased in this way the baseline's DMA, in other words the erroneous topmost
hypotheses semantically align with the true transcript.
Figure 5.1 Label histogram for the Edinburgh TownInfo training set
5.1. Automatic Labelling
The labelling used in this study is closely related to that devised by Gabsdil and
Lemon (2004) and Jonson (2006). The main idea is to map each hypothesis to a class
that categorises it primarily from a semantic point of view and secondarily taking
into account the WER. This is being done under the notion that Dialogue Systems are
sensitive to the meaning behind the user's utterance rather than the grammaticality of
the utterance.
However, the method adopted in order to ascribe a label to each hypothesis as de-
scribed in section 2.2 augmented the problem of the imbalanced dataset as mentioned
in the introduction of this chapter (Figure 5.1). The 'neg' category includes both se-
mantically aligned hypotheses with high WER (>50) and hypotheses which do not
align but are addressed to the system and thus are distinguished from the 'ign'. This
opt pos neg ign
Hyp
othe
ses
opt pos neg ign
39
of course gave a boost to the 'neg' category being the majority class with a grave nu-
merical difference from the rest.
A way to alleviate this problem would be to split the 'neg' category to two, namely
'pessimistic' ('pess') which would include semantically identical hypotheses with high
WER (>50) and 'neg' which would address semantically mismatched hypotheses. A
few preliminary experiments towards this directions were conducted but did not
achieve much accuracy and were left to be included in some future work.
5.2. Features
The features used in this study are divided into four groups which also reflect the
number of series of experiments that were conducted; list features, current hypothesis
features, acoustic features and User Simulation. All of the features mentioned with
the exception of the User Simulation score have been used in previous similar studies
with successful results, a strong justification for including them in my system as
well. However, as illustrated in chapter 4 not all of them contributed to the classifier's
performance.
The list features alone (1st experiment) did not make the classifier more significant
compared to the Baseline. It seems that at least in the Edinburgh TownInfo corpus we
cannot account for possible clusters of labels gathered in a specific place within the
list (something that the n-best rank feature or the standard deviation of confidence
scores for example could give rise to).
This phenomenon is particularly evident in the case of the 'ign' label, as shown in
table 3.4. This seems quite reasonable since in the case where an utterance is actually
crosstalk, then most of the times all the hypotheses in the n-best list are labelled as
'ign', rendering list-wise features such as n-best rank and standard deviation of con-
fidence score useless.
The current hypothesis features (2nd experiment) contributed significantly to the clas-
sifier's performance which was quite as expected though they included attributes
such as the speech recogniser's confidence and acoustic score which by default are
40
the foremost suspects for the mediocre performance of the 1-best top hypothesis
baseline system. The inclusion of the grammar parsability and the minimum word
confidence score seem to separate the hypotheses well especially between the 'opt' -
'pos' and 'neg' - 'ign' categories. In this way they validate the assumption that fairly
ungrammatical and/or marginally acceptable utterance recognitions (which might
have on average a high confidence score but some of the words that comprise it are
not recognised with much confidence by the acoustic model and/or language model,
e.g. utterances with wrong syntax) do not carry the correct semantic information
compared to the true transcript.
On the other hand, the addition of the acoustic features (3rd experiment) though
seemed promising from the Information Gain ranking (Figure 2.7) and literature,
they actually impaired the accuracy of the classifier (Table 3.5). This may be due to
the fact that the minimum, maximum and RMS amplitude values correspond to a
single wave file and thus is the same for all the hypotheses in an n-best list. From a
dataset point of view, we are essentially multiplying the probability mass of a certain
value for each of these attributes without them being unique. As a result, we are arti-
ficially boosting their importance which in its turn trick the Information Gain and
Gain Ratio measure which is being used internally by TiMBL. In other words, these
attributes score high in both measures because they have many occurrences of the
same values rather than being unique and therefore essentially useful.
The addition of the User Simulation score (4th experiment) gives a remarkable boost
to the classifier's performance, which validates the main hypothesis of this study as
far as the classification of each hypothesis to a certain label is concerned. What
strikes most is the fact that the User Simulation score helps the classifier distinguish
very clearly the 'ign' and then the 'neg' category, i.e. the categories which correspond
to hypotheses that mostly differ semantically from the true transcript or do not ad-
dress the system.
Especially in the case of the 'ign' category when the user does not address the system,
the User Simulation almost always models it very accurately. In other words, given a
history of 4 dialogue moves (User Simulation uses 5-grams) and the current being se-
mantically empty, {<[null]>,<[null]>}, it assigns it the highest probability it can give
41
(as shown in figure 5.2). This makes sense since if the user currently does not ad-
dress the system, then the dialogue that has preceded is rather fixed and thus can be
modelled easily. An equivalent justification exists in the case when the user says
something that does not align semantically with the true transcript and/or is erro-
neous and thus has caused the system in the past to respond in a fixed way. Bear in
mind that we consider a dialogue system, the responses and vocabulary of which
(and of the user as well) is rather limited.
Figure 5.2 Histogram of User Simulation score for the Ed TownInfo training set
On the other hand, in the case of the 'opt' and 'pos' categories the User Simulation is
less certain (Figure 5.2) for the exactly opposite reason as with the case of the 'ign'
and 'neg' categories. In the case of correctly recognised hypotheses the dialogue
between the system and the user may progress rather quickly in the sense that the sys-
tem does not need to explicitly or implicitly confirm the user's utterances. This
means that the course of the dialogue can be quite different and thus more difficult to
model ( {<[provide_info]>,<[hotel_price]>} can occur in many different contexts
compared to {<[null]>,<[null]>} ). This is partially validated by the additional exper-
iment performed where I trained TiMBL with just the User Simulation score as a fea-
ture and noticed that it was not able to classify correctly any of the 'opt' hypotheses.
42
5.3 Classifier
The TiMBL classifier seems to be rather well-suited for modelling dialogue context-
based speech recognition using the User Simulation as an extra feature. Though
every effort was made to keep as consistent a method for the feature selection and
model optimization as possible (Information Gain ranking), still I believe that the
classifier would benefit more from a more systematic-exhaustive search through all
the possible combinations of features and/or parameter settings, such as using the
“leave-one-out” method adopted by Gabsdil and Lemon (2004) and got an increase
in their classifier's accuracy by 9%.
The main drawback of our trained classifier was the high false positive rate for the
'pos' category. As is evident in table 3.6 the 'pos' category gets easily mistaken with
the 'neg' category. A possible cause for this is the fact that the hypotheses belonging
to the 'neg' category far outnumber those belonging to the 'pos', as described in the
introduction of this chapter. Another way to justify this phenomenon resides in the
fact that the 'neg' category includes semantically aligned (with high WER) hypo-
theses as is the case with the 'pos' category and thus most of the features used cannot
distinguish very well between the two classes in this case. For example the hypothes-
is' duration is the same even if the recogniser captures something semantically
aligned with the true transcription or not.
5.4 Results
In the first layer of experiments I managed to train a considerably efficient classifier
using all 13 attributes scoring 86.03% F1-measure and 84.90% accuracy. The User
Simulation score seems to be the key attribute that accounts for most of the classifi-
er's ability to separate among the four classes. In favour of this hypothesis is the extra
experiment performed towards the end of this study where I trained TiMBL with just
the User Simulation score and the Grammar Parser and scored 85.68% F1-measure
and 85.58% accuracy.
43
This latter experiment seems very promising in the sense that we can get acceptable
results with just two attributes, resulting in a very robust and efficient system. What
is more, the nature of these attributes being of higher level than the rest pose an inter-
esting argument as to the approach which should be followed in the post-processing
of speech recognition output. We should bear in mind though that both features can-
not be extracted directly from the n-best lists or the wave files of the user's utterances
but rather involve the application of models on the dialogue and the syntax of each
hypothesis. This means that we are essentially inducing time overhead to the system,
which in the case of a real time dialogue system is crucial. Dealing with 60-best lists
induces fairly acceptable overhead time for the User Simulation model which does
not have to account for too many different states as is the case of domain-specific
dialogue systems. However, this is not always the case especially in the use of a
wide-coverage grammar parser which sometimes has to deal with the parsing of long
sentences and may slow down the overall response of the system. Using a more do-
main-specific and efficient parser than the one used in this thesis shall effectively al-
leviate this problem.
In the second layer of experiments the performance of the re-ranker is equally en-
couraging as in the case of the classifier. The system achieved a relative reduction of
WER of 5.13%, a relative increase of DMA of 4.22%, and a relative increase of SA
of 4.40% with only the latter not being statistically significant (0.05 < p 0.10) com-
pared to the Baseline.
In the case of dialogue systems we are primarily interested in a gain in the DMA
measure, which would essentially mean that our re-ranker is helping the system to
better “understand” what the user really said and it seems that my system can im-
prove the performance of the speech recogniser. Even though, the increase is some-
what small compared to previous studies, still it shows that my system is robust
enough gaining 61.55% of the possible DMA improvement and the result is statistic-
ally significant. The same applies for the relative improvement of WER compared to
the Baseline, which altogether considers a 44.06% of the possible boost in the overall
performance of the speech recogniser.
A possible reason for not gaining very large increase in the results for the WER and
44
the DMA and a statistical insignificant improvement in the SA measure was the lim-
ited size of the test data. What is more, as described in the introduction of this
chapter, the problem we are trying to solve from a semantic point of view is rather
trivial resulting both in the Baseline having an already high DMA and SA and in a
very tight margin between the Baseline and the Oracle, leaving only a small improve-
ment to be achieved.
5.5. Future work
Some ideas for future work have already been mentioned in previous sections of this
chapter: the adoption of a fifth ('pess') label that shall split the 'neg' category and
therefore bring balance to the dataset, a more systematic search of the features' and
parameters' combination of TiMBL to be selected using the “leave-one-out” approach
of Gabsdil and Lemon (2006), the use of a more domain-specific and more efficient
grammar parser.
Another useful improvement would be to use a more elaborate semantic parser than
the keyword parser I used in my system, which would not take into only the exist-
ence or not of certain key-words in each hypothesis but also some semantic function
among the uttered words. In this way we shall end up with a more difficult problem
left to be solved by the re-ranker, which would essentially reduce the accuracy of the
baseline, i.e. merely choosing the topmost hypothesis.
Finally, the current system is already implemented in a way that adheres to the OAA
practices and thus is very easy to integrate with a real dialogue system such as DIP-
PER and TownInfo dialogue system. In this way, we shall be able to evaluate the sys-
tem with truly unseen data and test it against the baseline system which relies on the
topmost 1-best hypothesis of the speech recogniser alone.
5.6. Conclusions
The system developed was tested in two layers, namely experiments that involved
the classifier alone and experiments that concerned the re-ranker. For the first layer I
45
conducted four experiments by training the classifier with an increasing number of
features:
• List Features (LF)
• List Features + Current Hypothesis Features (LF + CHF)
• List Features + Current Hypothesis Features + Acoustic Features (LF
+ CHF + AF)
• List Features + Current Hypothesis Features + Acoustic Features +
User Simulation (LF + CHF + AF + US)
Out of the four experiments the 4th gave the best results with 86.03% F1-measure and
84.90% accuracy, yielding a 66.70% relative increase of the accuracy compared to
the Baseline.
I also conducted an additional experiment where the classifier was trained to a lim-
ited set of high-level features, namely the User Simulation score and the Grammar
Parsability feature and scored 85.68% F1-measure and 85.58% accuracy, performing
a 67.54% relative increase of the accuracy compared to the Baseline.
For the second layer of experiments the re-ranker scored a relative reduction of WER
of 5.13%, a relative increase of DMA of 4.22%, and a relative increase of SA of
4.40% with only the latter not being statistically significant (0.05 < p 0.10) com-
pared to the Baseline. Comparing the re-ranker's performance with the Oracle it
achieved a 44.06% of the possible WER improvement on this data, 61.55% for the
DMA measure and 37.16% for the SA measure.
This study has shown that building a system that performs re-ranking of n-best lists
produced as an output from a speech recogniser module of a dialogue system can im-
prove the performance of the speech recogniser. It has also validated the main hypo-
thesis that the boost in the performance can be achieved to a considerable extent us-
ing a User Simulation model of the dialogues between the system and the user.
46
References
Andersson, S. (2006), “Context Dependent Speech Recognition”, MSc Dissertation,
University of Edinburgh, 2006.
Boros, M., Eckert, W., Gallwitz, F., Gorz, G., Hanrieder, G. and Niemann, H. (1996),
“Towards Understanding Spontaneous Speech: Word Accuracy vs. Concept Accur-
acy”, in Proceedings of International Symposium on Spoken Dialogue, ICSLP-96,
Philadelphia, USA, pp. 1005–1008.
Bos, J., Klein, E., Lemon, O. and Oka, T. (2003), “Dipper: Description and formal-
isation of an information-state update dialogue system architecture”, in 4th SIGdial
Workshop on Discourse and Dialogue, Sapporo, Japan, pp. 115–124.
Boyce, S., and Gorin, A. L. (1996), “User interface issues for natural spoken dia-
logue systems”, in Proceedings of International Symposium on Spoken Dialogue, pp.
65–68.
Cheyer, A. and Martin, D. (2001), “The open agent architecture”, Journal of
Autonomous Agents and Multi-Agent Systems 4(1), 143–148.
Chotimongkol, A. and Rudnicky, A. (2001), “N-best speech hypotheses reordering
using linear regression”, in Proceedings of EuroSpeech, pp. 1829–1832.
Cohen, W. (1996), “Learning trees and rules with set-valued features”, in Proceed-
ings of the Association for the Advancement of Artificial Intelligence, AAAI-96.
Daelemans, W., Zavrel, J., van der Sloot, K. and van den Bosch, A. (2007), “TiMBL:
Tilburg Memory Based Learner”, version 6.1 Reference Guide, ILK Technical Report
07-07.
Gabsdil, M. (2003), “Classifying Recognition Results for Spoken Dialogue
Systems”, in Proceedings of the Student Research Workshop at ACL–03.
Gabsdil, M. and Lemon, O. (2004), “Combining acoustic and pragmatic features to
47
predict recognition performance in spoken dialogue systems”, in Proceedings of
ACL, Barcelona, Spain, pp. 343–350.
Georgila, K., Henderson, J. and Lemon, O. (2006), “User Simulation for Spoken Dia-
logue Systems: Learning and Evaluation”, in Proceedings of the 9th International
Conference on Spoken Language Processing (INTERSPEECH–ICSLP-06), Pitts-
burgh, USA.
Gorin, A., L., Riccardi G. and Wright, J., H. (1997), “How may I help you?”, Journ-
al of Speech Communication, 23(1/2), pp. 113–127.
Gruenstein, A. (2008), “Response-Based Confidence Annotation for Spoken Dia-
logue Systems”, in Proceedings of the 9th SIGdial Workshop on Discourse and Dia-
logue, Columbus, Ohio, USA, pp. 11–20.
Gruenstein, A. and Seneff, S. (2007), “Releasing a multimodal dialogue system into
the wild: User support mechanisms”, in Proceedings of the 8th SIGdial Workshop on
Discourse and Dialogue, pp 111–119.
Gruenstein, A., Seneff S. and Wang C., (2006), “Scalable and portable web-based
multimodal dialogue interaction with geographical databases”, in Proceedings of
INTERSPEECH-06.
Jonson, R.(2006), “Dialogue context-based re-ranking of ASR hypothesis”, in
Spoken Language Technology Workshop, IEEE, Palm Beach, Aruba, pp. 174–177.
Lemon, O. (2004), “Context-sensitive speech recognition in ISU dialogue systems:
results for the grammar switching approach”, in Proceedings of the 8th Workshop on
the Semantics and Pragmatics of Dialogue, CATALOG-04.
Lemon, O., Georgila, K. and Henderson, J. (2006), “Evaluating effectiveness and
portability of reinforcement learned dialogue strategies with real users: The TALK
TownInfo evaluation”, in Spoken Language Technology Workshop, IEEE, pp. 178–
181.
Lemon, O., Georgila, K., Henderson, J. and Stuttle, M. (2006), “An ISU dialogue
system exhibiting reinforcement learning of dialogue policies: Generic slot-filling in
the talk in-car system”, in Proceedings of European Chapter of the ACL, EACL-06,
48
Trento, Italy, pp. 119–122.
Kamm, C., Litman, D. and Walker, M., A. (1998), “From novice to expert: The effect
of tutorials on user expertise with user dialogue systems”, in Proceedings of the In-
ternational Conference on Spoken Language Processing, ICSL-98.
Klein, D. and Manning, C., D. (2003), “Fast Exact Inference with a Factored Model
for Natural Language Parsing”, in Journal of Advances in Neural Information Pro-
cessing Systems 15, NIPS-02, Cambridge, MA, MIT Press, pp. 3–10.
Litman, D., Hirschberg, J. and Swerts, M. (2000), “Predicting automatic speech re-
cognition performance using prosodic cues”, in Proceedings of NAACL-00.
Litman, D. and Pan, S. (2000), “Predicting and adapting to poor speech recognition
in a spoken dialogue system”, in Proceedings of the Association for the Advancement
of Artificial Intelligence, AAAI-00, Austin, USA, pp. 722–728.
Pickering, M., J. and Garrod S. (2007), “Do people use language production to make predictions during comprehension?”, Journal of Trends in Cognitive Sciences, 11(3), pp.105–110.
Rudnicky, A., I., Bennett, C., Black, A., W., Chotimongkol, A., Lenzo, K., Oh, A. and
Singh R. (2000), “Task and Domain Specific Modeling in the Carnegie Mellon Com-
municator System”, in Proceedings of the International Conference on Spoken Lan-
guage Processing, ICSLP’00, Beijing, China.
Tan, C., M., Wang, Y., F. and Lee, C., D. (2001), “The use of bigrams to enhance text
categorization”, Journal of Information Processing & Management, 38(4), pp. 529–
546.
Walker M., Fromer, J., C. and Narayanan S. (1998), “Learning optimal dialogue
strategies: A case study of a spoken dialogue agent for email”, in Proceedings of the
36th Annual Meeting of the Association of Computational Lingustics, COLING/ACL
-98, pp. 1345–1352.
Walker, M., Wright, J. and Langkilde, I. (2000), “Using natural language processing
and discourse features to identify understanding errors in a spoken dialogue system”,
in Proceedings of International Conference on Machine Learning, ICML-00.
49
Young, S. (2004), “ATK An Application Toolkit for HTK”, 1.4.1 edn. Technical
Manual.