fusion in multimodal interactive systems: an hmm-based ......fusion in multimodal interactive...

Fusion in Multimodal Interactive Systems: An HMM-BasedAlgorithm for User-Induced Adaptation

Bruno DumasWISE Lab

Vrije Universiteit BrusselPleinlaan 2

1050 Brussels, [email protected]

Beat SignerWISE Lab

Vrije Universiteit BrusselPleinlaan 2

1050 Brussels, [email protected]

Denis LalanneDIVA Group

Universite de FribourgBoulevard de Perolles 90

1700 Fribourg, [email protected]

ABSTRACTMultimodal interfaces have shown to be ideal candidates forinteractive systems that adapt to a user either automatically orbased on user-defined rules. However, user-based adaptationdemands for the corresponding advanced software architec-tures and algorithms. We present a novel multimodal fusionalgorithm for the development of adaptive interactive systemswhich is based on hidden Markov models (HMM). In order toselect relevant modalities at the semantic level, the algorithmis linked to temporal relationship properties. The presentedalgorithm has been evaluated in three use cases from whichwe were able to identify the main challenges involved in de-veloping adaptive multimodal interfaces.

Author KeywordsMultimodal interaction, multimodal fusion, HMM-basedfusion, user interface adaptation

ACM Classification KeywordsH.5.2 Information Interfaces and Presentation: User Inter-faces—Graphical user interfaces, Prototyping, Theory andmethods

General TermsAlgorithms; Human Factors.

INTRODUCTIONMultimodal interaction has been shown to enhance human-computer interaction by providing users with an interactionmodel that is closer to human-human interaction than stan-dard interaction via mouse or keyboard. By providing userswith a number of different modalities and by offering themthe possibility to use these modalities either in a complemen-tary or in a redundant manner, multimodal interfaces take ad-vantage of the parallel processing capabilities of the humanbrain [25]. Furthermore, due to the different ways on howmodalities can be combined, multimodal interfaces have thepotential to adapt to a user, by offering them one or multiple

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.EICS’12, June 25–26, 2012, Copenhagen, Denmark.Copyright 2012 ACM 978-1-4503-1168-7/12/06...$10.00.

preferred schemes of interactions. However, research on thecombination or fusion of different modalities has progressedslower than the appearance of new modalities. In particu-lar, fusion-related topics such as the dynamic adaptation offusion engines, for example with help of machine learningtechniques, still need to be addressed in more detail by thescientific community [21].

In this article, we present a novel algorithm for the fusion ofinput modalities in multimodal interfaces. In the past fewyears, we focussed on studying different aspects of multi-modal input fusion. To this end, a framework for the pro-totyping of multimodal interfaces, called HephaisTK, hasbeen developed. HephaisTK allows developers to focus onthe creation of their multimodal applications, without havingto worry about the integration of input libraries, input datanormalisation, concurrency management or the implementa-tion of fusion algorithms. The HephaisTK framework comesalong with a number of fusion algorithms for managing inputdata. Among the different algorithms, a symbolic-statisticalfusion algorithm based on hidden Markov models (HMMs),which offers the possibility to take a user’s feedback into ac-count, has been developed. While this novel multimodal fu-sion algorithm represents a first major contribution of our pa-per, a second contribution is the identification of a numberof challenges to be overcome for the introduction of user-induced automatic multimodal interface adaptation based onsome of our initial explorations.

We start by presenting related work on prototyping tools andmultimodal fusion algorithms with respect to the adaptationto the user and context. The following section introducesthe HephaisTK framework and its architecture, as well as theSMUIML language, on which the presented algorithm hasbeen built. The HMM-based fusion algorithm that we usedfor the user adaptation is then presented and we describe howto integrate such an algorithm in a typical multimodal sys-tem. This is followed by an evaluation of our HMM-basedalgorithm. Finally, we provide some conclusions and outlinepossible future work.

RELATED WORKResearch on fusion algorithms and prototyping tools for mul-timodal interactive systems has been an active field duringthe last two decades. In this section, we are going to intro-duce the state of the art on fusion algorithms for multimodalinteractive systems, provide a summary of notable results on

prototyping tools and give an overview on adaptation to userand context.

Fusion of Multimodal Input DataOn a conceptual level, Sharma et al. [29] consider the data,feature and decision levels for the fusion of incoming data.Each fusion scheme can operate at a different level of analy-sis of the same modality channel. Data-level fusion consid-ers data from a modality channel in its rawest form, wherefusion is generally used on different channels of the samemodality, in order to enhance or extract new results. Adap-tation of data-level fusion typically focusses on taking intoaccount contextual information to improve results. Feature-level fusion is used on tightly coupled, synchronised modal-ities, such as speech and lip movement. Low-level featuresextracted from raw data are usually processed with machine-learning algorithms. Adaptation on the feature level can beused when multiple data sources provide data for the samemodality. Decision-level fusion is the most common type offusion in interactive multimodal applications, because of itsability to extract meaning from loosely coupled modalities.Complex forms of adaptation can be applied on the decision-level, including the adaptation to the number of users, userprofiles or device. On an algorithmic level, typical decision-level fusion algorithms are frame-based fusion, unification-based fusion and hybrid symbolic-statistical fusion.

• Frame-based fusion [18] uses data structures called framesor features for the meaning representation of data comingfrom various sources or modalities.

• Unification-based fusion [30] is based on the recursivemerging of attribute/value structures to obtain a logicalmeaning representation. However, both frame-based aswell as unification-based fusion rely on a predefined andnon-evolutive behaviour.

• Symbolic-statistical fusion [6, 31] is an evolution of stan-dard symbolic unification-based approaches, which addsstatistical processing techniques to the fusion techniquesdescribed above. These hybrid fusion techniques have beendemonstrated to achieve robust and reliable results. A clas-sical example of a symbolic-statistical hybrid fusion tech-nique is the Member-Team-Committee architecture used inQuickset [9]. However, these techniques need training dataspecific to the targeted application.

On a modelling level, the CARE properties as defined byCoutaz and Nigay [10] show how modalities can be com-posed. Note that CARE stands for complementarity, assign-ment, redundancy and equivalence. Complementary modali-ties will need all modalities for the meaning to be extracted,assigned modalities allocate one and only one modality toeach meaning, redundant modalities state that all modalitiescan lead to the same meaning and equivalent modalities as-sert that any modality can lead to the same meaning. Thedifference between redundancy and equivalence is the way inwhich cases with multiple modalities occurring at the sametime are dealt with.

Multimodal Authoring FrameworksQuickset by Cohen et al. [9] is a speech/pen multimodal inter-face based on the Open Agent Architecture1, which served asa test bed for unification-based and hybrid fusion methods.The integration of multimodal input fusion with the mod-elling of multimodal human-machine dialogues was intro-duced by IMBuilder and MEngine [4], which both make useof finite state machines. In their multimodal system, Flippoet al. [14] use a parallel application-independent fusion tech-nique, based on a software agent architecture. In their sys-tem, fusion has been realised by using frames. After theseoriginal explorations, the last few years have seen the appear-ance of a number of fully integrated tools for the creation ofmultimodal interfaces. Among these tools are comprehen-sive open source frameworks such as OpenInterface [28] andSquidy [20]. These frameworks share a similar conceptualarchitecture with different goals. While OpenInterface tar-gets pure or combined modalities, Squidy was created as aparticularly effective tool for streams composed of low leveldata. Finally, in contrast to the linear, stream-based architec-ture adopted by most other solutions, the Mudra [16] frame-work adopts a service-based architecture. However, few ofthese tools provided services to test, use and compare multi-ple fusion algorithms.

Adaptation to User and ContextWhen considering user adaptation, different ways of adapt-ing the user interface can be considered. On the one hand, theuser interface can be created in such a way that it will adaptautomatically and without any user intervention. The adap-tation can thus be performed automatically to the context ofuse or to the user. On the other hand, users can be given thepossibility to modify the user interface in a proactive way.More formally, Malinowski et al. [23] proposed a completetaxonomy for user interface adaptation. Their taxonomy isbased on four different stages of adaptation in a given in-terface, namely initiative, proposal, decision and execution.Each of these adaptations at the four different stages can beperformed by the user, the machine or both. Lopez-Jaquero etal. proposed the ISATINE framework [22]. This frameworkintroduces seven stages of adaptation including the goals, theinitiative, the specification, the application, the transition, theinterpretation as well as the evaluation of adaptation. Interest-ingly, while adaptation has been investigated for traditionalWIMP interfaces [5] or context-aware mobile interfaces [1],less work has been devoted to user-induced adaptation in thecontext of multimodal interfaces. Octavia et al. [24] exploredadaptation in virtual environments, especially on how to buildthe user model.

The HephaisTK framework that we are going to present in thenext section and in particular a novel adaptive multimodal fu-sion algorithm, contribute to this currently not much exploredresearch on user-induced adaptive multimodal interfaces.

HEPHAISTK FRAMEWORKBefore we present our new solution for user-induced adapta-tion in multimodal interaction, we introduce the framework in1http://www.openagent.com

which the new algorithm has been integrated, since its archi-tecture, modules and scripting language influenced the choicewe made in terms of our user adaptation solution.

HephaisTK is a framework that has been built to help de-velopers in creating multimodal interfaces via a range oftools [12]. The HephaisTK framework shown in Figure 1offers engineers a framework which manages, stores andpresents data coming from a number of input recognisersin an uniform way, with the possibility to query past inputevents. Second, the tool provides different multimodal in-put data fusion algorithms, including a classical implementa-tion of frame-based fusion as well as our novel algorithm pre-sented in the next section. Third, multimodal human-machinedialogue is described by means of a high level modelling lan-guage, called SMUIML, which is linked to the CARE prop-erties [10]. Compared to the tools in the related work sec-tion, HephaisTK focusses on the study of fusion algorithms,as well as on the high level modelling of multimodal human-machine interaction through the SMUIML language.

EventNotifier

LogWatcher

Postman

Recogniser 1 Recogniser 2

FusionManager

DialogManager

FissionManager

Recogniser X

Recogniser 1Agent

Recogniser 2Agent

Recogniser XAgent

IntegrationCommittee

Inpu

tR

ecog

nise

rsH

epha

isTK

Fra

mew

ork

Java Client Application

(Restitution & Input)

Postman DB

SMUIML DocumentUser Interface

<ins

tant

iate

s>

UserFeedback

Adaptation Data

Figure 1. HephaisTK architecture

Developers of multimodal interfaces define the behaviour oftheir multimodal interface in a configuration file loaded byHephaisTK. This configuration file defines which recognisersto use (input/output level), how an event is triggered and howit is linked to the human-machine dialogue (event level), andthe various states of the dialogue and the transitions betweenthese states are specified (dialogue level). The adaptationlevel links the dialogue with contextual and user information.To model this human-machine dialogue for a specific clientapplication, the SMUIML language has been developed [13].

The SMUIML language is divided in three different layersdealing with different levels of abstraction as shown in Fig-ure 2. The lowest level specifies the different modalitieswhich will be used in the context of an application, as wellas the particular recognisers to be used to access the differ-ent modalities. The middle level addresses input and out-put events. Input events are called triggers whereas output

events are called actions. Triggers are defined per modalityand therefore not directly tied to specific recognisers. Theycan express different ways to trigger a particular event. Forexample, a speech trigger can be defined in such a way that“clear”, “erase” and “delete” will all lead to the same event.Actions are the messages that the framework will send to theclient application. The highest level of abstraction in Fig-ure 2 describes the actual human-machine dialogue by meansof defining the contexts of use and interweaving the differ-ent input events and output messages between those differ-ent contexts, as well as link to adaptation-related information.The resulting description takes the form of a state machine, ina similar way as Bourguet’s IMBuilder [4]. The combinationof modalities is defined based on the CARE properties as wellas the (non-)sequentiality of input triggers.

context 1 context 2

Triggers=======(trigger 1)PAR_AND(trigger 2)

Actions=======(action 1)(action 2)

1strecogniser

2ndrecogniser

Java clientapplica-

tion

dialoguelevel

event level

input/output level

transition

Human

Machine

user & context model adaptation level

Figure 2. SMUIML and its three levels of abstraction

To come back to the overall HephaisTK architecture, an in-tegration committee is in charge of interpreting the multi-modal input data in order to create a suitable answer forthe user. The different components of the integration com-mittee are shown in Figure 1. The FusionManager andFissionManager are linked to the DialogManagerwhich informs them about the current state of the interac-tion and what they can expect as input information. TheDialogManager relies on the interpretation of the appli-cation-specific SMUIML script for managing the interaction.The FusionManager focusses on the interpretation of in-coming input data based on information provided by theDialogManager. It also encapsulates the different avail-able fusion algorithms. In addition to a classic frame-basedmultimodal fusion algorithm, we have developed an HMM-based multimodal fusion algorithm which is described in thenext section.

HMM-BASED MULTIMODAL FUSION ALGORITHMAs presented in the previous section, the HephaisTK frame-work has been designed to serve as a test platform for multi-modal input data fusion algorithms. Based on HephaisTK,different algorithms including our new machine learning-based fusion algorithm have been implemented and tested.

The idea of mixing the adaptiveness of machine learning-based recognition algorithms with the expressive power ofhigher level rules is getting more attention recently, with ex-amples demonstrating the overall effectiveness in recognitionrates and expressiveness of such hybrid approaches [7]. How-ever, symbolic-statistical approaches, such as the ones men-tioned in the related work section, tend to not take into ac-count temporal relationships between modalities, which webelieve is a necessary feature of fusion algorithms for mul-timodal interactive systems. Furthermore, the adaptation touser behaviour has only been sparsely studied. We also haveto stress that machine learning algorithms have mainly beenused for multimodal input fusion at the feature level to typ-ically improve the recognition results of a specific modalityor in offline analysis systems such as biometrics or meetingbrowsers. However, at the decision level, purely rule-basedapproaches such as frame-based or unification-based fusionhave been the norm so far.

Goals of the AlgorithmWhen considering which type of machine learning should beused, hidden Markov models have been prioritised becauseof their easy adaptation to time-related processes. HiddenMarkov models have historically been used in temporal pat-tern recognition tasks, such as speech [19, 26], handwritingor gesture recognition. Since the fusion of input data alsofocusses on time-dependant patterns, HMMs have been seenas one of the obvious choices when considering the differ-ent alternatives among statistical models. Due to the fact thatSMUIML models the human-machine interaction via states,the transition from the dialogue model to a Markov processis relatively straightforward. Hidden Markov models have al-ready been used for tasks such as video segmentation [17]or biometric person recognition [8] using multimodal inputdata at the data or feature level. However, to the best of ourknowledge, HMMs have not yet been applied for the fusionof multimodal input events at the decision level in the contextof real-time multimodal human-machine interfaces.

Our goals when designing the new multimodal fusion algo-rithm were threefold:

1. Consider a class of machine learning algorithm which hasshown its efficiency at modelling the flow of time eventsand thus is ideally suited for the modelling of multimodalhuman-computer interaction.

2. In the fusion recognition process take into account a setof the most likely results from probability-based modalityrecognisers, such as speech or gesture recognisers.

3. Be able to adapt and correct results from the fusion algo-rithm on-the-fly based on user feedback.

Hidden Markov ModelsAs presented by Rabiner [26], hidden Markov models arebased on discrete Markov processes, in particular discretetime-varying random phenomenons for which the Markovproperty holds. In place of directly observing the states of thediscrete Markov process, a hidden Markov model observesthe (hidden) states through a set of stochastic processes that

produce the sequence of observations. The most likely se-quence of states, given a set of observations, is extracted bythe Viterbi algorithm [15]. Finally, the training of hiddenMarkov models is achieved with help of the Baum-Welch al-gorithm [2], a particular case of a generalised expectation-maximisation algorithm.

Algorithm in ActionThe first challenge when integrating HMMs as fusion algo-rithms for multimodal interaction is to map the features andstates to the actual human-machine interaction model. In ourcase, data fed to the HMM will be semantic-level informa-tion, such as a speech utterance or the high-level descriptionof a gesture such as “flick left”. For our implementation of thealgorithm, the Java Jahmm HMM library2 was used. Scriptswritten in the SMUIML modelling language [13] serve ashigh-level descriptions of the HMMs topology. A secondchallenge was to minimise the need for training the algorithmbefore being able to actually use the system. As we explainlater, a pre-training of our system is achieved through the sim-ulation of expected inputs described in SMUIML.

As an example to explain how the algorithm works, let usconsider the “put that there” example by Bolt [3]. In this ex-ample, a user is seated in front of a wall on which differentshapes are displayed. To move a shape from one place to theother, the user points to the shape and utters “put that there”while pointing to another place on the wall. Five differentinput events are expected, including three speech related trig-gers (“put”, “that” and “there”) and two gesture related trig-gers (i.e. pointing events). Note that in our example, the twopointing events will be considered as events of the same type.

idlestate

"that"

"put" "there"

<point>

Figure 3. A sequence of states for a “put that there” multimodal event

The step from the high-level human-machine interaction de-scription to the actual implementation of the HMM-based fu-sion algorithm is achieved as described in the following. InSMUIML, the <context> element is used to describe dif-ferent application states. For example, in a drawing applica-tion, free line drawing and shape editing could be modelledas different states of the application. Different applicationstates in SMUIML are modelled with help of one HMM foreach state. This implies that changing states in an applicationcreated with HephaisTK, corresponds to changing from one2http://code.google.com/p/jahmm/

context 1 context 2

Triggers=======(trigger 1)AND_PAR(trigger 2)

Actions=======(action 1)(action 2)

1strecogniser

2ndrecogniser

client app-lication

dialoguelevel

event level

input/output level

transition

(0, 1, 1, 3, 5)(0, 1, 2, 2)(0, 2, 3, 4)(0, 2, 4, 3)

...

Trigger 3

Trigger 4Trigger 5

Trigger 1 Trigger 21. create

observations

2. createHMMs

3. create expected output sequences, based on <transition> elements and CARE properties

4. train HMMs(Baum-Welch algorithm)

Figure 4. Instantiation of the HMM-based fusion algorithm with observations and HMMs created and trained based on a SMUIML script

<context> element to another in SMUIML. This in turnis equivalent to switching from one HMM to another HMM.The “put that there” example consists of a single interactionwith only one context and thus a single HMM. In the currentimplementation of the algorithm, any adapted training doneafter the initialisation phase will not transfer directly duringcontext switching.

At runtime, input events such as a speech utterance or a ges-ture are modelled as individual observations. Basically, oneinput event described in the SMUIML script corresponds toone observation of the HMM. Every time an input event isprocessed by the HephaisTK framework, it is sent to theFusionManager. For the list of all input events happeningin a specified time window, the FusionManager considerspotential combinations. Those combinations are then injectedinto the HMM. The Viterbi algorithm is subsequently used onthe HMM to extract the most probable state sequences. Thesestate sequences are compared to a set of expected observa-tion sequences which are defined by the <transition>elements of the current <context> in the SMUIML script.For example, the following succession of input events: “put”*point* “that” *point* “there” would generate a state se-quence as illustrated by the red dashed arrows in Figure 3. Ifa match is found, the information that a sequence of eventscorresponding to the SMUIML script description has beenfound is passed to the client application.

Instantiation of the HMM-Based Fusion AlgorithmThe instantiation of HMM-based fusion in HephaisTK is per-formed based on a given SMUIML script as highlighted inFigure 4. First, all triggers defined in the script are parsedand states are created based on this list of triggers (step 1).Basically, one state corresponds to one trigger definition,with the addition of an idle state which serves as a start-ing state. Then, for each defined <context> element, oneHMM is instantiated with the number of states correspond-

ing to T + 1; T being the number of triggers declared forthe current <context> (step 2). All states are initially in-terconnected, except the idle one. A discrete output obser-vation distribution is used to connect the hidden layer withthe observations. All <transition> elements declared inthe SMUIML script are then parsed and, based on the syn-chronicity rules, a set of all expected output sequences is gen-erated for each context (step 3).

The input events processed by HephaisTK all come fromusers interacting with a multimodal interface, for example bymeans of speech, gestures or gaze. Therefore, correlated in-put events generally happen within relatively small time win-dows of less than 10 seconds. Synchronicity and time order-ing of modalities is modelled via the CARE property-based<par or>, <par and>, <seq or> and <seq and> ele-ments in SMUIML. These elements are in turn used to createthe list of expected output sequences for all HMMs. For ex-ample, a <transition> declared as equivalent means thatany of the modalities in the transition will lead to the samenext <context>.

Finally, each HMM undergoes a pre-training stage, with au-tomatically created sequences of observations injected intotheir applicable HMM. The Baum-Welch algorithm is usedto find the unknown parameters of the HMM, hence allowingthe fusion algorithm to be directly used (step 4). Basically, theinternal structure of an HMM trained this way correspondsto the weighted state machine representation of the human-machine dialogue defined in the SMUIML script. The HMMcan then stay this way, as a “luxury” weighted state machine,or be refined with real training data or user feedback at run-time. The HMM can also be trained at instantiation time withreal training data, if it is available.

The particular strength of HMMs for the implementation ofthis human-machine dialogue in comparison to ad-hoc meth-ods lies in the possibility to adapt the HMM behaviour at

runtime. As pointed out in the introduction of this subsec-tion, the main goal when creating an HMM-based fusion al-gorithm was to lay the foundations in HephaisTK for addingthe ability to adapt to context and user feedback. HephaisTKwas therefore modified to support a simple “I am not ok withthat result” feedback from the user. If no particular result isspecified, HephaisTK assumes that the last result has to becorrected. The HMM is then re-trained in order to give lessweight to the erroneous result. An illustration of this feed-back process is provided in the next section.

Our description of the HMM-based fusion algorithm integra-tion is based on the SMUIML modelling language. However,the presented algorithm could be integrated into any multi-modal system considering input events as normalised infor-mation atoms at the semantic level. A full modelling of theentire human-machine dialogue is not mandatory, but helpsto map the interaction flow to Markov models and improvesresults. Also, without such a model, the training of the algo-rithm with help of recorded sequences would be required.

EVALUATIONThe evaluation of our new multimodal fusion algorithm hasbeen conducted in three steps. First, a qualitative evalua-tion using existing multimodal applications that were createdbased on HephaisTK has been performed. Then a more for-mal evaluation using the framework’s integrated benchmark-ing tool [11] was carried out. These two tests were designedto assess the accuracy and performance of our HMM-basedfusion algorithm. Third, the adaptation to user input was alsotested. In addition to these three evaluations, explorations ofuser-induced adaptation in the context of a particular applica-tion are presented. Initial conclusions based on the results ofthese explorations are then discussed.

Qualitative Test Using Real-World ApplicationsThe first task was to check whether the HMM-based al-gorithm aligned itself correctly with the expected resultsfor applications that had already been developed with theHephaisTK framework based on a classic frame-based fusionalgorithm. Those applications did not take into account useradaptation but were good illustrations of use cases consid-ering the CARE-based combination of modalities. Amongthe tested applications, a multimodal music player applica-tion was chosen to illustrate redundancy versus equivalencecases. Another application, a tool for document managementin smart meeting rooms called Docobro [12], was used forthe cases of sequential and non-sequential complementaritybetween modalities.

Both of these two applications were migrated to the HMM-based fusion algorithm. Only a single line in the HephaisTKXML configuration file, specifying the fusion algorithm to beused, had to be changed. No other code in the SMUIML mod-elling script or the client Java application had to be modified.More importantly, since the SMUIML modelling script whichdefines the human-machine dialogues is used as preliminarytraining for the HMM-based fusion algorithm, no explicittraining is required before the applications can be used. Infact, applications can even hot swap fusion algorithms while

they are running, since context switching information is prop-agated in real-time to every fusion algorithm. While this wasonly a qualitative evaluation, the goal was to check whetherthe overall behaviour of the HMM-based fusion algorithmwould at least be comparable to the frame-based one. Threeexpert users were presented with the two applications, firstrunning with the frame-based fusion algorithm, then with theHMM-based fusion algorithm. Users were given five minuteswith each of the condition to work with the application. Thenan interview was conducted with them after each condition.During these interviews, users reported a coherent behaviourbetween both algorithms, with less false positives in the caseof the HMM-based algorithm. Overall, sequenced and unse-quenced complementarity, redundancy as well as equivalencecases were managed in a similar way by both algorithms. Re-garding the responsiveness of the HMM-based fusion algo-rithm, users did not report a noticeable difference betweenboth algorithms.

Quantitative Assessment Through an EMMA BenchmarkHephaisTK’s integrated benchmarking tool [11] has beenused to validate our initial observations. The benchmarkingtool simulates various recognisers for a number of modali-ties and feeds this input data into the framework at predefinedtimes, which have been previously recorded from a user ses-sion. The benchmarking tool then collects the fusion resultsin place of the client application. In the test setting shown inFigure 5, the HephaisTK framework works as if real modali-ties and a real client application were used.

Multimodal Fusion Engine

<interpret1 t=5 …/> <interpret2 t=9 …/> …

Interpretation

<event1 st=1 et=4 …/> <event2 st=3 et=5 …/> <event3 st=8 et=9 …/> …

Testbed

<interpret1 t=5 …/> <interpret2 t=9 …/> …

Ground Truth PERFORMANCE

Figure 5. Benchmark test setting

The HephaisTK benchmarking tool allows for reproducibletesting of the overall behaviour of the framework and of itsperformance and enables a comparison of different multi-modal fusion algorithms. The benchmarks themselves aredescribed via the XML EMMA language3 which has beendefined by the W3C Multimodal Interaction Working Group.Benchmarking tests were preferred over live user testing inorder to compare the performance of two fusion algorithmsworking under the same conditions in a framework.

The results of our tests are illustrated in Figure 6 and Fig-ure 7. Note that the screenshots of the benchmarking toolwere slightly altered by adding horizontal black lines to dif-ferentiate the steps of each test. Each line corresponds to a3http://www.w3.org/TR/emma/

single input sent to the HephaisTK framework. Details are,from left to right: the time in milliseconds when the data wassent with 0 being the start of the test, the used modality, thecontent of the data and the expected answer from the fusionengine. In green or red is the answer which was effectivelysent (or not) by the framework, whether this was the correctanswer and finally how much time (in milliseconds) elapsedbetween sending the piece of data and receiving the result.

Figure 6. Equivalence and redundancy tests

Note that some delays are well above 100 ms. These delaysrepresent in fact the delay between the time when the firstsingle modality data of a complex command was fed into theframework and the time when the final multimodal commandwas fused. For example, the “put” command on the first lineof Figure 7 was sent to the framework 318 ms before the com-plete “put that there” command was fused.

Figure 7. Sequential and non-sequential complementarity tests

The CARE properties were used as a basis to test the accu-racy of the HMM-based fusion algorithm. Each CARE prop-erty was first modelled with a correct case and then with somenoise introduced. Finally, erroneous commands were fed intothe system to check its resistance against false positives. Fig-ure 6 shows the results for the equivalence and redundancycases. “Hello” messages were input in different modalities,first sequentially and then at the same time. The test was ex-ecuted in four sequences with an increasing number of noisealong the legitimate data. As shown in Figure 6, the HMMalgorithm managed to pass all tests with or without addednoise. The second part of this series of tests was to check

whether sequential and non-sequential complementarity fu-sion was correctly managed by the fusion algorithm. The re-sults are shown in Figure 7. The well-known “put that there”case was used to model these tests. Listing 1 shows the mod-elling of the command in SMUIML with parallel and sequen-tial temporal constraints used. Four variants of the commandwere used: two legit variants, then an incomplete commandand finally a garbled command. As outlined in Figure 7, thetwo correct commands were successfully fused while the in-complete and garbled commands were both rejected by thealgorithm.

Listing 1. “Put that there” expressed in SMUIML.<transition leadtime=”1500”>

<seq and><trigger name=”put trigger” /><par and><trigger name=”that trigger” /><trigger name=”object pointed event” />

</par and><par and><trigger name=”there trigger” /><trigger name=”object pointed event” />

</par and></seq and><result action=”put that there action” />

</transition>

Finally, a challenging multimodal fusion task was tested inthe form of the “play next track” example discussed in [11].This task involves the differentiation between three differentnuances of meaning, only based on the order of the inputevents. In [11], the frame-based fusion algorithm was usedand could not correctly differentiate the various cases, be-cause frames cannot easily represent this level of temporalnuances. There was a slightly better behaviour when sequen-tial constraints were taken into account.

Figure 8. “Play next track” example without sequential constraints inthe upper part and with sequential constraints in the lower part

Figure 8 shows the same tests but this time with the newHMM-based fusion algorithm. The first case without sequen-tial constraints is still not completely solved but at least pro-vides a consistent answer compared to most results returnedby frames. The two other cases are correctly fused. As forthe test run with sequential constraints, the HMM-based al-gorithm scores perfectly. We think that these challengingcases of ambiguous input sequences emphasise the advantageHMMs have over other classes of fusion algorithms, due totheir ability to model time-related processes. However, we

think that a full user evaluation would be needed in order tofurther assess the effects on usability of the different algo-rithms since our evaluation primarily focussed on the accu-racy of recognition.

PerformanceMultimodal interfaces should be responsive and thereforethe fusion algorithms have to demonstrate reasonable perfor-mance. A full test run using the same list of commands asthe tests of Figure 6 and Figure 7 was used. In total, thetest run fed 40 different pieces of information into the sys-tem and expected 20 different fusion results. The full testwas run 5 times with each algorithm and the delays betweenthe input of data and the retrieval of the corresponding resultswere measured. 100 different measures were thus taken foreach algorithm. The average computation time over these 100measures for the frame-based fusion algorithm was 18.2 ms,with a standard deviation of 12.7 ms. On the other hand, theaverage computation time for HMM-based fusion algorithmwas 16.6 ms, with a standard deviation of 11.6 ms, whichis not significant. Further examination showed that the actualcomputation time of the fusion algorithms is much lower thanthese average values and a significant amount of the time isused by other HephaisTK computations. In fact, it appearedthat most of HephaisTK’s computation time is devoted toinformation storage and retrieval in the internal database aswell as the information passing between the different soft-ware agents. In the future, we therefore intend to also test thealgorithms with more complex situations and events.

User-Induced AdaptationAs presented in the related work section, adaptation can takea multitude of shapes. We are particularly interested in whatis called user-induced automatic adaptation in Malinowskiet al.’s framework [23]: Initiative from the user and Proposal-Decision-Execution from the machine. In this form of adap-tation, the user signals to the machine that a given output wasnot what they expected and implicitly asks the machine toaccordingly revise its judgement. The identification of the“wrong” result, the decision about a correction as well as theexecution of the necessary changes are left to the machine andonly the error detection part depends on the user as shown inFigure 9. The user is also not explicitly asked to select whichresult they would have preferred which guarantees minimalintrusion. Of course such a semi-automatic adaptation intro-duces some challenges for the machine processing.

In HephaisTK, the definition of the human-machine dialoguethrough SMUIML and the HMM-based fusion algorithm al-lows to explore user-induced automatic adaptation. Userfeedback was experimentally tested in the following way: vi-bration sensors were attached to the computer screen of a ma-chine running the HephaisTK framework. When the com-puter screen is gently slapped, the vibration sensors are acti-vated and provide input to the framework. Those vibrationsensors events are tagged in the dialogue manager as userfeedback expressing dissatisfaction from the user. The cor-rection is sent to the HephaisTK FusionManager whichadapts the behaviour of the HMM-based fusion algorithm ac-cordingly. The erroneous result was in our test case simply

considered to be the last output produced by the fusion algo-rithm. The HMM for the current context is trained to giveless weight to the particular transition which triggered thiserroneous result.

We have conducted some tests of this setting with help of thebenchmarking tool, in order to verify the correct behaviour ofthe algorithm. The same tests as explained above were used,but in the case of a wrongly recognised result, simulated userfeedback was injected into HephaisTK. Three different caseswere tested and in all three cases the algorithm adapted suc-cessfully its behaviour. The same setting was then used “live”with the Docobro document browser and the multimodal mu-sic player.

Multimodal Inputs

System Output

Feedback Channel

Figure 9. User-based error detection and feedback

Based on our tests, we can draw the conclusion that the in-tegration of user-induced automatic adaptation in multimodalinterfaces raises the following challenges: the identificationof the problematic fusion result, the identification of thesource of the error and the creation of a meaningful correctionas well as the possibility for undo operations.

When the user indicates the wish for a fusion result to be cor-rected, other fusion results might already have been processedbefore the user’s feedback reaches the fusion engine. In thepresented test environment, the erroneous fusion result wasalways considered to be the last one. Evidently, this strategyproved soon to be error prone. Since the HMM-based fusionalgorithm provides a probability for the correct recognitionwith each fusion result, this probability score was used to tryto give less weight to the least probable fusion result amongthe most recent ones. This strategy was not satisfactory ei-ther and the identification of the erroneous fusion behaviouramong the latest results without further input from the user isstill an open issue.

The identification of the source of error is also an open is-sue. Indeed, the fusion algorithm works with interpretationscoming from different recognisers. If the erroneous fusionbehaviour originates from a recognition mistake of a partic-ular modality (e.g. speech recogniser), downgrading a fusionresult at the fusion algorithm level could worsen the fusionresults rather than improving them. On the other hand, ifthe source of the error is correctly identified, the individualmodality recognisers have the possibility to make use of auser’s feedback to correct their results. We assume that exten-sive testing could help to identify thresholds at which “bad”results should be delegated either to the fusion algorithm orto individual recognisers, or be simply ignored if no sourcecould be assigned.

The meaningful correction was satisfactory throughout thetests achieved with the previously described setting. Whenincorrect results originating from the latest fusion result wereproduced by the HMM-based fusion algorithm, a retraining ofthe current context’s HMM lead to a reduction of false posi-tives. However, if the error did not originate from the latestfusion algorithm result, the lack of some “undo” functional-ity showed to be problematic. Keeping two copies of eachHMM and having one version trained one step later than theother should help to introduce this “undo” mechanism at thetechnical level.

CONCLUSIONWe presented our work on a novel algorithm for decision-level fusion of input data in the context of multimodal interac-tive systems. Our algorithm is based on hidden Markov mod-els and has been implemented and tested in the HephaisTKframework. A qualitative as well as quantitative assessmentof the algorithm has been conducted with HephaisTK’s in-tegrated benchmarking tool. We conducted repeated tests oftwo multimodal fusion algorithms by feeding them with mul-timodal inputs organised in a series of examples. These exam-ples represented different cases of unambiguous and ambigu-ous time-synchronised multimodal input. With similar pro-cessing time, our new HMM-based fusion algorithm demon-strated similar or better accuracy than a classic frame-basedfusion algorithm. This improvement in accuracy, coupledwith the capability of our new HMM-based algorithm to taketemporal combinations of modalities into account, makes ita promising solution for decision-level fusion engines of in-teractive multimodal systems. Furthermore, the algorithmdemonstrated its strength in adapting fusion results based onuser and context information. Our initial explorations on au-tomatic user-induced adaptation in multimodal interfaces re-vealed three main challenges to be overcome before the pre-sented type of adaptivity can be fully employed in multimodalinterfaces: 1) the identification of the problematic fusion re-sult, 2) the identification of the source of error and 3) errorcorrection and learning.

In the future, we plan to explore automatic user-induced adap-tation, based on the challenges identified in this paper. Wewill do further evaluations through tests on specific tasks andhypotheses, as well as a user study with more users. We fur-ther plan to focus on making full use of probability scoresprovided by individual modality recognisers and go beyondthe simple integration in HMMs. The robust handling of in-puts with uncertainty needs to be tackled not only on the algo-rithm level but also on the dialogue level. To this end, we planto integrate a variant of the framework proposed by Schwarzet al. [27] into HephaisTK, which should complement themanagement of the probability scores at the algorithm level.

ACKNOWLEDGMENTSThe work on HephaisTK and SMUIML has been fundedby the Hasler Foundation in the context of the MeModulesproject and by the Swiss National Center of Competencein Research on Interactive Multimodal Information Manage-ment via the NCCR IM2 project. Bruno Dumas is supported

by MobiCraNT, a project forming part of the Strategic Plat-forms programme by the Brussels Institute for Research andInnovation (Innoviris).

REFERENCES1. Arhippainen, L., Rantakokko, T., and Tahti, M.

Navigation with an Adaptive Mobile Map-Application:User Experiences of Gesture- andContext-Sensitiveness. In Proceedings of the 2ndInternational Conference on Ubiquitous ComputingSystems (UCS 2004) (Tokyo, Japan, November 2005),62–73.

2. Baum, L., Petrie, T., Soules, G., and Weiss, N. AMaximization Technique Occurring in the StatisticalAnalysis of Probabilistic Functions of Markov Chains.The Annals of Mathematical Statistics 41, 1 (1970),164–171.

3. Bolt, R. A. “Put-that-there”: Voice and Gesture at theGraphics Interface. In Proceedings of the 7th AnnualConference on Computer Graphics and InteractiveTechniques (SIGGRAPH 1980) (Seattle, USA, July1980), 262–270.

4. Bourguet, M.-L. A Toolkit for Creating and TestingMultimodal Interface Designs. In Companion of the 15thAnnual Symposium on User Interface Software andTechnology (UIST 2002) (Paris, France, October 2002),29–30.

5. Cesar, P., Vaishnavi, I., Kernchen, R., Meissner, S.,Hesselman, C., Boussard, M., Spedalieri, A., Bulterman,D. C., and Gao, B. Multimedia Adaptation in UbiquitousEnvironments: Benefits of Structured MultimediaDocuments. In Proceedings of the 8th ACM Symposiumon Document Engineering (DocEng 2008) (Sao Paulo,Brazil, September 2008), 275–284.

6. Chai, J., Hong, P., and Zhou, M. A ProbabilisticApproach to Reference Resolution in Multimodal UserInterfaces. In Proceedings of the 9th InternationalConference on Intelligent User Interfaces (IUI 2004)(Funchal, Madeira, Portugal, January 2004), 70–77.

7. Chen, Q., Georganas, N. D., and Petriu, E. M. HandGesture Recognition Using Haar-Like Features and aStochastic Context-Free Grammar. IEEE Transactionson Instrumentation and Measurement 57, 8 (August2008), 1562–1571.

8. Choudhury, T., Clarkson, B., Jebara, T., and Pentland, A.Multimodal Person Recognition using UnconstrainedAudio and Video. In Proceedings of the InternationalConference on Audio- and Video-Based PersonAuthentication (AVBPA 1999) (Washington DC, USA,1999), 176–181.

9. Cohen, P. R., Johnston, M., McGee, D., Oviatt, S.,Pittman, J., Smith, I., Chen, L., and Clow, J. QuickSet:Multimodal Interaction for Simulation Set-Up andControl. In Proceedings of the 5th Conference onApplied Natural Language Processing (ANLC 1997)(Washington DC, USA, April 1997), 20–24.

10. Coutaz, J., Nigay, L., Salber, D., Blandford, A., May, J.,and Young, R. M. Four Easy Pieces for Assessing theUsability of Multimodal Interaction: The CAREProperties. In Proceedings of the 5th InternationalConference on Human-Computer Interaction (Interact1995) (Lillehammer, Norway, June 1995), 115–120.

11. Dumas, B., Ingold, R., and Lalanne, D. BenchmarkingFusion Engines of Multimodal Interactive Systems. InProceedings of the 11th International Conference onMultimodal Interfaces (ICMI-MLMI 2009) (Cambridge,USA, November 2009), 169–176.

12. Dumas, B., Lalanne, D., and Ingold, R. HephaisTK: AToolkit for Rapid Prototyping of Multimodal Interfaces.In Proceedings of the 11th International Conference onMultimodal Interfaces (ICMI-MLMI 2009) (Cambridge,USA, November 2009), 231–232.

13. Dumas, B., Lalanne, D., and Ingold, R. DescriptionLanguages for Multimodal Interaction: A Set ofGuidelines and its Illustration with SMUIML. Journalon Multimodal User Interfaces: “Special Issue on TheChallenges of Engineering Multimodal Interaction” 3, 3(February 2010), 237–247.

14. Flippo, F., Krebs, A., and Marsic, I. A Framework forRapid Development of Multimodal Interfaces. InProceedings of the 5th International Conference onMultimodal Interfaces (ICMI 2003) (Vancouver,Canada, November 2003), 109–116.

15. Forney Jr, G. The Viterbi Algorithm. Proceedings of theIEEE 61, 3 (1973), 268–278.

16. Hoste, L., Dumas, B., and Signer, B. Mudra: A UnifiedMultimodal Interaction Framework. In Proceedings ofthe 13th International Conference on MultimodalInterfaces (ICMI 2011) (Alicante, Spain, November2011), 97–104.

17. Huang, J., Liu, Z., Wang, Y., Chen, Y., and Wong, E.Integration of Multimodal Features for Video SceneClassification Based on HMM. In Proceedings of the 3rdIEEE Workshop on Multimedia Signal Processing(MMSP 1999) (Copenhagen, Denmark, September1999), 53–58.

18. Johnston, M., Cohen, P., McGee, D., Oviatt, S., Pittman,J., and Smith, I. Unification-Based MultimodalIntegration. In Proceedings of the 35th Annual Meetingof the Association for Computational Linguistics (ACL1997) (Madrid, Spain, July 1997), 281–288.

19. Juang, B., and Rabiner, L. Hidden Markov Models forSpeech Recognition. Technometrics 33 (1991), 251–272.

20. Konig, W. A., Radle, R., and Reiterer, H. Squidy: AZoomable Design Environment for Natural UserInterfaces. In Proceedings of the 27th InternationalConference on Human Factors in Computing Systems(CHI 2009) (Boston, USA, April 2009), 4561–4566.

21. Lalanne, D., Nigay, L., Palanque, P., Robinson, P.,Vanderdonckt, J., and Ladry, J. Fusion Engines forMultimodal Input: A Survey. In Proceedings of the 11thInternational Conference on Multimodal Interfaces(ICMI-MLMI 2009) (Cambridge, USA, September2009), 153–160.

22. Lopez-Jaquero, V., Vanderdonckt, J., Montero, F., andGonzalez, P. Towards an Extended Model of UserInterface Adaptation: The Isatine Framework. InEngineering Interactive Systems, J. Gulliksen, M. B.Harning, P. Palanque, G. C. Veer, and J. Wesson, Eds.,vol. 4940 of LNCS. Springer Verlag, 2008, 374–392.

23. Malinowski, U., Thomas, K., Dieterich, H., andSchneider-Hufschmidt, M. A Taxonomy of AdaptiveUser Interfaces. In Proceedings of the Conference onPeople and Computers VII (HCI 1992) (York, UnitedKingdom, September 1992), 391–414.

24. Octavia, J., Raymaekers, C., and Coninx, K. Adaptationin Virtual Environments: Conceptual Framework andUser Models. Multimedia Tools and Applications 54(2011), 121–142.

25. Oviatt, S. Human-Centered Design Meets CognitiveLoad Theory: Designing Interfaces That Help PeopleThink. In Proceedings of the 14th ACM InternationalConference on Multimedia (ACM MM 2006) (SantaBarbara, USA, October 2006), 871–880.

26. Rabiner, L. A Tutorial on Hidden Markov Models andSelected Applications in Speech Recognition.Proceedings of the IEEE 77, 2 (1989), 257–286.

27. Schwarz, J., Hudson, S., Mankoff, J., and Wilson, A. D.A Framework for Robust and Flexible Handling ofInputs with Uncertainty. In Proceedings of the 23ndSymposium on User Interface Software and Technology(UIST 2010) (New York, USA, October 2010), 47–56.

28. Serrano, M., Nigay, L., Lawson, J., Ramsay, A.,Murray-Smith, R., and Denef, S. The OpenInterfaceFramework: A Tool for Multimodal Interaction. InProceedings of the 26th International Conference onHuman Factors in Computing Systems (CHI 2008)(Florence, Italy, April 2008), 3501–3506.

29. Sharma, R., Pavlovic, V., and Huang, T. TowardMultimodal Human-Computer Interface. Proceedings ofthe IEEE 86, 5 (1998), 853–869.

30. Vo, M., and Wood, C. Building an ApplicationFramework for Speech and Pen Input Integration inMultimodal Learning Interfaces. In Proceedings of theIEEE International Conference on Acoustics, Speech,and Signal Processing (ICASSP 1996) (Atlanta, USA,May 1996), 3545–3548.

31. Wu, L., Oviatt, S., and Cohen, P. From Members toTeams to Committee – A Robust Approach to Gesturaland Multimodal Recognition. IEEE Transactions onNeural Networks 13, 4 (2002), 972–982.