speech-gesture driven multimodal interfaces for crisis ... - in press ieee... · speech-gesture...

48
Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5 , M. Yeasin 2,1 , N. Krahnstoever 2 , I. Rauschert 2,5 , G. Cai 4,5 I. Brewer 5 , A. MacEachren 3,5 , K. Sengupta 1 1. Advanced Interface Technologies, Inc., 403 S. Allen Street, State College, PA 16801. 2. Dept. of Computer Science and Engineering, Penn State University, University Park, PA 16802. 3. Department of Geography, Penn State University, University Park, PA 16802. 4. School of Information Sciences and Technology, Penn State University, University Park, PA 16802 5. GeoVISTA Center, Pennsylvania State University, University Part, PA 16802 Abstract Emergency response requires strategic assessment of risks, decisions, and communications that are time- critical while requiring teams of individuals to have fast access to large volumes of complex information and technologies that enables tightly coordinated work. The access to this information by crisis management (CM) teams in emergency operations centers can be facilitated through various human- computer interfaces. Unfortunately these interfaces are hard to use, require extensive training, and often impede rather than support teamwork. Dialogue-enabled devices, based on natural, multimodal interfaces have the potential of making a variety of information technology tools accessible during crisis management. This paper establishes the importance of multimodal interfaces in various aspects of crisis management and explores many issues in realizing successful speech-gesture driven, dialog-enabled interfaces for CM. The paper is organized in five parts. The first part discusses the needs of CM that can be potentially met by the development of appropriate interfaces. The second part discusses the issues related to the design and development of multimodal interfaces in the context of CM. The third part discusses the state of the art in both the theories and practices involving these human-computer interfaces. In particular it describes, the evolution and implementation details of two representative systems, called crisis management (XISM) and Dialog Assisted Visual Environment for Geoinformation (DAVE_G). The fourth part speculates on the short term and long term research directions that will help addressing the outstanding challenges in interfaces that support dialog and collaboration. Finally, part five concludes the paper. Correspondence Information: Dr. Rajeev Sharma Department of Computer Science and Engineering 220 Pond Laboratory, Pennsylvania State University University Part, PA 16802 Email: [email protected] Phone: (814) 867-8977 Fax: (814) 867-8957 Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface

Upload: hoangdieu

Post on 05-Mar-2018

220 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1, N. Krahnstoever 2, I. Rauschert 2,5, G. Cai 4,5 I. Brewer5, A. MacEachren 3,5, K. Sengupta 1

1. Advanced Interface Technologies, Inc., 403 S. Allen Street, State College, PA 16801.

2. Dept. of Computer Science and Engineering, Penn State University, University Park, PA 16802.

3. Department of Geography, Penn State University, University Park, PA 16802.

4. School of Information Sciences and Technology, Penn State University, University Park, PA 16802

5. GeoVISTA Center, Pennsylvania State University, University Part, PA 16802

Abstract

Emergency response requires strategic assessment of risks, decisions, and communications that are time-critical while requiring teams of individuals to have fast access to large volumes of complex information and technologies that enables tightly coordinated work. The access to this information by crisis management (CM) teams in emergency operations centers can be facilitated through various human-computer interfaces. Unfortunately these interfaces are hard to use, require extensive training, and often impede rather than support teamwork. Dialogue-enabled devices, based on natural, multimodal interfaces have the potential of making a variety of information technology tools accessible during crisis management. This paper establishes the importance of multimodal interfaces in various aspects of crisis management and explores many issues in realizing successful speech-gesture driven, dialog-enabled interfaces for CM.

The paper is organized in five parts. The first part discusses the needs of CM that can be potentially met by the development of appropriate interfaces. The second part discusses the issues related to the design and development of multimodal interfaces in the context of CM. The third part discusses the state of the art in both the theories and practices involving these human-computer interfaces. In particular it describes, the evolution and implementation details of two representative systems, called crisis management (XISM) and Dialog Assisted Visual Environment for Geoinformation (DAVE_G). The fourth part speculates on the short term and long term research directions that will help addressing the outstanding challenges in interfaces that support dialog and collaboration. Finally, part five concludes the paper.

Correspondence Information:

Dr. Rajeev Sharma Department of Computer Science and Engineering 220 Pond Laboratory, Pennsylvania State University University Part, PA 16802

Email: [email protected] Phone: (814) 867-8977 Fax: (814) 867-8957

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface

Page 2: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 2

Part I: Need for Multimodal Interfaces in Crisis Management The need to develop information science and technology to support crisis management has never been more apparent. The world is increasingly vulnerable to sudden hazardous events such as terrorist attacks, chemical spills, hurricanes, tornadoes, floods, wildfires, and disease epidemics. Emergency response requires strategic assessment of risks, decisions, and communications that are time-critical while requiring teams of individuals to have fast access to large volumes of complex information and technologies that enables tightly coordinated work. The access to this information by crisis management teams in emergency operations centers is through various human-computer interfaces that unfortunately are hard to use, require extensive training, and often impede rather than support teamwork. Meeting the challenges of crisis management in a rapidly changing world will require more research on fundamental information science and technology. To have an impact, that research must be linked directly with development, implementation, and assessment of new technologies. Making information technology easier to use for crisis managers and related decision makers is expected to increase the efficiency of coordination and control in strategic assessment and crisis response activities. To be useful and usable, the interface technologies must be human-centered, designed with input from practicing crisis management personnel at all stages of development.

Crisis management scenarios (please see Figure 1 for an example scenario) considered in this paper include both strategic assessment (work to prepare for and possibly prevent potential crises) and emergency response (activities designed to minimize loss of life and property). Most crisis management relies upon geospatial information (derived from location-based data) about the event itself, its causes, the people and infrastructure affected, the resources available to respond, and more. Geospatial information is essential for pre-event assessment of risk and vulnerability as well as to response during events and subsequent recovery efforts. Crisis management also relies upon teams of people who need to collaboratively derive information from geospatial data and to coordinate their subsequent activities. Current geospatial information technologies, however, have not been designed to support group work and we have very little scientific understanding of how groups (or groups of groups) work in crisis management using geospatial information and the technologies for collecting, processing, and using it.

We believe that dialogue-enabled devices, based on natural, multimodal interfaces have the potential of making a variety of information technology tools accessible during crisis management. Multimodal interfaces allow users to interact via a combination of modalities, for instance, speech, gesture, pen, touch screen, displays, keypads, pointing devices, and tactile sensors. They offer the potential for considerable flexibility, broad utility, and use by a larger and more diverse population than ever before. A particularly advantageous feature of multimodal interface design is its ability to support superior error handling, compared to unimodal recognition-based interfaces, in terms of both error avoidance and graceful recovery from errors [1-4]. But the traditional human-computer interfaces do not support the collaborative decision making involved in crisis management.

The ability to develop a multimodal interface system depends on knowledge of the natural integration patterns that typify people's combined use of different input modes. Developing a multimodal interface for collaborative decision-making requires systematic attention to both human and computational issues at all stages of the research. The human issues range from analysis of the ways in which humans indicate elements of a geographic problem domain (through speech and gesture) to the social aspects of group work. The computational issues range from developing robust real-time algorithms for tracking multiple people, recognizing continuous gestures and understanding spoken words, through developing methods for syntactical and semantic analysis of speech/gesture commands, for designing an efficient dialog-based natural interface in the geospatial domain for crisis management.

Given the complex nature of users' multimodal interaction a multidisciplinary approach is required to design a multimodal system that integrates complementary modalities to yield a highly synergistic blend. The main idea is to consider each of the input modalities in terms of the others, rather than separately.

Page 3: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 3

The key to success is the integration and synchronization requirements for combining different modes strategically into a whole system. A well-designed multimodal architecture can support mutual disambiguation of input signals [5]. Mutual disambiguation involves recovery from unimodal recognition errors within a multimodal architecture because semantic information from each input mode supplies partial disambiguation of the other mode, thereby leading to more stable and robust overall system performance. This integration is useful, both in the disambiguation of the human input to the system and in the disambiguation of the system output.

This paper discusses the evolution and implementation of a dialog-based speech-gesture driven multimodal interface systems developed by group of researchers at the Pennsylvania State University and Advanced Interface Technologies (AIT). The main goal was to design HCI systems that will allow, a team of individuals to collaborate while easily and naturally interacting with complex geospatial information. The unified multimodal framework would include two or more people in front of a large display, agents in the field with small displays as well as mobile robotic agents. Such a multimodal, cross-platform collaborative framework could be an important element for rapid and effective response to a wide range of crisis management activities, including homeland security emergencies. The objectives of this paper are:

i. To outline how cutting-edge information technologies, for example a speech-gesture driven multimodal interface, allow individuals as well as teams to access essential information more quickly and naturally, thus improving decision making in crisis situations.

ii. To discuss the challenges faced in designing such a system (which may include):

- to identify and respond to the critical needs of crisis mitigation and response,

- to provide the crisis management team a distributed environment for training, testing including virtual space for distant members to collaborate in making the decision,

iii. To discuss the state of the art of speech-gesture driven collaborative systems and technological issues involved in the design of speech gesture based interfaces. This includes speech and image analysis tasks for sensing, multimodal fusion framework for user action recognition, and dialog design and semantics issues in the domain of crisis management.

iv. To report our progress to date, detailing the evolution of two implemented systems, namely, XISM and DAVE_G.

v. To discuss the future challenges that must be overcome to realize natural and intuitive interfaces, for collaborative decision making in the context of crisis management.

A Crisis Management Scenario:

Let us consider an example scenario that could help in grounding the discussions on the role of multimodal interfaces for collaborative work for crisis management (see Figure 2 for a conceptual snapshot of the problem). Imagine the crisis management center of a government organization with, Center Director Jane Smith and Paul Brown, chief logistic and evacuation manager, in front of a large-screen display linked to the organization’s emergency management system called, Multimodal Interface for Collaborative Emergency Response (MICER).

“An earthquake of magnitude 7.1 has hit San Diego and many freeways and major roads are impassable. Buildings are severely damaged or collapsed and fire has broken out in many places. Shortly before the quake, seismographs indicated a fault shift and triggered alarms at emergency centers and local governments. A few minutes later, emergency operation centers are occupied and prepared to respond to this situation…”

Page 4: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 4

They are assessing initially available information about earthquake’s epicenter, its magnitude and preliminary damage estimates. The crisis center is filled with response professionals, each with different expertise, sitting in front of displays showing new incoming information and reports from affected sites. Assessing all available information and not missing any important ones is their main purpose. Based on available information immediate decisions have to be made about where to send rescue teams, where to send resources and how to prioritize the response effort. They decide where and how they can help most effectively with the available resources. One of the first reports reaching the center comes from a chemical plant close to the earthquake’s epicenter. A fire is threatening stored chemical tanks as well as nearby residential areas. Jane Smith, already collaborating with the fire department, is guiding firefighting groups through the partially impassible city districts to the disaster site. Paul Brown, responsible for evacuation affairs, is working together with Jane to plan the evacuation of threatened residential areas. They both interact via remote communication devices with onsite observer Bob Lewis, who is providing necessary information about conditions at the disaster site.

Jane: “System, show the status of the chemical plant and provide an area map, we need to evacuate threatened residential areas.”

MICER: The system shows plant diagrams and information about the hazardous chemicals stored on site and highlights residential areas.

Jane: To get an update on the current situation at the plant, she connects to an onsite search and rescue member with real-time video conferencing and shared information displays.

Bob: “There was a big explosion in one of the store houses and one chemical tank burst. There are several other tanks still in there and we have deployed our robot to see how badly damaged the other tanks are. There is a chance of further explosions that could destroy the entire area.”

Jane: “Ok, how bad is the chemical flow right now? Do you have an expected impact plan that the already exposed chemicals might cause on the area?”

Bob: “Yes, The emergency barrier is damaged and the chemicals are flowing steadily towards a residential area. Our simulation model predicts a flow direction this way with impact in these areas (outlines threatened residential

areas) within the next 12 hours.”

MICER: shows transmitted images of the disaster site and area map with illustrated HazMat situation. ”There are 34,351 residents here. Do you want to see the simulation as well?”

Jane: “No thanks, I think we have enough information to work on an evacuation plan, right Paul?”

Paul: “Right. Now, we first need to inform local hospitals and shelter facilities.”

MICER: Displays the mentioned buildings and displays detailed information about their capacities

Paul: “The closest hospitals are these two (gestures with hand) and we might need these shelters over here as well (outlines area with hand).”

MICER: Highlights indicated shelters and provides information on available capacity and provided medical care.

Bob: “We’re receiving some initial readings from the explosion site provided by the robot.”

MICER: Shows transmitted video data from the robot along with structural reconstruction information from the remaining containers.

Bob: “The other containers don’t seem to be leaking. We can contain this area (gestures with hand). But these residential areas cannot be protected in time.”

Paul: “Thanks Bob! This means we have to get these people out first. We should evacuate this way (indicates a route) and direct them over this highway to both hospitals.”

Jane: “No, this is not possible, this highway is damaged (the system marks the indicated portion of the highway dark) and we need this route free for incoming resources to the chemical plant. I think we should evacuate the south eastern residents out over this (points) bridge to those shelters (indicates area with hand gesture).”

MICER: Highlights respective routes.

Jane: “Then the other rest could evacuate this way (outlines with hand-gesture) to that hospital (points) and avoid traffic jams here if we use all four lanes of those (points) two ring roads.”

MICER: Splits up the screen into an overview map of directed routes from the residential areas to sheltering facilities and a close up map of the local evacuation routes within the residential area.

Figure 1: A scenario of speech-gesture driven collaborative interfaces in the context of crisis management.

Page 5: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 5

The crisis management scenario illustrated a number of common properties for such practices:

i. The use of information technology in crisis management involves collaborative problem solving with participation of both human and machine agents. Comparing with the simple information retrieval problems that current multimodal systems are commonly design for, crisis management is comprised of complex tasks that require multiple phases and steps in different levels of complexities.

Knowledge Database

Distributed SystemsControl

Domain specificFusion

DomainIndependent

Fusion

SpeechRecognition

GestureRecognition

SpeechInterpretation

GestureInterpretation

AudioSignal

VisualSignal

KnowledgeRepresentation

DialogCollaborationManagement

Emergency Operation Center

LaptopKeyboard/Penball

Speech/Pen Interface

Mobile Operation CenterSimulation Systems

Workstations

Handheld DisplaySpeech/Pen Interface

Mobile RobotCamera/Tools

Remote Satellite StationVoice/Video/Data

Figure 2: A conceptual snapshot of crisis scenario depicting the processes involved.

Page 6: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 6

ii. The knowledge necessary for deriving a solution is distributed among a team of agents who must plan and coordinate their actions through communication. (This raises serious challenges in knowledge management and planning functions of current multimodal systems).

iii. Users direct the operation of the system using natural spoken language and free hand gestures in ways similar to communicating with other humans. The continuous streams of speech and gesture signals must be analyzed and interpreted to distill useful information from noisy input.

iv. Users’ information requests are expressed in their task domain vocabulary and are sometime inseparable from their reasoning process about their goals and means.

v. Information dialogues are neither system-led nor user-led. Instead, they are mixed-initiative, allowing both the system and users to introduce new goals and to clarify with others.

vi. User’s information requests can be either explicit or implicit. Implicit requests are harder to recognize and require deep reasoning about user’s beliefs and goals.

Part II: Issues in Designing Speech-Gesture Driven Multimodal Interfaces

In this section, we outline both the scientific and engineering challenges in designing speech-gesture driven multimodal interfaces in the context of crisis management. Our main goal is to design a dialog-enabled HCI system for collaborative decision-making, command and control. While traditional interfaces support sequential and unambiguous input from devices such as keyboard and conventional pointing devices (e.g., mouse, trackpad), speech-gesture driven dialog-based multimodal interfaces relax these constraints and typically incorporate a broader range of input devices (e.g., spoken language, eye and head tracking, speech, gesture, pen, touch screen, displays, keypads, pointing devices, and tactile sensors). The ability to develop a dialog-based speech-gesture driven interface is motivated by the knowledge of the natural integration patterns that typify people's combined use of different modalities for natural communications. Recent trends in multimodal interfaces are inspired by goals to support more transparent, flexible, efficient, and powerfully expressive means of human-computer interaction than ever before. Multimodal interfaces are expected to support a wider range of diverse applications, to be usable by a broader spectrum of the average population, and to function more reliably under realistic and challenging usage conditions. The main challenges related to the design of a speech-gesture driven multimodal interface for crisis management are:

i. Domain & Task analysis,

ii. Acquisition of valid multimodal data,

iii. Sensing technologies for multimodal data acquisition,

iv. Detection/localization and tracking of users,

v. Recognizing users action (i.e., gesture recognition, speech recognition and etc.),

vi. A framework to fuse gestures and spoken words,

vii. Dialog design,

viii. Semantics,

ix. Usability Studies and Performance Evaluation and

x. Inter-operability of Devices.

We next discuss each of the above challenges in some detail.

Page 7: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 7

2.1. Domain & Task analysis

Analysis of the crisis management task is of paramount importance to develop a dialog based natural multimodal interface system. By studying the work domain, researchers can create realistic scenarios to conduct user studies with prototype systems. Because multimodal systems must be built before user studies can be conducted, a scenario-based approach lends itself well to the multiple iterations required for constructing such a system. Rosson and Carroll, [6] discuss the role of scenarios in system design. Crisis management often relies upon geospatial information and technologies (e.g., determining evacuation routes, identifying locations of at risk facilities, simulating the spread of a toxic gas released HAZMAT facilities, and others), but surprisingly, only limited research, has been directed to understanding use of geospatial information and technologies for decision support [7-9]. In addition, traditional problems in usability engineering and human-computer interaction involve relatively well-defined user tasks. Thus, many of the methods developed for user task analysis in typical HCI domains are inappropriate for task analysis in the context of crisis management, where the tasks are often ill-defined [10]. As a result, analysis of tasks carried out in crisis management, particularly those involving use of geospatial information and technologies, requires adaptation of existing methods and development of new methods that are applicable to analysis of ill-structured decision-making tasks, often made under stress.

One context in which methods have been developed to address use and usability of technologies designed to enable decision-making under crisis-like situations is the design of technologies to support military activities (e.g., command and control, airplane cockpit controls, etc.). Within this context, cognitive systems engineering (CSE) has proven to be an effective methodology for understanding the task domain and developing interface technologies to support performance of tasks [11-13]. The theoretical frameworks of distributed cognition [14], activity theory [15] and cognitive ergonomics [16] also have the potential to help isolate and augment specific elements of the crisis management domain for multimodal system design. We agree with Descortis [16] in that each approach produces specific results based on the one instance of interpretation and one should consider, scale and needs before settling on a single framework making it important to consider a variety of approaches in designing a collaborative multimodal crisis management system.

Understanding the task domain is essential to make the challenge of building a natural interface for crisis management (or other application domains) a tractable problem. This is because multimodal signification (through speech, gesture, and other modalities) is context dependent. The crisis management context provides a particular challenge for development of integrated speech-gesture interfaces, since an important component of crisis management (response) is typically carried out under conditions of considerable stress. Although there is a growing body of research on speech-gesture interfaces to geospatial information (usually presented via maps) [17-21], little attention has been directed thus far to specific challenges of developing these interfaces to cope with interaction in stressful situations. We believe that an integrated, multidisciplinary approach is necessary to understanding the ill-structured, highly dynamic, collaborative work domain of crisis management for the design of multimodal systems. Without such an approach, we could develop multimodal system that meets all usability design requirements and yet have constructed, in fact, the wrong system.

An important distinction for methods and tools designed to enable collaboration in typical crisis management situations is that between tight and loose collaboration. Tight collaboration is usually synchronous (or same-time) with multiple people interacting directly to view and discuss the same information and/or to solve the same task. Loose collaboration is usually asynchronous (different-time) with multiple people performing different tasks and having different views on the CM system and situation while sharing information. For example, during a hurricane event, one person may be responsible for dam breaches, another for evacuation planning, and another for fires. The same individuals might also be periodically required to engage in tight collaboration (e.g., because flooding can prevent fire trucks from getting through or close possible evacuation routes).

Page 8: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 8

In addition to degree of direct interaction among participants in a collaborative activity, a second critical distinction exists, between same-place and different-place (distributed) team work [22]. With same-place team work, interface technologies must support joint work on the task at hand – with crisis management that task often involves integrating, understanding, and making decisions using geospatial information. Participants in same-place team work can rely upon normal methods for face-to face communication, unless the interface technologies interfere with that communications as they often do with immersive virtual environment technologies [23]. A multimodal interface to support same-place team work must include mechanisms for visual tracking of multiple people, speaker identification, and determination of who is speaking to whom. With different-place team work, the interface technologies, in addition to providing support for the task at hand must also provide support for communication among the remotely located individuals on the team [24]. When that distributed team work is same time (synchronous), interface technologies must provide tools that support real-time interaction and communication among participants. It is also important to note determining who is talking to whom may be more difficult since the speaker’s gaze direction is irrelevant if they are talking with someone at a remote location.

2.2. Acquisition of Valid Multimodal Data

An important feature of a natural interface would be the absence of predefined speech and gesture commands. The resulting multimodal “language” thus would have to be interpreted by a computer. While some progress has been made in the natural language processing of speech, there has been very little progress in the understanding of multimodal human-computer interaction [4]. Although, most gestures are closely linked to speech, they still present meaning in a fundamentally different form from speech. Studies in human-to-human communication, psycholinguistics, and others have already generated a significant body of research on multimodal communication. However, they usually consider a different granularity of the problem. When designing a computer interface, even when it incorporates reasonably natural means of interaction, we have to consider artificially imposed paradigms and constraints of the created environment. Therefore studies of human-human communication cannot be automatically transferred to design HCI systems. Hence, a major problem in multimodal HCI research is the availability of valid data that would allow relevant studies of different issues in multimodality.

Part of the reason for the slow progress in speech-gesture driven multimodal HCI is the lack of available sensing technologies that would allow sensing of natural user behavior. In addition, real HCI systems require the designer to view those levels from interaction enabling perspective, e.g., providing timely and adequate feedback. Use of statistical techniques is considered a preferred choice for building such systems. The patterns from face-to-face communication do not automatically transfer over to HCI due to the "artificial" paradigms of information displays. Hence, the lack of multimodal data, which is required to learn the multimodal pattern, prior the system building creates so-called chicken-and-egg problem. This controversy leads to a “chicken-and-egg” problem. One of the solutions is to use Wizard-of-Oz style of experiments [25] in which the experimenter interprets user requests and simulates system response. However, this method does not guarantee a timely and accurate system response, which is desirable for eliciting adequate user interaction. As Zue et al., [26] pointed out, while an experimenter-in-the-loop paradigm can provide important base information from which to build initial prototypes, once a prototype is developed, a system-in-the-loop paradigm (“Wizardless”) is preferable, one in which interaction is with the system acting on its own.

2.3. Sensing

The role of sensing for multimodal interfaces is to understand a users queries and commands through speech and gesture. Key challenges are acquisition and recognition of speech for understanding spoken commands in natural settings and the acquisition and recognition of gesture actions.

Page 9: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 9

2.3.1. Speech Acquisition

Speech acquisition concerned with capturing verbal commands and queries from the user. Because automatic speech recognition (ASR) systems to date are still very sensitive towards the quality of the captured speech signal, speech acquisition is both difficult and crucial for multimodal interfaces. Three main conceptual approaches to capturing clean human speech signals in the presence of background noise exist. One approach is to bring the microphone as close to the speaker as possible. This approach is utilized by head-set [27, 28], throat and lavalier microphones [29]. If this approach is not feasible, one has to either resort to physically directional microphones such as shotgun [30] or parabolic [31] microphones or resort to noise cancellation techniques. Noise cancellation can be performed by having one or several additional microphones capture mainly background noise signals or in an extreme approach to use an array of distributed microphones [32-34].

In general, headset microphones tend to be the best choice in noisy environments but require a user to wear a dedicated device. Among long-range approaches, microphone domes seems to be better choice but have the disadvantages of size and that the user is in general constrained to interact with a system from a fixed location. In contrast, microphone arrays can adaptively capture localized sound signals from arbitrary locations in space but tend to have a lower signal to noise ratio (SNR), especially in reverberating indoor environments.

2.3.2. Gesture Acquisition

Gesture acquisition is concerned with the capture of a gesturers motion information in order to perform subsequent gesture recognition. Gestures are in general defined as movement of the body or limbs that expresses or emphasizes ideas and concept. In the context of multimodal systems, pen and touch based interfaces are also commonly viewed to fall under the gesture recognition domain. However, while for pen- and touch based systems, gesture acquisition is merely a marginal problem, it requires considerable effort for most other approaches. Aside pen and touch-based systems [35, 36], the most common gesture acquisition methods are based on magnetic trackers, cyber-gloves and vision-based approaches. The suitability of the different approaches depends on the application domain and the platform. Pen-based approaches [35, 37] are the method of choice for small mobile devices and are cost effective and reliable. Acquisition using magnetic trackers [25] and/or cyber gloves [38-43] is efficient and accurate but suffers from the constraint of having to wear restrictive devices. In contrast, vision-based approaches offer entirely contact free interaction and are flexible enough to operate on all platforms except the smallest mobile devices.

Using vision based approaches, direct or indirect measurements of a persons gesticulation have to be acquired visually by assuming a parameterized visual model of the gesturer [44]. The process by which the parameters of a given model are estimated from video sequences is called visual tracking. Tracking is commonly performed incrementally by adjusting the model parameters for a given video frame based on the parameters at earlier times, which improves the tracking accuracy and speed. However, for this approach to be feasible, the tracker has to be initialized in a preliminary track initialization stage. Especially for high degree of freedom (DOF) articulated visual models, this step is inherently difficult to solve and hence often performed manually. In the following, different tracking approaches are discussed in more detail.

2.4. Detection, Localization and Tracking of Users

Initialization of the vision component of a multimedia system can be performed manually, but for convenience and reduced user training requirements, automatic approaches are desirable. Three main problems have to be addressed: a) user detection, b) user localization and c) track initialization. A simple approach for detecting a user in the camera’s view is to perform foreground/background segmentation and subsequent silhouette analysis of the foreground. The major challenge for this approach is the modeling of

Page 10: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 10

the background in changing environments and the segmentation for the case of coincidental foreground-background similarity. Motion based approaches offer some improvement, but are computationally more demanding and often work only under restricted conditions. Stereo systems, which can be used to obtain depth maps of the environment, are attractive solutions but require additional hardware and need to be carefully calibrated. Another approach to initialization is based on face detection [45]. Face detection algorithms have in recent times advanced both in speed and detection performance such that they can be utilized even in real-time systems for both user detection and localization and additional tasks such as head track verification and gaze estimation [46, 47]. After appropriate initialization one needs to track the person/body parts over time to understand the gesture.

2.4.1. Visual Tracking

Visual tracking is one of the most actively researched fields in computer vision. A thorough discussion of human motion tracking methods is not possible here and the reader is hence referred to a number of reviews on this subject [48, 49]. Rather, we will discuss in this section, the challenges that vision based tracking algorithms encounter in the context of multimodal systems and to what degree standard approaches are suitable for different application domains. For multimodal HCI systems, a visual tracking algorithm has to fulfill the following requirements:

Real-Time: A visual tracking algorithm in the HCI domain has to be able to process incoming video information at a rate that yields sufficiently sampled motion data. Rates of 30 frames per second are in general necessary.

Occlusion: Occlusion is an inherent problem for human motion tracking algorithms; for example, when people gesture, they hold their hand in front of their body and hands often occlude each other.

Visual distractions in background: In unconstrained environments, it cannot be ensured that the user is the only object or person in the view of the systems. In addition to the user, bystanders, furniture or other objects might be visible, which has to be handled by the tracker.

Target size: Visual sensors of an HCI system often capture images of the entire user. The arms, hands and fingers in the video might therefore only occupy a small region in the video images, making robust tracking challenging.

Visual distractions in foreground: In addition to background distractions, the user him or herself can be a significant source of distraction to a tracking algorithm. For example, when the tracker is designed to track a person’s hand based on skincolor information, the users dressing style (short sleeved shirts) can be a source of distraction.

Changing environmental conditions: Visual tracking systems work best in environments that don’t change over time. For example, the natural diurnal cycle or changing lighting environments can cause the tracker to fail if these changes are not handled appropriately.

Initialization: A visual tracker for HCI systems must in general be able to initialize automatically, to perform its task independent of the person that is being tracked. This means, that no prior information, for example, about the size or height about a person, can be assumed.

2.4.2. Tracking Methods

Visual tracking methods have unique advantages and disadvantages. In this section, we will discuss a number of representative approaches and their suitability in the context of multimodal HCI. The most complex target representations are those that involve detailed models of the target in terms of articulated (skeletal) structure and volumetric shape descriptions of body parts [50-60]. These model-based representations are often parameterized by 3D locations and joint angles with many degrees of freedom. Model based approaches are able to recover the 3D location and pose of a gesticulating subject from

Page 11: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 11

monocular image sequences if the underlying model is detailed enough. Unfortunately, the evaluation of these high DOF models is still prohibitively expensive for real-time tracking systems.

Other visual tracking approaches assume much narrower and incomplete models of the gesticulating person. Feature based approaches assume that the users gesture movements give rise to image features that can be detected and used for tracking. Common visual features used are contours [61-63, 45], points [64-67], color [68, 69] and motion [70]. Finally image content itself can directly serve as image features [71-74]. Contour based approaches suffer from the requirement that they usually require some form of more detailed model of the target to be tracked. This makes the approaches often unsuitable because the inherent non-rigidity of human motion calls for non-trivial contour generators (except when shape can be approximated well by for example ellipses, such as for head tracking [68]) and because of intra-individual shape variability. Point feature trackers are able to detect and robustly track salient and stable image features over long periods of time. Unfortunately, the interesting body parts of a gesticulating person often show surprisingly small amount of salient features. Template and exemplar based approaches utilize typical snap-shots or representative descriptions of the target in combination with template-to-image correlation to perform visual tracking. These approaches have proven to be good for applications such as head, face or whole person tracking but suffer if the appearance of the target changes over time or the target is small in size. One of the most widely used approaches to hand tracking is based on color and motion cues. The human skin color is an excellent feature that distinguishes the human hand and face from other objects in an image. If combined with additional cues such as motion information, robust trackers can be designed.

2.5. Recognizing User’s Action

The ability to develop a dialog-based multimodal interface system is motivated by the knowledge of how humans naturally integrate different input modes. Integration of speech and gesture has tangible advantages in the context of HCI, especially when coping with the complexities of spatial representations [75]. Hence, the requirements of the natural interactive system would include the ability to understand multiple modalities, i.e., speech and gesture where information is somehow distributed across the modalities.

2.5.1. Gesture Recognition

Gesture recognition is the process of inferring gestures from captured motion data. Depending on the motion acquisition method, gestures are parameterized in a variety of ways. Articulated model-based visual approaches or magnetic trackers usually yield time varying joint angles, while pen based approaches or trackers that operate in 2D image space, yield time varying 2D trajectory data. These direct parameters are commonly processed further and augmented with derivatives before processing. The discussion of gesture recognition methods however, can be kept somewhat independent of the specific parameterization. In the following, we will outline a number of challenges that gesture recognition systems are faced with.

Gestures as the outcome of a stochastic processes: The spatio-temporal evolution of different gestures performed by the same or two different people, will differ both in spatial shape as well as temporal properties. Hence, a gesture can be viewed as a realization of a stochastic process and modeled appropriately. The stochastic nature of gestures foils attempts to perform direct comparisons of gesture trajectories, especially due to the time varying differences in spatial and temporal scale.

Meaningful versus incidental gestures: For pen and touch-screen based systems each pen stroke in general conveys meaningful information. In contrast, the continuous motion data acquired by vision or magnetic tracker based systems contains meaningful as well as meaningless motion information. This problem is somewhat analogous to human speech, where the stream of audio signals contains meaningful speech as well as noise and mumbles.

Page 12: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 12

The meaningful gestures are embedded in a continuous observation stream: Except for pen and touch based systems, which are designed such that there exists a one to one mapping between input stroke sequences and gesture (e.g., Graffiti on the Palm OS), general hand-writing recognizers and systems that acquire motion data continuously have to somehow divide the motion data into segments that constitute single gestures. This situation is again similar to the problem that ASR systems are faced with, where phonemes, words and phrase boundaries are unknown or at least hard to detect.

Gestures are context dependent: In human-to-human communication, McNeill [76] distinguishes four major types of gestures, deictic, iconic, metaphoric and beats by their relationship to the speech. Deictic gestures are used to direct a listener's attention to a physical reference in course of a conversation. Iconic and metaphoric gestures are associated with abstract ideas, mostly peculiar to subjective notions of an individual. Beats serve as gestural marks of speech pace. In the Weather Channel broadcast the last three categories roughly constitute 20% of all the gestures exhibited by the narrators. Hence, when the discourse concerns geocentric data, the use of deictic gestures is most common [77] and relatively consistent in coupling with speech.

Gesture models: Due to their stochastic nature, it is difficult to manually find general and representative descriptions of spatio-temporal gesture motion patterns. Hence, the usual approach to gesture recognition is based on machine learning methods. As with ASR, the two main approaches to gesture recognition are based on neural networks (NNs) [78] and hidden Markov models (HMMs). The most common and successful approach to dynamic gesture recognition is based on HMMs [79-86]. HMMs model doubly stochastic processes with a state transition network. States in an HMM network are associated with stochastic observation densities and transitions are governed by probabilistic rules. Stochastic observation streams such as gestures are then viewed to arise from a realized path through the network and from realized observations emitted at the visited states. The traditional state sequence approaches cannot be employed, as one has no easy method of detecting the beginning and end to the gestures embedded in the stream. There are two solutions to this problem. One approach splits the gesture stream into chunks and applies the above procedure. However, this splitting operation can cause the gesture stream to be cut in the middle of a gesture. Overlapped splitting addresses this problem, but the fusion of ambiguous recognition results on overlapping segments is challenging. Another approach operates the state estimation procedure in a continuous mode by employing a simple yet powerful approach called Token Passing [87]. Token passing operates by maintaining a set of tokens that are copied and passed around in the compound transition network. As tokens are passed around in the network, transitions and observation incur costs as negative logarithm of the corresponding probability values. At each time step, and for each state, every token associated with the given state are duplicated according to how many outgoing transitions exists for the given state. The state transition history of the most probable (least cost) token tokens is assumed to be the true sequence of performed gestures and can be determined easily at periodic intervals.

2.5.2. Speech Recognition

Automatic speech recognition (SR) systems build on three major components. A lexicon that contains mappings from words to phonemes, a language model that statistically describes the likelihood of word sequences and an acoustic model that describes the probability of making certain feature stream observations given a hypothesized word sequence [88]. The language model is usually formulated on the basis of Hidden Markov Models [82]. These models reflect the doubly stochastic processes underlying human speech. Using the lexicon, phoneme models are combined into word models, which in turn are combined into sentence models by appropriately connecting Hidden Markov Models into larger state transition networks. Using this network representation, speech recognition is performed by determining the most likely state transition sequence through this network given observed speech features [89].

In commercial speech recognition systems, the end-user is commonly only confronted with the final most probable utterances, however, systems internally maintain a whole set of possible utterances defined as a

Page 13: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 13

confusion network. As the quality of the acquired speech signal deteriorates, obtained confusion networks will increase in size (i.e., the number of parallel word sequences). The goal and advantage of multimodal HCI systems is that a plethora of additional information is available to further disambiguate these hypotheses. Speech recognition systems fall into two major classes: systems that can recognize speech independent of the speaker and systems that are trained to recognize the voice of a specific speaker (speaker dependent). Speaker dependent speech recognition is much easier in general and hence associated with higher recognition rate. However, to make multimodal HCI systems operable under unconstrained public environment, where user training is infeasible, speaker dependent system need to be employed.

2.6. A Framework for Fusion of Gestures and Speech

The world around us offers continuously huge amounts of information, from which living organisms can elicit the knowledge and understanding they need for survival or well-being. A fundamental cognitive feature that makes this possible is the ability of a brain to integrate the inputs it receives from different sensory modalities into a coherent description of its surrounding environment. By analogy, artificial autonomous systems are designed to record continuously large amounts of data with various sensors. A major design problem by the last is the lack of reference of how the information from the different sensor streams can be integrated into a consistent description. This is primarily due to the fact that few methods adequately model the complexity of the audio/visual relationship. It has been widely argued how and whether at all the integration takes place, and many models has been suggested therefore. By far, there is not a single theory that explains exactly how integration takes place in the brain.

We believe that to develop a successful multimodal interface it is important to focus on the development of a synergistic integration principle, supported by the synchronization of the multimodal information streams on temporal coherence principle. Effective representation and integration of diverse input is the key to building successful multi-modal interfaces. Through a free mixture of speech, gestures, and gaze, a user can potentially refer to and control the configuration, attributes, and events. The integration principle, that is featured in this paper is based on the understanding, that there are two aspects of the integration process: (1) achieving a synergistic integration of two or more sensor modalities and (2) actual combination (fusion) of the various information streams at particular moments of their processing. The synergistic integration relies on a hypothesis of how different percepts unify in the brain. It is based on some evidences from temporal registration and binding experiments.

A probabilistic evaluation of all possible speech gesture combinations promises a better estimation of users intent than either modality alone. The conditional probabilities of observing certain gestures given a speech utterance will be based on several factors. Speech utterances will first have to be analyzed for keyword classes such as typical deictic keywords (e.g., “this”, “that”). These keywords can then be associated with corresponding deictic gestures. The association needs to take gesture and utterance component classes into consideration and maintain the appropriate mapping between speech and gesture components. Typically a statistical method is employed for continuous recognition (hypothesis search on the state transition network using token passing [87]), both the speech recognition and gesture recognition systems generate their recognition results with time delays of typically 1 sec. Verbal utterances from the speech recognition have to be associated with co-occurring gestures observed by the gesture recognition. The understanding of the temporal alignment of speech and gesture is crucial in performing this association. While in pen based systems [36], deictic gesture have been shown to occur before the associated keywords, investigations from HCI and Weather Narration [90] showed that for large screen display systems, the deictic word occurred during or after the gesture in 97% of the cases. Hence modality fusion should be triggered by the occurrence of verbal commands and hence boils down to the problem of conditionally combining speech and gesture confusion networks.

Page 14: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 14

Once data associations (or set of associations if several are possible) have been determined, the co-occurrence module can determine a final match value between the utterance and the gesture based on temporal co-occurrence statistics. The domain specific data will then be used to perform the actual system training to obtain optimal task specific co-occurrence relations between speech and gesture. In Figure 3, we illustrate the architecture of a possible fusion strategy. Alternatively, instead of trying to answer to the question how the integration takes place one can argue why the integration takes place? There are varieties of answers to this question. For example, integrating on-line, up-to-date information, which brings different level of generality and is sensed from different scope, can give us a key of how to adapt to the new situation and deal with it.

2.7. Dialog Design

Speech-gesture driven multimodal interfaces to crisis management information systems demand careful design of information flow between users and those subsystems that manage various knowledge and data sources. The dialogue management component of a crisis management system takes inputs from the output of speech and gesture recognition subsystems and generates an action plan for communicating with other information sources. In the most simplistic cases where a user’s request is clearly stated and sufficient information was included, the process of dialogue handling could be “hard-coded” as a serially ordered steps including:

i. Understanding user’s information request and constraints;

ii. Determining whether sufficient information is included in the request in order to contact external knowledgebase or applications;

iii. Making requests to external applications; and

iv. Communicating information (returned from external application) back to the user.

However, handling natural, multimodal input from users is rarely so simple for a number of reasons [91].

First, user’s multimodal input may be misinterpreted or misunderstood. Multimodal interfaces rely on the ability of the system to recognize and infer user’s intended actions from multiple modes of input. Extracting meanings from continuous gesture/speech is prune to recognition errors. Dialogue management must provide adequate verification and grounding mechanism in order to allow misunderstandings to be communicated and corrected.

Second, user’s input may be ill-formed, incomplete, or even incorrect. Users of information systems in crisis management are commonly experts in their task domains but relatively naïve users of complex information systems. These users often do not know exactly what data items they require and how to obtain it, and hence their input may be ill-formed, incomplete, or inaccurate. Instead of simply reporting these problems back to the user and request a reformulation of the input, a dialogue manager should assist the users by suggesting ways to correct or complete their requests.

Figure 3: Architecture for speech / gesture fusion.

Page 15: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 15

Thirdly, there is a need to handle user inputs with flexibility. Natural speech and gesture input can be very flexible in both requesting information and supplying additional parameters. Flexibilities must be supported not only in the choices of phrases and gestures, but also in the way they are structured in an utterance or a dialogue. In particular, a user’s request may be under-specified (including less detail than necessary) or over-specified (providing more info than is needed in the current step of action). In either case, the dialogue manager should accept the input, and initiate new dialogue to request any missing information, if it is necessary.

Fourth, there is the need to support collaborative planning through dialogues. In the dynamic environment of crisis management, action plans often needs to be modified, extended, and negotiated by a group of participants in a dialogue in response to changes in the state of the world (where and what threats, damages, priorities, resources) and in priorities. Such planning activities should be managed to allow participants of a dialogue to plan their actions through collaborative interactions.

To handle all the above aspects of dialogues in crisis management, a dialogue management system must include: (a) more sophisticated methods for interpreting user’s multimodal inputs; and (b) more flexible and cooperative dialogue control strategies so that sufficient repair, clarification, confirmation, and negotiation capabilities are supported. A high level goal of dialogue design for crisis management is to support a user’s problem-solving process as it unfolds through the sequence of communicative interactions.

Dialogue design for multimodal crisis management systems is inherently a multi-facet problem. To facilitate later discussions on various challenging issues of dialogue management, Figure 4 serves as a framework that lays out multiple design dimensions and their relationships. It distinguishes a number of processing tasks as well as the contexts required for these tasks. We present this framework in the way that it captures the design issues without constraining to any computational architecture. The significance of this framework is that it allows actual dialogue systems to be analyzed and compared by listing what approaches are used in each of the design dimensions and how these approaches are coupled in the whole system. Next, we will describe the desirable functions for each of the components in Figure 4. Issues of contexts will be separated discussed in section §2.8 that focus more on the semantic aspects of multimodal system.

2.7.1. Understanding of Multimodal Input

As detailed in Figure 4, the output from speech-gesture recognition and syntactic analyses is the input to the dialogue management system. This input is analyzed to derive a meaning representation that can be used by the dialogue control subcomponent. The understanding of gesture/speech input usually requires several levels of analysis. It normally starts with analyzing the semantic content of each constituent (words, phrases, and gestures) in an input, and then moves towards understanding of the whole utterance by combining small semantic fragments into larger chunks. If an input is grammatically correct and semantically self-complete, then the process of input understanding can be handled by grammar-based semantic parsing techniques developed in computational linguistics [92]. However, full parsing of inputs in spontaneous dialogues is often not possible. Instead, the goal of semantic parsing become the extraction of critical meaning fragments that are to be further analyzed by other interpretation techniques, using perhaps high-level knowledge about discourse structures, user’s focus of attentions, pragmatics in the domain. These knowledge sources are external to the captured gesture/speech input, and must be explicitly represented in a form usable by the dialogue management system. The input understanding component corresponds roughly to three of those boxes in Figure 4: semantic parsing, discourse interpretation, and intention recognition.

Page 16: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 16

2.7.1.1. Semantic Parsing

Semantic parsing takes the recognized words and detects the existence of meaningful phrases. Common semantic parsing methods include a feature-based semantic grammar approach, robust parsing methods, and more practical methods involving concept spotting, each described below.

Semantic grammar approach: Grammatical approaches to semantic parsing are based on the theoretical foundation of computational linguistics [92]. Normally, a feature-based description is used to represent the meaning of grammatical units (words, phrases, and sentence), and unification grammar rules are used to compose meaning of an utterance from the meanings of its parts. This form of semantic analysis typically results in meaning represented in first order predicate calculus (FOPC), which can be used to infer the set of valid statements. To parse less well-formed input, conventional grammar approach becomes inefficient and impractical due to the difficulties of handling large number of potential dialogue features. For this reason, more robust parsing techniques have been developed.

Robust parsing: Robust parsing aims at extracting semantic information from ungrammatical input fragments without performing a complete parse. Robust parsing does not attempt to understand every word or phrase – instead it extracts only those meaningful items essential for the communication. Some form of feature-based bottom-up parser is capable of extracting meaningful phrases without attempting a full parsing of the sentence.

Figure 4: Dialogue management for multimodal crisis management systems – a conceptual view.

Page 17: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 17

Concept spotting: Concept spotting is a semantic-driven approach that attempts to extract critical concepts from user’s utterance without going through sentence based grammatical parsing. The idea is to use some form of conceptual graph to represent frequently observed concept sequences. It has the advantage of low computational cost, but it might not be able to handle more complex cases where sophisticated grammatical analysis is necessary to determine the interrelationships among disjoint constituents [93]. This lack of robustness can be compensated by adequate repair facility in the dialogue management so that the system is able to ask more direct questions and prompts in order to elicit required information.

2.7.1.2. Discourse Interpretation

Some items in an input are not interpretable out of the previous dialogue context. For example, pronouns (such as they, it, etc) and deictic expressions (such as these, the last one) usually refer to some entities that were mentioned previously in the dialogue; ellipses (clauses that are syntactically incomplete) and anaphors can only be interpreted when considering syntactic and semantic structures of previous clauses. These issues require that the system keep a record of previously mentioned items and structures in order to assist interpretation within the context of the previous discourse. Considerable attention has been given to the issues of representing and maintaining discourse context information. A simple approach for representing discourse context is to maintain a history list of elements mentioned in the previous discourse. To update discourse context, the concepts of centering [94] and attentional state [95] are useful.

2.7.1.3. Intention and Belief Recognition

Interpretation of user’s input may also be driven by a set of expectations on what user will do or say next. One approach for generating such expectations is to construct a model of user’s intention and belief behind their communicative behavior.

When people engage in interactions with another agent, their communicated messages carry certain intentions and imply certain beliefs. For example, a spoken request “pick up a pen from the desk, please” not only carries an intention of requesting a physical action (“pick up a pen”) to be done, it also implies speaker’s beliefs that (a) there is a pen, there is a desk, and the pen is on the desk, and (2) there is a hearer and the hearer is capable of physically picking up a pen from the desk. In natural interactions, the system should recognize the reason or intention that leads the user to make a request and subsequently use that information to guide the response planning process. Recognition of the intention of an input includes two components: (1) to identify the purpose of the input, and (2) to identify how the purpose of this input relates to prior intentions.

2.7.2. Response Planning and Generation

The response planning and generation phase takes the interpreted input and formulates proper response for this stage of the dialogue. We will discuss this part of dialogue management in four components, plan reasoning, information control, mixed-initiative dialogue control, and response content assembly. Although these subcomponents are commonly integrated as one functional component in practical dialogue systems, it is important to consider these as separate aspects of dialogue design. The separation of these subcomponents allows clear design of dialogue functionalities and is perhaps useful as a guide for systems designed for better portability and extension in order to serve new domains and tasks [96, 91].

2.7.2.1. Plan Reasoning

The plan reasoning is often optional for dialogue systems of simple information retrieval tasks and fixed interaction styles, but is necessary and even a core component for more intelligent dialogue systems that support problem-solving. The idea of having a reasoning component in a dialogue manager support was featured in a number of dialogue systems. With direct access to three knowledge sources: task knowledge (general ideas of how tasks should be done), user knowledge (what each user knows and works on), and

Page 18: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 18

world knowledge (world facts, processes, and events) (as indicated in Figure 4), the plan-reasoning component serves two main purposes:

i. to establish the system’s intention and belief;

ii. to elaborate the plan on the course of actions for the task in focus.

To accomplish these, the plan reasoning component starts with the recognized user intentions and beliefs and uses its knowledge about tasks and the problem domains to decide what are the next goals to be attacked by the system itself or jointly with the user. In the process of this mean-end reasoning, the plan of actions will be extended. If new obstacles (such as missing information) are discovered that require further communication with the user, it will notify and prepare agenda items to be used by the dialogue controller (see link � of Figure 4). The system will also reason on the set of beliefs held by users and the system and make sure they mesh well. When conflicting beliefs are detected, repair mechanisms will be suggested to the dialogue controller and new agenda items are added (see link � of Figure 4).

When the plan reasoner collects enough information for the system to act on retrieving information, it will send an action item to the information controller with all the necessary details. This is represented as link � of Figure 4. Besides generating action items for the dialogue controller and the information controller, the plan reasoner is also responsible for maintaining the dynamic context such as the task states, user’s mental states, and collaboration states.

2.7.2.2. Information Control

The information control component interacts with the external information systems on one side and communicates with the dialogue control and response content assembler on the other side. Communicating with external information systems can be as simple as sending a query and receiving the result. However, queries to the external system may be ill-formed, which may result in no records or too many records being returned. The problem of no records returned can be caused by any vocabulary differences due to the problems of synonymy (multiple terms describing the same object) and polysemy (a single term carrying multiple meanings) or conceptual differences that the ontology (how things are categorized and related) imposed by the user on the modeled world is incompatible with that of the system. Hence, functions of the information control component should include: (1) some capabilities for vocabulary and ontology translation to increase the chance of successful queries, and (2) adequate capability to report (to the dialogue controller) the reasons for why query failure happens, possibly with suggestions on how to restate the query in the next round of user input. Information control is also useful for handling ill-formed queries that return too many irrelevant results together with relevant ones. In such case, the system may suggest narrower terms or add query constraints.

Another function of information control is to handle response delays. Dialogue systems for crisis management are likely to encounter users’ requests that require long processing time for the external information system to generate the result. For example, a query to generate buffer zones for thousands of map features may cause a geographical information system (GIS) to run several hours. Information control should be able to detect potential long transactions, provide an estimate of the time needed, and inform the user about possible long waiting times.

Although the above information control functions can also be found in generic information retrieval systems, the purpose of designing the information control as part of dialogue management (rather than as functions of external information systems) is to enable the dialogue system to handle such issues in a more user-centered and task-centered manner.

2.7.2.3. Mixed-initiative Dialogue Control

The goal of dialogue control in a crisis management system is to maintain information flow between the user and the information system so that communications between the two are natural and effective in advancing the user’s task. A dialogue can be user-led, system-led, or mixed initiative. Human interactions with crisis management systems, as examplified by the scenario of Figure 1, in inherently

Page 19: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 19

mixed initiative dialogues, which means that dialogue control is shared among human and system agents. The user and the system are equally capable of introducing new topics, and are equally responsible to engage cooperative dialogue with the other participants.

The commonly used dialogue control strategies include finite-state based, frame-based, plan-based, and agent-based (for recent review, see [91]). The choice of control strategy in a dialogue system depends on the complexity of the underlying task of a dialogue [96]. Finite-state based method and frame-based method can not support mixed initiative dialogue due to their fixed dialogue control structures. Artificial intelligence (AI) planning methods of dialogue management have sufficient models of complex task structures, but they require full access to user’s task schema, which may not be possible in group collaboration processes. The full complexity of human-system-human interactions in crisis management requires the most powerful, agent-based approach to handle mixed-initiative, collaborative planning on complex tasks. An agent-based approach uses advanced models of tasks advances, user’s intentions and beliefs, and implements complex grounding mechanisms to manage the dynamics of collaboration.

The dialogue control plays a central role in advancing user’s tasks while dealing with needs for dialogue repair and error handling. To detect and correct recognition and understanding errors, the system must provide adequate mechanisms for clarification, verification, and confirmation. Verification can either be explicit (by asking questions to be confirmed) or implicit (by restating, or otherwise presenting, what the system understood). Both approaches give the user an opportunity to correct errors. The challenging issue for verification design is to verify sufficiently but not too much, since, every verification process adds to the lengthy of the overall dialogue. The dialogue control is the coordinator for generating responses. While all other components have the potential of adding new agenda items for system response generation, it is the responsibility of the dialogue control component to decide what priority and strategy should be used for responding to the last user input.

This component is responsible for organizing the multimedia and multimodal response contents into a schema that is usable by the response presentation module. The need for this component also comes from the fact that response contents are often contributed by multiple components (such as the dialogue controller and the information controller), and must be buffered and synchronized before forwarding to the user.

2.8. Semantics

Both the interpretation of multimodal input and the generation natural and consistent responses require access to higher-level knowledge. In general, semantics required by multimodal systems can be categorized along two dimensions: general vs. task / domain specific, and dynamic vs. static, as shown in Table 1. They together provide the necessary context for deep semantic analysis of multimodal input, and for maintaining the context of dialogues between users and the system as well as collaborations among multiple users. Some of these semantics are included in Figure 4 as part of static and dynamic contexts.

2.8.1. Static Contexts

Static contexts include knowledge that is manually compiled and stored in knowledge bases. These knowledge components provide semantic input to various processing components, and they usually do

Table 1: Types of semantics

General Task/Domain Specific

Static

Semantic of language Discourse acts Common-sense knowledge

Task structures User models Domain specific model of the world

Dynamic Discourse history Attentional states

Task states User’s mental states Collaboration states

Page 20: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 20

not change during the course of dialogue interactions. Linguistic/Semantic Knowledge exists mostly in the form of feature-based grammars that support both syntactic and semantic analysis of spoken input. Discourse knowledge includes knowledge about discourse structures and various speech acts. Task Knowledge refers to knowledge about the structure of tasks in an application domain (e.g., hurricane response). It should reflect the general problem-solving model of the target domain. In particular, it could describe objectives (goals, subgoals, and their constraints), solutions (courses of actions), resources (objects, space, and time), and situations (current world status) [96]. User knowledge describes the general properties of the users in terms of what they known and what they do. In particular, the knowledge about the level of expertise in a problem domain and tool domain can be useful for enabling the system to adapt its behavior when interacting with different persons. World knowledge is a structured record of relevant entities, processes, and events that have some effects on the dialogue system. Knowing the situational information about the current world is often the precondition for setting goals of task-domain actions, and special events in the world (flooding) can be used to initiate new dialogues or interrupt ongoing dialogues.

2.8.2. Dynamic Contexts

In contrast to static contexts, dynamic contexts are data structures that represent the current states of the interaction. They serve as a temporary store of information about the task in focus, user’s beliefs, and the status of collaboration. The contents of dynamic contexts are directly manipulated by various processing components. In the dynamic contexts of Figure 4, discourse states are records of the currently opened dialogue segments, a history list of mentioned concepts, current dialogue focus, and current speech act. Discourse state information is used by discourse interpretation and dialogue control modules for proper handling of discourse constraints on input understanding and response generation. Task states represent the planning status for the task in focus, and are used by intention recognition and reasoning components. User’s mental states are models of individual user’s beliefs and intentions at any given moment. Collaboration states are established and communicated beliefs and commitments that are shared (intended to be shared) among all participants involved in collaboration.

2.8.3. The uses of Contexts for Semantic Analysis

Semantic fusion of gestures with speech inputs: Multiple modes of input can be integrated using either early ‘feature-based’ binding or late ‘semantic’ binding, or a mix of both. Early binding integrates multimodal input at the feature level, and is considered most appropriate for closely coupled and synchronized speech and gestures. However, interactive dialogues in collaborative problem solving are full of interruptions and interleaving effects, resulting in complex ‘lagging’ phenomena between speech and related gestures. Furthermore, when modalities are used in complementary fashion, there is often lack of direct feature level clues or temporal clues for relating one mode of input to the other. Due to the above limitations of a feature-based approach, plus their modeling complexity and training difficulty, more semantically oriented, late binding approaches have been developed. With late semantic fusion approach, speech and gesture inputs are recognized separately. Each input mode is processed in parallel to derive individual understanding before being combined with the other modality [97]. Considering discourse context, task context, and user’s goal stacks facilitates the fusion process.

User adaptive responses: An explicit model of user’s current goals and beliefs allows the system to response to the user’s requests more intelligently by adapting the ways how objects are described, how verification are performed, and what level of details to be returned. Maintaining the semantic contexts of a complex crisis management system can be challenging since each component of the contexts is likely to be shared by multiple functional units where modification and access have to be coordinated. Adding to this complexity is the potential co-existence of multiple knowledge representation schemes, such as rules, frames, and objects.

Page 21: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 21

2.9. Usability Studies and Performance Evaluation

Often, interface refinement and suggestions for improvements result from feedback obtained during informal demonstrations to potential users of the system. Shneiderman [98] recommends a more formalized approach for advanced system interface design in order to identify the range of critical usability concerns. However, a formalized approach for a multimodal system does not yet exist, therefore, we must piece together a elements from several approaches and draw upon a suite of methods for addressing questions about individual and collaborative human work with computer systems. A user centered evaluation procedure modeled on that proposed by Gabbard et al., [99] for the design of virtual environments has the potential to contribute to the creation of a more formalized framework for the design of multimodal systems. In [99], Gabbard et al. identify four steps including i.) user task analysis, ii.) expert guidelines-based evaluation, iii.) formative user center evaluation and iv.) summative comparative evaluation. This multistage process can help reveal usability concerns early in the design process. In designing a multimodal system, the sooner that real users can interact with the system and produce real usability data, the easier it will be to identify key issues of usability.

In addition to a human centered usability testing approach, the Cognitive Systems Engineering design approach can assist early in the development process by allowing designers to gain a deep understanding of the underlying work domain. This approach can help focus development issues on more specific usability tasks within the crisis management work domain that are critical for multimodal design. Crisis management is comprised of multiple activities and actions that involve distributing and redistributing resources, identifying critical infrastructure, and prioritizing traffic flow along evacuation routes among others [100]. Here, we consider a simplified interaction task that would be used to complete any number of planning, mitigation, response, or recovery activities: the selection of areas on the screen by making a pointing gesture accompanied by an activation action (e.g. “Select these facilities over here.”) This interaction is very similar to that performed using current devices; for example, the mouse is used to move the cursor to where selection is desired, and usually a mouse button is used to activate it.

One of the problems with multimodal performance evaluation studies is that, the tasks used to evaluate selections have not been consistent throughout different studies, making it very difficult to compare them. The International Standards Organization (ISO) has published an emerging standard, ISO 9241, focused on Ergonomic design for office works with visual display terminals (VDTs). Part 9 of the standard, Requirements for non-keyboard input devices [101], addresses the evaluation of performance, comfort and effort. Several experimental studies have adopted the recommendation of this standard as a basis for usability assessment. An example is MacKenzie [102] who has used this strategy to evaluate mice, touch-pads, pens, gyro-pads, and several other input devices. These methods for evaluating performance are based on the work of Fitts [103], who in 1954 conducted experiments to measure the information capacity of human articulations. One can draw upon methods developed to address scientific questions about human perception and cognition, many of which have focused on map-based displays that are common in crisis management activities (see: [104-106]). Since the results of formative user centered usability evaluation experiments will affect some technology decisions it is important to include them from the early phases of the system design and development. This section summarizes usability issues we believe are pivotal in creating an effective multimodal interface.

Flexibility and Efficiency of Use: This point concerns the time required to complete a command in regards to the complexity of the procedure. This time can be analyzed at different granularities: the time to make a selection, the time required to identify the user’s intentions, the time to complete the entire interaction. Efficiency is critically important in the context of crisis management because the first moments in an emergency are the most important, and unless the system can provide efficiency improvements to the current modes of interaction, such a system will not be useful.

Error Prevention: Accuracy is a critical factor for the multimodal interface design and can also be analyzed with the same levels of granularity as the efficiency. In most cases, the overall interaction requires accuracy near to 100%. Accuracy can be improved substantially by designing a good output

Page 22: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 22

interface that requires confirmations at several steps of the interaction; however, there is a trade-off between accuracy and efficiency that has to be addressed carefully.

Learning Curve / Training: Another important factor is the speed at which new users get acquainted with the system. Decision makers and crisis managers have numerous demands and responsibilities. A crisis management system should require minimal training and be intuitively designed for ease of use.

Visibility of System Status (Multimedia Visual Feedback): The design of feedback can play a critical role in a multimodal interface. A good interface design can overcome some limitations of the system. One can improve the overall interaction of the system by providing a multimedia output that includes graphics, sounds, and speech.

Ergonomics: Ergonomic issues, not identified in traditional interface design, are crucial to multimodal systems. It is important to consider the potential for user fatigue and discomfort. A large screen-based system will require the user to interact with a digital display through motions and gestures coupled with voice commands; these activities could create discomfort for users operating the system, especially during emergency response activities where emergency managers work, often non-stop and with very little rest, for days and even weeks at a time.

Here, we have not emphasized all of Nielsen’s [107] usability heuristics including match between system and real world, user control and freedom, recognition rather than recall and aesthetic and minimalist design because a multimodal gesture speech interface eclipses these traditional limitations of WIMP interfaces. He also identified consistency and standards, which will become increasingly important with future prototypes, as well as help and documentation, which will be addressed though the dialog management system (see section 2.7.).

Identification of key usability issues is important, but it is also critical to develop a set of performance metrics to measure the individual usability issues, as well the overall performance of the system. The performance metrics shall be designed to evaluate both the complete system as well as individual components. At the system level one can consider at least two broad stages of evaluation: formative and summative. At the formative level, one should consider a prototype interface with lesser degree of cognitive load (less active, less adaptive) to elicit more multimodal input from the user. It is also possible to develop a metric to measure the performance of a system by relating a grammatical model of multimodal constructs (most likely in the form Subject-Action-Object) to the interaction time and errors. The relative subjective duration (RSD) that provides a means for probing the difficulty that users have with performing tasks without requiring the questioning of users about the difficulty can be another useful measure of the performance of the system.

2.10. Inter-operability of Devices

One of the key aspects of a crisis management system is its collaborative framework. The system should be able to link up several regions in the country (or world), and allow collaborative tasks among people present at remote sites. The computing platforms, communication devices, and network connectivity vary from location to location. For example, the collaborators at the crisis management center will be using powerful computer systems, large screen displays, and access to the wealth of databases, like weather and other GIS information, imagery, technical information, on-scene video, digital photography

Figure 5: The system architecture of Unified Multimedia Application System (UMAS).

Page 23: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 23

and other expert information. As in the case of the Domestic Emergency Response Information Services (DERIS)1, these databases will be accessed by signing into a mission critical web based applications with broadband network connectivity (usually, T1). As a sharp contrast, we will have the agents in the field, with low power computing and electronic devices, like PDAs, mobile phones. These agents will need to access the same databases, and communicate with other sites through voice, text messaging, and e-mail. In some crisis management situations, it is essential to upload images and videos from the field, allowing the objects of interest in the field to be viewed by the collaborators. For example, a helicopter hovering over the area of disaster may relay a video to the operations center. Or a camera mounted on the helmet of a member of a bomb squad should allow for a real time feed to the control center, so that they can advise how to defuse the bomb. In Table 2, we illustrate the variety of people and agents involved in a typical emergency response situation, and the tools/devices that would be used in each of these.

The issue of interoperability across the wide range of devices is very critical for a seamless flow of information and communication. Hence, it is important to design a unified multimedia applications system (UMAS) that supports multiple devices and platforms (see Figure 5). For example, audio, text, and images can be captured from the control center and sent to the agent in the field operator who can retrieve the message from the web based messaging system using a PDA. Images captured from the field can be sent back to the control center, or to other platforms, for people to evaluate the situation. The multimedia engine would enable the server to handle requests from the entire spectrum of devices (from low power mobile devices to supercomputers), by processing and filtering the GIS data set appropriately.

Other than the unified multimedia messaging subsystem (UMMS) that is essential in the UMAS, there is also a need for having a unified videoconferencing (UVS) subsystem within it, for crisis management situations. Video uploaded through a camera from a desktop computer (say, at the control center) needs to be analyzed by the multimedia engine, before it can be streamed to a low power device like a PDA. The multimedia engine in this case will segment each image frame into meaningful patches or Video Object Plane (like the face of person, and the non-face areas on the person, the background region not containing

1 http://www.niusr.org/XiiProject.htm

Table 2: List of people and agents involved and the tools/devices to be used in a crisis management situation.

Location Possible Interface Devices and Computers

Example of Processors

Examples of Network

Connectivity

Mode of interaction with computer/display

Central command

centers

Large screen displays, cameras, microphones, high

end computers

Ultra high speed parallel vector computing

64 bits processors

Pentium IV processors

T1

Speech

Hand gestures

Keyboard and mouse

Distributed centers

Smaller screen displays, desktops, laptops, cameras,

microphones

Pentium IV or other 32 bit processors

T1

DSL

Wireless modem

Keyboard and mouse

Hand gesture

Speech

Pen based (laptops)

Field agents (humans)

Laptops, PDAs, mobile phones, cameras, head

mounted displays (HMDs)

Pentium IV processors (laptop)

StrongARM (PDAs)

Wireless modem 2.5/3G networks

Keyboard (laptops)

Pen based (laptop, PDAs)

Speech and keypad (mobile phones)

Eyetracking (HMD)

Field agents

(robots) Cameras, dedicated

microcontrollers Pentium IV processors Wireless modem

2.5/3G networks N/A

Page 24: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 24

the person, etc.). Next, these patches are coded separately. Finally, we can have a very low bit rate video that can be downloaded to a PDA. Hence, one of the key problems is the design of novel model based coding schemes, suitable for each device type.

The third subsystem that would be useful to have within the UMAS is a unified geographic information system (UGIS). For example, the mobile computing devices in the field as well as high-end desktop computers in a crisis management center will be accessing the GIS database. The multimedia engine would enable the server to handle requests from the entire spectrum of devices (from low power mobile devices to supercomputers), by processing and filtering the GIS data set appropriately. The agent in the field can have highly compressed and “succinct” version of the 3D terrain data, while the control center can have the same data at the highest resolution.

A possible software architecture of the UMAS, that captures a large number of multimedia applications in crisis management, is shown in Figure 5. The device detection engine helps the system to differentiate and determine the type of device the user is using to access the system. The multimedia engine on the other hand processes the data appropriately for each of the devices accessing the system, so that the size of the data and the data type is compatible with the device. Data is usually represented in XML format. XML enables extensive reuse of material and also serves as a common transport technology for moving data around in a system-neutral format.

Part III: Evolution of Systems and Implementation Details In this section first, we briefly discuss the state of the art in multimodal systems. Following this we will discuss the research done at The Pennsylvania State University and Advanced Interface Technologies, Inc. (AIT) that led to a series of multimodal interfaces (please see Figure 6), especially focused on free hand gestures and spoken commands.

Integration of speech and gesture has tangible advantages in the context of HCI, especially when coping with the complexities of spatial representations [75]. Combining speech, gesture and context understanding improves recognition accuracy. By integrating speech and gesture recognition, Bolt [25, 108] discovered that neither had to be perfect provided they converged on the users intended meaning. In [109], speech, gesture and graphics are integrated with an isolated 3D CAD package. A similar approach is used in the NASA Virtual Environment Workstation. Another interface integrating speech and graphics

- Real-time Testbed for Crisis Management Scenarios

XISM

- Speech/Gesture driven HCI Testbed for Interactive Map

2001

1997 1998 1999 2000 2001 2002

iMAP

Vision Lab at Pennsylvania State University

Advanced Interfaces Technology, Inc.

- Public Information Access

Eoz TV

- Interactive Entertainment

SuperBounz

- Immersive and Interactive System for Retail Industry- Navigation: Visual Avatar

Gift-Finder

- Integration with GIS- Multi-user HCI for a GIS- Dialog Management

DAVE_G

- Valid Multimodal Data for Bootstrapping- Speech/Gesture Co-Analysis

Weather Narration

Figure 6: Evolution of Speech/Gesture driven dialog-enabled HCI systems at Penn State and AIT.

Page 25: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 25

is the Boeing "Talk and Draw" project [110], an AWACS workstation that allows users to direct military air operations. The ALIVE interface developed by [111] is a gesture and full-body recognition interface that allows users to interact with autonomous agents in a virtual environment. The system uses contextual information to simplify recognition.

Recently, hand-held computing devices have been gaining popularity among the users. Their mobility along with usability augmented by pen-based “gestural” input was found especially beneficial in interacting with spatially presented information, e.g. [112, 113]. Using state-of-the-art speech recognition, Microsoft's project MiPad has demonstrated successful combination of speech and pen input for interacting with a hand-held computer. Pen modality was also successfully applied to index audio recordings for later retrieval [114]. Since the 1990s QuickSet collaborative system [115] that enabled users to create and position entities on a map with both speech and pen-based gestures the new avenues for more effective hand-held HCI have been opened. Since then, a number of pen-based and hand gesture interfaces have been designed, c.f. [4]. Distinct in their functionality of input and multimodal architecture, all of them aimed to achieve easy and intuitive human-computer interaction. Because there are large individual differences in ability and preference to use different modes of communication, a multimodal interface permits the user to exercise selection and control over how they interact with the computer [116]. In this respect, multimodal interfaces have the potential to accommodate a broader range of users than traditional unimodal interfaces. Those include different age groups, skill levels, cognitive styles, and temporary disabilities associated with a particular environment. With respect to the functionality of multimodal input other known pen-based applications, e.g., IBM's Human-Centric Word Processor [117] and NCR's Field Medic Information System [118] integrate spoken keywords with the pen-based pointing events. In contrast, QuickSet [115] and Boeing's Virtual Reality Aircraft Maintenance Training Prototype [119] process speech with a limited vocabulary of symbolic gestures. Except for the Field Medic Information System, which supports unimodal recognition only, these applications have parallel recognition of pen and spoken inputs. Multimodal integration of inputs is achieved later by semantic-level fusion where keywords usually associated with pen gestures.

The speech-gesture integration framework resulting from our research is closer to the IBM VizSpace [120] prototype system, but differs in terms of design and integration philosophy both in the conceptual and implementation level. In contrast, associated researchers from Penn State and AIT are aiming at developing multimodal systems that by strict design are able to operate with moderate affordable off-the-shelf hardware. Other systems related to our work are the Compaq’s Smart Kiosk [121] that allows interaction using vision (i.e., person detection) and touch. MIT has developed a range of prototype systems that combine aspects of visual sensing and speech recognition but in general rely on a large amount of dedicated hardware and distributed computing on multiple platforms. Along the same lines, Microsoft’s EasyLiving system [122] aims at turning peoples living space into one large multimodal interface, with omnipresence interaction between users and their surroundings.

As discussed in part II, valid multimodal data is one of the basic design elements of multimodal interfaces. To address this issue it is of paramount importance to develop a computational framework for the acquisition of non-predefined gestures. Building such a system involves challenges that range from low-level signal processing to high-level interpretation. We sought a solution to bootstrap continuous gesture recognition (phonemes) through the use of an analogous domain, which does not require predefinition of gestures. We refer to it as the weather domain. The weather domain is derived from the weather channel on TV that shows a person gesticulating in front of the weather map while narrating the weather conditions (Figure 7a). Here, the gestures embedded in a narration are used to draw the attention of the viewer to specific portions of the map. A similar set of gestures can also be used for the display control problem. The natural gestures in the weather domain were analyzed with the goal of applying the same recognition techniques to the design of a gesture recognition system for our first system, called iMap [123] (Figure 7b). It was developed in 1999 and received significant media attention as the first proof-of-concept speech/gesture interface of its kind [84]. It required manual initialization and calibration and processing on four networked SGI O2 workstations. iMap utilized an interactive campus map on a

Page 26: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 26

large display that supported a variety of spatial information browsing tasks using spoken words and free hand gestures. A second system, called crisis management (XISM) was completed in 2000 and simulated an urban emergency response system for studying speech/gesture interaction under stressful and time-constrained situations [123]. XISM extended the iMap framework to explore more a dynamic environment representative of stressful crisis situations.

The XISM system was the first natural multimodal speech-gesture interface to run on a single processing platform holistically addressing various aspects of the human computer interface design and development issues. The iMap and XISM systems were developed as part of the Federated Laboratory on “Advanced and Interactive Displays” funded by the U.S. Army. More recently, under a grant from the National Science Foundation, a system called dialog assisted virtual environment for geoinformation (DAVE_G) is being developed for multimodal interaction with a GIS (geographic information system). The goals of the research are to:

• Develop a cognitive framework for guiding the design of a multimodal collaborative system, with special emphasis on context modeling, dialogue processing, and error handling techniques.

• Design and implement robust computational techniques for inferring a user’s intent by fusing speech, communicative free hand gestures, pen gestures, gaze, and body movement using computer vision, speech recognition and other pattern recognition techniques.

• Produce a formal framework for doing user task analysis in the context of geospatial information technologies applied to crisis management.

• Conduct experimental evaluation to determine causal relationships between multimodal interface characteristics and user performance/ behavior.

• Demonstrate using the experimental test-beds, to enable a team of participants to conduct a simulated emergency response exercise.

In the following subsections we will discuss the XISM and DAVE_G systems.

3.1. Crisis Management (XISM)– A Simulation Test-bed

In this subsection we provide an overview of the XISM system. To study the suitability of advanced multimodal interfaces for typical crisis management tasks such as emergency response, a research test-bed system called XISM was developed.

The user takes the role of an emergency center operator and is able to use speech and gesture commands to dispatch emergency vehicles to crisis locations in a virtual city Figure 8). The operator is standing at a distance of about 5 feet from the display in the view of a camera located on top of the unit. The operator’s speech commands are captured with a mficrophone dome hanging from the ceiling above. The operator has a birds-eye view of the city but has the ability to zoom in on locations (in plan view) to get a more comprehensive perspective. The goal of the operator is to acknowledge and attend to incoming emergencies, indicated by animated emergency symbols and accompanying audible

a) b)

Figure 7: Two analogous domains: a) Analysis of weather narration broadcast and b) iMap testbed.

Figure 8: XISM, a multimodal crisis management system.

Page 27: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 27

alarm signals (Figure 9). Alarms in the simulation are acknowledged by pointing at the location in conjunction with appropriate verbal commands (for example, “Acknowledge this emergency.” or simply “Acknowledge this!”). The acknowledgement indicates to the system that the operator is aware of the emergency and that it will be attended to shortly. The visual as well as audio signals emitted by the alarm are then reduced to a lower level. The speed at which each emergency is attended to, ultimately determines the final performance of the operator.

Hence, the emergencies have to be resolved as quickly as possible by sending appropriate emergency vehicles to the crisis locations. For this, the operator has to decide which type of units to dispatch and from which station to dispatch them. Emergency stations (hospitals, police and fire stations) are spread throughout the city and have limited dispatch capacities. Hence, the operator has to schedule the dispatch resources appropriately in order to avoid resource depletion in certain areas of the city. Units are dispatched through speech-gesture commands such as “Dispatch an ambulance from this station to that location.” accompanied with an appropriate deictic contour gesture.

3.1.1. Simulation Parameters

The XISM system allows a variation of many different simulation parameters, which makes it possible to expose the operator to a variety of scenarios. In particular, the task complexity can be increased by adding resource constraints (more emergencies than resources) as well as by changing the cognitive load level by varying the scale of the scenario (city size, urbanization density, emergency rate) or duration of the simulation. When, due to the size of the city the displayed information is too dense to perform accurate dispatch actions, an operator needs to be able to get more detailed views of city locations. The required speech gesture commands to complete this task are of the form “Show this region in more detail.” or “Zoom here.” with an accompanying area or pointing gesture. While the pointing gesture enlarges the view to the maximum for a given location, the area gesture allows controlling the level of detail by naturally conveying which area should be enlarged.

3.1.2 Interaction Management

The tasks that the system needs to perform vary over time and are given an empirically evolved state transition model of an interaction session between a user and the system (Figure 10).

Interaction Session: An interaction session consists of three main phases. During the initialization phase the dialogue between a user and the system is established. It is followed by the interaction phase, where the actual communication between the user and the system takes place. Finally, the termination phase is entered when the user (or the system) decides to conclude the dialogue.

Initialization Phase: In the absence of any users within the sensor range, the system is in the wait state. User detection is achieved through face detection [46]. If at least one person has been detected, the system enters the attraction state in which it tries to establish a dialogue with one of the currently tracked

Figure 9: City with six active emergencies: One fire (yellow/red circle), four medical emergencies (red crosses) and one police incident (blue/yellow circle).

InteractionErrorRecovery

ValedictionReset

Waiting

Attraction

DiscardOnlookers

HandDetection

Zoom OnUser

Bootstrapping

Initialization

InteractionTermination Figure 10: State transition model of an interaction session between a user and the system.

Page 28: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 28

people. In addition, the system continues to detect and track new arrivals. If at any point all people leave the sensor area, the system falls back into the waiting state. Once the user has stepped into the sensor range and found to be facing the system, the system enters the final bootstrapping state of the initialization phase. The system immediately performs hand detection to obtain an initial location of the user’s active hand and initializes the hand-tracking algorithm. Finally, it adjusts its active camera to adjust to the exact location, height and size of the user to allow optimal sensor utilization after which the interaction phase is initiated.

Interaction Phase: During the interaction phase, the actual dialogue between the system and the user commences. The system utilizes speech recognition, motion analysis and gesture recognition as its main interaction modalities. The vision-based modalities mainly rely on robust continuous head and hand tracking based on motion and color cues. Recognized gestures are combined with speech recognition results by a speech-gesture modality fusion module. The semantic integration of the final user commands depends heavily on the application and the time varying context of the system, which constrains the set of possible user actions for increased response reliability.

Termination Phase: An HCI system may gracefully terminate an interaction process, for example because the user has actively indicated that the session is concluded. Then it resets its internal states, positions the active camera in wide-angle initialization mode and switches back to the initialization phase.

3.1.2 System Components

To capture speech and gesture commands, the XISM system utilizes a directional microphone and a single active camera. A large number of vision (face detection, palm detection, head and hand tracking) and speech (command recognition, audio feedback) related components cooperate together under tight resource constraints. In order to minimize network communication overheads and hardware requirements, the system was developed to run on a single processing platform. From a system design perspective smooth and automatic interaction initialization, robust real-time visual processing and error recovery are very important for the success of advanced interface approaches for crisis management applications, because unexpected system behavior is unacceptable for mission critical systems. The XISM system was built on a holistic approach to multimodal HCI system design adopted in the iMap framework. An attempt was made to address all of the above issues (see Figure 11).

3.1.2.1. Vision Components

Since all systems are integrated onto a single standard PC, the allowable complexity of motion tracking methods is limited, especially, because the system latency has to be minimized to avoid “sluggish” visual feedback and system response.

Face Detection: One of the most important and powerful components in the system is the face detector for robust user detection and continuous head track status verification. The implementation [46] is based on neural networks and favors a very low false positive ROC of <0.5% .

User

Video Stream Audio Stream

HandTracking

GestureRecognition

SpeechRecognition

ModalityFusion

ContextSemanticIntegration

Display

Audio

Feed

back

Visua

lF

eedback

Figure 11: Overview of the XISM framework.

Page 29: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 29

Hand Detection: With the proper camera placement and a suitable skin color model extracted from the face region, strong priors can be placed on the potential appearance and location of a user’s active hand in the view of the camera. The automatic hand detection operates based on the assumption that the object to be detected is a small skin colored blob-like region below and slightly off center with respect to the users head. In addition, the hand detector favors but does not rely on the occurrence of motion at the location of the hand and integrates evidence over a sequence of 60 frames. The (time varying) location of the palm is then given by the optimal (most probable) hypothesis-connecting path through the set of frames under consideration. The probability of the path depends on the probability of each hypothesis plus an additional cost associated with the spatial shift in location from one frame to the next. The optimal path can be found efficiently using dynamic programming using the Viterbi algorithm [82].

Head and Hand Tracking: The algorithms for head and face tracking are based on similar but slightly different approaches. Both trackers are based on rectangular tracking windows whose location is continuously adapted using Kalman filters [124] to follow the users head and hand. While the head tracker relies solely on skin color image cues, the hand tracker is a continuous version of the palm detector and geared towards skin colored moving objects. Prior knowledge about the human body is utilized for avoiding and resolving conflicts and interference between the head and hand tracks. The tracking methods used are based on simple imaging cues but are very efficient and require less than 15% processing time of a single CPU.

Continuous Gesture Recognition: The main visual interaction modality is continuous gesture recognition. We adopt Kendon's framework [125] by organizing these into a hierarchical structure. He proposed a notion of gestural unit (phrase) that starts at the moment when a limb is lifted away from the body and ends when the limb moves back to the resting position. After extensive analysis of gestures in the Weather domain and iMap [126, 90] we have selected the following strokes: contour, point, and circle.

Bootstrap and Evolve strategies were used to design the system. Based on our experience with examining weather narration broadcasts, we temporally modeled deictic gestures based on a set of fundamental gesture primitives that pose a minimal and complete basis for the large-display interaction tasks considered by our applications. The statistical gesture model and continuous recognition is based on continuous observation density Hidden Markov Models [82] and is described in detail in [127].

The gesture acquisition system yields continuous streams of observations. Hence, the traditional state sequence approaches cannot be employed, as one has no easy method of detecting the beginning and end to the gestures embedded in the stream. We have implemented the state estimation procedure in a continuous mode by employing a simple yet powerful approach called Token Passing [87]. Token passing operates by maintaining a set of tokens that are copied and passed around in the compound transition network. As tokens are passed around in the network, transitions and observation incur costs as negative logarithm of the corresponding probability values. At each time step and for each state, every token associated with the given state are duplicated according to how many outgoing transitions exist for the given state. Each duplicate is passed along one outgoing transition and the cost of this transition is added to the current cost associated with the given token. In addition, the cost associated with the observation available at this time step is obtained from the target state observation density and also added to the current cost of the token. Then, for every state, only a number of tokens with the lowest cost are maintained, while all others are discarded. In addition to this procedure, the state transition history for each token is maintained. Then, at periodic intervals, the set of tokens that share the most probable history up to a time located a certain interval in the past is determined. The state transition history of these

Prep Area Ret

Point Point

Contour

Rest

Figure 12: State transition model for deictic gestures.

Page 30: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 30

tokens is assumed to be the true sequence of performed gestures. The procedure is then continued after discarding all but the off-springs associated with this most probable history.

After bootstrapping, refinement of the HMM and the recognition network was performed by “pulling” desired gestures from a user. The system extracted gesture and speech data and automatically segmented the thus obtained gesture training data. After a training of HMMs on the isolated gesture data, a final embedded training of the compound network was performed. To accommodate the incidental gesticulation and pauses in addition to meaningful gestures, garbage and rest models have been added to the compound network (see Figure 12).

3.1.2.2. Audio Components

Speech Recognition: Speech recognition has improved tremendously in recent years and the robust incorporation of this technology in multimodal interfaces is becoming feasible. The XISM system utilizes a speaker dependent voice recognition engine (ViaVoice from IBM) that allows reliable speech acquisition after a short speaker enrollment procedure. The set of all possible utterances is defined in a context free grammar with embedded annotations. This allows constraining the necessary vocabulary that has to be understood by the system while retaining flexibility in how speech commands can be formulated. The speech recognition module of the system only reports time-stamped annotations to the application front end, which is responsible for the modality fusion and context maintenance.

Audio Feedback: Audio feedbacks in the form of sound effect and/or speech are important components for multimodal interfaces. The current XISM system utilizes audio effects of varying volume as a means of notifying an operator of occurring emergencies on the one hand and in order to create a task appropriate noise environment (e.g., sirens) that an actual operator would be subjected to on the other.

3.1.3 Modality Fusion

In order to correctly interpret a user’s intent from his or her utterances and gestural motions, the two modalities have to be fused appropriately. Due to the statistical method employed for continuous recognition, both the speech recognition and gesture recognition systems emit their recognition results with time delays of typically 1 sec. Verbal utterances such as “show me this region in more detail” taken from a typical geocentric application have to be associated with co-occurring gestures such as “<Preparation>-<Area Gesture Stroke>-<Retraction>”. The understanding of the temporal alignment of speech and gesture is crucial in performing this association. While in pen based systems [36], gesture have been shown to occur before the associated deictic word (“this”), our investigations from HCI and Weather Narration [81] showed that for large screen display systems, the deictic word occurred during or after the gesture in 97% of the cases. Hence modality fusion can reliably be triggered by the occurrence of verbal commands. The speech recognition system emits streams of time stamped annotation embedded in the speech grammar; for the above case one would obtain (see Figure 13)

…[ZOOM, t0, t1] [LOCATION, t1, t2] [REGION, t2, t3]…

The annotation “LOCATION” occurring around the time s 1 2t =(t +t )/2 corresponds to the occurrence of

the deictic keyword “this”. Similarly, the gesture recognition might report

…[PREP, s0, s1] [AREA, s1, s2] [RETRACTION, s2, s3]…

ZOMM LOCATION REGION

PREPARATION AREA GESTURE RETRACTION

Times0 s1 s2 s3

t0 t1 t2 t3

Figure 13: Speech gesture modality fusion.

Page 31: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 31

indicating that an area gesture was recognized in the time interval 1 2[ , ]s s .

Using the time stamp of the deictic keyword, a windowed search in the gesture recognition result history is performed. Each past gesture stroke is checked for co-occurrence with appropriate annotations. Given for example time stamps 1 2[ , ]s s for a gesture stroke, association with a keyword that occurred at time

st is assumed if

]s,-[s t e2b1be ∆+∆∈ . Where

b∆ and a∆ are constants determined from training data. This approach allows the occurrence of the

keyword a short time before the gesture and a longer time delay after the gesture. Upon a successful association, the physical content of the area gesture, namely hand trajectory data for the time interval

1 2[ , ]s s is used to obtain the actual gesture conveyed components of the compound speech gesture command. In the case of, for example, an area gesture, a circle is fitted to the gesture data obtained in order to determine which region of the screen should be shown in more detail. The framework presented requires only moderate computational resources. For a detailed description of the system components see [128]. The main system tasks were separated into a set of separate execution threads as shown in Figure 14. Since many of the components run on different time scales (especially the Speech Recognition, Face Detector and Active Camera Control), the architecture was designed to take advantage of multi-threaded parallel execution. Communication between components is performed using message passing and straightforward thread synchronization.

3.1.4. Usability Study

This XISM system has been and is currently being used for conducting cognitive load studies in which different aspects of multimodal interaction can be measured accurately and compared to traditional and alternative interaction methods under variable but controlled conditions. Informal user studies with the crisis management and related multimodal applications [128] have shown that 80% of users had successful interaction experiences. In addition, observations revealed that the system behaved according to its specifications in 95% of the cases. The acceptance of the XISM system was high with little or no difficulties in understanding the “mechanics” of multimodal interaction. Formal user studies are currently in progress.

3.2. DAVE_G

This section describes the subsequent HCI system DAVE_G that takes XISM one step further to accommodate the need for collaborative work on geo-spatial data in crisis management. Geographic Information Systems are essential in all aspects of crisis management including emergency operations. Emergency managers have to make fast and effective decisions about planning, response and recovery strategies. Also, rapid access to geospatial information is crucial to decision-making in emergency situations when decision makers with different domains of expertise need to work collaboratively, using GIS for hazard mapping and visualization as well as for improving situational awareness. Clear disadvantages of current emergency operations are the rather long and tedious, and also error-prune

ApplicationFront End

Text toSpeech

Text to SpeechEngine

(IBM Outloud)

Tracking Control

Speech Engine(SAPI, IBM ViaVoice)

MicrophoneDome

Speech Control

Gesture RecognitionControl

Image AcquireControl

Vision Interface

Face DetectorControl

Person Control videosignal Camera Sensor

windowsmessage

Active CameraControl

Pan Tilt Head

Large Screen Display

serialports

Figure 14: Overview of the XISM system architecture. Each of the bold framed boxes constitutes a separate thread of execution.

Page 32: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 32

interactions with GIS-specialists who work in the background to produce maps requested by the decision makers [129]. In order to overcome those problems, an effective and natural-to-use interface to GIS is currently being developed that allows the individuals participating in crisis management to utilize natural language and free-hand gestures as a means of querying GIS, where gestures provide more effective expression of spatial relations (see Figure 15).

Compared to XISM that can only handle well-structured commands, the interface of DAVE_G broadens the spectrum and complexity of expressible request and interaction patterns tremendously. Therefore a form of dialog management is needed to process ill-structured, incomplete, and sometimes incorrect requests. The dialogue manager in DAVE_G is able to understand and guide the user through the querying process, and to verify and clarify with the user in case of missing information or recognition errors. To provide such behavior, this form of guidance and dialog management needs to have knowledge and understanding about the current discourse context and task progress, and needs to maintain a model of users in terms of their intention, attention and information pool. Through multimedia feedback and verification questions, collaborations are always grounded with common understandings, ambiguous requests are resolved immediately through dialogs requesting clarification from the user.

3.2.1. Evolution of DAVE_G from XISM

We are using the framework for speech and gesture based human-computer-interaction XISM to build upon and extend the single-user interaction interface and achieve our goal of a dialog-assisted, collaborative group-work-supporting interface to an intelligent GIS. This task poses several challenges to the existing HCI framework of XISM. The HCI for a GIS needs to allow multiple users to access geospatial data and support their participation in a collaborative decision making process. Thus, XISM has to be able to capture each user’s interaction with the system and integrate them in a meaningful way with the GIS querying process.

The first challenge can be solved in several ways. The XISM framework uses only a single camera but supports defining several capture zones for multiple users. This enables the HCI to process visual interactions of more than one user and allows gesture-based control over the system. However, this approach has disadvantages in computational needs as well as the utilization of the users’ workspace and mobility. Tracking multiple people in one camera linearly increases computational costs and lowers the chances of robust real-time interactions with the system. Also, splitting the visible workspace of the camera into several capture zones for multiple users impairs their liberty of actions and makes tracking of more than two users infeasible due to occlusion and reduced image resolution.

The second challenge is to integrate spoken commands and requests of multiple users into the HCI. One easy way to tackle this is to use only one microphone input that is shared by all users. Of course this is not a very practical solution because it hides the identity of the user (unless speech-based user identification is applied) and thus poses many challenges and ambiguities to the speech/gesture fusion and the interpretation of user actions. Therefore each user needs to have a separate speech input channel that is directly associated with their identity. Combining the disadvantages of a single camera for multiple user tracking and the need for separate speech input, we chose to use multiple cameras and microphones to support a high degree of freedom for the human-computer-interaction front-end. We duplicated

Figure 15: DAVE_G prototype: Interaction in the loop with natural speech, gesture and dialog interface.

Page 33: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 33

complementary modules for speech and gesture recognition from the XISM and integrated them into the distributed interaction framework of DAVE_G.

The integration of multiple user requests into one system that pursues a common goal raises further challenges to the initial XISM, in which only one, relatively simple user command had to be associated with one specific action that was then carried out by the system. DAVE_G, in contrast, attempts to leave behind such a command and control driven environment and reaches for a more natural interface to query geospatial information. Positioning DAVE_G, dialogues is neither user-led nor system-led, but rather a mixed-initiative controlled by both the system and the users in a collaborative environment. It allows complex information needs to be incrementally specified by the user while the system can initiate dialogues anytime to request missing information for the specification GIS query commands. This is important since the specification of required spatial information can be quite complex, and the input of multiple people in several steps might be needed to successfully complete a single GIS query. Therefore, the HCI can no longer require the user to issue predefined commands, but needs to be flexible and intelligent enough to allow the user to specify requested information incompletely and in collaboration with other users and the system. A description of our initial prototype version of DAVE_G is given in greater detail in [130].

Figure 16 shows the system design for the current prototype of DAVE_G that addresses some of the challenges discussed earlier in this paper. The prototype uses several instances of speech and gesture recognition modules from XISM. These modules can be run on distributed systems or on one single machine as well. Each module recognizes, and interprets on a lower level, user interactions and sends recognized phrases and gesture descriptions to an action integration unit, where direct feedback information (such as hand positions and recognized utterances) are separated from actual GIS inquiries. While the former is used to give immediate feedback to the user, the latter is sent to a dialog management component that processes user requests, forms queries to the GIS, and engages in a collaborative dialog with the users. The dialog manager is built using an agent-based framework that uses semantic modeling and domain knowledge for collaborative planning. The following section discuses selected components of DAVE_G in greater detail.

3.2.2. Interpreting User Actions and Designing a Meaningful Dialog

In order to support such complex user inputs as depicted in the given scenarios at the beginning of this paper, two challenges must be addressed. One is to achieve satisfactory recognition accuracy and robustness of the speech recognition engine used, which has a direct impact on the overall system performance. Second, the semantic analysis of the recognized spoken and gestured input and their interpretation and integration into a global task for intelligent information retrieval has to be powerful enough for such a broad and complex domain as crisis management. We started out with the speech recognition engine IBM ViaVoice, which was also used in XISM. Here, the key to accurate performance is a finely tuned context-free grammar that defines syntactical and semantic correct phrases and sentences that are to be used in the spoken interaction. The following section describes some of the issues that have

User ActionIntegration K

no

wledg

e BaseGIS Map &

FeedbackVisualization

DialogManagement

Static DomainContext

Dynamic DomainContext

GIS Server

User 1Speech/Gesture

Recognition

User 2Speech/Gesture

Recognition

User nSpeech/Gesture

Recognition

Web-based Inter-Module

Figure 16: Component Graph and Communication Flow in the current DAVE_G prototype.

Page 34: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 34

to be handled when natural HCI beyond simple command and control environments make use of grammar-based speech recognition.

3.2.2.1. Speech Recognition

The first step in understanding multimodal input is to recognize meaningful phrases that help in discourse interpretation and intention recognition. Several methods for semantic parsing were outlined previously that differ in the degree of structural constraints that are posed on accepted input forms. While semantic grammar parsing does have the disadvantage of imposing many constrains on well-formed input structures (e.g. it does not allow irregular or spontaneous speech) it does improve overall speech recognition accuracy compared to loose concept or keyword spotting from unconstrained speech.

To bootstrap the development of DAVE_G, several constructed sample scenarios built on the basis of onsite visitations with emergency managers were analyzed. Through insight gained from interviews, a common representative structure of most user actions could be identified and compiled to create an overall speech interaction corpus. In general, the user might perform one of three actions: request, reply or inform. Requesting information from the GIS (e.g., asking for a map with certain features and properties) is the most frequently used interaction. In cases when the request is ambiguous or can not completely be understood, the dialog manager will respond with a question that prompts the user to provide more information to resolve those ambiguities. A dialog is achieved if the user replies and allows the dialog manager to complete the initial request, thus helping to make progress on the current task. The third action a user might perform is to inform, in other words communicate with the GIS about facts and beliefs concerning the current task stage or just to let the system know about the intentions the user has.

The sentence structures of user requests and replies, obtained from the scenarios have been thoroughly analyzed to define a context-free grammar. While for XISM the grammar consisted of only a few hundred possible sentences with little breadth and depth to accomplish the recognition of simple commands like “Send a fire truck from here to there”, for DAVE_G, the required spectrum of request and command utterances becomes much more complex and modeling all possible user inputs is not possible. Based on the scenario analysis, a subset of possible request-utterances was chosen. The most commonly used request can be modeled as: <INTRO>-<REQUEST>-<ENTITY>-<RELATION>-<ENTITY>. A <REQUEST> many times can directly be matched to a type or sequence of GIS queries (e.g. show a layer, select features). The action is applied to an <ENTITY> that can be any feature on the map, an attribute of an entity or any other abstract that is represented by a GIS. Each entity can be further described and classified by a set of qualifiers like in “all cities” or “this area”. The most complex and challenging part of the grammar definition is the description of relations or prepositions that are possible between entities (e.g. “is going to make landfall” or “which will lay above”).

Since not all combinations of qualifiers, entities and relations are meaningful, they are further decomposed into subclasses, which ideally would only occur together and thus preserve the inherent semantic meaning of defined sentences and increase precision and reliability of the

REQUEST

Dave, create a one mile buffer around the current surge zone layer

INTRO

Dave,

ACTION

create

ENTITY-PLACE

a one mile buffer around the current surge zone layer

ENTITY

a one mile buffer

ENTITY

the current surge zone layer

RELATION

around

QUALIFIER

one mile

ENTITY

surge zone layera

FEATURE

buffer

NUMBER

one

MEASURE

mile

the

QUALIFIER

current

FEATURE

surge zone layer

Figure 17: Structural and semantic decomposition according to grammar definition.

Page 35: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 35

modeled language. In practice, however, overlapping, semantically or even structurally incorrect sentences are still accepted by the grammar and indeed produced by users. The dialog management will handle such incorrect phrases or sentences in an intelligent way, and maintain a meaningful dialog.

An example of the grammatically structured request language is depicted with the request “Dave, create a one mile buffer around the current surge zone layer” in Figure 17. The headings of the non-terminals (light boxes) represent the non-final stages and abstract definitions of their containing phrases. Headings of darker, final stage boxes (terminals) represent the actual semantic meaning of their contained words or phrases. They are used as tags for later processing and identifying semantically related phrases and structures. Such a semantically coded speech grammar helps to disambiguate cases in which several words have different meanings in different semantic contexts. This property is in used the interpretation and natural language processing stage of the dialog management unit.

3.2.2.2. Semantic Interpretation

The use of a grammar like the one applied in our prototype attempts to overcome queries that are too stiff and restricted (as is the case when directly implemented in software) and tries to avoid the immense amount of pure semantic modeling and parsing of a free natural language. By using semantic knowledge about supported phrases and sentences directly within the speech recognition process, the interpretation of user actions and their matching GIS queries becomes more feasible. The interpretation process of user actions can be hierarchically divided into two levels. The lower level handles the fusion of all input streams and generates individual user requests as described in the previous section. The upper level makes use of task specific context and domain knowledge to generate complete and meaningful queries to the GIS-database, incorporating commands from all users. This level is embedded in a dialog management unit that resolves ambiguous and conflicting requests and guides the user through the querying process.

A mixed-initiative dialog control in an agent-based approach is chosen that allows for complex communications during problem-solving processes between users and a GIS [131, 132]. The distinctive feature of an agent-based system is that it offers cooperative and helpful assistance to humans in accomplishing their intended tasks. A so-called “GI-agent” reasons and supports each user’s intentions during collaborative work. A database of previous dialogue interaction (dialog history) is maintained within the knowledge base for dynamic domain context, in order to use it for the interpretation of subsequent user inputs. To create a realistic dialogue, the collaborative GI-agent needs to be domain specific. The GI-agent needs to understand the domain users’ goals, task structures, and general problem-solving strategies. This information is being acquired through iterative, cognitive task analyses with domain experts and potential users, as it will be described in the following section. Additionally, the agent needs to know about spatial data availability and know which procedures for data processing and display are valid (static domain context). Beyond static knowledge processing, knowledge generation during problem solving processes needs to be integrated and applied dynamically. Combining these different types of knowledge allows cooperative and realistic behavior between the system and the operator during collaborative management situations.

The prototype of DAVE_G is more flexible than traditional master-slave GIS interactions in several ways. First, the prototype allows users to provide partial interaction information without rejecting the command entirely. Second, the system accepts ambiguous commands. In case of unintelligible or incorrect commands, the system is able to question the user for further clarification. (e.g. User: “Show population density” System: “Please indicate the area and dataset you want. There is population by block level and population by county”) In a later stage of the project the GI agent will also provide visual feedback to users that will maximize the chance for a successful GIS query (e.g. in case of a complex gesture input, as for example the outline of a certain area, the system will highlight the matching features of the information database before performing the actual and time consuming GIS query).

Page 36: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 36

3.2.3. Error Resolution using Multimodal Fusion

So far only speech has been discussed as an input modality of DAVE_G for interacting and communicating with a GIS. Another important source for conveying information (especially geospatial information) are natural hand gestures. The HCI framework applied in XISM allows the recognition of continuous hand gestures like pointing at a particular location, outlining an area of interest and forming contours. As discussed in section 2, weaknesses of speech recognition can be partially resolved by incorporating complementary information form the gesture cue and visa versa.

Currently, DAVE_G fuses speech and gesture information on a semantic level to resolve spatial references (e.g. “What is the capacity of these shelters?”), in a manner similar to that done in XISM. However, in a much richer semantic domain as needed for GIS-enabled collaborative crisis management, the relation between a selecting gesture and a spoken reference “these” can no longer be resolved by simple keyword spotting. The meaning of “these” depends on the actual context of discourse and domain. “these” might refer to shelters that were already selected in a previous request (e.g. a previous request: “Dave, highlight all facilities in this area”) and thus, using discourse knowledge, DAVE_G is able to make the correct inference about the specification of “these shelters” and issues correct queries to the GIS. On the other hand, no shelters might have been specified earlier in the discourse and thus, “these” represents an unresolved reference. In the context of discourse, the system cannot find any evidence for a reference to previously used entities and therefore has two subsequent options for how to complete the query. It can search for available information in other cues, (e.g. gestures), or prompt the user to specify missing pieces.

In the first case, DAVE_G has marked “these” as a potential spatial reference and the gesture cue is searched for matching spatial descriptions that can complement the missing information. Depending on the actual spatial reference, different gesture types are favored over others to close the semantic gap of a given request. References like this, that, here are more likely to be used in terms of selecting an individual (point like) feature on the map and therefore pointing gestures are preferred. References like these, those, this area are more likely to indicate several features that are spatially distributed over the map and therefore outlining and contour describing gestures are favored over pointing gestures.

A second source for retrieving missing information is to prompt the user to verbally specify missing information (e.g. “the following shelters are available: hospitals, schools and hurricane. Please specify which ones you want.”). Such a dialog is initiated if no other information resources are available to explain missing parameters for a correct data query.

Dividing spatial references into two categories is not all exhaustive because in part they also depend on the given context. However it serves as an initial solution and user studies have to be conducted in order to develop a more realistic view of how to resolve spatial references outside and within domain and discourse context. The uni-directional resolving strategy does work to account for missing information in spoken commands using gesture cues. To some degree it also allows the system to disambiguate conflicting interpretations of a user request. In the previous example, it is not completely clear whether “these” does refer to shelters previously mentioned in the dialog, or if it refers to some new, additional shelters that the user might also be interested in. Here, the presence of a recognized gesture might support the latter hypothesis. However, unless gesture recognition is perfect, the support might only be marginal. Current investigations have shown that a bottom-up approach to gesture/speech fusion on a pure signal level is promising and can lead to more robust gesture recognition, which would be beneficial to spoken command interpretation that relies on gesture cues to disambiguate and correct conflicting requests.

3.2.4. Prototype Design and Usability

The utilization of such a multimodal interface is likely to differ from the standard mouse and keyboard interface we are used to. Therefore, special attention has to be applied to the design of such a new HCI, to generate effective user interfaces for multi-user applications in the emergency management domain. In

Page 37: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 37

designing the prototype, we have adopted a cognitive systems engineering approach that involved incorporating domain experts into the earliest stages of system design. See [133, 134] for overviews of cognitive systems engineering and work domain analysis. This approach involved conducting interviews and questionnaires as well as onsite visitations to emergency management operations centers.

First a set of questionnaires was administrated for the domain and task analysis. The questionnaire was sent out to 12 emergency mangers in Florida, Washington D.C., and Pennsylvania. The objective was to identify GIS-based response activities and operations on disaster events. The participants indicated that a GIS-based emergency response would need to support zoom, pan, buffer, display, and spatial selection of geospatial data. The emergency tasks for which these operations were used included transportation support, search and rescue, environmental protection, and firefighting. In a first step, this allowed us to compile the required GIS functionality into three categories: data query, viewing and drawing.

Second, onsite visitations of emergency operations centers helped to assess realistic scenarios, task distributions and their interconnections [100]. This, in turn, helped to focus the design of DAVE_G on realistic requirements. In particular, a more specific set of gestures could be identified that would be most useful in gathering geospatial information, as well as clusters of similar articulated actions that helped to bootstrap the dialog design and the natural language processing modules.

During further developments of DAVE_G we will carry out constant validations that will guarantee the effectiveness and acceptance of this new interface design for the emergency management domain. In particular we will conduct usability studies for the current prototype to gain insight of various interface properties such as naturalness of request utterances, dialog form and interaction feedback, information presentation and visualization, ease of hand gestures as a form of spatial input generation and last but not least the overall effectiveness compared to current interfaces to GIS. Further studies and developments towards a realistic multimodal interface to GIS will be carried out on the basis of our findings in these initial studies.

The current multimodal prototype systems DAVE_G and XISM are still basic in nature and has been developed in controlled laboratory conditions, with emphasis on the basic research issues and theoretical foundations they impose on natural HCI development. The main disadvantage of these systems is their limited robustness towards realistic environments in which considerable and unpredictable noise in all input modalities makes their actual application impracticable. Little error recovery has been applied to accommodate for miss-recognition and interpretation. To realize robust interfaces in real-world situations, it is crucial for the interface design to research and re-engineer theories in multimodal interaction and to develop robust speech and vision systems that work reliably. This is a non-trivial problem for computer vision and speech recognition-based multimodal systems. While the results presented show excellent promise and can be leveraged for bootstrapping purposes, a fundamental change in research direction is needed. In particular, it is important to research and advance the science of natural, multimodal HCI while addressing issues such as automatic user detection and initialization, more advanced hand and head tracking procedures, robust speech acquisition, audio-visual fusion for improved speech recognition prosodic feature based co-analysis for continuous gesture recognition and a probabilistic framework for speech/gesture fusion. Ongoing research and long-term research goals for DAVE_G as well as multimodal interfaces for crisis management in general will be discussed in the following section.

Part IV: The Future Challenges The lack of an integrated system in the real world, which can bring together all the agencies involved in emergency response, verifies the technological challenges involved in crisis management. As discussed in the previous section, multimodal human computer interface technologies, pervasive computing, device interoperability, etc., can address these limitations to some extent. In this section, we will discuss our ongoing research efforts and the grand challenges that need to be addressed to realize a practically

Page 38: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 38

feasible collaborative crisis management system. We discuss the scientific and engineering/practical challenges separately in the following subsections.

4.1. Scientific Challenges

To develop a natural collaborative system for crisis management, we believe that the following scientific challenges need to be addressed in the system design:

i. development of cognitive and linguistic theories to support collaborative system design,

ii. real-time sensing, automatic initialization and tracking/management of multiple users,

iii. multimodal fusion (i.e., to improve the recognition and to deal with inexact information),

iv. semantic framework for multimodal and multi-user collaboration and

v. usability study as well as performance metrics.

Next, we discuss each of them in details.

4.1.1. Cognitive and Linguistic Theories

A few important steps in realizing a collaborative crisis management system are the development of cognitive theories to guide multimodal system design, and the development of effective natural language understanding, dialogue processing, and error handling techniques. Here, we believe an interdisciplinary approach (including, but not limited to cognitive systems engineering, distributed cognition, activity theory, cognitive ergonomics) to learning about the work domain will assist in the development of multimodal design guidelines. Moreover, the use of realistic crisis management scenarios derived from work domain analyses will be critical to designing better computer vision, gesture recognition and dialogue management algorithms. Such scenarios can help assist in the creation of usability standards and performance metrics for multimodal systems.

4.1.2. Real-time Sensing

Automatic initialization and tracking multiple people is a key challenge for collaborative HCI system design. The most desirable properties of a visual hand and arm tracking system are robustness even in the presence of rapid motion, the ability to track multiple hands or users simultaneously and the ability to extract 3D information such as pointing direction while still maintaining real-time performance and the initialization capability.

We have investigated model-based approaches to human tracking with special emphasis on model acquisition and initialization (see [135] for details). When tracking multiple people robustly in an unconstrained environment, data association and uncertainty is a major problem. Recently, we have been researching to develop a framework using a multiple hypotheses tracking (MHT) algorithm to track multiple people in real world scenes (see [136] for details). We are also developing a statistical framework to deal with problems associated with multiple people collaborating in same place.

4.1.3. A Framework for Multimodal Fusion

A general framework for multimodal fusion is crucial in designing a robust collaborative HCI system. We view this can happen in various stages of signal processing also fusing the decision made from individual modalities. The key would be to consider each of the input modalities in terms of the others, rather than separately. But a major design problem is the lack of reference of how the information from the different sensor streams can be integrated into a consistent description. This is primarily due to the fact that few methods adequately model the complexity of the audio/visual relationship. For example, the issue of how

Page 39: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 39

gesture and speech relate in time is critical for understanding the system that includes gesture and speech as part of a multimodal expression. The co-occurrence analysis of the weather narration domain [90] revealed that approximately 85% of the time when any meaningful strokes are made, it is accompanied by a spoken keyword mostly temporally aligned during and after the gesture. The implication of this was successfully applied at the keyword level co-occurrence analysis to improve continuous gesture recognition in the previous weather narration study. Apart from the keyword (semantic) level, recent psycholinguistic studies [137] attempted to include spoken prosody, i.e., pauses in a fundamental frequency (F0) contour, extracted from the voice of a gesticulating subject. The study revealed that the gestural cues (holds) tend to facilitate the discourse analysis of the speech to identify intonational phrase breaks (pauses). We also believe that it is possible to improve the robustness of users actions (gesture and speech) by taking the cross-modal effects.

4.1.3.1 Speech Recognition

The speech recognition software currently used in our existing systems (e.g. XISM and DAVE_G) is IBM via-voice, which is speaker dependent and needs to be trained for individual users. Hence, it would be useful to experiment with other speech recognition products. One speaker independent product is Nuance, which supports 26 different languages with dynamic language detection in real-time. Nuance reports a recognition accuracy of 96% in a clean environment. They also made a big effort to improve its recognition rate under noisy circumstances such as mobile telephone and teleconferencing applications. However, Environments such as a crisis management, usually involve variable noise levels, social interchange, multi-tasking and interruption of tasks, Lombard effect, accent effects, and etc. All these effects combined can lead to a 20%-50% drop in speech recognition accuracy.

An important factor in further improving recognition accuracy is to use all available information from multiple microphones, and other modalities. Thus, signal enhancement using speaker and gaze detection, blind source separation and noise filtering will improve the quality of the speech signal significantly. Audio-visual fusion scheme can help in this scenario. State-of-the-art audio-visual speech recognition research (e.g. lip-reading) [138-141], indicates one possible solution to this problem. In similar situations the use of microphone arrays in combination with visual cues [142] to filter acoustic streams of several users improves speech recognition significantly. Nevertheless, they require high CPU power, close-up high-resolution images of the mouth. In contrast, little attention has been paid to less intrusive and lower cost techniques such as speaker detection [143-145] and low-resolution signal fusion. Speech recognition for HCI systems can be improved significantly, if the system were able to decide, in a more intelligent manner, when to turn the microphone on and when off More specifically, if situations could be recognized, in which the user is either not talking or not talking to the system and situations in which humans other than the user are talking nearby the microphone, then a large amount of misinterpretations due to noise and other interfering signals (e.g. sound of the engine, nearby talking people) could be avoided. This leads to the problem of speaker detection and signal enhancement, which can be done in many ways, but has not gained much attention in the past. We plan to use low-level data fusion of audio-visual signals (e.g. speech signal from microphones and mouth movements from the user) together with a high-level decision maker that fuses beliefs of different possible speaker configurations. We also plan to make use of the strong correlation between visual and auditory signal in two manners. First, we will use it to filter the auditory spectrum to enhance the components that strongly correlates with the visual signal of the user’s mouth motion. Next, we will incorporate the audio cues together with features such as the user’s gaze and mouth motion to determine when the user is speaking to the system.

4.1.3.2. Gesture recognition

The state of the art in continuous gesture recognition is still far from meeting the “naturalness” criteria of a multimodal HCI due to poor recognition rates. Although accuracy of the isolated sign recognition has achieved 95%, e.g., [146, 147], the accuracy of the continuous gesture recognition in an uncontrolled setting is still very low, e.g. [146, 90, 148]. The existing continuous gesture recognition system in the weather domain has 80% accuracy and we expect that recognition rates as high as 90% can be achieved

Page 40: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 40

using prosody-based co-analysis of speech gesture and state of the art multimodal fusion. Multimodal co-analysis of visual gesture and speech signals provide an attractive means of improving continuous gesture recognition. However, lack of understanding of cognitive mechanism of speech and gesture production restricted implementation of multimodal integration at the semantic level, e.g. [149, 126, 31, 150, 113, 90].

However, using the spoken context in a top-down approach has the additional challenge of dealing with natural language processing. For natural gesticulation, this problem becomes even less tractable since gestures do not exhibit one-to-one mapping of form to meaning [76]. For instance, the same gesture movement can exhibit different meanings when associated with different spoken context and at the same time, a number of gesture forms can be used to express the same meaning. Though the spoken context is extremely important in the understanding of a multimodal message and cannot be replaced, the complexity (many-to-many mapping) of top-down improvement of gesture recognition seems to be secondary or complimentary. Not surprisingly, the issue of speech processing for natural gesture recognition continues to be a hard problem, e.g., [126, 151], or applicable only within a well-annotated (limited) context, e.g., [31]. In a ongoing research, we introduce a novel framework of gesture and speech co-analysis that is independent of the semantics of spoken context, e.g., keywords [90]. We consider a bottom-up perspective of improving continuous gesture recognition by co-analyzing a set of prosodic features in speech and hand kinematics. We are exploring set of prosodic features for co-analysis of hand kinematics and spoken prosody to improve recognition of natural spontaneous gesticulation. We consider co-analysis based on both physiological and articulation phenomena of gesture and speech production. Although it is difficult to formulate two different analyses that uniquely address each phenomenon, we assume that physiological constraints of coverbal production are manifested when raw acoustic correlate of pitch (F0) is used for a feature-based co-analysis. In contrast, co-articulation analysis utilizes notion of co-occurrence with prosodically prominent speech segments.

4.1.4. Semantic Frameworks for Multimodal, Multi-user Collaboration

Collaborative problem solving in crisis management requires the fusion of gesture/speech inputs from multiple participants and maintain a semantic model of collaboration in advancing tasks. Simple grammar based semantic analysis is extremely inadequate. Plan-based approaches seem to be more appropriate modeling the dialogue phenomena in dynamic crisis management context. AI-planning methods use complex plan recognition and plan completion techniques to generate interactive system behavior. However, traditional models based on single agent plans are not sufficient for modeling multi-agent dialogues. A fruitful direction is to model collaborative dialogues using an agent-based approach where more explicit models of a user’s belief, a user’s task structure, related world events and their interactions are maintained. There are two established alternative theoretical foundations for building a dialogue control agent. One is the conversational agency theory based on the BDI (Belief, Desire, and Intention) architecture of Bratman et al [152]. The other is the collaborative planning framework of SharedPlan [153]. Lochbaum’s work [154] showed that SharedPlan is a more appropriate model for multi-agent dialogues involving complex planning. Rich and Snider [155] have applied such approach in building their dialogue agent called Collagen. In the current DAVE_G system development, we are working on extending these works for dialogue management in multimodal, multi-user interactions with critical geospatial information sources.

4.1.5. Usability Study and Performance Metric

Sophisticated methods for formal experimental evaluation are quite crucial to determine causal relationships between interface characteristic and user performance/ behavior. A serious effort is needed to measure the performance of speech-gesture driven multimodal crisis management systems. While some researchers have focused on specific usability aspects of multimodal design (e.g. efficiency in Oviatt et al. [156]), researchers have not yet addressed the range of issues important to the creation of a working multimodal crisis management system. We believe that the traditional usability issues in

Page 41: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 41

interface design will need to be expanded and adapted for multimodal system design. Moreover, usability frameworks and performance metrics for testing usability issues in multimodal systems must be established.

4.2. Engineering and Other Practical Challenges:

Apart from the above scientific challenges there are engineering issues may help in system design.

i. Task specific system: It is important to develop fully integrated task specific system that takes into account the multi-level users studies along with consideration of all possible aspects of a systems interaction cycle and integration issues (static and dynamic synchronization) to build innovative applications. Tools developed in task specific system can be extended for general system design.

ii. Choice of modality: Since individual input modalities are well suited in some situations, and less ideal or even inappropriate in others, modality choice is an important design issue in a multimodal system. It is expected to help in situations where certain modalities are not useful.

iii. Multi-user collaboration: New multimodal systems are expected to function more robustly and adaptively, and with support for collaborative multi-person use. It will be crucial to incorporate theoretical foundations from the field of computer supported cooperative work into the design of multi-user multimodal systems.

iv. Framework for characterizing multimodal interaction: There should be a formal framework for characterizing and assessing various aspects of multimodal interaction, for example, the complementarities, assignment, redundancy, and equivalence that may occur between the interaction techniques available in a multimodal user interface.

v. Fusion and fission: Novel aspects of interactions must be considered, such as the fusion and fission of information, and the nature of temporal constraints on the interactions. Fusion refers to the combination of several chunks of information to form new chunks and fission refers to the decomposition phenomenon. As for fission, it may be the case that information coming from a single input channel or from a single context need to be decomposed in order to be understood at a higher level of abstraction.

vi. Data Acquisition: During a crisis, the situation is always in flux. A changing situation not only requires continuous re-evaluation of rescue and response priorities, but also increases potential risk for rescue workers. Hence, pervasive sensing technologies that would enable non-human monitoring of shifting, dangerous situations can play a key role. Such sensors continuously gather data with pre-placed sensors capable of wireless data transmission, eliminating the need for humans to gather data.

vii. Interoperability of devices: The issue of interoperability across the wide range of devices is very critical for a seamless flow of information and communication. The wide range of devices, alongwith the special needs for the GUI and the data handling capability, as well as the bandwidth requirements of these devices creates a challenge for the multimedia engine in the interoperability architecture.

viii. Training of personnel: Although human computer interfaces discussed here are natural and intuitive, it is important to conduct an extensive training and education program for the operators and users of the system. Their feedback can be used (as discussed earlier in the User Study and Performance Metric task) in refining the system.

Page 42: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 42

Part V: Conclusion This paper discusses the potential role of multimodal speech-gesture interfaces in addressing some of the critical needs in crisis management. Speech-gesture driven interfaces can enable dialogue-assisted information access, easing the need for user training by hiding the complex technologies underlying information systems used in crisis management. Further, such interfaces can support collaborative work that is a fundamental aspect of crisis management. However, there are many challenges that need to be addressed before multimodal interfaces can be successfully used for crisis management. This paper discusses both the challenges as well as the progress to date. It describes the evolution of two implemented prototype systems namely, XISM and DAVE_G developed as a cooperative effort between Penn State and Advanced Interface Technologies, Inc. Experiments with these systems reinforce the great potential of speech-gesture driven systems for crisis management as well as other collaborative work.

Acknowledgements Part of this work is based upon work supported by the National Science Foundation under Grants No. 0113030, IIS-97-33644 and IIS-0081935. The financial support of the U. S. Army Research Laboratory (under Cooperative Agreement No. DAAL01-96-2-0003) is also gratefully acknowledged.

References [1] S. L. Oviatt, "Mutual disambiguation of recognition errors in a multimodal architecture," Proc. of the

Conference on Human Factors in Computing Systems (CHI’99), 1999. [2] S. L. Oviatt, J. Bernard, and G. Levow, "Linguistic adaptation during error resolution with spoken and

multimodal systems," Language and Speech, vol. 41, pp. 415–438, 1999. [3] A. Rudnicky and A. Hauptmann, "Multimodal interactions in speech systems," in Multimedia interface

design, R. Dannenberg, Ed.: New York: ACM Press, 1992, pp. 147–172. [4] R. Sharma, V. I. Pavlovic, and T. S. Huang, "Toward multimodal human-computer interface," Proceedings

of the IEEE, vol. 86, pp. 853-869, 1998. [5] S. L. Oviatt, "Ten myths of multimodal interaction," Communications of the ACM, vol. 42, pp. 74–81,

1999. [6] M.B. Rosson and J. M. Carroll, Usability Engineering: Scenario-based development of human-computer

interaction. San Francisco, CA: Morgan Kaufmann Publishers., 2002. [7] B. Buttenfield, "Usability evaluation of digital libraries," Science & Technology Libraries, vol. 17, pp. 39-

59, 1999. [8] C. Davies and D. Medyckyj-Scott, "GIS users observed," International Journal of Geographical

Information Systems, vol. 10, pp. 363-384, 1996. [9] P. Jankowski and T. Nyerges, "GIS-supported collaborative decision-making: Results of an Experiment,"

Annals of the Association of American Geographers, vol. 91, pp. 48-70, 2001. [10] M. P. Armstrong, "Requirements for the development of GIS-based group decision-support systems,"

Journal of the American Society for Information Science, vol. 45, pp. 669-677, 1994. [11] M. D. McNeese, "Discovering how cognitive systems should be engineered for aviation domains: A

developmental look at work, research, and practice," in Cognitive systems engineering in military aviation environments: Avoiding Cogminutia Fragmentosa, M. Vidulich, Ed. Wright-Patterson Air Force Base, OH: HSIAC Press., 2001.

[12] M. D. McNeese and M. A. Vidulich, "Cognitive Systems Engineering in Military Aviation Environments: Avoiding Cogminutia Fragmentosa!." Wright-Patterson AFB, OH: Human Systems Information Analysis Center, 2002.

2 Dr. David J. Nagel, George Washington University. "Pervasive Sensing." Presentation on Integrated Command Environments. MSCMC monthly meeting presentation (July and September 2000).

Page 43: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 43

[13] C. Stary and M. F. Peschl, "Representation still matters: cognitive engineering and user interface design," Behavior and Information Technology, vol. 17, pp. 338-360, 1998.

[14] E. Hutchins, "How a cockpit remembers its speeds," Cognitive Science, vol. 19, pp. 265-88, 1995. [15] B. A. Nardi, Context and Consciousness: Activity Theory and Human Computer Interaction. Cambridge,

MA: MIT Press, 1996. [16] F. Descortis, S. Noirfalise, and B. Saudelli, "Activity theory, cognitive ergonomics and distributed

cognition: three views of a transport company," Int. J. Human-Computer Studies, vol. 53, pp. 5-33, 2000. [17] M. J. Egenhofer, "Query processing in spatial-query-by-sketch," Journal of Visual Languages and

Computing, vol. 8, pp. 403-424, 1997. [18] G. Fischer, "Articulating the task at hand by making information relevant to it," Human-Computer

Interaction (Special issue on context-aware computing), vol. 16, pp. 243-256, 2001. [19] D. R. McGee, P. R. Cohen, R. M. Wesson, and S. Horman, "Comparing paper and tangible, multimodal

tools," Proc. of the SIGCHI conference on Human factors in computing systems: Changing our world, changing ourselves, Minneapolis, Minnesota, USA, 2002.

[20] S. L. Oviatt, "Multimodal Interactive Maps: Designing for Human Performance," Human-Computer Interaction, vol. 12, pp. 93-129, 1997.

[21] R. Sharma, I. Poddar, and S. Kettebekov, "Recognition of natural gestures for multimodal interactive map (iMAP)," Proc. of the 2000 Advanced Display Federated Laboratory Symposium, Adelphi, MD, 2000.

[22] J. E. McGrath, Groups: Interaction and performance. Englewood Cliffs, NJ: Prentice-Hall, 1984. [23] A. M. MacEachren and I. Brewer, "Developing a conceptual framework for visually-enabled

geocollaboration," International Journal of Gegraphical Information Science, in press. [24] I. Brewer, A. M. MacEachren, H. Abdo, J. Gundrum, and G. Otto, "Collaborative Geographic

Visualization: Enabling shared understanding of environmental processes," Proc. of the IEEE Information Visualization Symposium, Salt Lake City Utah, Oct. 9-10, 2000, 2000.

[25] R. Bolt, "Put-that-there: voice and gesture at the graphics interface," Computer Graphics, vol. 14, pp. 262-270, 1980.

[26] V. W. Zue and G. J.R., "Conversational Interfaces: Advances and Challenges," Proceedings of the IEEE, vol. 88, pp. 1166-1180, 2000.

[27] L. B. Larsen, M. D. Jensen, and W. K. Vodzi, "Multi Modal User Interaction in an Automatic Pool Trainer," Proc. of the Fourth International Conference on Multimodal Interfaces (ICMI), Pittsburgh, PA, USA, 2002.

[28] J. Mostow and G. Aist, "When Speech Input is Not an Afterthought: A Reading Tutor that Listens," Proc. of the Workshops on Perceptual/Perceptive User Interfaces (PUI), Banff, Canada, 1997.

[29] B. Clarkson, N. Sawhney, and A. Pentland, "Auditory Context Awareness via Wearable Computing," Proc. of the Workshops on Perceptual/Perceptive User Interfaces (PUI), Banff, Canada, 1997.

[30] P. Kakumanu, R. Gutierrez-Osuna, A. Esposito, R. Bryll, A. Goshtasby, and O. N. Garcia, "Speech Driven Facial Animation," Proc. of the Workshops on Perceptual/Perceptive User Interfaces (PUI), Orlando, FL, PUI 2001.

[31] N. Krahnstoever, S. Kettebekov, M. Yeasin, and R. Sharma, "A Real-Time Framework for Natural Multimodal Interaction with Large Screen Displays," Proc. of the International Conference on Multimodal Interfaces, Pittsburgh, USA, 2002.

[32] T. Brøndsted, L. Larsen, M. Manthey, P. McKevitt, T. Moeslund, and K. Olesen, "The Intellimedia WorkBench - an environment for building multimodal systems," Proc. of the Second International Conference on Cooperative Multimodal Communication, Theory and Applications, Tilburg, 1998.

[33] H. F. Silverman, W. R. Patterson, and J. L. Flanagan, "The huge microphone array," IEEE Concurrency, pp. 36-46, 1998.

[34] K. Wilson, V. Rangarajan, N. Checka, and T. Darrell, "Audiovisual arrays for untethered spoken interfaces," Proc. of the Fourth International Conference on Multimodal Interfaces (ICMI), Pittsburgh, PA, USA, 2002.

[35] S. Chatty and P. Lecoanet, "Pen Computing for Air Traffic Control," Proc. of the Conference on Human Factors in Computing Systems (CHI 96), Vancouver, BC, Canada, 1996.

[36] S. L. Oviatt and R. vanGent, "Error resolution during multimodal human –computer interaction," Proc. of the International Conference on Spoken Language Processing, Philadelphia, 1996.

[37] A. Meyer, "PEN COMPUTING - A Technology Overview and a Vision," ACM SIGCHI, 1995. [38] T. Baudel and M. Baudouin-Lafon, "Charade: Remote Control of Objects Using Free-Hand Gestures,"

Communications of the ACM, vol. 36, pp. 28-35, 1993.

Page 44: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 44

[39] K. Böhm, W. Broll, and M. Sokolewicz, "Dynamic Gesture Recognition Using Neural Networks; A Fundament for Advanced Interaction Construction," Proc. of the IS&T/SPIE's Symposium on Electronic Imaging: Science & Technology 1994 (EI 94). San José, 1994.

[40] S. S. Fels and G. E. Hinton, "Glove-Talk: A Neural Network Interface Between a Data-Glove and a Speech Synthesizer," IEEE Transactions on Neural Networks, vol. 4, pp. 2-8, Jan. 1993.

[41] D. L. Quam, "Gesture Recognition with a DataGlove," in Proceedings of the 1990 IEEE National Aerospace and Electronics Conference, vol. 2, 1990.

[42] D. J. Sturman and D. Zeltzer, "A Survey of Glove-based Input," IEEE Computer Graphics and Applications, vol. 14, pp. 30-39, Jan. 1994.

[43] C. Wang and D. J. Cannon, "A Virtual End-Effector Pointing System in Point-and-Direct Robotics for Inspection of Surface Flaws Using a Neural Network Based Skeleton Transform," in Proceedings of IEEE International Conference on Robotics and Automation, vol. 3, May 1993, pp. 784-789.

[44] V. I. Pavlovic, R. Sharma, and T. S. Huang, "Visual interpretation of hand gestures for human-computer interaction: a review," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, pp. 677--695, July 1997.

[45] B. Moghaddam and A. Pentland, "Maximum likelihood detection of faces and hands," pp. 122-128, June 1995.

[46] M. Yeasin and Y. Kuniyoshi, "Detecting and tracking human face using a space-varying sensor and an active head," Proc. of the Computer Vision and Pattern Recognition, South Carolina, USA, 2000.

[47] Z. Zhang, L. Zhu, S. Z. Li, and H. Zhang, "Real-Time Multi-View Face Detection," Proc. of the International Conference on Automatic Face and Gesture Recognition, Washington D.C., USA, 2002.

[48] D. M. Gavrila, "The Visual Analysis of Human Movement: A Survey," Computer Vision and Image Understanding, vol. 73, pp. 82-98, 1999.

[49] T. B. Moeslund and E. Granum, "A Survey of Computer Vision-Based Human Motion Capture," Computer Vision and Image Understanding, vol. 81, pp. 231-268, 2001.

[50] C. Bregler and J. Malik, "Tracking People with Twists and Exponential Maps," in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 1998.

[51] T. Drummond and R. Cipolla, "Real-time tracking of highly articulated structures in the presence of noisy measurements," in Proc. International Conference on Computer Vision, Vancouver, Canada, 2001.

[52] D. M. Gavrila and L. S. Davis, "3D model-based tracking of humans in action: a multi-view approach," in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 1996, pp. 73--80.

[53] L. Goncalves, E. D. Bernardo, E. Ursella, and P. Perona, "Monocular tracking of the human arm in 3D," in Proc. International Conference on Computer Vision, 1995, pp. 764-770.

[54] Y. Guo, G. Xu, and S. Tsuji, "Understanding human motion patterns," in Proc. 12th International Conference on Pattern Recognition: Computer Vision and Image Processing, vol. 2, 1994, pp. 325--329.

[55] N. R. Howe, M. E. Leventon, and W. T. Freeman, "Bayesian Reconstruction of 3D human motion from a single-camera video," October 1999.

[56] I. A. Kakadiaris and D. Metaxas, "Model-Based Estimation of 3D Human Motion with Occlusion Based on Active Multi-Viewpoint Selection," in Proc. IEEE Computer Vision and Pattern Recognition Conference, San Francisco, CA, June 1996, pp. 81--87.

[57] D. D. Morris and J. M. Rehg, "Singularity Analysis for Articulated Object Tracking," in Proc. IEEE Conference on Computer Vision and Pattern Recognition, Santa Barbara, CA, June 1998.

[58] J. O'Rourke and B. U. Badler, "Model-based image analysis of human motion using constraint propagation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 6, pp. 522-536, 1980.

[59] V. I. Pavlovic, J. Rehg, C. Tat-Jen, and K. Murphy, "A dynamic Bayesian network approach to figure tracking using learned dynamic models," in Seventh IEEE International Conference on Computer Vision, vol. 1, 1999, pp. 94--101.

[60] H. Sidenbladh and M. J. Black, "Learning Image Statistics for Bayesian Tracking," in Proc. IEEE International Conference on Computer Vision, Vancouver, vol. 2, July 2001, pp. 709--716.

[61] A. Blake, B. Bascle, M. Isard, and J. MacCormick, "Statistical models of visual shape and motion," Proc. of the Phil. Trans. R. Soc. Lond. A, 1998.

[62] M. Isard and A. Blake, "CONDENSATION - Conditional Density Propagation for VisualTracking," International Journal of Computer Vision, vol. 29, pp. 5-28, 1998.

[63] M. Kass and A. Witkin, "Snakes: Active Contour Models," International Journal of Computer Vision, pp. 321-331, 1988.

Page 45: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 45

[64] H. Jin, P. Favaro, and S. Soatto, "Real-Time Feature Tracking and Outlier Rejection with Changes in Illumination," in Proc. International Conference on Computer Vision, Vancouver, Canada, 2001.

[65] J. Shi and C. Tomasi, "Good features to track," in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'94), Seattle, Washington, June 1994.

[66] C. Tomasi and T. Kanade, "Detection and Tracking of Point Features," CMU-CS-91-132, April 1991. [67] Q. Zheng and R. Chellappa, "Automatic feature point extraction and tracking in image sequences for

arbitrary camera motion," International journal of computer vision, vol. 15, pp. 31-76, 1995. [68] S. Birchfield, "Elliptical head tracking using intensity gradients and color histograms," IEEE Conference on

Computer Vision and Pattern Recognition, pp. 232-237, 1998. [69] H. Stern and B. Efros, "Adaptive Color Space Switching for Face Tracking in Multi-Colored Lighting

Environments," Proc. of the International Conference on Automatic Face and Gesture Recognition, Washington D.C., USA, 2002.

[70] A. Wu, M. Sha, and N. d. V. Lobo, "A Virtual 3D blackboard: 3D finger tracking using a single camera," in Fourth IEEE International Conference on Automatic Face and Gesture Recognition, 2000, pp. 536-542.

[71] M. J. Black and A. Jepson, "Eigen tracking: Robust matching and tracking of an articulated objects using a view based representation'," in 4th European conf. on comp. vis. ECCV'96, Part I, Cambridge, UK, 1996.

[72] M. Gleicher, "Projective registration with difference decomposition," in Proc. IEEE Conf. Computer Vision and Patterm Recognition, 1996.

[73] G. Hager and P. Belhumeur, "Real-time tracking of image regions with changes in geometry and illumination," in Computer Vision and Pattern Recognition, Proceedings., IEEE Computer Society Conference on, 1996, pp. 403-410.

[74] A. Lipton, H. Fujiyoshi, and R. Patil, "Moving target classification and tracking from real-time video," in Proc. Fourth IEEE Workshop on Applications of Computer Vision, WACV '98, 1998, pp. 8--14.

[75] V. I. Pavlovic, R. Sharma, and T. S. Huang, "Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 7, July, 1997.

[76] D. McNeill, Hand and Mind: The University of Chicago Press, Chicago IL, 1992. [77] S. Kettebekov and R. Sharma, "Understanding gestures in multimodal human computer interaction,"

International Journal on Artificial Intelligence Tools, vol. 9, pp. 205-224, 2000. [78] K. Murakami, H. Taguchi, S. P. Robertson, G. M. Olson, and J. S. Olson, "Gesture Recognition using

Recurrent Neural Networks," in Human Factors in Computing Systems, Reaching Through Technology, CHI, 1991.

[79] M. Assan and K. Grobel, "Video Based Sign language Recognition using Hidden Markov Models," in Gesture and Sign Language in Human-Computer Interaction, Intl. Gesture Workshop,Bielfeld, Germany, Sept 1997.

[80] I. Poddar, "Continuous Recognition of Natural Hand Gestures for Human Computer Interaction," 1999. [81] I. Poddar, Y. Sethi, E. Ozyildiz, and R. Sharma, "Toward Natural Gesture/Speech HCI: A Case Study of

Weather Narration," in Proc. Workshop on Perceptual User Interfaces (PUI98), November 1998, pp. 1-6. [82] L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,"

Proceedings of the IEEE, vol. 77, pp. 257-286, 1989. [83] J. Schlenzig, E. Hunter, and R. Jain, "Recursive identification of gesture inputs using hidden Markov

models," in Proceedings of the Second IEEE Workshop on Applications of Computer Vision. Sarasota, FL, December 5-7 1994, pp. 187-194.

[84] R. Sharma, I. Poddar, E. Ozyildiz, S. Kettebekov, H. Kim, and T. S. Huang, "Toward Interpretation of natural speech/gesture for spatial planning on a virtual map," in Proceedings Advanced Display Federated Laboratory Symposium, Adelphi, MD, February 1999, pp. 35-39.

[85] T. E. Starner and A. Pentland, "Visual Recognition of American Sign Language Using Hidden Markov Models," in International Workshop on Automatic Face- and Gesture-Recognition IWAFGR95, June 1995.

[86] A. Wilson and A. Bobick, "Parametric hidden markov models for gesture recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, pp. 884--900, September 1999.

[87] S. J. Young, N. H. Rusell, and J. H. S. Thornton, "Token Passing: a Conceptual model for Connected Speech Recognition," ftp://svr-ftp.eng.cam.ac.uk 1989.

[88] M. Padmanabhan and M. Picheny, "Large-Vocabulary Speech Recognition Algorithms," IEEE Computer, vol. 35, pp. 42-50, 2002.

[89] S. Young, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book (for HTK Version 2.1): Cambridge University, 1995.

Page 46: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 46

[90] R. Sharma, J. Cai, S. Chakravarthy, I. Poddar, and Y. Sethi, "Exploiting Speech/Gesture Co-occurrence for Improving Continuous Gesture Recognition in Weather Narration," Proc. of the International Conference on Face and Gesture Recognition (FG'2000), Grenoble, France, 2000.

[91] M. F. McTear, "Spoken dialogue technology: enabling the conversational user interface," ACM Computing Surveys, vol. 34, pp. 90 - 169, 2002.

[92] J. Allen, Natural Language Processing, Second ed. Redwood, CA: Benjamin Cummings Publishing Company, 1995.

[93] R. C. Moore, "Integration of speech and natural language understanding," in Voice Communication between humans and machines, J. Wilpon, Ed. Washington, DC: National Academy Press, 1995, pp. 254-271.

[94] B. J. Grosz, A. K. Joshi, and S. Weinstein, "Providing a unified account of definite noun phrases in discourse," Proc. of the The 21st Annual Meeting of the Association of Computational Linguistics, Boston, MA, 1983.

[95] B. J. Grosz and C. L. Sidner, "Attention, intentions, and the structure of discourse," Computational Linguistics, vol. 12, pp. 175-204, 1986.

[96] J. Allen, D. Byron, M. Dzikovska, G. Ferguson, L. Galescu, and A. Stent, "Towards conversational human-computer interaction," AI Magazine, vol. 22, pp. 27-37, 2001.

[97] S. Oviatt, "Taming Speech Recognition Errors Within a Multimodal Interface," Communications of the ACM, vol. 43(9), pp. 45-51, 2000.

[98] B. Shneiderman, Designing the User Interface: Strategies for Effective Human-Computer Interaction, 3rd Ed. ed. Reading, MA: Addison-Wesley, 1998.

[99] J.L. Gabbard, D. Hix, and J. E. S. II, "User-centered design and evaluation of virtual environments," IEEE Computer Graphics and Applications, 1999.

[100] I. Brewer, "Cognitive Systems Engineering and GIScience: Lessons Learned from a Work Domain Analysis for the Design of a Collaborative, Multomodal Emergency Management GIS," Proc. of the GIScience, Boulder, Colorado, USA, 2002.

[101] ISO, "Report Number ISO/TC 159/SC4/WG3 N147: Ergonomic requirements for office work with visual display terminals (VDTs) - Part 9 - Requirements for non-keyboard input devices (ISO 9241-9)," Proc. of the International Organisation for Standardisation, 1998.

[102] I. S. MacKenzie and A. Oniszczak, "A comparison of three selection techniques for touchpads," Proc. of the Proceedings of the ACM Conference on Human Factors in Computing Systems - CHI '98, 1998.

[103] P. M. Fitts, "The Information Capacity of Human Motor System in Controlling the Amplitude of Movement," Journal of Experimental Psychology, vol. 47, pp. 381-391, 1954.

[104] A. M. MacEachren, How Maps Work: Representation, Visualization and Design. New York: Guilford Press, 1995.

[105] A. M. MacEachren, "Cartography and GIS: facilitating collaboration," Progress in Human Geography, vol. 24, pp. 445-456, 2000.

[106] T. A. Slocum, C. Blok, B. Jiang, A. Koussoulakou, D. R. Montello, S. Fuhrmann, and N. R. Hedley, "Cognitive and Usability Issues in Geovisualization," Cartography and Geographic Information Science, vol. 28, pp. 61-75, 2001.

[107] J. Nielsen, "Heuristic evaluation," in Usability Inspection Methods, R. L. Mack, Ed. New York, NY: John Wiley & Sons, 1994.

[108] B. R.A., "Conversing with Computers.," in Human-Computer Interaction: A Multidisciplinary Approach, A. S. Buxton, Ed. California: Morgan Kaufmann Publishers Inc., 1987.

[109] D. Weimer and S. K. Ganapathy, "A Synthetic Visual Environment with Hand Gesturing and Voice Input," Proc. of the Proceedings of Human Factors in Computing Systems (CHI'89), 1989.

[110] M. W. Salisbury, J. H. Hendrickson, T. L. Lammers, and C. M. Fu, S.A., "Talk and Draw: Bundling Speech and Grpahics," Computer, vol. 23, pp. 59-65, 1990.

[111] P. Maes, T. Darrell, B. Blumberg, and A. Pentland, "The ALIVE system: Fullbody Interaction with Autonomous Agents," Proc. of the Proceed ings of the Computer Animation '95, Geneva, Switzerland, April 1995.

[112] A. D. Angeli, W. Gerbino, G. Cassano, and D. Petrelli, "Visual display: pointing and natural language: The power of multimodal interaction," Proc. of the Advanced Visual Interfaces, L'Aquila, Italy, 1998.

[113] S. Oviatt, "Multimodal interfaces for dynamic interactive maps," Proc. of the Conference on Human Factors in Computing Systems (CHI'96), 1996.

Page 47: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 47

[114] L. Stifelman, B. Arouns, and C. Schmandt, "The Audio Notebook. Paper and Pen Interaction with Structured Speech," Proc. of the Computer Human Interaction (CHI'01), 2001.

[115] P. R. Cohen, M. Johnston, D. McGee, S. L. Oviatt, J. Pittman, I. Smith, L. Chen, and J. Clow, "QuickSet: Multimodal interaction for distributed applications," Proc. of the Fith ACM International Multimedia Conference, 1997.

[116] H. Fell, H. Delta, R. Peterson, L. Ferrier, Z. Mooraj, and M. Valleau, "Using the baby-babble-blanket for infants with motor problems," Proc. of the Conference on Assistive Technologies (ASSETS’94), Marina del Rey, CA, 1994.

[117] K. A. Papineni, S. Roukos, and R. T. Ward, "Feature-based language understanding," Proc. of the 5th European Conference on Speech Communication and Technology, Rhdoes, Greece, 1997.

[118] T. Holzman, "Computer-human interface solutions for emergency medical care," Interactions, vol. 6, pp. 13-24, 1999.

[119] L. Duncan, W. Brown, C. Eposito, H. Holmback, and P. Xue, "Enhancing virtual maintenance environments with speech understanding," Boeing M&CT TechNet, 1999.

[120] M. Lucente, "Visualization Space: A Testbed for Deviceless Multimodal User Interface," Computer Graphics, vol. 31, 1997.

[121] J. M. Rehg, M. Loughlin, and K. Waters, "Vision for a Smart Kiosk," Computer Vision and Pattern Recognition, pp. 690-696, 1997.

[122] B. Brumitt, B. Meyers, J. Krumm, A. Kern, and S. Shafer, "EasyLiving: Technologies for intelligent environments," Proc. of the Second International Symposium on Handheld and Ubiquitous Computing (HUC), Bristol, UK, 2000.

[123] S. Kettebekov, N. Krahnstoever, M. Leas, E. Polat, H. Raju, E. Schapira, and R. Sharma, "i2Map: Crisis Management using a Multimodal Interface," in Proceedings of the ARL Federated Laboratory 4th Annual Symposium, 2000.

[124] M. H. Hayes, Statistical Digital Signal Processing and Modeling: John Wiley & Sons, Inc., 1996. [125] A. Kendon, Conducting Interaction: Cambridge: Cambridge University Press, 1990. [126] S. Kettebekov and R. Sharma, "Toward Natural Gesture/Speech Control of a Large Display," in

Engineering for Human Computer Interaction, vol. 2254, Lecture Notes in Computer Science, L. Nigay, Ed. Berlin Heidelberg New York: Springer Verlag, 2001, pp. 133-146.

[127] I. Poddar, "Continuous Recognition of Deictic Gestures for MultimodalInterfaces," The Pennsylvania State University, 1999.

[128] N. Krahnstoever, S. Kettebekov, M. Yeasin, and R. Sharma, "A Real-Time Framework for Natural Multimodal Interaction with Large Screen Displays," Dept. of Computer Science and Engineering, 220 Pond Lab, University Park, PA, Technical Report CSE-02-010, May 2002.

[129] I. Brewer, "Cognitive Systems Engineering and GIScience: Lessons learned from a work domain analysis for the design of a collaborative, multimodal emergency management GIS," Proc. of the Proceedings, GIScience 2002, Boulder, CO, 2002.

[130] I. Rauschert, P. Agrawal, S. Fuhrmann, I. Brewer, H. Wang, R. Sharma, G. Cai, and A. M. MacEachren, "Designing a Human-Centered, Multimodal GIS Interface to Support Emergency Management," Proc. of the 10th ACM International Symposium on Advances in Geographic Information Systems, McLean, Virginia, USA, 2002.

[131] N. Lesh, C. Rich, and C. L. Sidner, "Using Plan Recognition in Human-Computer Collaboration," 1999. [132] I. Lokuge and S. Ishizaki, "Geospace: An interactive visualization system for exploring complex

information spaces," Proc. of the CHI'95 Proceedings, 1995. [133] J. Rasmussen, A. M. Pejtersen, and L. P. Goodstein, Cognitive systems engineering. New York: Wiley,

1994. [134] K. J. Vicente, Cognitive Work Analysis: Toward Safe, Productive, and Healthy Computer-Based Work.

Mahwah, New Jersey: Lawrence Erlbaum Associates, 1999. [135] N. Krahnstoever, M. Yeasin, and R. Sharma, "Automatic Acquisition and Initialization of Articulated

Models," Machine Vision and Applications, to appear, 2002. [136] E. P. M. Yeasin and R. Sharma, "Tracking Body Parts of Multiple People: A New Approach," in IEEE

Workshop on Multi Object Tracking at ICCV, 2001, pp. 35--42. [137] F. Quek, D. McNeill, B. Bryll, S. Duncan, X. Ma, C. Kirbas, K.-E. McCullough, and R. Ansari, "Gesture

and speech cues for conversational interaction," Submitted to ACM Transactions on Computer-Human Interaction. Also as VISLab Report: VISLab-01-01, 2001.

Page 48: Speech-Gesture Driven Multimodal Interfaces for Crisis ... - in press IEEE... · Speech-Gesture Driven Multimodal Interfaces for Crisis Management R. Sharma 1,2,5, M. Yeasin 2,1,

Speech-Gesture Driven Multimodal Interfaces in Crisis Management

Submitted to the Proceedings of IEEE special issue on Multimodal Human-Computer Interface 48

[138] K. W. Grant, B. E. Walden, and P. F. Seitz, "Auditory-visual speech recognition by hearing-impaired subjects: Consonant recognition , sentence recognition, and auditory-visual integration," J. Acoust. Soc. Am., vol. 103, pp. 2677-2690, 1998.

[139] J. Luettin and S. Dupont, "Continuous Audio-Visual Speech Recognition," Proc. of the Proc. 5th European Conference on Computer Vision, 1998.

[140] I. Matthews, J. A. Bangham, R. Harvey, and S. Cox, "Lipreading from shape shading and scale," Proc. of the Proc. Auditory-Visual Speech Processing (AVSP), Terrigal, 1998.

[141] U. Meier, R. Stiefelhagen, J. Yang, and A. Waibel, "Towards unrestricted lipreading," Proc. of the Second International Conference on Multimodal Interfaces (ICMI99), 1999.

[142] N. Matsuo, H. Kitagawa, and S. Nagata, "Speaker Position Detection System using Audio-visual Information," in FUJUTSU Sci. Tech. J., vol. 35, 1999, pp. 212-220.

[143] C.Neti, P.deCuetos, and A.Senior, "Audio-visual intent-to-speak detection for human-computer interaction," Proc. of the ICASSP, Istanbul, Turkey, 2000.

[144] R. Cutler and L. Davis, "Look Who's Talking: Speaker Detection Using Video And Audio Correlation," Proc. of the IEEE International Conference on Multimedia and Expo, New York, 2000.

[145] G. Iyengar and C. Neti, "A Vision-based Microphone Switch for Speech Intent Detection," Proc. of the RTAFG - IEEE Workshop on Real-Time Analysis of Face and Gesture, Vancouver, BC, Canada, 2001.

[146] M. Assan and K. Grobel, "Video-Based Sign Language Recognition Using Hidden Markov Models," in Gesture and Sign Language in Human-Computer Interaction, vol. LNAI 1371, M. Fröhlich, Ed., 1997, pp. 97-111.

[147] G. Rigoll, A. Kosmala, and S. Eickeler, "High Performance Real-Time Gesture Recognition Using Hidden Markov Models," in Gesture and Sign Language in Human-Computer Interaction, vol. LNAI 1371, M. Fröhlich, Ed., 1997, pp. 69-80.

[148] M. Yeasin and S. Chaudhuri, "Visual understanding of dynamic hand gestures," Pattern Recognition, vol. 33, pp. 1805-1817, 2000.

[149] C. Benoit, J. C. Martin, C. Pelachaud, L. Schomaker, and B. Suhm, "Audio-Visual and Multimodal Speech Systems," in Handbook of Standards and Resources for Spoken Language Systems, D. Gibbon, Ed., 1998.

[150] Y. Nam, D. Thalmann, and K. Wohn, "A Hybrid Framework for Modelling Comprehensible Human Gesture," Proc. of the International Conference on Computational Intelligence for Modelling, Control and Automation (CIMCA '99), 1999.

[151] T. Sowa and I. Wachsmuth, "Coverbal iconic gestures for object descriptions in virtual environments: An empirical study," Proc. of the Conference on Gestures: Meaning and Use, Porto, Portugal, 1999.

[152] M. Bratman, D. Iseael, and M. Pollack, "Plans and resource-bounded practical reasoning," Computational Intelligence, vol. 4, pp. 349–355, 1988.

[153] B. J. Grosz, Kraus, S., "Collaborative plans for complex group action," Artificial Intelligence, vol. 86, pp. 269-357, 1996.

[154] K. E. Lochbaum, "A collaborative planning model of intentional structure," Computational Linguistics, vol. 24, pp. 525-572, 1998.

[155] C. Rich, and Sidner, C., "COLLAGEN: A collaboration manager for software interface agents," USER MODELING AND USER-ADAPTED INTERACTION, vol. 8, pp. 315-350, 1998.

[156] S. Oviatt, A. D. Angeli, and K. Kuhn, "Integration and synchronization of input modes during multimodal human-computer interaction," Proc. of the Conference on Human Factors in Computing Systems (CHI'97), 1997.