speech & gesture recognition systems -...

Speech & Gesture Recognition Systems

Andreas Farner

Seminar Human Computer Interaction18.06.09

Overview

1 Multimodal Interfaces

2 Put That There

3 Finite-state Multimodal Parsing and Understanding

Multimodal Interfaces

Systems that allow input and/or output to be conveyed overmultiple different channels

Making the system produce a specific output based on thevoice- & gesture-input of the user

Overview


2 Put That There


Introduction Commands Technologies Summary

Put That There: Voice and Gesture at the

Graphics Interface

Richard A. Bolt

1980


Overview

1 Introduction

2 Commands

3 Technologies

4 Summary


Introduction

Groundbreaking paper from 1980

Approach to combined voice-input and gesture-recognition

Practical use: arranging things

Scenery: The media room of the Massachusetts Institute ofTechnology (MIT)


The Media Room

The MIT Media Room I


The Media Room

The MIT Media Room II

Combines virtual “Dataland” and physical space→ one interactive space

Offers the user to act naturally


The Media Room

Objects

Simple basic shapes

CirclesSquaresDiamonds

Variable attributes:

ColorSize (large, medium, small)

Virtual object does not represent the real shape of the object


Controls

Navigating through “Dataland”

No keyboard needed

Joysticks and touch-pads

Navigating in a helicopter-like manner

Moving “you are here” marker with right-hand joystick ortouch-pad on TV-screenZooming in and out with the left-hand joystickMoving the marker also possible by pointing


Controls

Put That There


Overview: Basic Commands

Create

Move

Make that smaller/bigger/like that

Delete

Naming


Create

Create

“Create a blue circle here.”


Create

Create

Default size: medium

Color and shape must be given

“there” is combined with the x,y pointing input


Make That Smaller/Bigger/Like That

Smaller I

“Make the blue circle smaller.”



Smaller II

“Make that smaller.”



Smaller Result

“Make the blue circle smaller.”“Make that smaller.”



Make That

“Make that a large red diamond.”


Delete

Everything

“Delete everything to the left of this.”


Naming

Naming

“Call that...the calendar”


Naming

Naming

“Call that...the calendar”

“Call that” → record x,y coordinates and switch to trainingmode

“...” → Pause needed to switch between the modes

“the calendar” → associate coordinates to “the calendar” andswitch back to recognition mode


Speech

Speech Recognizers

Two types of speech recognizers:

Discrete or isolated utterancesRequires pause between words → unnatural

Connected SpeechNo pause necessary, up to 5 words → more natural


Speech

Used Speech Recognizer

Response time: 300 milliseconds

Output: display of the text on an alphanumeric visual display

Vocabulary: 120 words as word reference patterns


Space

Pointing

Based on measurements made of a nutating magnetic field

Two small plastic cubes each containing three coils (one foreach axis)

1 A transmitter cube (3,8cm on edge)2 A sensor cube (1,9cm on edge) attached to a wristband of the

user

Coils together create an antenna


Summary

More than a virtual desktop overview:

Moving ships about a harbor mapMoving battalion formationsFacilities planning

Groundbreaking approach⇒ First approach to combine speech and gesture input

May be simple, but have to remember: from 1980

Put That There rather superficial

Another approach from 2000:

“Finite-state Multimodal Parsing and Understanding”

→ How is multimodal input parsed?

Overview


2 Put That There


Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Finite-state Multimodal Parsing and Understanding

Michael Johnston & Srinivas Bangalore

2000


Inhaltsverzeichnis

1 Introduction

2 Finite-state Language Processing

3 Finite-state Multimodal Grammars

4 Applying Multimodal Transducers

5 Conclusion


Multimodal Interfaces

Multimodal interfaces require effective parsing andunderstanding

General applicability required

Using one single finite-state device

Important concern in the ongoing migration of interactionfrom the desktop to wireless devices as PDAs, next-generationphones, etc.


Finite-state Automaton (FSA)

Parsing, understanding and integration of speech and peninput performed by one device

Running on three tapes1 Speech input2 Pen input3 Combined interpretation


Example


Finite-state Transducers

Are finite-state automata (FSA)

Each transition consists of an input and an output symbol

Can be regarded as a two-tape FSA with an input tape and anoutput tape


Finite-state Models

Attractive mechanisms for language processing

Efficiently learnable from data

Generally effective for decoding

Allows straightforward integration of constraints from variouslevels of language processing

Enable tight integration of language processing with speechand gesture recognition


Frege’s Principle

“The meaning of a complex expression is

determined by the meanings of its constituent

expressions and the rules used to combine them.”


Parsing Multiple Input Streams

Speech and pen input require three tapes, one for speech andone for pen input and the third one for their combinedmeaning

Finite-state device combines the content of multiple inputstreams into a single semantic representation

An interface with n modes requires n + 1 tapes

First n tapes represent the input streamsn + 1 is an output stream representing their composition


Multimodal Context-free Grammar

Non-terminals are atomic symbols

Each terminal contains three components W:G:Mcorresponding to the n + 1 tapes

1 W: Words2 G: Gestures3 M: Meaning

ǫ means that component is empty in this terminal


Example Grammar

S → V NP ǫ:ǫ:])NP → DET NNP → DET N CONJ NPCONJ → and:ǫ:,V → email:ǫ:email([V → page:ǫ:page([DET → this:ǫ:ǫDET → that:ǫ:ǫ

N → person:Gp:person( ENTRYN → organization:Go :org( ENTRYN → department:Gd :dept( ENTRYENTRY → ǫ:e1:e1 ǫ:ǫ:)ENTRY → ǫ:e2:e2 ǫ:ǫ:)ENTRY → ǫ:e3:e3 ǫ:ǫ:)

ENTRY → ...


Example Three-Tape FSA


Finite-state Meaning Representation

Capturing Meaning

Not only capturing the structure of the language, but alsomeaning

Writing symbols on the third tape (n + 1)

Concatenated symbols yield the semantic representation of anutterance



Gesture Values

Every object that can be gestured on needs an uniqueidentifier → In case of persons something like an address book

To avoid repeating ID-arcs like ǫ : objid123 : objid123 theauthors store these values in a finite set of buffers labelede1, e2, e3, . . .



Capturing Meaning Example

“Email ...” ⇒ Word input→ Output on semantic tape: email([




“Email this ...” ⇒ Word input + pen input→ Output on semantic tape: email([ (e1)




“Email this person.” ⇒ Word input→ Output on semantic tape: email([person(e1)])


Multimodal Finite-state Transducers

Problems

Finite-state language processing tools only support finite-statetransducers (two tapes)

Speech recognizers don’t support use of three-tape FSA⇒ Three-tape FSA has to be converted into an FST



From FSA to FST

Combining pen input & word input→ one input component: (G × W )

Output component M remains the same

Resulting function: T : (G × W ) → M



Transducers

R : G → W

T : (G × W ) → M


Applying Multimodal Transducers

1 Recognizing pen input first→ Process incoming pen gestures and construct a finite-statemachine

2 Using observed pen input to modify the language model forspeech recognition


Gesture Finite-state Machine


Gesture Language Transducer

+

R : G → W


Gesture Language Transducer

=


Output Tape of Gesture Language Transducer


Speech Recognizer


Gesture Speech FST


Final Transducer


Summary

First approach using single finite-state device to parse andintegrate spoken language and pen input

Speech and pen input recognition

Composes semantic representation from speech and pen input

Mutual compensation among input modes


Thank you very much for your attention!

Any questions?

speech & gesture recognition systems -...

Documents