speech & gesture recognition systems -...

Speech & Gesture Recognition Systems

Andreas Farner

Seminar Human Computer Interaction18.06.09

Overview

1 Multimodal Interfaces

2 Put That There

3 Finite-state Multimodal Parsing and Understanding

Multimodal Interfaces

Systems that allow input and/or output to be conveyed overmultiple different channels

Making the system produce a specific output based on thevoice- & gesture-input of the user

Overview

2 Put That There

Introduction Commands Technologies Summary

Put That There: Voice and Gesture at the

Graphics Interface

Richard A. Bolt

Overview

1 Introduction

2 Commands

3 Technologies

4 Summary

Introduction

Groundbreaking paper from 1980

Approach to combined voice-input and gesture-recognition

Practical use: arranging things

Scenery: The media room of the Massachusetts Institute ofTechnology (MIT)

The Media Room

The MIT Media Room I

The Media Room

The MIT Media Room II

Combines virtual “Dataland” and physical space→ one interactive space

Offers the user to act naturally

The Media Room

Objects

Simple basic shapes

CirclesSquaresDiamonds

Variable attributes:

ColorSize (large, medium, small)

Virtual object does not represent the real shape of the object

Controls

Navigating through “Dataland”

No keyboard needed

Joysticks and touch-pads

Navigating in a helicopter-like manner

Moving “you are here” marker with right-hand joystick ortouch-pad on TV-screenZooming in and out with the left-hand joystickMoving the marker also possible by pointing

Controls

Put That There

Overview: Basic Commands

Create

Make that smaller/bigger/like that

Delete

Naming

Create

“Create a blue circle here.”

Create

“Create a blue circle here.”

Create

Default size: medium

Color and shape must be given

“there” is combined with the x,y pointing input

Make That Smaller/Bigger/Like That

Smaller I

“Make the blue circle smaller.”

Smaller II

“Make that smaller.”

Smaller Result

“Make the blue circle smaller.”“Make that smaller.”

Make That

“Make that a large red diamond.”

Make That

“Make that a large red diamond.”

Delete

Everything

“Delete everything to the left of this.”

Delete

Everything

“Delete everything to the left of this.”

Naming

“Call that...the calendar”

Naming

“Call that...the calendar”

“Call that” → record x,y coordinates and switch to trainingmode

“...” → Pause needed to switch between the modes

“the calendar” → associate coordinates to “the calendar” andswitch back to recognition mode

Speech

Speech Recognizers

Two types of speech recognizers:

Discrete or isolated utterancesRequires pause between words → unnatural

Connected SpeechNo pause necessary, up to 5 words → more natural

Speech

Used Speech Recognizer

Response time: 300 milliseconds

Output: display of the text on an alphanumeric visual display

Vocabulary: 120 words as word reference patterns

Pointing

Based on measurements made of a nutating magnetic field

Two small plastic cubes each containing three coils (one foreach axis)

1 A transmitter cube (3,8cm on edge)2 A sensor cube (1,9cm on edge) attached to a wristband of the

Coils together create an antenna

Summary

More than a virtual desktop overview:

Moving ships about a harbor mapMoving battalion formationsFacilities planning

Groundbreaking approach⇒ First approach to combine speech and gesture input

May be simple, but have to remember: from 1980

Put That There rather superficial

Another approach from 2000:

“Finite-state Multimodal Parsing and Understanding”

→ How is multimodal input parsed?

Overview

2 Put That There

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Finite-state Multimodal Parsing and Understanding

Michael Johnston & Srinivas Bangalore

Inhaltsverzeichnis

1 Introduction

2 Finite-state Language Processing

3 Finite-state Multimodal Grammars

4 Applying Multimodal Transducers

5 Conclusion

Multimodal Interfaces

Multimodal interfaces require effective parsing andunderstanding

General applicability required

Using one single finite-state device

Important concern in the ongoing migration of interactionfrom the desktop to wireless devices as PDAs, next-generationphones, etc.

Finite-state Automaton (FSA)

Parsing, understanding and integration of speech and peninput performed by one device

Running on three tapes1 Speech input2 Pen input3 Combined interpretation

Example

Finite-state Transducers

Are finite-state automata (FSA)

Each transition consists of an input and an output symbol

Can be regarded as a two-tape FSA with an input tape and anoutput tape

Finite-state Models

Attractive mechanisms for language processing

Efficiently learnable from data

Generally effective for decoding

Allows straightforward integration of constraints from variouslevels of language processing

Enable tight integration of language processing with speechand gesture recognition

Frege’s Principle

“The meaning of a complex expression is

determined by the meanings of its constituent

expressions and the rules used to combine them.”

Parsing Multiple Input Streams

Speech and pen input require three tapes, one for speech andone for pen input and the third one for their combinedmeaning

Finite-state device combines the content of multiple inputstreams into a single semantic representation

An interface with n modes requires n + 1 tapes

First n tapes represent the input streamsn + 1 is an output stream representing their composition

Multimodal Context-free Grammar

Non-terminals are atomic symbols

Each terminal contains three components W:G:Mcorresponding to the n + 1 tapes

1 W: Words2 G: Gestures3 M: Meaning

ǫ means that component is empty in this terminal

Example Grammar

S → V NP ǫ:ǫ:])NP → DET NNP → DET N CONJ NPCONJ → and:ǫ:,V → email:ǫ:email([V → page:ǫ:page([DET → this:ǫ:ǫDET → that:ǫ:ǫ

N → person:Gp:person( ENTRYN → organization:Go :org( ENTRYN → department:Gd :dept( ENTRYENTRY → ǫ:e1:e1 ǫ:ǫ:)ENTRY → ǫ:e2:e2 ǫ:ǫ:)ENTRY → ǫ:e3:e3 ǫ:ǫ:)

ENTRY → ...

Example Three-Tape FSA

Finite-state Meaning Representation

Capturing Meaning

Not only capturing the structure of the language, but alsomeaning

Writing symbols on the third tape (n + 1)

Concatenated symbols yield the semantic representation of anutterance

Gesture Values

Every object that can be gestured on needs an uniqueidentifier → In case of persons something like an address book

To avoid repeating ID-arcs like ǫ : objid123 : objid123 theauthors store these values in a finite set of buffers labelede1, e2, e3, . . .

Capturing Meaning Example

“Email ...” ⇒ Word input→ Output on semantic tape: email([

“Email this ...” ⇒ Word input + pen input→ Output on semantic tape: email([ (e1)

“Email this person.” ⇒ Word input→ Output on semantic tape: email([person(e1)])

Multimodal Finite-state Transducers

Problems

Finite-state language processing tools only support finite-statetransducers (two tapes)

Speech recognizers don’t support use of three-tape FSA⇒ Three-tape FSA has to be converted into an FST

From FSA to FST

Combining pen input & word input→ one input component: (G × W )

Output component M remains the same

Resulting function: T : (G × W ) → M

Transducers

R : G → W

T : (G × W ) → M

Applying Multimodal Transducers

1 Recognizing pen input first→ Process incoming pen gestures and construct a finite-statemachine

2 Using observed pen input to modify the language model forspeech recognition

Gesture Finite-state Machine

Gesture Language Transducer

R : G → W

Gesture Language Transducer

Output Tape of Gesture Language Transducer

Speech Recognizer

Gesture Speech FST

Final Transducer

Summary

First approach using single finite-state device to parse andintegrate spoken language and pen input

Speech and pen input recognition

Composes semantic representation from speech and pen input

Mutual compensation among input modes

Thank you very much for your attention!

Any questions?

speech & gesture recognition systems -...

Documents

dependency parsing (3) - university of maryland ·...

parsing iii (top-down parsing: recursive descent & ll(1) )

parsing expression grammar and packrat parsing ...

weighted parsing, probabilistic parsing

bab 6: contex-free grammar & parsing -...

hand gesture recognition based on digital image - ijser ·...

welcome to gesture - mat uc santa...

chart parsing and probabilistic parsing

eﬀective field theories

top down parsing, predictive parsing

famous gesture drawing artists - gesture in context

gesture drawing. gesture drawing gesture drawing is the...

learning for semantic parsing using statistical syntactic...

harassment, intimidation, bullying. new definition includes...

u.cs.biu.ac.ilu.cs.biu.ac.il/~89-680/parsing-algorithms.pdf ·...

a wearable gesture recognition device for detecting ......

welcome to gesture - mat uc santa barbara · welcome to...

top-down parsing - recursive descent - predictive parsing

gesture input and gesture recognition algorithms

pseudo-eﬀective and nef cones on spherical varieties...