speech & gesture recognition systems -...

Post on 05-Mar-2018

220 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Speech & Gesture Recognition Systems

Andreas Farner

Seminar Human Computer Interaction18.06.09

Overview

1 Multimodal Interfaces

2 Put That There

3 Finite-state Multimodal Parsing and Understanding

Multimodal Interfaces

Systems that allow input and/or output to be conveyed overmultiple different channels

Making the system produce a specific output based on thevoice- & gesture-input of the user

Overview

1 Multimodal Interfaces

2 Put That There

3 Finite-state Multimodal Parsing and Understanding

Introduction Commands Technologies Summary

Put That There: Voice and Gesture at the

Graphics Interface

Richard A. Bolt

1980

Introduction Commands Technologies Summary

Overview

1 Introduction

2 Commands

3 Technologies

4 Summary

Introduction Commands Technologies Summary

Introduction

Groundbreaking paper from 1980

Approach to combined voice-input and gesture-recognition

Practical use: arranging things

Scenery: The media room of the Massachusetts Institute ofTechnology (MIT)

Introduction Commands Technologies Summary

The Media Room

The MIT Media Room I

Introduction Commands Technologies Summary

The Media Room

The MIT Media Room II

Combines virtual “Dataland” and physical space→ one interactive space

Offers the user to act naturally

Introduction Commands Technologies Summary

The Media Room

Objects

Simple basic shapes

CirclesSquaresDiamonds

Variable attributes:

ColorSize (large, medium, small)

Virtual object does not represent the real shape of the object

Introduction Commands Technologies Summary

Controls

Navigating through “Dataland”

No keyboard needed

Joysticks and touch-pads

Navigating in a helicopter-like manner

Moving “you are here” marker with right-hand joystick ortouch-pad on TV-screenZooming in and out with the left-hand joystickMoving the marker also possible by pointing

Introduction Commands Technologies Summary

Controls

Put That There

Introduction Commands Technologies Summary

Overview: Basic Commands

Create

Move

Make that smaller/bigger/like that

Delete

Naming

Introduction Commands Technologies Summary

Create

Create

“Create a blue circle here.”

Introduction Commands Technologies Summary

Create

Create

“Create a blue circle here.”

Introduction Commands Technologies Summary

Create

Create

Default size: medium

Color and shape must be given

“there” is combined with the x,y pointing input

Introduction Commands Technologies Summary

Make That Smaller/Bigger/Like That

Smaller I

“Make the blue circle smaller.”

Introduction Commands Technologies Summary

Make That Smaller/Bigger/Like That

Smaller II

“Make that smaller.”

Introduction Commands Technologies Summary

Make That Smaller/Bigger/Like That

Smaller Result

“Make the blue circle smaller.”“Make that smaller.”

Introduction Commands Technologies Summary

Make That Smaller/Bigger/Like That

Make That

“Make that a large red diamond.”

Introduction Commands Technologies Summary

Make That Smaller/Bigger/Like That

Make That

“Make that a large red diamond.”

Introduction Commands Technologies Summary

Delete

Everything

“Delete everything to the left of this.”

Introduction Commands Technologies Summary

Delete

Everything

“Delete everything to the left of this.”

Introduction Commands Technologies Summary

Naming

Naming

“Call that...the calendar”

Introduction Commands Technologies Summary

Naming

Naming

“Call that...the calendar”

“Call that” → record x,y coordinates and switch to trainingmode

“...” → Pause needed to switch between the modes

“the calendar” → associate coordinates to “the calendar” andswitch back to recognition mode

Introduction Commands Technologies Summary

Speech

Speech Recognizers

Two types of speech recognizers:

Discrete or isolated utterancesRequires pause between words → unnatural

Connected SpeechNo pause necessary, up to 5 words → more natural

Introduction Commands Technologies Summary

Speech

Used Speech Recognizer

Response time: 300 milliseconds

Output: display of the text on an alphanumeric visual display

Vocabulary: 120 words as word reference patterns

Introduction Commands Technologies Summary

Space

Pointing

Based on measurements made of a nutating magnetic field

Two small plastic cubes each containing three coils (one foreach axis)

1 A transmitter cube (3,8cm on edge)2 A sensor cube (1,9cm on edge) attached to a wristband of the

user

Coils together create an antenna

Introduction Commands Technologies Summary

Summary

More than a virtual desktop overview:

Moving ships about a harbor mapMoving battalion formationsFacilities planning

Groundbreaking approach⇒ First approach to combine speech and gesture input

May be simple, but have to remember: from 1980

Put That There rather superficial

Another approach from 2000:

“Finite-state Multimodal Parsing and Understanding”

→ How is multimodal input parsed?

Overview

1 Multimodal Interfaces

2 Put That There

3 Finite-state Multimodal Parsing and Understanding

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Finite-state Multimodal Parsing and Understanding

Michael Johnston & Srinivas Bangalore

2000

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Inhaltsverzeichnis

1 Introduction

2 Finite-state Language Processing

3 Finite-state Multimodal Grammars

4 Applying Multimodal Transducers

5 Conclusion

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Multimodal Interfaces

Multimodal interfaces require effective parsing andunderstanding

General applicability required

Using one single finite-state device

Important concern in the ongoing migration of interactionfrom the desktop to wireless devices as PDAs, next-generationphones, etc.

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Finite-state Automaton (FSA)

Parsing, understanding and integration of speech and peninput performed by one device

Running on three tapes1 Speech input2 Pen input3 Combined interpretation

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Example

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Finite-state Transducers

Are finite-state automata (FSA)

Each transition consists of an input and an output symbol

Can be regarded as a two-tape FSA with an input tape and anoutput tape

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Finite-state Models

Attractive mechanisms for language processing

Efficiently learnable from data

Generally effective for decoding

Allows straightforward integration of constraints from variouslevels of language processing

Enable tight integration of language processing with speechand gesture recognition

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Frege’s Principle

“The meaning of a complex expression is

determined by the meanings of its constituent

expressions and the rules used to combine them.”

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Parsing Multiple Input Streams

Speech and pen input require three tapes, one for speech andone for pen input and the third one for their combinedmeaning

Finite-state device combines the content of multiple inputstreams into a single semantic representation

An interface with n modes requires n + 1 tapes

First n tapes represent the input streamsn + 1 is an output stream representing their composition

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Multimodal Context-free Grammar

Non-terminals are atomic symbols

Each terminal contains three components W:G:Mcorresponding to the n + 1 tapes

1 W: Words2 G: Gestures3 M: Meaning

ǫ means that component is empty in this terminal

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Example Grammar

S → V NP ǫ:ǫ:])NP → DET NNP → DET N CONJ NPCONJ → and:ǫ:,V → email:ǫ:email([V → page:ǫ:page([DET → this:ǫ:ǫDET → that:ǫ:ǫ

N → person:Gp:person( ENTRYN → organization:Go :org( ENTRYN → department:Gd :dept( ENTRYENTRY → ǫ:e1:e1 ǫ:ǫ:)ENTRY → ǫ:e2:e2 ǫ:ǫ:)ENTRY → ǫ:e3:e3 ǫ:ǫ:)

ENTRY → ...

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Example Three-Tape FSA

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Finite-state Meaning Representation

Capturing Meaning

Not only capturing the structure of the language, but alsomeaning

Writing symbols on the third tape (n + 1)

Concatenated symbols yield the semantic representation of anutterance

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Finite-state Meaning Representation

Gesture Values

Every object that can be gestured on needs an uniqueidentifier → In case of persons something like an address book

To avoid repeating ID-arcs like ǫ : objid123 : objid123 theauthors store these values in a finite set of buffers labelede1, e2, e3, . . .

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Finite-state Meaning Representation

Capturing Meaning Example

“Email ...” ⇒ Word input→ Output on semantic tape: email([

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Finite-state Meaning Representation

Capturing Meaning Example

“Email this ...” ⇒ Word input + pen input→ Output on semantic tape: email([ (e1)

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Finite-state Meaning Representation

Capturing Meaning Example

“Email this person.” ⇒ Word input→ Output on semantic tape: email([person(e1)])

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Multimodal Finite-state Transducers

Problems

Finite-state language processing tools only support finite-statetransducers (two tapes)

Speech recognizers don’t support use of three-tape FSA⇒ Three-tape FSA has to be converted into an FST

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Multimodal Finite-state Transducers

From FSA to FST

Combining pen input & word input→ one input component: (G × W )

Output component M remains the same

Resulting function: T : (G × W ) → M

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Multimodal Finite-state Transducers

Transducers

R : G → W

T : (G × W ) → M

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Applying Multimodal Transducers

1 Recognizing pen input first→ Process incoming pen gestures and construct a finite-statemachine

2 Using observed pen input to modify the language model forspeech recognition

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Gesture Finite-state Machine

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Gesture Language Transducer

+

R : G → W

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Gesture Language Transducer

=

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Output Tape of Gesture Language Transducer

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Speech Recognizer

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Gesture Speech FST

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Final Transducer

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Summary

First approach using single finite-state device to parse andintegrate spoken language and pen input

Speech and pen input recognition

Composes semantic representation from speech and pen input

Mutual compensation among input modes

Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion

Thank you very much for your attention!

Any questions?

top related