speech & gesture recognition systems -...
TRANSCRIPT
Speech & Gesture Recognition Systems
Andreas Farner
Seminar Human Computer Interaction18.06.09
Overview
1 Multimodal Interfaces
2 Put That There
3 Finite-state Multimodal Parsing and Understanding
Multimodal Interfaces
Systems that allow input and/or output to be conveyed overmultiple different channels
Making the system produce a specific output based on thevoice- & gesture-input of the user
Overview
1 Multimodal Interfaces
2 Put That There
3 Finite-state Multimodal Parsing and Understanding
Introduction Commands Technologies Summary
Put That There: Voice and Gesture at the
Graphics Interface
Richard A. Bolt
1980
Introduction Commands Technologies Summary
Overview
1 Introduction
2 Commands
3 Technologies
4 Summary
Introduction Commands Technologies Summary
Introduction
Groundbreaking paper from 1980
Approach to combined voice-input and gesture-recognition
Practical use: arranging things
Scenery: The media room of the Massachusetts Institute ofTechnology (MIT)
Introduction Commands Technologies Summary
The Media Room
The MIT Media Room I
Introduction Commands Technologies Summary
The Media Room
The MIT Media Room II
Combines virtual “Dataland” and physical space→ one interactive space
Offers the user to act naturally
Introduction Commands Technologies Summary
The Media Room
Objects
Simple basic shapes
CirclesSquaresDiamonds
Variable attributes:
ColorSize (large, medium, small)
Virtual object does not represent the real shape of the object
Introduction Commands Technologies Summary
Controls
Navigating through “Dataland”
No keyboard needed
Joysticks and touch-pads
Navigating in a helicopter-like manner
Moving “you are here” marker with right-hand joystick ortouch-pad on TV-screenZooming in and out with the left-hand joystickMoving the marker also possible by pointing
Introduction Commands Technologies Summary
Controls
Put That There
Introduction Commands Technologies Summary
Overview: Basic Commands
Create
Move
Make that smaller/bigger/like that
Delete
Naming
Introduction Commands Technologies Summary
Create
Create
“Create a blue circle here.”
Introduction Commands Technologies Summary
Create
Create
“Create a blue circle here.”
Introduction Commands Technologies Summary
Create
Create
Default size: medium
Color and shape must be given
“there” is combined with the x,y pointing input
Introduction Commands Technologies Summary
Make That Smaller/Bigger/Like That
Smaller I
“Make the blue circle smaller.”
Introduction Commands Technologies Summary
Make That Smaller/Bigger/Like That
Smaller II
“Make that smaller.”
Introduction Commands Technologies Summary
Make That Smaller/Bigger/Like That
Smaller Result
“Make the blue circle smaller.”“Make that smaller.”
Introduction Commands Technologies Summary
Make That Smaller/Bigger/Like That
Make That
“Make that a large red diamond.”
Introduction Commands Technologies Summary
Make That Smaller/Bigger/Like That
Make That
“Make that a large red diamond.”
Introduction Commands Technologies Summary
Delete
Everything
“Delete everything to the left of this.”
Introduction Commands Technologies Summary
Delete
Everything
“Delete everything to the left of this.”
Introduction Commands Technologies Summary
Naming
Naming
“Call that...the calendar”
Introduction Commands Technologies Summary
Naming
Naming
“Call that...the calendar”
“Call that” → record x,y coordinates and switch to trainingmode
“...” → Pause needed to switch between the modes
“the calendar” → associate coordinates to “the calendar” andswitch back to recognition mode
Introduction Commands Technologies Summary
Speech
Speech Recognizers
Two types of speech recognizers:
Discrete or isolated utterancesRequires pause between words → unnatural
Connected SpeechNo pause necessary, up to 5 words → more natural
Introduction Commands Technologies Summary
Speech
Used Speech Recognizer
Response time: 300 milliseconds
Output: display of the text on an alphanumeric visual display
Vocabulary: 120 words as word reference patterns
Introduction Commands Technologies Summary
Space
Pointing
Based on measurements made of a nutating magnetic field
Two small plastic cubes each containing three coils (one foreach axis)
1 A transmitter cube (3,8cm on edge)2 A sensor cube (1,9cm on edge) attached to a wristband of the
user
Coils together create an antenna
Introduction Commands Technologies Summary
Summary
More than a virtual desktop overview:
Moving ships about a harbor mapMoving battalion formationsFacilities planning
Groundbreaking approach⇒ First approach to combine speech and gesture input
May be simple, but have to remember: from 1980
Put That There rather superficial
Another approach from 2000:
“Finite-state Multimodal Parsing and Understanding”
→ How is multimodal input parsed?
Overview
1 Multimodal Interfaces
2 Put That There
3 Finite-state Multimodal Parsing and Understanding
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Finite-state Multimodal Parsing and Understanding
Michael Johnston & Srinivas Bangalore
2000
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Inhaltsverzeichnis
1 Introduction
2 Finite-state Language Processing
3 Finite-state Multimodal Grammars
4 Applying Multimodal Transducers
5 Conclusion
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Multimodal Interfaces
Multimodal interfaces require effective parsing andunderstanding
General applicability required
Using one single finite-state device
Important concern in the ongoing migration of interactionfrom the desktop to wireless devices as PDAs, next-generationphones, etc.
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Finite-state Automaton (FSA)
Parsing, understanding and integration of speech and peninput performed by one device
Running on three tapes1 Speech input2 Pen input3 Combined interpretation
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Example
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Finite-state Transducers
Are finite-state automata (FSA)
Each transition consists of an input and an output symbol
Can be regarded as a two-tape FSA with an input tape and anoutput tape
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Finite-state Models
Attractive mechanisms for language processing
Efficiently learnable from data
Generally effective for decoding
Allows straightforward integration of constraints from variouslevels of language processing
Enable tight integration of language processing with speechand gesture recognition
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Frege’s Principle
“The meaning of a complex expression is
determined by the meanings of its constituent
expressions and the rules used to combine them.”
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Parsing Multiple Input Streams
Speech and pen input require three tapes, one for speech andone for pen input and the third one for their combinedmeaning
Finite-state device combines the content of multiple inputstreams into a single semantic representation
An interface with n modes requires n + 1 tapes
First n tapes represent the input streamsn + 1 is an output stream representing their composition
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Multimodal Context-free Grammar
Non-terminals are atomic symbols
Each terminal contains three components W:G:Mcorresponding to the n + 1 tapes
1 W: Words2 G: Gestures3 M: Meaning
ǫ means that component is empty in this terminal
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Example Grammar
S → V NP ǫ:ǫ:])NP → DET NNP → DET N CONJ NPCONJ → and:ǫ:,V → email:ǫ:email([V → page:ǫ:page([DET → this:ǫ:ǫDET → that:ǫ:ǫ
N → person:Gp:person( ENTRYN → organization:Go :org( ENTRYN → department:Gd :dept( ENTRYENTRY → ǫ:e1:e1 ǫ:ǫ:)ENTRY → ǫ:e2:e2 ǫ:ǫ:)ENTRY → ǫ:e3:e3 ǫ:ǫ:)
ENTRY → ...
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Example Three-Tape FSA
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Finite-state Meaning Representation
Capturing Meaning
Not only capturing the structure of the language, but alsomeaning
Writing symbols on the third tape (n + 1)
Concatenated symbols yield the semantic representation of anutterance
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Finite-state Meaning Representation
Gesture Values
Every object that can be gestured on needs an uniqueidentifier → In case of persons something like an address book
To avoid repeating ID-arcs like ǫ : objid123 : objid123 theauthors store these values in a finite set of buffers labelede1, e2, e3, . . .
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Finite-state Meaning Representation
Capturing Meaning Example
“Email ...” ⇒ Word input→ Output on semantic tape: email([
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Finite-state Meaning Representation
Capturing Meaning Example
“Email this ...” ⇒ Word input + pen input→ Output on semantic tape: email([ (e1)
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Finite-state Meaning Representation
Capturing Meaning Example
“Email this person.” ⇒ Word input→ Output on semantic tape: email([person(e1)])
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Multimodal Finite-state Transducers
Problems
Finite-state language processing tools only support finite-statetransducers (two tapes)
Speech recognizers don’t support use of three-tape FSA⇒ Three-tape FSA has to be converted into an FST
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Multimodal Finite-state Transducers
From FSA to FST
Combining pen input & word input→ one input component: (G × W )
Output component M remains the same
Resulting function: T : (G × W ) → M
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Multimodal Finite-state Transducers
Transducers
R : G → W
T : (G × W ) → M
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Applying Multimodal Transducers
1 Recognizing pen input first→ Process incoming pen gestures and construct a finite-statemachine
2 Using observed pen input to modify the language model forspeech recognition
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Gesture Finite-state Machine
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Gesture Language Transducer
+
R : G → W
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Gesture Language Transducer
=
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Output Tape of Gesture Language Transducer
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Speech Recognizer
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Gesture Speech FST
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Final Transducer
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Summary
First approach using single finite-state device to parse andintegrate spoken language and pen input
Speech and pen input recognition
Composes semantic representation from speech and pen input
Mutual compensation among input modes
Introduction Finite-state Language Processing Finite-state Multimodal Grammars Applying Multimodal Transducers Conclusion
Thank you very much for your attention!
Any questions?