forschungszentrum telekommunikation wien [telecommunications research center vienna] interfaces...
Post on 18-Dec-2015
215 views
TRANSCRIPT
Forschungszentrum Telekommunikation Wien[Telecommunications Research Center Vienna]
Interfaces between Speech and Non-Speech Audio Technology
Michael Pucher (FTW Vienna, ICSI Berkeley)
© ftw. 2005
Contents
Text-to-Speech Synthesis (TTS)
Automatic Speech Recognition (ASR, STT)
Dialog Systems
Multimodal Mobile Applications
Resources
© ftw. 2005
Auditory representations
Affective states and attitudes
Speaker characteristics
Structural prosodic elements
Pragmatics and discourse
Sound signals
Music
Perspectival, spatial cues
Non-linguistic
Paralinguistic
Linguistic
Lexical semantics and syntax
TTS
ASR
Dialog Systems
© ftw. 2005
TTS Examples
16kHz natural voice
16kHz unit selection synthesis (server-based)
8kHz diphone-based synthesis with lexicon (embedded or distributed)
8kHz diphone-based synthesis without lexicon (embedded)
Application specific lexicon- Gerald R. Ford tSE-r6ld a:R fo:rdtSE-r6ld a:R fo:rd
© ftw. 2005
TTS Evaluation
CorpPPC AMR
CorpPPC GSM
CorpPPC PCM
CTTS
MobileSPLex
MobileSPNoLex
Natural
SmartPPCLex
SmartPPCNoLex
So
urc
e
1 2 3 4 5
95% CI
Pronuniciation
Articulation
Overall
Voice
Listening
Comprehension
SourceLabel
0
20
40
60
80
100
95%
CI C
om
pre
hen
sio
n o
f w
ord
s
© ftw. 2005
TTS and Non-Speech Audio
TTSTTS FEATUREFEATURE STATUSSTATUS COMBINATION COMBINATION WITH NON-SPEECH WITH NON-SPEECH AUDIOAUDIO
Comprehensible TTS Low word-error-rate
Solved
Diphone based TTS
TTS provides lexical information
Add structural prosodic elements
Natural TTS Single style prosody
Solved
Unit selection
TTS provides structural prosodic elements
Add affective states and attitudes
Expressive TTS Various prosodic styles
Not solved
?
Add pragmatic information, dialog
turns
© ftw. 2005
Limited Expressiveness of Speech 1
Limited expressiveness of Expressive TTS = Limited expressiveness of speech
Limited expressiveness of speech because of unlimited expressiveness1 of speech
- Because everything is expressible in language, the messages are less useful for certain purposes (too complex)
- Simpler, less expressive codes (sounds, icons) may be used in context and lead to shorter messages
Disadvantages of speech- Seriality- Non-universality
© ftw. 2005
Types of ASR and Applications
Isolated word recogniton
Large vocabulary Speech recognition
Conversational Speech recognition
Speech Recognition in noisy environments
Car navigation
Meeting transcription
Command & control
Broadcast news transcription
Speaker dependent or speaker independentSpeaker dependent or speaker independent
© ftw. 2005
Other Related Technologies
Speech
- Speaker verification
NLP
- Dialog act detection
- Topic detection
© ftw. 2005
Music Information Retrieval (MIR)
Query By Humming (Fraunhofer)- Non-speech sound as an input pattern to search for
other non-speech sounds- http://www.musicline.de/de/melodiesuche/input
Performer Style Identification
Melody and Rhythm Extraction
Music Similarity
Genre Classification
© ftw. 2005
Dialog Systems - ASR
3 Types of Recognition in state-of-the-art Dialog Systems
- Isolated word
- Recognition grammar
- Statistical Language Model (SLM) + grammar for more robustness
<rule id=„commands"> <item repeat="0-1"> move </item> <one-of> <item>forward</item> <item>backward</item> </one-of></rule>
<rule id="exit"> <one-of> <item>exit</item> <item>quit</item> </one-of></rule>
„um ah to san francisco from new york“
1. Apply SLM
2. Apply grammar on results of SLM
© ftw. 2005
Dialog Systems – TTS and Audio
Loquendo TTS Mixer
- Play and mix TTS and audio files
- Fadein, fadeout
- Pause and resume
- Record
Paolo Massimino : Loquendo S.p.A. From Marked Text to Mixed Speech and Sound
© ftw. 2005
Dialog Management 1
Usages of non-speech audio- Replace prompts- Indicate dialog turns and dialog states- Indicate menu structure (3Daudio)- Create listen & feel of the application- System response time
Questions- Bargein, Streaming and Standardization
© ftw. 2005
Dialog Management 2
A good bad example- Uses only speech- Audio enhancement for
transitions- Audio enhancement for
states
Bob Cooper : Avaya Corporation A Case Study on the planned and actual Use of Auditory Feedback and Audio Cues in the Realization of a Personal Virtual Assistant
© ftw. 2005
Dialog Managment 3
VoiceXML Version 2.0- W3C (Word Wide Web Consortium) standard for
voice dialog design- Form filling paradigm similar to web forms
Synthesis Markup Language (SSML) Version 1.0
<prosody contour="(0%,+20Hz) (10%,+30%) (40%,+10Hz)"> good morning </prosody>
<voice gender="female"> Any female voice here.
<voice age="6"> A female child voice here. </voice> </voice>
© ftw. 2005
Limited Expressiveness of Speech 2
Limited expressiveness of human-machine voice dialog compared to a natural dialog
- Natural dialog is probable multimodal
- Role of non-speech sound in human communication
© ftw. 2005
The Importance of Multimodality for Mobile Applications
Multimodal communication is perceived as natural
Disadvantages of unimodal interfaces for mobile devices- Small displays- No comfortable alphanumeric keyboards- Visual access to the display is not always possible
Disadvantages cannot be overcome by increasing processor and memory capabilities
© ftw. 2005
Multimodal Dialog Managment
Speech Application Language Tags (http://www.saltforum.org)
Possible combination with non-speech audio at all states and transitions
Similar to (unimodal) dialog systems
Minhua Ma : University of Ulster Paul Mc Kevitt : University of Ulster Lexical Semantics and Auditory Display in Virtual Storytelling
© ftw. 2005
Asymmetric Multimodality
For Multiparty applications- Users select preferred modalities (e.g. speech, visual, music?)- System is able to translate content from one modality to another
MONA – Mobile Multimodal Next Generation Applications- Multiuser quiz application
Input Output Preference=Speech
Output Preference=Visual
Speech Speech Speech-To-Text
Text Text-To-Speech Text
© ftw. 2005
Resources
TTS- Festival 2.0, to build unit selection voices- Festival Lite, for embedded TTS- FreeTTS, a Java speech synthesizer- The Mbrola project, many synthetic voices available
ASR- Sphinx- Htk
Multimodal Systems- SALT implementations
© ftw. 2005
Thank you for your attentionThank you for your attention
Contact:Contact:[email protected]
http://userver.ftw.at/~pucherhttp://userver.ftw.at/~pucher