creating user interfaces [continue presentations as needed] speech recognition. speech synthesis...
Post on 17-Jan-2016
Embed Size (px)
Creating User Interfaces[Continue presentations as needed] Speech recognition. Speech synthesisHomework: Report on current products. Register on Tellme Studies. Study VoiceXML
Speech recognitionUser speaks. System 'understands', at least enough to perform some action.
Related to (but not the same as)Natural language understandingVoice print identificationRecord information to be re-played to human in compressed form for later interactionSpeech synthesis (other direction): words to speech?
Natural language understandingSkip speech altogether, but type in statements or phrases in normal languageWhat is normal? We tend not to speak that grammaticallyMany 'natural language systems' actually use keywordsHistorMoon rocks exampleCombine speech to natural language
Continuous versus discreteSpeaker speaks 'naturally' versusSpeaker separates words
ExamplesDictation: no understanding as such, produce words/sentences in a program(Telephone) Help desk / Information: generally restricted or directed speech, choosing from alternatives (may or may not be given). Advances the process[Restricted] commands: actually carrying out operationsFactory example: start and stopCar: radio, heat/ACPhone: call specific number
TrainingDictation application: user takes time to read specific test to train the systemNote: some systems also adapt with use. If & when user corrects the results, system may do better next time.Phone lookup: user records names. No 'understanding', just record for matching.
Audience & contentSome systems may allow adapting to audiences, for example, male versus femaleSome systems have restrictions on types of contentHistorical note: IBM system in 1980s & 1990s was restricted to male, American-born speakers (no speech impediments) and legal text.
Speech recognition conceptsAir pressure diaphragm in phone electrical signal (Fourier Transform) wave patternmatched againstsets of canonical patterns (native speaker of English, perhaps male/female & young/old alternatives)generated for the specified grammar (using a segmentation=dividing up of the parts)Note: interplay of grammar and statistics distinguishes different approaches
Fourier Transform(Discrete Fourier Transform -- FFT)Takes data representing a signal
And produces numbers representing the combination of sine and cosine waves that make up the signal
Speech recognitionWorks on the product of the FFTUses (in most cases) Segmentation: attempt to break up into pieces, perhaps syllables or wordsGrammar: definition of what is to be expectedProbabilities: if first part matched X, then greater probability that then next would match to Y
Current State of the ArtGeneral, no restrictions, speech reco, good enough to act on the speech? always about to happen?dictation / substitute for keyboard+ exists and satisfies manyIs this most important application for most users?May not be killer ap, but may be good for motivating researchHomework: prepare brief report on [a] current product or application. Can be one you use yourself.
Speech synthesisaka TTS (text to speech)Application determines that the computer needs to say certain wordslexical units (syllables of words) phonemes pre-recorded (wav) files of phonemes
Speech synthesisThis is again a segmentation process: need to divide up the words and then put together so speech sounds 'natural'. particular phoneme may [need to] sound different in different context.also need to deal with abbreviations & local accentsPlace names (important in travel & weather applications)Special case: detect and use wav file for each name.Older methods were all synthesized similar distinction between all synthesized and samples of music
Speech synthesisis essentially the computer reading out loud.Easy to do most thingsMore and more difficult to do complete job
Different languages may be easier than English.People who are not monolingual please comment!
Restricted / directed speech applicationsWe will use the tellme studio engine to create directed speech applications.These make use ofGrammarsOptions to use numbers (buttons)Recorded (.wav) soundsText to speech
studio.tellme.comCompany that provides engine for applicationsProvides developing environment We are doing the Tellme version of VoiceXML, but it appears to be standard.Register as a developer:Provide your own id; assigned a PINPut VoiceXML in ScratchPad place (no audio files)1-800-555-VXML (8965)SAY id and then PIN or can give phone number. Tellme runs eitherprogram in ScratchPad ORprogram at Application URL for projects with multiple files
To look at someone else's project, you change your Application URLcalled pointing your account to a new source.
XMLGeneralization of HTMLXML documents have markup.Tag indicating type of element and, possibly with attributes, content, tag closer.Document must be well-formed.Developers decide on element types.
Very brief overview document contains and/or menu elements. can contain , can contain or do its own audio can contain , , , etc.NOTE: certain types of elements use built-in grammars, for example, booleanCan have a child node that indicates what to do if there is a match is a compressed way use a simple grammar
, otherThese may be part of a element
AudioTellme studio provides way to record [your] speech as a wav file to upload to a website. Sends it to your email addressYou upload your VoiceXML file plus any wav files (and anything else) Welcome to my site If Tellme can't find the mygreeting.wav file, it uses its Text to Speech on the string "Welcome to my site". Note: you also can use a full URL: http://....
You put in the URL for the voicexml file into your Tellme studio account, called pointing to the URL.TEST
VoiceXML basics, continued element can contain elements, which can contain , , other which can contain
(if not one of built-in grammars)
VoiceXML basics: typical casea form element , made up of , with reference to recorded wav file and backup text, if NOT using built-in grammars designated by type attribute of field. This is a CDATA section. with (follow-on) code using field for nomatch, noinput cases
CautionA form contains various elements, including a field. If a field has a grammar and the grammar is satisfied, control goes to a filled tag
recorded using tellme studiobackup using TTS, just in case src file missing
exampleAsks for number of credits and calculates when you/caller can registeruses built-in grammar for numberNo error recovery. You need to do better than this in your project.Unfortunate situation: there is a element type filled and an element type field.The < symbols are represented using lt;
Hello there. How many credits have you earned?
Sorry. I didn't get that.
You can register on the third day You can register on the second day You can register on the first day You can register on the fourth day Good bye.
HomeworkDo research / think about your own experiences and come prepared to report on a speech recognition / speech synthesis applicationStart learning VoiceXML
*************Phonetic method of reading: sounding out.****Only touch on topic. The tellme site has several tutorials and examples.*The filled element is a possible element in the field element.***Try it!*