creating user interfaces [continue presentations as needed] speech recognition. speech synthesis...

of 29/29
Creating User Interfaces [Continue presentations as needed] Speech recognition. Speech synthesis Homework: Report on current products. Register on Tellme Studies. Study VoiceXML

Post on 17-Jan-2016




0 download

Embed Size (px)


  • Creating User Interfaces[Continue presentations as needed] Speech recognition. Speech synthesisHomework: Report on current products. Register on Tellme Studies. Study VoiceXML

  • Speech recognitionUser speaks. System 'understands', at least enough to perform some action.

    Related to (but not the same as)Natural language understandingVoice print identificationRecord information to be re-played to human in compressed form for later interactionSpeech synthesis (other direction): words to speech?

  • Natural language understandingSkip speech altogether, but type in statements or phrases in normal languageWhat is normal? We tend not to speak that grammaticallyMany 'natural language systems' actually use keywordsHistorMoon rocks exampleCombine speech to natural language

  • Continuous versus discreteSpeaker speaks 'naturally' versusSpeaker separates words

  • ExamplesDictation: no understanding as such, produce words/sentences in a program(Telephone) Help desk / Information: generally restricted or directed speech, choosing from alternatives (may or may not be given). Advances the process[Restricted] commands: actually carrying out operationsFactory example: start and stopCar: radio, heat/ACPhone: call specific number

  • TrainingDictation application: user takes time to read specific test to train the systemNote: some systems also adapt with use. If & when user corrects the results, system may do better next time.Phone lookup: user records names. No 'understanding', just record for matching.

  • Audience & contentSome systems may allow adapting to audiences, for example, male versus femaleSome systems have restrictions on types of contentHistorical note: IBM system in 1980s & 1990s was restricted to male, American-born speakers (no speech impediments) and legal text.

  • Speech recognition conceptsAir pressure diaphragm in phone electrical signal (Fourier Transform) wave patternmatched againstsets of canonical patterns (native speaker of English, perhaps male/female & young/old alternatives)generated for the specified grammar (using a segmentation=dividing up of the parts)Note: interplay of grammar and statistics distinguishes different approaches

  • Fourier Transform(Discrete Fourier Transform -- FFT)Takes data representing a signal

    And produces numbers representing the combination of sine and cosine waves that make up the signal

  • Speech recognitionWorks on the product of the FFTUses (in most cases) Segmentation: attempt to break up into pieces, perhaps syllables or wordsGrammar: definition of what is to be expectedProbabilities: if first part matched X, then greater probability that then next would match to Y

  • Current State of the ArtGeneral, no restrictions, speech reco, good enough to act on the speech? always about to happen?dictation / substitute for keyboard+ exists and satisfies manyIs this most important application for most users?May not be killer ap, but may be good for motivating researchHomework: prepare brief report on [a] current product or application. Can be one you use yourself.

  • Speech synthesisaka TTS (text to speech)Application determines that the computer needs to say certain wordslexical units (syllables of words) phonemes pre-recorded (wav) files of phonemes

  • Speech synthesisThis is again a segmentation process: need to divide up the words and then put together so speech sounds 'natural'. particular phoneme may [need to] sound different in different context.also need to deal with abbreviations & local accentsPlace names (important in travel & weather applications)Special case: detect and use wav file for each name.Older methods were all synthesized similar distinction between all synthesized and samples of music

  • Speech synthesisis essentially the computer reading out loud.Easy to do most thingsMore and more difficult to do complete job

    Different languages may be easier than English.People who are not monolingual please comment!

  • Restricted / directed speech applicationsWe will use the tellme studio engine to create directed speech applications.These make use ofGrammarsOptions to use numbers (buttons)Recorded (.wav) soundsText to speech

  • studio.tellme.comCompany that provides engine for applicationsProvides developing environment We are doing the Tellme version of VoiceXML, but it appears to be standard.Register as a developer:Provide your own id; assigned a PINPut VoiceXML in ScratchPad place (no audio files)1-800-555-VXML (8965)SAY id and then PIN or can give phone number. Tellme runs eitherprogram in ScratchPad ORprogram at Application URL for projects with multiple files

    To look at someone else's project, you change your Application URLcalled pointing your account to a new source.

  • XMLGeneralization of HTMLXML documents have markup.Tag indicating type of element and, possibly with attributes, content, tag closer.Document must be well-formed.Developers decide on element types.

  • VoiceXMLXML document (VXML header)This means proper nesting of elements, quotation marks on attributesVoiceXML has tags for flow-of-control and calculations.Also can use for JavaScriptGrammars come in different varieties. We will use the Tellme way. Grammars are included in CDATA tags to prevent XML interpretation.Many grammars constructed for you. will listen for yes or no. will listen for currency. for list

  • Very brief overview document contains and/or menu elements. can contain , can contain or do its own audio can contain , , , etc.NOTE: certain types of elements use built-in grammars, for example, booleanCan have a child node that indicates what to do if there is a match is a compressed way use a simple grammar

  • Very brief, cont.Logic can be done using a element that contains a variant of JavaScript and/orvxml logic elements, including

    , otherThese may be part of a element

  • AudioTellme studio provides way to record [your] speech as a wav file to upload to a website. Sends it to your email addressYou upload your VoiceXML file plus any wav files (and anything else) Welcome to my site If Tellme can't find the mygreeting.wav file, it uses its Text to Speech on the string "Welcome to my site". Note: you also can use a full URL: http://....

    You put in the URL for the voicexml file into your Tellme studio account, called pointing to the URL.TEST

  • VoiceXML basics, continued element can contain elements, which can contain , , other which can contain

    (if not one of built-in grammars)

    tags can be at different levels (for example, document, block, or higher levels) tags elements for JavaScript (which can also appear in expressions>

  • VoiceXML basics: typical casea form element , made up of , with reference to recorded wav file and backup text, if NOT using built-in grammars designated by type attribute of field. This is a CDATA section. with (follow-on) code using field for nomatch, noinput cases

  • CautionA form contains various elements, including a field. If a field has a grammar and the grammar is satisfied, control goes to a filled tag

  • obligatory

    Hello, world

    recorded using tellme studiobackup using TTS, just in case src file missing

  • exampleAsks for number of credits and calculates when you/caller can registeruses built-in grammar for numberNo error recovery. You need to do better than this in your project.Unfortunate situation: there is a element type filled and an element type field.The < symbols are represented using lt;

  • Hello there. How many credits have you earned?

    Sorry. I didn't get that.

  • You can register on the third day You can register on the second day You can register on the first day You can register on the fourth day Good bye.

  • HomeworkDo research / think about your own experiences and come prepared to report on a speech recognition / speech synthesis applicationStart learning VoiceXML

    *************Phonetic method of reading: sounding out.****Only touch on topic. The tellme site has several tutorials and examples.*The filled element is a possible element in the field element.***Try it!*