creating user interfaces [continue presentations as needed] speech recognition. speech synthesis...

Creating User Interfaces

[Continue presentations as needed] Speech recognition. Speech synthesis

Homework: Report on current products. Register on Tellme Studies.

Study VoiceXML

Speech recognition

• User speaks. System 'understands', at least enough to perform some action.

• Related to (but not the same as)– Natural language understanding– Voice print identification– Record information to be re-played to human in

compressed form for later interaction– Speech synthesis (other direction): words to speech– ?

Natural language understanding

• Skip speech altogether, but type in statements or phrases in normal language– What is normal? We tend not to speak that

grammatically– Many 'natural language systems' actually use

keywords• Histor• Moon rocks example

• Combine speech to natural language …

Continuous versus discrete

• Speaker speaks 'naturally' versus

• Speaker separates words

Examples• Dictation: no understanding as such, produce

words/sentences in a program• (Telephone) Help desk / Information: generally

restricted or directed speech, choosing from alternatives (may or may not be given). Advances the process

• [Restricted] commands: actually carrying out operations– Factory example: start and stop– Car: radio, heat/AC– Phone: call specific number

Training

• Dictation application: user takes time to read specific test to train the system– Note: some systems also adapt with use. If &

when user corrects the results, system may do better next time.

• Phone lookup: user records names. No 'understanding', just record for matching.

Audience & content

• Some systems may allow adapting to audiences, for example, male versus female

• Some systems have restrictions on types of content– Historical note: IBM system in 1980s & 1990s

was restricted to male, American-born speakers (no speech impediments) and legal text.

Speech recognition concepts

• Air pressure diaphragm in phone electrical signal (Fourier Transform) wave pattern

matched against• sets of canonical patterns

(native speaker of English, perhaps male/female & young/old alternatives)

• generated for the specified grammar (using a segmentation=dividing up of the parts)

Note: interplay of grammar and statistics distinguishes different approaches

Fourier Transform(Discrete Fourier Transform -- FFT)• Takes data representing

a signal

• And produces numbers representing the combination of sine and cosine waves that make up the signal

Speech recognition

• Works on the product of the FFT

• Uses (in most cases) – Segmentation: attempt to break up into

pieces, perhaps syllables or words– Grammar: definition of what is to be expected– Probabilities: if first part matched X, then

greater probability that then next would match to Y

Current State of the Art• General, no restrictions, speech reco, good

enough to act on the speech? always about to happen?

• dictation / substitute for keyboard+ exists and satisfies many– Is this most important application for most users?– May not be killer ap, but may be good for motivating

research

Homework: prepare brief report on [a] current product or application. Can be one you use yourself.

Speech synthesis

• aka TTS (text to speech)

• Application determines that the computer needs to say certain words

• lexical units (syllables of words) phonemes pre-recorded (wav) files of phonemes

Speech synthesis• This is again a segmentation process: need to

divide up the words and then put together so speech sounds 'natural'. – particular phoneme may [need to] sound different in

different context.– also need to deal with abbreviations & local accents– Place names (important in travel & weather

applications)• Special case: detect and use wav file for each name.

• Older methods were all synthesized – similar distinction between all synthesized and

samples of music

Speech synthesis

is essentially ‘the computer’ reading ‘out loud’.

Easy to do most things

More and more difficult to do complete job

Different languages may be easier than English.

People who are not monolingual please comment!

Restricted / directed speech applications

• We will use the tellme studio engine to create directed speech applications.

• These make use of– Grammars– Options to use numbers (buttons)– Recorded (.wav) sounds– Text to speech

studio.tellme.com• Company that provides ‘engine’ for applications• Provides developing environment

– We are doing the Tellme version of VoiceXML, but it appears to be standard.

• Register as a developer:– Provide your own id; assigned a PIN– Put VoiceXML in ScratchPad place (no audio files)

• 1-800-555-VXML (8965)– SAY id and then PIN or can give phone number. Tellme

runs either• program in ScratchPad OR• program at Application URL for projects with multiple files

• To look at someone else's project, you change your Application URL– called pointing your account to a new source.

XML

• Generalization of HTML

• XML documents have markup.– Tag indicating type of element and, possibly

with attributes, content, tag closer.

• Document must be well-formed.

• Developers decide on element types.

VoiceXML• XML document (VXML header)

– This means proper nesting of elements, quotation marks on attributes

• VoiceXML has tags for flow-of-control and calculations.– Also can use <script> for JavaScript

• Grammars come in different varieties. We will use the Tellme way. – Grammars are included in CDATA tags to prevent

XML interpretation.– Many grammars constructed for you.

• <field name="answer" type="boolean" >…will listen for yes or no. <field name="price" type="currency" > … will listen for currency.

– <menu > <choice > <choice> for list

Very brief overview• <vxml> document contains <form> and/or

menu elements.– <form> can contain <block>, <field>

• <block> can contain <audio> or do its own audio• <field> can contain <prompt>, <grammar>,

<noinput>, etc.– NOTE: certain types of <field> elements use built-in

grammars, for example, boolean– Can have a child node <filled> that indicates what to do if

there is a match

– <menu> is a compressed way use a simple grammar

Very brief, cont.

• Logic can be done using a <script> element that contains a variant of JavaScript and/or

• vxml logic elements, including– <var>– <if>, <else> <elseif>– other

• These may be part of a <filled> element

Audio• Tellme studio provides way to record [your] speech as a

wav file to upload to a website. Sends it to your email address

• You upload your VoiceXML file plus any wav files (and anything else)<audio src="mygreeting.wav">Welcome to my site </audio>If Tellme can't find the mygreeting.wav file, it uses its Text to Speech on the string "Welcome to my site".

Note: you also can use a full URL: http://....

• You put in the URL for the voicexml file into your Tellme studio account, called pointing to the URL.

• TEST

VoiceXML basics, continued• <form> element can contain

– <block> elements, which can contain <audio>, <go>, other

– <field> which can contain• <prompt>• <grammar> (if not one of built-in grammars)• <filled>

• <var> tags can be at different levels (for example, document, block, or higher levels)

• <if> <elseif><else> tags• <script> elements for JavaScript (which can

also appear in expressions>

VoiceXML basics: typical case

• a form element – <field>

• <prompt>, made up of <audio>, with reference to recorded wav file and backup text

• <grammar>, if NOT using built-in grammars designated by type attribute of field. This is a CDATA section.

• <filled> with (follow-on) code using field• <catch> for nomatch, noinput cases

Caution

A form contains various elements,

including

a field.

If a field has a grammar and the grammar is satisfied, control goes to a

filled tag

obligatory…

<?xml version="1.0"?><vxml version="2.0"> <form> <block> <audio src="prompt1.wav">Hello, world </audio>

</block> </form></vxml>

recorded using tellme studio

backup using TTS, just in case src file missing

example• Asks for number of credits and calculates

when you/caller can register

• uses built-in grammar for number

• No error recovery. You need to do better than this in your project.

• Unfortunate situation: there is a element type filled and an element type field.

• The < symbols are represented using lt;

<?xml version="1.0"?><vxml version="2.1" xmlns="http://www.w3.org/2001/vxml"><form id="credit"> <var name="rest" expr="1000"/> <field name="bcount" type="number"> <prompt> <audio src="howmanycredits.wav">Hello there. How many credits

have you earned? </audio> </prompt><grammar type="application/x-gsl" mode="voice" ><![CDATA[ NATURAL_NUMBER_THRU_999]]></grammar><catch event="noinput nomatch"> <audio src="sorry.wav">Sorry. I

didn't get that.</audio> <exit/> </catch>

<filled> <assign name="rest" expr="bcount"/> <audio> <value expr="rest" /> </audio> <if cond="rest<30" > <audio src="homestretch.wav">You can register on the

third day </audio> <elseif cond="rest<60" /> <audio src="morethanhalf.wav">You can register on the

second day </audio> <elseif cond="rest<90" /> <audio src="goodstart.wav">You can register on the first

day</audio> <else/> <audio>You can register on the fourth day </audio> </if> <audio src="goodbye.wav">Good bye. </audio> </filled> </field> </form> </vxml>

Homework

• Do research / think about your own experiences and come prepared to report on a speech recognition / speech synthesis application

• Start learning VoiceXML

creating user interfaces [continue presentations as needed] speech recognition. speech synthesis...

Documents

speech impediments

speech synthesishomework

needed speech recognition

speech synthesisaka

natural language continuous

femalesome systems

audience contentsome

user records names