voice browsers making the web accessible to more of us, more of the time. sdbi november 2001, shani...

103
Voice Voice Browsers Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, SDBI November 2001, Shani Shalgi Shani Shalgi GeneralMagic Demo

Upload: tatum-barraclough

Post on 02-Apr-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

Voice BrowsersVoice Browsers

Making the Web accessible to more of us, more of the time.

SDBI November 2001,SDBI November 2001,Shani ShalgiShani Shalgi

GeneralMagic Demo

Page 2: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

2

What is a Voice Browser?What is a Voice Browser?

Expanding access to the WebExpanding access to the Web Will allow any telephone to be used to Will allow any telephone to be used to

access appropriately designed Web-access appropriately designed Web-based services based services

Server-basedServer-based Voice portalsVoice portals

Page 3: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

3

What is a Voice Browser?What is a Voice Browser?

Interaction via key pads, spoken Interaction via key pads, spoken commands, listening to prerecorded commands, listening to prerecorded speech, synthetic speech and music. speech, synthetic speech and music.

An advantage to people with visual An advantage to people with visual impairmentimpairment

Web access while keeping hands & Web access while keeping hands & eyes free for other things (eg. Driving). eyes free for other things (eg. Driving).

Page 4: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

4

What is a Voice Browser?What is a Voice Browser?

Mobile WebMobile Web Naturalistic dialogs with Web-based Naturalistic dialogs with Web-based

services.services.

Page 5: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

5

MotivationMotivation

Far more people today have access to a Far more people today have access to a telephone than have access to a telephone than have access to a computer with an Internet connection. computer with an Internet connection.

Many of us have already or soon will Many of us have already or soon will have a mobile phone within reach have a mobile phone within reach wherever we go. wherever we go.

Page 6: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

6

MotivationMotivation

Easy to use - for people with no Easy to use - for people with no knowledge or fear of computers.knowledge or fear of computers.

Voice interaction can escape the physical Voice interaction can escape the physical limitations on keypads and displays as limitations on keypads and displays as mobile devices become ever smaller.mobile devices become ever smaller.

Page 7: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

7

MotivationMotivation

Many companies to offer services over Many companies to offer services over the phone via menus traversed using the phone via menus traversed using the phone's keypad. Voice Browsers the phone's keypad. Voice Browsers are the next generation of call centers, are the next generation of call centers, which will become Voice Web portals to which will become Voice Web portals to the company's services and related the company's services and related websites, whether accessed via the websites, whether accessed via the telephone network or via the Internet.telephone network or via the Internet.

Page 8: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

8

MotivationMotivation

Disadvantages to existing methods:Disadvantages to existing methods:

• WAP (Cellular phones, Palm Pilots) WAP (Cellular phones, Palm Pilots) –Small screensSmall screens–Access SpeedAccess Speed–Limited or fragmented availabilityLimited or fragmented availability–Akward inputAkward input–PricePrice–Lack of user habitLack of user habit

Page 9: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

9

Graphical browsing is more Graphical browsing is more passive passive due to the persistence of the visual due to the persistence of the visual information information

Voice browsing is more Voice browsing is more activeactive since since the user has to issue commands.the user has to issue commands.

Graphical Browsers are Graphical Browsers are client-basedclient-based, , whereas Voice Browsers are whereas Voice Browsers are server-server-basedbased..

Differences Between Differences Between Graphical & Voice Graphical & Voice Browsing Browsing

The leading role is turned over to

the USER

Page 10: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

10

Possible ApplicationsPossible Applications

Accessing business information:Accessing business information:• The corporate "front desk" which asks callers The corporate "front desk" which asks callers

who or what they wantwho or what they want• Automated telephone ordering services Automated telephone ordering services • Support desksSupport desks• Order tracking Order tracking • Airline arrival and departure informationAirline arrival and departure information• Cinema and theater booking servicesCinema and theater booking services• Home banking servicesHome banking services

Page 11: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

11

Possible Applications (2)Possible Applications (2)

Accessing public information: Accessing public information: • Community information such as weather, Community information such as weather,

traffic conditions, school closures, traffic conditions, school closures, directions and eventsdirections and events

• Local, national and international newsLocal, national and international news• National and international stock market National and international stock market

informationinformation• Business and e-commerce transactionsBusiness and e-commerce transactions

Page 12: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

12

Possible Applications (3)Possible Applications (3)

Accessing personal information:Accessing personal information:• Voice mailVoice mail• Calendars, address and telephone lists Calendars, address and telephone lists • Personal horoscopePersonal horoscope• Personal newsletterPersonal newsletter• To-do lists, shopping lists, and calorie To-do lists, shopping lists, and calorie

counterscounters

Page 13: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

13

Advancing Towards VoiceAdvancing Towards Voice

Until now, speech recognition and synthesis Until now, speech recognition and synthesis technologies had to be handcrafted into technologies had to be handcrafted into applications. applications.

Voice Browsers intend the voice technologies Voice Browsers intend the voice technologies to be handcrfted directly into web servers.to be handcrfted directly into web servers.

This demands transformation of Web content This demands transformation of Web content into formats better suited to the needs of voice into formats better suited to the needs of voice browsing or authoring content directly for voice browsing or authoring content directly for voice browsers.browsers.

Page 14: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

14

The World Wide Web Consortium The World Wide Web Consortium (W3C) develops interoperable (W3C) develops interoperable technologies (specifications, guidelines, technologies (specifications, guidelines, software, and tools) to lead the Web to software, and tools) to lead the Web to its full potential as a forum for its full potential as a forum for information, commerce, communication, information, commerce, communication, and collective understanding. and collective understanding.

Page 15: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

15

WC3 Speech Interface WC3 Speech Interface FrameworkFramework

VoiceXMLVoiceXML Speech SynthesisSpeech Synthesis Speech RecognitionSpeech Recognition

• DTMF GrammarsDTMF Grammars• Speech GrammarsSpeech Grammars• Stochastic (N-Gram) Stochastic (N-Gram)

Language ModelsLanguage Models• Semantic InterpretationSemantic Interpretation

Pronunciation Pronunciation LexiconLexicon Call ControlCall Control Voice Browser Voice Browser InteroperationInteroperation

Page 16: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

VoiceXMLVoiceXML VoiceXML is a dialog markup language VoiceXML is a dialog markup language

designed for telephony applications, designed for telephony applications, where users are restricted to voice and where users are restricted to voice and DTMF (touch tone) input.DTMF (touch tone) input.

text.html

text.vxml

WebServer

Internet

Browser

Page 17: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

17

Speech SynthesisSpeech Synthesis

The specification defines a markup The specification defines a markup language for prompting users via a language for prompting users via a combination of prerecorded speech, combination of prerecorded speech, synthetic speech and music. You can synthetic speech and music. You can select voice characteristics (name, gender select voice characteristics (name, gender and age) and the speed, volume, pitch, and age) and the speed, volume, pitch, and emphasis. There is also provision for and emphasis. There is also provision for overriding the synthesis engine's default overriding the synthesis engine's default pronunciation.pronunciation.

Page 18: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

Speech RecognitionSpeech Recognition

DTMF Grammars

Speech Grammars

StochasticLanguageModels

Semantic Interpretation

Touch ToneUSERUSER

Speech

Page 19: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

19

DTMF GrammarsDTMF Grammars

Touch tone input is often used as an Touch tone input is often used as an alternative to speech recognition. alternative to speech recognition.

Especially useful in noisy conditions or when Especially useful in noisy conditions or when the social context makes it awkward to the social context makes it awkward to speak. speak.

The W3C DTMF grammar format allows The W3C DTMF grammar format allows authors to specify the expected sequence of authors to specify the expected sequence of digits, and to bind them to the appropriate digits, and to bind them to the appropriate resultsresults

Page 20: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

20

Speech GrammarsSpeech Grammars

In most cases, user prompts are very carefully In most cases, user prompts are very carefully designed to encourage the user to answer in a designed to encourage the user to answer in a form that matches context free grammar rules. form that matches context free grammar rules.

Speech Grammars allow authors to specify Speech Grammars allow authors to specify rules covering the sequences of words that rules covering the sequences of words that users are expected to say in particular contexts. users are expected to say in particular contexts. These contexualThese contexual clues allow the recognition clues allow the recognition engine to focus on likely utterances, improving engine to focus on likely utterances, improving the chances of a correct match. the chances of a correct match.

Page 21: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

21

Stochastic (N-Gram) Stochastic (N-Gram) Language ModelsLanguage Models

In some applications it is appropriate to use In some applications it is appropriate to use open ended prompts open ended prompts (how can I help)(how can I help). In these . In these cases, context free grammars are unuseful. cases, context free grammars are unuseful.

The solution is to use a stochastic language The solution is to use a stochastic language model. Such models specify the probability that model. Such models specify the probability that one word occurs following certain others. The one word occurs following certain others. The probabilities are computed from a collection of probabilities are computed from a collection of utterances collected from many users. utterances collected from many users.

Page 22: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

22

Semantic InterpretationSemantic Interpretation

The recognition process matches an The recognition process matches an utterance to a speech grammar, utterance to a speech grammar, building a parse tree as a byproduct. building a parse tree as a byproduct.

There are two approaches to harvesting There are two approaches to harvesting semantic results from the parse tree: semantic results from the parse tree:

1.1. Annotating grammar rules with Annotating grammar rules with semantic interpretation tags semantic interpretation tags ((ECMAScriptECMAScript).).

2.2. Representing the result in XML. Representing the result in XML.

Page 23: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

23

Semantic Interpretation - Semantic Interpretation - ExampleExample

For example (1st approach), the user utterance:For example (1st approach), the user utterance:"I would like a medium coca cola and a large "I would like a medium coca cola and a large

pizza with pepperoni and mushrooms.”pizza with pepperoni and mushrooms.”could be converted to the following semantic resultcould be converted to the following semantic result{{drink: {drink: {

beverage: "coke”beverage: "coke”drinksize: "medium”drinksize: "medium”

}}pizza: {pizza: {

pizzasize: "large"pizzasize: "large"topping: [ "pepperoni", "mushrooms" ]topping: [ "pepperoni", "mushrooms" ]

}}}}

Page 24: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

24

Pronunciation LexiconPronunciation Lexicon

Application developers sometimes need to Application developers sometimes need to ability to tune speech engines, whether for ability to tune speech engines, whether for synthesis or recognition. synthesis or recognition.

W3C is developing a markup language for an W3C is developing a markup language for an open portable specification of pronunciation open portable specification of pronunciation information using a standard phonetic alphabet. information using a standard phonetic alphabet.

The most commonly needed pronunciations The most commonly needed pronunciations are for proper nouns such as surnames or are for proper nouns such as surnames or business names. business names.

Page 25: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

25

Call ControlCall Control

Fine-grained control of speech (signal Fine-grained control of speech (signal processing) resources and telephony processing) resources and telephony resources in a VoiceXML telephony platform. resources in a VoiceXML telephony platform.

Will enable application developers to use Will enable application developers to use markup to perform call screening, whisper call markup to perform call screening, whisper call waiting, call transfer, and more.waiting, call transfer, and more.

Can be used to transfer a user from one voice Can be used to transfer a user from one voice browser to another on a competely different browser to another on a competely different machine.machine.

Page 26: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

26

Voice Browser Voice Browser InteroperationInteroperation

Mechanisms to transfer application state, such as a Mechanisms to transfer application state, such as a session identifier, along with the user's audio session identifier, along with the user's audio connections.connections. The user could start with a visual interaction on a The user could start with a visual interaction on a

cell phone and follow a link to switch to a cell phone and follow a link to switch to a VoiceXML application. VoiceXML application.

The ability to transfer a session identifier makes it The ability to transfer a session identifier makes it possible for the Voice Browser application to pick possible for the Voice Browser application to pick up user preferences and other data entered into up user preferences and other data entered into the visual application. the visual application.

Page 27: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

27

Voice Browser Voice Browser Interoperation (2)Interoperation (2)

Finally, the user could transfer from a Finally, the user could transfer from a VoiceXML application to a customer service VoiceXML application to a customer service agent. agent.

The agent needs the ability to use their The agent needs the ability to use their console to view information about the console to view information about the customer, as collected during the preceding customer, as collected during the preceding VoiceXML application. The ability to transfer VoiceXML application. The ability to transfer a session identifier can be used to retrieve a session identifier can be used to retrieve this information from the customer database.this information from the customer database.

Page 28: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

28

Voice Style Sheets?Voice Style Sheets?

Some extensions are proposed to Some extensions are proposed to HTML 4.0 and CSS2 to support voice HTML 4.0 and CSS2 to support voice browsingbrowsing

Prerecorded content is likely to include Prerecorded content is likely to include music and different speakers. These music and different speakers. These effects can be reproduced to some effects can be reproduced to some extent via the aural style sheets extent via the aural style sheets features in CSS2.features in CSS2.

Page 29: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

29

Authors want control over how the document is rendered. Aural style sheets (part of CSS2) provide a basis for controlling a range of features:

Voice Style Sheets!Voice Style Sheets!

Volume Volume Rate Rate Pitch Pitch Direction Direction Spelling out text letter by letter Spelling out text letter by letter Speech fonts (male/female, adult/child etc.) Speech fonts (male/female, adult/child etc.) Inserted text before and after element content Inserted text before and after element content Sound effects and musicSound effects and music

Page 30: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

30

How Does It Work?How Does It Work?

How do I connect?How do I connect? Do I speak to the browser or does Do I speak to the browser or does

the browser speak to me?the browser speak to me? What is seen on the screen?What is seen on the screen? How do I enter input?How do I enter input?

Page 31: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

31

ProblemsProblems

How does the browser understand what How does the browser understand what I say?I say?

How can I tell it what I want?How can I tell it what I want?

……what if it doesn’t understand?what if it doesn’t understand?

Page 32: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

32

Overview on Speech Overview on Speech TechnologiesTechnologies

Speech SynthesisSpeech Synthesis• Text to SpeechText to Speech

Speech RecognitionSpeech Recognition• Speech GrammarsSpeech Grammars• Stochastic n-gram modelsStochastic n-gram models

Semantic InterpretationSemantic Interpretation

Page 33: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

33

What is Speech Synthesis?What is Speech Synthesis?

Generating machine voice by arranging Generating machine voice by arranging phonemes (k, ch, sh, etc.) into words.phonemes (k, ch, sh, etc.) into words.

There are several algorithms for There are several algorithms for performing Speech Synthesis. The performing Speech Synthesis. The choice depends on the task they're choice depends on the task they're used for. used for.

Page 34: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

34

How is Speech Synthesis How is Speech Synthesis Performed?Performed?

The easiest way is to just record the The easiest way is to just record the voice of a person speaking thevoice of a person speaking thedesired phrases. desired phrases. • This is useful if only a restricted volume of This is useful if only a restricted volume of

phrases and sentences is used, e.g. phrases and sentences is used, e.g. schedule information of incoming flights. schedule information of incoming flights. The quality depends on the way recording The quality depends on the way recording is done.is done.

Page 35: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

35

How is Speech Synthesis How is Speech Synthesis Performed?Performed?

Another option is to record a large Another option is to record a large database of words. database of words. • Requires large memory storageRequires large memory storage• Limited vocabularyLimited vocabulary• No prosodic informationNo prosodic information

More sophisticated but worse in quality More sophisticated but worse in quality are Text-To-Speech algorithms.are Text-To-Speech algorithms.

Page 36: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

36

How is Speech Synthesis How is Speech Synthesis Performed?Performed?Text To SpeechText To Speech

Text-To-Speech algorithms split the speech into Text-To-Speech algorithms split the speech into smaller pieces. The smaller the units, the less smaller pieces. The smaller the units, the less they are in number, but the quality also they are in number, but the quality also decreases. decreases.

An often used unit is the An often used unit is the phonemephoneme,,the smallest linguistic unit. Depending on the the smallest linguistic unit. Depending on the language used, there are about 35-50 phonemes language used, there are about 35-50 phonemes in western European languages, i.e. we need in western European languages, i.e. we need only 35-50 single recordings.only 35-50 single recordings.

february twenty fifth: f eh b r ax r iy t w eh n t iy f ih f th

Page 37: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

37

Text To SpeechText To Speech

The problem is, combining them as fluent The problem is, combining them as fluent speech requires fluent transitions between speech requires fluent transitions between the elements. The intelligibility is therefore the elements. The intelligibility is therefore lower, but the memory required is small.lower, but the memory required is small.

A solution is using A solution is using diphonesdiphones. Instead of . Instead of splitting at the transitions, the cut is done splitting at the transitions, the cut is done at the center of the phonemes, leaving the at the center of the phonemes, leaving the transitions themselves intact.transitions themselves intact.

Page 38: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

38

Text To SpeechText To Speech

This means there are now This means there are now approximately 1600 recordings needed approximately 1600 recordings needed (40*40). (40*40).

The longer the units become, the more The longer the units become, the more elements there are, but the qualityelements there are, but the qualityincreases along with the memory increases along with the memory required. required.

Page 39: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

39

Text To SpeechText To Speech

Other units which are widely usedOther units which are widely usedare half-syllables, syllables, words, or are half-syllables, syllables, words, or combinations of them, e.g. wordcombinations of them, e.g. wordstems and inflectional endings.stems and inflectional endings.

TTS is dictionary-driven. The larger the TTS is dictionary-driven. The larger the dictionary resident in the browser is, the dictionary resident in the browser is, the better the quality. better the quality.

For unknown words, falls back on rules for For unknown words, falls back on rules for regular pronunciation. regular pronunciation.

Page 40: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

40

Text To SpeechText To Speech

Vocabulary is unlimited!!!Vocabulary is unlimited!!! But what about the prosodic information? But what about the prosodic information?

Pronunciation depends on the context Pronunciation depends on the context in which a word occurs. Limited in which a word occurs. Limited linguistic analysis is needed.linguistic analysis is needed. How can I How can I helphelp? ? HelpHelp is on the way! is on the way!

Page 41: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

41

Text To SpeechText To Speech

Another example:Another example: I have I have readread the first chapter. the first chapter.

I will I will readread some more after lunch. some more after lunch.

For these cases, and in the cases of For these cases, and in the cases of irregular words and name pronunciation, irregular words and name pronunciation, authors need a way to provide authors need a way to provide supplementary TTS information and to supplementary TTS information and to indicate when it applies.indicate when it applies.

Page 42: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

42

Text To SpeechText To Speech

But specialized representations for But specialized representations for phonemic and prosodic information can be phonemic and prosodic information can be off putting for non-specialist users. off putting for non-specialist users.

For this reason it is common to see For this reason it is common to see simplified ways to write down simplified ways to write down pronunciation, for instance, the word pronunciation, for instance, the word "station" can be defined as:"station" can be defined as:

station: stay-shunstation: stay-shun

Page 43: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

43

Text To SpeechText To Speech This approach encourages users to add This approach encourages users to add

pronunciation information, leading to an increase in pronunciation information, leading to an increase in the quality of spoken documents, compared to more the quality of spoken documents, compared to more complex and harder to learn approaches.complex and harder to learn approaches.

This is where W3C comes in: This is where W3C comes in:

Providing a specification to enable consistent Providing a specification to enable consistent control (generating, authoring, processing) of voice control (generating, authoring, processing) of voice output by speech synthesizers for varying speech output by speech synthesizers for varying speech content, for use in voice browsing and in other content, for use in voice browsing and in other contexts.contexts.

Page 44: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

44

Overview on Speech Overview on Speech TechnologiesTechnologies

Speech SynthesisSpeech Synthesis Text to SpeechText to Speech

Speech RecognitionSpeech Recognition• Speech GrammarsSpeech Grammars• Stochastic n-gram modelsStochastic n-gram models

Semantic InterpretationSemantic Interpretation

Page 45: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

45

Speech RecognitionSpeech Recognition

Page 46: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

46

Speech RecognitionSpeech Recognition

Page 47: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

47

Speech RecognitionSpeech Recognition

Page 48: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

48

Speech RecognitionSpeech Recognition

Page 49: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

49

Speech RecognitionSpeech Recognition

Automatic speech recognition is the Automatic speech recognition is the process by which a computer maps an process by which a computer maps an acoustic speech signal to text.acoustic speech signal to text.

Speech is first digitized and then Speech is first digitized and then matched against a dictionary of coded matched against a dictionary of coded waveforms. The matches arewaveforms. The matches areconverted into text.converted into text.

Page 50: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

50

Speech RecognitionSpeech Recognition

Types of voice recognition applications:Types of voice recognition applications: Command systemsCommand systems recognize a few hundred recognize a few hundred

words and eliminate using the mouse or keyboard words and eliminate using the mouse or keyboard for repetitive commands. for repetitive commands.

Discrete voice recognition systemsDiscrete voice recognition systems are used for are used for dictation, but require a pause between each word. dictation, but require a pause between each word.

Continuous voice recognitionContinuous voice recognition understands natural understands natural speech without pauses and is the most process speech without pauses and is the most process intensive. intensive.

Page 51: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

51

Speech RecognitionSpeech Recognition

A A speaker dependentspeaker dependent system is system is developed to operate for a single developed to operate for a single speaker. speaker.

These systems are usually easier to These systems are usually easier to develop, cheaper to buy and more develop, cheaper to buy and more accurate, but not as flexible as speaker accurate, but not as flexible as speaker adaptive or speaker independent adaptive or speaker independent systems. systems.

Page 52: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

52

Speech RecognitionSpeech Recognition

A A speaker independent systemspeaker independent system is is developed to operate for developed to operate for anyany speaker of a speaker of a particular type (e.g. American English). particular type (e.g. American English).

These systems are the most difficult to These systems are the most difficult to develop, most expensive and accuracy is develop, most expensive and accuracy is lower than speaker dependent systems. lower than speaker dependent systems. However, they are more flexible. However, they are more flexible.

Page 53: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

53

Speech RecognitionSpeech Recognition

A A speaker adaptivespeaker adaptive system is system is developed to adapt its operation to the developed to adapt its operation to the characteristics of new speakers. It's characteristics of new speakers. It's difficulty lies somewhere between difficulty lies somewhere between speaker independent and speaker speaker independent and speaker dependent systems. dependent systems.

Page 54: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

54

Speech RecognitionSpeech Recognition

Speech recognition technologies today Speech recognition technologies today are highly advanced.are highly advanced.

There is a huge gap between the ability There is a huge gap between the ability to to recognizerecognize speech and the ability to speech and the ability to interpretinterpret speech. speech.

Page 55: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

55

How is Speech Recognition How is Speech Recognition Performed?Performed?

Speech recognition technology involves Speech recognition technology involves complex statistical models that characterize complex statistical models that characterize the properties of sounds, taking into account the properties of sounds, taking into account factors such as male vs. female voices, factors such as male vs. female voices, accents, speaking rate, background noise, etc. accents, speaking rate, background noise, etc.

The process of speech recognition includes 5 The process of speech recognition includes 5 stages: stages: 1.1. Capture and digital sampling Capture and digital sampling

2.2. Spectral representation and analysis Spectral representation and analysis3.3. Segmentation. Segmentation.4.4. Phonetic Modeling Phonetic Modeling5.5. Search and Match Search and Match

Page 56: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

56

How is Speech Recognition How is Speech Recognition Performed?Performed?

Speech GrammarsSpeech Grammars HMM (Hidden Markov Modelling)HMM (Hidden Markov Modelling) DTW (Dynamic Time Warping)DTW (Dynamic Time Warping) NNs (Neural Networks)NNs (Neural Networks) Expert systems Expert systems Combinations of techniquesCombinations of techniques. .

HMM-based systems are currently the HMM-based systems are currently the most commonly used and most most commonly used and most successful approach. successful approach.

Page 57: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

57

Speech GrammarsSpeech Grammars

The grammar allows a speech The grammar allows a speech application to indicate to a recognizer application to indicate to a recognizer what it should listen for, specifically:what it should listen for, specifically: Words that may be spoken, Words that may be spoken, Patterns in which those words may Patterns in which those words may

occur, occur, Language of the spoken words. Language of the spoken words.

Page 58: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

58

Speech GrammarsSpeech Grammars

In simple speech recognition/speech In simple speech recognition/speech understanding systems, the expected input understanding systems, the expected input sentences are often modeled by a strict sentences are often modeled by a strict grammar (such as a CFG). grammar (such as a CFG).

In this case, the user is only allowed to utter In this case, the user is only allowed to utter those sentences, that are explicitly covered those sentences, that are explicitly covered by the grammar. by the grammar. • Good for menus, form filling, ordering services, Good for menus, form filling, ordering services,

etc.etc.

Page 59: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

59

Speech GrammarsSpeech Grammars

Experience shows that a context free Experience shows that a context free grammar with reasonable complexity can grammar with reasonable complexity can never foresee all the different sentence never foresee all the different sentence patterns, users come up with in patterns, users come up with in spontaneous speech input. spontaneous speech input.

This approach is therefore not sufficient for This approach is therefore not sufficient for robust speech recognition/ understanding robust speech recognition/ understanding tasks or free text input applications such as tasks or free text input applications such as dictation.dictation.

Page 60: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

60

For Example For Example

Possible answers to a question may be Possible answers to a question may be "Yes" or "No”, but it could also be any "Yes" or "No”, but it could also be any other word used for negative or positive other word used for negative or positive response. It could be "Ya," "you betch'ya," response. It could be "Ya," "you betch'ya," "sure," "of course" and many other "sure," "of course" and many other expressions. It is necessary to feed the expressions. It is necessary to feed the speech recognition engine with likely speech recognition engine with likely utterances representing the desired utterances representing the desired response.response.

Page 61: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

61

Speech GrammarsSpeech Grammars

What is done?What is done?• Beta and Pilot versionsBeta and Pilot versions

• Upgrade versionsUpgrade versions

Page 62: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

62

Speech Grammars - Speech Grammars - ExampleExample

<!-- the token "very" is optional -->

<item repeat="0-1">very</item>

<!-- the rule reference to digit can occur zero, one or many times -->

<item repeat="0-"> <ruleref uri="#digit"/> </item>

<!-- the rule reference to digit can occur one or more times -->

<item repeat="1-"> <ruleref uri="#digit"/> </item>

<!-- the rule reference to digit can occur four, five or six times -->

<item repeat="4-6"> <ruleref uri="#digit"/> </item>

<!-- the rule reference to digit can occur ten or more times -->

<item repeat="10-"> <ruleref uri="#digit"/> </item>

Page 63: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

63

Speech Grammars - Speech Grammars - ExampleExample<!-- Examples of the following expansion --> <!-- "pizza" --> <!-- "big pizza with pepperoni" --> <!-- "very big pizza with cheese and pepperoni" --> <item repeat="0-1">

<item repeat="0-1"> very </item> big

</item> pizza <item repeat="0-">

<item repeat="0-1"> <one-of>

<item>with</item> <item>and</item>

</one-of> </item>

<ruleref uri="#topping"/>

</item>

Page 64: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

64

Hidden Markov ModelHidden Markov Model

Notations:Notations: T = Observation sequence length T = Observation sequence length O = {oO = {o11,o,o22,…,o,…,oTT} = Observation sequence } = Observation sequence N = Number of States N = Number of States (we either know or guess)(we either know or guess)

Q = {qQ = {q11…q…qNN} = finite set of possible states} = finite set of possible states M = number of possible observations M = number of possible observations V = {vV = {v11,v,v22,…,v,…,vMM} finite set of possible observations} finite set of possible observations

XXtt = state at time t (state variable) = state at time t (state variable)

Page 65: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

65

Hidden Markov ModelHidden Markov Model

Distributional parametersDistributional parameters

A = {aA = {aijij} where a} where aijij = P(X = P(Xt+1t+1 = q = qjj |X |Xtt = q = qii) )

(transition probabilities) (transition probabilities) B = {bB = {bii(k)} where b(k)} where bii(k) = P(O(k) = P(Ott = v = vkk | X | Xtt = =

qqii) (observation probabilities) ) (observation probabilities) tt = P(X = P(X00 = q = qii) (initial state distribution)) (initial state distribution)

Page 66: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

66

Hidden Markov ModelHidden Markov Model

DefinitionsDefinitions A A Hidden Markov ModelHidden Markov Model (HMM) is a (HMM) is a

five-tuple (Q,V,A,B,five-tuple (Q,V,A,B,). ). Let Let = {A,B, = {A,B,} denote the parameters } denote the parameters

for a given HMM with fixed Q and V.for a given HMM with fixed Q and V.

Page 67: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

67

Hidden Markov ModelHidden Markov Model

ProblemsProblems

1. Find P(O | 1. Find P(O | ), the probability of the ), the probability of the observations given the model. observations given the model.

2. Find the most likely state trajectory 2. Find the most likely state trajectory

X = X = {x{x11,x,x22,…,x,…,xTT}} given the model and given the model and observations. observations. (Find X so that P(O,X | (Find X so that P(O,X | ) is maximized)) is maximized)

3. Adjust the 3. Adjust the parameters to maximize parameters to maximize

P(O | P(O | ))

Page 68: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

68

Language ModelsLanguage Models

A Language model is a probability A Language model is a probability distribution over word sequencesdistribution over word sequences

• P(“And nothing but the truth”) P(“And nothing but the truth”) 0.0010.001

• P(“And nuts sing on the roof”) P(“And nuts sing on the roof”) 0 0

Page 69: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

69

The EquationThe Equation

Notation:Notation:W' = argmaxW P(O|W) P(W)

Page 70: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

70

The N-Gram (Markovian) The N-Gram (Markovian) Language ModelLanguage Model

Hard to compute P(W)Hard to compute P(W)

• P(“And nothing but the truth”)P(“And nothing but the truth”)

Step 1:Step 1: Decompose probability - Decompose probability - P(“P(“And nothing but the truth”And nothing but the truth”) =) = P(“P(“AndAnd”) ”) P(“P(“nothing” | “andnothing” | “and”) ”) P(“P(“but” but” | “| “and nothingand nothing”) ”) P(“ P(“thethe” | “” | “and and

nothing butnothing but”) ”) P(“ P(“truth” truth” | “| “and nothing but and nothing but thethe”) ”)

Page 71: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

71

The Trigram The Trigram ApproximationApproximation

Assume each word depends only on the Assume each word depends only on the previous two words (three words total – previous two words (three words total – tri means three, gram means writing)tri means three, gram means writing)

P(“P(“thethe”|“… ”|“… whole truth and nothing butwhole truth and nothing but”) ”) P(“P(“thethe”|“”|“nothing butnothing but”)”)P(“P(“truthtruth”|“… ”|“… whole truth and nothing but thewhole truth and nothing but the”) ”) P(“P(“truthtruth”|“”|“but thebut the”)”)

Page 72: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

72

N-Gram - The Markovian N-Gram - The Markovian ModelModel

The Markovian state machine is an The Markovian state machine is an automatation with statistical weightsautomatation with statistical weights

A state represents a phoneme, diphone or A state represents a phoneme, diphone or word.word.

We do not include all options, but only those We do not include all options, but only those which are related to the context or subject. which are related to the context or subject.

We calculate all probable paths from beginning We calculate all probable paths from beginning to end of phrase/word and return the one with to end of phrase/word and return the one with the maximum probability.the maximum probability.

Page 73: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

73

Back to TrigramsBack to Trigrams

How do we find the probabilities?How do we find the probabilities? Get real text, and start counting!Get real text, and start counting!

• P(“P(“thethe” | “” | “nothing butnothing but”) ”) Count(“Count(“nothing but thenothing but the”) ”)

Count(“Count(“nothing butnothing but”)”)

Page 74: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

74

N-gramsN-grams

Why stop at 3-grams?Why stop at 3-grams? If P(z|…rstuvwxy)If P(z|…rstuvwxy) P(z|xy) is good, P(z|xy) is good,

then P(z|…rstuvwxy) then P(z|…rstuvwxy) P(z|vwxy) is P(z|vwxy) is better!better!

4-gram, 5-gram start to become 4-gram, 5-gram start to become expensive...expensive...

Page 75: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

75

The N-Gram (Markovian) The N-Gram (Markovian) Language Model - Language Model - SummarySummary

N-Gram language models are used in large N-Gram language models are used in large vocabulary speech recognition systems to vocabulary speech recognition systems to provide the recognizer with an a-priori provide the recognizer with an a-priori likelihood likelihood P(W)P(W) of a given word sequence of a given word sequence WW. .

The N-Gram language model is usually The N-Gram language model is usually derived from large training texts that share derived from large training texts that share the same language characteristics as the same language characteristics as expected input. expected input.

Page 76: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

76

Combining Speech Combining Speech Grammars and N-Gram Grammars and N-Gram ModelsModels

Using an N-Gram model in the recognizer and a Using an N-Gram model in the recognizer and a CFG in a (separate) understanding component CFG in a (separate) understanding component

Integrating special N-Gram rules at various Integrating special N-Gram rules at various levels in a CFG to allow for flexible input in levels in a CFG to allow for flexible input in specific context specific context

using a CFG to model the structure of phrases using a CFG to model the structure of phrases (e.g. numeric expressions) that incorporated in a (e.g. numeric expressions) that incorporated in a higher-level N-Gram model (class N-Grams) higher-level N-Gram model (class N-Grams)

Page 77: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

77

Overview on Speech Overview on Speech TechnologiesTechnologies

Speech SynthesisSpeech Synthesis Text to SpeechText to Speech

Speech RecognitionSpeech Recognition Speech GrammarsSpeech Grammars Stochastic n-gram modelsStochastic n-gram models

Semantic InterpretationSemantic Interpretation

Page 78: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

78

Semantic InterpretationSemantic Interpretation

We have recognized the phrases We have recognized the phrases and words, what now?and words, what now?

ProblemsProblems What does the user mean?What does the user mean? We have the right keywords, but the We have the right keywords, but the

phrase is meaningless or unclear.phrase is meaningless or unclear.

Page 79: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

79

Semantic InterpretationSemantic Interpretation

As stated before, the technologies of As stated before, the technologies of speech recognition exceed those of speech recognition exceed those of interpretation.interpretation.

Most interpreters are base on key Most interpreters are base on key words. words. • Sometimes this is not good enough!Sometimes this is not good enough!

Page 80: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

80

Back To Voice BrowsersBack To Voice Browsers

Making the Web accessible to more of us, more of the time.

Personal Browser DemoPersonal Browser Demo

Now we’ll talk about voiceXML, Now we’ll talk about voiceXML, navigation and various problemsnavigation and various problems

Page 81: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

81

VoiceXML - Example 1VoiceXML - Example 1

<?xml version="1.0"?> <vxml version="2.0"> <form> <block>Hello World!</block> </form></vxml>

The top-level element is <vxml>, which is The top-level element is <vxml>, which is mainly a container for mainly a container for dialogsdialogs. There are two . There are two types of dialogs: types of dialogs: formsforms and and menusmenus. Forms . Forms present information and gather input; menus present information and gather input; menus offer choices of what to do next.offer choices of what to do next.

Page 82: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

82

VoiceXML - Example 1VoiceXML - Example 1

<?xml version="1.0"?> <vxml version="2.0"> <form> <block>Hello World!</block> </form></vxml>

This example has a single form, which This example has a single form, which contains a block that synthesizes and contains a block that synthesizes and presents "Hello World!" to the user. Since the presents "Hello World!" to the user. Since the form does not specify a successor dialog, the form does not specify a successor dialog, the conversation ends.conversation ends.

Page 83: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

83

VoiceXML - Example 2VoiceXML - Example 2

Our second example asks the user for a choice of Our second example asks the user for a choice of drink and then submits it to a server script:drink and then submits it to a server script:

<?xml version="1.0"?><vxml version="2.0"> <form> <field name="drink"> <prompt>Would you like coffee,tea, milk, or nothing?

</prompt> <grammar src="drink.grxml"

type="application/grammar+xml"/> </field> <block> <submit

next="http://www.drink.example.com/drink2.asp"/> </block> </form></vxml>

A field is an input field. The user must provide a value for the field before proceeding to the next element in the form.

Page 84: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

84

VoiceXML - Example 2VoiceXML - Example 2

A sample interaction is:A sample interaction is:

C (computer): Would you like coffee, tea, milk, or C (computer): Would you like coffee, tea, milk, or nothing?nothing?

H (human): H (human): Orange juice.Orange juice.C: I did not understand what you said. (a platform-C: I did not understand what you said. (a platform-

specific default message.)specific default message.)C: Would you like coffee, tea, milk, or nothing?C: Would you like coffee, tea, milk, or nothing?H: H: TeaTeaC: (continues in document drink2.asp)C: (continues in document drink2.asp)

Page 85: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

VoiceXML - Architectural VoiceXML - Architectural ModelModel

Web Server

VoiceXML interpreter context may listen for a special escape phrase that takes the user to a high-level personal assistant, or for escape phrases that alter user preferences like volume or text-to-speech characteristics.

The implementation platform generates events in response to user actions (e.g. spoken or character input received, disconnect) and system events (e.g. timer expiration).

Page 86: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

86

Scope of VoiceXML Scope of VoiceXML

Output of synthesized speech (TTS)Output of synthesized speech (TTS) Output of audio files.Output of audio files. Recognition of spoken input.Recognition of spoken input. Recognition of DTMF input.Recognition of DTMF input. Recording of spoken input.Recording of spoken input. Control of dialog flow.Control of dialog flow. Telephony features such as call transfer and Telephony features such as call transfer and

disconnect.disconnect.

The language provides means for collecting character and/or spoken input, assigning theinput to document-defined request variables, and making decisions that affect the interpretation of documents written in the language. A document may be linked to other documents through Universal Resource Identifiers (URIs).

Page 87: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

87

VoiceXMLVoiceXML

Voice XML is intended to be analogous Voice XML is intended to be analogous to graphical surfing.to graphical surfing.

There are limitations.There are limitations. Excellent for menu applications.Excellent for menu applications. Awkward for open dialog applicationsAwkward for open dialog applications There are other languages: VoXML, There are other languages: VoXML,

omniviewXMLomniviewXML

Page 88: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

88

NavigationNavigation

The user might be able to speak the The user might be able to speak the word word "follow""follow" when she hears a when she hears a hypertext link she wishes to follow.hypertext link she wishes to follow.

The user could also interrupt the browser The user could also interrupt the browser to request a short list of the relevant to request a short list of the relevant links.links.

Page 89: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

89

Navigation exampleNavigation example

UserUser:: links?

Browser:Browser: The links are: 1 company info 2 latest news

3 placing an order 4 search for product details Please say the number now

User:User: 2Browser:Browser: Retrieving latest news...

Page 90: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

90

Navigation through Navigation through HeadingsHeadings

Another command could be used to Another command could be used to request a list of the document's request a list of the document's headings. This would allow users to headings. This would allow users to browse an outline form of the document browse an outline form of the document as a means to get to the section that as a means to get to the section that interests them.interests them.

Page 91: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

91

Navigation to Specific Navigation to Specific URLsURLs

Graphical Browsers allow entering a Graphical Browsers allow entering a wanted URL in the browser windowwanted URL in the browser window

How is this supported in Voice How is this supported in Voice Browsers? Browsers?

Think: What problems do you anticipate?Think: What problems do you anticipate?• Will we be able to transfer from any voice Will we be able to transfer from any voice

portal to any other?portal to any other?

• How do we know where to go?How do we know where to go?

Page 92: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

92

How Slow / Fast ?How Slow / Fast ?

If voice browsers are meant to replace If voice browsers are meant to replace human operator dialog, they must be human operator dialog, they must be fast in response.fast in response.

Speech Recognition / Interpretation / Speech Recognition / Interpretation / Synthesis depend on implementationSynthesis depend on implementation

When a user requests a certain When a user requests a certain document, several related documents document, several related documents can be downloaded for easier access.can be downloaded for easier access.

Page 93: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

93

Friendly vs. AnnoyingFriendly vs. Annoying

How friendly do you want the service to How friendly do you want the service to be?be?

Friendly is sometimes time consuming.Friendly is sometimes time consuming. What percentage of the time does the What percentage of the time does the

user talk and what percentage of the user talk and what percentage of the time is he listening?time is he listening?

What parameters can I control?What parameters can I control?

Page 94: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

94

Voice and GraphicsVoice and Graphics

Can I access the Voice Browser Can I access the Voice Browser through my computer? through my computer? • Some sites are authored only for voice.Some sites are authored only for voice.

• Some will be for both. This leads to moreSome will be for both. This leads to more

difficulties which must be dealt with.difficulties which must be dealt with.

Page 95: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

95

Inserted textInserted text

When a hypertext link is spoken by a speech When a hypertext link is spoken by a speech synthesizer, the author may wish to insert text synthesizer, the author may wish to insert text before and after the link's caption, to guide the before and after the link's caption, to guide the user's response.user's response.

For example:For example:

<A href="driving.html">Driving instruction</A>May be offered by the voice browser using the May be offered by the voice browser using the

following words:following words:

For driving instructions press 1

Page 96: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

96

Inserted textInserted text

The words The words "For” "For” andand "Press 1" "Press 1" were added to the text embedded in the were added to the text embedded in the anchor element.anchor element.

On first glance it looks as if this 'wrapper' On first glance it looks as if this 'wrapper' text should be left for the voice browser to text should be left for the voice browser to generate, but on further examination you generate, but on further examination you can easily find problems with this can easily find problems with this approach.approach.

Page 97: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

97

Inserted textInserted text

For example, the text for the following For example, the text for the following element cannot be “For” element cannot be “For”

<A href="LeaveMessage.html">Leave us a message</A>

We need to say:We need to say: To leave us a message, press 5

Page 98: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

98

Inserted textInserted text

The CSS2 draft specification includes the means The CSS2 draft specification includes the means to provide "generated text" before and after to provide "generated text" before and after element content.element content.

For example:For example:

<A accesskey="5"

style='cue-before: "To";

cue-after: ", press 5"'

href=LeaveMessage.html>Leave us a message</A>

Page 99: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

99

Handling Errors and Handling Errors and AmbiguitiesAmbiguities

Users might easily enter unexpected or Users might easily enter unexpected or ambiguous input, or just pause, providing no ambiguous input, or just pause, providing no input at all. input at all.

Some examples to errors which might generate Some examples to errors which might generate events:events: When presented with a numbered list of links, the When presented with a numbered list of links, the

user enters a number that is outside the range user enters a number that is outside the range presented .presented .

The phrase uttered by the user matches more than The phrase uttered by the user matches more than one template rule.one template rule.

Page 100: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

100

Handling Errors and Handling Errors and AmbiguitiesAmbiguities The phrase\sound uttered doesn't match a known The phrase\sound uttered doesn't match a known

command.command. The user looses track and the browser needs to The user looses track and the browser needs to

time-out and offer assistance time-out and offer assistance ““Um”s and “Err”sUm”s and “Err”s

Authors will have control over the browser Authors will have control over the browser response to selection errors and timeouts.response to selection errors and timeouts.

Other errors might be dealt with by the Other errors might be dealt with by the browser or platform. browser or platform.

Page 101: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

101

Some Nice DemosSome Nice Demos

Email assistant demoEmail assistant demo Bank Bank serviceservice demo (cough, ambiguity) demo (cough, ambiguity) Financial Center DemoFinancial Center Demo (“um”s) (“um”s) Telectronics Telectronics DemoDemo

Page 102: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

102

Who has implemented Who has implemented VoiceXML interpreters?VoiceXML interpreters?

BeVocalBeVocal Café Café General MagicGeneral Magic HeyAnita's HeyAnita's FreeSpeechFreeSpeech Developer Network Developer Network IBM Voice Server SDK Beta ProgramIBM Voice Server SDK Beta Program based based

on VoiceXML Version 1.0on VoiceXML Version 1.0 Motorola’s Motorola’s

Mobile Application Development ToolkitMobile Application Development Toolkit (MADK)(MADK)

Page 103: Voice Browsers Making the Web accessible to more of us, more of the time. SDBI November 2001, Shani Shalgi GeneralMagic Demo GeneralMagic Demo

103

Who has implemented Who has implemented VoiceXML interpreters?VoiceXML interpreters?

Nuance Developer Network Nuance Developer Network Open VXIOpen VXI VoiceXML interpreter VoiceXML interpreter PIPEBEACHPIPEBEACH’s speechWeb’s speechWeb Telera’s Telera’s DeVXchangeDeVXchange Tellme StudioTellme Studio VoiceGenie VoiceGenie