voice browsers making the web accessible to more of us, more of the time. sdbi november 2001, shani...

Voice BrowsersVoice Browsers

Making the Web accessible to more of us, more of the time.

SDBI November 2001,SDBI November 2001,Shani ShalgiShani Shalgi

GeneralMagic Demo

2

What is a Voice Browser?What is a Voice Browser?

Expanding access to the WebExpanding access to the Web Will allow any telephone to be used to Will allow any telephone to be used to

access appropriately designed Web-access appropriately designed Web-based services based services

Server-basedServer-based Voice portalsVoice portals

3


Interaction via key pads, spoken Interaction via key pads, spoken commands, listening to prerecorded commands, listening to prerecorded speech, synthetic speech and music. speech, synthetic speech and music.

An advantage to people with visual An advantage to people with visual impairmentimpairment

Web access while keeping hands & Web access while keeping hands & eyes free for other things (eg. Driving). eyes free for other things (eg. Driving).

4


Mobile WebMobile Web Naturalistic dialogs with Web-based Naturalistic dialogs with Web-based

services.services.

5

MotivationMotivation

Far more people today have access to a Far more people today have access to a telephone than have access to a telephone than have access to a computer with an Internet connection. computer with an Internet connection.

Many of us have already or soon will Many of us have already or soon will have a mobile phone within reach have a mobile phone within reach wherever we go. wherever we go.

6


Easy to use - for people with no Easy to use - for people with no knowledge or fear of computers.knowledge or fear of computers.

Voice interaction can escape the physical Voice interaction can escape the physical limitations on keypads and displays as limitations on keypads and displays as mobile devices become ever smaller.mobile devices become ever smaller.

7


Many companies to offer services over Many companies to offer services over the phone via menus traversed using the phone via menus traversed using the phone's keypad. Voice Browsers the phone's keypad. Voice Browsers are the next generation of call centers, are the next generation of call centers, which will become Voice Web portals to which will become Voice Web portals to the company's services and related the company's services and related websites, whether accessed via the websites, whether accessed via the telephone network or via the Internet.telephone network or via the Internet.

8


Disadvantages to existing methods:Disadvantages to existing methods:

• WAP (Cellular phones, Palm Pilots) WAP (Cellular phones, Palm Pilots) –Small screensSmall screens–Access SpeedAccess Speed–Limited or fragmented availabilityLimited or fragmented availability–Akward inputAkward input–PricePrice–Lack of user habitLack of user habit

9

Graphical browsing is more Graphical browsing is more passive passive due to the persistence of the visual due to the persistence of the visual information information

Voice browsing is more Voice browsing is more activeactive since since the user has to issue commands.the user has to issue commands.

Graphical Browsers are Graphical Browsers are client-basedclient-based, , whereas Voice Browsers are whereas Voice Browsers are server-server-basedbased..

Differences Between Differences Between Graphical & Voice Graphical & Voice Browsing Browsing

The leading role is turned over to

the USER

10

Possible ApplicationsPossible Applications

Accessing business information:Accessing business information:• The corporate "front desk" which asks callers The corporate "front desk" which asks callers

who or what they wantwho or what they want• Automated telephone ordering services Automated telephone ordering services • Support desksSupport desks• Order tracking Order tracking • Airline arrival and departure informationAirline arrival and departure information• Cinema and theater booking servicesCinema and theater booking services• Home banking servicesHome banking services

11

Possible Applications (2)Possible Applications (2)

Accessing public information: Accessing public information: • Community information such as weather, Community information such as weather,

traffic conditions, school closures, traffic conditions, school closures, directions and eventsdirections and events

• Local, national and international newsLocal, national and international news• National and international stock market National and international stock market

informationinformation• Business and e-commerce transactionsBusiness and e-commerce transactions

12

Possible Applications (3)Possible Applications (3)

Accessing personal information:Accessing personal information:• Voice mailVoice mail• Calendars, address and telephone lists Calendars, address and telephone lists • Personal horoscopePersonal horoscope• Personal newsletterPersonal newsletter• To-do lists, shopping lists, and calorie To-do lists, shopping lists, and calorie

counterscounters

13

Advancing Towards VoiceAdvancing Towards Voice

Until now, speech recognition and synthesis Until now, speech recognition and synthesis technologies had to be handcrafted into technologies had to be handcrafted into applications. applications.

Voice Browsers intend the voice technologies Voice Browsers intend the voice technologies to be handcrfted directly into web servers.to be handcrfted directly into web servers.

This demands transformation of Web content This demands transformation of Web content into formats better suited to the needs of voice into formats better suited to the needs of voice browsing or authoring content directly for voice browsing or authoring content directly for voice browsers.browsers.

14

The World Wide Web Consortium The World Wide Web Consortium (W3C) develops interoperable (W3C) develops interoperable technologies (specifications, guidelines, technologies (specifications, guidelines, software, and tools) to lead the Web to software, and tools) to lead the Web to its full potential as a forum for its full potential as a forum for information, commerce, communication, information, commerce, communication, and collective understanding. and collective understanding.

15

WC3 Speech Interface WC3 Speech Interface FrameworkFramework

VoiceXMLVoiceXML Speech SynthesisSpeech Synthesis Speech RecognitionSpeech Recognition

• DTMF GrammarsDTMF Grammars• Speech GrammarsSpeech Grammars• Stochastic (N-Gram) Stochastic (N-Gram)

Language ModelsLanguage Models• Semantic InterpretationSemantic Interpretation

Pronunciation Pronunciation LexiconLexicon Call ControlCall Control Voice Browser Voice Browser InteroperationInteroperation

VoiceXMLVoiceXML VoiceXML is a dialog markup language VoiceXML is a dialog markup language

designed for telephony applications, designed for telephony applications, where users are restricted to voice and where users are restricted to voice and DTMF (touch tone) input.DTMF (touch tone) input.

text.html

text.vxml

WebServer

Internet

Browser

17

Speech SynthesisSpeech Synthesis

The specification defines a markup The specification defines a markup language for prompting users via a language for prompting users via a combination of prerecorded speech, combination of prerecorded speech, synthetic speech and music. You can synthetic speech and music. You can select voice characteristics (name, gender select voice characteristics (name, gender and age) and the speed, volume, pitch, and age) and the speed, volume, pitch, and emphasis. There is also provision for and emphasis. There is also provision for overriding the synthesis engine's default overriding the synthesis engine's default pronunciation.pronunciation.

Speech RecognitionSpeech Recognition

DTMF Grammars

Speech Grammars

StochasticLanguageModels

Semantic Interpretation

Touch ToneUSERUSER

Speech

19

DTMF GrammarsDTMF Grammars

Touch tone input is often used as an Touch tone input is often used as an alternative to speech recognition. alternative to speech recognition.

Especially useful in noisy conditions or when Especially useful in noisy conditions or when the social context makes it awkward to the social context makes it awkward to speak. speak.

The W3C DTMF grammar format allows The W3C DTMF grammar format allows authors to specify the expected sequence of authors to specify the expected sequence of digits, and to bind them to the appropriate digits, and to bind them to the appropriate resultsresults

20

Speech GrammarsSpeech Grammars

In most cases, user prompts are very carefully In most cases, user prompts are very carefully designed to encourage the user to answer in a designed to encourage the user to answer in a form that matches context free grammar rules. form that matches context free grammar rules.

Speech Grammars allow authors to specify Speech Grammars allow authors to specify rules covering the sequences of words that rules covering the sequences of words that users are expected to say in particular contexts. users are expected to say in particular contexts. These contexualThese contexual clues allow the recognition clues allow the recognition engine to focus on likely utterances, improving engine to focus on likely utterances, improving the chances of a correct match. the chances of a correct match.

21

Stochastic (N-Gram) Stochastic (N-Gram) Language ModelsLanguage Models

In some applications it is appropriate to use In some applications it is appropriate to use open ended prompts open ended prompts (how can I help)(how can I help). In these . In these cases, context free grammars are unuseful. cases, context free grammars are unuseful.

The solution is to use a stochastic language The solution is to use a stochastic language model. Such models specify the probability that model. Such models specify the probability that one word occurs following certain others. The one word occurs following certain others. The probabilities are computed from a collection of probabilities are computed from a collection of utterances collected from many users. utterances collected from many users.

22

Semantic InterpretationSemantic Interpretation

The recognition process matches an The recognition process matches an utterance to a speech grammar, utterance to a speech grammar, building a parse tree as a byproduct. building a parse tree as a byproduct.

There are two approaches to harvesting There are two approaches to harvesting semantic results from the parse tree: semantic results from the parse tree:

1.1. Annotating grammar rules with Annotating grammar rules with semantic interpretation tags semantic interpretation tags ((ECMAScriptECMAScript).).

2.2. Representing the result in XML. Representing the result in XML.

23

Semantic Interpretation - Semantic Interpretation - ExampleExample

For example (1st approach), the user utterance:For example (1st approach), the user utterance:"I would like a medium coca cola and a large "I would like a medium coca cola and a large

pizza with pepperoni and mushrooms.”pizza with pepperoni and mushrooms.”could be converted to the following semantic resultcould be converted to the following semantic result{{drink: {drink: {

beverage: "coke”beverage: "coke”drinksize: "medium”drinksize: "medium”

}}pizza: {pizza: {

pizzasize: "large"pizzasize: "large"topping: [ "pepperoni", "mushrooms" ]topping: [ "pepperoni", "mushrooms" ]

}}}}

24

Pronunciation LexiconPronunciation Lexicon

Application developers sometimes need to Application developers sometimes need to ability to tune speech engines, whether for ability to tune speech engines, whether for synthesis or recognition. synthesis or recognition.

W3C is developing a markup language for an W3C is developing a markup language for an open portable specification of pronunciation open portable specification of pronunciation information using a standard phonetic alphabet. information using a standard phonetic alphabet.

The most commonly needed pronunciations The most commonly needed pronunciations are for proper nouns such as surnames or are for proper nouns such as surnames or business names. business names.

25

Call ControlCall Control

Fine-grained control of speech (signal Fine-grained control of speech (signal processing) resources and telephony processing) resources and telephony resources in a VoiceXML telephony platform. resources in a VoiceXML telephony platform.

Will enable application developers to use Will enable application developers to use markup to perform call screening, whisper call markup to perform call screening, whisper call waiting, call transfer, and more.waiting, call transfer, and more.

Can be used to transfer a user from one voice Can be used to transfer a user from one voice browser to another on a competely different browser to another on a competely different machine.machine.

26

Voice Browser Voice Browser InteroperationInteroperation

Mechanisms to transfer application state, such as a Mechanisms to transfer application state, such as a session identifier, along with the user's audio session identifier, along with the user's audio connections.connections. The user could start with a visual interaction on a The user could start with a visual interaction on a

cell phone and follow a link to switch to a cell phone and follow a link to switch to a VoiceXML application. VoiceXML application.

The ability to transfer a session identifier makes it The ability to transfer a session identifier makes it possible for the Voice Browser application to pick possible for the Voice Browser application to pick up user preferences and other data entered into up user preferences and other data entered into the visual application. the visual application.

27

Voice Browser Voice Browser Interoperation (2)Interoperation (2)

Finally, the user could transfer from a Finally, the user could transfer from a VoiceXML application to a customer service VoiceXML application to a customer service agent. agent.

The agent needs the ability to use their The agent needs the ability to use their console to view information about the console to view information about the customer, as collected during the preceding customer, as collected during the preceding VoiceXML application. The ability to transfer VoiceXML application. The ability to transfer a session identifier can be used to retrieve a session identifier can be used to retrieve this information from the customer database.this information from the customer database.

28

Voice Style Sheets?Voice Style Sheets?

Some extensions are proposed to Some extensions are proposed to HTML 4.0 and CSS2 to support voice HTML 4.0 and CSS2 to support voice browsingbrowsing

Prerecorded content is likely to include Prerecorded content is likely to include music and different speakers. These music and different speakers. These effects can be reproduced to some effects can be reproduced to some extent via the aural style sheets extent via the aural style sheets features in CSS2.features in CSS2.

29

Authors want control over how the document is rendered. Aural style sheets (part of CSS2) provide a basis for controlling a range of features:

Voice Style Sheets!Voice Style Sheets!

Volume Volume Rate Rate Pitch Pitch Direction Direction Spelling out text letter by letter Spelling out text letter by letter Speech fonts (male/female, adult/child etc.) Speech fonts (male/female, adult/child etc.) Inserted text before and after element content Inserted text before and after element content Sound effects and musicSound effects and music

30

How Does It Work?How Does It Work?

How do I connect?How do I connect? Do I speak to the browser or does Do I speak to the browser or does

the browser speak to me?the browser speak to me? What is seen on the screen?What is seen on the screen? How do I enter input?How do I enter input?

31

ProblemsProblems

How does the browser understand what How does the browser understand what I say?I say?

How can I tell it what I want?How can I tell it what I want?

……what if it doesn’t understand?what if it doesn’t understand?

32

Overview on Speech Overview on Speech TechnologiesTechnologies

Speech SynthesisSpeech Synthesis• Text to SpeechText to Speech

Speech RecognitionSpeech Recognition• Speech GrammarsSpeech Grammars• Stochastic n-gram modelsStochastic n-gram models


33

What is Speech Synthesis?What is Speech Synthesis?

Generating machine voice by arranging Generating machine voice by arranging phonemes (k, ch, sh, etc.) into words.phonemes (k, ch, sh, etc.) into words.

There are several algorithms for There are several algorithms for performing Speech Synthesis. The performing Speech Synthesis. The choice depends on the task they're choice depends on the task they're used for. used for.

34

How is Speech Synthesis How is Speech Synthesis Performed?Performed?

The easiest way is to just record the The easiest way is to just record the voice of a person speaking thevoice of a person speaking thedesired phrases. desired phrases. • This is useful if only a restricted volume of This is useful if only a restricted volume of

phrases and sentences is used, e.g. phrases and sentences is used, e.g. schedule information of incoming flights. schedule information of incoming flights. The quality depends on the way recording The quality depends on the way recording is done.is done.

35

How is Speech Synthesis How is Speech Synthesis Performed?Performed?

Another option is to record a large Another option is to record a large database of words. database of words. • Requires large memory storageRequires large memory storage• Limited vocabularyLimited vocabulary• No prosodic informationNo prosodic information

More sophisticated but worse in quality More sophisticated but worse in quality are Text-To-Speech algorithms.are Text-To-Speech algorithms.

36

How is Speech Synthesis How is Speech Synthesis Performed?Performed?Text To SpeechText To Speech

Text-To-Speech algorithms split the speech into Text-To-Speech algorithms split the speech into smaller pieces. The smaller the units, the less smaller pieces. The smaller the units, the less they are in number, but the quality also they are in number, but the quality also decreases. decreases.

An often used unit is the An often used unit is the phonemephoneme,,the smallest linguistic unit. Depending on the the smallest linguistic unit. Depending on the language used, there are about 35-50 phonemes language used, there are about 35-50 phonemes in western European languages, i.e. we need in western European languages, i.e. we need only 35-50 single recordings.only 35-50 single recordings.

february twenty fifth: f eh b r ax r iy t w eh n t iy f ih f th

37

Text To SpeechText To Speech

The problem is, combining them as fluent The problem is, combining them as fluent speech requires fluent transitions between speech requires fluent transitions between the elements. The intelligibility is therefore the elements. The intelligibility is therefore lower, but the memory required is small.lower, but the memory required is small.

A solution is using A solution is using diphonesdiphones. Instead of . Instead of splitting at the transitions, the cut is done splitting at the transitions, the cut is done at the center of the phonemes, leaving the at the center of the phonemes, leaving the transitions themselves intact.transitions themselves intact.

38


This means there are now This means there are now approximately 1600 recordings needed approximately 1600 recordings needed (40*40). (40*40).

The longer the units become, the more The longer the units become, the more elements there are, but the qualityelements there are, but the qualityincreases along with the memory increases along with the memory required. required.

39


Other units which are widely usedOther units which are widely usedare half-syllables, syllables, words, or are half-syllables, syllables, words, or combinations of them, e.g. wordcombinations of them, e.g. wordstems and inflectional endings.stems and inflectional endings.

TTS is dictionary-driven. The larger the TTS is dictionary-driven. The larger the dictionary resident in the browser is, the dictionary resident in the browser is, the better the quality. better the quality.

For unknown words, falls back on rules for For unknown words, falls back on rules for regular pronunciation. regular pronunciation.

40


Vocabulary is unlimited!!!Vocabulary is unlimited!!! But what about the prosodic information? But what about the prosodic information?

Pronunciation depends on the context Pronunciation depends on the context in which a word occurs. Limited in which a word occurs. Limited linguistic analysis is needed.linguistic analysis is needed. How can I How can I helphelp? ? HelpHelp is on the way! is on the way!

41


Another example:Another example: I have I have readread the first chapter. the first chapter.

I will I will readread some more after lunch. some more after lunch.

For these cases, and in the cases of For these cases, and in the cases of irregular words and name pronunciation, irregular words and name pronunciation, authors need a way to provide authors need a way to provide supplementary TTS information and to supplementary TTS information and to indicate when it applies.indicate when it applies.

42


But specialized representations for But specialized representations for phonemic and prosodic information can be phonemic and prosodic information can be off putting for non-specialist users. off putting for non-specialist users.

For this reason it is common to see For this reason it is common to see simplified ways to write down simplified ways to write down pronunciation, for instance, the word pronunciation, for instance, the word "station" can be defined as:"station" can be defined as:

station: stay-shunstation: stay-shun

43

Text To SpeechText To Speech This approach encourages users to add This approach encourages users to add

pronunciation information, leading to an increase in pronunciation information, leading to an increase in the quality of spoken documents, compared to more the quality of spoken documents, compared to more complex and harder to learn approaches.complex and harder to learn approaches.

This is where W3C comes in: This is where W3C comes in:

Providing a specification to enable consistent Providing a specification to enable consistent control (generating, authoring, processing) of voice control (generating, authoring, processing) of voice output by speech synthesizers for varying speech output by speech synthesizers for varying speech content, for use in voice browsing and in other content, for use in voice browsing and in other contexts.contexts.

44


Speech SynthesisSpeech Synthesis Text to SpeechText to Speech

Speech RecognitionSpeech Recognition• Speech GrammarsSpeech Grammars• Stochastic n-gram modelsStochastic n-gram models


45


46


47


48


49


Automatic speech recognition is the Automatic speech recognition is the process by which a computer maps an process by which a computer maps an acoustic speech signal to text.acoustic speech signal to text.

Speech is first digitized and then Speech is first digitized and then matched against a dictionary of coded matched against a dictionary of coded waveforms. The matches arewaveforms. The matches areconverted into text.converted into text.

50


Types of voice recognition applications:Types of voice recognition applications: Command systemsCommand systems recognize a few hundred recognize a few hundred

words and eliminate using the mouse or keyboard words and eliminate using the mouse or keyboard for repetitive commands. for repetitive commands.

Discrete voice recognition systemsDiscrete voice recognition systems are used for are used for dictation, but require a pause between each word. dictation, but require a pause between each word.

Continuous voice recognitionContinuous voice recognition understands natural understands natural speech without pauses and is the most process speech without pauses and is the most process intensive. intensive.

51


A A speaker dependentspeaker dependent system is system is developed to operate for a single developed to operate for a single speaker. speaker.

These systems are usually easier to These systems are usually easier to develop, cheaper to buy and more develop, cheaper to buy and more accurate, but not as flexible as speaker accurate, but not as flexible as speaker adaptive or speaker independent adaptive or speaker independent systems. systems.

52


A A speaker independent systemspeaker independent system is is developed to operate for developed to operate for anyany speaker of a speaker of a particular type (e.g. American English). particular type (e.g. American English).

These systems are the most difficult to These systems are the most difficult to develop, most expensive and accuracy is develop, most expensive and accuracy is lower than speaker dependent systems. lower than speaker dependent systems. However, they are more flexible. However, they are more flexible.

53


A A speaker adaptivespeaker adaptive system is system is developed to adapt its operation to the developed to adapt its operation to the characteristics of new speakers. It's characteristics of new speakers. It's difficulty lies somewhere between difficulty lies somewhere between speaker independent and speaker speaker independent and speaker dependent systems. dependent systems.

54


Speech recognition technologies today Speech recognition technologies today are highly advanced.are highly advanced.

There is a huge gap between the ability There is a huge gap between the ability to to recognizerecognize speech and the ability to speech and the ability to interpretinterpret speech. speech.

55

How is Speech Recognition How is Speech Recognition Performed?Performed?

Speech recognition technology involves Speech recognition technology involves complex statistical models that characterize complex statistical models that characterize the properties of sounds, taking into account the properties of sounds, taking into account factors such as male vs. female voices, factors such as male vs. female voices, accents, speaking rate, background noise, etc. accents, speaking rate, background noise, etc.

The process of speech recognition includes 5 The process of speech recognition includes 5 stages: stages: 1.1. Capture and digital sampling Capture and digital sampling

2.2. Spectral representation and analysis Spectral representation and analysis3.3. Segmentation. Segmentation.4.4. Phonetic Modeling Phonetic Modeling5.5. Search and Match Search and Match

56

How is Speech Recognition How is Speech Recognition Performed?Performed?

Speech GrammarsSpeech Grammars HMM (Hidden Markov Modelling)HMM (Hidden Markov Modelling) DTW (Dynamic Time Warping)DTW (Dynamic Time Warping) NNs (Neural Networks)NNs (Neural Networks) Expert systems Expert systems Combinations of techniquesCombinations of techniques. .

HMM-based systems are currently the HMM-based systems are currently the most commonly used and most most commonly used and most successful approach. successful approach.

57


The grammar allows a speech The grammar allows a speech application to indicate to a recognizer application to indicate to a recognizer what it should listen for, specifically:what it should listen for, specifically: Words that may be spoken, Words that may be spoken, Patterns in which those words may Patterns in which those words may

occur, occur, Language of the spoken words. Language of the spoken words.

58


In simple speech recognition/speech In simple speech recognition/speech understanding systems, the expected input understanding systems, the expected input sentences are often modeled by a strict sentences are often modeled by a strict grammar (such as a CFG). grammar (such as a CFG).

In this case, the user is only allowed to utter In this case, the user is only allowed to utter those sentences, that are explicitly covered those sentences, that are explicitly covered by the grammar. by the grammar. • Good for menus, form filling, ordering services, Good for menus, form filling, ordering services,

etc.etc.

59


Experience shows that a context free Experience shows that a context free grammar with reasonable complexity can grammar with reasonable complexity can never foresee all the different sentence never foresee all the different sentence patterns, users come up with in patterns, users come up with in spontaneous speech input. spontaneous speech input.

This approach is therefore not sufficient for This approach is therefore not sufficient for robust speech recognition/ understanding robust speech recognition/ understanding tasks or free text input applications such as tasks or free text input applications such as dictation.dictation.

60

For Example For Example

Possible answers to a question may be Possible answers to a question may be "Yes" or "No”, but it could also be any "Yes" or "No”, but it could also be any other word used for negative or positive other word used for negative or positive response. It could be "Ya," "you betch'ya," response. It could be "Ya," "you betch'ya," "sure," "of course" and many other "sure," "of course" and many other expressions. It is necessary to feed the expressions. It is necessary to feed the speech recognition engine with likely speech recognition engine with likely utterances representing the desired utterances representing the desired response.response.

61


What is done?What is done?• Beta and Pilot versionsBeta and Pilot versions

• Upgrade versionsUpgrade versions

62

Speech Grammars - Speech Grammars - ExampleExample



<item repeat="0-1">very</item>



<item repeat="0-"> <ruleref uri="#digit"/> </item>






<item repeat="4-6"> <ruleref uri="#digit"/> </item>

63

Speech Grammars - Speech Grammars - ExampleExample    <item repeat="0-1">

<item repeat="0-1"> very </item> big

</item> pizza <item repeat="0-">

<item repeat="0-1"> <one-of>

<item>with</item> <item>and</item>

</one-of> </item>

<ruleref uri="#topping"/>

</item>

64

Hidden Markov ModelHidden Markov Model

Notations:Notations: T = Observation sequence length T = Observation sequence length O = {oO = {o11,o,o22,…,o,…,oTT} = Observation sequence } = Observation sequence N = Number of States N = Number of States (we either know or guess)(we either know or guess)

Q = {qQ = {q11…q…qNN} = finite set of possible states} = finite set of possible states M = number of possible observations M = number of possible observations V = {vV = {v11,v,v22,…,v,…,vMM} finite set of possible observations} finite set of possible observations

XXtt = state at time t (state variable) = state at time t (state variable)

65


Distributional parametersDistributional parameters

A = {aA = {aijij} where a} where aijij = P(X = P(Xt+1t+1 = q = qjj |X |Xtt = q = qii) )

(transition probabilities) (transition probabilities) B = {bB = {bii(k)} where b(k)} where bii(k) = P(O(k) = P(Ott = v = vkk | X | Xtt = =

qqii) (observation probabilities) ) (observation probabilities) tt = P(X = P(X00 = q = qii) (initial state distribution)) (initial state distribution)

66


DefinitionsDefinitions A A Hidden Markov ModelHidden Markov Model (HMM) is a (HMM) is a

five-tuple (Q,V,A,B,five-tuple (Q,V,A,B,). ). Let Let = {A,B, = {A,B,} denote the parameters } denote the parameters

for a given HMM with fixed Q and V.for a given HMM with fixed Q and V.

67


ProblemsProblems

1. Find P(O | 1. Find P(O | ), the probability of the ), the probability of the observations given the model. observations given the model.

2. Find the most likely state trajectory 2. Find the most likely state trajectory

X = X = {x{x11,x,x22,…,x,…,xTT}} given the model and given the model and observations. observations. (Find X so that P(O,X | (Find X so that P(O,X | ) is maximized)) is maximized)

3. Adjust the 3. Adjust the parameters to maximize parameters to maximize

P(O | P(O | ))

68

Language ModelsLanguage Models

A Language model is a probability A Language model is a probability distribution over word sequencesdistribution over word sequences

• P(“And nothing but the truth”) P(“And nothing but the truth”) 0.0010.001

• P(“And nuts sing on the roof”) P(“And nuts sing on the roof”) 0 0

69

The EquationThe Equation

Notation:Notation:W' = argmaxW P(O|W) P(W)

70

The N-Gram (Markovian) The N-Gram (Markovian) Language ModelLanguage Model

Hard to compute P(W)Hard to compute P(W)

• P(“And nothing but the truth”)P(“And nothing but the truth”)

Step 1:Step 1: Decompose probability - Decompose probability - P(“P(“And nothing but the truth”And nothing but the truth”) =) = P(“P(“AndAnd”) ”) P(“P(“nothing” | “andnothing” | “and”) ”) P(“P(“but” but” | “| “and nothingand nothing”) ”) P(“ P(“thethe” | “” | “and and

nothing butnothing but”) ”) P(“ P(“truth” truth” | “| “and nothing but and nothing but thethe”) ”)

71

The Trigram The Trigram ApproximationApproximation

Assume each word depends only on the Assume each word depends only on the previous two words (three words total – previous two words (three words total – tri means three, gram means writing)tri means three, gram means writing)

P(“P(“thethe”|“… ”|“… whole truth and nothing butwhole truth and nothing but”) ”) P(“P(“thethe”|“”|“nothing butnothing but”)”)P(“P(“truthtruth”|“… ”|“… whole truth and nothing but thewhole truth and nothing but the”) ”) P(“P(“truthtruth”|“”|“but thebut the”)”)

72

N-Gram - The Markovian N-Gram - The Markovian ModelModel

The Markovian state machine is an The Markovian state machine is an automatation with statistical weightsautomatation with statistical weights

A state represents a phoneme, diphone or A state represents a phoneme, diphone or word.word.

We do not include all options, but only those We do not include all options, but only those which are related to the context or subject. which are related to the context or subject.

We calculate all probable paths from beginning We calculate all probable paths from beginning to end of phrase/word and return the one with to end of phrase/word and return the one with the maximum probability.the maximum probability.

73

Back to TrigramsBack to Trigrams

How do we find the probabilities?How do we find the probabilities? Get real text, and start counting!Get real text, and start counting!

• P(“P(“thethe” | “” | “nothing butnothing but”) ”) Count(“Count(“nothing but thenothing but the”) ”)

Count(“Count(“nothing butnothing but”)”)

75

The N-Gram (Markovian) The N-Gram (Markovian) Language Model - Language Model - SummarySummary

N-Gram language models are used in large N-Gram language models are used in large vocabulary speech recognition systems to vocabulary speech recognition systems to provide the recognizer with an a-priori provide the recognizer with an a-priori likelihood likelihood P(W)P(W) of a given word sequence of a given word sequence WW. .

The N-Gram language model is usually The N-Gram language model is usually derived from large training texts that share derived from large training texts that share the same language characteristics as the same language characteristics as expected input. expected input.

76

Combining Speech Combining Speech Grammars and N-Gram Grammars and N-Gram ModelsModels

Using an N-Gram model in the recognizer and a Using an N-Gram model in the recognizer and a CFG in a (separate) understanding component CFG in a (separate) understanding component

Integrating special N-Gram rules at various Integrating special N-Gram rules at various levels in a CFG to allow for flexible input in levels in a CFG to allow for flexible input in specific context specific context

using a CFG to model the structure of phrases using a CFG to model the structure of phrases (e.g. numeric expressions) that incorporated in a (e.g. numeric expressions) that incorporated in a higher-level N-Gram model (class N-Grams) higher-level N-Gram model (class N-Grams)

77


Speech SynthesisSpeech Synthesis Text to SpeechText to Speech

Speech RecognitionSpeech Recognition Speech GrammarsSpeech Grammars Stochastic n-gram modelsStochastic n-gram models


78


We have recognized the phrases We have recognized the phrases and words, what now?and words, what now?

ProblemsProblems What does the user mean?What does the user mean? We have the right keywords, but the We have the right keywords, but the

phrase is meaningless or unclear.phrase is meaningless or unclear.

79


As stated before, the technologies of As stated before, the technologies of speech recognition exceed those of speech recognition exceed those of interpretation.interpretation.

Most interpreters are base on key Most interpreters are base on key words. words. • Sometimes this is not good enough!Sometimes this is not good enough!

80

Back To Voice BrowsersBack To Voice Browsers

Making the Web accessible to more of us, more of the time.

Personal Browser DemoPersonal Browser Demo

Now we’ll talk about voiceXML, Now we’ll talk about voiceXML, navigation and various problemsnavigation and various problems

81

VoiceXML - Example 1VoiceXML - Example 1

<?xml version="1.0"?> <vxml version="2.0"> <form> <block>Hello World!</block> </form></vxml>

The top-level element is <vxml>, which is The top-level element is <vxml>, which is mainly a container for mainly a container for dialogsdialogs. There are two . There are two types of dialogs: types of dialogs: formsforms and and menusmenus. Forms . Forms present information and gather input; menus present information and gather input; menus offer choices of what to do next.offer choices of what to do next.

82


<?xml version="1.0"?> <vxml version="2.0"> <form> <block>Hello World!</block> </form></vxml>

This example has a single form, which This example has a single form, which contains a block that synthesizes and contains a block that synthesizes and presents "Hello World!" to the user. Since the presents "Hello World!" to the user. Since the form does not specify a successor dialog, the form does not specify a successor dialog, the conversation ends.conversation ends.

83


Our second example asks the user for a choice of Our second example asks the user for a choice of drink and then submits it to a server script:drink and then submits it to a server script:

<?xml version="1.0"?><vxml version="2.0"> <form> <field name="drink"> <prompt>Would you like coffee,tea, milk, or nothing?

</prompt> <grammar src="drink.grxml"

type="application/grammar+xml"/> </field> <block> <submit

next="http://www.drink.example.com/drink2.asp"/> </block> </form></vxml>

A field is an input field. The user must provide a value for the field before proceeding to the next element in the form.

84


A sample interaction is:A sample interaction is:

C (computer): Would you like coffee, tea, milk, or C (computer): Would you like coffee, tea, milk, or nothing?nothing?

H (human): H (human): Orange juice.Orange juice.C: I did not understand what you said. (a platform-C: I did not understand what you said. (a platform-

specific default message.)specific default message.)C: Would you like coffee, tea, milk, or nothing?C: Would you like coffee, tea, milk, or nothing?H: H: TeaTeaC: (continues in document drink2.asp)C: (continues in document drink2.asp)

VoiceXML - Architectural VoiceXML - Architectural ModelModel

Web Server

VoiceXML interpreter context may listen for a special escape phrase that takes the user to a high-level personal assistant, or for escape phrases that alter user preferences like volume or text-to-speech characteristics.

The implementation platform generates events in response to user actions (e.g. spoken or character input received, disconnect) and system events (e.g. timer expiration).

86

Scope of VoiceXML Scope of VoiceXML

Output of synthesized speech (TTS)Output of synthesized speech (TTS) Output of audio files.Output of audio files. Recognition of spoken input.Recognition of spoken input. Recognition of DTMF input.Recognition of DTMF input. Recording of spoken input.Recording of spoken input. Control of dialog flow.Control of dialog flow. Telephony features such as call transfer and Telephony features such as call transfer and

disconnect.disconnect.

The language provides means for collecting character and/or spoken input, assigning theinput to document-defined request variables, and making decisions that affect the interpretation of documents written in the language. A document may be linked to other documents through Universal Resource Identifiers (URIs).

87

VoiceXMLVoiceXML

Voice XML is intended to be analogous Voice XML is intended to be analogous to graphical surfing.to graphical surfing.

There are limitations.There are limitations. Excellent for menu applications.Excellent for menu applications. Awkward for open dialog applicationsAwkward for open dialog applications There are other languages: VoXML, There are other languages: VoXML,

omniviewXMLomniviewXML

88

NavigationNavigation

The user might be able to speak the The user might be able to speak the word word "follow""follow" when she hears a when she hears a hypertext link she wishes to follow.hypertext link she wishes to follow.

The user could also interrupt the browser The user could also interrupt the browser to request a short list of the relevant to request a short list of the relevant links.links.

89

Navigation exampleNavigation example

UserUser:: links?

Browser:Browser: The links are: 1 company info 2 latest news

3 placing an order 4 search for product details Please say the number now

User:User: 2Browser:Browser: Retrieving latest news...

90

Navigation through Navigation through HeadingsHeadings

Another command could be used to Another command could be used to request a list of the document's request a list of the document's headings. This would allow users to headings. This would allow users to browse an outline form of the document browse an outline form of the document as a means to get to the section that as a means to get to the section that interests them.interests them.

91

Navigation to Specific Navigation to Specific URLsURLs

Graphical Browsers allow entering a Graphical Browsers allow entering a wanted URL in the browser windowwanted URL in the browser window

How is this supported in Voice How is this supported in Voice Browsers? Browsers?

Think: What problems do you anticipate?Think: What problems do you anticipate?• Will we be able to transfer from any voice Will we be able to transfer from any voice

portal to any other?portal to any other?

• How do we know where to go?How do we know where to go?

92

How Slow / Fast ?How Slow / Fast ?

If voice browsers are meant to replace If voice browsers are meant to replace human operator dialog, they must be human operator dialog, they must be fast in response.fast in response.

Speech Recognition / Interpretation / Speech Recognition / Interpretation / Synthesis depend on implementationSynthesis depend on implementation

When a user requests a certain When a user requests a certain document, several related documents document, several related documents can be downloaded for easier access.can be downloaded for easier access.

93

Friendly vs. AnnoyingFriendly vs. Annoying

How friendly do you want the service to How friendly do you want the service to be?be?

Friendly is sometimes time consuming.Friendly is sometimes time consuming. What percentage of the time does the What percentage of the time does the

user talk and what percentage of the user talk and what percentage of the time is he listening?time is he listening?

What parameters can I control?What parameters can I control?

94

Voice and GraphicsVoice and Graphics

Can I access the Voice Browser Can I access the Voice Browser through my computer? through my computer? • Some sites are authored only for voice.Some sites are authored only for voice.

• Some will be for both. This leads to moreSome will be for both. This leads to more

difficulties which must be dealt with.difficulties which must be dealt with.

95

Inserted textInserted text

When a hypertext link is spoken by a speech When a hypertext link is spoken by a speech synthesizer, the author may wish to insert text synthesizer, the author may wish to insert text before and after the link's caption, to guide the before and after the link's caption, to guide the user's response.user's response.

For example:For example:

<A href="driving.html">Driving instruction</A>May be offered by the voice browser using the May be offered by the voice browser using the

following words:following words:

For driving instructions press 1

96


The words The words "For” "For” andand "Press 1" "Press 1" were added to the text embedded in the were added to the text embedded in the anchor element.anchor element.

On first glance it looks as if this 'wrapper' On first glance it looks as if this 'wrapper' text should be left for the voice browser to text should be left for the voice browser to generate, but on further examination you generate, but on further examination you can easily find problems with this can easily find problems with this approach.approach.

97


For example, the text for the following For example, the text for the following element cannot be “For” element cannot be “For”

<A href="LeaveMessage.html">Leave us a message</A>

We need to say:We need to say: To leave us a message, press 5

98


The CSS2 draft specification includes the means The CSS2 draft specification includes the means to provide "generated text" before and after to provide "generated text" before and after element content.element content.

For example:For example:

<A accesskey="5"

style='cue-before: "To";

cue-after: ", press 5"'

href=LeaveMessage.html>Leave us a message</A>

99

Handling Errors and Handling Errors and AmbiguitiesAmbiguities

Users might easily enter unexpected or Users might easily enter unexpected or ambiguous input, or just pause, providing no ambiguous input, or just pause, providing no input at all. input at all.

Some examples to errors which might generate Some examples to errors which might generate events:events: When presented with a numbered list of links, the When presented with a numbered list of links, the

user enters a number that is outside the range user enters a number that is outside the range presented .presented .

The phrase uttered by the user matches more than The phrase uttered by the user matches more than one template rule.one template rule.

100

Handling Errors and Handling Errors and AmbiguitiesAmbiguities The phrase\sound uttered doesn't match a known The phrase\sound uttered doesn't match a known

command.command. The user looses track and the browser needs to The user looses track and the browser needs to

time-out and offer assistance time-out and offer assistance ““Um”s and “Err”sUm”s and “Err”s

Authors will have control over the browser Authors will have control over the browser response to selection errors and timeouts.response to selection errors and timeouts.

Other errors might be dealt with by the Other errors might be dealt with by the browser or platform. browser or platform.

101

Some Nice DemosSome Nice Demos

Email assistant demoEmail assistant demo Bank Bank serviceservice demo (cough, ambiguity) demo (cough, ambiguity) Financial Center DemoFinancial Center Demo (“um”s) (“um”s) Telectronics Telectronics DemoDemo

102

Who has implemented Who has implemented VoiceXML interpreters?VoiceXML interpreters?

BeVocalBeVocal Café Café General MagicGeneral Magic HeyAnita's HeyAnita's FreeSpeechFreeSpeech Developer Network Developer Network IBM Voice Server SDK Beta ProgramIBM Voice Server SDK Beta Program based based

on VoiceXML Version 1.0on VoiceXML Version 1.0 Motorola’s Motorola’s

Mobile Application Development ToolkitMobile Application Development Toolkit (MADK)(MADK)

103

Who has implemented Who has implemented VoiceXML interpreters?VoiceXML interpreters?

Nuance Developer Network Nuance Developer Network Open VXIOpen VXI VoiceXML interpreter VoiceXML interpreter PIPEBEACHPIPEBEACH’s speechWeb’s speechWeb Telera’s Telera’s DeVXchangeDeVXchange Tellme StudioTellme Studio VoiceGenie VoiceGenie

voice browsers making the web accessible to more of us, more of the time. sdbi november 2001, shani...

Documents

b voice interaction

b interaction

b graphical browsers

possible applications

b graphical browsing

motivation b disadvantages

motivation b easy

voice browsers