naq2.free.frnaq2.free.fr/vxml/dissertation.pdf · acknowledgments i dedicate this piece of work to...

VoiceXML Web Browser Dissertation

2003/2004

Pierre Naquin

Document Name VoiceXML Web Browser Dissertation Author Name Pierre Naquin Author Student Number 03120748 Author Course MSc in Computer Science Document Status Final Submission Date 10/09/2004

Acknowledgments

I dedicate this piece of work to my parents and all the people that

believed in me, to my friends starting for the greatest of all:

Yannis, to my previous and coming loves.

I will like to thank all my friends for their supports, the two “girls”

in their Postgraduate Administration Room that helped me so

much this year, my teachers of this year and specially Chris Cox

and Mark Green that both helped me a lot with this @°#~%*

dissertation!

I sincerely hope all of you will be proud of my work.

For my “Mamy du train” that took so great care of me these last

five years.

I’m conscious that the style of this dissertation is not the well-

known, well-tested, standard, approved, classic dissertation style.

I apologise in advance for this, but it is much easier for me to

write in this style (I know that the easy path is not always the best

one but...) and I tried to make a dissertation enjoyable to read.

Hope I will succeed!

Abstract

Internet has become more and more accessible to people during the last 3/4 years.

It is an incredible source of cultural and commercial profit. It has been a huge success

and this success in not likely to be stopped before some time.

Some very smart people known as the W3C (World Wide Web Consortium) have been

working very hard to make the web (a common word for almost everything in relation

with Internet) more easy to use for users, web developers and companies by providing

standards solutions to almost any interesting feature of the web.

One of these standard technologies is VoiceXML. This technology (actually these

technologies because VoiceXML is only a piece of the voice languages castle) was mainly

designed to operate human/computer dialogs on telephonic servers in order to make

these dialogs more efficient, less boring and costless to create.

But what is interesting in a technology is when you try to push it to its limits, to do

something a little bit extreme...

In this dissertation, we will have a look on the creation of a Voice Web Browser using the

VoiceXML technology. This browser will give access to the user to any HTML data and

transform them into voice.

We will see all the power of VoiceXML, its limitations, and the problems we face when

dealing with very new technologies.

We will take special care of making a system understandable, usable, maintainable and

configurable for both the user and the administrator of the system.

We will use PERL, a fantastic but very bloodcurdling programming language, to

dynamically provide the user (an almost – I hate to have to say “almost”!) full access to

the power of the Internet directly in his/her ear!

Finally we will be very proud of ourselves but will have to admit that nothing is perfect in

this world and that some work could and should be done to make our VoiceXML Web

Browser totally perfect.

- i out of iii -

Table ofcontents

- ii out of iii -

Introduction .............................................. 1 Literature Review ... and thoughts .............................................. 4

Introduction ................ 5 Technical Materials ................ 5

HTML ................ 5 VoiceXML ................ 5 Speech Recognition Grammar Specification: SRGS ................ 6 Semantic Interpretation for Speech Recognition: SISR Speech Synthesis Markup Language: SSML ................ 6 ECMAScript ................ 7 Perl ................ 7 Regular Expressions ................ 8

Voice Interfaces ................ 8 Building Browsers ................ 9 Conclusion ................ 9

Methodology .............................................. 11 Introduction ................ 12 The System Functionalities ................ 12 Our lovable languages! ................ 12

VoiceXML ................ 12 Speech Recognition Grammar Specification: SRGS ................ 13 Semantic Interpretation for Speech Recognition: SISR ................ 13 Speech Synthesis Markup Language: SSML ................ 13 ECMAScript ................ 14 Who is doing what? ................ 14 How does it work? ................ 15

System process ................ 15 A server side browser... ................ 15 Design: Main Loop ................ 16 Design: Downloading HTML content from a source web browser ................ 17 Design: Making a zip of all the page ................ 18 Design: HTML tag processing ................ 19 Design: Sending the output to the user ................ 20 What action for what tag? ................ 21

Testing .............................................. 22 Introduction ................ 23 Inputs and Outputs ................ 23 “Well, but I want more explanations!” ................ 34

Initialisation (VoiceXML) ................ 34 Initialisation (SRGS) ................ 35 Switch between windows ................ 36 Listing favourites ................ 37 Opening a favourite ................ 37 Opening an URL ................ 38 Sending links and zips ................ 39 Transforming the tag ................ 40 Transforming the

tag ................ 41 Transforming the tag ................ 41 Processing a listing: and tags ................ 41 Dealing with forms: the tag ................ 42 Dealing with forms: the tag ................ 47 Dealing with forms: the tag ................ 48 Dealing with forms: the ................ 49 Transforming the tag ................ 50 Cows do eat other cows! : Transforming the tag ................ 50

- iii out of iii -

Conclusion ................ 51 Guide .............................................. 52

Introduction ................ 53 Login in to the system ................ 53 Logout off the system ................ 53 While browsing a page ................ 53

Accessing another URL ................ 53 Managing windows ................ 53 Favourites ................ 54 Sending link to page ................ 54 Sending copy of the page ................ 54 Following links ................ 55 Forms ................ 55 Images ................ 57

Conclusion ................ 57 Configuration .............................................. 58

Introduction ................ 59 What to find where? ................ 59

In the cgi-bin directory ................ 59 In the dictionaries directory ................ 59 In the user1 directory ................ 59

What can I configure if I am an Administrator ................ 60 Changing email configuration ................ 60 Changing the path of configuration files ................ 60 Managing global dictionaries ................ 60 Managing users ................ 61 Advanced modifications ................ 62

What can I configure if I am a user ................ 62 Configuration of the user’s information ................ 62 Managing your personal dictionary ................ 62 Managing your favourites ................ 62 The big part: Changing the way the system behave tag by tag ................ 63

A word about modularity ................ 67 Conclusion ................ 69

Conclusion .............................................. 70 Recommendations .............................................. 72

Introduction ................ 73 Now it is your turn to work! ................ 73

Testing in real case ................ 73 A graphical user-interface for configuration ................ 73 A more efficient tag auto-closing system ................ 73 Skipping the content of a tag ................ 74 Processing tags ................ 74 Titling windows ................ 74 Skipping advertising images ................ 75 A better processing ................ 75 for everyone ................ 75 Information on demand ................ 75 Help messages ................ 76 Recursive numbering for listing tags ................ 76 The “flat mode” for lists ................ 76 RSS feeds ................ 77 Favourites as web searchers ................ 78 Configuring favourites for “good information” ................ 78

Code .................................. in Appendix

- 1 out of 78 -

Introduction

- 2 out of 78 -

The Wold is becoming smaller and smaller by each passing day with the advent of the Internet and modern communication means. A person residing in one extreme corner of the globe can now interact with the person living in the other corner just with the help of mice and clicks. Internet has become a popular channel of communication and is not only used by ten-year-old kids but also by aged persons. It is a storehouse of rich data and materials of different kind, flavour, and variety. The Wold Wide Web (WWW) is a useful bank where you can find a huge amount of resources and information. But despite of the popularity of the Internet technology and the increasing accessibility of it through the word, we cannot deny that will still use more the telephone. Telephone is a great device and the greatest way for people to interact together. The reason behind this fact has to be found in its ease of use (just pick of the device and talk) and the age-old tradition of communication by voice. Besides that, the increasing popularity of mobile phones will make this tradition last. For accessing to the Internet, one has to sit in font of the computer; however when one wants to use the telephone, be it a landline or a mobile device, one does not need any kind of connection with the computer. In the case of mobile phones, one does not even have to be at home or in the office, even if the two communication channels are different and should not be compared. The purpose of this “comparison” is to set the rationale and theme for this dissertation. This dissertation is a conscious effort towards the development of a voice-based browser using which the Internet can be made (more) accessible to a huge mass of people. My system aims to create a voice based browser whereby people can use the ordinary telephone to browse the internet, be it from home, office, outside or anywhere a telephonic communication can be made. This would be a great evolution and advancement for both the Internet technology enabling more people to have access to the Internet even without having to learn how to use a computer. There are other advantages also. For instance, when one watches television or surfs the Internet, one takes an active part in it and hence cannot carry out any other work simultaneously. This is not the case when one is listening to music. Our visual sense needs focus; our auditory one less. The use of such a voice browsing system let people continue to carry out other works along with listening to information provided by the system. VoiceXML is a markup language designed to describe and process Human-Computer dialogs. Other languages help it in this: the Speech Recognition Grammar Specification (SRGS) language, the Semantic Interpretation for Speech Recognition (SISR) language and the Speech Synthesis Markup Language (SSML). This dissertation is designed for people who are interested in creating voice applications; people who want to have a look at more advanced VoiceXML applications after an introduction found on the Internet; or people who want to learn how to dynamically generate VoiceXML content. Therefore if you are already familiar with VoiceXML and other voice platform languages (SRGS, SISR, SSML ...) you will find this dissertation easier to understand. It is also assumed that you are familiar with client-server programming models; server side scripting; and web programming. We will try during the entire dissertation to keep the balance between the two approaches we will mainly talk about: the technical approach (how to do things) and the

- 3 out of 78 -

user approach (how to provide the user a not-too-bad experience; not-too-bad being very good for a voice interface!). The objective of this dissertation is to develop a dynamic system whereby Internet would be to most possible people; push VoiceXML to its last quarters. Also it would be a new way of accessing the “web” making it more dynamic and advanced. Finally it is for me a way to learn and become an advanced developer in the area of voice technologies. This is a challenging dissertation, so challenging that it enables me to spend my last night on this island in a computer room!

- 4 out of 78 -

LiteratureReview

…and thoughts

- 5 out of 78 -

Introduction Building a VoiceXML Web Browser is not a easy and implies lot of thinking but it also require lots of background knowledge in general programming, web programming, user interfacing, web protocols, client-server programming, scripting, and text processing. We will first talk about the technological aspect. What materials are available about HTML, VoiceXML, SRGS, SISR, SSML, ECMAScript, Perl or Regular Expressions? What are their limits? Then we will move to the user interface aspect. Finally we will try to find some information on the development of browsers (both voice and graphical ones). Technical materials Finding technical documentations is usually not the most difficult information to find... we will see that this is not always true, especially when it is about very new languages or technologies

• HTML

Finding information on HTML is the easiest thing you can ever imagine. The problem that then comes is what information is valuable for you and what is not. HTML is a formatting language and therefore you can have the best book or the worse piece of magazine talking about HTML, the only thing you really need is a reference book. As I am a big fan of all the O’Reilly collection, I’m using the HTML Pocket Reference by Jennifer Niederst (O’Reilly, 2002). A reference is compulsory even if you know HTML by hear and you are providing your sweetheart with some poems full of and tags! I also have to mention the HTML 4.01 Specification by the World Wide Web Consortium (W3C). It can be found on the W3C website (http://www.w3.org/TR/html4/). I found it a little long and over-explaining but maybe it is because I already know about HTML. It is anyway the specification and you should at least try to read it once. • VoiceXML

Even if all people doing web development are not aware of this language, VoiceXML is not that new. VoiceXML is already in version 2.0 and the W3C is now working on creating a free implementation of what they call a Voice Browser (the thing I wanted so much while doing this dissertation: a way of simply opening VoiceXML documents!) VoiceXML is a language describing a dialog between a user and a VoiceXML interpreter. VoiceXML is also the root between all the different elements (different languages) that compose a voice application. Tutorials that can be found on the Internet usually represent more an introduction to the language than some explanation on how to do something. I sincerely think they won’t make you an advance VoiceXML developer but these are doing their jobs in bringing new developers to the joy of developing for voice.

http://www.w3.org/TR/html4/

- 6 out of 78 -

Some very new books have been published on the subject. As I did not have the privilege of being able to read them, I will simply give a list of some of them:

VoiceXML: Professional Developer’s Guide Chetan Sharma, Jeff Kunins (Paperback, 2001) (0-471-41893-5)

Early Adopter VoiceXML Stephen Breitenbach, Tyler Burd, Nirmal Chidambaram, Eve Astrid Andersson, Xiaofei Tang, Paul Houle, Daniel Newsome, Xiaolan Zhu.

(Paperback, 2001) (1-861-00562-8) An overview of the book can be found at: [http://www.developer/com/voice/article.php/1565061]

Definitive VoiceXML Adam Hocek, David Cuddihy (Paperback, 2002) (0-130-46345-0)

The VoiceXML 2.0 Recommendation is available from the W3C website (http://www.w3.org/TR/2004/REC-voicexml20-20040316/) and I have to admit that I found this recommendation well written and very convenient to use when developing VoiceXML dialogs. I would also advise Mark Green lecture notes on voice applications (Oxford Brookes University – MSc in Computer Science – P08786). His documents describe very well each W3C voice language and how they interact between each over. Finally, very good articles can be found on the web regarding voice computing and specially VoiceXML applications (these articles were published in the USA):

acm.org: VoiceXML for Web-based distributed conversational applications by Kenneth R. Abbott (Apress, 2001) (0001-0782)

[http://delivery.acm.org/10.1145/350000/348985/p53-lucas.html]

acm.org: Mixed-initiative interaction = mixed computation Naren Ramakrishnan, Robert Capra, Manuel A. Pérez-Quiñones (Apress, 2002) (0362-1340)

[http://delivery.acm.org/10/.1145/510000/503042/p119-ramakrishnan.pdf] • Speech Recognition Grammar Specification: SRGS

The W3C released quite recently (March 2004) the recommendation for this language (http://w3.org/TR/2004/REC-speech-grammar-20040316/) and it is very difficult to find some information that does not come directly from the W3C. My personal view on this version of the recommendation is very negative and even more because it is the only source of information about the language. The document spends lot of time giving 3 or 4 times the same useless examples and leave lots of gaps largely open. SRGS is a grammar language that describes what kind of words or pattern of words should be looked for by the speech recognition module of a VoiceXML interpreter. SRGS comes in two forms: the ABNF form and the XML form. I used for this dissertation the XML form. The only information that can be found on SRGS (paper-based or web-based) is encapsulated with some VoiceXML information. Somehow it does make sense because SRGS by itself is totally useless. • Semantic Interpretation for Speech Recognition: SISR

Speech Synthesis Markup Language: SSML

http://www.developer/com/voice/article.php/1565061http://www.w3.org/TR/2004/REC-voicexml20-20040316/http://delivery.acm.org/10.1145/350000/348985/p53-lucas.htmlhttp://delivery.acm.org/10/.1145/510000/503042/p119-ramakrishnan.pdfhttp://w3.org/TR/2004/REC-speech-grammar-20040316/

- 7 out of 78 -

I deliberately choose to group these two languages even if they only have one (but the very strong one) similitude. This is because they are both (what I called) documentation disasters. A good example of that is: if search for SISR in google, it finds: “Société Suisse des Informaticiens – Section Romande” (that is some kind of Swiss Company of Computer Guys). SISR is a language that is supposed to defines the syntax of the content of elements (in SRGS grammar documents). It looks more like if it was describing how data recognised from SRGS rules are filled into ECMAScript variables. The W3C status for the document describing this language is Working Draft (they are working on it since April 2003 without any public modification!) (http://www.w3.org/TR/2003/WD-semantic-interpretation-20030401/). The actual version is a disaster as they are tons of non-documented cases. This is the language that cost me the most trouble in the writing of this dissertation. SSML also suffer of its very recent introduction but being much more classical in its approach; understand its goal is easier. SSML is a formatting language for voice. Its tags are now integrated directly into VoiceXML (in version 2.0) so this language is very easy to use and do not necessitate too much brain overheat. The SSML 1.0 Proposed Recommendation is available on the W3C website: http://www.w3.org/TR/2004/PR-speech-synthesis-20040715/. Last Minute Change: A W3C Recommendation has been published for SSML 1.0 on the 7th of September (http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/). • ECMAScript

ECMAScript is simply (it is almost that simple!) another name for JavaScript. Finding information about ECMAScript is therefore quite easy as all tutorial, documentations, examples, scripts in JavaScript are their exact equivalent in ECMAScript. ECMAScript is a language from the ECMA International Company. The ECMAScript specification can be found on their web site but I have to admit that I never read it: http://www.ecla-international.org/publications/files/ECMA-ST/Ecma-262.pdf. Personally, I’m using JavaScript Pocket Reference by David Flanagan (O’Reilly, 1998) and JavaScript: The Definitive Guide also by David Flanagan (O’Reilly, 1998). I like the first one because it is shortness; everything is resumed, organisation is very clear and well structured. You recover your information quickly. The second one is a lot thicker but is very good for solving particular cases... and I do not know how I manage, but I am always facing particular cases. I also would like to mention the very good paper from Rick Dobson: ECMAScript: the holy standard?. This article describes the advantages of ECMAScript face to JavaScript. • Perl

http://www.w3.org/TR/2003/WD-semantic-interpretation-20030401/http://www.w3.org/TR/2004/PR-speech-synthesis-20040715/http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/http://www.ecla-international.org/publications/files/ECMA-ST/Ecma-

- 8 out of 78 -

Perl has been for a long time a very popular CGI scripting language. Even if nowadays ASP and PHP are taking serious market shares, Perl still remains very used. Even if principally know for as a CGI scripting language, it is a programming language; lots of non-web-based programs are written in Perl. I choose Perl as an occasion to learn a new technology (even if I have been introduced to it during the P08771 module – Oxford Brookes University – MSc in Computer Science) and I discovered a fantastic, huge, very powerful, very dangerous language. This language let you do everything you will ever want to do and even more... and this is problem! I used two books for the purpose of my everyday work: Perl in a Nutshell (2nd Edition) by Stephen Spainhour, Ellen Siever, Nathan Patwardhan (O’Reilly, 2002) and Advanced Perl Programming by Sriram Srinivasan (O’Reilly, 1997). The first book is a reference book and contains lots of information (everything about Perl is simply not yet discovered!) about the language and the most used modules. The second is principally focussing on the complex aspects (complex nested data structure, typeblogs, the power of Perl OO, graphical interfaces ...) of Perl Programming. If you always wondered why is so great about Perl, read this book. I would also like to mention Perl Cookbook (2nd Edition) by Tom Christiansen and Nathan Torkington (O’Reilly, 2003) that I unfortunately discover far too late. If you want to have fun, some people even write poetry in Perl! I really could not believe it when I first saw this. Have a look at http://www.perlmonks.org/index.pl?node=Perl%20Poetry and enjoy! Some of them are really nice. • Regular Expressions I do not consider Regular Expressions either as a technology either as a language. For me Regular Expressions look more a tool... but this tool has his own language and a roughly complex one! I only used one book during to learn about and how to use Regular Expression. It is quite exceptional and the first time it happens to be but this book is perfect (I said it!): Mastering Regular Expressions by Jeffrey E.F. Friedl (O’Reilly, 1997). This book describes all the aspects of Regular Expressions: how to use them but also how the different motors works; how to take advantage of their differences; it also points the great importance of efficiency.

Voice interfaces Information on the design of voice interfaces is really difficult to find. This area of interest is a quite explored area of research but the results of these are difficult to find out and are especially not available outside of labs. The rules to follow in order to build efficient voice interfaces are not yet written: and lots of people like me suffer of this lack! My first and the further accessible material for designing the voice interface were Mary Zajicek’s lecture notes on voice interface designing theory (Oxford Brookes University – MSc in Computer Science – P08786). In this material you can find the real case of a whether report system. It introduced some very interesting ideas that are not all applicable to the VoiceXML technology.

http://www.perlmonks.org/index.pl?node=Perl%20Poetry

- 9 out of 78 -

But the main problem with this material is that it was describing a dialog. The system was asking something, the user was responding something and finally the computer was trying to respond intelligently to user. In this case the initiative comes to the computer. It starts the discussion. It is influence the user’s freedom of expression to something it will be able to understand and compute. In our case we are making a system that is responding to “orders” in the sense that the system is not trying to influence the user for his/her choice. The problem then comes when the user “gets lost”, are we (the system) are not waiting for anything in particular, we do not know what the user is willing to do. Some very interesting articles regarding voice interfaces can be found on the web but there are very theoretical and do not really give solution: they are usually more trying to explain how we do interact between each over. I will not describe them here as they are simply expressing too many things at each time. A short explanation for these can be found in the bibliography part of this dissertation. At the end I decide not to surcharge the system with some different ways to giving orders. There are a lot of reasons for this:

• Efficiency: The most complex is the recognition, the more choices the system has to try makes the system slower and breakable.

• Messiness: The most complex are the recognition rules, the messiest the system will become and then will become less understandable for developers, less understandable for administrators, less maintainable, less improvable. Another case of vicious circle: you want to improve the system and you finish by blocking it!

• Too human: A web browser – as evolved, as you may ever want – stays a tool for you. We are not in the case of making computer dialogs to simulate nice ladies in voice centre.

Building browsers Information can be found on how to build browser but very few on how to build voice ones. When you think more about it; this does not make so many differences. The process is the same: First you download the HTML content that will have to be presented to the user. Then you somehow process this file to make what is your output. When you continue to think about it, the main problems you have to face are:

• How to recognise what is a tag? When does it start, when does it end? • How to process that tag that you finally found?

If you are looking of explications on how to build web browsers, you will get sad very quickly; the closest available information is from the Mozilla group that are providing to developers the sources of both of their browsers (Mozilla and Firefox) (http://www.mozilla.org/developer/). I have to admit that I was not able to put myself properly into it to be able to understand what they were doing or even what was their approach to find tag and to process them. So I have to find other solution that is to find more specific or technical content: and then I came back to the tool we were talking about before: regular expressions:

• The system is using regular expression to separate tags from text. • The tags are processed one by one and have different action associated with

each of them. Conclusion I am French and therefore a professional complainer.

http://www.mozilla.org/developer/

- 10 out of 78 -

Some of the technologies, languages that are used to build this project are really undocumented. The same can be said for the theoretical aspects of building a voice web browser either on the voice part or on the browser part. This problem makes the implementation of complex ideas even more complex and don’t help the first-steppers. But I believe that our work (we, the first-steppers!) is for the benefit of everybody and my soul feels peaceful because earring that!

- 11 out of 78 -

Methodology

- 12 out of 78 -

Introduction In this section, I will describe the system, how it should work, how it works. Which are the problems faced and how did I solved them. I will first start by talking about the functionalities of the system. Then I will introduce our output languages for the ones of us that are not familiar with them. Then I will move to a description of the system’s designs and processes. The System Functionalities What are the functionalities that are provided to the user? What from the user’s point of view makes that the system is usable.

Main criteria that makes the system usable. We will try to provide a solution to every of these criteria. The user is tough, but we are strong enough! Our lovable languages! I will try to provide you some background about the languages that we are going to use as outputs of our server side program. It is very important that you have in mind which language is doing which part of the job. Try to have a special attention about this, especially because these languages both do not look like programming language that we use everyday neither to the data structuring languages that XML brought us.

• VoiceXML

The origin of VoiceXML goes back to year 1995 in the research project called Phone Markup Language at AT&T Bell Laboratories. In the year 1998, W3C organised a conference on voice browsers and the attendees of this conference included AT&T, Lucent, Motorola, IBM ...

- 13 out of 78 -

By this period, AT&T and Lucent had developed different variants of their original Phone Markup Language and IBM was developing its own speech language. On the other hand, Motorola had already developed VoxML. Following these developments by these separate commercial companies the VoiceXML Forum was formed to develop and promote a standard voice markup language that developers could use to build conversational applications. The VoiceXML Forum’s main objective was to explore public domain ideas from existing work in the voice browser arena. As the standardization process for voice browsers develops, the VoiceXML Forum would work with others to find common ground and the right solution for business needs. In the year 2000, the forum wrote the VoiceXML 1.0 specification and submitted it to the World Wide Web Consortium for the purpose of standardization. In October 2001, the VoiceXML 2.0 was published by the W3C’s Voice Browser Working Group; that is the latest version of the specification. VoiceXML is a markup language for describing Human-Computer dialogs, based on XML (Extensible Markup Language). It first concrete objective is to give a solution to the problem of Human-Computer interaction for voice servers where users shall fills forms.

• Speech Recognition Grammar Specification: SRGS

The Speech Recognition Grammar Specification is a language used to describe what should a VoiceXML Interpreter should listen to. The syntax of the grammar format is presented in two forms namely an Augmented BNF form and a XML form. The specification makes sure that the two forms are mappable to allow free transferability between them. In this dissertation we will use the second form; the XML one. The W3C decided to provide an XML version of this language in order to enable the grammar developers to use all the power of the tools developed for XML.

• Semantic Interpretation for Speech Recognition: SISR

Semantic Interpretation is useful when it is combined with some other specifications like the SRGS one. Semantic Interpretation provides a way whereby instructions can be attached for the computation of semantic results to a speech recognition grammar. In other words, it define how should a recognised pattern of word be interpreted. SRGS is the pattern, SISR give a sense to this pattern. SISR statements should be valid ECMAScript expressions. Like SRGS, SISR comes in two forms: a ABNF and a XML form. The two versions are in the same way totally mappable.

• Speech Synthesis Markup Language: SSML

- 14 out of 78 -

Speech Synthesis Markup Language is a standard that the W3C is working on which provide formatting to voice. It has been created to provide a rich XML-based markup language for assisting the generation of synthetic speech. It provides a standard way to control different aspects of speech such as volume, pronunciation, rate and pitch across various syntheses. This language is used to improve the quality of synthesized content. This markup language is suitable for “style” developers. It can be seen as an XML pendent of what is CSS for graphical web interface. Using SSML allows the content developer to provide information to the user in two ways: what is said and how it is said. Last Minute Change: A W3C Recommendation has been published for SSML 1.0 on the 7th of September (http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/).

• ECMAScript

ECMAScript is simply another name for JavaScript. Actually ECMAScript is an attempt to normalise the JavaScript language that was imagine years ago by Netscape. We can say that the language is the same because the implementation of the JavaScript and the ECMAScript languages are exactly the same (the company working on these technology not having any interest writing some different code that would do the same work). This language is used – in the context of VoiceXML applications - to perform computing actions directly on the VoiceXML interpreter (client side).

• Who is doing what?

A typical workflow between voice languages.

- The user is giving some input under the form of voice. - The user’s voice is parsed and tested through the SRGS rules. - The elements of the SRGS grammar file are executed. These

elements are in the SISR language. - Variables are passed back to the VoiceXML file that can ask some

ECMAScript code to be executed.

http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/

- 15 out of 78 -

- VoiceXML decide what output should be given to the user regarding to its internal rules. Outputs can be formed of SSML tags (or not like in this case); SSML tags specify how the voice should sound like.

• How does it work?

As described in the diagram, the VoiceXML architecture is one piece added to the web application design model. This piece is the VoiceXML Interpreter that fits between the content server that send the data as VoiceXML content and the user’s phone (accessing media).

System process I will now try to explain what the designs that I used for my system are.

• A server side browser...

The first thing that has to be understood about the system is that the client-server idea driving normal web architectures are altered when passing to voice: in our case, the browser is actually server side.

Normal client-server web architecture versus...

- 16 out of 78 -

The system used for our VoiceXML Web Browser.

As we can see on the diagrams, the architecture involved for such a system to work is more complex and implies more different components:

- The VoiceXML Interpreter: this piece of the architecture is common to every voice system running using VoiceXML. Its role is to process all the speech synthesis and speech recognition according to the content located on the VoiceXML and SRGS data.

- What was before the “server side” is now (for the purpose of our browser) acting like a proxy server: it makes request to other web sites (server the user wants access to) and then give it back to the user (in a VoiceXML form). This is why we can say that our browser is server-side.

• Design: Main Loop

- 17 out of 78 -

VoiceXML Web Browser Main Loop.

Each time the system has to process a web page, it starts by downloading it from the source web server. Then it creates a zip copy of the page (with images, CSS files, Java applets ...) that will be used to be sent by email to the user on its request. Then it processes all the tag transformation depending on the type of tag and on some rules and preferences set by both the user and the administrator. Finally it brings to the user the VoiceXML document that will be treated by the VoiceXML Interpreter.

• Design: Downloading HTML content from a source web server After seeing the design of the global loop, let us now have a look at every action that has been describe just before. First let us have a look on what is really performed when downloading some HTML content on the user’s request.

- 18 out of 78 -

Loop controlling the downloading of HTML content from a web server.

The first thing that the system checks is if there is actually any data to be posted to the requested URL (like in the case of the user sending a form). If it is the case, the data is posted. Then the system downloads the answer form the remote web server (our server is actually – as described before - only acting like a proxy). When the data is received, it checks the HTTP headers to see if a HTTP redirection has been requested. If so, it goes to the next URL and continues the loop.

• Design: Making a zip of all the page Now we can start the second action of our main loop: saving a copy of the overall web site in a zip format.

- 19 out of 78 -

Making a complete copy of a web page. The data received from the remote web server is analysed to find all the tags that requires some processing. For all these tags, the system finds out where to find the linked content and download it. When one content is downloaded, its link from the source file is changed to meet the new location. Finally, when everything is processed, a zip is made from all the files.

• Design: HTML tag processing

Third step and the most complex of all: finding and replacing HTML tags.

- 20 out of 78 -

HTML tag processing by the VoiceXML Web Browser For each tag that is found (the system is using regular expression to find them) in the HTML document the system tests if it is an opening tag or a closing tag (or a singleton tag like ). If it is a closing tag, it is deleted from the opened tag stack. If it is an opening tag it is added to the opened tag stack and it tries to close the nearest tag that can be closed by the opening of another tag. Then it finds out which function to import and call regarding to the rules.xml general configuration file and the preferences.xml file from the user. The function is called and the result of it is added to the output files (VoiceXML one and SRGS one).

• Design: Sending the output to the user

Now that the VoiceXML and the SRGS contents are ready, the web server can then send this content to the user’s VoiceXML Interpreter.

Process for sending the output to the user’s VoiceXML interpretor.

- 21 out of 78 -

When all the tag processing is over, the VoiceXML and the SRGS files are save in the temporary directory of the user and an HTTP redirection request is sent to link to the VoiceXML just saved document.

• What action for what tag?

Some tags execute special actions when met. I do not have enough space to describe them in here but you will find in the configuration section an explanation of all the action that the system can make depending on how it is configured.

- 22 out of 78 -

Testing

- 23 out of 78 -

Introduction While I was programming, debugging and even after while testing my system, I really putted it in some very hard situations. For demonstration purpose I will only show the transformation of a roughly easier HTML file. I’ve only be able to test HTML data (HTML files and HTML data generated by CGI scripts) on localhost (meaning my computer but through an HTTP server) due to firewall problem depending of my internet provider; but the code making no difference regarding where the code is located it shouldn’t make any problem on a well-configured server. I also would like to mention that both of the outputs files get succeed in passing the W3C XML validator. This doesn’t insure that the content of the file is corresponding to what it should be, but it at least certifies that the output files are valid. Because absolutely no free implementation of W3C voice languages you won’t be able to test it in a real situation (I wasn’t able too!). So I would try to explain, looking at the input HTML file, the user personal preference file, and both of the output files (the VoiceXML and the SRGS one) why the system is doing what it is suppose to do. Inputs and Outputs Here are the input and the output directly. I’ll explain them just after.

TESTING THE DISSERTATION


This time it better work! We have some

audience! It should work. It should work GREAT! IT'S SOOOOO GOOD! Why do web developer uses table for

styling! GRRR! 1 + 1 = 2 2 + 2 = 3 3 + 3 = 4

- 24 out of 78 -

do eat other

cows!

HTML File in input.

link Here starts the form number %%ID%% Here ends the form number %%ID%% A large image is there. text field password field

- 25 out of 78 -

file field. file fields are not supported by the system.

Personal Preference File of the user.

http://www.w3.org/2001/vxmlhttp://www.w3.org/2001/XMLSchema-instancehttp://www.w3.org/2001/vxmlhttp://www.w3.org/TR/voicexml20/vxml.xsdhttp://localhost/

- 26 out of 78 -

TESTING THE DISSERTATION TESTING THE DISSERTATION This time it better work! We have some audience! 1 It should work. 2 It should work GREAT! 3 IT'S SOOOOO GOOD! Why do web developer uses table for styling! GRRR! Here start the form number 1 text field fill with $ += $dictionary textValue[0] = text1 ; password field fill with $ = $alphabet

http://www.brookes.ac.uk/

- 27 out of 78 -

passwordValue[0] = password1 ; fill with $ += $dictionary textareaValue[0] = replaceReturn(textarea1) ; 1 + 1 = 2 select it checkValue[0].push( checkValuePossible[0][0]) ; 2 + 2 = 3 select it

- 28 out of 78 -

checkValue[0].push( checkValuePossible[0][1]) ; 3 + 3 = 4 select it checkValue[0].push( checkValuePossible[0][2]) ; Here ends the form number 1 Cows do eat other cows!

- 29 out of 78 -

]]>

- 30 out of 78 -

- 31 out of 78 -

The VoiceXML document in output. Only the tabulations have been changed for the document to be more readable. This does not influence in any case how the document is treated by the VoiceXML interpreter.

http://localhost/DISSERTATION/tmp.htmlhttp://localhost/DISSERTATION/tmp.htmlhttp://localhost/DISSERTATION/tmp.htmlhttp://localhost/DISSERTATION/tmp.htmlhttp://www.w3.org/2001/06/grammarhttp://www.w3.org/2001/XMLSchema-instancehttp://www.w3.org/2001/06/grammarhttp://www.w3.org/TR/speech-grammar/grammar.xsd

- 32 out of 78 -

Brookes $ = 0 ; in a new window switch to window number $.windows.windowNb = $digit ; if ($digit == 0) $.windows.url = " http://localhost/DISSERTATION/tmp.html" ; else $.windows.url = "ERROR" ; list favourites $.listFavorite = 'FILLED' ; $ = 0 ; in a new window open if($newWindowFav == 0) $.openFavorite.windowNb = 0 ; else $.openFavorite.windowNb = 1 ; $.openFavorite.favNb = $favorite ; $ += $alphabet

http://localhost/DISSERTATION/tmp.html

- 33 out of 78 -

$ = 0 ; in a new window open if($newWindowURL == 0) $.openURL.windowNb = 0 ; else $.openUrl.windowNb = 1 ; $.openURL.url = $url ; $ = 0 ; in a new window follow link number if ($newWindowLink == 0) $.links.windowNb = 0 ; else $.links.windowNb = 1 ; if ($digit == 1) $.links.url = "" ; else $.url = "ERROR" ; send url $.sendLink = 'FILLED' ; send all urls $.sendAllLink = 'FILLED' ; send zip of page $.sendZip = 'FILLED' ; send all zips $.sendAllZip = 'FILLED' ; send form number 1 $.form1send = 'FILLED' ; clear form number 1 $.form1clear = 'FILLED' ; fill the number of form 1 with $.form1fill1.type = $fieldType1 ; $.form1fill1.nb = $digit ; $.form1fill1.value = $sentence ;

- 34 out of 78 -

select the element number of the number of form 1 $.form1fill2.type = $fieldType2 ; $.form1fill2.nb = $digit ; $.form1fill2.value = $elementNumber ; $ += $dictionary text block text field password field check list list

The grammar (SRGS) document in output. Only the tabulations have been changed for the document to be more readable. This does not influence in any case how the document is treated by the grammar interpreter.

“Well, but I want more explanation!” To show you why the system is doing what it is suppose to do I will take parts of the outputs files and explain the transformation. We will do all the totality of the outputs files in this way. I don’t expect you to know VoiceXML and I’ll try my best to make this explanation readable and understandable without you requiring too much knowledge of W3C voice languages; but it’s undeniable that you will much faster if you knowing these technologies.

• Initialisation (VoiceXML)

(1)

http://www.w3.org/2001/vxmlhttp://www.w3.org/2001/XMLSchema-instancehttp://www.w3.org/2001/vxmlhttp://www.w3.org/TR/voicexml20/vxml.xsdhttp://localhost/DISSERTATION/tmp.html

- 35 out of 78 -

(3)

VoiceXML code. In here, I’m only opening the VoiceXML document (1) (all VoiceXML files start like that) and initialising some different variables that I’ll use later (2). The system is using a form-level grammar (3). This implies that the user and the system will have a mixed-initiative dialog between them. In a traditional way (and before VoiceXML), the dialog was driven by the computer. This means that the computer when asking a question, the user has to answer. In a mixed-initiative dialog, the computer knows from before all the information it needs to know and try to find these information in anything the user is saying. This functionality of VoiceXML is very important for us because it let us a dialog driven by the user: the user ask a question, the computer will answer to him/her. I’m also creating an ECMAScript function that will be used to build a string containing the list of all favourites (4). It’s the perfect moment to tell you that for readability, all part of code that will be interpreted by the ECMAScript processor are indicated in italic. • Initialisation (SRGS)

(1) (2)

http://www.brookes.ac.uk/http://www.w3.org/2001/06/grammarhttp://www.w3.org/2001/XMLSchema-instancehttp://www.w3.org/2001/06/grammarhttp://www.w3.org/TR/speech-grammar/grammar.xsd

- 36 out of 78 -

SRGS code (grammar file).

First, I’m opening a SRGS document. Like for VoiceXML, all SRGS documents starts in the same way (1). Then I’m making a first special rule that will be use as the root rule and will be used to find every combination of valid sentences (2). The SRGS documentation from the W3C is not clear on how root rules in the context of a mixed-initiative form should be handled. The rule described previously might not be useful but it will make the system surely work. • Switch between windows

(3) (4) (5)

VoiceXML code.

switch to window number (1) (2) $.windows.windowNb = $digit ; (3) if ($digit == 0) $.windows.url = " http://localhost/DISSERTATION/tmp.html" ; else $.windows.url = "ERROR" ;

SRGS and SISR code (grammar file).

In here, we have an example of a real recognition. The goal of this part is to recognise a sentence like: “switch to window number 0”. When I recognise this sentence the system will then open the corresponding “window”. Let’s have a look at the process: (1) The system recognise “switch to window number x” (in the grammar file). (2) The system then performs the ECMAScript code that is in between . (3) The variable $.windows is modified (in the grammar file). This will

automatically fill the windows field (in the VoiceXML file). (4) The VoiceXML code in is executed. (5) If the “window” indicated by the number exists (has already be opened), the

system put the URL of the corresponding “window” in the url variable and send all the needed variables to the server


- 37 out of 78 -

The server will then reply back to the user with the processed transformation of the new url passed. • Listing Favourites

(2) (3) (4)

VoiceXML code.

list favourites (1) $.listFavorite = 'FILLED' ; (2)


This action is much simpler than the pervious one. If the grammar recognise “list favourites” (1), it filled the listFavorite field (2) of the VoiceXML document that then calls the appropriate ECMAScript function listFavorite() (3) that we introduced at the beginning. At the end, we clear the listFavorite field (the listFavorite value) (4) in order to be able to re-recognise it... and therefore for the user to be able to re-ask for his favourite. I would like to add the line (2) of the grammar is required in this case because we are in a mixed-initiative context. simply says what is inside itself. • Opening a Favourite

(5) (6)

VoiceXML code.

(1) (2) Brookes

(3) $ = 0 ; in a new window

- 38 out of 78 -

open if($newWindowFav == 0) $.openFavorite.windowNb = 0 ; else $.openFavorite.windowNb = 1 ; (4) $.openFavorite.favNb = $favorite ;


Here nothing is new, just a little bit more complex. The system recognises sentences like “open brookes in a new window”. The favorite rule is a list of all favourites configured by the user (1). meaning that the rule will try recognise one of (and only one of) the elements that are inside itself (2). The newWindowFav rule is here to capture the optional “in a new window” (3). If the user wants to access his/her favourite in a new “window” the system will look for the next available “window” (4); in our case the “window” number 1. The field is then filled (5) and the variable initialised at the beginning is now used to give their value to the url variable (6). • Opening an URL

(5)

VoiceXML code.

(1) $ += $alphabet (2) (4) $ = 0 ; in a new window open if($newWindowURL == 0) $.openURL.windowNb = 0 ;

- 39 out of 78 -

else $.openUrl.windowNb = 1 ; $.openURL.url = $url ; (3)


The idea is exactly the same as the previous one but in here we have to recognise a URL instead of recognising a number. The system will recognise: “h t t p : / / w w w . g o o g l e . c o m /” The important points are: (1) The repeat attribute permit the system to find more than one letter and

therefore let it recognise the list of letters. (2) We are constantly adding the newly found letter to the other ones found

previously. The $ represent the value that will return the url rule and will be caught as (3) by the openURL rule.

(4) The newWindowUrl recognise the optional “in a new window”. (5) The openURL field is filled; the variables are set and send back to the server. The alphabet.grxml file encloses a grammar that recognises every possible character of an URL. • Sending links and zips

As they are all done exactly the same way and I will only explain the first action... and it will be quite short because it’s very simple.

(2) (3) (3)

VoiceXML code. Send a link to the current “window” by email.

send url (1) $.sendLink = 'FILLED' ; (2)

SRGS and SISR (grammar file). Send a link to the current “window” by email. The recognition part can be compared to the one used when we wanted to recognise “list favourites”: when the exact phrase is matched (1), the system fills the variable corresponding to the right VoiceXML field with any data (2). Then the right data is sent (3) to the send.cgi server script. I am adding here the screenshot of the webpage directly from the url and decompressed from the zip.


- 40 out of 78 -

Before.

In here the image can be anywhere in the web server file tree; or even on some other web server.

After. In here all the files are in the same zip, in the same folder, starting at file number 0 to file number n. For this the system is parsing the whole HTML file, download the files that should be, change all the href, src, background, ... attributes in the HTML file, save everything, zip everything and finally send the zip to the user by email.

As you can see, there is no difference... meaning that the system is doing its job right! It is working correctly for images (as we can see here), but also for css files, background images, java applets, Flash animations, javascript files, ...

• Transforming the tag

TESTING THE DISSERTATION (3)

HTML tag in input.

(1) (2)

Preferences for the tag.

TESTING THE DISSERTATION (3) VoiceXML code.

As we can see in the preference file, an action is required when meeting an tag (action="on") (1). The action specified is to add a pause during 2 seconds where the tag closes (2). And this is exactly what is done (3). is not really a VoiceXML tag, it is part of the SSML specification (another one) that specifies how should some word be said or (in this case not said). SSML tags can be directly included into VoiceXML documents starting from VoiceXML version 2.0. With this tag, we show that we can specify a pause using a preference file, but the user can also choose any arbitrary text to be specified before or after a tag.

- 41 out of 78 -

• Transforming the

tag


HTML

tag in input.

(1) (2) (3)

Preferences for the

tag.

(4) TESTING THE DISSERTATION

VoiceXML code.

Here it is the same idea than previously, the

tag requires some processing because of its action property being equal to “on” (1). The user requires a break of 2 seconds before (2) and after (3) the

tag. The output is corresponding to the user’s request (4).



HTML tag in input.

(2) Preferences for the tag.

TESTING THE DISSERTATION (1)

Pure text in the VoiceXML file.

You might have wondered reading the previous section why the tag look like not being processed (1)? The answer is to be looked in the preference file: no action is required by the user to be done to the tag (action="off") (2). Of course, the user could have select like for the previous section some pause to be taken or some text to be spoken. Starting from now I won’t explain the tags that does not require any action. This testing section is already very long... there is no need to make it even longer.

• Processing a listing: and tags

It should work. (3) It should work GREAT! (4) IT'S SOOOOO GOOD! (5)

HTML and tags in input.

(1) (2)

- 42 out of 78 -


1 It should work. (3) 2 It should work GREAT! (4)

3 IT'S SOOOOO GOOD! (5) Pure text in the VoiceXML file.

This case is more interesting; we are trying to solve the case of the numbering of a list. In HTML, there is a whole bunch of tag for doing different listings: , , , and finally (that one being a little bit special). As HTML firstly being design for graphical environments, there were lots of graphical different choices available to the HTML designer, but they can be catalogued as:

o Numbering listings: for example: “1, 2, 3 ...” or “a, b, c, ...” o Non-numbering listings: for example: “•” or “□” or “-”

For us using only voice, the choice is simple: “do I number or not?”. Numbering in our case consist of adding some text before the content. Not-numbering consist of not adding anything. Let’s not have a look at our example: Some action is needed (1), the action consist of a “numbering” (the user requires the system to count all the tags) (2). The first tag is processed and counted as number 1 (1). The second tag is processed and counted as number 2 (4). The last tag is processed and count as number 3 (5). In this case, the numbering uses “1, 2, 3 ...” because the value attribute is equal to “typeAttibute” (2). The HTML specification for the (and all the other listing tags) introduces the type attribute that let the HTML designer specify what characters should be used for numbering. By specifying “typeAttribute” the user asks our system to use the specified character to count. As there is no type attribute in the of the HTML file, the system is using the default value that is “1”. The valid values for the value attribute in the preferences file are:

o “typeAttribute”: already discussed o “1”: “1, 2, 3, 4 ...” o “a”: “a, b, c, d ...” o “A”: “A, B, C, D ...” o “i”: “i, ii, iii, iv ...” o “I”: “I, II, III, IV ...”

The system has been tested for every combination of those and has been working successfully. The system is also doing very well even if the tags are not close in the HTML as we can see (3) (4) (5). • Dealing with forms: the tag

HTML tag in input.

- 43 out of 78 -

(1) Here start the form number %%ID%% (2) Here ends the form number %%ID%% (3)


Here starts the form number 1 (2)

Here ends the form number 1 (3)

(4)

- 44 out of 78 -

textField passwordField selectField " /> (4) (20) (20)

VoiceXML code.

(4) send form number 1 (5) $.form1send = 'FILLED' ; (4) clear form number 1 (5) $.form1clear = 'FILLED' ; (20) (21)

- 45 out of 78 -

fill the number (23) of form 1 with (24) $.form1fill1.type = $fieldType1 ; $.form1fill1.nb = $digit ; $.form1fill1.value = $sentence ; (20) (22) select the element number of the number of form 1 $.form1fill2.type = $fieldType2 ; $.form1fill2.nb = $digit ; $.form1fill2.value = $elementNumber ; (24) (25) $ += $dictionary (26) (23) text block text field password field (28) check list list (29)


(1) A special action is required for the tag. (2) The user wants a sentence introducing the form. He/she also wants the

sentence to enunciate in it the number of the form. The %%ID%% keyword is used for that.

(3) The user also wants a sentence to announce the end of the form.

(4) Two rules and fields are created for each form: one to submit the information filled by the user and one to clear the form.

(5) The two rules recognises “clear form number 1” and “send form number 1”.

(6) A first list of values is created. All variables in VoiceXML are actually ECMAScript variables. Setting them in a tag let us set them easily: all in one place. Every variable (like textValue or passwordName) are arrays: one element for each item of a particular type.

- TypeName olds the name attribute of the element in the HTML (7).

- 46 out of 78 -

- TypeValue olds the value that the field will take (8). The value can be set at the call of the page (like in the case of ) or can be set by the user during the running.

- In the case of a check list or a simple list () the TypeValuePossible is used to store all the possible values of a particular field (all the values attributes of these HTML tags).

(10) A second list of values is created then to specify which input fields are taken care of by which form. In our case it doesn’t give us too much trouble, as there is only one form.

- FormAction is an array saving all the action attributes of every form (11). The action attribute is in HTML the URL of the CGI that will receive the form.

- FormItem is an array saving all the fields belonging to all the forms, form by form (12). Therefore FormItem [n] is an array saving all the fields that belong to the form number n (actually, n starting from zero, it is saving all the fields that belong to the form number n+1).

- FormItem[n][0] is the array saving all fields belonging to the form number n+1 that are of the textarea fields (13).

- FormItem[0][0][0] = 0 ; means that the first textarea elements of the first form is corresponding is the first textarea element (of the whole document) (14). 0 being the index of the textareaName and the textareaValue arrays.

- FormItem[n][1] is the array saving all fields belonging to the form number n+1 that are of the text fields (15).

- FormItem[n][2] is the array saving all fields belonging to the form number n+1 that are of the password fields (16).

- FormItem[n][3] is the array saving all fields belonging to the form number n+1 that are of the list fields (17). The system consider as a list field any list of radio input elements that share the same name attribute or a select element with it list of options.

- FormItem[n][4] is the array saving all fields belonging to the form number n+1 that are of the check list fields (18). The system consider as a list field any list of checbox input elements that share the same name attribute or a select element with it list of options when the multiple attribute is provided ().

- FormItem[n][5] is the array saving all fields belonging to the form number n+1 that are of the hidden fields (19).

(20) The last big thing is now to be able to fill any item arbitrarily. - This being too complex to do in one rule, the system uses two rules (for

each individual form) to deals with feeling forms’ data: formnfill1 (21) and formnfill2 (22).

- The formnfill1 rule (21) recognises sentences like “fill the text field number 1 of form 1 with ‘the sky is blue today’”. To do so, it uses the fieldType1 rule that simply match either “text block”, “text field” or “password field” (23). It also uses a rule that recognise a sentence (24) by putting a rule that recognise one word in a repeat (25). $ (26) represents the value that will have the ECMAScript variable sentence when it goes out of the sentence rule.

- The formnfill2 rule (22) recognises sentences like “select the element number 1 of the list number 1 of form 1”. Again, the system creates a special rule for recognising the “list” or “check list” (28).

- 47 out of 78 -

The elementNumber (29) rule only exists to encapsulate a digit rule. This is required to avoid a variable clash with the other digit rule called.

- The ECMAScript code that the client will execute when the user ask for filling the a text block, text field or password field (30) will first check that the number asked correspond to a existing field (formnfill1.nb < formItem[t][u].length); then it will replace the value in memory with the one specified by the user (31) (32) (33).

- The same kind of action is done when the user asks for filling a list or a check list and I don’t think any more explanation are required on those (34).

• Dealing with forms: the tag

HTML tag in input.

text field (1) password field file field. file fields are not supported by the system. (3)


text field (1) (2) (3) (4) fill with (5) $ += $dictionary (6) textValue[0] = text1 ; (7)

VoiceXML code.

When the system arrive to a tag, the first thing it does is to check in the preference file if there is any announcement for this type of input (1). Then it has to give to the user the ability to instantly fill the text field (2). For this the system is using a modal field (modal="true"). A modal field differs from a field using form level grammars in the sense that the field that it cannot

- 48 out of 78 -

anymore fill the field using the form level rules. In our case the text1 field can only be filled by the inline rules. If duration for the system to wait for the user to “bargin” is given in the preference file, this property is added to the field using the statement (3). The inline rule will recognise sentences like “fill with ‘aliens are around us’.” (4). The sentence rule (5) is exactly the same as the one we used with the tag. If the user doesn’t say anything or something that doesn’t have anything to do with filling the form, we want the system to continue to speak the rest of the file. This is done by artificially filling the VoiceXML textn field (6). In the opposite case where the user wants to fill the text field, the textValue [n] ECMAScript variable is set to the value given by the user (7). would have done the exact same thing. The tag also uses all the facilities provided by the tag to fill a particular field of any type. The tag works in the exact same way as the so we will skip it.


HTML tag in input.

(2)


fill with $ += $dictionary textareaValue[0] = replaceReturn(textarea1) ; (2)

- 49 out of 78 -

var1.replace("new line", '\n') { (3) return var1 ; } ]]>

VoiceXML code.

The tag is globally working exactly like the despite the fact that it does by itself the variable allocation (1). The data used to fill the textarea field is before parsed in order to replace a special combination of words representing a carriage return (in our case “new line”) by a new line ('\n') (2). To do this the function uses the replace method of the ECMAScript String class (3).


1 + 1 = 2 (1) (7) 2 + 2 = 3 3 + 3 = 4 (2)

HTML and tags in input.


1 + 1 = 2 (3) (4) select it (5) checkValue[0].push( checkValuePossible[0][0]) ; (6)

VoiceXML file.

For the system, a with its encapsulated is exactly like the same as a list of that share the same name attribute. This is done because to help the user in the sense that it is two graphical ways of representing the same thing. We will only focus our interest on the first of the (1) as they are all the same. The last one (2) has the particularity of being checked, we can verify that it has been taken care of in the tag: “checkValue[0].push("3") ;”.

- 50 out of 78 -

(3) First the system doesn’t forget the enclosed text. (4) Then it makes the traditional “bargin” field. (5) The inline rule is only capable to recognise “select it”; that is only what

we want! (6) The push ECMAScript function is used to push the new value inside the

checkValue[n] variable. Here also, we can note that the system doesn’t crash when a non-closed tag appear as it is the case here (7). The tag also uses all the facilities provided by the tag to fill a particular field of any type. We don’t have any example of a but it is globally working in the same way. The only difference is that the listValue [n] is not anymore an array but a single value. Again, the system consider a like a list of as they represent the same thing.


Unfortunately, to make this example shorter, I didn’t put an tag. This tag is performing an action that can be compared as a mix of:

- Any tag for the ability given to the user to “bargin” that he/she wants to follow the just enunciate link.

- The tag for the ability of following any link (but in much simpler because you don’t have different set of links like you can have more than one form and because you don’t have to fill anything and therefore don’t have to “play” with hundreds of ECMAScript variables).

• Cows do eat other cows! : Transforming the tag

(1) do eat other cows!

HTML tag in input.

A large image is there. (2)


Cows do eat other cows! (1) Pure text in the VoiceXML file.

The tag processing will perform two separates action:

- It will first test the presence of an alt attribute in the tag and replace the tag by its value (1).

- Then test the size of the image to find out if a message should be “prompt” or not (2). Unfortunately, the image is too small in our case for the message to be prompted, but it is a very good way for the user to be warned when pictures are important for the page (the case of a photo album for example). If the user is interested by the image, he/she can then ask the system to send the page (with the picture) by email.

- 51 out of 78 -

Even better, the system doesn’t rely on the optional width and height attributes but on the real size of the image!

I think its enough! If you get through all of these explanations, I’m very proud of you and very happy that these were either understandable, either fun enough for you to keep reading! Conclusion Here is a brief table that encapsulates all the tests that I actually performed... If you don’t have enough time... you should check in here! Testing in real situation Couldn’t test Testing in situation the efficiency of the interface Couldn’t test Working on localhost Success Working on outside web sites Couldn’t test Validating by the W3C XML validator Success Transformation of formatting tags (, , ...) Success Transformation of special content tags (,
, , ...) Success Transformation of links () Success Following links Success Transformation of lists (, , , ...) Success Transformations of tables (, , , , ...) Can be improved Adding information using metas () Not Implemented Transformations of forms (, , , ...) Success Filling form fields Success Sending form Success Clearing form Success Login in Success Login off Success Opening an URL Success Managing different “windows” Success Listing Favourite Success Adding to Favourites Success Opening a Favourite Success Saving a zip copy of current viewed site (receiving by email) Success Saving a zip copy of all current opened sites (receiving by email) Success Saving a link to the current viewed site (receiving by email) Success Saving links to all current opened sites (receiving by email) Success User-friendly interface for configuration Not implemented Having fun doing all these Total Success! I hope you’re now convinced that the system is always doing its best in every situation and have been tested as much as it could. The W3C documentation is also very unclear on some points; for example, should a field filled in a mixed-initiative context be cleared first in order to be able to be re-filled? If so, it would imply some minor changes for the system to work perfectly, but there changes would be necessary.

- 52 out of 78 -

Guide

- 53 out of 78 -

Introduction In here we will put ourselves in the point of view of the user of the system to see how he/she can do his/her everyday work using the system. It is always a good idea to see the point of view of the user. It will also be for us a good way of discovery the effort that have been done on the usability and the user interface. I would also like to remind that this is a working version; it is not supposed to have all the functionalities that would have a commercial version like help messages. We assume that the user knows how to use the system. The effort has been done to make the system even easier when the user knows the program. Login in to the system To login into the system, the user has to access the login.vxml file on the server. C: Welcome to the VoiceXML Web Brother. C: Please enter your login. H: u s e r 1 C: Please enter your password. H: n o p a s s C: [...] (loading the user’s home page)

C: Welcome to the VoiceXML Web Brother. C: Please enter your login. H: u s e r 1 C: Please enter your password. H: [...] C: Please enter your password. H: [...] C: Please enter your password. H: [...] C: Would you like us to send your password by email? H: Yes C: [...] (the naughty computer just hanged up! But he still sent the password by email before)

Dialog with direct access. “C” represents the computer “H” represents the user

The user has lost his/her password. “C” represents the computer “H” represents the user

Logout off the system To logout from the system, the user only has to hang up his phone. Politeness rules would like the user to say “Bye! Bye!” to the computer before! While browsing a web page Here we will describe how the user can interact with the system while browsing; what are the actions available, how to launch them, how can they be used in conjunction with other functionalities to be even more powerful.

• Accessing another URL

H: Open h t t p : / / w w w . g o o g l e . com Opening an URL.

H: Open h t t p : / / w w w . g o o g l e . com in a new window

Opening an URL in a new window.

• Managing windows

H: Close this window Closing the current window.

- 54 out of 78 -

H: Close all windows

Closing all windows.

H: Close window number n Closing the window number n.

Even if windows do not really make sense in a blind word... it is actually very useful for the user. It is a really nice way for him/her keep track of where is was before (a little like the back button in a normal web browser that we all use so much). • Favourites

H: Open brookes

Opening a favourite (assuming brookes is in the user’s favourite).

H: Open brookes in a new window Open a favourite in a new window.

H: List favourites

List all accessible favourites.

H: Add to favourite as b r o o k e s Add a page to favourites.

H: Add window number n as b r o o k e s

Add opened window number n to favourites.

Favourite are essential in a browser working on the VoiceXML technology. Using VoiceXML, it is very difficult to recognise any arbitrary word and therefore recognising an arbitrary URL without giving the spelling is almost impossible. Favourites are a very good way for users to go faster by not having to spell complex URLs. • Sending link to page

H: Send url Send a link to the current page by email.

H: Send all urls

Send links to all opened windows by email.

Sending a link to a page is very useful for the user to be able to recover the information he/she checked using the VoiceXML Web Browser. It is also very “light” for the user’s mailbox; unfortunately the web page might change between the time he/she check it using the system and the time he/she check it using the link provided... for this case: • Sending copy of the page

H: Send zip of page Send a copy to the current page by email.

H: Send all zips

Send a copy of all opened windows by email.

- 55 out of 78 -

The email sent to the user contains a zip file of a web page. The web page is modified to be self-contained. All the linked files (images, CSS files, JavaScript files, Flash animations, SVG files, Java applets ...) are contained in the zip file and the links in the HTML file are modified to link the files located in the zip file. • Following links

C: link: click here H: Follow

Follow a link by “bargin”.

C: link: click here H: Follow in a new window

Follow a link a new window by “bargin”.

H: Follow link number n Follow the link number n.

H: Follow link number n in a new windows

Follow the link number n in a new window. There is two ways of accessing a link: the user can either say that he/she wants to follow a link when this link occurs, or follow a link using its index at any time. • Forms

H: Send form number n Send form number n.

H: Clear form number n

Clear all the fields of form number n.

C: Text field H: Fill it with “I hate dissertations at 2 a.m.”

Fill a text field by “bargin”.

C: Password field H: Fill it with n o p a s s

Fill a password field by “bargin”.

C: Text block H: Fill it with “I hate dissertations at 2 a.m. new line but please give me good marks”

Fill a by “bargin”.

C: You have the choice between: C: First Element C: Second Element H: Select it C: Third Element

Select an element of a radio list by “bargin”.

C: You have the choice between: C: First Element C: Second Element

- 56 out of 78 -

H: Select it C: Third Element

Select an element of a by “bargin”.

C: You can select one or more elements from: C: First Element C: Second Element H: Select it C: Third Element

Tick an element of a check list by “bargin”.

C: You can select one or more elements from: C: First Element C: Second Element H: Select it C: Third Element

Select an element of a by “bargin”.

H: Fill the text field number n of form m with “I hate dissertations at 2 a.m.” Fill the text field number n of form m.

H: Fill the password field number n of form m with n o p a s s

Fill the password field number n of form m.

H: Fill the text block number n of form m with “I hate dissertations at 2 a.m. new line but please give me good marks”

Fill the number n of form m.

H: Select the element number 2 from the list number n of the form m Select an element of the radio list number n of form m.

H: Select the element number 2 from the list number n of the form m

Select an element of the number n of form m.

H: Select the element number 2 from the check list number n of the form m Tick an element of the checkbox list number n of form m.

H: Select the element number 2 from the check list number n of the form m

Select an element of the number n of form m.

The system numbers forms to avoid the problem that implies multiple forms. This is not very user friendly because the user cannot guess to what form a field belongs. In order to help the user, beginnings and ends of forms are announced: “Here starts the form n”, “Here ends the form n”. All the fields can be either filled by “bargin” instantly after a field is announced or at any time using its form number and its number in the form. To help the user, a list of fields and a are considered in the same way because they express the same thing: a single element between a list of choices. The user only one thing: a list from which he/she can choose one element. To follow the same idea, a list of fields and a are also considered in the same way. They both express a list of elements where the user can choose one or more one element.

- 57 out of 78 -

• Images Images are transformed by the system using the alt attribute of the tag.

C: a stupid image

HTML tag. The corresponding “dialog” by the computer. The system can also be configured to express that a large image in present:

C: A large image is there.

HTML tag. The image being a large one. The corresponding “dialog” by the computer. This functionality can be very useful when accessing some image album; with this the user will know that the main content of the page is an image and can therefore ask the system to send a link or a copy of the page by email. This is a good example of combination of functionalities to make the system easier to use for users.

Conclusion We’ve see throw this part how the user can use the system to perform complex tasks. We also saw how the system is trying his best to give the user always the straight forward way of performing actions. A lot of other enhancements could be done to improve the user’s experience. Testing in real situation would also be a great way of understanding how the users use the system in order to improve the efficiency of them.

- 58 out of 78 -

Configuration

- 59 out of 78 -

Introduction We will in this section see how to configure the system to make it more efficient or more corresponding to what you can expect from it. I will first explain where you can find every configuration file and what the purpose of each one is. Then we will see what can be configured and how for either the administrator and for the general user. Finally, we will have a world on modularity; a concept that have been taken care off all the way through programming. What to find where Here is a diagram explaining how are organises the configurations files.

The directories hierarchy in the cgi-bin directory. Let us now have a look at what each one is doing:

• In the cgi-bin directory - general.xml: this file does specify 3 things: the path of the users

configuration files (in here ./users.xml), the path of the rules configuration files (in here ./rules.xml) and the paths and ids of standards dictionaries (in our case, only the English dictionary at ./dictionaries/en.dic with EN as id).

- users.xml: this file does specify everything that regards users. - rules.xml: this file specifies how each tag should be processed. It also

keeps all the emailing configuration. • In the dictionaries directory

- en.dic: this file is a list of all the valid words that compose the English dictionary.

• In the user1 directory

- 60 out of 78 -

- personal.dic: this dictionary is a list of words that are not part of a standard dictionary but are used by the user. The user shall build his personal dictionary but him/herself.

- preferences.xml: this XML file is used by the user to be able to configure how the system transforms the HTML tags.

- favorite.list: this file contains the favourite list of the user with their full URLs.

What can I configure if I am an Administrator Here are the files that should be changed by an administrator. As there is not graphical interface for configuration, anyone that has access to the server can change anything. We will simply present here what an administrator is supposed to care about.

• Changing email configuration

(2) (3) (4)

Changing email configuration in rules.xml.

The email configuration has to be changed in the rules.xml file, the property that can be changed are: the fancy email string of the sender (1), the real address of the sender (2), the address of the SMTP server (3) and finally the prefix of the email subject (4).

• Changing the path of configuration files

(1) (2)

Changing paths in general.xml.

The path of the users configuration file (1) and the path of the rules configuration file (2) can both be modified in the general.xml file.

(1) (2)

Changing paths in users.xml.

The path of the user’s personal dictionary (1) and the its favourite file (2) can be accessed in the users.xml file. It shouldn’t be done without any reason. • Managing global dictionaries

Dictionaries in the general.xml file.

Dictionary can be added or removed from the system very easily. Do be available for the system, a dictionary have to be declared in the general.xml file. The URL given has to match with a real .dic file. Ideally, a global dictionary should be located in the dictionaries directory.

- 61 out of 78 -

A .dic file is a file containing a list of all existing word of a specific language separated by a new line. It has to start with the language name between square-brackets (‘[’ and ‘]’)

[english] a A-bomb A-road

A dictionary example: the en.dic file. A declared dictionary also has to have a name attribute (in our case “EN”) to be able to be selected by a user. • Managing users

(1) Pierre Naquin (2) 20/12/1984 (3) [email protected] (4) (5) (6) (7) (8)

The user1 user in the users.xml file.

For a user to be valid it has to exist in the users.

naq2.free.frnaq2.free.fr/vxml/dissertation.pdf · acknowledgments i dedicate this piece of work to...

Documents