[ieee comput. soc iscc 2003 - international symposium on computers and communications -...

MSC-based Language for Specifying Automated Web Clients

Vicente Luque Centeno, Peter T. Breuer, Luis Sanchez FernandezCarlos Delgado Kloos, Juan Antonio Herraiz Perez

Depto. Ingenierıa Telematica, Universidad Carlos III de MadridAv. Universidad, 30, E-28911 Leganes (Madrid/Spain)

E-mail:�vlc,ptb,luis,cdk � @it.uc3m.es

Abstract

Programming automated Web Navigation Assistants,i.e., applications that automatically navigate the Web per-forming specific tasks for the user, is far from easy. SinceHTML pages offered by legacy Web-based applications aredesigned to be manipulated only by people using browsers,and Web pages contain semi-structured information [3]whose data schema may be easily changed, the creationand, even worse, the maintenance of this kind of applica-tions, is very expensive. However, an increasing amountof information sources and online applications have beenadded to the Web during the last few years, so assistantsfor automating tasks over those Web-enabled applications,are more and more needed. These assistants may automatetasks by filling in forms, following links, analyzing data em-bedded in Web pages and performing computations overthose data on behalf of the user.

Software engineering techniques are clearly needed toreduce the cost of, not just creating these programs (by sig-nificatively reducing their time-to-market), but even moreimportantly, maintaining them working properly, reducingthe cost of readapting them to Web site pages whose struc-ture or navigation schemes are frequently changed. Thispaper proposes the well known formal method Message Se-quence Charts (MSC) [12] as a base for defining a lan-guage for programming Web Navigation Assistants whichmay navigate a Web site according to the user’s aims. Thisspecification language, called XPlore, is specially suited forboth requirements engineering and automatic generation ofan executable, and has been successfully tested on severalwell known Web sites.

1 Introduction

The World Wide Web has rapidly expanded as thelargest human knowledge repository. Not only informa-tion sources, but also Web-enabled applications, like e-mail,

auctions, intranet databases, online stores, hotel reserva-tions or even procedures involving government forms haveto be repeatedly used by millions of users who daily sitin front of their browser-enabled computers to spend a lotof effort by filling in forms and usually cutting & past-ing small chucks of data through different windows in or-der to perform tasks that have to be repeated over and overover the same forms. Most of the information they dealwith is usually stored at Web enabled databases that are notfully exploited conforming to the particular needs of manyusers. Web browsers have traditionally been the main toolfor Web navigation, though this is not necessarily true. Webbrowsers are great for presenting multimedia pages to usersand interactively collecting their data into form fields, butthey are not as good as working tools where some morecomputing task has to be performed or automation of repet-itive retrieval of hypertext links could be desirable. Fortu-nately, it is possible to build task oriented Web clients thatautomate these tasks for the user by automatically follow-ing links and filling in forms, perhaps getting identified ata Web server and establishing a Web session involving sev-eral networked transactions, but presenting to the user onlythe final results. These Web clients, referred as Web Nav-igation Assistants, are quite different from search enginerobots which anonymously follow almost every link theyencounter. Web Navigation Assistants are programmed toperform a specific task and they try to achieve some spe-cific purpose for the user, navigating through the deep Web[18] of many Web sites, by exploring only selected links andforms, ignoring all irrelevant data. However, not only de-veloping this kind of programs, but maintaining them oper-ative, is rather expensive, because remodeling of Web sites,which may include non visualized changes, may easily andfrequently break these programs.

Besides Web Navigation Assistants supporting users atthe navigation process, a new kind of Web-enabled appli-cations which aggregate information from different infor-mation sources is being demanded, specially at enterprisesystems, where information obtained from several hetero-

Proceedings of the Eighth IEEE International Symposium on Computers and Communication (ISCC’03) 1530-1346/03 $17.00 © 2003 IEEE

geneous legacy information sources needs to be combinedin order to create syndicated updated information for takingdecisions. Each information source has a different way tobe accessed, and the results of each source are presented indifferent structures. Legacy information sources are not re-quired nor desired to be modified, since they provide usefulinformation for other users who work with them directly.However, some users, specially those who have to com-bine information from different sources, would prefer notto open a different window for every information source.These users would probably prefer to have all the informa-tion that they need in a single homogenized window, a win-dow which properly combines the data extracted from alldifferent legacy sources. Data warehouse [9] is not the bestanswer when the amount of data is large or it is frequentlyupdated. Instead, mediator systems [13] can integrate, notonly data, but also services, from different applications. Itis also quite common that similar Web-enabled applicationsfrom different vendors need to be used by the same user.Since each application may have different links and layoutsor information structures to provide similar functionalities,users spend a lot of time accessing each single Web-enabledapplication in a different window. Many users would findreally helpful to be able to control all different applicationsthrough a single front-end where all differences from the ac-cessed applications are virtually removed and informationfrom different applications can be presented in a common-homogenized information structure.

Both Web navigation assistants, information and media-tor systems, have common problems and solutions. Theseprograms have in common that they support a task whichis algorithmically simple but requires many interactions. Infact, their difficulty resides mostly in those interactions in-deed. Algorithmical treatments, which are not usually com-plex, can easily be defined in a user’s routine if data areproperly structured in available well-defined programmablerepositories. So, programs navigating the Web are difficultto create but they are even more difficult to be updated, spe-cially those involved with Web sites outside an intranet, be-cause changes on those sites can’t be controlled or even maybe difficult to be detected. It is quite common that theseapplications may have a short life until failures, becauseWeb sites dynamically evolve without considering that theyare being accessed by specialized tools instead of browsers.Time until failure in a system integrating several sourcescan be measured as the period of time until the first changeat any of those sources break the integration program, sothis becomes critical for Web aggregators.

As a result, both Web navigation assistants and aggre-gators of information and services from different sourcesare new kinds of software application where intrinsic dy-namism of the Web provides an ever changing environmentwhere these programs may frequently need maintenance ef-

fort. Besides that, it is also highly important that theseprograms can be rapidly developed and that little changesdon’t necessarily affect their correct behaviour. In otherwords, short time to market, robustness at navigation, notonly controlling network failures, but also changes at theWeb pages, and low cost maintenance are very importantneeded assets that clearly conclude that software engineer-ing techniques need to be used for building and maintainingthese applications.

In this paper, we present XPlore as a language forspecifying the behaviour of Web navigation assistants andWeb aggregators. XPlore is based on the Message Se-quence Charts (MSC) [12], a formal method defined bythe ITU (International Telecommunication Union) for spec-ifying behaviours in distributed middleware systems. MSChave been mainly used as part of SDL [11], but they canalso be used in other environments. In this paper, MSCs areused to specify interactions between Web clients and Webservers in order to automate tasks on the Web. XPlore hasbeen defined as a MSC-based language which allows thespecification of both user, browser and JavaScript definedbehaviours, as well as networked transactions, in order tointeract with Web servers running legacy Web-based appli-cations which have been built to be accessed by browsers.XPlore language has been designed to be as similar as pos-sible as the textual representation of MSC. However, it hasseveral minor differences with the original standard due toits Web oriented semantics. XPlore can be considered as anattempt to adapt MSC formal method towards the develop-ment of task-oriented Web clients.

2 Challenges for Web navigation assistants

Web applications are very heterogeneous. Similar ap-plications can have not only really different front-ends orvisualization formats, but also different transaction proto-cols. While publishing an auction at Aucland’s Web site[1] may involve three or four HTTP-based transactionsand Web pages, performing the same task on Ebay [2] in-volves about five or six different forms across ten differentpages. Each information source or each Web based appli-cation may involve its own HTTP-based transaction proto-col. Some servers ask for a login and password just oncewhen users enter their sites, while others ask for authen-tication only when an important transaction is going to beperformed. Some servers maintain sessions with cookies,while others maintain them with hidden form fields. Someservers use a single but complete form. Other servers dividethe process of collecting data from users by placing differ-ent forms at several sequential steps, each one located at adifferent HTML page. Some servers use JavaScript basednavigation, while others, do not, despite both probably pro-vide a similar functionality...


This Web heterogeneity, though positively consideredfor Web designers and most users, makes data from theDeep Web very difficult to be properly collected or inte-grated by automated tools. A single unique program beingable to control all existing differences between legacy sys-tems can’t be easily built. In order to make these systemsscalable, different server-focused Web clients need to be de-veloped. These task-and-server oriented programs are alsoknown as wrapper programs [10], and their mission is toprovide a similar view of all accessed information sources.Once several different wrapper programs have finally ob-tained the expected results from each information source,an integration program may homogenize those differentlyobtained data and represent them in a common, uniform,known and structured repository for getting presented tothe user or getting involved in further computations. Thepurpose of the whole system (wrappers and integration pro-gram) is to present aggregated information obtained fromdifferent sources by hiding differences between legacy sys-tems.

Robustness of each wrapper program is highly importantsince the failure of any of them may break the whole ap-plication. However, frequent remodeling at Web sites, non-observance of W3C standards at most Web sites, and server-side usability only oriented to specific graphical browsersresult in a need to frequently update these wrapper pro-grams in order to keep them on working properly. SinceWeb sites’ markup design is subject to change at any mo-ment and it has not explicitly been defined at any publisheddata schema, any minor visual-oriented change can easilybreak the program, so requiring a great maintenance effort.

The complexity of these applications, so, comes fromthe various interactions between servers being accessed andtheir clients, rather than from the data being exchanged orcomputed. It is common that data treatment may be reducedto simple comparisons or filtering, but it is also desirablethat more complex treatments may deal with data on theWeb.

3 XPlore: a language for Web task specifica-tion

XPlore is a specification language for developing appli-cations that automate tasks in the World Wide Web. Webtasks specified in XPlore language result simple and robust.XPlore can encapsulate details defining complete user ses-sions for specifying part of a task. XPlore can also definethe structure of HTTP requests and responses as well, fordetailed specifications, if required. In both cases (sessionsand HTTP transactions), XPlore just needs a few lines ofcode. XPlore is an imperative language which defines asequence of conditions to be tested and actions and transac-tions to be executed, allowing the user to specify the control

of this execution with loops (while-do-done and foreach-in-do-done), conditional branching (if-then-else-end), vari-ables, user defined functions and full access to machine re-sources, like files, external programs, timers, threads andsynchronizing methods like locks. The language is both ex-pressive and high level, so it is applyable to a wide rangeof problems, but also being able to be translated to an exe-cutable if all the needed information is provided.

XPlore has been defined as a Web-client-oriented adap-tation of the textual representation of MSC. MSC have beentraditionally used for specifying behaviours and commu-nications within components of distributed systems whichcommunicate by exchanging messages. MSC have bothgraphical and textual representations. Since its birth in1992, MSC have periodically been expanded with new pow-erful functionality and it easily allows defining expected di-alogs between remote software, as well as defining errorconditions, repetitive and conditioned behaviour, concur-rency, timers or defining and calling functions. All thesefeatures can be used to improve the construction and main-tenance of wrapper programs by properly using them ina specification language, directly translatable to an exe-cutable, instead of reviewing a large amount of low levelcode written in a common programming language.

The approach of easily defining both HTTP requests,HTTP sessions and relevant data extractions in a single lineof code may result difficult when using well known pro-gramming languages like C, Perl or Java. None of theselanguages provide the right level of abstraction required tokeep the required amount of code to the minimum. Mostof the changes in the structure of a Web based informa-tion source involve updating a single XPlore line of code.Since the automation of every HTTP transaction, filling ina form field, or data extraction involves a single XPlore lineof code (perhaps two or three when data schema gets com-plex), maintaining these applications is not very expensive,since the amount of lines to be reviewed when a modifica-tion is required is quite small.

The extraction of relevant data from visited pages ismodeled as MSC actions performed at the client side be-tween network transactions. These actions can be imple-mented by calling user defined routines which can be writ-ten in any programming language managed by the user.Semi-structured data retrieval techniques for this purposehave been proposed in the literature [4, 6, 7, 15, 14]. How-ever, most of them are not focused on getting simple ro-bust expressive data extraction rules and try to solve thisissue by applying lexical analyzers and parsers based onregular expressions, which is a difficult to understand, lowlevel yet powerful solution that can’t offer by itself all theneeded functionality and is usually difficult to understandand maintain by non specialized users. As an alternativeto this option, user’s actions can directly call data extrac-


tion primitives defined over a well known standard, namelyXPath [19], which is a W3C recommendation for data ad-dressing in XML documents which does take into accountthe document’s structure. This is possible since HTMLpages obtained from Web servers are given XML syntax onthe fly by Tidy [17] when retrieved. Extended XPath-likebuilt-in extensions included in XPlore turn the process ofdescribing relevant data computations rather easy and effi-cient.

Once all relevant data are available at structured localrepositories, further computing is often required to be per-formed over them, like comparisons, accumulations, re-orderings, or any kind of semantic reasoning which maytake decisions like which link should be followed next orwhether an information retrieval process is near from be-ing concluded. Though this computations can be easilyprogrammed with any imperative programming language,Web site navigation skills need to be considered as well.XPlore acts as a host language for XPath-like primitives,just the way XSLT acts as a host language for XPath. How-ever, XPlore has not been defined with XML syntax, nei-ther is focused only on XML transformations. XPlore iswell suited for defining the basic actions of a Web task,like HTTP requests and answers, filling in forms, extract-ing data from documents and specifying user defined rou-tines for processing those data. Besides these, XPlore isflexible enough to configure low level communication fea-tures, specify multi-threaded behaviours, service combina-tors [5] for the treatment of errors within execution andhaving access to the operating system resources of the lo-cal machine in the same way as other programming lan-guages. Besides the usual programming data-types like in-tegers, floats, booleans, strings, lists, structured records orarrays, XPlore includes specific data-types for documentsand selected ranges within a document.

Programmed navigation recipes should separate ex-pected parts of a document from other irrelevant parts, stor-ing the former parts for further processing at local reposito-ries like memory variables, local files or databases, accord-ing to the programmer’s specification. This data extractionis highly dependent on the markup design which is beingused at each Web site, so different wrapper code for eachserver is usually created. This dependence becomes a ma-jor problem since unexpected changes in the markup designof a Web site may easily break automated navigation pro-grams.

Recent projects [16, 8] prefer to approach this issueby applying XML techniques, like DOM programming orXSLT transformation rules, to dynamically XML-ized Webpages (also using Tidy or similar tools). XSLT is then usedto transform any incoming XML-ized Web page into a wellformed document that can be processed further. This is ahigher level approach than treating pages as plain text, like

regular expressions do, because contextual treatments maybe applied to different subtrees. However, something morethan simple XML transformation is needed. In fact, betterthan well formed XML documents, structured data directlyprocessable by user defined computation routines are oftenpreferred. XSLT can be considered as a good solution inorder to get a uniform set of documents representing thedifferent collections of data obtained from different servers.However, XSLT can’t be considered to do efficient manip-ulations on small parts of incoming pages, specially if theyneed to be repeatedly performed over a single document.DOM is a much more powerful solution which can solvethis issue, but since it has a lower level of abstraction, a lotof lines of code are required to be developed and to be main-tained when a simple modification in the markup design of aWeb site may appear. However, XPath has been specificallydesigned as a language for addressing parts of XML docu-ments by using simple expressions. XPath expressions mayeasily fit in a single line and are well understood by manypeople. XPath 1.0 is a good option for selecting nodes in adocument only if no complex processing need to be speci-fied, so it becomes a good option for XSLT. More complextreatments can be specified with the new XPath 2.0 draft[20], but most of its new capabilities, have not been com-pletely defined yet.

3.1 Data processing

When the amount of extracted data is small and the com-putation over them is reduced to simple comparisons, theuser can happily perform her task manually with a browser.However, when the amount of data is large, or there aresome pages that have to be explored, or processing thatdata involves some logic or mathematical calculi, the userwho uses a browser becomes overwhelmed and error prone,perhaps trying to cut & paste small pieces of data throughdifferent windows, trying to memorize too much informa-tion and repeatedly clicking in some links. Unfortunately,browsers can’t be explicitly told about user’s purposes, nei-ther whatever is relevant for the user and whatever is not,so they can’t offer better support for those user impetuousactions. However, this processing is indeed a major reasonfor data extraction. Data management is not finished whenmatching selected data against expected patterns, but lettingthe programmer read and manipulate retrieved data throughwell known programming data types, like lists, arrays, treesor structured records which can be easily further processedby user defined routines. Comparing elements in a list orperforming a user defined action over selected elements ina document can be efficiently programmed if document ex-tracted data are located at well known data types.


4 Commented example: Web-mail aggrega-tor

This section provides graphical representations of aWeb-mail aggregator specified with XPlore. These graph-ics only represent two well known Web-mail servers, but thereal application gets involved with some more servers beingsyndicated. Figure 1 details the process of authenticationof the user at the aggregator application, by retrieving themain page, where a form for getting identified is included.When the user submits his login and password, these arechecked at the server side, enabling next steps only if au-thentication is OK. If so, the aggregator gets identified atsyndicated servers as the user, and obtain from them the listof messages stored at those servers. These lists are properlycombined in a single page which is presented to the user.

User Aggr. Yahoo! Hotmail

mainmain

identify

Authentication OK

Identify

Obtain messages

Combine

Results

msc Beginning

Figure 1. Session initialization

Details from the Obtain Messages sub-MSC are givenin figure 2. For each syndicated server, a thread is createdto execute wrapper’s code. These threads maintain a shortdialog with their respective servers and get the list of mes-sages that these servers have for the user. These lists arefinally combined at the aggregator application, and storedin a database for caching purposes.

Though both wrappers have common purposes, theymight perform their work very differently, because serversreturn their data in different formats or different number of

Aggr Yahoo! Hotmail

yahoohotmail

Obtain H-mailObtain Y-mail

Store at DB

msc Obtain messages

Figure 2. Message retrieval

links should be followed, as illustrated in figure 3.When the user asks again for the list of his messages, the

database is rapidly consulted. However, the user may decideto get an up-to-date list of messages, so prior MSC may beexecuted again. Once the list is seen by the user, he maydecide to read the contents of an e-mail. No matter wherethat e-mail came from, the aggregator application is awareof which server should be accessed, so it starts a dialog withit, in order to follow all needed links to retrieve the contentsof the body of that e-mail, in order to present it to the user.This is represented at figures 4.

Replying, forwarding, deleting or moving mails throughfolders is also possible by properly filling in the form pro-vided by each server. This form has to be properly retrievedand filled in, just as if it had been asked by a browser. Finalresults are given to the user as a result of his petition.

5 Acknowledgements

The work reported in this paper has been partially fundedby the project TEL1999-0207 of the Spanish Ministry ofScience and Research.

References

[1] Aucland online auctions. www.aucland.com.[2] ebay online auctions. www.ebay.com.[3] P. Atzeni, G. Mecca, and P. Merialdo. Semistructured and

structured data in the web: Going back and forth. In Work-shop on Management of Semistructured Data, 1997.


wrapper1 Yahoo!

Ask listResults

msc Wrapper 1

wrapper2 Hotmail

Ask listTransitional page

Follow linkResults

msc Wrapper 2

Figure 3. Wrapper differences

User Aggr. Yahoo! Hotmail

Read m

if m � Yahoo!

Read mm

m

msc Reading an e-mail

Figure 4. Reading an e-mail

[4] A. S. F. Azavant. Building light-weight wrappers for legacyweb data-sources using w4f. International Conference onVery Large Databases (VLDB), 1999.

[5] L. Cardelli and R. Davies. Service combinators for webcomputing. Software Engineering, 25(3):309–316, 1999.

[6] S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland,Y. Papakonstantinou, J. D. Ullman, and J. Widom. TheTSIMMIS project: Integration of heterogeneous informa-tion sources. In 16th Meeting of the Information ProcessingSociety of Japan, pages 7–18, Tokyo, Japan, 1994.

[7] C. P. David Buttler, Ling Liu. A fully automated extractionsystem for the world wide web. IEEE ICDCS-21, April 16-19 2001.

[8] D. Florescu, A. Grunhagen, and D. Kossmann. Xl: An xmlprogramming language for web service specification andcomposition. In WWW 11th conference, 2002.

[9] H. Gupta. Selection of views to materialize in a data ware-house. In ICDT, pages 98–112, 1997.

[10] J. Hammer, H. Garcıa-Molina, S. Nestorov, R. Yerneni,M. Breunig, and V. Vassalos. Template-based wrappers inthe TSIMMIS system. pages 532–535, 1997.

[11] ITU-T. Recommendation z.100: Specification and descrip-tion language (sdl). In Formal description techniques (FDT),Geneva, Switzerland, 1993.

[12] ITU-T. Recommendation z.120: Message sequence chart(msc). In Formal description techniques (FDT), Geneva,Switzerland, 1997.

[13] V. Josifovski. Design, implementation and evaluation of adistributed mediator system for data integration, 1999.

[14] L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabledwrapper construction system for web information sources.In ICDE, pages 611–621, 2000.

[15] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchicalwrapper induction for semistructured information sources.Autonomous Agents and Multi-Agent Systems, 4(1/2):93–114, 2001.

[16] J. Myllymaki. Effective web data extraction with standardXML technologies. In World Wide Web 10th Conference,Hong Kong, pages 689–696, 2001.

[17] D. Raggett. Clean up your web pages with html tidy. Poster7th International World Wide Web Conference.

[18] S. Raghavan and H. Garcia-Molina. Crawling the hiddenweb. In Proceedings of the Twenty-seventh InternationalConference on Very Large Databases, 2001.

[19] W3C. Xml path language (xpath) version 1.0. W3C Recom-mendation 16 November 1999, 1999.

[20] W3C. Xml path language (xpath) 2.0. W3C Working Draft15 November 2002, 2002.


[ieee comput. soc iscc 2003 - international symposium on computers and communications -...

Documents