supporting end-users in the creation of dependable - conferences

Supporting End-Users in the Creation ofDependable Web Clips

Sandeep LingamUniversity of Nebraska-Lincoln

Lincoln, [email protected]

Sebastian ElbaumUniversity of Nebraska-Lincoln

Lincoln, [email protected]

ABSTRACTWeb authoring environments enable end-users to create ap-plications that integrate information from other web sources.Users can create web sites that include built-in componentsto dynamically incorporate, for example, weather informa-tion, stock-quotes, or the latest news from different websources. Recent surveys conducted among end-users have in-dicated an increasing interest in creating such applications.Unfortunately, web authoring environments do not providesupport beyond a limited set of built-in components. Thiswork addresses this limitation by providing end-user sup-port for “clipping” information from a target web site toincorporate it into the end-user site. The support consistsof a mechanism to identify the target clipping with multiplemarkers to increase robustness, and a dynamic assessmentof the retrieved information to quantify its reliability. Theclipping approach has been integrated as a feature into apopular web authoring tool on which we present the resultsof two preliminary studies.

Categories and Subject DescriptorsD.2.6 [Software Engineering]: Interactive Environments;D.2.5 [Software Engineering]: Testing and Debugging;H.5.2 [Information Interfaces and Presentation]: UserInterfaces

General TermsExperimentation, Reliability, Human Factors

KeywordsWeb authoring tools, Dependability, End-users

1. INTRODUCTIONWeb authoring environments have enabled end-users who

are non-programmers to design and quickly construct webpages. Today, end-users can create web pages of increasingsophistication with simple drag-drop type operations, whilethe authoring environment keeps the underlying code infras-tructure hidden. Consider the example of Tom, the ownerand sole employee of a travel agency, who arranges airlinetickets, car rentals, hotel reservations and package tours forhis local customers. Tom has designed his own site, andsince he cannot afford to pay travel consolidators and spe-cialized programmers, his daily routine consists of monitor-

Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). Distribution of these papers is limited to classroom use,and personal use by others.WWW 2007, May 8–12, 2007, Banff, Alberta, Canada.ACM 978-1-59593-654-7/07/0005.

ing various airlines’ and travel web sites for low-price alerts,checking the availability and prices of rental cars and hotelrooms for his customers, and maintaining updated informa-tion about package tours from various operators in his website. In order to obtain that information, Tom has to gothrough the tedious process of navigating to the appropri-ate web page, filling in all the relevant details, extractingsections of desired information and integrating them intohis own site.

Several aspects of the process just described could beautomated, freeing the end-user to perform other activi-ties while reducing the chances of introducing errors in thisrepetitive task. At a high level, we envision Tom specify-ing what data to obtain from what web sites, and how thosepieces of data should be included in his web site, and then anautomated process would create Tom’s site and the neces-sary components so that the target site’s content is automat-ically retrieved and integrated. Tom’s situation is a commonone, experienced by several end-users who wish to repeat-edly retrieve, consolidate, and tailor content such as searchresults, weather and financial data from other web sources.In a recent survey conducted by Rode and Rosson [21] todetermine the types of web applications non-professionalprogrammers would be interested in, 34% of the subjectssurveyed expressed interest in applications that required fa-cilities for custom collections of data generated by other webapplications.

Still, the process of finding, extracting and integratingdata from multiple web sites remains laborious. Severalcommercial web-authoring environments such as MicrosoftFrontpage [7] and Dreamweaver [6] enable end-users to cre-ate increasingly sophisticated web-applications capable ofintegrating information from various web-sources in the formof built-in components. These web-authoring tools, how-ever, do not support integration of information beyond alimited set of components. End-users attempting to retrieveinformation from sites not covered by these built-in compo-nents receive no further support from web authoring tools.Furthermore, deploying such functionality requires a level oftechnical expertise above an end-user like Tom. Our workaddresses this limitation by assisting end-users in the cre-ation of “Web clips”.

Web clips are components within the end-user web sitewhich dynamically extract information from other web-sources.For instance, in the example described earlier, Tom’s sitewould include several web clips, each extracting and incor-porating information from a target web site as shown inFigure 1.

WWW 2007 / Track: Web Engineering Session: End-User Perspectives and Measurement in Web Engineering

953

Figure 1: Tom’s travel site

There are several challenges that must be addressed toprovide adequate support for end-users like Tom. First, al-though we can assume that the end-user is familiar with webauthoring tools, we cannot expect end-users to have any pro-gramming experience or desire to learn programming. Thisimplies that the support mechanisms for the creation of webclips must be transparently integrated within web-authoringtools, while the underlying support infrastructure and codenecessary to deploy the web clips must be automaticallygenerated and hidden from the end-user.

Second, it is reasonable to expect that the target websources of information will change in ways that go beyondcontent. For example, a web source may present data in anew format, or in a different sequence, or not include it atall. A deployed web clip must be able to at least detect andalert the end-user about the degree to which the changesin the web source may impact their web site. Followingwith Tom’s scenario, if his web clip is not robust enoughto detect changes in the structure of the source web site,it might end up retrieving “cruise deals” instead of “flightdeals”, thereby resulting in inconvenience, delay or even atransaction loss because of incorrect information. At thesame time, we would like for web clips to be resilient to, forexample, cosmetic changes, to reduce the number of times aweb clip must be created. Hence, it is desirable to developclips that not only accurately identify what to clip but arealso robust in the presence of changes.

To address these challenges, we have designed and pro-totyped an approach to support an end-user in creating adependable web clip. The approach is unique in three fun-damental aspects: 1) it maintains the web authoring toolinterface metaphor so that the end user keeps operating asusual through the standard directives while imposing no ad-ditional programming requirements, 2) it increases the ro-bustness of the web clip through a training procedure inwhich the end user is iteratively queried to infer multiplemarkers that can help to identify the correct data in thepresence of change, and 3) it deploys multiple filters to in-crease the confidence in the correctness of the retrieved in-formation, and assessment code to provide a measure of cor-rectness associated with the retrieved and integrated data.

2. RELATED WORKThere have been numerous research efforts addressing au-

tomated extraction and customization of content from websites that we now proceed to summarize.

Multiple toolkits and specialized languages enabling con-tent extraction through manual and automated wrapper gen-eration have been discussed by Kuhlins et al.[22] and Laen-der et al.[23]. On the user end, stand alone languages suchas Jedi [18] and W4F (World Wide Web Wrapper Factory)[31] provide sophisticated modules for generating wrappersthat enable web content extraction. Furthermore, GUI-based toolkits such as Lapis [26] and Sphinx [25] enable end-users who lack any programming skills to create wrappersby providing enhanced user-interaction. A different type ofextraction mechanism is provided by web site companies inthe form of web-feeds and web services. Web-based feedssuch as RSS [8] provide web content or summaries of webcontent together with links to the full versions. Web-feedsoriginated with news and blog sites, but more web sites arenow making content available through them. Information isusually delivered as XML content, organized in a pre-definedformat, which could be collected by feed aggregators. Webservices [9] on the other hand enable creation of Mashups[20], which combine information from multiple web sites intoa single one using public interfaces and APIs provided byweb site companies.

There are several instances of techniques and tools thatoperate as web-browser extensions and provide end userswith basically web macro recorders of different sophistica-tion. Smart bookmarks are shortcuts to web content thatwould require multiple steps of navigation to be reached.Systems supporting smart bookmarks such as WebVCR [11]are capable of recording the steps taken during navigation,which could later be replayed automatically to reach thetarget web page. Furthermore, some such as Webviews [16]also enable some level of content customization to, for ex-ample, fit smaller screens. Tools such as Greasemonkey [2],Chickenfoot [14] and Robofox [33] enable end-users to, forexample, customize the look and content of a web page, as-sociate actions with triggers in observed web sites, fill andsubmit forms, and filter content. Essentially, these tools pro-vide programming language capabilities with different levelsof abstraction, each for a slightly distinct range of users, tomanipulate and utilize web page content as perceived by theuser through the browser.

Stand-alone research systems such as Hunter Gatherer[32], Internet Scrapbook [34] and Lixto [13] enable end-usersto create “within-web page collections” which can containtext, images, links and other active components like forms.These tools enable users to make selections of the desiredcontent within the visited web page as it is visited. A slightlydifferent approach is taken by web-integration frameworkssuch as InfoBeans [12], CCC [17], and Robosuite [4] whichenable users to create several configurable components, thatencapsulate information or specify an information gather-ing and providing process. Several such components couldbe integrated into a common interface to create an applica-tion where information obtained from one component can beused to drive other components. One particular feature thatis worth noting in these systems is example-based clippingguided by a training process where the user demonstratesthe selection while the system generates a unique characteri-zation of the selected content. This approach partly inspired


954

Figure 2: Outline of web clip creation process

our training component. Characteristics such as documentstructure, XPATH and page fragmentation have been usedby systems discussed in [27, 15] to mark the clipped content.Alternate approaches use visual patterns within web pages[19] and tree-edit distances [30] (based on the tree struc-ture of a web page) to identify the selected content. Thecomponent of our approach that aims to identify and markthe entities to be clipped combines the merits of the aboveapproaches.

It is worth noticing that although sufficient correctnessand robustness have been cited as critical aspects of appli-cations that integrate information from other web sources[29], the approaches just described have not directly ad-dressed these issues. Furthermore, most of the approachesdiscussed above are either stand alone applications, havebeen integrated into web browsers, or are constrained by theavailability of a specific API or content. Our work addressesthat particular niche, the dependability aspects associatedwith clips developed through web authoring environmentsthat are intended to be published by end users.

3. OUR APPROACH: WEB CLIPPERThe primary goal for web clipper is to enable an end-user

to create a dependable web clip without imposing additionalrequirements on or expecting higher levels of expertise fromthe end user. The following objectives were defined in orderto achieve the above goal:

1. To develop and integrate an intuitive and user-friendlyselection process into the web-authoring environmentthat would enable an end-user to create a clipping fromany given web site with ease.

2. To enable training of the web clip to better identifythe clipped content within the structure of the targetweb site.

3. To provide an extent to the validity of the extractedinformation, once the web clip(s) is deployed.

4. To alert the user about structural changes in the “clipped”web site that may affect the reliability of informationcollected.

Figure 2 gives an overview of the various steps involvedin web clip creation - clipping, training, deployment, filter-ing and assessment. Clipping enables the end-user to makea selection within the target web page and is followed bya training session which generates several valid extractionpatterns capable of identifying the clipped content. Thedeployment stage results in the creation of several scriptswhich dynamically retrieve and filter content to create thetarget web clip. It is followed by an assessment of the valid-ity of information within the generated web clip.

3.1 ClippingThe clipping process enables the end-user to select data

from a target web page. Consider the example of Tom dis-cussed earlier. One of the web clips that Tom could beinterested in, is the “sample airfares” section on traveloc-ity.com. There are two basic requirements for the creationof this web clip which are described in the following sections.

3.1.1 Target Clip SelectionSince we have prototyped our approach in Microsoft Front-

page, we will utilize this tool’s context to explain our ap-proach. Clipping is available in the tool in the same manneras several other HTML controls, so the incorporation of thetarget web clip into the user’s web page is provided througha simple drag-and-drop operation or by a menu-selection.The web clipper control has a custom browser associatedwith it, which enables the user to navigate to the desiredweb page and clip content from it as shown in Figure 3.

The selection operation is supported by visual feedback inthe form of a border around the corresponding HTML ele-ment over which the user is hovering. As the user moves themouse, every extractable document element is highlightedand the user can make a selection by clicking on it. Op-tions to refine a selection such as zoom in and zoom out,are available through a context menu. These operations arebased on the DOM [1] structure of the HTML documentwhere, for example, a zoom out operation selects the firstvisible HTML element which encloses the selected element.Move up and move down are also available to enable a moredetailed DOM traversal.

3.1.2 Extraction PatternOnce a selection is made, an extraction pattern is gen-

erated. An extraction pattern constitutes a unique charac-teristic of the user’s selection that can later be used to dy-namically extract it from the target web page. During theclipping process, the user’s selection is uniquely identified byits HTML-Path. HTML-Path is a specialized XPATH [10]expression which uniquely identifies any HTML-documentportion corresponding to a DOM tree node. For example,the HTML-Path corresponding to the “sample airfares” sec-tion on travelocity.com shown in Figure 3 enables the identi-fication and extraction of that section within the web page.The URL of the clipped web page and the HTML-Path ofthe clipped section are extracted during the clipping pro-cess and will be later embedded (among other components)in the user’s web page for dynamic extraction of contentafter deployment.

3.2 TrainingAlthough an extraction pattern based on HTML-Path is

capable of uniquely identifying a web page element, it is not


955

Figure 3: Clipping process

quite robust. Even a small change in the structure of theweb page might affect the content retrieved by a web clipbased on just the HTML-Path. For instance, in the exampledescribed in the previous section, if a <TABLE> elementis introduced before the “sample airfares” section, the clip-ping based on HTML-Path will extract the newly inserted<TABLE> element instead of the selected one, thereby re-turning incorrect results.

The aim of the training session is to increase the ro-bustness of the web clip by constructing several extractionpatterns which uniquely characterize the end-user selection,such that if one of them fails, one of the others might be suc-cessful in identifying the selection. Once the clipping pro-cess is complete, the user can proceed by clicking “Train” tostart a training session (see Figure 4). During the trainingsession, several clips, created using different extraction pat-terns are shown to the user, one after the other. The usereither approves or disapproves the clips based on his or herexpectations, thereby generating several unique characteri-zations for the desired clip. Every time the user marks aclipping as valid, the system generates a filter correspond-ing to the clipping. Filters are basic snippets of javascriptcode that will be embedded within the user’s web page, andwhich are capable of identifying the clipped content withinthe target web page using an extraction pattern.

For instance, continuing with the example discussed ear-lier (Figure 3), let us assume that the “sample airfares”section is always preceded by the element <DIV> whichcontains a label for the section, “Sample Round-trip faresbetween:”. Thus, one of the extraction patterns would bea <TABLE> element that immediately follows a <DIV>with the text “Sample Round-trip fares between:”. Duringthe training session, the above extraction pattern is appliedto the web page and the first table matching the pattern isreturned as the web clip. This is presented to the user by

Figure 4: Training session

highlighting the clipped element with a blue border. Theuser reviews the response and indicates its correctness byclicking on the appropriate button as shown in Figure 4. Incase of a correct response, the extraction pattern becomes aunique characterization of the web clip, which could be usedin addition to HTML-Path to extract the desired content.The resilience of the web clip to changes within the targetweb page is dependent upon the existence of multiple validextraction patterns, hence the user is encouraged to trainthe system better in order to obtain better results.

Table 1 summarizes the characterization heuristics we usedas bases for extraction patterns in our prototype. For each


956

Pattern Rationale Usage and Internal Representation ExampleHTML-Path

HTML-Path provides directaccess to the clipped sectionwithin the target web page

HTML-Path evaluated based on position ofclipped element relative to the documentroot. Internally represented as an XPATHexpression

Sample airfares section on trave-locity.com was extracted using itspath as shown in Figure 3

HTML Ele-ment Id

‘ID’ attribute is likely to beunique and is often used byweb pages for updating con-tent or as a stylesheet indica-tor

‘ID’ attribute for clipped element capturedand stored. Clipped element directly refer-enced using ‘ID’ during dynamic extraction

Articles on slashdot.org are usuallyenclosed in a <div> with ‘ID’ as“articles” and were clipped usingthis characteristic

Class nameand otherAttributes

A combination of propertiesof the clipped element such asclass name, style, action, alt,could enable its identification

Absolute indexes of elements in HTML-Path are replaced by name-value pairs ofattributes to form HTML-Path by expres-sions. These are internally represented asspecialized XPATH expressions

The “top stories” section ontimes.com was identified bythe HTML-Path by expression/html/body/div[2]/div[2]/div[1]/table/tbody/tr/td[2]/div[1]/div[1]/div[@class=‘aColumn’]

EnclosingElement

Certain sections within webpages are easily identified bytheir container sections

Relative XPATH expressions enable identi-fication of clipped sections [27]. Surround-ing/Enclosing elements which can be refer-enced directly using ‘IDs’, ‘class names’ areused to form relative XPATH expressions

The form to obtain maps onmaps.yahoo.com was extracted us-ing the <div> enclosing it whichhas its ID attribute set to “ymap-form”

Neighboringelement

Visual patterns such as ad-jacent or nearby sections,side panels, images, adver-tisements can enable identifi-cation of clipped section [19]

Characteristics of nearby elements suchas content type, attributes, distance fromclipped section enable its identification. In-formation pertaining to above (tag name,attributes, distance, content-type) is cap-tured and internally represented as pattern-value pairs

The “stock indexes” section onnyse.com is usually preceded by op-tions to e-mail or print the page,and followed by an advertisementwith an image.

Labels Headings and labels of sec-tions describe the nature ofcontent within the clippingand could be used as identi-fiers for clipped content

Sections containing text within emphasistags such as h1 − h6, b, span, i, em andshorter than a specific length are recog-nized as labels. Labels can be containedwithin the clipped section or within nearbyelements and are represented by a combi-nation of regular expressions and pattern-value pairs

A label “Right Now For” usu-ally precedes the weather informa-tion for a particular zip code onweather.com and enabled its iden-tification

Links Hyperlinks within/near theclipped content could be acharacteristic feature of aclipping

Hyperlinks within/nearby the clipped sec-tions are captured and information pertain-ing to them such as link text, images at-tributes, referenced URL and distance isstored to enable dynamic identification ofthe clipped section

The links - ‘Web’,‘News’, and ‘Im-ages’ enabled characterization ofGoogle’s search form

ContentStructure

Structure of content may en-able identification of a partic-ular section within an entireweb page

A few patterns pertaining to the struc-ture of content in web pages such as num-ber of rows/columns in a table, items in alist, number of siblings, children, have beenidentified and are being currently used bythe prototype for clipping.

The “sample airfares” section ontravelocity.com was clipped usingthe characteristic that it was theonly table within the web page withexactly 3 columns. (see Figure 3)

Table 1: A list of characteristics used as extraction patterns

characterization type we provide its rationale, usage, inter-nal representation and an example where it can be identifiedwithin the sites we analyzed in Section 4. It must be notedhere that this set of extraction patterns is easily extendable[24].

3.3 DeploymentThe clipping and training processes are followed by the

deployment of the user’s web page with the clipped con-tent. The URL and extraction patterns captured from theprevious stages are used to generate an AJAX [28] scriptthat is linked to the user’s web page. This script uses theXMLHttp object to retrieve content dynamically from thetarget web pages and incorporate it into the end-user’s webpage. Since retrieved HTML documents may be inconsis-tent with HTML specification, their content is converted toXHTML to enable construction of a DOM tree and appli-cation of DOM and DHTML methods. Any relative URLs

are converted to absolute URLs to ensure that links are notbroken. Therefore, clicking of a hyperlink or submitting aform would take the user to the target web page. By de-fault, the web clip’s style is maintained by importing thestylesheets from the target site, but it can be tailored tomatch the user’s style by pointing to a different stylesheetset. Any javascript associated with the target web clip iseither imported or referenced to ensure that features suchas validation checks, cookies are not lost due to creation ofthe web clip.

In addition to the AJAX scripts to support the data re-trieval, Javascript filters are generated from pre-defined tem-plates for each of the extraction patterns gathered duringtraining. Validation scripts to enable filter assessment asexplained in Section 3.4 are also embedded within the user’sweb page. Filters and validation scripts enable the extrac-tion and assessment of the clipped content which is placedwithin an iframe element within the user’s web page. Just


957

Figure 6: Clips generated by filters using HTML-Path, Content Structure, and Labels

like any other HTML control, the user can move, resize oreven annotate the web clip to suit his/her preference. Thisentire process is automatically triggered upon the comple-tion of the training session and runs in the background with-out the user having to worry about the mechanics of the pro-cess. Web clips thus generated can be currently viewed inInternet Explorer (6.0 and above) and Mozilla Firefox (1.5and above).

3.4 Filtering and AssessmentAfter deployment, when the user designed or updated web

page is loaded in a browser, a filter score is computed foreach filter generated by a valid extraction pattern duringtraining, by analyzing its performance in the presence of theother available filters. If a filter retrieved valid data in thepresence of another filter, then its filter score is increased,otherwise it remains the same. At run-time, the filter withthe maximum filter score is assumed to be the most trust-worthy so it is used to generate the target web clip (any tiesin the filter scores are broken by random selection).

For example, Figure 5 gives the web clip that Tom hadinitially created during the clipping process. Figure 6 givesthe web clips generated a few days later by 3 different filters.Table 2 gives a summary of the extraction patterns observed

Figure 5: Tom’s desired web clip

Filters / Extractionpatterns

ContentStructure

HTML-Path

Label Filterscore

Content structure filter 1 0 0 1HTML-Path filter 0 1 0 1Label filter 1 0 1 2

Table 2: Filter scores and valid extraction patterns

during training and after deployment, and the scores of thecorresponding filters. The filters for clips shown in Figure6 were generated based on content structure, HTML-Pathand labels respectively. In Figure 6a we see that thoughthe content structure is the same as in the pattern markedvalid during the training process, the label and the HTML-Path are not the same. In this example we note that thefilter using labels (third row) had the highest filter score asmost number of patterns marked valid during training wereobserved even after deployment.

To provide the user with an extent to the validity of thecontent retrieved, we define confidence as the ratio of themaximum filter score to the total number of valid extractionpatterns generated during the training session. For example,if six extraction patterns were marked valid during the train-ing process, but only five extraction patterns are valid afterdeployment for the web clip generated by a particular filter,then the confidence score would be .83. Confidence gives anextent of the similarity between the patterns marked validduring the training process and the patterns observed af-ter deployment. A greater similarity in patterns results ina higher confidence in the content within the dynamicallygenerated web clip.

The current prototype provides this confidence informa-tion either by displaying it through the automated annota-tion of the web site with a Validity Assessment Bar aboveeach web clip (as illustrated in Figure 7), or by reportingit to a log file. The prototype is also able to alert the userabout changes in the patterns that were observed duringthe training process, as this might enable the user to makea further assessment of the dependability of the clip. Theuser can also configure the web clips to provide alerts whenthe confidence scores fall below a particular threshold.


958

Figure 7: Sample web clip with the Validity Assess-ment Bar at the top

4. EVALUATIONOur work aim has been to assist an end-user in creating

dependable and robust web clips. To evaluate the effec-tiveness of web clipper in achieving this objective, the firstauthor of the paper generated and tested web clips from 25different web sites to assess:

• Effectiveness of the extraction patterns used in gener-ating web clips

• Dependability of web clips in providing sufficiently cor-rect information over time, and

• Robustness of web clips to changes in the clipped website

4.1 List of Web sitesWeb clips were generated from 25 popular web sites for

evaluation purposes. The web sites used for evaluation werechosen for several reasons. First, they had content whichgets updated periodically. Second, as the primary focus ofour work has been on assisting end-users, we chose web siteswhich an end-user like Tom might access. Finally, some ofthese web sites are among the top commercial web sites hav-ing high hit counts [5, 3]. Web clips created from these sitesinclude text, links, images, forms and dynamically createdcontent.

4.2 Effectiveness of Extraction PatternsThere are several extraction patterns which could be po-

tentially used to characterize users’ clippings such as theones described in Section 3.2. Clippings were generated fromthe 25 web sites to assess the effectiveness of these extractionpatterns. The chart in Figure 8 gives an indication of theapplicability of the various extraction patterns to the 25 websites. Results indicate that many of the web clips could be‘clipped’ using more than one extraction pattern. In all theobserved sites, HTML-Path was a valid extraction patternas every page element has a unique path within a HTMLdocument. Extraction patterns based on HTML-Path byexpressions were found to be very effective in characterizingclippings as several of the observed sites such as slashdot.org,

Figure 8: Applicability of extraction patterns forclipping

travelocity.com, and espn.com used “classes”, “ids” for log-ical separation of sections. Labels and content structurewere also very effective as extraction patterns, characteriz-ing clippings in more than 50% of the sites. usaweather.com,southwest.com, nasdaq.com, and yahoo had a well-definedcentral structure within the web page for clipped contentannotated by explanatory labels. Neighboring and enclosingelements characterized nearly 50% of the clippings in sitessuch as careerbuilder.com, nyse.com ands bettycrocker.comwhere navigation panels, side bars and bounding elementsserved as extraction patterns. Links were not widely appli-cable as extraction patterns as they were either not presentor were not unique to the clipped element. Only a couple ofsites such as astrology.yahoo.com and BBC sports had linkswhich could characterize the clipping.

Overall, usaweather.com and travelocity.com generated thehighest number of valid extraction patterns (six) during thetraining session. The average number of extraction patternsvalid for the observed sites was close to 4. However, theexistence of multiple extraction patterns for every observedweb site enabled us to generate multiple filters for clippingcontent, thereby increasing the probability of successful re-trieval of desired information in the presence of changes inthe target site.

4.3 Dependability of web clipsTo assess the dependability of web clips in providing suf-

ficiently correct information over time, the web clips gen-erated were monitored for a period of 16 weeks. Each webclip was periodically refreshed and the confidence score wasrecorded. Table 3 gives the biweekly confidence scores forthe clipped web sites that had undergone changes duringthe period. Most of the web clips were found to be quitestable, retrieving the desired information, despite changesto the target web site. The average confidence score wasnearly 85%.

Certain extraction patterns enabled the identification ofthe clipping during the training process but were not char-acteristic features of the clipped section. For example labelscontaining dates on which the web pages were viewed, ti-tles of news, recipe sections, and number of items within


959

Web sites/Week 1 3 5 7 9 11 13 15 Type of Changewww.careerbuilder.com 100 100 100 75 100 75 75 75 Html-Pathhotjobs.yahoo.com 100 67 67 67 67 67 67 67 Labelwww.cricinfo.com 67 67 67 0 0 0 0 0 Re-designedwww.amazon.com 100 100 67 67 67 67 67 67 Content structurewww.jcpenney.com 100 100 100 80 80 80 80 80 Linkswww.epicurious.com 100 75 75 75 75 75 75 75 Html-Pathwww.bettycrocker.com 100 100 83 83 83 83 83 83 Neighborsastrology.yahoo.com 100 75 50 50 50 50 50 50 Html Path + Linkshoroscopes.astrology.com 75 75 75 75 50 50 50 50 HTML Path + Neighborsweather.yahoo.com 100 100 100 100 100 100 100 0 Re-designedusa.weather.com 100 100 100 100 100 100 83 83 Neighborswww.techdirt.com 100 75 75 75 75 75 75 75 Neighborswww.nyse.com 100 100 100 67 67 67 67 67 Html Pathwww.health-fitness-tips.com/ 100 75 75 75 75 75 75 75 Enclosing Element

Table 3: Biweekly confidence scores

lists were among the extraction patterns observed for sitessuch as yahoo jobs, epicurious.com and amazon.com. Thesewere eliminated within a short time after deployment af-ter which the confidence scores stabilized for these sites.Certain design changes such as introduction of additionalsections, navigation aids and sidebars affected the HTMLpath, neighbors and enclosing elements of careerbuilder.com,astrology.com and bettycrocker. New advertisements also af-fected sites such as espn.com and usaweather.com. Two websites, yahoo weather and cricinfo.com, had a major restruc-ture, resulting in the web clips not being able to retrieve thedesired information and reporting a zero confidence value.Overall, the web clips were able to retrieve the originallyclipped content in most cases or were appropriately anno-tated with confidence scores enabling an end-user to deter-mine whether the information was sufficiently correct to use.

4.4 RobustnessThe previous study observed real web sites over a period

of time, but did not take into consideration the degree towhich they change. We are now performing a more con-trolled study where we manipulated the sites through a se-ries of changes to measure the robustness of the generatedclips in the presence of such changes.

For evaluation purposes, we define a ‘block’ as a HTMLelement which is similar in structure to the clipped ele-ment, having the same tag name and same type of content(text, images, links). For example, Figure 9 shows a newblock, “Hot deals” section inserted above the initially cre-ated web clip of the “Sample round-trip fares between” sec-tion. Blocks were created either from existing elements orby modifying elements within the target web page. The fol-lowing structural modifications were made to the web pagesto assess the robustness of web clips:

• Block Insertion. A new block similar in structure tothe clipped element, was inserted into the web pageimmediately above the clipped element as shown inFigure 9 where the “Hot Deals” section is inserted intothe web page from which the “Sample Round-trip faresbetween” section was initially clipped. This would testthe ability of web clips to extract desired informationwhen content is inserted into the web page.

• Block Movement. Reorganization of content within aweb page might result in moving of sections of infor-mation within a web page. Therefore, we tested the

robustness of web clips to block movements within theclipped web page. Block movement involved movingone of the existing blocks which was similar in struc-ture to, and above the clipped element to immedi-ately below the clipped element. In cases where theblock below the clipped element was similar to it, theblock below the clipped element was moved immedi-ately above it. If there was no block similar to theclipped element, an element with the same tag nameand same type of content was considered as a block andmoved either immediately above or below the clippedelement, based on its initial position.

• Block Deletion. Web clips were also tested for robust-ness against block deletions. To test for block dele-tions, the nearest block similar in structure, and imme-diately above the clipped element was deleted. In theabsence of such a block, an element above the clippedsection with the same tag name and similar contentwas deleted.

• Enclosing Element Changes. We modified the elementenclosing the clipped element to determine the effectit would have on the content returned by the web clip.This helped us evaluate the robustness of web clips tochanges in HTML-Path and other characteristics suchas enclosing element, element Id. The enclosing ele-ment was modified by changing the tag name of the en-closing element, and by removing any of the attributesthat it had, which could be used for its identification.

Figure 9: Block-A section similar to clipped section


960

Web sites/Structural Change Block Insertion Block Movement Block Deletion Enclosing Ele-ment Change

Target ClippingRemoved

careerbuilder 50 100 100 50 25hotjobs.yahoo 33 100 100 67 0cricinfo 67 67 67 67 33sports.espn.go 25 50 50 50 25bbc.co.uk/sport 67 67 67 67 0walmart 67 67 67 67 0amazon 67 67 67 67 67bestbuy 67 100 100 67 0jcpenney 60 80 100 80 20epicurious 75 100 75 50 25bettycrocker 67 67 67 50 17recipecenter 75 75 75 50 75astrology.yahoo 75 100 75 75 25horoscopes.astrology 50 50 25 75 0travelocity 50 50 67 33 50southwest 75 75 75 75 25rentalcars 67 100 100 67 0weather.yahoo 50 100 100 50 0usa.weather 100 100 100 67 0slashdot.org 80 80 80 40 40techdirt 75 50 25 50 0nyse 67 67 67 100 0nasdaq 50 50 50 50 25health-fitness-tips 75 75 75 50 0101lifestyle 67 67 67 67 33

Table 4: Confidence scores observed during controlled study

• Target Clipping Removed. Due to design changes withinweb sites, it is quite probable that the originally clippedcontent may be completely removed from the sourceweb page. To test the ability of the generated web clipto alert the user during such changes, we removed theoriginal clipping from the source web page and evalu-ated the confidence scores.

The user’s web pages with the web clips were reloaded(refreshed) after making the above modifications to deter-mine whether the changes affected the content retrieved bythe web clip. Confidence scores of the web clips are indi-cated in Table 4. All the web clips were able to retrieve thedesired content despite the structural changes made to thetarget web pages in the first 4 scenarios. The web clips werefound to be robust to the modifications, indicating the im-portance of the training session in generating multiple waysof retrieving the desired content. The presence of multipleextraction patterns enabled the web clips to recover fromthe failure of one or more extraction patterns and retrievethe desired information using the remaining ones. However,confidence scores varied depending on the effect that themodifications had on the extraction patterns. The removalof target clipping in the fifth scenario resulted in a significantdrop in the confidence scores for sites observed. A couple ofsites though such as amazon.com and recipecenter.com hadhigher confidence scores due to the presence of multiple sec-tions that could be erroneously identified using the same setof patterns. However, the average confidence score was lessthan 20% with seventeen web clips displaying alert messageswithin the validity assessment bar asking users to recreatetheir web clips.

5. CONCLUSIONIn this paper, we have presented an approach to support

end-users through the entire process of creating a depend-

able web clip. We have integrated web clipper into a popu-lar web-authoring environment, thereby enabling end-usersto incorporate web clips as components within their webpages. Web clipper addresses the shortcomings of existingtools by introducing the notion of training of web clips andof dynamic confidence evaluation for retrieved information.Training generates a set of unique characterizations whichenable the extraction of desired content in multiple ways,thereby increasing the probability of obtaining desired re-sponses despite structural changes in the target web site.Confidence evaluation provides an extent to the validity ofthe collected information within the web clip, enabling theuser to assess sufficient correctness. A preliminary eval-uation of the prototype consisting of the creation of webclips from web pages of twenty-five commercial web sitesand modified copies of the clipped web pages has been en-couraging. Web clips appear stable over time and robust inthe presence of changes in the target sites.

This work points to several research directions that wewould like to further explore. First, we would like to actuallyobserve real end-users utilizing web clipper to quantify towhat degree it helps them generate more dependable clips.Furthermore, we would like to broaden the range of end-users to include those that may be operating in an intranetsite, perhaps using the clipper to extract and convey dataoriginating in a legacy system. Second, many heuristics toimprove the characterization of web elements are emergingand we have just scratched the surface on what can be usedto improve the accuracy of our extraction process. Last, wewould like to extend the transparency of our clip generationmechanism so that, for example, the end-user can go backto the “original” source page to re-touch it and maintain itwithout need for re-clipping.


961

AcknowledgmentsThis work was supported in part by NSF CAREER Award0347518 and the EUSES Consortium through NSF-ITR 0324861.

6. REFERENCES[1] Document Object Model. http://www.w3.org/DOM,

Oct. 2006.

[2] Grease Monkey. http://greasemonkey.mozdev.org, Oct.2006.

[3] Infoplease - Internet Statistics and Resources.http://www.infoplease.com/ipa/A0779093.html, Oct.2006.

[4] Kapow Robosuite.http://www.kapowtech.com/pdf/WebIntegrationWP.pdf,Oct. 2006.

[5] KeyNote. Consumer top 40 sites.http://www.keynote.com/solutions/performanceindices/consumerindex/consumer40.html., Oct. 2006.

[6] Macromedia Dreameaver.http://www.macromedia.com, Oct. 2006.

[7] Microsoft Frontpage.http://www.microsoft.com/office/frontpage, Oct. 2006.

[8] RSS. http://web.resource.org/rss/1.0/spec, Oct. 2006.

[9] Web Services. http://www.w3.org/2002/WS, Oct.2006.

[10] XPATH. http://www.w3.org/TR/XPATH, Oct. 2006.

[11] V. Anupam, J. Freire, B. Kumar, and D. Lieuwen.Automating Web Navigation with the WebVCR. InInternational Journal of Computer andTelecommunications Networking, 2000.

[12] M. Bauer, D. Dengler, and G. Paul. Instructibleinformation agents for Web mining. In InternationalConference on Intelligent User Interfaces, pages21–28, 2000.

[13] R. Baumgartner, S. Flesca, and G. Gottlob.Declarative Information Extraction, Web Crawling,and Recursive Wrapping with Lixto. In InternationalConference on Logic Programming and NonmonotonicReasoning, pages 21–41, 2001.

[14] M. Bolin, M. Webber, P. Rha, T. Wilson, and R. C.Miller. Automation and customization of renderedweb pages. In ACM Symposium on User interfaceSoftware and Technology, pages 163–172, 2005.

[15] B. Christos, K. Vaggelis, and M. Ioannis. A web-pagefragmentation technique for personalized browsing. InACM Symposium on Applied Computing, pages1146–1147, 2004.

[16] J. Freire, B. Kumar, and D. Lieuwen. WebViews:accessing personalized web content and services. InInternational Conference on World Wide Web, pages576–586, 2001.

[17] J. Fujima, A. Lunzer, K. Hornbaek, and Y. Tanaka.Clip, connect, clone: combining application elementsto build custom interfaces for information access. InACM Symposium on User Interface Software andTechnology, pages 175–184, 2004.

[18] G. Huck, P. Fankhauser, K. Aberer, and E. Neuhold.Jedi: Extracting and Synthesizing Information fromthe Web. In International Conference of CooperativeInformation Systems, pages 32–42, 1998.

[19] U. Irmak and T. Suel. Interactive wrapper generation

with minimal user effort. In International Conferenceon World Wide Web, pages 553–563, 2006.

[20] A. Jhingran. Enterprise information mashups:integrating information, simply. In InternationalConference on Very Large Databases, pages 3–4, 2006.

[21] J.Rode and M. Rosson. Programming at runtime:Requirements and paradigms for nonprogrammer webapplication development. In IEEE Symposium onHuman Centric Computing Languages andEnvironment, pages 23–30, 2003.

[22] S. Kuhlins and R. Tredwell. Toolkits for GeneratingWrappers . In International Conference on Objects,Components, Architectures, Services, and Applicationsfor a Networked World, pages 184–198, 2002.

[23] A. Laender, B. Ribeiro-Neto, A. Silva, and J. Teixeira.A brief survey of web data extraction tools. InSIGMOD Rec., volume 31, pages 84–93, 2002.

[24] S. Lingam. Supporting end-users in creation of webclips. Master’s thesis, University of Nebraska, Lincoln,NE, May 2006.

[25] R. C. Miller and K. Bharat. SPHINX: A Frameworkfor Creating Personal, Site-Specific Web Crawlers. InInternational World Wide Web Conference, pages119–130, 1998.

[26] R. C. Miller and B. A. Myers. Integrating a CommandShell into a Web Browser. In USENIX 2000 AnnualTechnical Conference, pages 171–182, 2000.

[27] M.Kowalkiewicz, M.Orlowska, T.Kaczmarek, andW.Abramowicz. Towards more personalized Web:Extraction and integration of dynamic content fromthe Web. In Asia Pacific Web Conference, 2006.

[28] L. D. Paulson. Building rich web applications withajax. Computer, 38(10):14–17, 2005.

[29] O. Raz and M. Shaw. An Approach to PreservingSufficient Correctness in Open Resource Coalitions. InInternational Workshop on Software Specification andDesign, page 159, 2000.

[30] D. C. Reis, P. B. Golgher, A. S. Silva, and A. F.Laender. Automatic web news extraction using treeedit distance. In International Conference on WorldWide Web, pages 502–511, 2004.

[31] A. Sahuguet and F. Azavant. Building light-weightwrappers for legacy web data-sources using W4F. InInternational Conference on Very Large Data Bases,pages 738–741, 1999.

[32] M. C. schraefel, Y. Zhu, D. Modjeska, D. Wigdor, andS. Zhao. Hunter gatherer: interaction support for thecreation and management of within-web-pagecollections. In International World Wide WebConference, pages 172–181, 2002.

[33] S.Elbaum, A.Koesnandar. Robofox.http://esquared.unl.edu/wikka.php?wakka=AboutRobofox,Oct. 2006.

[34] A. Sugiura and Y. Koseki. Internet scrapbook:automating Web browsing tasks by demonstration. InACM Symposium on User interface software andtechnology, pages 9–18, 1998.


962

supporting end-users in the creation of dependable - conferences

Documents