oxpath: a language for scalable data extraction ......the vldb journal doi 10.1007/s00778-012-0286-6...

26
The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation, and crawling on the deep web Tim Furche · Georg Gottlob · Giovanni Grasso · Christian Schallhart · Andrew Sellers Received: 10 February 2012 / Revised: 4 June 2012 / Accepted: 13 July 2012 © Springer-Verlag 2012 Abstract The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisti- cated interfaces necessitate automated processing, yet exist- ing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely cap- ture the relevant data to be extracted, (3) scale with the num- ber of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extract- ing data thus revealed—matching all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoret- ical complexity and demonstrate that OXPath’s resource consumption is dominated by page rendering in the under- lying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and acad- emic data extraction tools by a wide margin. T. Furche (B ) · G. Gottlob · G. Grasso · C. Schallhart · A. Sellers Department of Computer Science, Oxford University, Wolfson Building, Parks Road, Oxford OX1 3QD, UK e-mail: [email protected]; [email protected] G. Gottlob e-mail: [email protected] G. Grasso e-mail: [email protected] C. Schallhart e-mail: [email protected] A. Sellers e-mail: [email protected] Keywords Web extraction · Crawling · Data extraction · Automation · XPath · DOM · AJAX · Web applications 1 Introduction The dream that the wealth of information on the web is easily accessible to everyone is at the heart of the current evolution of the web. Due to the web’s rapid growth, humans can no longer find all relevant data without automation. Indeed, many invaluable web services, such as Amazon, Facebook, or Pandora, already offer limited automation, focusing on filtering or recommendation. But in many cases, we cannot expect data providers to comply with yet another interface designed for automatic processing. Neither can we afford to wait another decade for publishers to implement these interfaces. Rather, data should be extracted from existing human-oriented user interfaces. This lessens the burden for providers, yet allows automated processing of everything accessible to human users, not just arbitrary fragments exposed by providers. This approach complements initiatives, such as Linked Open Data, which push providers toward publishing in open, interlinked formats. For automation, data accessible to humans through exist- ing interfaces must be transformed into structured data, for example, each gray span with class source on Google News should be recognized as news source. These obser- vations call for a new generation of web extraction tools, which (1) can interact with rich interfaces of (scripted) web applications by simulating user actions, (2) provide extraction capabilities sufficiently expressive and precise to specify the data for extraction, (3) scale well even if the num- ber of relevant web sites is very large, and (4) are embed- dable in existing programming environments for servers and clients. 123

Upload: others

Post on 13-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

The VLDB JournalDOI 10.1007/s00778-012-0286-6

SPECIAL ISSUE PAPER

OXPATH: A language for scalable data extraction, automation,and crawling on the deep web

Tim Furche · Georg Gottlob · Giovanni Grasso ·Christian Schallhart · Andrew Sellers

Received: 10 February 2012 / Revised: 4 June 2012 / Accepted: 13 July 2012© Springer-Verlag 2012

Abstract The evolution of the web has outpaced itself:A growing wealth of information and increasingly sophisti-cated interfaces necessitate automated processing, yet exist-ing automation and data extraction technologies have beenoverwhelmed by this very growth. To address this trend,we identify four key requirements for web data extraction,automation, and (focused) web crawling: (1) interact withsophisticated web application interfaces, (2) precisely cap-ture the relevant data to be extracted, (3) scale with the num-ber of visited pages, and (4) readily embed into existingweb technologies. We introduce OXPath as an extensionof XPath for interacting with web applications and extract-ing data thus revealed—matching all the above requirements.OXPath’s page-at-a-time evaluation guarantees memoryuse independent of the number of visited pages, yet remainspolynomial in time. We experimentally validate the theoret-ical complexity and demonstrate that OXPath’s resourceconsumption is dominated by page rendering in the under-lying browser. With an extensive study of sublanguages andproperties of OXPath, we pinpoint the effect of specificfeatures on evaluation performance. Our experiments showthat OXPath outperforms existing commercial and acad-emic data extraction tools by a wide margin.

T. Furche (B) · G. Gottlob · G. Grasso · C. Schallhart · A. SellersDepartment of Computer Science, Oxford University,Wolfson Building, Parks Road, Oxford OX1 3QD, UKe-mail: [email protected]; [email protected]

G. Gottlobe-mail: [email protected]

G. Grassoe-mail: [email protected]

C. Schallharte-mail: [email protected]

A. Sellerse-mail: [email protected]

Keywords Web extraction · Crawling · Data extraction ·Automation · XPath · DOM · AJAX ·Web applications

1 Introduction

The dream that the wealth of information on the webis easily accessible to everyone is at the heart of thecurrent evolution of the web. Due to the web’s rapidgrowth, humans can no longer find all relevant data withoutautomation. Indeed, many invaluable web services, such asAmazon, Facebook, or Pandora, already offer limitedautomation, focusing on filtering or recommendation. Butin many cases, we cannot expect data providers to complywith yet another interface designed for automatic processing.Neither can we afford to wait another decade for publishers toimplement these interfaces. Rather, data should be extractedfrom existing human-oriented user interfaces. This lessensthe burden for providers, yet allows automated processingof everything accessible to human users, not just arbitraryfragments exposed by providers. This approach complementsinitiatives, such as Linked Open Data, which push providerstoward publishing in open, interlinked formats.

For automation, data accessible to humans through exist-ing interfaces must be transformed into structured data, forexample, each gray span with class source on GoogleNews should be recognized as news source. These obser-vations call for a new generation of web extraction tools,which (1) can interact with rich interfaces of (scripted)web applications by simulating user actions, (2) provideextraction capabilities sufficiently expressive and precise tospecify the data for extraction, (3) scale well even if the num-ber of relevant web sites is very large, and (4) are embed-dable in existing programming environments for servers andclients.

123

Page 2: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

T. Furche et al.

Previous approaches to web extraction languages [34,46]use a declarative approach akin to OXPath; however, mainlydue to their age, they often do not adequately facilitatedeep web interaction, for example, form submissions. Also,they do not provide native constructs for page navigation,apart from retrieving pages for given URIs. Where script-ing is addressed [12,47], the simulation of user actionsis neither declarative nor succinct, but rather relies onimperative scripts and standalone, heavy-weight extractioninterfaces. Lixto [12], Web Content Extractor [2], and VisualWeb Ripper [3] are moving towards interactive wrapper gen-erator frameworks, recording user actions in a browser andreplaying these actions for extracting data. As large commer-cial extraction environments, their feature set goes beyondthe scope of OXPath. They all emphasize the visual aspectof wrapper generation that ease the design of extractiontasks by specifying only few examples, mainly selecting lay-out elements on the rendered page. Further, Lixto adoptstree generalization techniques to produce more robust wrap-pers, whereas Web Content Extractor allows to write com-plex user-defined text manipulation scripts. However, despitetheir feature richness, none of these systems addresses mem-ory management and our experimental evaluation (Sect. 6)demonstrates that such systems indeed take memory linearin the number of accessed pages.

OXPath restricts its focus to data extraction in the con-text of deep web crawling. This sets OXPath apart frominformation extraction (IE) systems that aim at extractingstructured data (entities and relations) from unstructured text.Systems like [11,23] and [19] extract factual informationfrom textual description (mainly by lexico-syntactic patterns)and HTML tables on the web, respectively. Though OXPathcould be extended to support libraries of user-defined func-tions to refine extraction from text (along the lines of pro-cedural predicates in [46]), this is out of the scope of thispaper. Whereas OXPath takes advantage of the structureson the web (possibly revealed through simulating user inter-action) to extract, for example, infoboxes on Wikipedia orproducts on Amazon along with all their reviews, the task ofextracting, for example, named entities from these reviews, isnot addressed by OXPath, but delegated to post-processingof the extracted information, for example, using a IEsystems.

As far as web automation tools are concerned, thoughsome of them [17,31] can deal with scripted web applica-tions, they are tailored to single action sequences and proveto be inconvenient and inefficient in large-scale extractiontasks requiring multi-way navigation (Sect. 6).

It is against this backdrop that we introduce OXPath, acareful, declarative extension of XPath for interacting withweb applications to extract information revealed during suchinteractions. It extends XPath with a few concise extensions,yet addresses all the above desiderata:

1—Interaction. OXPath allows the simulation of useractions to interact with the scripted multi-page interfacesof web applications: (I) Actions are specified declara-tively with action types and context elements, such as thelinks to click on, or the form field to fill. (II) In con-trast to most previous web extraction and automation tools,actions have a formal semantics (Sect. 4.4) based on a(III) novel multi-page data model for web applications thatcaptures both page navigation and modifications to a page(Sect. 4.3).

2—Expressive and precise. OXPath inherits the preciseselection capabilities of XPath (rather than heuristics forelement selection as in [17]) and extends them: (I) OXPathallows selection based on visual features by exposing allCSS properties via a new axis. (II) OXPath deals with nav-igation through page sequences, including multi-way navi-gation, for example, following multiple links from the samepage, and unbounded navigation sequences, for example, fol-lowing next links on a result page until there is no furthersuch link. (III) OXPath provides intensional axes to relatenodes through multiple conditions, for example, to select allnodes that are at the same vertical position and have the samecolor as the current node. (IV) OXPath enables the identi-fication of data for extraction, which can be assembled into(hierarchical) records, regardless of its original structure. (V)Based on the formal semantics of OXPath (Sect. 4.4), weshow that its extensions considerably increase the language’sexpressiveness (Sect. 4.10).

3—Scale. OXPath scales well both in time and in memory:(I) We show that OXPath’s memory requirements are inde-pendent of the number of pages visited (Sect. 5). To thebest of our knowledge, OXPath is the first web extrac-tion tool with such a guarantee, as confirmed by a com-parison with five commercial and academic web extractiontools. (II) We show that the combined complexity of evalu-ating OXPath remains polynomial (Sect. 4.10) and is onlyslightly higher than that of XPath (Sect. 5). (III) We alsoshow that OXPath is highly parallelizable (Sect. 4.10). (IV)We provide a normal form that reduces the size of the mem-oization tables during evaluation and rewriting rules to nor-malize arbitrary expressions (Sect. 4.9). (V) We verify thesetheoretical results in an extensive experimental evaluation(Sect. 6), showing that OXPath outperforms existing extrac-tion tools on large-scale experiments by at least one order ofmagnitude.

4—Embeddable, standard API. OXPath is designed to inte-grate with other technologies, such as Java, XQuery, orJavascript. Following the spirit of XPath, we provide anAPI and host language to facilitate OXPath’s interoperationwith other systems.

123

Page 3: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

OXPath

Bonus: Open Source. We provide our OXPath implemen-tation and API at http://diadem.cs.ox.ac.uk/oxpath, for dis-tribution under the new BSD license.

OXPath has been employed within DIADEM [24], adomain-driven, large-scale data extraction framework devel-oped at Oxford University, proving to be a practically viabletool for (1) succinctly describing web interaction and extrac-tion tasks on sophisticated web interfaces, for (2) generatingand processing such task descriptions, and for (3) efficientlyexecuting these wrappers on the cloud.

This paper extends [25] in three main aspects:

(1) It clarifies the design of OXPath and introducesintensional axes (Sect. 4.7) as a further instrument forextracting data from rich web applications. With inten-sional axes, the OXPath user can on-the-fly specify newtypes of relations between elements of a web page, forexample, to select all paragraphs with the same color anddimension.(2) The page-at-a-time evaluation algorithm (Sect. 5) hasbeen further refined to cater to these additions and toimprove the complexity bounds from [25]: By splittingand specializing the memoization table, we achieve areduction by up to a factor of n in time and memory.An additional factor of n reduction (at a slight increasein expression size) can be achieved by applying a newnormalization rewriting (Sect. 4.9).(3) An extensive study (Sect. 5.7) of the impact of themain features of OXPath, extraction markers, actions,and Kleene star iteration on evaluation performance isused to define a normal form for OXPath (Sect. 4.9)together with a sound and complete rewriting.

1.1 A gentle introduction to OXPath

OXPath extends XPath with five concepts: actions tonavigate the interface of web applications, means for inter-acting with highly visual pages, intensional axes to identifynodes by multiple relations, extraction markers to specifydata to extract, and the Kleene star to extract from a set ofpages with unknown extent.

Actions. For simulating user actions such as clicks or mouse-overs, OXPath introduces (i) contextual, as in {click},and (ii) absolute action steps with a trailing slash, as in{click /}. Since actions may modify or replace the DOM,we assume that they always return a new DOM. Absoluteactions continue at DOM roots, and contextual actions con-tinue at those nodes in the new DOM matched by the action-free prefix (Sect. 4.4) of the performed action. This prefix isobtained from the segment starting at the previous absoluteaction by dropping all intermediate contextual actions andextraction markers.

Style Axis and Visible Field Access. For lightweight visualnavigation, we expose the computed style of rendered HTMLpages with (i) a new axis for accessing CSS DOM nodeproperties and (ii) a new node test for selecting only vis-ible form fields. The style axis navigates the actual CSSproperties of the DOM style object, for example, it ispossible to select nodes by their (rendered) color or fontsize. To ease field navigation, we introduce the node testfield() that relies on the style axis to access the com-puted CSS style to exclude non-visible fields, for example,/descendant::field()[1] selects the first visible field indocument order.

Intensional Axis. XPath is not able to express querieswhere nodes from two node sets are related by more thanone relation. Also, the set of relations that can be usedis fixed. To overcome this limitation, OXPath introducesintensional axes: Users can identify nodes by intention-ally defining relations between node pairs, as needed. Forexample, exploiting the style axis, the following expres-sion selects all nodes with the same font and color as ahyperlink:

Intensional axes allow to compare node sets by more thanone relation with the use of the reserved variables $lhs and$rhs.

Extraction Marker. In OXPath, we introduce a new kindof qualifier, the extraction marker, to identify nodes asrepresentatives for records and to form attributes. Forexample,

extracts a story element for each current Google Newsstory, along with its title and sources (as strings), producing:

The nesting in the result mirrors the structure of theOXPath expression: extraction markers in a particular predi-cate, such astitle andsource, yield attributes associatedwith the last marker outside this predicate, in our examplethe story marker.

Kleene Star. We follow [35] in adding the Kleene star.The following expression, for example, queries Google for“Oxford”, traverses all accessible result pages and extractsall links.

123

Page 4: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

T. Furche et al.

Fig. 1 Finding an OXPath through Amazon

To limit the range of the Kleene star, one can specifyupper and lower bounds on the multiplicity, for example,(...)*{3,8}.

2 Application scenario

This section showcases some typical applications of OXPath:interacting with a visual, scripted interface to find flights onKayak, extracting books on Amazon, information on relevantacademic papers and their citations on Google Scholar, andstock quotes from Yahoo Finance.

History Books on Seattle. To extract data about history bookson Seattle as offered on amazon.co.uk, a user has to per-form the following sequence of actions to retrieve the pagelisting these books (see Fig. 1): (1) Select “Books” from the“Search in” select box, (2) enter “Seattle” into the “Searchfor” text field, and (3) press the “Go” button to start thesearch. On the returned page, (4) refine the search to only“History” books, (5) and open the details page for retrievingfurther details. Figure 1 shows an OXPath expression thatrealizes this extraction (each action is numbered accordingto the involved step). Lines 1–5 implement the above steps:To select the two input fields, we use OXPath’s field()

node test (matching only visible form elements) and eachnode’s title attribute (@title). A contextual action (enclosedin {}) selects “Books” from the select box and continues the

Fig. 2 Finding an OXPath through Google Scholar

navigation from that field. The other actions are not con-textual but absolute (with an added/before the closingbrace) to continue at the root of the page retrieved by theaction. To select the “History” link, we adopt the . nota-tion from CSS for selecting elements with a class attributerefinementLink and use OXPath’s ~ shorthand forXPath’s contains() function to match the “History”text.

For the obtained books, we extract their title, price, andpublisher in step (5), as shown in Lines 6–8 of Fig. 1: Theelement with class result serves as indicator of bookrecords, denoted by the record extraction marker: <book>.From there, we navigate to the contained title links, extracttheir value as a title attribute, and click on the linkto obtain the page for the individual book, where we findand extract the publisher. Finally, we extract the price fromthe previous page—without caring for the order in whichthe pages are visited during extraction. OXPath bufferspages when necessary, yet guarantees that the number ofbuffered pages is independent of the number of visitedpages.

WWW Papers with their citations. We might want to extractthe most relevant papers of a scientific field together withother works citing them. Figure 2 shows the sequence ofnecessary user actions (each one numbered accordingly): (1)Enter “world wide web” in the search field and (2) press“search”. From the result page, we extract, for each paper,its title and authors paper and (3) click on its “cited by” link.To retrieve the next result page, we (4) click on “Next” andcontinue with (3).

123

Page 5: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

OXPath

Fig. 3 OXPath for flights to Seattle

Fig. 4 OXPath for stock quotes

Figure 2 shows an OXPath solution for this extractiontask: Lines 1–3, navigate to Google Scholar, fill and submitthe search form. Line 4 realizes the iteration over the set ofresult pages by repeatedly clicking the “Next” link, denotedwith a Kleene star. Lines 5–6 identify a result record and itsauthor and title, lines 7–9 navigate to the cited-by page andextract the papers. The expression yields nested records ofthe following shape:

Flights from Kayak. Our next example (Fig. 3) demon-strates using OXPath for finding non-stop flights to Hyder-abad. This example illustrates the interaction with a heavilyscripted page for refining the search results from any stops toonly non-stop flights. Here, search results are only exposedthrough serial user action input, which is allowed via well-formed OXPath expressions. After submitting the form, wewait for the first result page to load and refine our resultsby selecting only non-stop flights and flights with a singlestop. We can then extract flights, along with features suchas price, airline, and plane. In this example, we use the #selector from CSS, analogous to the . selector used in thisprevious example, that filters elements based upon their idattributes.

Stock Quotes from Yahoo Finance. Figure 4 illustrates anOXPath expression extracting stock quotes from YahooFinance. In particular, note the use of optional predicates([? ]) for conditional extraction: If the change is format-ted in red, it is prefixed with a minus, otherwise with aplus.

3 Preliminaries: XPATH data model

XPath is used to query XML documents modeled asunranked and ordered trees of nodes. The set of nodes withina document are given as DOM, with nodes of seven types,namely root, element, text, comment, attribute, namespace,and processing instruction. Documents begin at a unique rootnode with elements as the most common non-terminal. Allnodes except comment and text have an associated name(or label). Within a document, nodes x and y are orderedby document order, which is defined as the binary relationx <doc y, iff the opening tag of x occurs before the open-ing tag of y in the well-formed XML document. Node typesand labels are formally represented in this paper through aset of unary relations, (unaryν)ν∈Unary with, for example,text,element,a ∈ Unary (all text nodes, elements, anda-labeled nodes).

An XPath query result has one out of four possible datatypes, which is either (1) an unordered collection of distinctdocument nodes, called node sets, or a scalar value of type(2) Boolean, (3) string, or (4) number.

Briefly, XPath queries are composed of functions andoperators. Most importantly, node sets are specified by pathexpressions of concatenated steps: Each step evaluates the

123

Page 6: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

T. Furche et al.

Fig. 5 XPath axes composed in terms of “primitive” tree relationschild and nextsibling, their inverses, and the identity relation self

context nodes resulting from the previous step and consistsof an axis (or child, if omitted), node test, and optional predi-cate expressions. XPath axes are binary relations, relating acontext node to another node in the document according to thedefinitions in Fig. 5 (self is given as {self(x, x)|x ∈ dom}).In addition to self and the axes in Fig. 5, XPath has specificaxes to access attribute and namespace nodes.

Axis navigation is further refined by node tests. In addi-tion to a wildcard node test (node()) covering all documentnodes, XPath defines node tests to filter by each of theunnamed node types (text(),processing-instruction(),and comment()). Node tests can also filter elements by nameor by ∗, which is a wildcard for all elements.

Finally, steps can be filtered further with an arbitrary num-ber of predicate expressions. Predicates contain a single inputexpression and return each node from the context that evalu-ates to true for its input expression. We leave further detailson XPath to Sect. 4, where we discuss XPath implicitly asOXPath’s sublanguage.

4 Language

4.1 Design principles

The design of OXPath is guided by the following designgoals derived from two core principles:

(1) Spirit of XPATH. OXPath maintains, wherepossible, the principles upon which XPath is built, inparticular the use of a single, navigational expression,polynomial time evaluation, and concise syntax. Wherewe extend XPath, we do so using existing web stan-dards such as CSS or DOM events. (I) Single Expres-sion. OXPath expressions are path expressions just likeplain XPath. We choose not to extend OXPath with aseparate “construct” clause specifying the result of theexpression (as in SPARQL, SQL, or XQuery), but ratherto embed the specification of the result of the extrac-tion into the path expression through extraction markers.This requires the shape of the expression to mirror theshape of the extracted result, a limitation, however, that

is insignificant due to XPath’s flexible axes.1 (II) Treeresult. The result of an OXPath expression is a tree con-structed from matches for the record extraction markersand their attributes. This poses only a small limitation,as data on the web are usually presented in a hierar-chical way. (III) Polynomial time. We design OXPathto remain polynomial and in two-variable logic for bothselection and construction (see Sect. 4.10). Though thereare cases, where full first-order logic is necessary forextraction (e.g., to refer back to values encountered onother pages), we believe that OXPath presents a moreuseful trade-off in most cases (see Sect. 2).(2) Low memory. The second core principle is thatOXPath should not require an unbounded buffer forpages, but rather be able to extract from hundreds ofthousands of pages with very little memory use. This isnecessary for large-scale data extraction. (I) No pageidentity. OXPath does not manage page “identity”: Iftwo links lead to the same URL, OXPath considersthe pages reached by clicking on those links as distinct.This avoids issues with server state where the same URLreturns different results at different times or points in aninteraction. It also avoids the need to maintain pages inOXPath in case they are later encountered again. (II) Noback. OXPath does not allow the reverse (or “back”)navigation over pages. Once we have moved from pageA to B, there is no way back to A. This is a limitation asit (together with the lack of variables) prohibits a classof wrappers that refer back to values encountered on ear-lier pages. However, it is essential to maintain the lowmemory profile of OXPath.

4.2 Syntax

The syntax of OXPath is defined in Fig. 6 atop XPath’ssyntax as defined in [13,26]. We add OXPath-specific con-structs to XPath non-terminals and highlight the newlyintroduced non-terminals, that is, the action and kleene starexpressions added to step, and the marker expression addedto step-suffix. Moreover, we allow certain CSS selectors [5]to occur in OXPath expressions and maintain their normalsemantics. These include ‘#’ and ‘.’ , which filter by theid and the class attribute, respectively. For instance, a#myidmatches a elements with id attribute equal to myid whereasa.result selects an a node if it contains result as a wordin its class attribute. The containment operator ‘~=’ returnstrue iff the right operand is contained in the left operandas a word (cf. word containment in CSS [5]). Furthermore,we add five abbreviations: For contains(a,b), we writea ~ b (string containment), for count(a | b) = count(b), we

1 However, classical results [41] on rewriting reverse axes such asancestor in XPath do not extend to OXPath.

123

Page 7: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

OXPath

Fig. 6 OXPath syntax

write a subset b (subset), for [a | true], we write [? a], for(a)*{1,1}, we write (a), and for style::color we write^color.

Even with the added selection capabilities of CSS,OXPath’s text extraction capabilities are still rather weakcompared to full IE system. We are currently extendingOXPath with rich text processing operators (e.g., regu-lar expression matching) inspired by XPath 2.0. However,proper entity or relation extraction is beyond the scope ofOXPath and, if necessary, can be performed in a post-processing step.

Finally, we define the main path to a step s in an expressione as the sequence of steps occurring on the path to from theroot of e’s expression tree to the step s. For example, givene =a[b[c]/d]/e, we obtain for s =d and s =e, respectively,the main paths a/b/d and a/e.

The grammar in Fig. 6 omits a few restrictions necessaryto avoid an undesirable interplay of our new language fea-tures with functions, predicates and sorting operators as wellas to avoid unintuitive extraction: (R1) Actions and extrac-tion markers may not occur in other extraction markers,function, or operator arguments. Further, extraction mark-ers may not occur inside intensional axes (see Sect. 4.7).This limitation is mainly for simplicity, although actions inoperator arguments can also affect the low-memory principleof OXPath (see Sect. 4.1). (R2) We disallow position()

on the node set obtained from (Kleene-starred) bracketedexpressions having actions or extraction markers in theirmain paths, to avoid sorting nodes that originate from differ-ent documents or relate to different extraction markers. Thisrestriction is necessary to maintain the low-memory goal ofOXPath, as it would require storing all these nodes from dif-ferent pages if allowed. Expressions with extractions insideKleene stars are rewritable into expressions satisfying thisrestriction, as shown in Theorem 2. (R3) Extraction mark-

ers may only occur in a predicate if there is a marker on thepath leading to the predicate and the last such marker does notextract a value. Extraction markers extracting a value may notoccur outside a predicate. The value of an extraction markermust yield a scalar value. These three restrictions ensurea sane use of extraction markers such that the expressionsalways produce a proper output tree (“tree result” principle).(R4) All predicates that are followed by a contextual actionwith no absolute action between must be free of actions. Thisensures that the action-free prefix (see Sect. 4.4) of a con-textual action does not contain any actions in predicates andthus can safely be executed multiple times on different pageswithout violating the “no page identity” principle.

4.3 Data model

OXPath’s data model extends the data model of XPath tomultiple HTML pages, interconnected by actions. We evalu-ate OXPath expressions on relational structures, called pagetrees, with an input schema extending the model introducedin Sect. 3:

((unaryν)ν∈Unary, child, self,next-sibl,attribute, style,(actionα)α∈Action

)

Therein, (unaryν)ν∈Unary is the set of unary relations,indicating type and label sets, child is the parent–child rela-tion, self is the identity relation, next-sibl is the direct siblingrelation, attribute relates nodes with their attribute nodes,and analogously, style relates nodes with their style nodes,that is, the dynamically computed style values of their parentnodes. For example, the node types box-top, box-bottom,box-left, box-right, box-width, and box-height referto the dimensions of the bounding box of node at hand.

The remaining XPath axes, such as descendant, arederived from the basic relations, as in Sect. 3. We also add foreach OXPath action α a binary relation (actionα)α∈Action.Herein, actionα(x, y) indicates that action α triggeredon node x yields the page rooted at y. The actions(actionα)α∈Action and the child axis together form a tree:Each page has a unique parent node connected by a singleaction edge, that is, if two links point to the same URI, theyyield two different pages in the page tree.

An OXPath expression E returns an unordered tree ofextracted nodes represented as a single relation O such that〈o, p, t, v〉 ∈ O indicates that there is an output node o withparent output node p, tag t , and a value v, which is possi-bly null. A Record extraction marker R yields tuples of theform 〈r, p, R,null〉, (typically) containing tuples of the form〈a, r, A, v〉, generated from an attribute extraction marker Aand a non-null value v. This allows the extraction of (possi-bly nested) records with multiple values. We refer to nodesin this tree (in the page tree) as output (input) nodes.

123

Page 8: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

T. Furche et al.

d d d

p b

k

Rd Rd Rd

results

Sk

Ab

Ap

child next-sibl action{click /}

DOM node extracted record extracted value page

accessed

Fig. 7 OXPath page tree

Figure 7 illustrates both the data model and the resultof an OXPath expression. The page tree consists of thenodes of the considered pages, connected by child, next-sibl,and action edges (in the figure we use only {click/}).Each distinct path of actions and nodes leads to a distinctpage. OXPath traverses action edges only in the direc-tion of the edge (no reverse navigation), and only directly(no descendant over action edges). The part of the pagetree accessed by a given OXPath expression is called theaccessed page tree (dotted area in Fig. 7), consisting ofaccessed pages and nodes. Figure 7 also shows the resultof an OXPath expression

producing (square) output nodes 〈Rd , results , R,null 〉for matching record extraction markers //r:<R>, and〈Sk, Rd , S,null〉 for //s:<S> (nested records). It produces(diamond) nodes 〈Ap, Rd , Tp, vp〉 and 〈Ab, Sk, Tb, vb〉 forvalue extraction markers //p:<Tp=string(.)> andparent::b:<Tb=string(.). Note that the structure of theoutput tree reflects the structure of the extraction mark-ers in the OXPath expression, but not necessarily thestructure of the input tree. In our example, the frag-ment //k:<S>/parent::b:<Tb=string(.)> produces out-put nodes Ab children of output nodes Sk , although the b’sare parents of the k’s in the input tree.

4.4 Semantics

The semantics of OXPath is defined with its extractionsemantics � expr �

β

E (c), specifying the result tree for expres-sion expr , context tuple c, and variable assignment β. Eachcontext tuple c = 〈n, p, l〉 consists of an input node n, theparent output node p, and the last sibling output node l,accessed through the notation c.n, c.p, c.l. We define theextraction semantics atop the value semantics � expr �

β

V (c),which matches nodes from the page tree, akin to XPath’ssemantics in [13]. The latter computes the value reached viaexpr , which is either a set of context tuples, a Boolean, inte-ger, or string value. For lucidity’s sake, we first develop thesemantics for the OXPath language without the intensionalaxes, which we add to the semantics in Sect. 4.7.

The OXPath semantics deviates from XPath in the con-tents of its context tuples. We maintain both the precedingparent and sibling match to organize the extracted nodes hier-archically: At context c = 〈n, p, l〉, an extraction match out-side predicates (1) yields an output tuple 〈o, c.p,M, v〉, witha fresh output node o, becoming descendant of c.p and sib-ling of c.l, and (2) changes the context to c′ = 〈c.n, c.p, o〉.On entering a predicate, c.l replaces c.p as parent outputnode, such that further extraction matches yield nodes thatare nested as descendants of c.l instead of c.p.

We start the evaluation of an OXPath expression withthe initial context node c = 〈⊥, results, results〉, where ⊥is an arbitrary context node and results is the root of allsubsequent extraction matches. The context node is arbitrarysince all expressions start with doc(uri), which returns theroot of the page at uri regardless of the supplied context node.

For an OXPath expression that is also an XPath expres-sion, the value semantics computes the exact same result asstandard XPath. Tables 1 and 2 give the full OXPath valueand extraction semantics, respectively, though in the extrac-tion semantics we omit those rules that only decompose theexpression. F f , Bo and Uo are the semantic functions on thenodes corresponding to XPath’s functions, binary, and unaryoperators, extended by the OXPath . and# node tests and~ and ~= operators. As the variable assignment β is handedthorough unchanged throughout the semantics, we show β

only in Rule V6, where we evaluate variables, and drop itotherwise. For simplicity, we disallow positional functionsoutside qualifiers.

4.5 Value semantics

In Table 1, Rule V1, � path �V delegates the evaluation ofa path to � path �N, which handles expressions computingnode sets. The Rules V2–V4 deal with functions and opera-tors and apply the corresponding semantic functions on theevaluated subexpressions and operands. Rule V5 handles

123

Page 9: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

OXPath

Table 1 Value semantics of OXPath

Table 2 Extraction semantics of OXPath (partial)

literals and Rule V6 maps a variable $var to its valueβ($var).

The major part of the value semantics, consisting of RulesN1–N10, deals with path expressions. At first, Rule N1decomposes a given path into its first element, which iseither a step or an istep, and the tail expression path. Then,the semantics evaluates the (i)step with Rules N2–N10 andevaluates path on the resulting node set recursively.N2–N5: Axes, node tests, and predicates. In Rules N2 andN3, we handle OXPath axes and node tests as in standardXPath. The case of predicates in Rules N4 and N5 followsXPath as well, but requires additional provisions: To man-age the nesting of the extracted data, upon entering a predi-cate, OXPath takes the last sibling output node as new par-ent output node, both in Rule N4 and in N5. Expressionsin predicates are cast to Booleans by means of � expr �Bwith Rule B1, as in XPath. Rule N5, for positional predi-cates, relies on two new functions, rewrite+ and rewrite−,of the form expr × C × c → expr ′, where expr is aninput expression, C is a context set, c ∈ C , and expr ′ isa rewritten expression as follows: Both functions replace in

the input expression each non-nested occurrence of last()with |C | and of position() with the position of c withinC according to document order. This order is well-definedwithin all possible context sets at this point, since all suchcontext sets contain only tuples with nodes from the samepage with identical parent and sibling matches: If (i)stepstarts with an axis/node navigation or an action, this propertyholds. Otherwise (i)step must a be bracketed (Kleene star)expression, and in this case, the expression is not allowed tocontain actions or extraction markers (Restriction (R2) onpage 7), preventing any changes in the underlying page orcontext tuples. We write rewrite± for conciseness, wherethe specific function applied depends on the (i)step: Foraxis navigation, we choose rewrite+ for forward andrewrite− for reverse axes. Otherwise, for (Kleene-starred)bracketed or action expressions, we always select rewrite+.2

N6–N7: Actions. Actions map the context node c.n to a nodein a different page. Absolute actions in Rule N6 map c.nto the root n′ of some other page with (actionα)α∈Action.By our data model, we assume that the node n′ is differ-ent from any other node previously reached. For contextualactions, Rule N7 first also evaluates the absolute action toobtain the root of the new page, but then attempts to moveto a node that corresponds closely to the node c.n on theoriginal page. We select the corresponding node by applyingthe same OXPath expression to the new page root as weused to select c.n in the original page. The action-free prefixafp(action, c.n) of action and c.n in OXPath expressionexpr returns the following OXPath expression: Let basebe the subexpression of expr between action and its last pre-ceding absolute action, stripped of all extraction markers andall contextual actions occurring in the main path of expr .Then, afp(action, c.n) = (base)[i] where i is the positionof c.n in document order among all nodes in the current pagematching base. In the original page, this expression uniquelyidentifies c.n. When evaluated from the root returned by theabsolute action on c.n, it selects a unique node reached bythe same path and in the same relative position.N8: Extraction markers. For extraction markers, the contextset of the corresponding step is modified by replacing thelast sibling output node c.l with the one generated from thismarker M and current node c.n. OXPath computes the newnode with out(c.n,M) where outinjectively maps an inputnode c.n and extraction marker M to an output node. N9–N10: Kleene star. Unbounded and bounded Kleene-starredexpressions match the Kleene-repeated path multiple times,enforcing optional iteration bounds. N11: Empty expressions.Return the context node.

2 Thus, (path)*[qp] =(⋃∞

i=0 pathi)[qp] always holds, but

(path)*[qp]=⋃∞

i=0 pathi[qp] does not hold necessarily, since[qp] is applied to each of the i-th copy of path.

123

Page 10: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

T. Furche et al.

4.6 Extraction semantics

The extraction semantics � expr �E (c) for OXPath inTable 2 takes context tuple c and extracts an output tree fromthe input page tree. For the sake of brevity, we omit thoserules that recursively decompose the expression but have noother effect. Except for extraction markers, the extractionsemantics is straightforward: For expressions with subex-pressions, we compute the extraction semantics for eachsubexpression and take the union of all extracted results. Tocompute the extraction semantics for a subexpression expr ,we first compute with the value semantics � �V the contextset to apply expr on, and then apply � expr �E recursivelyon each obtained context node. Rules E1 and E2 exemplifythis case for paths and predicates. For all other expressionsnot shown here, we collect all extraction markers returnedby their subexpressions (if any), regardless of the (value)semantics of the involved expressions.

Extraction markers are treated in Rules E3 and E4:For markers without extracted values (Rule E3), OXPathextracts the tuple

⟨out(c.n,M), c.p,M,null

⟩. The resulting

node is thus a child of the parent match c.p. For markers withvalues (Rule E4), we evaluate additionally the value expres-sion v: We take the value returned by � v �V (c) (a string orother scalar) and output

⟨out(c.n,M), c.p,M, � v �V (c)

⟩.

4.7 Intensional axis

In XPath, axes relate nodes through a fixed set of rela-tions such as child or following. Together with functions andoperators, these are the only means in XPath for relatingtwo nodes. Unfortunately, these means are rather limited, forexample, we cannot identify all nodes that follow the currentnode in document order and are displayed in the same fontsize as the current node. In general, XPath is not able toexpress queries where nodes from two node sets are relatedby more than one relation.

Theorem 1 Let I be the set of first-order queries of the formQ(x, y)⇐ φ(x, y)∧ψ(x, y)whereφ(x, y) andψ(x, y) arenon-empty first-order formulas expressible in XPath. Thenthere are queries in I that cannot be expressed in XPath.

Proof (sketch) From [13] and [36], this follows for naviga-tional XPath, which expresses exactly all XPNF queries.An XPNF query is a FO2 query over page trees (withoutaction relations) built from relations between two node sets,which are limited to disjunctions of binary atomic formulas.

For full XPath, we need to show that neither relationaloperators, functions, aggregation, or positional arithmeticallow us to express multiple relations between two node sets.All queries in I relate two nodes, and thus we can ignoreboolean and other value queries.

First, outside predicates, id() is the only functional oper-ator allowed in such queries. As id() returns the same resultfor any context node, it does not relate multiple nodes.

Second, inside predicates, XPath allow conjunctions offunctions, relational operators, and aggregations. However,the only “shared variable” between such conjuncts is thecontext node (see V1–V6 in Table 1). Though it is pos-sible to build up several node sets originating from thesame context node, once constructed, its individual ele-ments are only accessible via quantification. For example,[.//a=.//b and.//a=.//c] does not require the existenceof a single a node having the same value as some b and somec: In contrast, the predicate is already satisfied if there existtwo a nodes that match, respectively, the values of some band c.

Though the same context set can be matched by multi-ple conjuncts (by using two equivalent subexpressions), theindividual nodes in these node sets cannot be related. Thiseven holds for count() that can be used to relate entire nodesets, but not individual nodes. �

To address this limitation, we introduce intensional axesto support OXPath users in intensionally defining relationsbetween node pairs, as needed. We denote intensional axeslike predicates (see Fig. 8), but place them in lieu of anaxis, that is, before a ::. For example, if we evaluate on acontext c,

we obtain the context set C containing all nodes c′ such thatthe expression inside the brackets evaluates to true for thevariable assignment β ′ = β[$lhs ← c.n, $rhs ← c′.n].We bind $lhs to the current node c.n, try for $rhs everynode in the current page, and return those for which the axisexpression is satisfied. Thus, this expression solves the queryfrom the beginning of this section: It returns all nodes thatfollow the current node in document order and are displayedin the same font size. The expression uses OXPath’s subsetoperator to test whether $rhs is among the nodes following$lhs (see Sect. 4).

Definition 1 An intensional step [φ]::nψ consists of anintensional axis of [φ], a node test n, an arbitrary numberof predicates ψ , and an arbitrary OXPath expression φ thatmay use the reserved variables $lhs and $rhs. An inten-sional axis [φ] returns, for a context node c, all nodes m suchthat �φ �B is true if $lhs is bound to c and $rhs to m.

It should be noted that an intensional axis that does notrefer to $lhs or the context node relates all context nodesto the same bindings for $rhs (since �φ �B does not dependon $lhs. If it does not refer to $rhs, it acts as a filter to thecontext nodes, but each context node that matches the filteris related to all nodes in the DOM. If neither is referenced, it

123

Page 11: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

OXPath

Table 3 Value semantics for intensional axes

Fig. 8 OXPath intensional axis

Table 4 Intensional axis examples

is either ∅ or the set of all pairs of DOM nodes, dependingon whether �φ �B holds (which is absolute and independentof the context node).

Table 3 shows the necessary additions to the semanticsof OXPath. The existing Rule N2 already suffices to coverintensional steps. Rule N12 (and only this rule) modifies theset of variable bindings β such that $lhs is now boundto the context node and $rhs is successively bound to allnodes n in the document. With this updated variable binding,expr is evaluated under Boolean semantics, and if � expr �Bevaluates to true, n is added to the resulting context set.Consequently, in case of nested intensional axes, $lhs and$rhs are always bound according to the last axis.

Examples. Table 4 lists five applications of intensional axesin OXPath. Expression (1) selects all nodes that have thesame font color and size as an a. In (2), we select all divsthat are rendered northwest of an a. Expression (3) selectsall books having a common author with another book thatis cited by the latter one, that is, it selects books containingself-citations. Example (4) shows a nested intensional axis:It selects all em children of elements that have the same fontfamily as a div and that are to the north of an a of that divwith the same value. This case requires a nested relationalaxis, as thediv is related to theem by more than one relation,but so is the a to the em’s parent. Expression (5) selects allelements that are visually contained in some div.

4.8 OXPath properties

As discussed in Sect. 4.1, OXPath is designed to avoidbuffering many pages or result tuples at the same time. Thisdesign goal is expressed in two formal properties:

OXPath avoids sorting context sets that contain nodesfrom different pages, since it is unclear how to order nodesfrom different pages, without first retrieving (and thus buffer-ing) those pages.

Proposition 1 (No Node Sorting across Pages) The evalua-tion of an OXPath expression never requires sorting contextsets which contain nodes from different pages.

Proof OXPath requires sorting only for positional qualifiersin Rule N5, where the function rewrite±(q,C, c′) sorts thetuples in the context set C and determines the position of c′within C . Thus, it suffices to show that C in Rule N5 nevercontains nodes from different pages.

This holds true, since in Rule N5, C = � (i)step± �V (c)is computed from a single axis navigation axis :: nodes(N3), followed by a sequence of (positional) qualifiers (N4and N5) and markers (N8). Our language restriction (4) fromSect. 4 ensures that actions cannot occur in (i)step±, as theyare disallowed in positionally qualified steps. Since Rule N3always results in a context set with nodes from the same page,and since N4-5,8 can only remove nodes from the context set,C must contain nodes from a single page only. �

OXPath’s semantics does not require any further process-ing on result tuples, and hence allows them to be streamedout as they are extracted. Extracted tuples are never modified,deleted, or re-accessed again.

Proposition 2 (No Output Buffer) The evaluation ofOXPath expressions requires no output buffer.

Proof Only Rules E3 and E4 in Table 2 create output tuples.We visit each input node c.n at most once with extractionmarker M , and thus, each created output tuple is unique(and is not overridden or altered in any anymore). The outputtuples are immediately added to the output relation O , regard-less of whether the current expression evaluates eventuallyto true or not. Also, when the output tuples are created, theparent output nodes are known by construction and thus nobuffering is needed to obtain all tuple values. All other rulesin the extraction semantics (as in E1 and E2) only collect thetuples returned by their subexpressions. Since no duplicatetuples are created, this requires no buffering. �

More intuitively, this holds as the structure of the outputtree reflects the structure of the OXPath expression and par-ent nodes are therefore always created before their childrennodes. Also, tuples are extracted immediately upon creation,regardless of whether the current subexpression matches ornot. For example, consider expr[p1][p2]. If p1 contains

123

Page 12: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

T. Furche et al.

extraction markers, the extracted tuples are returned whetherp2 matches or not.

Requiring that tuples extracted by p1 are returned only ifp2 matches, would require an unbounded buffer, as the vis-ited pages and extracted results for p1 are both unbounded.Furthermore, we can achieve the same effect in the existingOXPath semantics at the cost of an increased query size:We can rewrite the above expression into expr[p′2][p1][p2]

where p′2 is obtained from p2 by removing all extrac-tion markers. Then p′2 matches if and only if p2 matches,as extraction markers do not affect matching, and tuplesextracted by p1 are only returned if p2 matches.

4.9 Normalizing OXPath

We can reduce the size of the necessary memoization tablesby a factor of n, without restricting the language, at potentialexpense of longer queries by rewriting general OXPath intonormalized OXPath, a fragment denoted OXPath norm.

We introduce two normalization properties for OXPathexpressions: Property (A) does not allow any extractionmarkers within Kleene star expressions, and Property (B)disallows any two extraction markers on the same expressionbranch. The latter property means that all extraction mark-ers after any given marker must be nested within predicates.For example, Property (B) is violated by a:<R>b:<S> butsatisfied by a:<R>[b:<S>]. In this section, we do not distin-guish record markers, such as :<R>, and attribute extractionmarkers of the form :<R=. . .>.

To normalize general OXPath expressions, we apply tworewriting steps. The first rewriting, shown in Theorem 2, pro-duces expressions that meet Property (A). When we cannotapply Theorem 2 anymore, we rewrite the obtained expres-sion following Theorem 3, again until inapplicable, to meetProperty (B) as well.

The proof of Theorems 2 and 3 relies on the loose couplingof value and extraction semantics in OXPath: The extrac-tion semantics does not influence the value semantics at all,as stated in Fact 3. On the other hand, the extraction seman-tics depends on the value semantics, as the value semanticsdetermines which nodes to extract. But extractions take placeimmediately, independently of whether the tailing expressionmatches any values or not, as stated in Fact 4.

Fact 3 (Extraction agnostic value semantics) The introduc-tion or removal of extraction markers does not affect the valuesemantics of an OXPath expression.

Fact 4 (Tail agnostic extraction) When extraction marker<R> in a:<R>b is evaluated after having matched a, the corre-sponding pairs 〈n, R〉 are extracted immediately, regardlessof whether b matches subsequently.

4.9.1 Extraction-free Kleene star

For the next theorem, we rely on OXPath not being short-circuited, for example, the evaluation of [a | b] cannot beaborted once a is evaluated to true since b may containextraction markers which must be matched, even if they donot affect the value semantics.

Theorem 2 (Extraction-free Kleene Star) Let e be anOXPath expression hp*t , with h, p, and t as arbitrary subex-pressions. Then the expression e is rewritable with

hp*t ≡ ht | hq* pt ≡ h [? q* p] q*t ,

where q denotes the expression derived from p by removingall extraction markers.

Proof We show the theorem by proving

where we abbreviate self:node() with self. From thisclaim, the theorem follows, by adding h and t to the start andend of the three subexpressions in claim.

First, we show the claim for the value semantics andsubsequently for the extraction semantics. Using ≡V todenote the equivalence with respect to value semantics,we obtain p* ≡V q*≡V self [? x]q*≡V self [? q*p]q*.Therein, Step (1) holds for Fact 3, (2) holds for everyexpression x , since optional predicates are not required toevaluate to true (in fact, they have been introduced for con-ditional extraction), and (3) instantiates x with q*p. Simi-larly, we have p* ≡V self | p*p ≡V self | q*p, where Step(1) unrolls the Kleene star expression, and (2) holds againfor Fact 3. Taken together, these identities show the claim forvalue semantics, that is, for ≡V .

Second, for extraction semantics, we show

yielding the sought for equivalences after Steps (2) and (4).Step (1) unrolls the first iteration of the Kleene star. In (2),we replace an instance of p with q, and hence we need toshow that every pair extracted by an instance of p in p*p isalso extracted by q*p. But this is the case: If p*p extractssome pair, then there must exist a minimal i ≥ 0 such thatpi p extracts this pair. Because of Fact 4, we only considerthe prefix leading to the extraction, while we ignore the sub-sequent expressions to be matched. Since i is minimal, theextraction does not occur within pi but in the tailing p, andtherefore, qi p produces the same pair. Hence, q*p does soas well, proving the soundness of this step. Step (3) holdsfor any y without extraction markers: All pairs extracted byq*p are also extracted by the conditional predicate self

[? q*p], regardless of the tailing y. On the other hand, sincey does not contain extraction markers, self [? q*p]y can-

123

Page 13: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

OXPath

not extract more than q*p. Step (4) instantiates y with q*,which is valid since q contains no extraction markers. �

Both rewriting options in Theorem 2 have exponentialupper bounds: If we rewrite hp*t with ht | hq*pt , we need toduplicate the entire expression, with additional occurrencesof h, t , and p (in terms of q). On the other hand, if we useh[? q*p]q*t , we triplicate p with two additional copies of q.In our experience, if extraction markers occur in Kleene starexpressions, then the Kleene-starred expressions are rathershort, that is, the second option is usually the better choice.Finally, if the Kleene star must match at least once (e.g.,when using the Kleene + instead of *), then we can use aneven more efficient rewriting:

Corollary 1 (Rewriting for Kleene+) Let e be an OXPathexpression hp+t , with h, p, and t arbitrary. Then the expres-sion e is rewritable with hp+t≡hq*pt where q denotesthe expression derived from p by removing all extractionmarkers.

In general, all these rewritings are exponential in theexpression size. However, the overhead introduced by rewrit-ing Kleene star expressions with bounded Kleene nestingdepth is polynomial. This bound is practically relevant, aswe never encountered natural OXPath expressions with anesting depth larger than 2.

Moreover, the rewriting of Theorem 2 is naturally extendedto the bounded case with (for a−· b = max(a − b, 0))

4.9.2 Sibling-free extraction

Theorem 3 (Sibling-free Extraction Markers) Let e be anOXPath expression a:<R>b:<S>c, with a, b, and c as arbi-trary subexpressions. Then e is rewritable with

Proof We prove Theorem 3 fist for value semantics, then forextraction semantics. With ≡V for equivalence according tovalue semantics, we obtain

where Step (1) holds because of Fact 3, and Step (2) holdssince [self:<R>] evaluates to true under value semanticsfor every node.

For extraction semantics, a:<R> and a[self:<R>] mustextract the same pairs, since <R> is applied to the same nodesets, as a and a/self select same nodes. Again, because ofFact 4, the respective tail expressions are irrelevant. �

4.10 Complexity

Considering the complexity of OXPath, we note that expres-sions containing Kleene star repeated actions may requireaccess to an unbounded number of pages. In particular, whenwe evaluate such an expression, we do not know whetherthe evaluation terminates and how many pages are accessedduring evaluation. Thus, when we discuss the complexity ofevaluating OXPath, we only consider expressions whoseevaluations terminate and consider all accessed pages asinput. Furthermore, we assume that traversing an action takesconstant time, as most pages execute their actions quickly.

Theorem 4 (Complexity) OXPath evaluation without mul-tiplication and string concatenation is in NLogSpace fordata complexity. OXPath evaluation is PTime -complete forcombined complexity.

Proof We show the theorem statements separately, startingwith data complexity: From all extensions over XPath, onlythe Kleene star causes an increase in complexity: Actionsare assumed to take constant time, extraction markers do notrequire additional memory as they are streamed out, and theadditional axis does not introduce further complexity. XPath1.0 without string concatenation and multiplication has datacomplexity LogSpace [13]. Each Kleene star expressioncan be realized as transitive, reflexive closure of the Kleenestar repeated expression; therefore, we arrive at NLogSpacedata complexity for OXPath without string concatenationand multiplication.

Combined Complexity: PTime -hardness follows immedi-ately from the PTime-hardness for XPath query evalua-tion [13]. To evaluate an OXPath query, we process thequery left to right, and decompose it recursively. Since weshow that evaluating each subexpression requires at mostpolynomial time, the overall evaluation runs in polynomialtime as well.

For XPath subexpressions, we rely on one of the knownpolynomial time algorithms for XPath [26], which can beeasily extended with style, intensional axes, and the otherselection only features added in OXPath. If the expressionis an extraction marker, we stream out the extracted tuple,which is in polynomial time, too, while actions are assumedto take constant time.

The only remaining case is the Kleene star: If the Kleenestar repeated expression contains a non-nested action, weknow that each iteration of the repeated expression leads toa new page. Consequently, there are at most input size manyiterations. If the Kleene star does not contain non-nestedactions, we evaluate it like an ordinary Regular XPath, lead-ing to a polynomial time algorithm [35]: If a predicate withinthe expression contains an action, we can evaluate this pred-icate in polynomial time, and since we need to evaluate this

123

Page 14: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

T. Furche et al.

predicate at most for a polynomial number of context tuples,the theorem statement follows. �

5 Page-at-a-time evaluation

As shown in the previous section, OXPath remains closeto the favorable complexity results of XPath. In this sec-tion, we show OXPath’s practical performance to be closeto XPath’s performance, only slightly increasing the upperbounds. For example, consider evaluating the expression

on the (simplified) DOM fragment in Fig. 9. The expressionnavigates to all descendant links of p nodes, which are them-selves descendants of div nodes. It extracts an <R> recordfor the div node and then clicks on all found links to extractthe title of each reached page as <t> attribute. The eval-uation first expands the abbreviated syntax to obtain

and then proceeds by processing the individual steps of theexpanded expression. Figure 10 shows the call graph dur-ing the evaluation of Expr , starting with the initial con-text tuple 〈⊥, results, results〉. For simplicity, the nodesin the call graph do not show the full context tuples〈n, p, l〉 but only the respective DOM nodes n. In the firststep, doc(uri) is applied to ⊥ to yield 0, where the lat-ter is the DOM root of uri|. The next step reaches viadescendant-or-self:node() all nodes 0 . . . 12. Since thenodes 1 . . . 12 have no child::div successor, as required inthe third step, we summarize those nodes into a single node,and continue from node 0, which is the only node with a div

div

pR

p p p

ul

a

p

a

p

a

a

results1

2

3

4

5

6

7

8

9

10

11

12

action{click /}

extracted record extracted valuepage

child

DOM node

0

Fig. 9 Page and result tree example

memoized

doc(uri) /desc /div

/desc

/p

/desc

/a

:<R>0 0 1

1 5 8

2 5 8 11 6 9

3 6 9 11 6 9

4 7 10 12

2,4 5,7 8,10 12

2..4,6,7,9..12

1..12

1

7 10

Fig. 10 Call graph example

child. In the fourth step, the extraction marker :<R> createsan <R> record for node 1, linked to the results record thatcontains all extraction results. Subsequently, when weprocess the remaining steps in this fashion, we need to makesure that we do not perform redundant computations. Forexample, there are two ways to reach nodes 7 and 9, respec-tively, such that all computations below those two nodeswould be replicated by a naive algorithm. In such cases, weavoid recomputing the same results by memoizing the afore-computed results. We complete the processing by clicking onall found links (nodes 4, 7, 10, 12) and by extracting the titleof the reached pages (not shown in the figure). In general, wehave to assume that loading a page a second time yields twodifferent pages. Hence, to minimize the needed resources,we buffer the current page and load each of the four linkedpages one after another into a second page buffer.

5.1 Algorithmic design goals

Starting with a standard XPath evaluation with memoiza-tion [26], only two of OXPath extensions demand signif-icant additional treatment, leading to the following threedesign goals: (1) Actions visit different pages, and multipleactions on the same page yield branches in the page tree (seeSect. 5.4). Unfortunately, if the same page is fetched mul-tiple times, we may obtain different results, for example, ifthe underlying data have changed or the page contains time-sensitive information. Thus, we need to buffer such pages,but at the same time, need to minimize the number of neces-sary page buffers without reloading pages. (2) With extrac-tion markers, we can return multiple, possibly related dataitems, requiring the evaluation to collect these items. To scalewell with large-scale extraction tasks, we need to efficientlypropagate matches on extraction markers and their relations.Aside these two goals, we also need to (3) maintain the poly-nomial evaluation of XPath, catering the other extensionsof OXPath efficiently.

To address (1), our page-at-a-time algorithm traverses thepage tree in a depth-first manner without retaining informa-tion on pages not visited again. However, a naive depth-

123

Page 15: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

OXPath

first traversal of the DOM nodes within individual pageswould cause an exponential worst-case runtime in violationof (3), necessitating memoization of intermediate results. Toaddress (2), our algorithms stream out extraction matches,requiring no buffering at all.

Mutual Recursive Evaluation. As a solution to these designgoals, we employ two mutually recursive evaluation proce-dures eval− and eval. Thereby, eval− evaluates simpleOXPath expressions without actions or extraction markers.As such, they are regular XPath expressions (as in [35]),extended with the style axis and additional operators dis-cussed at the beginning of Sect. 4. In our example, //div,//p//a, and //title are simple subexpressions to be eval-uated with eval−. Given eval−, eval decomposes fullOXPath expressions into chunks of simple expressions fordelegation to eval−, and directly handles those steps thatcontain actions and markers. Each chunk spans from oneextraction marker, action, or predicate, containing markersor actions to the next such step. Both algorithms implementa recursive top-down evaluation, in the spirit of the semanticsin Sect. 4.4, as shown in Algorithms 2 and 3. To compute thevalue semantics � �V and extraction semantics � �E simul-taneously, eval− and eval return the result according to� �V while outputting the tuples following � �E. For clarity,we factor the evaluation of individual tuples from eval−into evalT− in Algorithm 1.

Memoization. As essential design goal, we need to preventmultiple evaluations of the same expression with the samecontext tuple. For example, while evaluating //p//a[...]

in our example, there are two a nodes (7 and 10) that aredescendants of multiple p nodes. While a naive implemen-tation processes such a nodes multiple times, we avoid thisoverhead by inserting memoization at two strategic positions:We (1) encapsulate the evaluation of simple expressions intoeval− extending the memoization-based XPath evalua-tion from [26] and (2) additionally memoize the outcome ofnon-simple predicates of eval). Only recursion branchesstarting at these points can possibly process the same nodeand expression more than once. Thus, it suffices to memoizeat these points (see Sect. 5.7 for details).

Page Management. To keep the resource consumption ofPAAT low, PAAT minimizes the number of simultaneouslyretained pages and frees a page as soon as possible, either byexplicitly or by implicitly replacing it with a new page. Todecide whether we can replace a page with a new one, evalmaintains recursively a flag Free which is set to true if apage is not required anymore by the caller—and thus can beoverridden or freed. In our example, we visit for nodes 4, 7,10, and 12 a new page recursively. In the first three of recur-sive calls, Free is set to false , since we still need the current

page to follow the last link to another page. Only in case ofthe last link, Free is set to true. When a page is removedfrom memory, both the browser DOM and the correspondingentries in the lookup tables are freed.

5.2 Context tuples

eval and eval− take as input a (full or simple) OXPathexpression, and two sets ICtx and Ctx of context tuples, bothof the same type (out of two possibilities): (1) XPath con-text tuples cx = 〈cx .nx , cx .px 〉 depend on their context set toevaluate position and size and consist of the context nodeand its parent context node, see [26]. (2) Extraction contexttuples ce = 〈ce.cx , ce.p, ce.l〉 consist of one XPath con-text tuple and the ids of the last parent and sibling extractionmatch—reminiscent of the context tuples in the semantics,allowing subsequent extraction matches to be nested accord-ing to the predicate nesting (Sect. 4.4).

As remarked above, XPath context tuples are alwaysrelative to some context set, which determines the positionof the tuple within the set. Since we want to evaluate onlypart of the Ctx , we maintain those tuples in ICtx ⊆ Ctx, anduse Ctx only for determining the position of a tuple. For effi-ciency, we maintain several context sets in the same programvariable. Hence, we select all tuples with same parent con-text node (and same parent and sibling extraction matches)to obtain the restricted tuple set Ctx|cx (Ctx|ce ) containingonly tuples from the (proper) context of cx . Furthermore, wewrite Ctx|ce,c′e = {

⟨c′′e .cx , c′e.p, c′e.l

⟩ | c′′e ∈ Ctx|ce } to adaptparent and sibling match in a restricted context set.

5.3 PAAT simple evaluation

The first procedure, evalT− (Algorithm 1), evaluates asimple expression Expr on a context tuple cx , belonging to acontext set Ctx . As simple expression, Expr is free of actionsor extraction markers, but may use other OXPath featuressuch as Kleene stars. For brevity, Algorithm 1 only dealswith the most important expressions, that is, axis naviga-tion, Kleene stars, and predicates. The omitted parts (mostlyfunctions and operators) do not affect the algorithm designand can be added analogously to predicates. In our design ofevalT−, we are inspired by polynomial time XPath evalu-ation algorithms from [26]: evalT− implements a dynamicprogramming approach in maintaining a memoization tableLookup , which maps subexpressions and context tuples tointermediate results.

First we check for empty expressions as base case andreturn the input tuple cx as result {cx } (Line 2). Next, wecheck whether the result of applying Expr to cx has beenmemoized, and if so, return this result (Line 3). Otherwise,

123

Page 16: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

T. Furche et al.

1 Function evalT−(Expr, cx ,Ctx):

2 if Expr ≡ ε then return {cx };3 if Lookup[Expr, cx ] �= null then return Lookup[Expr, cx ];4 Expr ≡ e t where e matches one of the following cases;

5 ICtx← ∅;Ctx′ ← ∅;6 if e ≡ axis :: nodes then7 ICtx← {⟨n′x , cx .nx } | axis(cx .nx , n′x ) ∧ n′x ∈ unarynodes

⟩};8 Ctx′ ← ICtx;

9 else if e ≡ (path) ∗ {v,w} then10 ICtx← {cx }; OCtx← ∅;11 for i ← 0 to w − 1 do12 if i ≥ v then13 ICtx← ICtx\OCtx; OCtx← OCtx ∪ ICtx;

14 if ICtx = ∅ then break ;15 ICtx← eval−(path, ICtx, ICtx);

16 ICtx← {⟨c′x .nx , cx .px⟩ | c′x ∈ ICtx ∪OCtx};

17 Ctx′ ← ICtx;

18 else if e ≡ [q] then19 if evalT−(q, cx ,Ctx) �= ∅ then ICtx←{cx };20 Ctx′ ← {cx ∈ Ctx | evalT−(q, cx ,Ctx) �= ∅};21 Result← eval−(t, ICtx,Ctx′);22 Lookup[Expr, cx ] ← Result;23 return Result;

Algorithm 1: Evaluation of simple OXPath on tuples1 Function eval−(Expr, ICtx,Ctx):

2 Result← ∅;3 if Ctx is extraction context set then4 for ce ∈ ICtx do5 for cx ∈ evalT−(Expr, ce.cx ,Ctx) do6 Result← Result ∪ {〈cx , ce.p, ce.l〉};

7 else for cx ∈ ICtx do8 Result← Result ∪ {evalT−(Expr, cx ,Ctx);

9 return Result;Algorithm 2: Evaluation of simple OXPath on tuple sets

Algorithm 1 proceeds in a depth-first manner: It splits Exprinto prefix e and remainder t (Line 4) to evaluate e directlyand t recursively. In its main part (Lines 5–20), Algorithm 1evaluates e on cx to obtain the context sets ICtx and Ctx′(see Sect. 5.2), where e is either an axis with node test, aKleene star, or a predicate. Finally, the evaluation of the tail-ing expression t on ICtx and Ctx′ yields the result to memoizeand return (Lines 21–23).

(1) For axis navigation (Line 6), we obtain ICtx = Ctx′via axis and unarynodes, adjusting the parent node tocx .nx .(2) In case of Kleene star expressions without actionsor extraction markers (Line 9), each successive iterationmight return nodes already reached through prior iter-ations. Thus, in each of the w application of path, weavoid traversing paths starting at already analyzed con-text tuples. More specifically, we collect in OCtx thetuples reached by the compound Kleene star expression,

1 Function eval(Expr, ICtx,Ctx,Free):

2 if Expr is simple then return eval−(Expr, ICtx,Ctx);3 Expr ≡ het where h is simple, e is not;4 ICtx← eval−(h, ICtx,Ctx, false);5 if ICtx = ∅ then return ∅ ;

6 Result← ∅;7 if e ≡ {action/} ∨ {action} then8 for ce ∈ ICtx do9 f ← Free ∧ isLast(c, ICtx);

10 ICtx′←{〈〈getPage(action, ce.cx .n, f ), ce〉 , ce.p, ce.l〉};11 if e ≡ {action} then12 if isPageUnmodified() then ICtx′ ← {ce};13 else ICtx′ ← eval−(afp(action, ce.cx .n), ICtx′, ICtx′);14 Result← Result ∪ eval(t, ICtx′, ICtx′, true);15 if Free then FreeMem(pageof(ICtx′));

16 else if e ≡ 〈M[= v]〉 then17 ICtx′ ← ∅ ;18 for ce ∈ ICtx do19 val← evalT−(v, ce.cx ,Ctx|ce );20 output

⟨out(ce.cx .nx ,M), ce.px ,M, val

⟩;

21 ICtx′ ← ICtx′ ∪ {⟨ce.cx , ce.p,out(ce.cx .nx ,M)⟩};

22 Result← eval(t, ICtx′, ICtx′,Free);

23 else if e ≡ (path) ∗ {v,w} then24 if path contains an action on the main path then25 if w > 0 then26 if v=0 then Result←Result∪eval(t, ICtx, ICtx, false);27 Expr′ ← path path ∗ {max(0, v − 1), w − 1} t ;28 Result← Result ∪ eval(Expr′, ICtx, ICtx,Free);

29 else Result← Result ∪ eval(t, ICtx, ICtx,Free);

30 else31 N← ∅;32 for ce ∈ ICtx do33 ICtx′ ← {ce};OCtx← ∅;34 for i ← 0 to w − 1 do35 if i≥v then ICtx′←ICtx′\OCtx;OCtx←OCtx∪ICtx′;36 if ICtx′ = ∅ then break ;37 ICtx′←eval(path, ICtx′, ICtx′,Free∧(i=w−1)∧(t≡ε));38 N←N∪{⟨⟨c′e.nx , ce.px

⟩, c′e.pe, c′e.le

⟩|c′e∈ICtx′∪OCtx};39 Result← eval(t,N,N,Free);

40 else if e ≡ [q] then41 ICtx′ ← ∅ ;42 for ce ∈ ICtx do43 Free′ ← Free ∧ (t ≡ ε) ∧ isLast(ce, ICtx);44 c′e ← 〈ce.cx , ce.le, ce.le〉;45 if Lookup∃[e, ce] = false then continue ;46 else if Lookup∃[e, ce] = true or47 eval(q, {c′e},Ctx|ce,c′e ,Free′) �= ∅ then48 ICtx′← ICtx′∪{c′e}; Lookup∃[e, ce]← true;

49 else Lookup∃[e, ce] ← false;

50 Result← eval(t, ICtx′, ICtx′,Free);

51 return Result;Algorithm 3: Evaluation of full OXPath on tuple sets

123

Page 17: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

OXPath

and maintain in ICtx the tuples to be explored. Inside theloop, if we have reached the lower bound v, we removefrom ICtx all tuples already collected in OCtx (as theywould be redundant), and take all new nodes into OCtx(Line 13). If there are no new tuples left, we break the loop(Line 14); otherwise, we evaluate path once (Line 15)to complete the iteration. Finally, we add OCtx to ICtxand set in all resulting tuples cx .nx as parent context(Line 16). The latter groups the tuples in the context ICtx, as ICtx might become part of a larger context set subse-quently: This ensures, for example, that in an expressionψ/(φ){2,3}[i], for each node n matching ψ , the i-thnode reached via φ from n is returned, instead of the sin-gle i-th node among all nodes reached via ψ/(φ){2,3}.(3) We deal with predicate expressions [q] (Line 18), byrecursively evaluating q with evalT−, since q must besimple itself. If the evaluation returns a non-empty set,we keep cx in ICtx , and likewise, determine Ctx′ as thesubset of Ctx whose nodes satisfy q (Line 20).

Finally, in all cases, we evaluate tail t of Expr with thenew context sets ICtx and Ctx′ to return the memoized result.

Algorithm 2 iterates over a set of context tuples ICtx andcallsevalT− for each tuple. It is split into two parts: The firstpart (Lines 3–6) covers the case that ICtx contains extrac-tion context tuples, the other part the case of XPath contexttuples. In the former case, we strip the extraction matchesfrom the tuples to reduce the space for Lookup in evalT−,see Sect. 5.7. Either way, we call evalT− for each tuplein ICtx , providing the restriction Ctx|cx (or Ctx|ce ) of Ctxto the proper context set of cx (or ce ). In case of extrac-tion tuples, the algorithm reattaches the parent and siblingmatches to the returned nodes in Line 6.

5.4 PAAT full evaluation

In Algorithm 3, we show eval for handling full OXPath.Building upon eval−, eval deals with actions andextraction markers, and expressions that contain actionsand extraction markers as subexpressions, that is, Kleenestars and predicates. Next to exploiting the memoizationof eval−, eval also memorizes the outcome of predicateevaluations in its own lookup table Lookup∃. As inevalT−,we omit functions and operators for clarity; they are handledanalog to predicates.

Performing the actions in the expression, eval traversesthe page tree in a depth-first manner. For Kleene stars withoutactions on the main path, eval works similar to evalT−,since all resulting nodes are on the same page, even if actionswithin the predicates of the Kleene star expressions may nav-igate to other pages. While the input context set Ctx onlycontains nodes from a single page, the result set Result ,however, may contain nodes from many pages.

If Expr is simple, we delegate its evaluation to eval−and return the result (Line 2). Otherwise, we split Expr intothree parts (Line 3): h is the maximum prefix which is simple,e is either an action, an extraction marker, or a Kleene staror predicate containing an action or extraction marker, and tis the remaining expression. The prefix h is evaluated witheval− (Line 4), and the result is returned if empty (Line 5).

(1) The first main case deals with actions (Line 7),covering both absolute {action} and contextual actions{action/}. Roughly, we iterate over all ce in ICtx , obtain-ing per ce a new context set ICtx′ with a single tuple.That tuple is either the root of the page returned by theaction applied to ce or the result of evaluating the action-free prefix on that root. Either way, the parent and siblingextraction matches are copied from ce. In getPage , wefree the page to perform the action upon, if the input flagFree is set and ce is last in the iteration (Line 9). Uponfreeing a page, all memoization information in Lookupand Lookup∃ related to this page is freed, too. If theaction is contextual (Line 11) and did not alter the page,we stay at ce to avoid evaluating the action-free prefixafp(action, ce.cx ) (Line 12), as done otherwise. Eitherway, we evaluate t recursively on ICtx′, descending onestep further in the depth-first traversal of the page tree(Line 14). We set Free to true , since the page and allrelated memoized information is freed after the invoca-tion in any case (Line 15).(2) For extraction markers (Line 16), first the value tobe extracted is evaluated with evalT−, as extractionvalues are always computed from simple expressions(Line 19), and the obtained result tuple written to theoutput (Line 20). Finally, we add a tuple to ICtx′ thatis identical to ce up to the sibling match which is set toout(ce.cx .nx ,m). The tail is then recursively evaluatedwith the new context set ICtx′.(3) Kleene stars containing actions are treated in twocases: (i) If a Kleene star contains an action on its mainpath, we expand the expression (Line 27) and recur-sively evaluate the expanded expression (Line 28). Oncethe expansion has reached the lower bound, that is, vbecame 0, we also collect results by evaluating the tail t(Line 26). The results for the final iteration are collectedseparately (Line 29). By evaluating the tail t at each indi-vidual recursion step, we avoid context sets with nodesfrom different pages and nevertheless evaluate the sameexpression only once for the same context, as each iter-ation yields nodes from different pages. (ii) Otherwise,without actions on the main path (Line 30), different iter-ations can never re-reach a node already processed, andhence, we use similar strategy as in evalT−.(4) It remains to address predicates containing actionsor extraction markers (Line 40). Here, we need to

123

Page 18: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

T. Furche et al.

evaluate the contained expression q for every ce

in ICtx to test whether it yields ∅. Since q con-tains an action or extraction marker, we need to useeval . Doing so without memoization would lead toan exponential runtime, due to expressions such as//a[.//b[{click /}][.//c[{click/}][. . .]]]—requiring another lookup table Lookup∃. Here, we needto memoize an entry per extraction context tuple andexpression; however, as result we only store true orfalse. We construct the extraction context tuple ce forevaluating the predicate by taking the sibling extractionmatch as new parent match (Line 44). Then q is evalu-ated over {c′e}with Ctx|ce,c′e (see Sect. 5.2) as new contextset (Line 47). It remains to evaluate tail t on the filteredcontext ICtx′ (Line 50).

5.5 PAAT example

With the algorithms at hand, we now revisit the exampleshown in Figs. 9 and 10 to discuss its processing in detail.

1—Navigation. PAAT starts with eval(Expr,Ctx,Ctx,true) where Ctx = {〈〈⊥,⊥〉 , results, results〉}. Expr issplit into h = ε, e =doc(uri), and t =descendant-or-self::node()..., see Line 4 ofAlgorithm 3. As h is empty, the call to eval− in Line 3returns the unchanged context tuples, continuing with e =doc(u), an absolute action to load the new page (Line 10).Thereby, we Free′ = Free = true, since there is onlya single tuple in the context set. On loading u, getPagereturns 0, the root of the page tree in Fig. 9. The contexttuple 〈〈0,⊥〉 , results, results〉 is used for processing thetail recursively (Line 14).

In the recursive invocation on 0 (yielding the second boxin Fig. 10), the former tail t becomes Expr, is split intoh = /descendant-or-self::node()/child::div, e =:<R>, and the rest t . First, eval evaluates h with eval−which calls evalT− for each single context node. SinceevalT− has no memoized data available for h (Line 3),it splits Expr into descendant-or-self::node() andchild::div (Line 4). Evaluating the first expression, thecontext is expanded in Line 7 to all nodes in the pageand child::div is evaluated on these nodes with eval−which calls evalT− once for each node. In Fig. 10, wesummarize these calls with three boxes: Starting at 0,descendant-or-self::node() yields a summary box forthe DOM nodes 1…12 (albeit, there is one call for each node),and a box for 0. Finally, child::div leads from 0 to 1.

Thus eval− returns 〈〈1,⊥〉 , results, results〉 to evalto continue with the evaluation of :<R>: (Line 16). For thatcontext,

⟨out(1, R), results, R,null

⟩is written to the output

by the evaluation of :<R>: (Line 20). In Fig. 9, the new tupleis shown as a (square) node R.

eval continues recursively with the rest of the expres-sion using the context tuple

⟨〈1,⊥〉 , results,out(1, R)⟩,

carrying the fresh output node out(1, R) as new siblingmarker.

The rest of the expression is split again, this time into thesimple expression h of four XPath steps between :<R>: andthe predicate, the predicate (as e) and an empty tail. The eval-uation of the simple expression is again delegated toevalT−(via eval−): 1 has all nodes except ⊥ as descendants, butonly the only nodes with p are 1, 5, and 8, all others willreturn subsequently empty results. The evaluation continuesin a depth-first manner, evaluating the left most branchesfirst. Figure 10 shows how we first find all a descendantsof p descendants of 1, memoizing the results at every step.When we later search for such nodes for 5 and 8, we findthat all matching nodes have been already evaluated and thecorresponding results memoized.

2—Predicate Evaluation. The evaluation of the predi-cate in eval starts with the context set containing onecontext tuple for 4, 7, 10, and 12. For example, with4, we get the tuple

⟨〈4, 2〉 , results,out(1, R)⟩. To eval-

uate e = [{click/}/....]:, eval changes the tupleto

⟨〈4, 2〉 ,out(1, R),out(1, R)⟩, such that the last sibling

match out(div1, R) becomes the parent in the recursivecall to eval (Line 47). Thus, all tuples extracted during thepredicate evaluation (i.e., within this branch of the call tree)are descendants of out(1, R).

The recursive calls to eval split Expr into e =[{click/}]: and t = /desc-or-self::node()/title:

<t=string(.)>:. In the first three of these four calls, Freeis set to false, since the page is still needed, but in the lastinvocation, Free is set to true, such that the page buffer ofthe current page can be reused. Accordingly, the action in eopens the linked page with getPage (Line 10) into eithera new or the current page buffer, overwriting the old page. Inany case, the context now refers to the root node of the newpage. eval evaluates t recursively (Line 14), setting Freeto true in any case, the newly loaded page will be freed afterthis invocation anyway (Line 15).

The subsequent recursive invocations on the new pagenavigate to the title node for value extraction. Hence, withi ranging over the title nodes, evalT− outputs the tuple⟨out(i, t),out(1, R), t, vali

⟩(diamond in Fig. 9) as child of

out(1, R), associated with its textual content vali (Line 20).

5.6 Intensional axes

So far, we did not consider the evaluation of intensional axes.As we will show in Theorem 8, we could rather assume thatthey are precomputed on page load. In practice, however, itis usually preferable to delay that computation and to usememoization, requiring a small modification to the PAATalgorithm: First, we must explicitly manage the environment

123

Page 19: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

OXPath

containing the variable bindings. While technically neces-sary for plain XPath, we omitted the variable bindings in thealgorithms for OXPath without intensional axes for clarity,as these bindings are handed through all recursive invoca-tions unaltered. Second, in Line 7 of evalT−, we can nolonger assume that the axis is precomputed, but rather needto determine all nodes related to the current context node bycalling eval with the expression defining the intensionalaxis and an updated environment. If the result is not empty,the node is related to the context node and added to Ctx′.To avoid multiple evaluations of the same intensional axis,we guard this evaluation with a memoization table similar toLookup∃, but with an additional entry for the related node.In Sect. 6, we show that, implemented in this way, inten-sional axes have little impact on the practical performance ofOXPath.

5.7 Analysis of PAAT

In Table 5, we summarize the complexity of OXPath andsome sublanguages. We denote the size of the query with q,with n the (maximum) number of nodes in a document, with pthe number of pages in the page tree, and with d its depth tree.On the left of the table, we show which features of OXPathare allowed, X, or not allowed, —. (X) indicates that thecomplexity is not affected by removing or adding the partic-ular feature. The considered features are extraction markers(normalized only or any), actions (absolute or contextual),Kleene stars (bounded only or any). We also give the numberof page buffers PAAT maintains at most for each class. Inthe following, we give proofs for the most important cases.Intensional axes are considered separately in Theorem 8.

Theorem 5 Evaluating a simple OXPath expression withevalT− takes at most O(q2 · n4) time and O(q2 · n3) spacewhere q is the size of the expression and n the number ofnodes in all documents reached by the evaluation.

In case of simple OXPath,3 n is the number of nodes inthe start document.

Proof The proof follows closely the proof of Theorem 6.6 in[26], since simple OXPath is roughly comparable to XPath,adding only Kleene stars, the style axis and a few operators.Due to the memoization, the Kleene star does not affect thecomplexity (it does, however, impact practical performance,as it generates on average far more Lookup entries than otherexpressions).

PAAT is a top-down, recursive implementation of the con-text value table principle from [26].

3 Simple OXPath is the restriction of OXPath to simple OXPathexpression, but we allow a doc() action at the start of the expressionto set the document to be queried.

We first show that evaluating a simple OXPath expres-sion using evalT− takes O(l− · Top) time and O(l− · Sval)

space where l− is the maximum number of entries, Sval themaximum size of an entry in the lookup table Lookup inevalT−, and Top the maximum cost for evaluating a func-tion or operator. This holds as the body of evalT− runsat most once per entry in Lookup. For each entry, the timefor executing evalT− is bounded by the maximum time forevaluating a function or operator, as this time dominates theother cases in evalT− (all bounded by O(n), whereas thefunction or operator evaluation is ≥ n). For size, observethat other than the size of the lookup table, we only managethe three contexts Ctx , Ctx′, and Ctx′′ and the Result , thefirst three bounded by O(n2), the second by O(n), and thusdominated by the size of Lookup .

Second, we show that (1) l− ∈ O(q · n2). This followsimmediately from the signature of Lookup . (2) Sval ∈ O(q ·n). concat() and multiplication are the operations that yieldthe largest increase in value size and the resulting valuesare bounded by O(q · n), see [26]. (3) Top ∈ O(q · n2).Again following Theorem 6.6 in [26], we observe that themost expensive operation is=which compares two nodesetsof up to O(n) size. Together with the bound on value size,we obtain a bound of O(q · n2) for Top. The added axes orcomparison operators in OXPath do not affect this result ifwe assume a precomputed table for ~ and ~= as for =. Wealso deviate in how we compute position() and size()

by projecting the context set to tuples with the same parentand sorting the result. However, this is done in O(n log n)and thus dominated by the time for =. �

Proposition 5 Evaluating an OXPath expression withevaltakes O(NExpr · Nce + NExpr · Ncx · Top) time and O(NExpr ·SCtx + l∃ + l− · Sval)) where NExpr the number of subex-pressions evaluated by recursive calls in the evaluation, Nce

the number of extraction context tuples, Ncx the number ofXPath context tuples from all reached pages, SCtx the max-imum size of a context set in eval, l∃ the maximum size ofLookup∃, l− the maximum size of Lookup , and Sval, Top asin Theorem 5.

Proof We first consider space complexity: eval usesLookup∃ and (indirectly) the lookup table Lookup forevalT−. Additionally, we have to account for the vari-ous context sets and the result set Result. The latter canbe streamed out rather than stored, as it is never processedfurther. The context sets are accounted for by the SCtx ·NExpr,as each call to eval uses a constant number of such setsand the depth of the call graph (and thus the number ofsimultaneously stored context sets) is bounded by NExpr. Itsuffices to consider l− · Sval for the impact of evalT−, asTheorem 5 shows that this expression dominates its spacecomplexity.

123

Page 20: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

T. Furche et al.

Table 5 Complexity of OXPath family

Extraction Actions Kleene Time Space Page buffersNorm. Any Abs. Context. Bounded Any

– – – – – – O(q2 · n4) O(q2 · n3) O(1) XPath + style, …

– – – – X (X) O(q2 · n4) O(q2 · n3) O(1) Simple

X – – – – – O(q2 · n4) O(q2 · n3) O(1)

X X – – – – O(q2 · n4) O(q2 · n4) O(1)

– – X (X) – – O(q2 · (p · n)4) O(q2 · (min(q, d) · n)3) O(min(q, d))

– – X (X) X X O(q2 · d · p2 · n4) O(q2 · d3 · n3) O(min(q, d)) Extraction-free

X – X (X) – – O(q3 · p4 · n4) O(q6 · n3) O(min(q, d))

X X X (X) – – O(q3 · p4 · n4) O(q7 · n4) O(min(q, d)) Kleene-free

X – X (X) X X O(q2 · d · p3 · n4) O(q2 · d4 · n3) O(d) Normalized

X X X (X) X X O(q3 · d · p4 · n4) O(q3 · d5 · n4) O(d) full

For time complexity, observe that eval is called at mostNExpr · Nce times. This holds since, eval is never calledtwice on the same expression for the same context tuple otherthan in Line 47 where two calls may use the same contexttuple c′e (originating from context tuples with different parentmatches). However, in that case, the total number of calls isstill bounded by the original ICtx set.

In each such call, eval may delegate the evaluation ofthe expression or some subexpression toevalT−. Due to thememoization in evalT−, the total time for all these calls is,however, bounded by NExpr · Ncx · Top as any repeated callsimmediately return the memoized result and the memoizationtables are kept until all nodes from the page are processed.

Other than those calls and recursive calls to itself on aproper subexpression, eval only requires constant time pernode if we assume that action execution is constant. � Theorem 6 Evaluating a full OXPath expression requiresO(q3 · d · p4 · n4) time and O(q3 · d5 · n4) space.

Proof For this proof, we first observe the invariant on eval:each context set contains only context tuples with contextnodes from one page.

For full OXPath, the following bounds hold (q querysize, d depth of the page tree reached by the evaluation, nmaximum number of nodes on a page, p number of pagesreached by the evaluation) (1) NExpr is bounded by O(q · d)not O(q) as the expansion of Kleene stars in Line 27 intro-duces new expressions. However, there can be at most d + 1such expansions on the path to the root from any leaf expres-sion, as each expansion includes at most one action and, afterd expansions any additional expansion yields an expressionwith d + 1 actions on a single path and thus an empty resultas the page tree is bounded by d. In this case, the expan-sion is stopped and we obtain the bound of O(q ·d). Forthe evaluation of action-free prefixes, we at most double thisnumber (if each step contains a contextual action). (2) Nce isbounded by O(q2 · p4 · n4) since there are at most p · n (par-

ent or actual) context nodes, each such pair combined with atmost q · p ·n different parent and sibling matches, since theymust originate from an extraction marker on the path to theroot and such a path has at most q distinct such expressions.Kleene star expansion may cause the same extraction markerto occur in several positions, but matches for all occurrencesare indistinct. (3) Ncx is similarly bounded by O((p · n)2).(4) Top and Sval are as in Theorem 5, as we do not allowactions in operands of functions and operators and thus anoperand is limited to nodes from a single page. The valuesize is also not affected by Kleene star expansions. (5) l− isbounded by O(NExpr · (d ·n)2), since only the lookup entriesfrom at most d pages are active at a time. (6) l∃ is similarlybounded by O(NExpr · q2 · d4 · n4) since there are only tuplefrom at most d pages with at most n nodes stored in Lookup∃and each of those may be combined with q·d·n parent andthe same number of sibling markers. (7) SCtx is bounded byO(n2 ·q2 ·d2 ·n2

m) as each context set contains only extractioncontext tuples for context nodes from one page (but extrac-tion matches may originate from any page on the currentbranch of the page tree). With this, the complexity followsfrom Theorem 5. � Theorem 7 Evaluating a normalized OXPath expressiontakes O(q2 · d · p3 · n4) time and O(q2 · d4 · n3).

Proof For normalized OXPath, we can drop the siblingextraction markers from the extraction context tuples ineval and eval−.

The complexity remains as for full OXPath except for(1) Nce is bounded by O(q · p3 · n3) since we do not needto maintain sibling extraction matches. (2) l∃ is bounded byNExpr · (q · d3 · n3)for the same reason. (3) SCtx is boundedby O(q · d · n3) as each context set contains only extrac-tion context tuples for context nodes from one page (but par-ent extraction matches may originate from any page on thecurrent branch of the page tree). With this, the complexityfollows from Theorem 5. �

123

Page 21: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

OXPath

Theorem 8 Intensional axes increase the complexity ofOXPath and any sublanguage including XPath by at mosta factor of O(n2) where n is the total number of nodes in thepage tree.

Proof In general, intensional axes can be evaluated asfollows: First, we materialize all intensional axes in anexpression bottom-up. Then, the expression is evaluated overthe intensional axis as usual. The actual evaluation complex-ity is not affected as the two necessary operations testingwhether a pair of nodes is in an axis and iterating over allnodes that are related to a given context node by an axis,retain their (constant, resp. linear) complexity. For the mate-rialization of the intensional axes, we first note that there areat most q such axes. For each axis, we need to store at mostO(n2) tuples. To compute the materialization, we need toevaluate the expression inside the intensional axis at mostonce for each pair of nodes, binding each successively to$lhs and $rhs. � Theorem 9 (Memory minimality) Let L be an OXPathexpression without actions in predicates. Then there existsa page tree for which every algorithm that computes � · �Vwithout prior knowledge of the page tree requires at least asmany page buffers as PAAT.

Proof An expression from L has the shape ed = doc(w)rd

with rd = /φd/{action}rd−1 for d > 1 and r1 = ε. Assumefurther that in the page tree of the expressions ed , each pagehas at least two nodes with an action that leads again toanother page of this form. ed executes {action} on all nodesof a page w that match the corresponding φi , and continuesrecursively from all pages thus reached. It returns the rootsof the pages finally reached.

When we evaluate ed with PAAT, we access the page treeup to a depth of d and use exactly d page buffers. This holds,since the accessed page tree has at least two branches at eachpage.

Any other algorithm A must load the leaves of the accessedpage tree of PAAT as these nodes are the result of evaluat-ing � ed �V. To visit such a leaf node l of the accessed pagetree, we have to load its parent p first, because, without priorknowledge, all children of p are only accessible by perform-ing {action} on the respective node in p. Thus, A must haveloaded all d − 1 ancestors of l to finally access l. Assumethat l is the first leaf reached by A. Then, A must buffer alld−1 ancestors in addition to l, because for each ancestor of l,there are further children to be visited. �

6 Evaluation

We confirm OXPath’s scaling behavior as follows:

(1) The theoretical complexity bounds from Sect. 5.7 areconfirmed in several large-scale extraction experiments

in diverse settings, in particular the constant memory useeven for extracting millions of records from hundreds ofthousands of web pages.(2) We illustrate that OXPath’s evaluation is dominatedby the browser rendering time, even for complex querieson small web pages. None of the extensions of OXPath(Sect. 1.1) significantly affects the scaling behavior orthe dominance of page rendering.(3) In an extensive comparison with commercial andacademic web extraction tools, we show that OXPathoutperforms previous approaches by at least an order ofmagnitude. Where OXPath stays within constant mem-ory bounds independently of the number of accessedpages, most other tools require linear memory.

Scaling: Millions of Results at Constant Memory. We vali-date the complexity bounds for OXPath’s PAAT algorithmand illustrate its scaling behavior by evaluating three classesof queries that require complex page buffering. Figure 11ashows the results of the first class, which searches for paperson “Seattle” in Google Scholar and repeatedly clicks on the“Cited by” links of all results using the Kleene star operator.The query descends to a depth of 3 into the citation graphand is evaluated in under 9 min (130 pages/min). The num-ber of retrieved pages grows linear in time, while the memorysize remains constant throughout. The jitter in memory useis due to the repeated ascends and descends of the citationhierarchy. Figure 11b shows the same test mirroring GoogleScholar pages on our web server (to avoid overly taxingGoogle’s servers). The number in brackets indicates how wescale the axes. OXPath extracts over 7 million pieces of datafrom 140,000 pages in 12 h (194 pages/min) with constantmemory.

We conduct similar tests, that is, repeatedly clicking on alllinks on certain pages, chosen from sites with different char-acteristics: In one experiment, we took very large pages fromWikipedia, and in another one, we took pages from GoogleProducts to reach different web shops with their typicallyvisually elaborate pages. Figure 11c demonstrate OXPath’sconstant memory usage in the face of an increasing num-ber of visited pages. For large pages, the results are verysimilar.

Profiling: Page Rendering is Dominant. We profile eachstage of OXPath’s evaluation performing five sets ofqueries on the following sites: apple.com (D1),diadem-project.info (D2), bing.com (D3),www.vldb.org/2011 (D4), and the Seattle page onWikipedia (D5). On each, we click all links and extract thehtml tag of the resulting pages. Figure 12a, b show the totaland page-wise averages, respectively. For bing.com, thepage rendering time and number of links is very low, and

123

Page 22: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

T. Furche et al.

0

10

20

30

40

50

60

0 50 100 150 200 250 300 350 400 4500

200

400

600

800

1000

1200m

emor

y [M

B]

page

s

time [sec]

memoryvisited pages

Many pages.

0

50

100

150

200

0 2 4 6 8 10 12 0

20

40

60

80

100

120

140

160

mem

ory

[MB

]

#pag

es [1

000]

/ #r

esul

ts [1

00,0

00]

time [h]

memoryextracted matches

visited pages

Millions of results.

0

10

20

30

40

50

60

70

50 100 150 200 250 300 350 400 450 0

1000

2000

3000

4000

5000

6000

7000

mem

ory

[MB

]

#mat

ches

[10]

/ #pa

ges

time [sec]

memoryextracted matches

visited pages

Many Results.(c)(b)(a)

Fig. 11 Scaling OXPath: Memory, visited pages, and output size versus time

browser initialization (13%)page rendering (85%)PAAT (2%)

Profiling OXPATH (avg.).

0

10

20

30

40

50

D1 D2 D3 D4 D5

time

[sec

]

query set

browser initialization

page rendering

PAAT evaluation

Profiling OXPATH (per query).

0

0.1

0.2

0.3

0.4

1 2 3 4 5 6

time

[sec

]

#actions

absolutecontextual

Contextual actions.(a) (b) (c)

Fig. 12 Profiling OXPath’s components

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5

time

[sec

]

query expansions

plain XPathOXPath

Fig. 13 OXPath versus plain XPath

thus also the overall evaluation time. Wikipedia pages, onthe other hand, are comparatively large and contain manylinks, thus the overall evaluation time is high.

The second experiment analyzes the effect of OXPath’sactions on query evaluation, comparing time performance ofcontextual and absolute actions. Our test performs actions onpages that do not result in new page retrievals. Figure 12cshows the results with queries containing 1–6 actions onGoogle’s advanced product search form. Contextual actionssuffer an insignificantly small penalty in their evaluation timeas compared with their absolute equivalents.

Actions (as well as extraction markers and Kleene starexpressions) affect the evaluation notably, as expected.

This contrasts sharply with the added selection capabilities:Neither style, field(), nor the added selectors signifi-cantly impact evaluation performance. The same holds forintensional axes. Figure 13 summarizes this observationshowing the evaluation time for a series of expansions ofa simple expression. It compares the case where the expres-sion is pure XPath (/descendant-or-self::*[self::*]repeated 1 to 5 times) with the case where the expression usesa style axis (instead of self::*). The results are nearlyidentical for intensional axis or any of the other added fea-tures, if expression size is properly accounted for. We showresults for up to 5 repetitions, but observe that the roughly10 % overhead holds also for much larger expressions. Notethat these results are affected by our use of the XPath engineprovided by the browser, which does not perform any opti-mization of XPath expressions.

Order-of-Magnitude Improvement. We benchmark OXPathagainst four commercial web extraction tools, namely, WebContent Extractor [2] (WCE), Lixto [12], Visual WebRipper [3] (VWR), the academic web automation and extrac-tion system Chickenfoot [17], and the open source extractiontoolkit Web Harvest [4]. Where the first three can express atleast the same extraction tasks as OXPath (and, for exam-ple, Lixto goes considerably beyond), Chickenfoot and WebHarvest require scripted iteration and manual memory man-

123

Page 23: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

OXPath

agement for many extraction tasks, particularly for realiz-ing multi-way navigation. We do not consider tools such asCoScripter and iMacros as they focus on automation onlyand offer no iterative constructs as required for extractiontasks. We also disregard tools such as RoadRunner [22] orXWRAP [33], since they work on single pages and lack theability to traverse to new web pages.

In contrast to OXPath, many of these tools cannot processscripted web sites easily. Thus, we choose an extraction taskon Google Scholar as benchmark example, since it does notrequire scripted actions. On heavily scripted pages, the per-formance advantage of OXPath is even more pronounced.With each system, we navigate the citation graph to a depthof 3 for papers on “Seattle”.

Specifically, we implement the following OXPath expres-sion in the other systems for our experiments:

An equivalent Web Harvest program takes 54 lines, whereasan equivalent Chickenfoot script takes 27, and the other toolsuse visual interfaces.

We record evaluation time and memory consumption foreach system. We measure the normalized evaluation time, inwhich we discount the time for page loading, cleaning, andrendering. This allows for a more balanced comparison as thedifferences in the employed browser or web cleaning enginesaffect the overall runtime considerably. Figure 14a shows theresults for each system up to 150 pages. Though Chickenfootand Web Harvest do not render pages at all or do not managepage and browser state, OXPath still outperforms them. Thesystems that manage state similar to OXPath are betweentwo and four times slower than OXPath even on this smallnumber of pages.

Figure 14b illustrates the normalized evaluation time upto 850 pages. In this case, we omits WCE and VWR as theywere not able to run these tests. Figure 14a, b show a consid-erable advantage for OXPath, amounting at least one orderof magnitude.

Figure 14c illustrates the memory use of these systems.WCE and VWR are again excluded, but they show a clearlylinear memory usage in those tests we were still able torun. Among the systems in Fig. 14c, only Web Harvestcomes close to the memory usage of OXPath, which isnot surprising as it does not render pages. Yet, even WebHarvest shows a clear linear trend. Chickenfoot exhibits aconstant memory consumption just as OXPath, though ittakes about ten times more memory in absolute terms. Theconstant memory is due to Chickenfoot’s lack of supportfor multi-way navigation that we compensate by manuallyusing the browser’s history whenever possible. This forces

reloading when a page is no longer cached, but requiresonly a single active DOM instance at any time. We alsotried to simulate multi-way navigation in Chickenfoot, butthe resulting program was too slow for the tests shownhere.

7 Related work

The automatic extraction and aggregation of web informa-tion is not a new challenge. Almost all previous approachesrequire either (1) service providers to deliver their data in astructured fashion, as in the Semantic Web, or (2) clients towrap unstructured information sources to extract and aggre-gate relevant data. The first case levies requirements serviceproviders have little incentive to adopt, rendering client-sidewrapping as only realistic choice.

As recognized in [6], wrapping web sites has become evenmore involved with the advent of AJAX-enabled web appli-cations, which reveal the relevant data only during user inter-actions. Previous approaches to web extraction [34,46] donot adequately address web page scripting. Where scriptingis addressed [12,15,42,47], the simulation of user actions isneither declarative nor succinct, but rather relies on imper-ative action scripts and standalone, heavy-weight extractioninterfaces. Web automation tools such as [17,39] are increas-ingly able to deal with scripted web applications, but are tai-lored to automate a single sequence of user actions. Hence,they are neither convenient nor efficient for large-scale dataextraction with their inherent multi-way navigation, neces-sary to reach all relevant information pieces by followingmultiple links on the same page.

Thus, in the following, we particularly consider (1) fillingand submitting (scripted) web forms, (2) multi-way naviga-tion, and (3) memory management for large-scale extraction.We focus on supervised extraction tools, categorized intoweb crawlers, web extraction IDEs, extraction languages,and web automation tools. Thus, we exclude unsupervisedweb extraction tools (see [21] for a survey), as they focuson automated analysis rather than extraction, rendering themrather incomparable to OXPath.

Web Crawlers. Most commonly employed by search enginesfor indexing web pages, such crawlers store the relevantinformation on any found web pages and move on to newpages by traversing all present hyperlinks. Due to the com-mercial relevance of web crawlers, rather little publishedresearch exists, as compared to their prominence in industrialapplications. Nevertheless, some work has been published,for example, most famously on the Google crawler [18].Acting only on static page representations, such crawlers areunable to handle dynamic, scripted content.

123

Page 24: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

T. Furche et al.

0

100

200

300

400

500

600

700

0 20 40 60 80 100 120 140time

w/o

pag

e lo

adin

g [s

ec]

#pages

OXPathWeb Content Extractor

LixtoVisual Web Ripper

Web HarvestChickenfoot

(a) Norm. evaluation time, <150p.

0

200

400

600

800

1000

1200

1400

1600

0 100 200 300 400 500 600 700 800time

w/o

pag

e lo

adin

g [s

ec]

Number of pages

OXPathLixto

Web HarvestChickenfoot

(b) Norm. evaluation time, <850p.

0

50

100

150

200

250

300

350

0 100 200 300 400 500 600 700 800

mem

ory

[MB

]

#pages

OXPathLixto

Web HarvestChickenfoot

(c) Memory consumption, <850p.

Fig. 14 Comparison

Thus, these web crawlers, in their current state, are inca-pable of extracting information from content reachable onlyvia some user interaction. Bergman [14] first recognized thatsuch amount of content exceeds the quantity of data accessi-ble by hyperlink traversal by far, and is assumed to grow inimportance ever since [28] OXPath expressions can specifythe crawling of scripted web sites by following web links,submitting forms, etc.

Web Extraction IDEs. Web extractionIDEs, such as [12],have a wider scope than OXPath, as they provide anentire development environment (e.g., extraction cluster onAmazon Cloud EC2, full support of XPath 2.0). But in termsof extraction speed and memory consumption, OXPath out-classes these systems by a wide margin (see Sect. 6). Lixto,Visual Web Ripper [3], and Web Content Extractor [2] areinteractive wrapper generator frameworks, recording useractions in browsers to replay these actions for extracting data.As our experimental evaluation (Sect. 6) demonstrates, thememory footprint of these systems grows linear in the num-ber of accessed pages—in contrast to OXPath’s constantmemory requirements.

Deep web extraction tools such as [15] increasingly dealwith scripted, highly visual web sites, but infer the extractionscripts automatically from user-provided examples. Thoughallowing for easy wrapper generation, such an approach lacksthe precision necessary for many fully automated tasks. Ananother example, BODE [47] is a browser-based extrac-tion tool, whose imperative extraction language BODED isnot portable and hard to optimize. Without minimizing thememory requirements, BODE replicates complete browserinstances for multi-way navigation, imposing a significantperformance penalty, rendering BODE unsuitable for large-scale data extraction.

Web Extraction Languages. Most extraction languagesfollow a declarative approach [9,34,37,44–46], much likeOXPath. However, they do not adequately facilitate deepweb interaction, such as form submissions, often due to

their age. Also, they do not provide native constructs forpage navigation, apart from retrieving a page from a givenURL. As an exception, the BODED extraction language [47]deals with modern web applications, but is unsuitable forlarge-scale extraction tasks, as discussed in the previoussection.

In our evaluation, we compare with Web Harvest [4], arecent, open source extraction language. Extraction tasksare specified as imperative scripts, formulated inXML. WebHarvest does not deal with interactive web applications anddoes not give access to the rendered page, but rather to acleaned XML view of HTML documents.

Another strand of research employed XML technologies,for example, XPath, for information extraction. As a notableexample, ANDES [40] is capable of navigating modern webinterfaces, but only by generating URLs from naively filledforms and feeding these URLs back to the underlying crawler.In contrast, OXPath embeds extraction and navigation intoa single seamless process, handling more complicated webinterfaces in a more intuitive manner.

Otherwise similar to ANDES, the approach in [7] is lim-ited to generalizing tree traversal patterns. A third exampleare L-wrappers [10], albeit limited to scraping data fromresult pages returned on query submissions. In [20], theauthors reported on an XPath-based interface for web forms,but did not release their work so far. As a final example, Web-Prospector [38] processes the Deep Web within the sciencedomain, but appears to be limited to this domain.

XLog [46] extends the ideas from Elog (the datalog-basedextraction formalism underlying Lixto [12]) for informationextraction by embedding (procedural) extraction predicates.It is optimized for large-scale information extraction tasks,but does not address any kind of web interaction such asform filling and page navigation. Earlier work also exploresdeclarative languages for specifying extraction [9,37,44], butdoes not sufficiently support interaction or page scripting.

W4F [44] offers wysiwyg support for wrapper specifi-cation, whereas extraction rules are specified using HEL(HTML Extraction Language), an SQL-like language for

123

Page 25: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

OXPath

HTML elements selection in the spirit of WebSQL [37] andWebOQL [9]. However, all these languages do not adequatelyaddress web interaction.

Web Automation Tools. Web automation tools mainly focuson single navigation sequences to automate a single task, butdo not consider large-scale web extraction with their needfor low overhead and multi-way navigation. Coscripter [31]and iMacros [1] are examples of such tools, not support-ing multi-way navigation due to their limited iterative andconditional programming constructs. Vegemite [32] is aCoScripter extension that introduces some extraction capa-bilities, such as querying some value for a number of inputs.However, as its authors note, such navigation patterns areexpensive, since the same page might be reloaded manytimes. Furthermore, as the page state is not preserved, someweb applications may not behave as expected.

The same applies to Chickenfoot [17], a language for webautomation running its scripts in Firefox. Chickenfoot scriptsare essentially imperative Javascript programs that containloops and iterations, enabling interaction with forms as wellas loading and navigating pages. Multi-way navigation ispossible, but only by explicit “back” instructions command-ing the browser to return to previous pages. Thus, page buffer-ing is unnecessary, but for a high price: Page states are lost,and thus, pages must be rendered anew for each branch leav-ing a page during a multi-way navigation.

There are some other systems relying on recorded useractions, for example, WebVCR [8] and WebMacros [43], ormore recently [39]. All these tools suffer from limitations onmodern web pages and consider only single action sequencesrather than scalable multi-path data extraction tasks.

More recent work [44] addresses the issue of filling webforms automatically. This work, however, does not offerdeclarative scripting and makes several simplifying assump-tions we do not take—for example, they consider drop downlists as the only dynamic content.

8 Conclusion and future work

To the best of our knowledge, OXPath is the first web extrac-tion system with strict memory guarantees, which reflectstrongly in our experimental evaluation. We believe that itcan become an important part of the toolset of developersinteracting with the web.

We are committed to building a strong set of tools aroundOXPath. We provide a visual generator for OXPath expres-sions and a Java API based on JAXP. Some of the issues raisedby OXPath that we plan to address in future work are the fol-lowing: (1) OXPath is amenable to significant optimizationand a good target for automated generation of web extractionprograms. (2) Further, OXPath is perfectly suited for highlyparallel execution: Different bindings for the same variable

can be filled into forms in parallel. The effective parallel exe-cution of actions on context sets with many nodes is an openissue. (3) We plan to further investigate language features,such as more expressive visual features and multi-propertyaxes.

Acknowledgments The research leading to these results has receivedfunding from the European Research Council under the EuropeanCommunity’s 7th Framework Programme (FP7/2007–2013)/ERC grantagreement no. 246858 (DIADEM). This work was carried out in thewider context of the networking programme FoX—Foundations ofXML, FET-Open grant agreement number FP7-ICT-233599. The viewsexpressed in this article are solely those of the authors.

References

1. http://www.iopus.com/iMacros2. http://www.newprosoft.com/web-content-extractor.htm3. http://www.visualwebripper.com4. http://www.web-harvest.sourceforge.net5. http://www.w3.org/TR/CSS2/selector.html6. Alba, A., Bhagwan, V., Grandison, T.: Accessing the deep web:

when good ideas go bad. In: OOPSLA (2008)7. Anton, T.: XPath—wrapper induction by generalizing tree traversal

patterns. In: LWA (2005)8. Anupam, V., Freire, J., Kumar, B., Lieuwen, D.: Automating web

navigation with the webvcr. In: WWW (2000)9. Arocena, G.O., Mendelzon, A.O.: Weboql: Restructuring docu-

ments, databases, and webs. In: ICDE (1998)10. Badica, C., Badica, A., Popescu, E., Abraham, A.: L-wrappers:

concepts, properties and construction: A declarative approach todata extraction from web sources. Soft Comput. 11(8), 753–772(2007)

11. Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni,O.: Open information extraction from the Web. In: IJCAI (2007)

12. Baumgartner, R., Flesca, S., Gottlob, G.: Visual web informationextraction with Lixto. In: VLDB (2001)

13. Benedikt, M., Koch, C.: Xpath leashed. CSUR 41, 3:1–3:54 (2009)14. Bergman, M.K.: The deep web: Surfacing hidden value. J. Electron.

Publ. 7(1), 1–17 (2001)15. Bigham, J.P., Cavender, A.C., Kaminsky, R.S., Prince, C.M., Obi-

son T.S.: Transcendence: enabling a personal view of the deep web.In: IUI (2008)

16. Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: a scal-able fully distributed web crawler. Softw. Practice Experience 34,711–726 (2004)

17. Bolin, M., Webber, M., Rha, P., Wilson, T., Miller, R.C.:. Automa-tion and customization of rendered web pages. In: UIST (2005)

18. Brin, S., Page, L.: The anatomy of a large-scale hypertextualweb search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117(1998)

19. Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wy, E., Zhang, Y.:WebTables: exploring the power of tables on the web. PVLDB1(1), 538–549 (2008)

20. Centeno, V.L., Kloos, C.D., Fernández, L.S.: García, N.F.: Intelli-gent automated navigation through the deep web. In: Advances inWeb Intelligence (2004)

21. Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A surveyof web information extraction systems. TKDE 18(10), 1411–1428(2006)

22. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: automatic dataextraction from data-intensive web sites. In: SIGMOD (2002)

23. Cafarella, M.J., Downey, D., Popescu, A.-M., Shaked, T., Soder-land, S., Weld, D.S., Yates, A.: Unsupervised named-entity extrac-

123

Page 26: OXPATH: A language for scalable data extraction ......The VLDB Journal DOI 10.1007/s00778-012-0286-6 SPECIAL ISSUE PAPER OXPATH: A language for scalable data extraction, automation,

T. Furche et al.

tion from the Web: an experimental study. Artif. Intell. 165(1),91–134 (2005)

24. Furche, T., Gottlob, G., Grasso, G., Gunes, O., Guo, X.,Kravchenko, A., Orsi, G., Schallhart, C., Sellers, A., Wang, C.:DIADEM: Domain-centric, intelligent, automated data extractionmethodology. In: WWW (2012)

25. Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.:Oxpath: A language for scalable, memory-efficient data extractionfrom web applications. PVLDB 4(11), 1016–1027 (2011)

26. Gottlob, G., Koch, C., Pichler, R.: Efficient algorithms for process-ing XPath queries. In: TODS (2005)

27. Gruhl, D., Chavet, L., Gibson, D., Meyer, J., Pattanayak, P.,Tomkins, A., Zien, J.: How to build a webfountain: an architecturefor very large-scale text analytics. IBM Syst. J. 43, 64–77 (2004)

28. He, B., Patel, M., Zhang, Z., Chang, K.C.-C.: Accessing the deepweb. Commun. ACM 50(5), 94–101 (2007)

29. Heydon, A., Najork, M.: Mercator: a scalable, extensible webcrawler. World Wide Web 2(4), 219–229 (1999)

30. Kranzdorf, J., Sellers, A., Grasso, G., Schallhart, C., Furche, T:Spotting the tracks on the oxpath. In: WWW (2012)

31. Leshed, G., Haber, E.M., Matthews, T., Lau, T.: Coscripter:automating & sharing how-to knowledge in the enterprise. In: CHI(2008)

32. Lin, J., Wong, J., Nichols, J., Cypher, A., Lau, T.A.: End-user pro-gramming of mashups with vegemite. In: IUI (2009)

33. Liu, L., Pu, C., Han, W.: Xwrap: an xml-enabled wrapper construc-tion system for web information sources. In: ICDE (2000)

34. Liu, M., Ling, T.W.: A rule-based query language for html. In:DASFAA (2001)

35. Marx, M.: Conditional XPath. ACM Trans. Database Syst. 30(4),929–959 (2005)

36. Marx, M., de Rijke, M.: Semantic characterizations of navigationalXPath. ACM SIGMOD Rec. 34(2), 41–46 (2005)

37. Mendelzon, A.O., Mihaila, G.A., Milo, T.: Querying the worldwide web. Int. J. Digit. Libr. 1(1), 54–67 (1997)

38. Mir, S., Staab, S., Rojas, I.: Web-prospector—an automatic, site-wide wrapper induction approach for scientific deep-web data-bases. In: BTW (2009)

39. Montoto, P., Pan, A., Raposo, J., Bellas, F., López, J: Automatingnavigation sequences in ajax websites. In: ICWE (2009)

40. Myllymaki, J.: Effective web data extraction with standard xmltechnologies. Comput. Netw. 39(5), 635–644 (2002)

41. Olteanu, D., Meuss, H., Furche, T., Bry, F.: XPath: lookingForward. In: EDBT-XML-Based Data Management, LNCS 2490(2002)

42. Raposo, J., Pan, A., Álvarez, M., Hidalgo, J., Viña., A.: The wargosystem: semi-automatic wrapper generation in presence of complexdata access modes. In: DEXA (2002)

43. Safonov, A.: Web macros by example: users managing the wwwof applications. In: CHI, pp. 71–72. ACM (1999)

44. Sahuguet, A., Azavant, F.: Building light-weight wrappers forlegacy web data-sources using w4f. In: VLDB, pp. 738–741 (1999)

45. Sawa, N., Morishima, A., Sugimoto, S., Kitagawa, H.: Wraplet:Wrapping your web contents with a lightweight language. In:SITIS, pp. 387–394 (2007)

46. Shen, W., Doan, A., Naughton, J.F., Ramakrishnan, R: Declara-tive information extraction using datalog with embedded extractionpredicates. In: VLDB (2007)

47. Su, J.-Y., Sun, D.-J., Wu, I.-C., Chen, L.-P.: On design of browser-oriented data extraction system and plug-ins. J. Mar. Sci. Technol.18(2), 189–200 (2010)

48. Wang, Y., Hornung, T.: Deep web navigation by example. ScalableComput. Practice Experience 9, 281–292 (2008)

123