sf-tree: an e cient and flexible structure for selectivity ...structures are the same if they...

22
SF-Tree: An Efficient and Flexible Structure for Selectivity Estimation Wai-Shing Ho Ben Kao David W. Cheung YIP Chi Lap [Beta] Eric Lo Department of Computer Science and Information Systems, The University of Hong Kong. {wsho,kao,dcheung,clyip,ecllo}@csis.hku.hk March 8, 2004 Abstract Estimating the selectivity of a simple path expression (SPE ) is essential for selecting the most efficient evaluation plans for XML queries. To estimate selectivity, we need an efficient and flexible structure to store a summary of the path expressions that are present in an XML document collection. In this paper we propose a new structure called SF-Tree to address the selectivity estimation problem. SF-Tree provides a flexible way for the users to choose among accuracy, space requirement and selectivity retrieval speed. It makes use of signature files to store the SPEs in a tree form to increase the selectivity retrieval speed and the accuracy of the retrieved selectivity. Our analysis shows that the probability that a selectivity estimation error occurs decreases exponentially with respect to the error size. 1 Introduction Extensible Markup Language (XML) [16] is becoming the standard of information exchange over the Internet. The standardized and self-describing properties of XML documents make them ideal for information exchange. XML documents, ranging from product catalogs of e-Commerce Web sites to annotated documentation of applications, are now available on the Internet. An XML document can be regarded as a textual representation of a directed graph called data graph that consists of information items such as elements and attributes. Figure 1 shows an XML document representing an invoice and its data graph. Each node in the graph corresponds to an information item, and the edges represent the nesting relationships between the information items. Each information item has a tag (or label, name ) that describes its semantics. If we ignore the IDREFs, a data graph is a tree. Note that a collection of XML documents can be represented as a single data graph by merging all the root nodes of the original documents into one. Thus the data structures and algorithms designed for an XML document can be applied to an XML document collection. In addition to texts, structures of XML documents (i.e., how the information items nest each other) also carry information. For example, a “name” element nested within a “buyer” element represents a buyer’s name. To extract useful information out of XML documents, we need query languages such as XQuery [18] and XPath [17] that allow users to query the structures of XML documents. Fig- ure 2 shows an example XQuery that queries an XML-based eMarket Web site that stores the in- formation of the products sold in supermarkets or stores. It calculates the price differences of any products between a supermarket and a store. This query first tries to find out all the products that are sold in any supermarkets or stores (as specified in the “FOR” clause). The path expression document("*")//supermarket/product” returns all the “product” elements that are directly nested 1

Upload: others

Post on 29-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

SF-Tree: An Efficient and Flexible Structure for Selectivity Estimation

Wai-Shing Ho Ben Kao David W. CheungYIP Chi Lap [Beta] Eric Lo

Department of Computer Science and Information Systems,The University of Hong Kong.

{wsho,kao,dcheung,clyip,ecllo}@csis.hku.hk

March 8, 2004

Abstract

Estimating the selectivity of a simple path expression (SPE ) is essential for selecting the mostefficient evaluation plans for XML queries. To estimate selectivity, we need an efficient and flexiblestructure to store a summary of the path expressions that are present in an XML document collection.In this paper we propose a new structure called SF-Tree to address the selectivity estimation problem.SF-Tree provides a flexible way for the users to choose among accuracy, space requirement andselectivity retrieval speed. It makes use of signature files to store the SPEs in a tree form to increasethe selectivity retrieval speed and the accuracy of the retrieved selectivity. Our analysis shows thatthe probability that a selectivity estimation error occurs decreases exponentially with respect to theerror size.

1 Introduction

Extensible Markup Language (XML) [16] is becoming the standard of information exchange over theInternet. The standardized and self-describing properties of XML documents make them ideal forinformation exchange. XML documents, ranging from product catalogs of e-Commerce Web sites toannotated documentation of applications, are now available on the Internet. An XML document can beregarded as a textual representation of a directed graph called data graph that consists of informationitems such as elements and attributes. Figure 1 shows an XML document representing an invoice andits data graph. Each node in the graph corresponds to an information item, and the edges represent thenesting relationships between the information items. Each information item has a tag (or label, name)that describes its semantics. If we ignore the IDREFs, a data graph is a tree. Note that a collection ofXML documents can be represented as a single data graph by merging all the root nodes of the originaldocuments into one. Thus the data structures and algorithms designed for an XML document can beapplied to an XML document collection.

In addition to texts, structures of XML documents (i.e., how the information items nest each other)also carry information. For example, a “name” element nested within a “buyer” element represents abuyer’s name. To extract useful information out of XML documents, we need query languages suchas XQuery [18] and XPath [17] that allow users to query the structures of XML documents. Fig-ure 2 shows an example XQuery that queries an XML-based eMarket Web site that stores the in-formation of the products sold in supermarkets or stores. It calculates the price differences of anyproducts between a supermarket and a store. This query first tries to find out all the productsthat are sold in any supermarkets or stores (as specified in the “FOR” clause). The path expression“document("*")//supermarket/product” returns all the “product” elements that are directly nested

1

Administrator
HKU CSIS Tech Report TR-2003-08
Page 2: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

<invoice date="1/7/2002"> <buyer><name>ABC Corp.</name></buyer> <seller><name>D.com</name></seller> <products> <product>TV</product> <product>VCD</product> <product>VCR</product> </products></invoice>

(a) An XML invoice

(b) The corresponding data graph

@date

invoice

buyer seller products

name name

product product product

root

Figure 1: An XML invoice and its data graph

within a “supermarket” element in any XML documents. If a product is sold in both a supermarketand a store (as specified in the “WHERE” clause), the price difference of this product between the super-market and the store is calculated, wrapped into a “diff” element and returned (as specified in the“RETURN” clause).

FOR $i IN document("*")//supermarket/product$j IN document("*")//store/product

WHERE $i/text()=$j/text()RETURN <diff> $i/price - $j/price </diff>

Figure 2: An example query to an eMarket Web site

There are different ways (or evaluation plans) to evaluate the previous query. In one plan we mayfirst find out a list of all the products that are sold in supermarkets by using a path index, and thenverify if those products are also sold in any stores by navigating the XML documents. After gettinga list of common products, we can then calculate and return the price differences. In another plan wemay first find out all the products that are sold in the stores and then check if those products are alsosold in any supermarkets. The only difference between the previous two evaluation plans is the order ofevaluating the path expressions “//supermarket/product” and “//store/product”. To increase thequery processing efficiency we should first evaluate the path expression that returns fewer elements. Forexample, if there are fewer products sold in the stores, we should choose the second evaluation plan.In this case most of the irrelevant elements are filtered out by efficient path indices resulting in lessamount of navigation, and hence better efficiency. Thus, in order to choose an efficient evaluation plan,we have to efficiently retrieve the counts (or, selectivities) of elements that are returned by any path

2

Page 3: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

expressions. Our focus in this paper is on estimating the selectivities of simple path expressions (SPEs).A simple path expression (SPE ) queries the structure of an XML document. It is denoted by

a sequence of tags “//t1/t2/· · · /tn” that represents a navigation through the structure of an XMLdocument. The navigation may start at anywhere in the document but each transition goes from aparent to a child. An SPE returns all the information items tn that can be reached by the navigationspecified in it. The selectivity σ(p) of an SPE p is the number of information items that the SPEreturns.

To get the selectivities of SPEs, we need some structures to store the statistics of an XML document.There are different approaches to storing those statistics [1, 5, 10, 11, 12, 13], but they all suffer fromvarious drawbacks. Path indices like path tree [1], 1-index [9], and DataGuide [5] are automata thatcan recognize all absolute path expressions (APEs) of an XML document. An APE “/t1/t2/· · · /tn”is a special kind of SPE in that the navigation must start from the root node. Note that these threestructures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs areignored). Figure 3 shows a path tree of our sample invoice. To retrieve the selectivity of an APE, wecan navigate the path tree from the root according to the APE and return the selectivity associatedwith the terminating node. However, in order to retrieve the selectivity of an SPE, we need a full treewalk to determine the possible starts of the navigations before we can navigate the tree to retrievethe selectivities. Note that a unique node is associated with each APE in the XML document. If thedocument is very complex or has an irregular structure (e.g., consider XML documents that are collectedfrom various data sources), there will be many APEs. The path tree that represents the document couldbe very large. Retrieving the selectivities of SPEs from a path tree requires a full tree walk and is thusinefficient.

An efficient way to store the statistics is to use structures such as suffix tries [2] and hash tables [8]to store the association between SPEs and selectivity counts. Figure 4 shows the corresponding suffixtrie for our example invoice document. A suffix trie can be regarded as an automata that recognizesall SPEs in an XML document. We can find the selectivity of an SPE by navigating the trie from theroot using the SPE and returning the selectivity associated with the terminating node. Only one pathin the suffix trie needs to be checked. Thus the selectivity retrieval time is much shorter than that ofpath trees. Alternatively, we can enumerate all SPEs in the document and use a hash table to store theSPE strings and their selectivity counts. However, suffix tries and hash tables require a large amount ofspace and they will be too large to be stored and accessed efficiently if the XML document is complexand has an irregular structure.

Many researchers observe that exact selectivities are usually not required [1, 2, 10, 12, 13]. In oursupermarket query example, if the selectivity of the SPE “//store/product” is 100, then this SPEshould be evaluated first as long as the selectivity of the SPE “//supermarket/product” is larger than100. In this case, the selectivity does not have to be exact. Thus many researchers use some approximatestructures to estimate the selectivities in order to increase the space and time efficiency. However, noneof these approaches provide any accuracy guarantees to the selectivities that their systems return.

In addition to the previous problems, most of the previous structures proposed for the SPE selectivityestimation problem are not flexible. A flexible structure should be easily configurable to optimize theoverall performance on different space, speed, and accuracy requirements for different applications. Forexample, users should be able to trade space for accuracy or speed if they have excess memory available.Moreover, a flexible structure should adapt to its workload and its performance can be optimized foranswering frequently asked queries. Most of the previously proposed structures do not provide suchflexibilities.

In this paper we propose a new data structure called SF-Tree that efficiently stores the counts (or,selectivities) of all SPEs in an XML document. In an SF-Tree, SPEs are divided into groups accordingto their selectivities. We can thus find the selectivity of an SPE by finding to which group the SPEbelongs. In order to increase the space efficiency, time efficiency, and accuracy in retrieving a selectivity,

3

Page 4: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

&711 1 1

&3 &5&2

&0

&4 &6 &8

&1

@date buyer seller products

1 1 3name name product

1invoice

root

Figure 3: The path tree for the XML invoice shown in Figure 1

@date nameproductssellerbuyer

productname

productinvoice

namename product

@date buyer seller products name

1

1

1 1

3

1 1 1

31

2 3

1

1

1

1

1

Figure 4: The suffix trie for the XML invoice shown in Figure 1

we use signature files to store the groups and organize the signature files in form of a tree. Figure 7shows an example SF-Tree for our sample invoice. We will show in this paper that using SF-Tree forestimating XML SPE selectivity is:

• efficient. We have a large speed up in selectivity estimation time over path trees (up to 1000 timesin our experiments) although SF-Tree is slower than the space inefficient suffix tries. Our analysisshows that the selectivity estimation time is independent of data size. SF-Tree is thus efficientwhen it is compared with other structures especially when the data is large and complex.

• accurate. Although it does not always return exact selectivities, we have a statistical accuracyguarantee on the selectivities returned by an SF-Tree. The guarantee bounds the probability foran SF-Tree to report an incorrect selectivity. Thus the guarantee makes SF-Tree superior to allprevious approaches that use approximate structures to estimate selectivities. We will show inour analysis that SF-Tree returns exact selectivity in most cases.

• flexible. By using different parameters, SF-Tree allows an easy tradeoff among space, time, andaccuracy requirements. Moreover, SF-Tree can be workload-adaptable, and hence we can optimizeour performance for answering frequently asked queries.

Therefore, SF-Tree provides a more efficient, flexible solution to the SPE selectivity estimationproblem over path trees without sacrificing much accuracy and space.

The rest of this paper is organized as follows. Section 2 discusses some previous estimation tech-niques. Section 3 describes the basic structure of an SF-Tree and analyzes the properties of SF-Treesusing a perfect SF-Tree as an example. Various ways to improve a perfect SF-Tree are discussed in

4

Page 5: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

Section 4. The performance of SF-Trees is evaluated in Section 5. Finally, Section 6 concludes thepaper.

2 Related Work

In this section we discuss the previous techniques for estimating SPE selectivity. Most of the techniquesproposed so far contain two main steps: a retrieval step followed by a calculation step. The retrievalstep retrieves the intermediate statistics that are stored explicitly in the storage structures. Thoseintermediate statistics are then fed into the calculation step to calculate the estimated selectivity of theinput SPE. Different techniques put emphases on either steps. We can divide the previous techniquesinto two main approaches, the automata-based approach and the Markov model based approach.

2.1 Automata-Based Approach

One approach to solve the selectivity estimation problem is to use the existing automata-based pathindices to store the selectivity counts. This approach associates the count of a path expression to anautomaton state that recognizes the path expression. The count is directly retrieved and returnedwithout going through the calculation step. DataGuides [5, 6], 1-indices [11] and path trees [1] are allpath indices that can be used for selectivity estimation. All these methods build an automaton that canrecognize all absolute path expressions (APEs) that appear in an XML document. Each state (node)in the automaton represents an APE that the automaton recognizes. Instead of storing referencesto the information items that satisfy the APE, the states in the automaton store selectivity counts.Figure 3 shows an example path tree. In the example, the node marked “&3” represents the APE“/invoice/buyer”, which is obtained by concatenating the labels (or tags) of the nodes from the rootto node “&3”. In order to retrieve the selectivity of an APE, we start navigating the automaton fromthe root according to the APE and return the selectivity associated with the terminating state. Onthe other hand, to determine an SPE’s selectivity, a complete tree walk is required since a navigationmay start at any node of the tree. For example, in order to find the selectivity of the SPE “//name”,we must first scan the tree to find the potential starting states, “&3” and “&5”. Then we can navigateaccording to “//name” to find the ending states and return the sum of all the selectivities found. Theselectivity of “//name” is thus the sum of the selectivities of states “&4” and “&6”, i.e., 1 + 1 = 2. Ifthe XML document is complex or irregular in structure (e.g., data collected by search engines or dataintegrators), the number of path expressions derived, and hence the automaton, is large. Determiningan SPE’s selectivity is thus very time consuming since a full tree walk is required.

In order to reduce the time required for retrieving the selectivity of an SPE, we can use an automatonthat can recognize not only APEs, but also all SPEs in an XML document. In this type of automata,each SPE in the document is associated with a final state and a selectivity count. To retrieve theselectivity of an SPE, we simply feed the SPE to the automaton and return the selectivity count thatis associated with the terminating state of this SPE. 2-indices [11] and suffix tries [15] are examplestructures that can be used as this type of “SPE recognizing automata”. Although very time efficient,such an automaton needs to recognize all SPEs and thus it requires a large amount of space.

Another method to improve query efficiency is to summarize the path tree structures into startrees [1]. With this summarization, the space required for storing the automaton is reduced and thetime required for scanning the whole automaton is also reduced. Aboulnaga et al. [1] observed that nodeswith small selectivity counts associated with them are less significant in the estimation. Therefore, nodeswith small selectivity counts are summarized into a star node. The label of a star node is a wild cardmeaning that it matches any tags. Although the star-tree structure saves time and space by reducingthe number of nodes in the tree, the error of the estimated selectivity is unbounded. There are noguarantees on the accuracy of the returned selectivity.

5

Page 6: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

2.2 Markov Model Based Approach

Another main approach is to use a Markov model to estimate the selectivities of long SPEs. Previousexamples include summarized Markov table [1] and XPathLearner [10]. In these techniques, only theselectivity counts of some short SPEs are stored explicitly in a table. The selectivity of a long SPEis estimated by those of short SPEs using a Markov model. For example, if a full length-two Markovtable is stored, the selectivity of all SPEs with lengths up to two such as “//invoice/buyer” and“//buyer/name” are stored explicitly in the table. The selectivity of the SPE “//invoice/buyer/name”is then estimated by the following formula: (we abbreviate “invoice” to “I”, “buyer” to “B”, and“name” to “N” in the following discussions)

σ̂(//I/B/N) = σ(//I/B)σ(//B/N)σ(//B)

In the formula, σ(p) denotes the exact selectivity of the simple path expression p while σ̂(p) isthe estimated selectivity of p. The fraction σ(//B/N)/σ(//B) represents the average number of “N”children a “B” element has. Therefore, the RHS of the equation estimates the number of “N” elementsin “//I/B/N” using the number of “B” elements in “//I/B” and the average number of “N” children ofthose “B” elements.

Different variations of the Markov approach have been proposed. Aboulnaga et al. [1] proposedto further reduce the space requirement by summarizing the table entries that have small selectivitycounts. In XPathLearner [10], the table is built from query feedbacks only and there is no need to scanthe XML documents for building the table. Since XPathLearner is built from query feedbacks, theselectivity counts of frequently asked queries contribute more to the tables and hence the selectivitycounts of those queries will be estimated more accurately. In XSketch and XSketch2 [12, 13], thelength of the longest SPE that has its selectivity explicitly stored is not a constant. Basically, if theestimated selectivity of a long SPE is very different from the actual selectivity, the selectivity of thelong SPE is stored. The study focused on how to choose a subset of the selectivities to be stored inorder to maximize the overall accuracy of the selectivity counts of all the SPEs.

In all the above methods, the selectivity of long path expressions, branching path expressions orother more complex path expressions can be estimated. However, since the selectivity counts of onlya small subset of SPEs (usually those short ones) are stored explicitly, the selectivity counts that arereturned by these methods are only estimates. Although they showed empirically that their systems arequite accurate in their experiments, errors can occur in these estimations and there are no theoreticalguarantees on the accuracy in the selectivity counts returned.

3 SF-Tree

In this section we describe the structure of an SF-Tree and analyze its properties using a perfect SF-Treeas an example. SF-Tree stores the statistics of XML documents so that we can retrieve the selectivity ofSPEs from it. In an SF-Tree, all SPEs in the documents are grouped by their selectivities. To retrievethe selectivity of an SPE, we need to find out to which group the SPE belongs. Basically, SF-Tree is atree structure that stores this grouping information efficiently and accurately. By efficiently we meanthat SF-Trees take a small amount of space while supporting efficient retrieval. By accurately we meanthat SF-Trees provide a statistical accuracy guarantee on the estimated selectivity. Before we discussthe detailed structure of an SF-Tree, let us first review a key technique used in SF-Tree called signaturefiles.

6

Page 7: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

h(//name) = {1, 4} h(//product) = {1, 8} h(//invoice) = {3, 4}

0 8...

F D = {//name, //product, //invoice}

Figure 5: A signature file with m = 2 and the path strings it summarizes. The shaded bit positions areset to 1.

3.1 Signature File

Signature file [4] is a traditional method to improve the efficiency of keyword-based information retrieval.A signature file F is a bit vector that summarizes a set D of keywords so that the existence of a keywordpquery in D can be checked efficiently using F . To create F , every keyword p in D is hashed by a hashfunction h() into m bit positions in F and we set those bit positions to “1”. Hash results from all thekeywords are bitwise summed to generate F . Figure 5 shows a signature file and the three path stringsit summarizes. To check the existence of a keyword pquery in D, we first hash pquery by h() into m bitpositions and then check the values of those bit positions. If any of those bits is “0”, pquery is definitelynot in D; If all those bits are “1”, then pquery is very likely to be in D. This is because those m bitsmay be set to “1” by a combination of other keywords in D instead of the keyword pquery itself. In thiscase a false drop occurs and the probability for this to happen is called the false drop rate. Accordingto the analysis shown in [4], in order to minimize the false drop rate in a given space budget of |F | bits,F , D, and m should be related by |F | = m|D|/ ln 2. With this condition, the false drop rate is 1/2m.

A signature file summarizes a set of keywords efficiently. It is space efficient because for each keywordin D, we need only m/ ln 2 bits, which is much less than storing the original keyword string. Signaturefile is also time efficient because for each query, we only need to check m bit values in F and m does notdepend on the size of the keyword set. On different choices of m, we can control the false drop rate andachieve a flexible tradeoff between space and accuracy. Therefore, we use a signature file to summarizea group of SPEs with the same selectivity in order to make SF-Tree more efficient and flexible.

3.2 Structure of SF-Tree

SF-Tree is a tree structure that is designed to store the association between an SPE and its selectivity1.Every node (except for the root) stores a signature file, thus this structure is called SF-Tree, whichstands for Signature File Tree. An SF-Tree satisfies the following two criteria on its nodes:

• Leaf Nodes. Each leaf node of an SF-Tree stores a signature file F that summarizes a group Dof SPEs in the document. Those SPEs have the same selectivity2.

• Internal Nodes. Each internal node of an SF-Tree represents all its child nodes. Except theroot node, each internal node computes and stores a signature file that summarizes all SPEs thatare contained in any of its children.

Each leaf node represents a group D of SPEs. We treat an SPE as a keyword and a group as akeyword set so that we can use a signature file F in the leaf node to summarize D. Since a signaturefile is used, we can determine the existence of a query SPE pquery in D efficiently by the signature file.All SPEs in the XML document is partitioned into disjoint groups, each contains SPEs with the sameselectivity. Each group is associated with a leaf node in an SF-Tree. We say that a node n contains anSPE p if p is in the group D of SPEs represented by the node. Note that by the definition of internal

1SF-Tree can also be extended to other applications. We concentrate our discussion on using SF-Tree for SPE selectivityestimation in this paper.

2We can extend the idea of SF-Tree so that a leaf node contains SPEs that have the same quantized selectivity.

7

Page 8: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

Algorithm getSelectivityINPUT: the query path pOUTPUT: the estimated selectivity of p1. currNode ← root, C ← root.children, R ← ∅2. while (C is not empty)3. C ′ ← ∅,4. foreach v in C,5. if (v contains p and v is not a leaf)6. C ′ ← C ′ ∪ v.children7. else if (v contains p and v is a leaf)8. R ← R ∪ v9. end foreach.10. C ← C ′

11. end while12. if (R is empty) return 0;13. if (R contains only 1 node) return the selectivity associated to this node.14. if (R contains more than 1 node) return an interval of selectivity that can cover

the selectivity of all the nodes.

Figure 6: The algorithm for getting the selectivity of an SPE in SF-Tree

nodes of an SF-Tree, if an SPE pquery is not contained in an internal node n, it must not be containedin any child nodes (or descendant nodes) of n. This property makes the retrieval of SPE selectivityfrom an SF-Tree very efficient because a signature file check in a internal node can filter away lots ofirrelevant leaf nodes.

3.3 Retrieving Selectivity from SF-Tree

In an SF-Tree, SPEs in a document are grouped according to their selectivities. Each leaf node of anSF-Tree corresponds to a group of SPEs with the same selectivity count. To retrieve the selectivity σ(p)of an SPE p from an SF-Tree, we navigate from the root through all the internal nodes containing p tothe leaf node that contains p. Since every SPE has one and only one selectivity, p is only contained inone SF-Tree leaf node and all of its ancestors. Figure 6 shows the detailed selectivity retrieval algorithm.Since we use only a signature file to summarize the SPEs in a group, a node n may falsely report thatit contains p. Therefore, there is a small probability that we can reach more than one leaf nodes duringthe navigation. In this case we choose to report a range of selectivities that can cover all the leaf nodesthat contains p (as stated in line 14 of the algorithm) because we cannot distinguish a false drop leafnode from a correct leaf node. False drop leaves refer to the leaf nodes that do not contain p but falselyreport that they contain p.

3.4 Analysis on SF-Tree

There are many different types of SF-Trees because we impose no constraints on how the internal nodesare nested. A tree that contains one internal node (the root), with all leaf nodes as children of theroot, is an SF-Tree. Leaf nodes and the complete binary tree built over them also form an SF-Tree.Different SF-Trees have different space requirement, selectivity retrieval speed and accuracy. In ouranalysis, we use a perfect SF-Tree as an example to demonstrate the key properties of an SF-Tree.Though the properties are derived from a perfect SF-Tree, most of them are also applicable to other

8

Page 9: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

g1 U g2 g3 U g4

g1 g2 g3 g4

Figure 7: A perfect SF-Tree for the invoice in Figure 1.

SPE Selec-tivity

SPE Selec-tivity

//@date 1 //buyer 1//invoice 1 //products 1//seller 1 //buyer/name 1

g1 //seller/name 1 //invoice/@date 1//invoice/buyer 1 //invoice/seller 1//invoice/products 1 //invoice/buyer/name 1//invoice/seller/name 1

g2 //name 2//product 3 //products/product 3

g3 //invoice/products/product 3

Table 1: The selectivity of all the SPEs that appear in the invoice in Figure 1

forms of SF-Trees. Our analysis shows that SF-Tree is space and time efficient and the probability foran error to occur is bounded and decreases exponentially with respect to the SF-Tree’s depth and errorsize. Before presenting the analysis, we first introduce the structure of a perfect SF-Tree, the basis ofour analysis.

3.4.1 Perfect SF-Tree

In an SF-Tree, each leaf node represents a group of SPEs with the same selectivity count. To build atree over the leaf nodes, a straightforward approach is to build a perfect binary tree according to theirselectivity counts. A perfect SF-Tree is a tree such that

• every leaf node has the same depth,

• every internal node has exactly two children,

• every leaf associates with a selectivity count, and

• the selectivity counts start from 1 and they are assigned to the leaves in order.

Figure 7 shows an example of a very small perfect SF-Tree built from our invoice document. In thefigure, gk represents a group of SPEs that have the selectivity count of k. The selectivities of theSPEs are shown in Table 1. The SPEs in gk are summarized in a signature file that is stored in thecorresponding leaf node. Every leaf node has a depth of 2, and the selectivity of the nodes are listedfrom 1 to 4. Note that we do not have a group of SPEs that have zero selectivity. If an SPE is notcontained in any group, it has zero selectivity.

9

Page 10: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

3.4.2 Accuracy

Since we use signature files to summarize the SPEs that are contained in a leaf node, there is a smallprobability for a leaf node nf to falsely report that it contains an SPE p. This leads to an error inthe estimated selectivity because our “getSelectivity” algorithm will report a range of selectivities thatcovers not only the correct selectivity but also the selectivity associated to false drop leaf nf . However,we will show in the analysis that the probability for this error to occur is bounded and decreasesexponentially with respect to an SF-Tree’s depth and error size.

Lemma 1 Let d be the depth of a perfect SF-Tree T , and m be the number of bit positions in a signaturefile an SPE is mapped to. The probability ε−(d) for an error to occur in the estimated selectivity for anegative query p− in a perfect SF-Tree is smaller than 1

2d(m−1)

Proof: The false drop rate of a signature file is 1/2m. Consider a negative SPE query p− which haszero selectivity. p− should not be contained in any nodes. According to the algorithm “getSelectivity”,for an error to occur, there must be at least one root-to-leaf path in the SF-Tree such that p− iscontained in all the nodes along the path. If such a path does not exist, “getSelectivity” cannot reachany leaf nodes and returns the correct selectivity (zero). The probability that such a path exists can beexpressed using the following recurrence equation:

ε−(d) =

1 if d = 0,2[ 1

2m (1− 12m )ε−(d− 1)]+

( 12m )2[1− (1− ε−(d− 1))2] if d > 0.

An error occurs when we can reach one of the leaf nodes through the internal nodes that have falsedrops (i.e., the nodes falsely report that they contain p−). The first term of the recurrence part of theequation calculates the probability of having false drops in only one of the two subtrees of the currentroot node. The second term calculates the probability of having false drops in both subtrees of thecurrent root node. We can simplify the formula:

ε−(d) =1

2m[2− 1

2mε−(d− 1)]ε−(d− 1)

<2

2mε−(d− 1)

<1

2d(m−1)ε−(0) =

12d(m−1)

As shown in Lemma 1, the probability for an error to occur decreases exponentially with respect tod and m. If we choose m = 8, the error rate is less than 1/216(8−1) = 1/2112 ≈ 0 in a 16-level perfectSF-Tree.

Lemma 2 Let p+ be an SPE query that appears in the document, np be the leaf node that contains p+,nf be a false drop leaf node (i.e., a node that falsely reports that it contains p+), and a be the differencein levels between nf and the least common ancestor na of np and nf . The probability ε+(d, a) for apositive SPE p+ to have error caused by nf is smaller than 1

2(ma−a+1)

Proof: Since p+ appears in the document, it is contained in one of the leaf nodes (np) and itsancestors. For an error to occur, there must exist a child node of na that does not actually containp+ and there exist a path from that child node to nf such that every node has a false drop. Beforethe navigation reaches na, we should not encounter any false drop child nodes that lead to another

10

Page 11: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

false drop leaf node. Otherwise, the error is not caused by nf . Thus, ε+(d, a) can be expressed by thefollowing recurrence equation:

ε+(d, a) =

1

2m ε−(d− 1) if d = a,

((1− 12m ) + 1

2m (1− ε−(d− 1, a)))×ε+(d− 1, a) if d > a.

From the recurrence equation, we have:

ε+(d, a) < 1× ε+(d− 1, a)

< 1d−a × ε+(a, a) =1

2mε−(a− 1)

<1

2m(

12(a−1)(m−1)

) =1

2ma−a+1

Note that for a given a, the number of leaf nodes in the subtree rooted at na is 2a. The magnitudeof error, which is the difference between the selectivity of nf and np, is at most 2a − 1 because theselectivities all leaf nodes are contiguous. Thus, by Lemma 2, the probability for an error to occur ina positive SPE p+ decreases exponentially with respect to the magnitude of an error. It is a very niceproperty because it states that errors with large magnitudes are extremely unlikely to occur.

The two lemmas show that there is a theoretical bound on the probability for an error to occurin the selectivity returned from a perfect SF-Tree. We showed that the error probability decreasesexponentially with respect to the tree’s depth d, the error size 2a−1, and the numberm of bit positions anSPE is mapped to. Thus the accuracy of the selectivity estimated by an SF-Tree is ensured statistically.Though we used a perfect SF-Tree in the accuracy analysis, we can apply similar arguments to othertypes of SF-Trees.

3.4.3 Selectivity Retrieval Time

To retrieve the selectivity of a positive SPE p+ from an SF-Tree T , we navigate from the root node ofT to a leaf node through the nodes that contains p+. We need to check m bits in a signature file inorder to test whether p is contained in a node n. The selectivity retrieval time thus depends on howmany signature files we have to check before we can determine which leaf node contains p.

Lemma 3 Let d be the depth of a perfect SF-Tree T . The expected number t of signature files to bechecked to retrieve the selectivity of a positive SPE query p+ from T is approximately 2d+ (2d+2)(d+2)

2m .

Proof: During the navigation, we need to check the signature files in two children of each internalnode to decide which child link to follow. Therefore, if we ignore false drops, the number of signaturefiles to be checked per positive SPE query is 2 × d. A false drop may happen in any node that doesnot contain p+. For each false drop in an internal node, we have to check two more signature files(two children of that internal node). If a false drop occurs in a leaf node, we do not have to checkextra signature files. Let g be the number of signature files that we check during the navigation. If weencounter exactly i false drops, g must be at most 2d+ 2i because we check at most two extra signaturefiles per false drop. Since d of these signature files actually contain p+, false drops must occur in theremaining d + 2i signature files. Thus the probability Pri of encountering exactly i false drops in thenavigation equals gCi ( 1

2m )i, which is less than (d+2i)Ci ( 12m )i. The expected number t of signature files

11

Page 12: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

to be checked is

t =∑i≥0

(2d+ 2i)Pri

<∑i≥0

(2d+ 2i) (d+2i)Ci (1

2m)i

≈ 2d+(2d+ 2)(d+ 2)

2m

By Lemma 3, the expected number of bit positions to be checked to answer a positive SPE queryis approximately (2d+ (2d+2)(d+2)

2m )m. Since both d and m do not depend on the size of the document,it is time efficient to retrieve a selectivity from an SF-Tree especially when the document is huge. Fora negative SPE query p− that is not contained in the document, we usually only need to check thechildren of the root node before knowing that p− is not contained in any nodes in the SF-Tree. ThusSF-Tree is even more efficient for retrieving the selectivity of negative SPE queries.

3.4.4 Storage Requirement

In each node of an SF-Tree, two types of data are stored: tree pointers and a signature file. Since thespace required by tree pointers in a tree is well understood, it is more interesting to analyze the spacerequired to store all the signature files.

Lemma 4 Let P be the set of all SPEs in the XML document, N be the set of all leaf nodes of anSF-Tree T , np ∈ N be the leaf node that contains an SPE p ∈ P , |np| be the number of SPEs containedin np, dnp be the depth of np in T , and m be the number of bit positions that an SPE is mapped to. Thetotal space S required by all the signature files in an SF-Tree T is m

ln 2

∑np∈N |np|dnp.

Proof: Every SPE p in the document has exactly one selectivity and is contained in exactly one leafnode np of the SF-Tree T . By the definition of SF-Tree, p is also contained in all the ancestors of np(except the root node). Therefore, p is contained in dnp nodes. As shown in Section 3.1, the signaturefile size |F | is m|D|/ ln 2 in the optimal case. We need m/ ln 2 bits of storage per SPE per signature.Thus by grouping the SPEs to leaf nodes by their selectivity, we have

S =∑p∈P

(m

ln 2dnp) =

m

ln 2

∑p∈P

(dnp)

=m

ln 2

∑np∈N

(|np|dnp)

By the above lemma, S = mln 2

∑np∈N (|np|dnp). The space requirement of an SF-Tree is linearly

proportional to m, the number of bit positions that an SPE is mapped to, and the average weighteddepth of the SF-Tree. Thus if we want to improve the space efficiency of an SF-Tree, we can changethe value of m or the averaged weighted depth of the SF-Tree.

4 Variations of SF-Tree

In the previous section we focused on describing the structure and analyzing the properties of a perfectSF-Tree. We focused on perfect SF-Trees because their structures are simple and regular so that it iseasy to analyze their properties. However, the structure of a perfect SF-Tree may be too regular in

12

Page 13: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

g2 U g3 U g 4

g1

g2

g3

g4

g2 U g4

Figure 8: A Huffman SF-Tree

some cases so that it cannot fit the requirements of some users. For example, some users may wantto improve the overall selectivity estimation speed by optimizing the SF-Tree for those frequently askedSPE queries. However, this cannot be done because the structure of a perfect SF-Tree depends only onthe selectivities of SPEs in the XML document but not on the queries. Some users may tolerate someerrors in the estimated selectivity and want to trade accuracy for space. But a perfect SF-Tree alsocannot provide such a flexibility in the accuracy-space tradeoff. Thus in this section we discuss some“improved” SF-Trees to fit different user requirements.

4.1 Huffman SF-Tree

Our analysis in Section 3 shows that both the space requirement and the estimation time of an SF-Treeis proportional to its depth. Hence, minimizing the depth of an SF-Tree can improve the efficiency ofit.

Huffman trees [7] have the minimum average weighted depth among all binary code trees. Thebuilding of a Huffman tree starts with a forest of trees. Each tree contains only one node and a weightis associated with it. Two trees with the smallest weights are merged repeatedly until there is only onetree left.

Since Huffman trees have the minimum average weighted depth, we can use the Huffman algorithmto build a binary SF-Tree that has minimized space requirement or selectivity estimation time. Thiscan be achieved by assigning different weights to the leaf nodes. To optimize the space requirement ofa binary SF-Tree, we should associate the number of SPEs contained in a leaf node as the weight ofthat node. As shown in Lemma 4, the SF-Tree that has this weight assignment has the minimum spacerequirement. Figure 8 shows an example Huffman SF-Tree with this weight assignment. On the otherhand, to optimize the selectivity estimation speed of an SF-Tree, the weight associated with a leaf nodeshould be the probability that an SPE query falls into this node. In this case, the average weighteddepth is directly proportional to the expected number of signature files to be checked in estimating aselectivity.

4.2 Shannon-Fano SF-Tree

Although Huffman SF-Tree can give optimal estimation speed or space requirement, it has lower ac-curacy than a perfect SF-Tree. By Lemma 1 and Lemma 2, the probability for an SF-Tree to returna wrong selectivity increases as the tree depth decreases. Therefore, since Huffman SF-Tree has theminimum depth, the error rate of a Huffman SF-Tree is the largest among all binary SF-Trees. Moreimportantly, by Lemma 2, the error rate for positive SPEs (i.e., the SPEs with non-zero selectivity)depends on a, where a is the distance in levels between the false drop leaf with its least common ancestorwith the correct leaf. In a Huffman SF-Tree, subtrees are combined according to their weights. Hence,

13

Page 14: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

g2 U g3 U g 4

g1

g2

g3 g4

g3 U g4

Figure 9: A Shannon-Fano SF-Tree

the sibling leaf nodes may contain groups of SPEs with very different selectivity. Therefore, a doesnot increase with the error size and the error rate may increase to 1

2m for every positive SPEs in thedocument.

To keep the error rate for positive SPEs at a low level, we use a top-down algorithm, which is similarto a Shannon-Fano tree, to construct an SF-Tree. To construct a Shannon-Fano SF-Tree, we start witha root node that represents all the SPEs. Every node is recursively divided into two groups of SPEswith roughly the same weight. In this construction, the order of the leaf nodes are unchanged and hencethe error rate for positive SPEs decreases as the error size increases. Figure 9 shows a Shannon-FanoSF-Tree. In this example, g2 and g4 are not siblings so that the probability of having an SPE that hasactual selectivity of 2 in g4 decreases. Since Shannon-Fano trees have a depth only slightly deeper thanHuffman trees, Shannon-Fano tree is a good alternative of Huffman SF-Tree because Shannon-FanoSF-Tree has a better accuracy but similar size and estimation speed. Thus our experiments are basedon Shannon-Fano SF-Trees.

4.3 Cropped SF-Tree

Different users may have different requirements on space, speed or accuracy. For example, some usersmay have very limited memory budget and may want to sacrifice some selectivity retrieval speed orestimation accuracy to reduce the space requirement. SF-Tree provides a very flexible structure formaking the tradeoff between space, time and accuracy. By removing some layers (usually top or bottom)of an SF-Tree, the space required by an SF-Tree is reduced but the estimation accuracy is also reduced.The selectivity retrieval speed may also be reduced. We denote this type of SF-Trees as cropped SF-Trees.

Since some layers of an SF-Tree is removed, the depth of the tree is reduced. By Lemmas 1, 2, and4, the space required by the SF-Tree is reduced and the accuracy of the estimated selectivity may alsobe reduced. However, note that even the height of an SF-Tree is reduced from 16 to 3, the error rate isstill very low (1/221) if m is 8. Therefore, in most cases, cropping an SF-Tree will not alter the accuracymuch but this can much reduce the space required for the SF-Tree.

We propose two types of cropped SF-Trees, namely, head-cropped SF-Trees and tail-cropped SF-Trees.In a head-cropped SF-Tree, the layers of nodes that are around the root are removed. All children ofthe cropped nodes become children of the root. To retrieve a selectivity, we have to scan all the childrenof the root node. Therefore, a head-cropped SF-Tree is slower than a normal SF-Tree.

In a tail-cropped SF-Tree, the layers of nodes that are around the leaves are removed. The newleaf nodes represent a range of selectivities of its deleted descendant leaves. Figure 10 shows both ahead-cropped SF-Tree and a tail-cropped SF-Tree. The accuracy of a tail-cropped SF-Tree is loweredbecause we can only find a range of selectivities to which an SPE belongs. Errors may now be causedby false drops as well as quantization. However, the selectivity estimation speed is increased becausethe SF-Tree contains fewer levels and we still need to check only two signature files per level.

14

Page 15: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

(b) a tail cropped SF−Tree

(a) a head cropped SF−Tree

Figure 10: Cropped SF-Trees

We can reduce the space required by an SF-Tree by cropping the head and/or tail layers of it. Thenumber of head and/or tail levels to be removed depends on how much selectivity retrieval speed andaccuracy we could sacrifice. This facilitates the tradeoff between space, time and accuracy.

5 Experiments

In this section we present the experimental evaluation on the performance of SF-Tree by using Shannon-Fano SF-Tree as an example. We evaluated the accuracy and the selectivity retrieval time of SF-Treeunder different parameters that affect the space requirement. The parameters include m (the numberof bit positions that an SPE is mapped to) and h (the number of top levels to be cropped). Wealso compared SF-Tree with some previous approaches including path tree, global-* tree [1], and suffixtrie [15].

5.1 Experiment Setup

5.1.1 Datasets

We used two different datasets, XMark and MathML, in our experiments. The XMark dataset [14] isgenerated by the XML generator in the XML Benchmark project. It contains the data of an auctionWeb site. We generated a 10MB file using the 0.1 scale factor for our experiments. The MathML datasetis generated by the IBM XML Generator [3] using the DTD of MathML. It represents the data witha very complex and irregular structure. We generated 100 MathML documents using the parametersr = 5 and l = 7 as inputs to the IBM XML Generator. r is the maximum number of times a childelement denoted by the * or + option will be repeated at a node and l is the maximum number of levelsin the resulting XML tree. Table 2 shows a summary of the properties of the two datasets we used.

5.1.2 Query Workload

We generated two sets of queries for each dataset. The first set of queries contains only positive SPEs(SPEs with have non-zero selectivity) and the other set contains only negative SPEs (SPEs with zeroselectivity). The positive SPEs are generated by randomly picking an SPE from all SPEs in the dataset.Each SPE in the dataset has an equal probability of being picked. The negative SPEs are generated by

15

Page 16: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

XMark MathMLNumber of Elements 206131 42688Number of Diff. Tags 83 197Size of Document 11.7MB 1.07MBNumber of Absolute Paths 537 39264Number of Simple Paths 1912 228874

Table 2: A summary of the properties of the datasets

XMark MathML

PositiveQueries

Number of Queries 2000 2000Average Query Length 5.94 6.793Average Selectivity 351 1.04

NegativeQueries

Number of Queries 2000 2000Average Query Length 3.93 4.046Average Selectivity 0 0

Table 3: A summary of the properties of the query workloads

randomly concatenating possible tags in the dataset and removing those SPEs with non-zero selectivity.Table 3 shows the properties of the query workloads on both datasets.

5.1.3 Evaluation Metrics

We evaluated the performance of SF-Tree and some existing approaches in three aspects, namely,accuracy, selectivity retrieval time and space requirement. The accuracy are presented in terms of bothaverage absolute error and average relative error. Let σ̂(p) be the estimated selectivity and σ(p) be theactual selectivity of an SPE p. The average absolute error is the average of |σ̂(p) − σ(p)| over all p.The average relative error is the average of |σ̂(p)− σ(p)|/max(σ̂(p), b) where b is a sanity bound whichprevents paths with low counts over-contribute to the value of the average relative error. We set b tothe 10-percentile of the selectivity counts in our experiments, meaning 10% of paths have counts lessthan b. Note that relative error is not well defined for negative SPEs so that it is not considered fornegative SPEs. The selectivity retrieval time is the time required to find out the selectivity counts ofall the 2,000 input SPEs. The space requirement measures the space required to store all the necessarystructures of the algorithms including the signature files and the tree pointers.

5.1.4 Running Platform

All the experiments are run on a Pentium III 700MHz running Solaris 5.8 with 4GB memory. All thealgorithms are implemented using Java 1.3.1.

5.2 Performance of SF-Tree

In this experiment we studied the performance of different SF-Trees under different scenarios. Weperformed the experiments on the XMark and MathML datasets. The SF-Trees showed similar resultson both datasets. Thus, due to space limitation, we focused our presentation on the performance of anSF-Tree for the XMark dataset.

16

Page 17: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

5.2.1 Accuracy vs. Size

In this experiment we studied the accuracy of an SF-Tree under different values of m (the number ofbit positions that an SPE is mapped to) and h (the number of top levels of an SF-Tree to be cropped).Figures 11 to 13 show the accuracy of an SF-Tree versus its size under different scenarios for the XMarkdataset. In the graphs, the points on different lines represent different head-cropped SF-Trees. Thepoints on the same line represent SF-Trees with the same shape but different values of m. h (the numberof cropped top levels) varied from 0 to 4 and m (the number of bit positions that an SPE is mappedto) varied from 4 to 10. The left-most point on each line represents the performance of an SF-Treewith m = 4 and the other points on the line represents m = 6, 8, and 10 respectively. Note that thesize of an SF-Tree is linear proportional to the value of m. It is because as shown in Lemma 4, thespace required by an SF-Tree is m

ln 2

∑np∈N |np|dnp , which is linearly proportional to m. Moreover, by

considering the first point on each line, the size of an SF-Tree increases with d (the tree’s depth). It isbecause the space required by an SF-Tree is also proportional to d as shown in Lemma 4.

From the graphs, we found that the average error is nearly zero if h < 4 or m > 8. The error goesup when m or d is lower than a certain value. This agrees with our analysis that the error rate decreasesexponentially with respect to d and m. From Figures 11 and 11, we can see that the error size is muchlower for negative SPEs than for positive SPEs. Consider an SF-Tree with depth d = 3.84, an SF-Treewith m = 8 has negligible errors in the retrieved selectivities for negative SPEs but it still has noticeableerrors for positive SPEs. This also agrees with our analysis that by Lemmas 1 and 2, the error rate ishigher for positive SPEs than for negative SPEs.

Figure 14 shows the absolute errors for estimating the selectivities of positive SPEs in the MathMLdataset. It shows a similar results as in the XMark dataset. However, due to space limitation, otherparts of the experiments on the MathML dataset is not shown in this paper.

5.2.2 Selectivity Retrieval Time vs. Size

In this experiment, the time for retrieving the selectivity of an SPE from different SF-Trees are tested.Figures 15 and 16 show the time for retrieving the estimated selectivities of 2000 SPEs from differentSF-Trees. The points on different lines still represents different head-cropped SF-Trees and the pointson the same line represents the SF-Trees with the same shape but different m.

We can see that the time for finding an estimated selectivity of a negative SPE is only approximatelyhalf of the time for evaluating a positive SPE. This is because in order to get the selectivity of a negativeSPE, we normally only need to check two signatures. These two signatures normally (without false drop)report that the query does not exist in their descendants. We do not need to traverse down the SF-Treein this case and hence less time is required to retrieve the selectivity of the negative SPE. Considerthe points on the same line in Figure 15, the retrieval time increases with m because m is linearlyproportional to the number of bit positions we have to check for each selectivity retrieval. Consider thestarting points of the lines, we can see that the retrieval time increases as d decreases. It is becauseafter cropping the top h levels, all internal nodes at level h and leaf nodes at level less than h in theoriginal SF-Tree become the children of the new root node. To retrieve the estimated selectivity of anSPE, we need to check the children of the root first and hence increasing the number of children in rootnode increases the selectivity retrieval time.

5.3 Comparison with Previous Approaches

We also compared the performance of SF-Trees with that of some previous approaches including pathtrees, global-* trees and suffix tries in our experiments. We compared them in terms of accuracy,selectivity retrieval time and storage requirement. Tables 4 and 5 show the result of this experiment.

17

Page 18: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

Path Tree Global-* Tree SF-Tree SF-Tree Suffix Trie(537 nodes) (299 nodes) (m = 8,

h = 3)(m = 6,h = 3)

Averaged Abs. Error 0 146.2335 0 0.013 0Averaged Rel. Errors 0 198% 0 0.025% 0Storage Requirement 7.5KB 4.2KB 24KB 20.7KB 183KBSpeed up over Path Tree 1 3.98 3.27 3.68 32.28

Table 4: A Comparison of performance for the XMark dataset.

Path Tree Global-* Tree SF-Tree SF-Tree Suffix Trie(61545nodes)

(39590 nodes) (m = 8,h = 3)

(m = 6,h = 3)

Averaged Abs. Error 0 2.387 0 0.043 0Averaged Rel. Errors 0 234% 0 4.3% 0Storage Requirement 862KB 555KB 484KB 365KB 17.2MBSpeed up over Path Tree 1 4.22 2546 2704 5064

Table 5: A Comparison of performance for the MathML dataset.

-10

0

10

20

30

40

50

60

70

80

15 20 25 30 35 40

Ave

rage

Abs

olut

e E

rror

Size of SFTree (KB)

d=7.84d=6.84d=5.84d=4.84d=3.84

Figure 11: The absolute error for positive queries against the size of the SF-Tree in XMark dataset withdifferent m and d

18

Page 19: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

-4

-2

0

2

4

6

8

10

15 20 25 30 35 40

Ave

rage

Abs

olut

e E

rror

Size of SFTree (KB)

d=7.84d=6.84d=5.84d=4.84d=3.84

Figure 12: The absolute error for negative queries against the size of the SF-Tree in XMark datasetwith different m and d

0

0.05

0.1

0.15

0.2

15 20 25 30 35 40

Ave

rage

Rel

ativ

e E

rror

Size of SFTree (KB)

d=7.84d=6.84d=5.84d=4.84d=3.84

Figure 13: The relative error against the size of the SF-Tree in XMark dataset with different m and d

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

200 300 400 500 600 700

Ave

rage

Abs

olut

e E

rror

Size of SFTree (KB)

d=1.254d=1.161d=1.118d=1.075d=1.037

Figure 14: The absolute error for positive queries against the size of the SF-Tree in the MathML datasetwith different mand d

19

Page 20: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

0

50

100

150

200

250

300

350

400

450

15 20 25 30 35 40

Run

ning

tim

e (m

s)

Size of SFTree (KB)

d=7.84d=6.84d=5.84d=4.84d=3.84

Figure 15: The selectivity retrieval time for positive queries against the size of the SF-Tree in XMarkdataset with different m and d

0

50

100

150

200

250

300

350

15 20 25 30 35 40

Run

ning

tim

e (m

s)

Size of SFTree (KB)

d=7.84d=6.84d=5.84d=4.84d=3.84

Figure 16: The selectivity retrieval time for negative queries against the size of the SF-Tree in XMarkdataset with different m and d

20

Page 21: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

Although we can only retrieve an estimated selectivity from an SF-Tree, by appropriately selectingm and h the estimates are very close to the exact values. As shown in Tables 4 and 5, we have noerror on the estimated selectivity if m = 8 and h = 3. Even if we set m = 6 to save some space, theerrors in the estimated selectivity is still very low. Thus the selectivity retrieved from an SF-Tree isvery accurate in many parameter settings.

Global-* tree occupies the least amount of space compared with all the methods we tested. Inour experiments, we asked the global-* tree to remove around 40% of the nodes in the path tree. Wefound that the selectivity estimated by a global-* tree is very inaccurate. We have a relative error ofaround 200% on both the XMark and MathML datasets. It is because a global-* tree do not controlthe error induced into the estimated selectivity during its summarization process. Note that if the XMLdocument is very complex and the selectivity of the path expressions does not vary a lot (as in theMathML dataset), SF-Tree can be smaller than a global-* tree.

Our experiments showed that SF-Tree is generally much more efficient than a path tree. Retrievinga selectivity of an SPE from an SF-Tree can be up to 2,704 times faster than retrieving the requiredselectivity from a path tree. While having such a large speed up, SF-Trees do not require much morespace than path trees. Sometimes, an SF-Tree may require less space than a path tree.

In terms of selectivity retrieval time, suffix trie gives the best performance but it requires a largeamount of space. Although we need more time to retrieve the selectivity of an SPE from an SF-Treethan from a suffix trie, an SF-Tree requires 10 times to 50 times less space than a suffix trie. Therefore,SF-Tree is a much more space efficient method than suffix trie to achieve speed up over path trees.

6 Conclusions and Discussions

In this paper we proposed a new data structure called SF-Tree. An SF-Tree stores the statistics of anXML document using signature files and the files are organized in a tree form so that the selectivitiesof SPEs can be retrieved efficiently. Our analysis shows that an SF-Tree has a statistical accuracyguarantee on the selectivity retrieved from it and the error rate decreases exponentially with respectto the error size. Moreover, SF-Tree provides a flexible tradeoff between space, time and accuracy.Different variations that improve a perfect SF-Tree are also discussed. Our experiments show thatSF-Tree is much more efficient than path trees and much more space efficient than suffix tries.

As a further study, we are investigating techniques for updating an SF-Tree. When an XML doc-ument is updated, the selectivities of SPEs in the document are changed. Since in an SF-Tree theSPEs are grouped according to their selectivities, the SPEs may jump from one group to another if thedocument is updated. However, since we use signature files to summarize the groups, it is non-trivialto remove an SPE from a group and insert it to another group.

Another research direction is that we are extending SF-Tree to other applications. Basically, SF-Treeis a new approach for storing a set “key-to-value” pairs efficiently. For the selectivity estimation problem,we store the mapping from an SPE to its selectivity in an SF-Tree. SF-Tree is especially efficient whenthe domain of key is large but the domain of values is relatively small. Hence we can apply SF-Treesto other applications such as multi-dimensional histograms.

References

[1] A. Aboulnaga, A. R. Alameldeen, and J. F. Naughton. Estimating the selectivity of XML pathexpressions for internet scale applications. In Proceedings of 27th International Conference on VeryLarge Data Bases, pages 591–600, 2001.

21

Page 22: SF-Tree: An E cient and Flexible Structure for Selectivity ...structures are the same if they represent tree-shaped data (i.e., the navigations based on IDREFs are ignored). Figure

[2] Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S. Muthukrishnan, R. Ng, and D. Srivastava.Counting twig matches in a tree. In Proceedings of the 17th International Conference on DataEngineering, pages 595–604. IEEE Computer Society, 2001.

[3] A. L. Diaz and D. Lovell. XML data generator, Sept. 1999.http://www.alphaworks.ibm.com/tech/xmlgenerator.

[4] C. Faloutsos and S. Christodoulakis. Signature files: An access method for documents and itsanalytical performance evaluation. ACM Transactions on Information Systems, 2(4):267–288, 1984.

[5] R. Goldman and J. Widom. DataGuides: Enabling query formulation and optimization insemistructured databases. In Proceedings of the 23rd International Conference on Very LargeData Bases, pages 436–445, 1997.

[6] R. Goldman and J. Widom. Approximate DataGuides. In Proceedings of the Workshop on QueryProcessing for Semistructured Data and Non-Standard Data Formats, Jerusalem, Israel, Jan. 1999.

[7] D. A. Huffman. A method for the construction of minimum-redundancy codes. In Proceedings ofthe IRE, volume 40(9), pages 1098–1101, 1952.

[8] D. E. Knuth. The Art of Computer Programming, volume 3. Addison-Wesley, 1973.

[9] Q. Li and B. Moon. Indexing and querying XML data for regular path expressions. In Proceedingsof 27th International Conference on Very Large Data Bases. Morgan Kaufmann, Sept. 2001.

[10] L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. Parr. XPathLearner: an on-line self-tuning markov histogram for XML path selectivity estimation. In Proceedings of 28th InternationalConference on Very Large Data Bases. Morgan Kaufmann, 2002.

[11] T. Milo and D. Suciu. Index structures for path expressions. In ICDT 1999, pages 277–295, 1999.

[12] N. Polyzotis and M. Garofalakis. Statistical synopses for graph-structured XML databases. InProceedings of the 2002 ACM SIGMOD International Conference on Management of Data. ACMPress, 2002.

[13] N. Polyzotis and M. Garofalakis. Structure and value synopses for xml data graphs. In Proceedingsof 28th International Conference on Very Large Data Bases. Morgan Kaufmann, 2002.

[14] A. R. Schmidt, F. Waas, M. L. Kersten, D. Florescu, I. Manolescu, M. J. Carey, and R. Busse.The XML Benchmark Project. Technical Report INS-R0103, CWI, Amsterdam, The Netherlands,Apr. 2001.

[15] G. A. Stephen. String Searching Algorithms, volume 3 of Lecture Notes Series on Computing,chapter Suffix Trees, pages 87–110. World Scientific, 1994.

[16] W3C. Extensible markup language (XML) 1.0, Feb. 1998.http://www.w3.org/TR/1998/REC-xml-19980210.

[17] W3C. XML path language (XPath) version 1.0, Nov. 1999.

[18] W3C. XQuery 1.0: An XML query language, June 2001. http://www.w3.org/TR/xquery.

22