presentation for cmpe-521 vist – virtual suffix tree prepared by: evren ceylan – 2003700163...

Post on 18-Dec-2015

217 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Presentation for Cmpe-521

VIST – Virtual Suffix Tree

Prepared by:

Evren CEYLAN – 2003700163Aslı UYAR - 2003701321

VIST:A Dynamic Index Method for Querying XML

Data by Tree Structures

Written by: Haixun Wang, Sanghyun Park, Wei Fan, Philip S. Yu – SIGMOD 2003

What is XML?

XML : Extentional Markup Language

Has a great importance in Data Exchange.

So, lots of research has been done in providing flexible query mechanisms in order to extract data from XML Documents.

VIST : Virtual Suffix Tree

In this paper, VIST is proposed to search XML Documents.

XML Documents and XML Queries will be represented in structured-encoded sequences (that will be explained in on-going pages).

By using this type of sequences it is shown that, querying XML data is equal to finding subsequence matches.

Index Methods in XML

Previous index methods:Disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide final answers.

What does VIST do? Converts both XML Data and XML Queries to

structure-encoded sequences

Uses tree structures as the basic unit of query in order to avoid highly expensive join operations

In other words, uses structured-encoded sequences instead of nodes or paths

What does VIST do? Matches structured queries against

structured data as a whole, without breaking down the queries into sub-queries of paths or nodes and relying on join operations.

Supports dynamic index update.

What does VIST do?

  In this paper, it is shown that VIST is effective and efficient in supporting structural queries.

Introduction

XML has a growing importance in data exchange (extracting data from XML documents)

XML provides a flexible way to define semi-structured data

In this paper a ‘novel index structure’ is introduced called “VIST”(Virtual Suffix Tree)

VIST provides solutions, offers better performance and usability than previous approaches in XML indexing.

In XML query language design, expressing complex structural or graphical queries is one of the major concept.

(In figure 2, four sample queries is displayed in graph form)

In previous approaches;

i. Indexes are created on path (e.g. “/P/S/I/M” in Q1) Path indexes can answer simple queries efficiently (no branches in Q1).

  ii. However, queries that involves branching structures (such as Q2), have to be disassembled into sub-queries, then combined by expensive join operations to produce final results.

iii. So, these methods are inefficient in handling.

In VIST approach;

Objective: to provide a general method so that structural XML queries need not to be decomposed into sub-queries.

Result: no need to perform expensive join operations.

Method:

XML Data and XML Queries is transformed into to “structure-encoded sequences”.

In order to organize structure-encoded sequences Virtual Suffix Tree is used.

VIST also speeds up the matching process.

Structure:

VIST’s index structure includes two parts: D-Ancestor index, S-Ancestor index (that will be explained in on-going pages).

VIST unifies structural indexes and value indexes into a single index.

To achieve this, a method is proposed called “dynamic virtual suffix tree labeling” (index update can be performed directly on B+Trees.

Structure-Encoded SequencesStructure-Encoded Sequences

Sequential representation of both XML Data and XML Queries.

Objective: Modeling of XML queries through sequence matching makes us to avoid unnecessary join operations in query processing.

Result: Structure-Encoded Sequences are used instead of paths or nodes.

Mapping Data and Queries to Structure-Encoded Sequences:

Stage 1: Lets consider the purchase record example in figure 3. Notation: Capital letters represent names of Attributes. Lowercase letter represent names of attribute values. To encode attribute values into integers we use hash( )

function. e.g. v1 = h(“dell”) and v2 = h(“ibm”) V1 and v2 is used to represent delle and ibm respectively.

Representing an XML document by the preorder sequence of its tree structure.

e.g. preorder sequence of the tree in Figure 3 is:

PSNv1IMv2Nv3IMv4Inv5Lv6BLv7Nv8

Stage 2:

Stage 3:

Definition: A structure-encoded sequence is a sequence of (symbol,prefix) pairs:

D = (a1,p1), (a2,p2), . . . , (an,pn)

ai: node in the XML doc tree.pi: path from the root node to node ai.

Figure 3 can be converted into the structure-encoded sequence.

D = ... ... (Figure 4)

Benefits:

Modeling XML queries through sequence matching is that structural queries can be processed as a whole instead of being broken into smaller query units(paths or nodes of XML doc tree)

Combining the results of the sub queries by join operations is expensive.

The VIST Approach:

Presented in 3 stages:

Naïve algorithm based on the suffix trees

RIST : improves the naïve algorithm by using B+Trees to index suffix tree nodes

VIST : an index structure but relying only on the B+Trees

Requirements XML indexing method needs to include:

Should support structural queries directly. This is done by “structure-encoded sequences”.

Instead of relying on “suffix trees”, the index method uses better indexing techniques such as B+Trees.

The index structure should allow dynamic data insertion and deletion, etc.

A Naïve Algorithm Based on Suffix Trees

Most widely used index structure for subsequence matching is the suffix tree.

 

Example:

2 XML Documents called Doc1 and Doc2, 2 XML Queries called Q1 and Q2

in structure-encoded sequences.

 Doc1 : (P,e)(S,P)(N,PS)(V1,PSN)(L,PS) (V2,PSL)Doc2 : (P,e) (B,P) (L,PB) (V2,PBL) Q1 : (P,e) (B,P) (L,PB) (V2,PBL)Q2 : (P,e) (L,P*) (V2,P*L)

A tree structure for Doc1 and Doc2 is shown in Figure 5

Example: (Cont’d)

As it is shown above elements in the sequences represent nodes in the suffix tree.

Since the nodes are involed in 2 different trees, there is 2 kinds of ancestor-descendent relationships among the nodes.

i ) D-Ancestorshipe.g. (S,P) is a D-ancestor of (L,PS)

ii ) S-Ancestorshipe.g. (v1,PSN) is a S-ancestor of (L,PS)

Example: (Cont’d)

Naïve Algorithm based on the suffix trees:

NaiveSearch algorithm based on suffix trees.

Represents a naïve method for non-contigious subsequence matching.

For example to match Q2;

Start with the root node, which matches the 1st element of Q2 that is (P,e).

Then search under the root for ll nodes that match (L,P*) which yields to (L,PS) and (L,PB)

Finally, search for - (v2,PSL) under the node labeled (L,PS)- (v2,PBL) under the node labeled (L,PB)

Algorithm 1, searches nodes first by S-Ancestorship, and then D-Ancestorship.

Difficulties of Naive Algorithm:

There are difficulties in using suffix tree to index structure-encoded sequences.

Major difficulty is explained below:

Searching for nodes satisfying both S-Ancestorship, and D-Ancestorship is extremely costly. (because we need to go over a large portion of the subtree for each match)

RIST: Indexing by Ancestor-Descendent Relationships

Improves Naïve Algorithm by eliminating the expensive go-over operations in suffix tree.

When we reach node X after matching, we can jump directly to those nodes Y to which X is both D-Ancestor and S-Ancestor.

So, no longer need to search among the descendents of X to find Ys one by one.

RIST Algorithm:     1. index nodes in suffix tree by their (Symbol,Prefix) pairs.

This is represented by a B+Tree.                                                                                                                       

    i.This enables us to search nodes by these (Symbol,Prefix) pairs that is D-Ancestorship.

                                                                                                                          ii.      This B+Tree is called D-Ancestorship B+Tree.

RIST Algorithm: 2.among all the nodes satisfying D-Ancestorship,

we are interested in the ones satisfying S-Ancestorship as well.

                                                                                                    i. Labels are created for suffix tree nodes in order to tell the relationship btw 2 nodes.

                                                                                                      ii.  We use B+Trees to index nodes by labels.

                                                                                                   iii.This B+Tree is called S-Ancestorship B+Tree.

Labeling Notation <nx, sizex>

nx: prefix traversal order of x in the suffix tree.

Sizex: total number of descendants of x in the suffix tree.

That kind of labeling is shown in figure 5.

Note: with that labeling, the S-Ancestorship between any two nodes can be decide easily:

If x and y are labeled <nx, sizex> and <ny, sizey>, node x is an S- Ancestor of y if ny Є ( nx , <nx + sizex> )

Labeling Notation

Constructing the B+Trees:

Insert all suffix tree nodes into the D-Ancestorship B+Tree using their symbols as their keys.

For all nodes that x inserted with the same (Symbol,Prefix), we index them by an S-Ancestorship B+Tree, using the nx values of their labels as keys.

Shown in FIGURE 6

Building the DocID B+Tree:

DocID B+Tree stores for each node x ( using nx as key ), the document IDs of those XML sequences that end up at node x when they are inserted into the suffix tree.

Shown in DocID B+Tree

In summary; Unlike the naïve algorithm, RIST does not use suffix

trees for subsequence matching (it uses D-Ancestorship B+Tree and S-Ancestorship B+Tree )

Form any node , instead of searching the entire subtree under the node, we can jump to the sub nodes that match the next element in the query.

So, RIST supports non-contigious subsequence matching efficiently.

VIST: The Virtual Suffix Tree

RIST uses a static scheme to label suffix tree nodes and that prevents it from supporting dynamic insertions.

Because any node x labeled <n,size> , late insertions can change the number of nodes that appear before x. (in the prefix order)

As well as the size of the subtree rooted at x, which means neither n nor size can be fixed.

VIST: The Virtual Suffix Tree

The purpose of the suffix tree is to provide a labeling mechanism to encode S-Ancestorship.

Suppose a node x is created for element di ,during the insertion of sequence

d1, … , di,… ,dk.

VIST: The Virtual Suffix Tree

If it is estimated i. how many different elements will possibly follow di in future insertions.ii.The occurrence probability of each of these elements

Then we can label x’s child nodes instead of waiting until all sequences are inserted.

It also means ;

the suffix tree itself is no longer needed, because it’s labeling mechanism is inefficient.

It supports dynamic data insertion and deletion.

VIST: The Virtual Suffix Tree (Cont’d)

Top down scope allocation:

A tree structure defines nested scopes: the scope of a child node is a subscope of its parent node, and the root node has the max scope which covers the scope of each node.

Top down scope allocation:

In dynamic scope allocation there is a parameter called λ, which is the expected number of child nodes of any node,

λ is usually assumed as 2. without the knowledge of the occurrence rate of

the each child node, 1/λ of the remaining scope is allocated to x’s 1st inserted child.

Child1 : <n+1,size/2> Child2 : <(n+1+size)/2, size/4>

Dynamic scope of a Suffix Tree Node:

The dynamic scope of a node is triple <n,size,k> ,

where k is the number of subscopes allocated inside current scope.

Algorithm of VIST:

VIST uses the same sequence matching algorithm as RIST

Dynamic method for labeling suffix

tree nodes is represented without building the suffix tree.

Algorithm of VIST:

The method relies on insensitive estimations of the number of attribute values.

Because of that the labeling mechanism is based on a virtual suffix tree .

Example:

- lets look at the index structure before and after insertion

Algortihm of VIST:

Suppose, before the insertion the index structure already contains the following sequence:

Doc1 = (P,e) (S,P) (N,PS) (V1,PSN) (L,PS) (V2,PSL)

The sequence to be inserted

=> Doc2 = (P,e) (S,P) (L,PS) (V2,PSL)

Assumptions of the Example:

There are 2 assumptions for the algorithm:

Max = 20480 Dynamic scope allocation method uses

the parameter λ =2

The insertion process is much like that of inserting a sequence into a suffix tree.

We follow the branches, and when there is no branch to follow we create one.

CONCLUSION: VIST (a dynamic index method) is

developed for XML Documents.

XML data and XML queries is converted into sequences that encode their structural information.

VIST’s Pros: Uses tree structure as the basic unit of query to

avoid expensive join operations.

Supports dynamic data insertion and deletion.

Unlike some other data structures used in other approaches, the index structure of VIST which is based on B+Trees, are well supported by DBMSs.

End of Presentation

Questions ?

top related