presentation for cmpe-521 vist – virtual suffix tree prepared by: evren ceylan – 2003700163...

54
Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by : Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Upload: jessica-dalton

Post on 18-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Presentation for Cmpe-521

VIST – Virtual Suffix Tree

Prepared by:

Evren CEYLAN – 2003700163Aslı UYAR - 2003701321

Page 2: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

VIST:A Dynamic Index Method for Querying XML

Data by Tree Structures

Written by: Haixun Wang, Sanghyun Park, Wei Fan, Philip S. Yu – SIGMOD 2003

Page 3: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

What is XML?

XML : Extentional Markup Language

Has a great importance in Data Exchange.

So, lots of research has been done in providing flexible query mechanisms in order to extract data from XML Documents.

Page 4: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

VIST : Virtual Suffix Tree

In this paper, VIST is proposed to search XML Documents.

XML Documents and XML Queries will be represented in structured-encoded sequences (that will be explained in on-going pages).

By using this type of sequences it is shown that, querying XML data is equal to finding subsequence matches.

Page 5: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Index Methods in XML

Previous index methods:Disassemble a query into multiple sub-queries, and then join the results of these sub-queries to provide final answers.

Page 6: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

What does VIST do? Converts both XML Data and XML Queries to

structure-encoded sequences

Uses tree structures as the basic unit of query in order to avoid highly expensive join operations

In other words, uses structured-encoded sequences instead of nodes or paths

Page 7: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

What does VIST do? Matches structured queries against

structured data as a whole, without breaking down the queries into sub-queries of paths or nodes and relying on join operations.

Supports dynamic index update.

Page 8: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

What does VIST do?

  In this paper, it is shown that VIST is effective and efficient in supporting structural queries.

Page 9: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Introduction

XML has a growing importance in data exchange (extracting data from XML documents)

XML provides a flexible way to define semi-structured data

In this paper a ‘novel index structure’ is introduced called “VIST”(Virtual Suffix Tree)

VIST provides solutions, offers better performance and usability than previous approaches in XML indexing.

Page 10: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

In XML query language design, expressing complex structural or graphical queries is one of the major concept.

(In figure 2, four sample queries is displayed in graph form)

Page 11: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

In previous approaches;

i. Indexes are created on path (e.g. “/P/S/I/M” in Q1) Path indexes can answer simple queries efficiently (no branches in Q1).

  ii. However, queries that involves branching structures (such as Q2), have to be disassembled into sub-queries, then combined by expensive join operations to produce final results.

iii. So, these methods are inefficient in handling.

Page 12: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

In VIST approach;

Objective: to provide a general method so that structural XML queries need not to be decomposed into sub-queries.

Result: no need to perform expensive join operations.

Page 13: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Method:

XML Data and XML Queries is transformed into to “structure-encoded sequences”.

In order to organize structure-encoded sequences Virtual Suffix Tree is used.

VIST also speeds up the matching process.

Page 14: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Structure:

VIST’s index structure includes two parts: D-Ancestor index, S-Ancestor index (that will be explained in on-going pages).

VIST unifies structural indexes and value indexes into a single index.

To achieve this, a method is proposed called “dynamic virtual suffix tree labeling” (index update can be performed directly on B+Trees.

Page 15: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Structure-Encoded SequencesStructure-Encoded Sequences

Sequential representation of both XML Data and XML Queries.

Page 16: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Objective: Modeling of XML queries through sequence matching makes us to avoid unnecessary join operations in query processing.

Result: Structure-Encoded Sequences are used instead of paths or nodes.

Page 17: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Mapping Data and Queries to Structure-Encoded Sequences:

Stage 1: Lets consider the purchase record example in figure 3. Notation: Capital letters represent names of Attributes. Lowercase letter represent names of attribute values. To encode attribute values into integers we use hash( )

function. e.g. v1 = h(“dell”) and v2 = h(“ibm”) V1 and v2 is used to represent delle and ibm respectively.

Page 18: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Representing an XML document by the preorder sequence of its tree structure.

e.g. preorder sequence of the tree in Figure 3 is:

PSNv1IMv2Nv3IMv4Inv5Lv6BLv7Nv8

Stage 2:

Page 19: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Stage 3:

Definition: A structure-encoded sequence is a sequence of (symbol,prefix) pairs:

D = (a1,p1), (a2,p2), . . . , (an,pn)

ai: node in the XML doc tree.pi: path from the root node to node ai.

Page 20: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Figure 3 can be converted into the structure-encoded sequence.

D = ... ... (Figure 4)

Page 21: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Benefits:

Modeling XML queries through sequence matching is that structural queries can be processed as a whole instead of being broken into smaller query units(paths or nodes of XML doc tree)

Combining the results of the sub queries by join operations is expensive.

Page 22: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

The VIST Approach:

Presented in 3 stages:

Naïve algorithm based on the suffix trees

RIST : improves the naïve algorithm by using B+Trees to index suffix tree nodes

VIST : an index structure but relying only on the B+Trees

Page 23: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Requirements XML indexing method needs to include:

Should support structural queries directly. This is done by “structure-encoded sequences”.

Instead of relying on “suffix trees”, the index method uses better indexing techniques such as B+Trees.

The index structure should allow dynamic data insertion and deletion, etc.

Page 24: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

A Naïve Algorithm Based on Suffix Trees

Most widely used index structure for subsequence matching is the suffix tree.

 

Page 25: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Example:

2 XML Documents called Doc1 and Doc2, 2 XML Queries called Q1 and Q2

in structure-encoded sequences.

 Doc1 : (P,e)(S,P)(N,PS)(V1,PSN)(L,PS) (V2,PSL)Doc2 : (P,e) (B,P) (L,PB) (V2,PBL) Q1 : (P,e) (B,P) (L,PB) (V2,PBL)Q2 : (P,e) (L,P*) (V2,P*L)

Page 26: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

A tree structure for Doc1 and Doc2 is shown in Figure 5

Example: (Cont’d)

Page 27: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

As it is shown above elements in the sequences represent nodes in the suffix tree.

Since the nodes are involed in 2 different trees, there is 2 kinds of ancestor-descendent relationships among the nodes.

i ) D-Ancestorshipe.g. (S,P) is a D-ancestor of (L,PS)

ii ) S-Ancestorshipe.g. (v1,PSN) is a S-ancestor of (L,PS)

Example: (Cont’d)

Page 28: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Naïve Algorithm based on the suffix trees:

NaiveSearch algorithm based on suffix trees.

Represents a naïve method for non-contigious subsequence matching.

Page 29: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

For example to match Q2;

Start with the root node, which matches the 1st element of Q2 that is (P,e).

Then search under the root for ll nodes that match (L,P*) which yields to (L,PS) and (L,PB)

Finally, search for - (v2,PSL) under the node labeled (L,PS)- (v2,PBL) under the node labeled (L,PB)

Algorithm 1, searches nodes first by S-Ancestorship, and then D-Ancestorship.

Page 30: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Difficulties of Naive Algorithm:

There are difficulties in using suffix tree to index structure-encoded sequences.

Major difficulty is explained below:

Searching for nodes satisfying both S-Ancestorship, and D-Ancestorship is extremely costly. (because we need to go over a large portion of the subtree for each match)

Page 31: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

RIST: Indexing by Ancestor-Descendent Relationships

Improves Naïve Algorithm by eliminating the expensive go-over operations in suffix tree.

When we reach node X after matching, we can jump directly to those nodes Y to which X is both D-Ancestor and S-Ancestor.

So, no longer need to search among the descendents of X to find Ys one by one.

Page 32: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

RIST Algorithm:     1. index nodes in suffix tree by their (Symbol,Prefix) pairs.

This is represented by a B+Tree.                                                                                                                       

    i.This enables us to search nodes by these (Symbol,Prefix) pairs that is D-Ancestorship.

                                                                                                                          ii.      This B+Tree is called D-Ancestorship B+Tree.

Page 33: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

RIST Algorithm: 2.among all the nodes satisfying D-Ancestorship,

we are interested in the ones satisfying S-Ancestorship as well.

                                                                                                    i. Labels are created for suffix tree nodes in order to tell the relationship btw 2 nodes.

                                                                                                      ii.  We use B+Trees to index nodes by labels.

                                                                                                   iii.This B+Tree is called S-Ancestorship B+Tree.

Page 34: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Labeling Notation <nx, sizex>

nx: prefix traversal order of x in the suffix tree.

Sizex: total number of descendants of x in the suffix tree.

That kind of labeling is shown in figure 5.

Page 35: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Note: with that labeling, the S-Ancestorship between any two nodes can be decide easily:

If x and y are labeled <nx, sizex> and <ny, sizey>, node x is an S- Ancestor of y if ny Є ( nx , <nx + sizex> )

Labeling Notation

Page 36: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Constructing the B+Trees:

Insert all suffix tree nodes into the D-Ancestorship B+Tree using their symbols as their keys.

For all nodes that x inserted with the same (Symbol,Prefix), we index them by an S-Ancestorship B+Tree, using the nx values of their labels as keys.

Shown in FIGURE 6

Page 37: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Building the DocID B+Tree:

DocID B+Tree stores for each node x ( using nx as key ), the document IDs of those XML sequences that end up at node x when they are inserted into the suffix tree.

Shown in DocID B+Tree

Page 38: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

In summary; Unlike the naïve algorithm, RIST does not use suffix

trees for subsequence matching (it uses D-Ancestorship B+Tree and S-Ancestorship B+Tree )

Form any node , instead of searching the entire subtree under the node, we can jump to the sub nodes that match the next element in the query.

So, RIST supports non-contigious subsequence matching efficiently.

Page 39: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

VIST: The Virtual Suffix Tree

RIST uses a static scheme to label suffix tree nodes and that prevents it from supporting dynamic insertions.

Because any node x labeled <n,size> , late insertions can change the number of nodes that appear before x. (in the prefix order)

As well as the size of the subtree rooted at x, which means neither n nor size can be fixed.

Page 40: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

VIST: The Virtual Suffix Tree

The purpose of the suffix tree is to provide a labeling mechanism to encode S-Ancestorship.

Suppose a node x is created for element di ,during the insertion of sequence

d1, … , di,… ,dk.

Page 41: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

VIST: The Virtual Suffix Tree

If it is estimated i. how many different elements will possibly follow di in future insertions.ii.The occurrence probability of each of these elements

Then we can label x’s child nodes instead of waiting until all sequences are inserted.

Page 42: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

It also means ;

the suffix tree itself is no longer needed, because it’s labeling mechanism is inefficient.

It supports dynamic data insertion and deletion.

VIST: The Virtual Suffix Tree (Cont’d)

Page 43: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Top down scope allocation:

A tree structure defines nested scopes: the scope of a child node is a subscope of its parent node, and the root node has the max scope which covers the scope of each node.

Page 44: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Top down scope allocation:

In dynamic scope allocation there is a parameter called λ, which is the expected number of child nodes of any node,

λ is usually assumed as 2. without the knowledge of the occurrence rate of

the each child node, 1/λ of the remaining scope is allocated to x’s 1st inserted child.

Child1 : <n+1,size/2> Child2 : <(n+1+size)/2, size/4>

Page 45: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Dynamic scope of a Suffix Tree Node:

The dynamic scope of a node is triple <n,size,k> ,

where k is the number of subscopes allocated inside current scope.

Page 46: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Algorithm of VIST:

VIST uses the same sequence matching algorithm as RIST

Dynamic method for labeling suffix

tree nodes is represented without building the suffix tree.

Page 47: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Algorithm of VIST:

The method relies on insensitive estimations of the number of attribute values.

Because of that the labeling mechanism is based on a virtual suffix tree .

Page 48: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Example:

- lets look at the index structure before and after insertion

Page 49: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Algortihm of VIST:

Suppose, before the insertion the index structure already contains the following sequence:

Doc1 = (P,e) (S,P) (N,PS) (V1,PSN) (L,PS) (V2,PSL)

The sequence to be inserted

=> Doc2 = (P,e) (S,P) (L,PS) (V2,PSL)

Page 50: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

Assumptions of the Example:

There are 2 assumptions for the algorithm:

Max = 20480 Dynamic scope allocation method uses

the parameter λ =2

Page 51: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

The insertion process is much like that of inserting a sequence into a suffix tree.

We follow the branches, and when there is no branch to follow we create one.

Page 52: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

CONCLUSION: VIST (a dynamic index method) is

developed for XML Documents.

XML data and XML queries is converted into sequences that encode their structural information.

Page 53: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

VIST’s Pros: Uses tree structure as the basic unit of query to

avoid expensive join operations.

Supports dynamic data insertion and deletion.

Unlike some other data structures used in other approaches, the index structure of VIST which is based on B+Trees, are well supported by DBMSs.

Page 54: Presentation for Cmpe-521 VIST – Virtual Suffix Tree Prepared by: Evren CEYLAN – 2003700163 Aslı UYAR - 2003701321

End of Presentation

Questions ?