optimizing keyword queries in xml tree structure

8/10/2019 Optimizing keyword queries in XML tree structure

1/45

Running Head: Optimizing Keyword Queries in XML tree structures 1

Optimizing Keyword Queries in XML Tree Structure

Name:

Instructor:

Institution:

Date

AST!A"T

XML which stands for Extensible markup language remains to be the most

popular and frequently used format for representing and exchanging data in

the World Wide Web. Its application is wide based on the arious di!erent

data types and applications that exist. "he data may take di!erent forms

which may include unstructured heterogeneous# semi structured and


2/45

Optimizing Keyword Queries in XML Tree Structures 2

structured data types. XML has been a progressie language increasing its

functionalities with arious inentions and researches to the leel of

deelopment of data streaming applications. "hese types of inentions hae

receied numerous signi$cance and attention by many experienced users of

the web. "hese deelopments hae led to the centrali%ation of e&cient

processing and querying of XML streams.

"his study focuses on retrieing queries through a combination of structural

constraints which essentially use key words as a search tool to represent an

essential executable function in the XML systems of data management.

'arious expectations are forecasted on and they are expected to yield best

case answers in an e!ectie an e&cient manner like the traditional key

search while factoring in the arious additional constraints that may exist.

"he de$nition of studying the new problem of top(k keyword query and

search oer XML probabilistic data with the aim of retrieing k )L*+ $ndingwhere k has the highest existence capabilities. ,inally the study is going to

preiew arious other forms of keyword searches using di!erent forms and

make a comparison through the analysis of the algorithms that hae been

used.

TAL# O$ "ONT#NTS%&' INT!OD("TION.....................................................................................-

roblem /e$nition........................................................................................0


3/45

Optimizing Keyword Queries in XML Tree Structures

roposal........................................................................................................0)&' O*#!*I#+ O$ !#LAT#D +O!KS........................................................0

*ost(1ased 2uery 3ptimi%ation in 4/1M)s..................................................52uery 3ptimi%ation ,rameworks...................................................................6XML 2uery 3ptimi%ation...............................................................................6

7eyword 2uery in 3rdinary XML /ocuments..............................................89robabilistic XML.........................................................................................88X2uery )treaming 3ptimi%ation.................................................................8:

Querying XML streams.............................................................................8:XML temporal model................................................................................8:

"ime Interals and Model............................................................................8:Mode 4elationship Ealuation.....................................................................8:robability...................................................................................................8;/ominance Lowest *ommon +ncestor ?/L*[email protected]

Dominance relationship...........................................................................85Dominance...............................................................................................A9,&' M#AS(!#M#NT O$ T-# !#LATIONS-I. #T+##N NOD#S IN ADATA T!##...................................................................................................A9

Mutual Information *oncepts......................................................................A8Mutual information and entropy..............................................................A8Mutual information...................................................................................A8

/&' ANS+#!S !#T!I#*#D $!OM TO.0K................................................A:/ominating )core.......................................................................................A:/ominated )core........................................................................................A:/ominance )core........................................................................................A;

1&' AL2O!IT-MS (S#D TO !#T!I#*# TO.0K!#S(LTS......................A-=aBe +lgorithm for )election of "op(K +nswers.........................................A-"op(K /ominated +lgorithm ?"7//@............................................................A0"op(K /ominating +lgorithm ?"7/[email protected]

3&' #X.#!IM#NTAL #*AL(ATION...........................................................:8Experimental )etup....................................................................................:82uery )ets..................................................................................................:8)earch 2uality............................................................................................:8E&ciency and )calability of "op(K+lgorithms.............................................:;

4&' "ON"L(SIONS....................................................................................:-!#$#!#N"#S................................................................................................:-

LIST O$ $I2(!#S


4/45

Optimizing Keyword Queries in XML Tree Structures !

,igure 8D robabilistic XML document 85F.....................................................89,igure AD +n example of an XML data tree structure# " 86F.........................8;,igure :D /ata "ree "A 86F..............................................................................85,igure ;D List of candidates sorted in descending order using their M()aluesA;F................................................................................................................A5

,igure -D 4anking e&ciency of "7//# "7/C and "7/ algorithms 6F............:;

LIST O$ TAL#S"able 8D "he Goin probability of the two speci$c nodes at context nodeDHpaper.......................................................................................................................8-"able AD "he Goin probability of the two speci$c nodes at context nodeDHproceeding....................................................................................................8-"able :D "wo dimension set candidate data...................................................A>"able ;D )earch spaces L+ and L1 deried from a list lof candidates............A0"able -D *andidates list sorted in descending order of f()alues..................A0

"able >D /ominance checks count used in calculating dominated candidatescores............................................................................................................A0"able 0D recision and recall of queries on mondial data...............................::"able 5D recision and recall of queries on auction data................................::"able 6D recision and recall of queries on dblp data.....................................::"able 89D *omparisons on ranking e!ectieness of the algorithms...............::


5/45

Optimizing Keyword Queries in XML Tree Structures "

%&'INT!OD("TION

XML ?Extensible Mark(up Language@ has oer the years eoled to become a

de facto standard used for the exchange and representation of data which

results into the distribution of proliferated XML documents which are spread

all oer the internet. In the past# there are arious query languages that were

used to retriee xml documents and data. "hese included languages such as

X2uery and Xath. +nd twig pattern queries. "hese languages made it

essential for the users of the systems to be ersed with the speci$c query

languages and the releant data schemas so that they may be able to

execute the XML queries e&ciently -F. "his therefore limited the type of

users since the adanced users since the query languages and the data

schemas seemed to be complex concepts to understand. "he data search

through X2ueryHXath languages therefore was a ery big limiting factor.

"he use of maGor keywords to search for documents has been widely

accepted as a ery conenient way in retrieing resources from ariousremote serers that hold that speci$c type of data on the internet. MaGority

of the search engines such as Coogle# 1ing# Jahoo and many more hae

adopted the use of these technologies so as to e&ciently facilitate the

process of data mining and data warehousing. "he adaptation and use of

keywords for querying arious databases has attracted arious researches to

be conducted by the research community from the a!ected $elds of

database and information retrieal ?I4@ -FAF:F. "his is a ery e&cient way

of facilitating the retrieal of documents because it does not inole the

learning of any concepts. "his process is an adancement of the traditional

search algorithms that were speci$cally inoling and required the mastering

of the particular I ?Internet rotocol@ addresses of arious documents or

information content and typing them in the K4L bar. is later adanced and

the I addresses were able to be attached to arious web links. It is from this

that the search engines were deeloped with a more interactie and

responsie algorithm that was able to handle a lot of bits and pieces of


6/45

Optimizing Keyword Queries in XML Tree Structures #

information including data mining >F. + ariety of approaches hae been

accessed and preiewed to $nd alternaties to the keyword queries as

opposed to the XML data. "he basic approaches that currently exist use

lowest common ancestor ?L*+@ type of semantics as opposed to the common

graph theory for identi$cation of the hit list gien a certain keyword query.

"his particular approach generates results composed of all candidates# also

known as sub trees# containing an instance of the queried keywords -F. "he

L*+ returned alues can be numerous yet the user may Gust be interested in

a portion or bit of the whole hit list. It therefore remains and unsoled issue

to be able to identify the exact dataset that is required by the user of the

system. "he ideal situation and the best case scenario would be for the

system to be able to generate an exact piece that is required by the user as

opposed to proiding a whole set of hits which also gies the user an extra

Gob to $lter the content until they obtain an exact piece8FAF:F;F-F>F.

'arious researches and studies hae not been successful in theimplementation of exact retrieal of the search queries hence it remains an

unresoled and challenging issue.

Many proposals hae been drafted on the basis of improing the baseline

approach precision. "he application of heuristic(based rules in enhancing the

"ree )earch for the best case scenario at the shortest time possible has been

the main foundation of maGority of the proposals. "hese approach though it is

intuitie# it portrays characteristics of being ad(hoc in the sense that data

$ltration takes place in maGority of the operations. *andidate data sets that

meet the speci$ed criterion are separated from those that do not meet what

the user seeks to $nd out. "he results of these process of $ltration is what is

shown as a result of the algorithmic computation AF:F;F-F. oweer#

research and studies show that I as much as it is assumed that the results of

these algorithms yield best case results# they do not only miss out on the

false negaties ?releant results@ but they also return false posities

?irreleant results@.

"he results that receied from XML queries can howeer be boosted and

made more reliable if the following considerations are addressed and madeD

If the candidate hits can be measured in terms of releance. )econdly# if

there can be a mechanism that can be able to ultra($lter from the releantcandidate hits to produce a more speci$c list with $ner details that closely

match the users search-F:F. "here third proposition is about the

positioning of the candidate hits in a descending order from the one that is

most probable to the least probable case among the best case scenario hits.

In this particular paper# we are going to focus on the optimi%ation of keyword

queries in XML tree structures so as to yield a ery e&cient result.


7/45

Optimizing Keyword Queries in XML Tree Structures $

)mallest ?9-0>050-@ Lowest *ommon +ncestor ?)L*+@ is a model keyword

search outcome that is a widely accepted semantic on a deterministic XML

tree named " in our scenario. + speci$c node named is therefore

considered )L*+ ifD

=ode is the root of the sub tree at "sub?@ and it consists of all the keywords

"here is no existence of a descendant node N of the root node in such a

way that "sub?N@ consists of all the probable keywords. ,or instance# consider

a scenario of Ok8# kAP being a keyword query on a certain document# p(

document as shown in the $gure below. "he particular )L*+s in this case are

OMKX:# I=/:P. When the algorithm attempts to perform a keyword search

on the speci$c p(document# the following challenges are encountered as

being a cumulatie example of a probabilistic XML document -FAF.

roblem /e$nition

*urrent keyword searches in XML can be diided into tree and graph

supported searches which are largely predicated on structural documentfeatures. oweer# these approaches on structure do not comprehensielyutili%e the hidden semantics within the XML documents leading to issues inthe processing of speci$c keyword query classes. "he growing reputation ofXML has intensi$ed the necessitation of an accessible and precise XML queryinterface that is predicated on natural language and search procedures thatexploit XML structures to simplify queries by ordinary users within XMLdatabases. *onentional methodologies howeer process queries rely basedon ad hoc and intuitie heuristics which frequently regain false positie andunranked answers.

roposal

"his paper systematically explores XML structure(based answers and userexpectations in order to identify the signi$cance of XML keyword searchsemantics. "his paper further posits a semantics(based methodology todeelop XML keyword queries principally through data(centric coherencyranking which is kernelled in the design of the domain and database which ispredicated on data dependence and mutual information models.


8/45

Optimizing Keyword Queries in XML Tree Structures %

*onsequently# keyword query results occur within a under schemareorgani%ation structures which process# present rank and query algorithmsthrough coherency ranking to deelop answers. +ctual XML data indicatesthat coherency ranking is the methodology with the highest precision# recalland ranking as compared with approaches.

)&' O*#!*I#+ O$ !#LAT#D +O!KS

In this particular section# a brief assessment of releant publications on

optimi%ation frameworks# XML query optimi%ation and relational cost(basedquery optimi%ation is going to be preiewed.

*ost(1ased 2uery 3ptimi%ation in 4/1M)s

+ researcher proposed the $rst cost(based eer query optimi%er# which

formed part of )ystem 4 and therefore was the prototype of the relational

database system. "he optimi%er had arious capabilities which included the

optimi%ation of linear and simple select# proGect# and Goin ?)


9/45

Optimizing Keyword Queries in XML Tree Structures &

introductions of products of the *artesian. Craefe and /eWitt showcased the

EX3/K) 3ptimi%er Cenerator and the purpose of this system was not

con$ned to a speci$c data model but it supported the algebraic

transformations speci$cation as rules -F80F A-F. Incorporated with a data

model that is concrete# the rules sere as input for the generator optimi%er#

creating a tailor(made query optimi%er.

,igure 8?a@ showing the p(document ". "ag names are used to show ordinarynodes# for instance *8# *A# and 18and 1A. +s for the distributional nodes#MKX is shown as rectangular rounded cornered boxes while I=/ is depictedas circles. "aking into consideration the I=/A node# there exist two childrennodes 1A and *8 with respectie existence of probabilities 9.> and 9.- -F."herefore# for neither *8 nor 1A# the absence probability being seen is

?8(9.>@Q?8(9.-@ R9.A.

*onsidering the MKXA node that consists of three children# I=/:# EA and/8node# their probabilities of existence respectiely are 9.-# 9.: and9.8."herefore# the probability for non(existence is

8 ( 9.- ( 9.: ( 9.8 R 9.8.

roided a p(document tree named "# there is a possibility of generating allpossible deterministic documents as sown belowS basically traersing " in atop(down manner# two situations arise that require to be dealt withindependently ifD

?8@ It occurs that it is an I=/ node consisting of m child nodes# Am copies of "

are generated# and the I=/ node deletedS m child nodes are replaced withone distinct subset as a copy which is a representation of them and theordinary parent node which is connected to each child node in the I=/ nodesubset. "he probability for this copy to occur for each copy is the product ofall probabilities that exist of the respectie child nodes in the particularsubset and the absence probabilities for instance the existence probabilitydeducted from 8# of the child nodes which are not existent in the subset -F.

?A@ ,or a MKX node consisting of m child nodes# m T 8 copies of " can begenerated# and the MKX node deleted# replacing the m child nodes with nochild or one distinct child node for a copy. +n established connection from

the child node of MKX to the ordinary parent node is made. "he existenceprobability for eery copy is the occurrence probability of the distinct childnode in the subset or the absence probability denoting no child nodeappearance. ,or eery " generated copy# the research adopts traersingusing the top(down approach until the deletion of all distributional nodes iscon$rmed -F.


10/45

Optimizing Keyword Queries in XML Tree Structures 1'

"his study explores the retrieal of queries through a combination ofstructural constraints which fundamentally use key words as a search tool torepresent an essential executable function in the XML systems of datamanagement. "hrough exploration of eidence(based research and theextrapolation of related works# it is expected that the study will yield best

case answers in an e!ectie and e&cient manner similar to the conentionalkey search while factoring in the constraints that may exist.

2uery 3ptimi%ation ,rameworks.

Lan%elotte and 'aldurie% contributed a framework which is extensible for

query optimi%ation which incorporates the concepts of modeling the

independent search space of a particular type of search strategy 86FA8F.

Ksing this approach# highly(extensible plans can be built by deelopers on

enumeration frameworks. 7abra and /eWitt made a proposition on 3"TT

as an 33 approach for ery extensie query optimi%ing 8>F. + combinationof extensible search components together with physical algebra and

extensible logical representation# the work of Lan%elotte and 'aldurie% is

lifted to obGect(oriented leel 6F.

XML 2uery 3ptimi%ation

3n XML queries optimi%ation targets the strictly limited and isolated

problems of path expressions optimi%ation using naigational access paths

and is depried of "< and )< operators support. Wu et al made a proposition

of a dynamic programming $e noel algorithms for Goin reordering

structuring 89F. "heir orthogonal approach is unique# for instance# it can be

used to select the most e&cient Goin order in )


11/45


framework for cost(based optimi%ation and a full(Uedged /1M) does not

seem to proide the solution. It described a cost(based Xath optimi%ation

$rst approach. *ontrary to this particular proposal# which is in support of the

optimi%ation of X2ueries# it has not considered "< and access paths that are

adanced *+) indexes or indexes 6F.

$igure %:robabilistic XML document 85F

7eyword 2uery in 3rdinary XML /ocuments

XML databases are inoled in the query of arious maGor keyword searches.Cien an XML data source keyword query# most of related work took lowest

common ancestors smallest L*+ of the nodes that matched as the results to

be returned. )chema(,ree X2uery and X4+=7 are able to compute )L*+s

and deelop stack(based algorithms 8AFA5F. "he Indexed Lookup Eager

algorithm is introduced when the appearance of the keywords with

frequencies which are di!erent signi$cantly. "he )can Eager algorithm will

take oer the process once keywords register similar frequencies. MaGority of

the authors and researchers of arious preious works focused on inferring

the de$nition of returned results and discussed the di!erentiations of result."he researchers proided more meaningful conclusions and utili%ed the

underlying XML statistics of the data for identi$cation of the return node

types 6F. "he researchers also proposed that a number of cleaning keyword

queries algorithms for optimality could be deeloped. "his therefore resulted

into the designing of M) approach for computation of )L*+s for queries of

keywords in multiple manners. "hey took the 'aluable L*+ as results by


12/45


intentionally aoiding the false negatie and false positie )L*+ and L*+

8AFA-F. "he arious researchers also proposed Indexed )tack# which was

an e&cient algorithm for $nding answers based on semantics of Exclusie

L*+. In addition# there exist other related works which process keyword

search through the integration of keywords into speci$c structured queries.

XML2L# which is a new query language# has the structure of the keywords

and query separated. "he research also introduced a method to embed

arious keywords into X2uery for processing of the speci$c keyword search

6F.

robabilistic XML

"he probabilistic XML topic has been a recently studied subGect in which

maGority of the proposed models hae been incorporated together with

ealuations of structured query. =ierman et al $rst introduced the concept of

ro"/1# with the existent probabilistic types MKX ( mutually exclusie andI=/ ( independent. "he researchers modeled the probabilistic XML in the

form of acyclic graphs# which support distributions that are arbitrary oer

children sets. "he research adopted a probabilistic tree approach for the

purpose of data integration where its possibility and probability nodes are

similar to I=/ and MKX respectiely 6F A8FA6F.

+ p(document# which is a probabilistic document written in XML speci$es a

probability distribution across space of deterministic documents written in

XML. Each deterministic document that belongs to this space is referred to as

a possible word. + probabilistic document referenced as a tree that has beenlabeled has distributional and ordinary nodes 88F8;F. 3rdinary nodes are

basically regular and normal XML nodes and their appearance may be seen

in deterministic documents# whereas distributional nodes are only used in

the de$nition of probabilistic process that inoles the generating of

deterministic documents and their occurrence is not isible in those

documents-F8AF A;F. In the adaptation of rXML Oind# muxP as part of the

XML model which is probabilistic# two distributional nodes types appear in a

p(document# which are MKX and I=/89F8:F.

*onsidering an example 8D*onsider ,igure 8?a@ showing the p(document ". "ag names are used to show

ordinary nodes# for instance *8# *A# and 18and 1A. +s for the distributional

nodes# MKX is shown as rectangular rounded cornered boxes while I=/ is

depicted as circles. "aking into consideration the I=/A node# there exist two

children nodes 1A and *8 with respectie existence of probabilities 9.> and


13/45


9.- -F. "herefore# for neither *8 nor 1A# the absence probability being seen

is

?8(9.>@Q?8(9.-@ R9.A.

*onsidering the MKXA node that consists of three children# I=/:# EA and

/8node# their probabilities of existence respectiely are 9.-# 9.: and

9.8."herefore# the probability for non(existence is

8 ( 9.- ( 9.: ( 9.8 R 9.8.

roided a p(document tree named "# there is a possibility of generating all

possible deterministic documents as sown belowS basically traersing " in a

top(down manner# two situations arise that require to be dealt with

independently ifD

?8@ It occurs that it is an I=/ node consisting of m child nodes# Am copies of "

are generated# and the I=/ node deletedS m child nodes are replaced with

one distinct subset as a copy which is a representation of them and the

ordinary parent node which is connected to each child node in the I=/ nodesubset. "he probability for this copy to occur for each copy is the product of

all probabilities that exist of the respectie child nodes in the particular

subset and the absence probabilities for instance the existence probability

deducted from 8# of the child nodes which are not existent in the subset -F.

?A@ ,or a MKX node consisting of m child nodes# m T 8 copies of " can be

generated# and the MKX node deleted# replacing the m child nodes with no

child or one distinct child node for a copy. +n established connection from

the child node of MKX to the ordinary parent node is made. "he existence

probability for eery copy is the occurrence probability of the distinct child

node in the subset or the absence probability denoting no child node

appearance. ,or eery " generated copy# the research adopts traersing

using the top(down approach until the deletion of all distributional nodes is

con$rmed -F.

'arious researchers hae proposed the adoption of a fu%%y trees model

where nodes are speci$cally associated with probabilistic eent ariables

conGunctions. + full complexity query analysis update on the Vfu%%y trees in

the research is also referenced. "hey also proposed algorithms that sole the

constraint(satisfaction which were e&cient >F85F A;FA6F. "he speci$csampling problem and query ealuation under constraints set can be well

de$ned to yield e&cient query results that are expected. 3ther publications

summari%ed and extended the preiously proposed probabilistic XML models#

tractability of queries and the expressieness on di!erent models were

discussed with the consideration of MKX and I=/-F8:F 8-F85F A0F.

'arious studies on the ealuation problem of twig queries considered oer


14/45

Optimizing Keyword Queries in XML Tree Structures 1!

probabilistic XML that may generate partial and incomplete answers with

particular respect to user probability threshold. "he researchers also

addressed and proposed the ranking top(k probabilities problem of answers

of a twig query. In summary# the work that has been cited focused on

discussions of arious probabilistic XML data models on a structured XML

query# for instance a twig query8F0F 8>FA:F. 3ur research howeer is

going to be di!erent in the sense that the keyword search problem in

probabilistic XML data is going to be critically preiewed and analy%ed 6F.

X2uery )treaming 3ptimi%ation

Querying XML streams

)eeral streaming algorithms exist that particularly focus on the querying

problem and the $ltration procedure. Many of these algorithms center their

operations on tree(pattern queries ?"2s@. "2s e&ciently correspond to

Xath queries which inole mainly descendant and child axes 8;FA>F. "2

streaming algorithms can be extended to facilitate the process obtaining

Xath queries which come along with ordered axes that inole preceding#

preceding(sibling# following# and following(sibling@ 89F.rocessing techniques

are therefore introduced on ordered axes )treaming algorithms broadly fall in

three categoriesD "he array(based approach# automaton(based approach and

the stack(bas.

XML temporal model

reious studies conducted on time(based XML model hae identi$ed seeral

disadantages and bene$ts. "he bitemporal approach is inclusie of bothalid time and transaction time in timestamp attributes AAFA>F. =ormatie

texts will always comprise of four time interals. =ormatie texts consisting

of temporal alues in an XML database represent new attributes of interals

for instance e&cacy time and publication time. "his particular approach of

XML tree partitioning guarantees the distribution of data into partitions of

equal si%e making considerations of both the query processing load and data

storage cost 89F.

"ime Interals and Model"he interals are publication time e&cacy time transaction time and alidity

time. "ransaction time refers to the time a transaction is reUected in the

database as a representation of an important factor for all transactions that

occur in time(referenced databases. 'alid time refers to the interal that

indicates the time when the data becomes alid for general use or it may be

inalid and unusable 5F86FA9F. E&cacy time is when data is used under


15/45

Optimizing Keyword Queries in XML Tree Structures 1"

arious conditions or in speciali%ed cases only. ublication time represents

the alert time as to when or during the publication of data. 4eason nodes

particularly hold data that is sensitie and may $nd utility in decision support

)ystems 89F.

Mode 4elationship Ealuation

"his research deiates from the other researches which focus on the

ealuation of the relationship between multiple nodes with the

implementation of heuristics(based intuitie rule. It will focus on the

relationship between multiple nodes in a data tree structure which is

measured by the adaptation of mutual information concept which deries its

application arious data mining processes-F>F88F. "his operates on the

correlation of arious database relation attributes. 1eing common for XML

data tree structures to consist of arious nodes haing the same labels but

occurring in di!erent contexts# pre$x labeled nodes are used to depict the

types of nodes. + pre$x label path refs to a sequence of names of elementsappearing in along the path from the root node to the speci$c node in

question. "he node types are used to identify the speci$c nodes that are

found in the data tree 8>FAAF.

$igure ): +n example of an XML data tree structure# " 86F

,or instance at node ;# the pre$x labeled path in the data tree is de$ned bydblpHproceedingHpaperHauthor. Many occurrences can exist in a speci$c pre$x

labeled path in the XML data tree structure# and these occurrences are

referred to as node instances AF0FA8FA-F. It is therefore de$nite that all

the instances of a speci$c node will consist of the same pre$x label path.

Eery instance consists of a unique alue which constitutes to the speci$c

set of key words contained directly in that particular instance. ,or instance#


16/45

Optimizing Keyword Queries in XML Tree Structures 1#

using the tree structure in ,igure A# pre$x label pathD

dblpHproceedingHpaperHauthor has instances A5# A9# 8-# 88 and ; which

consist of the alues 4ichards# Wang# hang# Liu and


17/45

Optimizing Keyword Queries in XML Tree Structures 1$

TAL# %:"E


18/45

Optimizing Keyword Queries in XML Tree Structures 1%

"his therefore indicates that there is ariance in the type of mutual

information between two nodes that are in di!erent contexts. "he mutual

information between the author?s@ and title in the paper context is higher

than the mutual information between the author?s@ and title of two papers

that are di!erent or which may be considered in the context of a proceeding.

We can therefore conclude that MI acts as a superb measure of showing how

closely nodes are interrelated to each other. "he MI alues scale has no

speci$c unique range as depicted by property - A-F. "he property states

that the nodes can be bound by the minimum alue of their entropy. In a

proper application of this particular concept# we require a uni$ed scale for

the sole purpose of measuring the MI along with global node sets ;F85F.

=ode 4elationship

*onsidering two nodes# u and which are Goined at the context of c# the

relationship of the two nodes is de$ned asD

In particular case# 0 (u)and 0 (')refer to the speci$c entropy of nodes u and

respectiely and the alues are calculated the same way as the random

ariable entropies 8>F.

When the alue of rel (uS 'c)is high# this means that the relationship

between nodes u and is strong at the context node c. ,or instance# the

entropy of nodes dblpH proceedingHpaperHtitle and

dblpHproceedingHpaperHauthor can be obtained byD

0(dblpHproceedingHpaperHtitle)

R Y(8-) log(8-) T (8-) log(8-)(8-) log(8-) T (8-) log(8-) T (8-) log(8-)F

R Ylog(8-) R log - R 9.09

0(dblpHproceedingHpaperHauthor)

R Y(8-) log(8-) T (8-) log(8-)

(8-) log(8-) T (8-) log(8-) T (8-) log(8-)F


19/45

Optimizing Keyword Queries in XML Tree Structures 1&

R Ylog(8-) R log - R 9.09

"his therefore implies that the relationship between

dblpHproceedingHpaperHtitle and dblpHproceedingHpaperHauthor meeting at

context node dblpHproceedingH paper can be deried as

rel(dblpHproceedingHpaperHauthorS dblpHproceedingHpaperHtile

dblpHproceeding) R 9.;6

9.09

R 9.0

"his is also similar toD

dblpHproceedingHpaperHtitleS rel(dblpHproceedingHpaperHauthor

dblpHproceedingHpaper) R 9.09

9.09

R 8.9

"his therefore shows that the relationship that exists between any two giennodes must fall in a speci$c range of 9#8F in any XML data tree gien

multiple nodes u and at any particular context node c.

9 Z rel(uS 'c) Z 8

"his can further be proed when property : as stated preiously is closely

examined. It states that

/(uS 'c) [ 9# which therefore implies that

roperty - also states that /(-Syc) Z 0(y)and /(-Syc) Z 0(-). 1his thereforegenerates2

/ominance Lowest *ommon +ncestor ?/L*+@

In order to retriee a particular hit or answer in a wide mass of L*+ based

candidates# this research proposes the use of new semantics referred to as

/ominance L*+. We begin by the introduction of the relationship between

L*+(based candidates.

Dominance relationship2uery candidates are represented by their root nodes can be depicted as

subsets of the sub trees. ,or instance gien a keyword search query say Q

RO38" . . . " 34Pa speci$c candidate of the query 2 called ) is represented in

the form 5(nlca" On8S . . . S n4P) in this particular result n4refers to a leaf

node that contains 3iand nlca becomes the distant and the lowest common

ancestor of the series On8S . . . S nm. Identi$ers are used to identify each


20/45

Optimizing Keyword Queries in XML Tree Structures 2'

node in the candidate series which according to this research is encoded as a

/ewey code.

"he foundations of the /ewey code are deried from the /ewey /ecimal

*lassi$cation which were deeloped for the purpose of classi$cation of

general knowledge8FA:F. With the implementation of the /ewey coding# a

ector is assigned to each node which is a representation of the path to the

node from the tree root. "he local order of the ancestral node is represented

by the each component that is found along the path. "his can eidently be

illustrated in the ,igure :D

$igure ,: /ata "ree "A 86F

"he researcher selected to encode the speci$c node identi$ers with /ewey

code since it is ery useful in the representation of the hierarchical

relationships that exist between nodes of a tree that forms a ery importantariable in the tree structure. "he corresponding label path of the speci$c

node can be found from the /ewey code. ,or instance# considering the

sample data tree structure "Ain the ,igure :# eery node s always identi$ed

by the /ewey code. ,or a node identi$cation of 9.8.9.9F# the corresponding

label path of the corresponding node n8Hn:Hn;Hn-. We therefore gie a name#

I/AL?id@ which is an id which represents the /ewey code that seres as an

input and reUects the corresponding path label8:FA;F. "here is a ast

chance of the possibility that the key words in a particular search tree may

yield many occurrences in the speci$c sub tree candidate 5(nlca" On8S . . . SnmP). Eery keyword yields a set of Li R Oni'al (ni) 6ith the 3ey6ords 3i (8

Z i Z m)P

"he relationship between the arious keywords that are produced in the

speci$c search tree is gien asD


21/45


In this particular case scenario# I/AL(ni) is an important function which

returns corresponding node types w6th the /ewey code niand rel(I/AL(ni)S

I/AL(n)I/AL(lca(ni" n@@@ which is normally calculated by the formulae

stipulated in formula ?8@. "his therefore measures the correlation between

the nodes that hae been tagged with the /ewey codesniand n at the

lowest common ancestor 85FA-F. "his therefore implies that the

relationship between the keywords kiand kGcontained in the candidate

structures is analy%ed as the maximum relationship that exists between two

nodes that contain two keywords in that speci$c candidate8:F.

,or instance# taking a query Q R O38" 3A" 3:P with a speci$c data tree "A#

only one of the sub(tree 2 candidate is present and is rooted at a place node

of n: 9.8F. In "ree "Aand this can be represented and can also take the form

of 5(9.8" O9.8.9.9S 9.8.8.9S 9.8.8.8P)."he relationship that exists in the

keyword queries in the speci$ed candidate ) can be calculated as followsD

rel(3A" 3:) R rel(I/AL(9.8.8.9)S I/AL(9.8.8.8)I/AL(9.8.8))R rel(n8n:n>n0S n8n:n>n5n8n:n>)

rel(38" 3:) R rel(I/AL(9.8.9.9)S I/AL(9.8.8.8)I/AL(9.8))

R rel(n8n:n;n-S n8n:n>n5n8n:)

rel(38" 3A) R rel(I/AL(9.8.9.9)S I/AL(9.8.8.9)I/AL(9.8))

R rel(n8n:n;n-S n8n:n>n0n8n:)

roided the keyword query Q R O38" . . . " 34P the relationship of each pair

calculated is stored in the ector /s of the query keywords in sub(tree

candidate ). "he keyword relationship ector in this particular research is

de$ned byD

D5 R rel(3i" 3)3i" 3 \Q ](i 7 )F

aing a total of *Aq combinations of two(keywords deried from a stable set

of say q keywords O38" . . . " 34Pthe ector /s therefore contains *Aq number

of elements. "his is normally denoted as D5 R *Aq . ,or instance# the ector

of the keyword relationship that corresponds to the candidate 5(9.8"

O9.8.9.9" 9.8.8.9" 9.8.8.8P) 4uery Q R O38" 3A" 3:P consistsD

of *A: R: R : 4espectie elementsA^(:Y8)^

D5 R rel(38" 3A)" rel(38" 3:)" rel(3A" 3:)F.

Letting /s and /s become the two speci$c types of keywords in a speci$c

relationship# of the candidates ) and )# the dominance relationship that

exists between the candidates ) and ) id can therefore be de$ned 8AF.


22/45


Dominance

Letting ) and ) to become the two candidates of the XML search query 2

oer a speci$c named and gien database "# ) dominates ). "his is

represented as ) _ ) and this condition will only hold if the following aremetD

(8 Z Z d)D5F 7 D5 F and i(8 Z i Z d)D5iF Z D5 iF

In this scenario d refers to the keyword length relationship ector of ) and

) which is (d R D5 R D5 R8A4). /siF is the element in the ithector /s*andidate ) dominates ) in the relationship ;F88FA;F.

,&' M#AS(!#M#NT O$ T-# !#LATIONS-I. #T+##N NOD#S IN A

DATA T!##

"his particular section reiews the mutual information ?MI@ concept alongside

with arious other concepts that are related. "he in depth detail of this

particular concept will be discussed with emphasis on the concept adaptationin the measurement of the meaningful relationship that exists between

arious nodes that exist in an XML data tree.


23/45


Mutual Information *oncepts

Mutual information and entropy

"hese are ery central and fundamental concepts that do exist in the $eld of

the information theory. Entropy therefore refers to the measure of

uncertainty of a particular random ariable. MI quanti$es the existing mutualdependence of two particular random ariables :F5F86FA0F.

EntropyD "aking a discrete random ariable x which takes the alue '-

extracted from the set dom (-) which is generali%ed and goerned by a

probability distribution function of aluep ('-)."he de$nition of entropy of-

is de$ned as followsD

*onditional Entropy of a particular random ariable sayyproided a second

ariable-"which is referred to as entropyyconditional-# which usually takes

the general form of 0 (y,-) has the following de$nitionD

In this particular type of equation#p ('y"'-) refers to Goint probability of (y='y)

and (-='-)9 whereas p('-"'y)gien (-='-)is the conditional probability of the

equation (y='y) 89F85F.

Mutual informationIn reference to two random ariables it can be referred to as a quantity that

measures mutual independence between two ariables >F. In a gien case

scenario# discrete ariables x and y which are random# the de$nition of their

mutual information can be de$ned asD

In this particular scenario#p ('-"'y)refers to the Goint probability of the

de$ned ariables (- = '-) and (y= 'y). In this particular scenario# p?x@ and

p?y@ are the probabilities of (-='-)"(y='y) respectiely.

"here are arious properties that characteri%e Mutual Information# and some

of the existing properties are detailed as followsD

roperty 8D

/ (-Sy) R 0 (-) 0 (-y) R 0 (y) 0 (y-)


24/45

Optimizing Keyword Queries in XML Tree Structures 2!

"his deals with the interpretation of Mutual Information. It indicates that the

information that has been proided by y concerning x is the reduction or

decrease in the uncertainty of x proided the knowledge possessed by y.

)imilarly# this occurs for all the bits of information aailed by x concerning

random ariable y. "he alue of the mutual information is directly

proportional to the information that is reealed by both the ariables x and y

in this particular property 8>FA9FA;FA6F.

roperty AD

/ (-Sy) R / (yS-)

It puts forward that mutual information takes a symmetric form# meaning

that information aailed by x concerning y is the ery same type of

information y coneys about x 8>FA9FA;F.

roperty :D

/(-Sy) [ 9

"he lower bound of the mutual information is gien in this particularscenario. Cien /(-Sy) R 9# we get the resultp('-" 'y) Rp ('-) p ('y) for the

possible alues of x and y. "hese means that the ariables x and y are

independent# therefore obtaining the alue of x does not necessarily proide

clues of the probable or exact alue of the ariable y. "his therefore puts

their mutual information at %ero -F8>FA9FA6F.

roperty ;

/(-S-) R 0(-).

"his property puts forward that mutual information of ariable x is by itself

the entropy of x. "his therefore means that entropy is also referred to as self(

information 8>FA;F.

roperty -D

/(-Sy) Z 0(-) and /(-Sy) Z 0(y)

"he mutual information that exists between two ariables is limited and

bound to the minimum of their speci$c entropy 8>FA9FA;F.


25/45

Optimizing Keyword Queries in XML Tree Structures 2"

/&' ANS+#!S !#T!I#*#D $!OM TO.0K

"he researcher obseres that /L*+ answers alter with di!erent search

queries. *onducting a data and information search# users usually are

interested in top(3 answers. "hey are sorted in descending order using theirrespectie releance degrees to the need of users information. "his section

de$nes three ranking functions used for identi$cation of the top(3 results for

a keyword(based sequential search through XML data. "he particular ranking

functions used in this study exploit di!erent and seeral aspects of

dominance relationships existing between query candidates for ranking their

releance degree to the speci$c search query8FA9FA5F.

roided 8 (Q" 1) as a set of candidates of a speci$c query Q in an existing

XML database 1# the degree of releance of a candidate based is measured

on the following three ranking scoresD

/ominating )core

roided a candidate answer structure 5# the dominating score of 5 is

de:ned as follows.

scoredg(5) R O5 \8 (Q" 1)5 _5P

?:@

"he dominating score of a speci$c candidate scoredg(5) shows the cumulatie

count of candidates which 5 dominates. + candidate portrays more releance

if it dominates as numerous and many other candidates as it possibly can."herefore# a higher dominating score of a speci$c candidate 5 denotes that )

is more signi$cant to the speci$ed query 0F8AF.

#5amp6e o7 an Instance %:

Letting 5 \8 (Q" 1) and 5\8 (Q" 1) e t6o respecti'e candidates of a

speci:ed 4uery Q in a stated XML data tree 1. 1herefore" if 5 _5# then this

implies that scoredg(5) [ scoredg(5).

"his example can be proed through using the transitie property which is a

subset of a dominance relationship. "herefore# for any two candidates on aquery 2# 5 \8 (Q" 1) and 5 \8 (Q" 1)# if 5;5# then 5i \8 (Q" 1)5 _5i# we

therefore hae 5 _5i. ,inally#

O5i \8 (Q" 1)5 _5iP [ O5i \8 (Q" 1)5 _5iP# or it can be stated as

scoredg(5) [ scoredg(5)


26/45

Optimizing Keyword Queries in XML Tree Structures 2#

"his particular example gies an assurance that candidate 5 dominates

candidate 5# which then means that 5 is ranked higher as compared to 5 in

the top 3 results that hae been returned ;F88FA;F.

/ominated )core

roided a candidate answer )# the dominated score of ) is de$ned as

followsD

scoredd(5) R O5\8 (Q" 1)5_5P

?;@

"he dominated score of the speci$ed candidate 5# scoredd (5)# shows the

number of other di!erent candidates which can dominate 5. "herefore# the

lower the dominated score# the more meaningful to the query for candidate

5 A9FA;F. "his therefore implies that candidate 5 is more releant whendominated by fewer candidates as possible.

/ominance )core

Example of Instance :

Letting5 \8 (Q" 1) and5 \8 (Q" 1) be two respectie candidates of a

speci$ed query 2 in a stated XML data tree ".

/f 5 _5" then scoredd(5) Z scoredd(5).

"his example can be proed in a similar manner like the preious examples.,or any existing two candidates 5 \8 (Q" 1) and 5 \8 (Q" 1)# if 5 _5 then

5i \8 (Q" 1)5i _5# we hae 5i 5 8>FA9FA;F. "herefore# O5i \8 (Q" 1)5i

5P Z O5i \8 (Q" 1)5i _5P# or scoredd(5) Z scoredd(5)


27/45

Optimizing Keyword Queries in XML Tree Structures 2$

1&' AL2O!IT-MS (S#D TO !#T!I#*# TO.0K!#S(LTS

"his particular section is meant as an introduction to algorithms which

identify releant search results and the top(3 answers# normally based on

arious skyline semantics in accordance to the aforementioned criteria ofranking. In order to obtain the speci$ed set of L*+(based candidates of a

particular gien keyword query# gien other signi$cant approaches in the

literature# the research adopts the inerted indexes 80F. "hese particular

indexes are built oine during a time it parsed the XML database tree

structure. )peci$cally# letting Q R O38" . . . " 34P be parsed a gien keyword

query and /Li be the inerted list consisting of keyword 3i. Eery entry

contained in the inerted list /Li is the /ewey code of a particular node

containing the keyword 3i. "he candidate set 8 of query Q is de$ned as

8 R Olca(n8" n4)n8 \/L8" . . . " n4 \/L4P#

Cien that lca(n8" . . . " n4) is an operation that gies the lo6est common

ancestor of On8" . . . " n4P# the keyword relationship ector of eery

candidate is concurrently fed as input during the candidate generation

process. "he generated candidates are stored in a speci$ed list ordered by

the alues of their releant keyword relationship ectors ;F80FA-F. "he

detailed explanations will be in the following subsections.


28/45

Optimizing Keyword Queries in XML Tree Structures 2%

=aBe +lgorithm for )election of "op(K +nswers

A6gorit8m %:=aBe +lgorithm 80F

"he naBe algorithm used for identi$cation of the top(3 results that are

desired corresponding to their respectie dominated scores ?similarly#

dominance and dominating scores@ is illustrated in the +lgorithm 8. "his

speci$c algorithm iterates through eery candidate in the speci$ed

candidate set and facilitates the calculation of its score by performing pair

wise dominance checks between these candidates and all other candidates

de$ned in the set ?lines @ 80FA8FA:F. "he resultant set is then updated

depending on the result obtained on the score compared between the

current 3Yth candidate and the new candidate in the current top(3 results

?lines ?*@@ 80FA8FA;FA5F.

"he maGor drawback of this particular algorithm is that its speci$ed

computational cost is ery high because regardless of the alue of 3# there is

need to iterate through each component candidate found in the candidate

set and calculate the score deried by each candidate by performing thespeci$ed pair wise dominance checks that occur between the candidate with

all other present candidates in the existing set 8>FA9FA;F. "his therefore

means that no matter what the deried alue of 3 is# the algorithm

exhaustiely performs and conducts all pair wise dominance tests across all

candidates.


29/45

Optimizing Keyword Queries in XML Tree Structures 2&

TAL# ,:"W3 /IME=)I3= )E" *+=/I/+"E /+"+ A;F

D% D)

S% 9.6- 9.6

S) 9.8- 9.-

S, 9.8 9.6-

S/ 9.- 9.;

S1 9.5 9.5

S3 9.6 9.;

S4 9.; 9.;

S9 9.: 9.A

S 9.0 9.>

S%' 9.: 9.:

,or instance# proided a set of candidates in "able :# in order for the proper

identi$cation of the top(: results# there is need to calculate the score deried

by of each candidate 5i(8 Z i Z 89) through iteration oer 6 other candidates

and conducting a pair wise dominance check A;FA5F. "his therefore impliesthat it takes 89 6 R 69 pair wise dominance checks. Cenerally# for

calculation of the score of a particular candidate in a gien set of n

candidates# there is need to do pair wise dominance checks between that

speci$c candidate together with (n Y 8) other candidates found in the set.

"op(K /ominated +lgorithm ?"7//@

"he chief aim of "7// is algorithm to each candidate is to e&ciently $nd the

number of other candidates which dominate it# while aoiding exhaustie

pair wise comparisons between the candidatesAF5FA;F . +fter the retrieal

of 3 results# the score of the 3(th result is used as a maximum threshold and

therefore pruning occurs for the candidates whose oerall dominated scores

extend the threshold A;F. "o add to that fact# safe termination of the

algorithm is guaranteed if the scores of all the remaining candidates exceed

the proided threshold. More speci$cally analy%ed# the "7// takes course

through the following four stepsD


30/45

Optimizing Keyword Queries in XML Tree Structures '

?i@ /nitialiAation

(line *)D the result set and min'alue are initiali%edS

?ii@ 1ermination condition

TAL# /: )E+4* )+*E) L+ +=/ L1 K)E/ "3 *+L*KL+"E scoredd(5i)+=/

scoredg(5i)4E)E*"I'ELJ /E4I'E/ ,43M + LI)" L3, *+=/I/+"E) WI*

+4E )34"E/ I= /E)*E=/I=C 34/E4 3, )E*I,I* B()'+LKE) A;F

TAL# 1: *+=/I/+"E) LI)" )34"E/ I= /E)*E=/I=C 34/E4 3, B()'+LKE)

A;F

TAL# 3: /3MI=+=*E *E*7) *3K=" K)E/ ,34 *+L*KL+"I3= 3, "E/3MI=+"E/ *+=/I/+"E )*34E) A;F

$igure /: List of candidates sorted in descending order using their M()

alues A;F


31/45


A6gorit8m ):"7// 80F

(lines C>)D roided that M?@ alue of the present candidate 5 is below the

minimum alue of the current 3(th candidate in # the algorithm terminates

and the resultant set is returnedS

?iii@ Dominance chec3s (lines ?*)D


32/45


"he pair wise dominance checks between 5E and eery other candidate 5in

the respectie search space of 5 6here the operation takes place. "he

dominated score of 5 is found to be increased by 8 eery time another

candidate dominates 80FAAF.

?i@ esult updates (lines ***?)D proided that 3 results are existent and the

dominated score of the 3(th candidate is larger than the current candidates

score# the 3(th candidate is eGected and the current candidate is put into S

otherwise if it becomes less than 3 results exist in # there is an insertion of

the current candidate into . ,inally# taking the si%e of as 3# the threshold

minFalue undergoes updating (lines *G


33/45

Optimizing Keyword Queries in XML Tree Structures

candidate found to exist in the search space proided will be

performed A;F.

*oncurrently# the dominating score of the candidate is calculatedS

?i@ 4esult set update (lines *+


34/45


3&' #X.#!IM#NTAL #*AL(ATION"he researcher performed and designed a couple of experiments to analy%e

the search performance of the approach. In the experiment the researcher

ealuates the outcomes and results of the arious experiments in order to

compare the e&ciency and quality of the approach that the researchers used

and other possible approaches that would hae been used:F-F>F 88F80F

A;FA5F.

Experimental )etup

"he experiments were conducted on the entium ;# :.AC% computeroperating on windows X rofessional and it had an internal memory of AC1.

.0 M1# Mondial

8M1 and /1L *omputer )cience 1ibliography 500 M1. /1L *omputer

)cience 1ibliography includes a list of bibliographic information of maGor

computer science proceedings and Gournals. Mondial on the other hand is a

worldwide geographic database or platform that has been integrated from

the world fact book of the *I+# "E44+ database# and the international atlasamong many other sources. +uction is a form of synthetic benchmark set of

data that has been generated by the XML generator using default /"/ from

XMark 8:FA-FA0F.


35/45

Optimizing Keyword Queries in XML Tree Structures "

2uery )ets

"he researchers asked a group of learners to submit $fty arious keyword

questions to search and ealuate on eery data set. Eery query contained a

speci$c set of search key words and also a brief description of each query

was also ery necessary in order to understand and identify the key

intension of the query:F-F>F 88F80FA;FA5F. "he researchers at the same

time obsered that searching on a speci$c domain like the three main data

sets that they were experimenting on was not e!ectie as the keyword

queries were ambiguous. "his made it had for the users to express the

search intention. /ue to this# it is sometimes di&cult to obtain the releant

results and outcomes of the queries at hand which are prerequisite for the

researchers to analy%e the performance of their approach and other

aailable approaches:F-FA;FA5F.

)earch 2uality

"he researchers compared the quality of the /L*+ approach with the other

arious approaches that exist likeS EL*+# *'L*'+# X4eal# ML*+# )L*+ and

X)earch. "he quality of these approaches were measured in three metrics

popular for retrieal of informationD recall ?4@# ,(measure and precision ?@ >F

88F80FA5F. In order for the researchers to recall and compute precision

they reformulated manually the keyword questions into schemas aware

queries based on the data sets schemas and the keywords query

descriptions. "he researchers then took the results of transformed queriesresults as a platform on which they computed the recall and precision of the

queries according to the platform as followsS gien the key word query 2 and

its corresponding X2uery that has been transformed AF85FA0F. "he

accurate outcome set of 2 which is the result a speci$c algorithm on 2 is

recorded as retrieed results AF88FA;FA5F. "he precision and recall of this

algorithm can be de$ned as follows.

"he precision is a fraction of retrieed results releant to the searchD

R ??releant results@n?retrieed results@@

?4etrieed results@"he recall is a fraction of the releant results which are successfully retrieed

by the search system

4R ??releant results@ n ?retrieed results@@

?4eleant results@

"he ,( measure which shows the trade(o! between the recall and precision is

computed asS


36/45

Optimizing Keyword Queries in XML Tree Structures #

,(measureR ?8T1A@ 4

1A T4

Where 1 R 8 the recall and precision are equal# where 1 8 precision is

emphasi%ed and where 1 _ 8 recall is emphasi%ed.

,rom the calculations it is clear that the releant results of each key word

query needs to be determined before the calculation of the appropriate

ealuation metrics. "o acquire the releant results of the tested queries the

researchers formed the manual corresponding schema aware Xquery with

the assistance of users >F80F. "he appropriate result of the queries was

then used as the basis for performance ealuation of the researchers

approach and other aailable approaches.

"he researchers conducted experiments with a set of -9 keyword queries by

using arious approaches and they measured the recall and precision of

eery approach by $nding the aerage of recall and precision alues of the

tested queries."he relationship and comparisons of recall and precision of the researchers

approaches in the three arious data sets are shown below.

TAL# 4: 4E*I)I3= +=/ 4E*+LL 3, 2KE4IE) 3= M3=/I+L /+"+ 88F

#L"A SL"A XSearc8

"*L"A

ML"A

X!#AL

DL"A

.recision

9 .- 9:

3 .0 8A

9 .> :-

9 .> 09

9 .0 8A

9 .0 A8

9 .6 AA

!eca66 8 .9 9

9

9 .> A

;

9 .6 ;

:

9 .6 8

9

9 .6 9

:

9 .> ;

0

9 .6 :

6

TAL# 9: 4E*I)I3= +=/ 4E*+LL 3, 2KE4IE) 3= +K*"I3= /+"+ 88F

#L"A SL"A XSearc8

"*L"A

ML"A

X!#AL

DL"A


37/45

Optimizing Keyword Queries in XML Tree Structures $

.recision

9 .; 05

9 .0 9>

9 .> A:

9 .> ;9

9.> 66

9 .0 9:

9 .6 98

!eca66 8 .9 99

9 .> -9

9 .6 :8

9 .6 A9

9.6 90

9 .> -9

9 .6 :8

TAL# : 4E*I)I3= +=/ 4E*+LL 3, 2KE4IE) 3= /1L /+"+ 88F

#L"A SL"A XSearc8

"*L"A ML"A X!#AL DL"A

.recision

9 .- A:

3 .0 ::

9 .> ;9

9 .> 55

9 .0 A9

9 .0 ::

9 .6 :;

!eca66 8 .9 99

9 .> ;0

9 .6 ;8

9 .6 88

9 .6 A:

9.> ; 0 9 .6 ;8

TAL# %':*3M+4I)3=) 3= 4+=7I=C E,,E*"I'E=E)) 3, "E

+LC34I"M) 88F

M A . !0.!#"

;pre7 !0!ANK

. < % . < 1 . < %'

TKDD 9.509 9.5:9 9.0 - 9 9 .5 > 9 9 .5 69

9.5 0 9 9 .5 89

TKD2 9 .5 -9

9 .5 A9

9 .0 69

3 .5 0 9 9.6 A9

9.5 6 9 9 .5 ;9

TK D0' &)1

9 .5 ;9

9 .5 99

9 .0 69

9 .5 > 9 9 .6 89

9.6 8 9 9 .5 >9

TK D0' &1'

9 .5 >9

9 .5 A9

9 .0 >9

9 .5 > 9 9.6 99

9 .5 09

9.5 A 9

TK D0' &41

9 .5 09

9.5 - 9 9.0 : 9 9.5 5 9 9.5 59

9 .5 -9

9.5 9 9

X!ANK 9 .> 0

9

9 .0 -

9

9 .> 8

9

9 .0 8 9 9 .> 6

9

9 .> 5

9

9 .> -

9

XS#A!"- 9 .0 99

9 .0 09

9 .> :9

9 .> 5 9 9 .0 :9

9.> 5 9 9 .> >9


38/45

Optimizing Keyword Queries in XML Tree Structures %

+ll the ranking algorithms makes it possible to identify the top ten results at

a precision ranging between eighty to eighty $e percent. "he mean

aerage precision of the algorithm is 5- and the researcher could een

achiee more accurate precision by selecting a suitable alue which can

maximi%e the balance and relationship the dominating and dominated scores

>F 88FA;FA5F.

E&ciency and )calability of "op(K +lgorithms

"he researchers tested ten queries with arious lengths in eery data set.

"hey tested about $e thousand candidates in default scenarios and the

number of results found was thirty. "he queries which had less than required

number of results from candidates# the researchers made a replica of the

candidates repeatedly until they obtained the required candidate number:F

88F. "he researchers then selected randomly the required candidate number

from the set that was duplicated. "he cost of computation of the algorithm is

shown in the $gure below. It is clear that when the candidate number

increases the algorithm processing time also increases but at di!erent trends

:F6F8;F80F. "7// in this case is the most e!ectie and e&cient method it

is less a!ected by the increase in the number of candidates. "7// is mainly

concerned with in the results which are dominated fewest number of

candidates as possible. "his is because the results are usually located at the

top of the list of sorted candidates and as a result it searches a small portion

of the candidate list. ,or "7/C the search space is much larger and as a

result there is expected delay. 3n the other hand the lower performance of

"7// is also as a result of the score that is dominating hence it explains whyits processing time rises the same way as the "7/C which has a small

oerhead used in calculating and $nding the dominating score -F6F 8:FA;F

A5F. ,rom the results in ,igure -# it is clear that the "7// processing time is

less a!ected by the increase in number of k of the returned results and it can

return from ten to one hundred results from the set of $e thousand

candidates within a second. "he "7/C processing time algorithm is more

a!ected by the change of the parameter but it takes A.- seconds to get back

to the top one hundred results from asset of $e thousand candidates.

$igure 1: 4anking e&ciency of "7//# "7/C and "7/ algorithms 6F


39/45

Optimizing Keyword Queries in XML Tree Structures &

4&' "ON"L(SIONS

In the thesis the researchers hae studied the issue of identifying the most

accurate outcomes and results and the top(k appropriate results for XMLkeyword questions or queries in this matter. "he use of maGor keywords tosearch for documents has been widely accepted as a ery conenient way inretrieing resources from arious remote serers that hold that speci$c typeof data on the internet. MaGority of the search engines such as Coogle# 1ing#Jahoo and many more hae adopted the use of these technologies so as toe&ciently facilitate the process of data mining and data warehousing. "headaptation and use of keywords for querying arious databases has attractedarious researches to be conducted by the research community from thea!ected $elds of database and information retrieal ?I4@. XML documents arecomposed of nested XML attributes from the root elements to the nested

sub(elements. XML elements often reference other elements which arequeried as XML alues and therefore the text content is captured using thedeputation contains ?u# k@. *onsequently# the predicate returns true when theelement u has keyword k while an XML query 2 is mapped from an XMLdatabase / to XML documents that characteri%e the query output. +s aresult# when the XML database enironment is K/ is and the XML documentsequence enironment is )# the outcome is 2D K/ ). 2?/@ is the result ofquery 2 oer database / whereby the query is identi$ed using XML querylanguage for instance X2uery. "herefore considering a sequence s# then e \s is true when e is in s. *onsider a p(document# which is a probabilisticdocument written in XML speci$es a probability distribution across space of

deterministic documents written in XML. Each deterministic document thatbelongs to this space is referred to as a possible word. + probabilisticdocument referenced as a tree that has been labeled has distributional andordinary nodes. 3rdinary nodes are basically regular and normal XML nodesand their appearance may be seen in deterministic documents# whereasdistributional nodes are only used in the de$nition of probabilistic processthat inoles the generating of deterministic documents and their occurrenceis not isible in those documents. In the adaptation of rXML Oind# muxP as


40/45

Optimizing Keyword Queries in XML Tree Structures !'

part of the XML model which is probabilistic# two distributional nodes typesappear in a p(document# which are MKX and I=/.

"he researchers hae stried to address the three ital requirements andconditions for e!ectie keyword searches of the XML. "he researchers haeintroduced new methods of analy%ing the relationship between query keywords in the candidates using mutual information idea and come up with anew /L*+ keyword queries semantic. "he researchers also hae a proposedstrategy and method of selecting the results of /L*+ from multiplecandidates and the three ranking methods used in selecting top(k resultsbased on skyline queries semantics. )ome of the properties which hae beenproen hae been acquired to accelerate proposed algorithms. "he $ndingsand experiments hae been conducted to analy%e and ealuate theresearchers experimental results and approach and they show that theapproach performs better than the approaches that hae been used in thedata sets that hae been tested and the ealuation metrics. "his is a erye&cient way of facilitating the retrieal of documents because it does notinole the learning of any concepts. "his process is an adancement of thetraditional search algorithms that were speci$cally inoling and required themastering of the particular I ?Internet rotocol@ addresses of ariousdocuments or information content and typing them in the K4L bar. is lateradanced and the I addresses were able to be attached to arious weblinks. It is from this that the search engines were deeloped with a moreinteractie and responsie algorithm that was able to handle a lot of bits andpieces of information including data mining. + ariety of approaches haebeen accessed and preiewed to $nd alternaties to the keyword queries asopposed to the XML data. "he basic approaches that currently exist uselowest common ancestor ?L*+@ type of semantics as opposed to the commongraph theory for identi$cation of the hit list gien a certain keyword query."his particular approach generates results composed of all candidates# alsoknown as sub trees# containing an instance of the queried keywords. "he L*+returned alues can be numerous yet the user may Gust be interested in aportion or bit of the whole hit list. It therefore remains and unsoled issue tobe able to identify the exact dataset that is required by the user of thesystem. "he ideal situation and the best case scenario would be for thesystem to be able to generate an exact piece that is required by the user asopposed to proiding a whole set of hits which also gies the user an extraGob to $lter the content until they obtain an exact piece. "he researchers

hae stried to address the three ital requirements and conditions fore!ectie keyword searches of the XML. "he researchers hae introduced newmethods of analy%ing the relationship between query key words in thecandidates using mutual information idea and come up with a new /L*+keyword queries semantic. "he researchers also hae a proposed strategyand method of selecting the results of /L*+ from multiple candidates andthe three ranking methods used in selecting top(k results based on skylinequeries semantics. )ome of the properties which hae been proen hae


41/45

Optimizing Keyword Queries in XML Tree Structures !1

been acquired to accelerate proposed algorithms. "he $ndings andexperiments hae been conducted to analy%e and ealuate the researchersexperimental results and approach and they show that the approachperforms better than the approaches that hae been used in the data setsthat hae been tested and the ealuation metrics examined.

+ simple cost model introduced by the authors was based on *K costs andweighted I3 which used statistics on data page numbers consumed byrelations that bound the cost model concrete alues. "he dynamicprogramming algorithm proides a selected optimal operator $tting forspeci$c access paths. +fter that# an optimal Goin order is eri$ed based on anassumption of local optimality. In order to prune early the search space# notall possible enumerations are considered. In their place# focus is laid oninteresting Goin orders# for instance orders which can do without additionalintroductions of products of the *artesian. Craefe and /eWitt showcased theEX3/K) 3ptimi%er Cenerator and the purpose of this system was notcon$ned to a speci$c data model but it supported the algebraictransformations speci$cation as rules. Incorporated with a data model that isconcrete# the rules sere as input for the generator optimi%er# creating atailor(made query optimi%er. "his paper systematically explores XMLstructure(based answers and user expectations in order to identify thesigni$cance of XML keyword search semantics. "his paper further posits asemantics(based methodology to deelop XML keyword queries principallythrough data(centric coherency ranking which is kernelled in the design ofthe domain and database which is predicated on data dependence andmutual information models. *onsequently# keyword query results occurwithin a under schema reorgani%ation structures which process# present rankand query algorithms through coherency ranking to deelop answers. +ctualXML data indicates that coherency ranking is the methodology with thehighest precision# recall and ranking as compared with approaches. *urrentkeyword searches in XML can be diided into tree and graph supportedsearches which are largely predicated on structural document features.oweer# these approaches on structure do not comprehensiely utili%e thehidden semantics within the XML documents leading to issues in theprocessing of speci$c keyword query classes. "he growing reputation of XMLhas intensi$ed the necessitation of an accessible and precise XML queryinterface that is predicated on natural language and search procedures thatexploit XML structures to simplify queries by ordinary users within XML

databases.


42/45

Optimizing Keyword Queries in XML Tree Structures !2

!#$#!#N"#S

8F+lghamdi# =orah )aleh# Wenny 4ahayu# and Eric ardede. j3bGect(based

semantic partitioning for XML twig query optimi%ation.j InId'anced

/nformation Jet6or3ing and Ipplications (I/JI)"


43/45


ngineering #or3shop"


44/45

Optimizing Keyword Queries in XML Tree Structures !!

03ctober A989#

httpDHHdownload.oracle.comHdocsHcdH189-9998Hserer.6A9Ha6>-::Hsqltrac

e.htm5:;;_

8-F Memory 8on:guration and Sse A995# 1A5A0;(9A# 3racle# iewed

8-)eptemberA989#httpDHHdownload.oracle.comHdocsHcdH1A5:-698Hserer.888HbA5A0;Hmemory

.htm _

8>F =. 3nose et al.# j4ewriting =ested XML 2ueries Ksing =ested 'iews#j in

roceedings of the I8M 5/RMD /nternational conference on

Management of Data# *hicago# IL# K)+# A99># pp. ;;: ;-;.

80F 1. )tantic et al.# jandling of *urrent "ime in =atie XML /atabases#j in

roceedings of the *?th Iustralasian Dataase 8onference ('olume ;6#

obart# +ustralia# A99># pp. 80- 85A.

85F,. Liu# *. ". Ju# W. Meng# and +. *howdhury# VE!ectie keyword search in

relational databases# in 5/RMD 8onference# A99># pp. ->:-0;.

86F '. ristidis# =. 7oudas# J. apakonstantinou# and /. )riastaa# V7eyword

proximity search in xml trees# / 1rans. Kno6l. Data ng.# ol. 85# no. ;#

pp. -A--:6# A99>.

A9F J. Xu and J. apakonstantinou# VE&cient L*+ based keyword search

inxml data# in DN1# A995# pp. -:--;>.

A8F . Liu and J. *hen# VIdentifying meaningful return information for XMLkeyword search# in 5/RMD 8onference# A990# pp. :A6:;9.

AAF *. )un# *. J. *han# and +. 7. Coenka# VMultiway )L*+(based keyword

search in xml data# in #### A990# pp. 89;:89-A.

A:F . Liu and J. *hen# V4easoning and identifying releant matches for xml

keyword search# FLDN# ol. 8# no. 8# pp. 6A86:A# A995.

A;F ). +mer(Jahia and M. Lalmas# VXml searchD languages# index and

scoring# 5/RMD ecord# ol. :-# no. ;# pp. 8>A:# A99>.

A-F J. Luo# X. Lin# W. Wang# and X. hou# V)parkD top(k keyword queryin

relational databases# in 5/RMD 8onference# A990# pp. 88-8A>.

A>F =. Mamoulis# 7. . *heng# M. L. Jiu# and /. W. *heung# VE&cient

aggregation of ranked inputs# in /8D# A99># p. 0A.


45/45

Optimizing Keyword Queries in XML Tree Structures !"

A0F /. Xin#

optimizing keyword queries in xml tree structure

Documents