optimizing keyword queries in xml tree structure

Upload: paul-ndeg

Post on 02-Jun-2018

232 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    1/45

    Running Head: Optimizing Keyword Queries in XML tree structures 1

    Optimizing Keyword Queries in XML Tree Structure

    Name:

    Instructor:

    Institution:

    Date

    AST!A"T

    XML which stands for Extensible markup language remains to be the most

    popular and frequently used format for representing and exchanging data in

    the World Wide Web. Its application is wide based on the arious di!erent

    data types and applications that exist. "he data may take di!erent forms

    which may include unstructured heterogeneous# semi structured and

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    2/45

    Optimizing Keyword Queries in XML Tree Structures 2

    structured data types. XML has been a progressie language increasing its

    functionalities with arious inentions and researches to the leel of

    deelopment of data streaming applications. "hese types of inentions hae

    receied numerous signi$cance and attention by many experienced users of

    the web. "hese deelopments hae led to the centrali%ation of e&cient

    processing and querying of XML streams.

    "his study focuses on retrieing queries through a combination of structural

    constraints which essentially use key words as a search tool to represent an

    essential executable function in the XML systems of data management.

    'arious expectations are forecasted on and they are expected to yield best

    case answers in an e!ectie an e&cient manner like the traditional key

    search while factoring in the arious additional constraints that may exist.

    "he de$nition of studying the new problem of top(k keyword query and

    search oer XML probabilistic data with the aim of retrieing k )L*+ $ndingwhere k has the highest existence capabilities. ,inally the study is going to

    preiew arious other forms of keyword searches using di!erent forms and

    make a comparison through the analysis of the algorithms that hae been

    used.

    TAL# O$ "ONT#NTS%&' INT!OD("TION.....................................................................................-

    roblem /e$nition........................................................................................0

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    3/45

    Optimizing Keyword Queries in XML Tree Structures

    roposal........................................................................................................0)&' O*#!*I#+ O$ !#LAT#D +O!KS........................................................0

    *ost(1ased 2uery 3ptimi%ation in 4/1M)s..................................................52uery 3ptimi%ation ,rameworks...................................................................6XML 2uery 3ptimi%ation...............................................................................6

    7eyword 2uery in 3rdinary XML /ocuments..............................................89robabilistic XML.........................................................................................88X2uery )treaming 3ptimi%ation.................................................................8:

    Querying XML streams.............................................................................8:XML temporal model................................................................................8:

    "ime Interals and Model............................................................................8:Mode 4elationship Ealuation.....................................................................8:robability...................................................................................................8;/ominance Lowest *ommon +ncestor ?/L*[email protected]

    Dominance relationship...........................................................................85Dominance...............................................................................................A9,&' M#AS(!#M#NT O$ T-# !#LATIONS-I. #T+##N NOD#S IN ADATA T!##...................................................................................................A9

    Mutual Information *oncepts......................................................................A8Mutual information and entropy..............................................................A8Mutual information...................................................................................A8

    /&' ANS+#!S !#T!I#*#D $!OM TO.0K................................................A:/ominating )core.......................................................................................A:/ominated )core........................................................................................A:/ominance )core........................................................................................A;

    1&' AL2O!IT-MS (S#D TO !#T!I#*# TO.0K!#S(LTS......................A-=aBe +lgorithm for )election of "op(K +nswers.........................................A-"op(K /ominated +lgorithm ?"7//@............................................................A0"op(K /ominating +lgorithm ?"7/[email protected]

    3&' #X.#!IM#NTAL #*AL(ATION...........................................................:8Experimental )etup....................................................................................:82uery )ets..................................................................................................:8)earch 2uality............................................................................................:8E&ciency and )calability of "op(K+lgorithms.............................................:;

    4&' "ON"L(SIONS....................................................................................:-!#$#!#N"#S................................................................................................:-

    LIST O$ $I2(!#S

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    4/45

    Optimizing Keyword Queries in XML Tree Structures !

    ,igure 8D robabilistic XML document 85F.....................................................89,igure AD +n example of an XML data tree structure# " 86F.........................8;,igure :D /ata "ree "A 86F..............................................................................85,igure ;D List of candidates sorted in descending order using their M()aluesA;F................................................................................................................A5

    ,igure -D 4anking e&ciency of "7//# "7/C and "7/ algorithms 6F............:;

    LIST O$ TAL#S"able 8D "he Goin probability of the two speci$c nodes at context nodeDHpaper.......................................................................................................................8-"able AD "he Goin probability of the two speci$c nodes at context nodeDHproceeding....................................................................................................8-"able :D "wo dimension set candidate data...................................................A>"able ;D )earch spaces L+ and L1 deried from a list lof candidates............A0"able -D *andidates list sorted in descending order of f()alues..................A0

    "able >D /ominance checks count used in calculating dominated candidatescores............................................................................................................A0"able 0D recision and recall of queries on mondial data...............................::"able 5D recision and recall of queries on auction data................................::"able 6D recision and recall of queries on dblp data.....................................::"able 89D *omparisons on ranking e!ectieness of the algorithms...............::

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    5/45

    Optimizing Keyword Queries in XML Tree Structures "

    %&'INT!OD("TION

    XML ?Extensible Mark(up Language@ has oer the years eoled to become a

    de facto standard used for the exchange and representation of data which

    results into the distribution of proliferated XML documents which are spread

    all oer the internet. In the past# there are arious query languages that were

    used to retriee xml documents and data. "hese included languages such as

    X2uery and Xath. +nd twig pattern queries. "hese languages made it

    essential for the users of the systems to be ersed with the speci$c query

    languages and the releant data schemas so that they may be able to

    execute the XML queries e&ciently -F. "his therefore limited the type of

    users since the adanced users since the query languages and the data

    schemas seemed to be complex concepts to understand. "he data search

    through X2ueryHXath languages therefore was a ery big limiting factor.

    "he use of maGor keywords to search for documents has been widely

    accepted as a ery conenient way in retrieing resources from ariousremote serers that hold that speci$c type of data on the internet. MaGority

    of the search engines such as Coogle# 1ing# Jahoo and many more hae

    adopted the use of these technologies so as to e&ciently facilitate the

    process of data mining and data warehousing. "he adaptation and use of

    keywords for querying arious databases has attracted arious researches to

    be conducted by the research community from the a!ected $elds of

    database and information retrieal ?I4@ -FAF:F. "his is a ery e&cient way

    of facilitating the retrieal of documents because it does not inole the

    learning of any concepts. "his process is an adancement of the traditional

    search algorithms that were speci$cally inoling and required the mastering

    of the particular I ?Internet rotocol@ addresses of arious documents or

    information content and typing them in the K4L bar. is later adanced and

    the I addresses were able to be attached to arious web links. It is from this

    that the search engines were deeloped with a more interactie and

    responsie algorithm that was able to handle a lot of bits and pieces of

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    6/45

    Optimizing Keyword Queries in XML Tree Structures #

    information including data mining >F. + ariety of approaches hae been

    accessed and preiewed to $nd alternaties to the keyword queries as

    opposed to the XML data. "he basic approaches that currently exist use

    lowest common ancestor ?L*+@ type of semantics as opposed to the common

    graph theory for identi$cation of the hit list gien a certain keyword query.

    "his particular approach generates results composed of all candidates# also

    known as sub trees# containing an instance of the queried keywords -F. "he

    L*+ returned alues can be numerous yet the user may Gust be interested in

    a portion or bit of the whole hit list. It therefore remains and unsoled issue

    to be able to identify the exact dataset that is required by the user of the

    system. "he ideal situation and the best case scenario would be for the

    system to be able to generate an exact piece that is required by the user as

    opposed to proiding a whole set of hits which also gies the user an extra

    Gob to $lter the content until they obtain an exact piece8FAF:F;F-F>F.

    'arious researches and studies hae not been successful in theimplementation of exact retrieal of the search queries hence it remains an

    unresoled and challenging issue.

    Many proposals hae been drafted on the basis of improing the baseline

    approach precision. "he application of heuristic(based rules in enhancing the

    "ree )earch for the best case scenario at the shortest time possible has been

    the main foundation of maGority of the proposals. "hese approach though it is

    intuitie# it portrays characteristics of being ad(hoc in the sense that data

    $ltration takes place in maGority of the operations. *andidate data sets that

    meet the speci$ed criterion are separated from those that do not meet what

    the user seeks to $nd out. "he results of these process of $ltration is what is

    shown as a result of the algorithmic computation AF:F;F-F. oweer#

    research and studies show that I as much as it is assumed that the results of

    these algorithms yield best case results# they do not only miss out on the

    false negaties ?releant results@ but they also return false posities

    ?irreleant results@.

    "he results that receied from XML queries can howeer be boosted and

    made more reliable if the following considerations are addressed and madeD

    If the candidate hits can be measured in terms of releance. )econdly# if

    there can be a mechanism that can be able to ultra($lter from the releantcandidate hits to produce a more speci$c list with $ner details that closely

    match the users search-F:F. "here third proposition is about the

    positioning of the candidate hits in a descending order from the one that is

    most probable to the least probable case among the best case scenario hits.

    In this particular paper# we are going to focus on the optimi%ation of keyword

    queries in XML tree structures so as to yield a ery e&cient result.

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    7/45

    Optimizing Keyword Queries in XML Tree Structures $

    )mallest ?9-0>050-@ Lowest *ommon +ncestor ?)L*+@ is a model keyword

    search outcome that is a widely accepted semantic on a deterministic XML

    tree named " in our scenario. + speci$c node named is therefore

    considered )L*+ ifD

    =ode is the root of the sub tree at "sub?@ and it consists of all the keywords

    "here is no existence of a descendant node N of the root node in such a

    way that "sub?N@ consists of all the probable keywords. ,or instance# consider

    a scenario of Ok8# kAP being a keyword query on a certain document# p(

    document as shown in the $gure below. "he particular )L*+s in this case are

    OMKX:# I=/:P. When the algorithm attempts to perform a keyword search

    on the speci$c p(document# the following challenges are encountered as

    being a cumulatie example of a probabilistic XML document -FAF.

    roblem /e$nition

    *urrent keyword searches in XML can be diided into tree and graph

    supported searches which are largely predicated on structural documentfeatures. oweer# these approaches on structure do not comprehensielyutili%e the hidden semantics within the XML documents leading to issues inthe processing of speci$c keyword query classes. "he growing reputation ofXML has intensi$ed the necessitation of an accessible and precise XML queryinterface that is predicated on natural language and search procedures thatexploit XML structures to simplify queries by ordinary users within XMLdatabases. *onentional methodologies howeer process queries rely basedon ad hoc and intuitie heuristics which frequently regain false positie andunranked answers.

    roposal

    "his paper systematically explores XML structure(based answers and userexpectations in order to identify the signi$cance of XML keyword searchsemantics. "his paper further posits a semantics(based methodology todeelop XML keyword queries principally through data(centric coherencyranking which is kernelled in the design of the domain and database which ispredicated on data dependence and mutual information models.

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    8/45

    Optimizing Keyword Queries in XML Tree Structures %

    *onsequently# keyword query results occur within a under schemareorgani%ation structures which process# present rank and query algorithmsthrough coherency ranking to deelop answers. +ctual XML data indicatesthat coherency ranking is the methodology with the highest precision# recalland ranking as compared with approaches.

    )&' O*#!*I#+ O$ !#LAT#D +O!KS

    In this particular section# a brief assessment of releant publications on

    optimi%ation frameworks# XML query optimi%ation and relational cost(basedquery optimi%ation is going to be preiewed.

    *ost(1ased 2uery 3ptimi%ation in 4/1M)s

    + researcher proposed the $rst cost(based eer query optimi%er# which

    formed part of )ystem 4 and therefore was the prototype of the relational

    database system. "he optimi%er had arious capabilities which included the

    optimi%ation of linear and simple select# proGect# and Goin ?)

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    9/45

    Optimizing Keyword Queries in XML Tree Structures &

    introductions of products of the *artesian. Craefe and /eWitt showcased the

    EX3/K) 3ptimi%er Cenerator and the purpose of this system was not

    con$ned to a speci$c data model but it supported the algebraic

    transformations speci$cation as rules -F80F A-F. Incorporated with a data

    model that is concrete# the rules sere as input for the generator optimi%er#

    creating a tailor(made query optimi%er.

    ,igure 8?a@ showing the p(document ". "ag names are used to show ordinarynodes# for instance *8# *A# and 18and 1A. +s for the distributional nodes#MKX is shown as rectangular rounded cornered boxes while I=/ is depictedas circles. "aking into consideration the I=/A node# there exist two childrennodes 1A and *8 with respectie existence of probabilities 9.> and 9.- -F."herefore# for neither *8 nor 1A# the absence probability being seen is

    ?8(9.>@Q?8(9.-@ R9.A.

    *onsidering the MKXA node that consists of three children# I=/:# EA and/8node# their probabilities of existence respectiely are 9.-# 9.: and9.8."herefore# the probability for non(existence is

    8 ( 9.- ( 9.: ( 9.8 R 9.8.

    roided a p(document tree named "# there is a possibility of generating allpossible deterministic documents as sown belowS basically traersing " in atop(down manner# two situations arise that require to be dealt withindependently ifD

    ?8@ It occurs that it is an I=/ node consisting of m child nodes# Am copies of "

    are generated# and the I=/ node deletedS m child nodes are replaced withone distinct subset as a copy which is a representation of them and theordinary parent node which is connected to each child node in the I=/ nodesubset. "he probability for this copy to occur for each copy is the product ofall probabilities that exist of the respectie child nodes in the particularsubset and the absence probabilities for instance the existence probabilitydeducted from 8# of the child nodes which are not existent in the subset -F.

    ?A@ ,or a MKX node consisting of m child nodes# m T 8 copies of " can begenerated# and the MKX node deleted# replacing the m child nodes with nochild or one distinct child node for a copy. +n established connection from

    the child node of MKX to the ordinary parent node is made. "he existenceprobability for eery copy is the occurrence probability of the distinct childnode in the subset or the absence probability denoting no child nodeappearance. ,or eery " generated copy# the research adopts traersingusing the top(down approach until the deletion of all distributional nodes iscon$rmed -F.

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    10/45

    Optimizing Keyword Queries in XML Tree Structures 1'

    "his study explores the retrieal of queries through a combination ofstructural constraints which fundamentally use key words as a search tool torepresent an essential executable function in the XML systems of datamanagement. "hrough exploration of eidence(based research and theextrapolation of related works# it is expected that the study will yield best

    case answers in an e!ectie and e&cient manner similar to the conentionalkey search while factoring in the constraints that may exist.

    2uery 3ptimi%ation ,rameworks.

    Lan%elotte and 'aldurie% contributed a framework which is extensible for

    query optimi%ation which incorporates the concepts of modeling the

    independent search space of a particular type of search strategy 86FA8F.

    Ksing this approach# highly(extensible plans can be built by deelopers on

    enumeration frameworks. 7abra and /eWitt made a proposition on 3"TT

    as an 33 approach for ery extensie query optimi%ing 8>F. + combinationof extensible search components together with physical algebra and

    extensible logical representation# the work of Lan%elotte and 'aldurie% is

    lifted to obGect(oriented leel 6F.

    XML 2uery 3ptimi%ation

    3n XML queries optimi%ation targets the strictly limited and isolated

    problems of path expressions optimi%ation using naigational access paths

    and is depried of "< and )< operators support. Wu et al made a proposition

    of a dynamic programming $e noel algorithms for Goin reordering

    structuring 89F. "heir orthogonal approach is unique# for instance# it can be

    used to select the most e&cient Goin order in )

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    11/45

    Optimizing Keyword Queries in XML Tree Structures 11

    framework for cost(based optimi%ation and a full(Uedged /1M) does not

    seem to proide the solution. It described a cost(based Xath optimi%ation

    $rst approach. *ontrary to this particular proposal# which is in support of the

    optimi%ation of X2ueries# it has not considered "< and access paths that are

    adanced *+) indexes or indexes 6F.

    $igure %:robabilistic XML document 85F

    7eyword 2uery in 3rdinary XML /ocuments

    XML databases are inoled in the query of arious maGor keyword searches.Cien an XML data source keyword query# most of related work took lowest

    common ancestors smallest L*+ of the nodes that matched as the results to

    be returned. )chema(,ree X2uery and X4+=7 are able to compute )L*+s

    and deelop stack(based algorithms 8AFA5F. "he Indexed Lookup Eager

    algorithm is introduced when the appearance of the keywords with

    frequencies which are di!erent signi$cantly. "he )can Eager algorithm will

    take oer the process once keywords register similar frequencies. MaGority of

    the authors and researchers of arious preious works focused on inferring

    the de$nition of returned results and discussed the di!erentiations of result."he researchers proided more meaningful conclusions and utili%ed the

    underlying XML statistics of the data for identi$cation of the return node

    types 6F. "he researchers also proposed that a number of cleaning keyword

    queries algorithms for optimality could be deeloped. "his therefore resulted

    into the designing of M) approach for computation of )L*+s for queries of

    keywords in multiple manners. "hey took the 'aluable L*+ as results by

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    12/45

    Optimizing Keyword Queries in XML Tree Structures 12

    intentionally aoiding the false negatie and false positie )L*+ and L*+

    8AFA-F. "he arious researchers also proposed Indexed )tack# which was

    an e&cient algorithm for $nding answers based on semantics of Exclusie

    L*+. In addition# there exist other related works which process keyword

    search through the integration of keywords into speci$c structured queries.

    XML2L# which is a new query language# has the structure of the keywords

    and query separated. "he research also introduced a method to embed

    arious keywords into X2uery for processing of the speci$c keyword search

    6F.

    robabilistic XML

    "he probabilistic XML topic has been a recently studied subGect in which

    maGority of the proposed models hae been incorporated together with

    ealuations of structured query. =ierman et al $rst introduced the concept of

    ro"/1# with the existent probabilistic types MKX ( mutually exclusie andI=/ ( independent. "he researchers modeled the probabilistic XML in the

    form of acyclic graphs# which support distributions that are arbitrary oer

    children sets. "he research adopted a probabilistic tree approach for the

    purpose of data integration where its possibility and probability nodes are

    similar to I=/ and MKX respectiely 6F A8FA6F.

    + p(document# which is a probabilistic document written in XML speci$es a

    probability distribution across space of deterministic documents written in

    XML. Each deterministic document that belongs to this space is referred to as

    a possible word. + probabilistic document referenced as a tree that has beenlabeled has distributional and ordinary nodes 88F8;F. 3rdinary nodes are

    basically regular and normal XML nodes and their appearance may be seen

    in deterministic documents# whereas distributional nodes are only used in

    the de$nition of probabilistic process that inoles the generating of

    deterministic documents and their occurrence is not isible in those

    documents-F8AF A;F. In the adaptation of rXML Oind# muxP as part of the

    XML model which is probabilistic# two distributional nodes types appear in a

    p(document# which are MKX and I=/89F8:F.

    *onsidering an example 8D*onsider ,igure 8?a@ showing the p(document ". "ag names are used to show

    ordinary nodes# for instance *8# *A# and 18and 1A. +s for the distributional

    nodes# MKX is shown as rectangular rounded cornered boxes while I=/ is

    depicted as circles. "aking into consideration the I=/A node# there exist two

    children nodes 1A and *8 with respectie existence of probabilities 9.> and

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    13/45

    Optimizing Keyword Queries in XML Tree Structures 1

    9.- -F. "herefore# for neither *8 nor 1A# the absence probability being seen

    is

    ?8(9.>@Q?8(9.-@ R9.A.

    *onsidering the MKXA node that consists of three children# I=/:# EA and

    /8node# their probabilities of existence respectiely are 9.-# 9.: and

    9.8."herefore# the probability for non(existence is

    8 ( 9.- ( 9.: ( 9.8 R 9.8.

    roided a p(document tree named "# there is a possibility of generating all

    possible deterministic documents as sown belowS basically traersing " in a

    top(down manner# two situations arise that require to be dealt with

    independently ifD

    ?8@ It occurs that it is an I=/ node consisting of m child nodes# Am copies of "

    are generated# and the I=/ node deletedS m child nodes are replaced with

    one distinct subset as a copy which is a representation of them and the

    ordinary parent node which is connected to each child node in the I=/ nodesubset. "he probability for this copy to occur for each copy is the product of

    all probabilities that exist of the respectie child nodes in the particular

    subset and the absence probabilities for instance the existence probability

    deducted from 8# of the child nodes which are not existent in the subset -F.

    ?A@ ,or a MKX node consisting of m child nodes# m T 8 copies of " can be

    generated# and the MKX node deleted# replacing the m child nodes with no

    child or one distinct child node for a copy. +n established connection from

    the child node of MKX to the ordinary parent node is made. "he existence

    probability for eery copy is the occurrence probability of the distinct child

    node in the subset or the absence probability denoting no child node

    appearance. ,or eery " generated copy# the research adopts traersing

    using the top(down approach until the deletion of all distributional nodes is

    con$rmed -F.

    'arious researchers hae proposed the adoption of a fu%%y trees model

    where nodes are speci$cally associated with probabilistic eent ariables

    conGunctions. + full complexity query analysis update on the Vfu%%y trees in

    the research is also referenced. "hey also proposed algorithms that sole the

    constraint(satisfaction which were e&cient >F85F A;FA6F. "he speci$csampling problem and query ealuation under constraints set can be well

    de$ned to yield e&cient query results that are expected. 3ther publications

    summari%ed and extended the preiously proposed probabilistic XML models#

    tractability of queries and the expressieness on di!erent models were

    discussed with the consideration of MKX and I=/-F8:F 8-F85F A0F.

    'arious studies on the ealuation problem of twig queries considered oer

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    14/45

    Optimizing Keyword Queries in XML Tree Structures 1!

    probabilistic XML that may generate partial and incomplete answers with

    particular respect to user probability threshold. "he researchers also

    addressed and proposed the ranking top(k probabilities problem of answers

    of a twig query. In summary# the work that has been cited focused on

    discussions of arious probabilistic XML data models on a structured XML

    query# for instance a twig query8F0F 8>FA:F. 3ur research howeer is

    going to be di!erent in the sense that the keyword search problem in

    probabilistic XML data is going to be critically preiewed and analy%ed 6F.

    X2uery )treaming 3ptimi%ation

    Querying XML streams

    )eeral streaming algorithms exist that particularly focus on the querying

    problem and the $ltration procedure. Many of these algorithms center their

    operations on tree(pattern queries ?"2s@. "2s e&ciently correspond to

    Xath queries which inole mainly descendant and child axes 8;FA>F. "2

    streaming algorithms can be extended to facilitate the process obtaining

    Xath queries which come along with ordered axes that inole preceding#

    preceding(sibling# following# and following(sibling@ 89F.rocessing techniques

    are therefore introduced on ordered axes )treaming algorithms broadly fall in

    three categoriesD "he array(based approach# automaton(based approach and

    the stack(bas.

    XML temporal model

    reious studies conducted on time(based XML model hae identi$ed seeral

    disadantages and bene$ts. "he bitemporal approach is inclusie of bothalid time and transaction time in timestamp attributes AAFA>F. =ormatie

    texts will always comprise of four time interals. =ormatie texts consisting

    of temporal alues in an XML database represent new attributes of interals

    for instance e&cacy time and publication time. "his particular approach of

    XML tree partitioning guarantees the distribution of data into partitions of

    equal si%e making considerations of both the query processing load and data

    storage cost 89F.

    "ime Interals and Model"he interals are publication time e&cacy time transaction time and alidity

    time. "ransaction time refers to the time a transaction is reUected in the

    database as a representation of an important factor for all transactions that

    occur in time(referenced databases. 'alid time refers to the interal that

    indicates the time when the data becomes alid for general use or it may be

    inalid and unusable 5F86FA9F. E&cacy time is when data is used under

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    15/45

    Optimizing Keyword Queries in XML Tree Structures 1"

    arious conditions or in speciali%ed cases only. ublication time represents

    the alert time as to when or during the publication of data. 4eason nodes

    particularly hold data that is sensitie and may $nd utility in decision support

    )ystems 89F.

    Mode 4elationship Ealuation

    "his research deiates from the other researches which focus on the

    ealuation of the relationship between multiple nodes with the

    implementation of heuristics(based intuitie rule. It will focus on the

    relationship between multiple nodes in a data tree structure which is

    measured by the adaptation of mutual information concept which deries its

    application arious data mining processes-F>F88F. "his operates on the

    correlation of arious database relation attributes. 1eing common for XML

    data tree structures to consist of arious nodes haing the same labels but

    occurring in di!erent contexts# pre$x labeled nodes are used to depict the

    types of nodes. + pre$x label path refs to a sequence of names of elementsappearing in along the path from the root node to the speci$c node in

    question. "he node types are used to identify the speci$c nodes that are

    found in the data tree 8>FAAF.

    $igure ): +n example of an XML data tree structure# " 86F

    ,or instance at node ;# the pre$x labeled path in the data tree is de$ned bydblpHproceedingHpaperHauthor. Many occurrences can exist in a speci$c pre$x

    labeled path in the XML data tree structure# and these occurrences are

    referred to as node instances AF0FA8FA-F. It is therefore de$nite that all

    the instances of a speci$c node will consist of the same pre$x label path.

    Eery instance consists of a unique alue which constitutes to the speci$c

    set of key words contained directly in that particular instance. ,or instance#

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    16/45

    Optimizing Keyword Queries in XML Tree Structures 1#

    using the tree structure in ,igure A# pre$x label pathD

    dblpHproceedingHpaperHauthor has instances A5# A9# 8-# 88 and ; which

    consist of the alues 4ichards# Wang# hang# Liu and

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    17/45

    Optimizing Keyword Queries in XML Tree Structures 1$

    TAL# %:"E

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    18/45

    Optimizing Keyword Queries in XML Tree Structures 1%

    "his therefore indicates that there is ariance in the type of mutual

    information between two nodes that are in di!erent contexts. "he mutual

    information between the author?s@ and title in the paper context is higher

    than the mutual information between the author?s@ and title of two papers

    that are di!erent or which may be considered in the context of a proceeding.

    We can therefore conclude that MI acts as a superb measure of showing how

    closely nodes are interrelated to each other. "he MI alues scale has no

    speci$c unique range as depicted by property - A-F. "he property states

    that the nodes can be bound by the minimum alue of their entropy. In a

    proper application of this particular concept# we require a uni$ed scale for

    the sole purpose of measuring the MI along with global node sets ;F85F.

    =ode 4elationship

    *onsidering two nodes# u and which are Goined at the context of c# the

    relationship of the two nodes is de$ned asD

    In particular case# 0 (u)and 0 (')refer to the speci$c entropy of nodes u and

    respectiely and the alues are calculated the same way as the random

    ariable entropies 8>F.

    When the alue of rel (uS 'c)is high# this means that the relationship

    between nodes u and is strong at the context node c. ,or instance# the

    entropy of nodes dblpH proceedingHpaperHtitle and

    dblpHproceedingHpaperHauthor can be obtained byD

    0(dblpHproceedingHpaperHtitle)

    R Y(8-) log(8-) T (8-) log(8-)(8-) log(8-) T (8-) log(8-) T (8-) log(8-)F

    R Ylog(8-) R log - R 9.09

    0(dblpHproceedingHpaperHauthor)

    R Y(8-) log(8-) T (8-) log(8-)

    (8-) log(8-) T (8-) log(8-) T (8-) log(8-)F

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    19/45

    Optimizing Keyword Queries in XML Tree Structures 1&

    R Ylog(8-) R log - R 9.09

    "his therefore implies that the relationship between

    dblpHproceedingHpaperHtitle and dblpHproceedingHpaperHauthor meeting at

    context node dblpHproceedingH paper can be deried as

    rel(dblpHproceedingHpaperHauthorS dblpHproceedingHpaperHtile

    dblpHproceeding) R 9.;6

    9.09

    R 9.0

    "his is also similar toD

    dblpHproceedingHpaperHtitleS rel(dblpHproceedingHpaperHauthor

    dblpHproceedingHpaper) R 9.09

    9.09

    R 8.9

    "his therefore shows that the relationship that exists between any two giennodes must fall in a speci$c range of 9#8F in any XML data tree gien

    multiple nodes u and at any particular context node c.

    9 Z rel(uS 'c) Z 8

    "his can further be proed when property : as stated preiously is closely

    examined. It states that

    /(uS 'c) [ 9# which therefore implies that

    roperty - also states that /(-Syc) Z 0(y)and /(-Syc) Z 0(-). 1his thereforegenerates2

    /ominance Lowest *ommon +ncestor ?/L*+@

    In order to retriee a particular hit or answer in a wide mass of L*+ based

    candidates# this research proposes the use of new semantics referred to as

    /ominance L*+. We begin by the introduction of the relationship between

    L*+(based candidates.

    Dominance relationship2uery candidates are represented by their root nodes can be depicted as

    subsets of the sub trees. ,or instance gien a keyword search query say Q

    RO38" . . . " 34Pa speci$c candidate of the query 2 called ) is represented in

    the form 5(nlca" On8S . . . S n4P) in this particular result n4refers to a leaf

    node that contains 3iand nlca becomes the distant and the lowest common

    ancestor of the series On8S . . . S nm. Identi$ers are used to identify each

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    20/45

    Optimizing Keyword Queries in XML Tree Structures 2'

    node in the candidate series which according to this research is encoded as a

    /ewey code.

    "he foundations of the /ewey code are deried from the /ewey /ecimal

    *lassi$cation which were deeloped for the purpose of classi$cation of

    general knowledge8FA:F. With the implementation of the /ewey coding# a

    ector is assigned to each node which is a representation of the path to the

    node from the tree root. "he local order of the ancestral node is represented

    by the each component that is found along the path. "his can eidently be

    illustrated in the ,igure :D

    $igure ,: /ata "ree "A 86F

    "he researcher selected to encode the speci$c node identi$ers with /ewey

    code since it is ery useful in the representation of the hierarchical

    relationships that exist between nodes of a tree that forms a ery importantariable in the tree structure. "he corresponding label path of the speci$c

    node can be found from the /ewey code. ,or instance# considering the

    sample data tree structure "Ain the ,igure :# eery node s always identi$ed

    by the /ewey code. ,or a node identi$cation of 9.8.9.9F# the corresponding

    label path of the corresponding node n8Hn:Hn;Hn-. We therefore gie a name#

    I/AL?id@ which is an id which represents the /ewey code that seres as an

    input and reUects the corresponding path label8:FA;F. "here is a ast

    chance of the possibility that the key words in a particular search tree may

    yield many occurrences in the speci$c sub tree candidate 5(nlca" On8S . . . SnmP). Eery keyword yields a set of Li R Oni'al (ni) 6ith the 3ey6ords 3i (8

    Z i Z m)P

    "he relationship between the arious keywords that are produced in the

    speci$c search tree is gien asD

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    21/45

    Optimizing Keyword Queries in XML Tree Structures 21

    In this particular case scenario# I/AL(ni) is an important function which

    returns corresponding node types w6th the /ewey code niand rel(I/AL(ni)S

    I/AL(n)I/AL(lca(ni" n@@@ which is normally calculated by the formulae

    stipulated in formula ?8@. "his therefore measures the correlation between

    the nodes that hae been tagged with the /ewey codesniand n at the

    lowest common ancestor 85FA-F. "his therefore implies that the

    relationship between the keywords kiand kGcontained in the candidate

    structures is analy%ed as the maximum relationship that exists between two

    nodes that contain two keywords in that speci$c candidate8:F.

    ,or instance# taking a query Q R O38" 3A" 3:P with a speci$c data tree "A#

    only one of the sub(tree 2 candidate is present and is rooted at a place node

    of n: 9.8F. In "ree "Aand this can be represented and can also take the form

    of 5(9.8" O9.8.9.9S 9.8.8.9S 9.8.8.8P)."he relationship that exists in the

    keyword queries in the speci$ed candidate ) can be calculated as followsD

    rel(3A" 3:) R rel(I/AL(9.8.8.9)S I/AL(9.8.8.8)I/AL(9.8.8))R rel(n8n:n>n0S n8n:n>n5n8n:n>)

    rel(38" 3:) R rel(I/AL(9.8.9.9)S I/AL(9.8.8.8)I/AL(9.8))

    R rel(n8n:n;n-S n8n:n>n5n8n:)

    rel(38" 3A) R rel(I/AL(9.8.9.9)S I/AL(9.8.8.9)I/AL(9.8))

    R rel(n8n:n;n-S n8n:n>n0n8n:)

    roided the keyword query Q R O38" . . . " 34P the relationship of each pair

    calculated is stored in the ector /s of the query keywords in sub(tree

    candidate ). "he keyword relationship ector in this particular research is

    de$ned byD

    D5 R rel(3i" 3)3i" 3 \Q ](i 7 )F

    aing a total of *Aq combinations of two(keywords deried from a stable set

    of say q keywords O38" . . . " 34Pthe ector /s therefore contains *Aq number

    of elements. "his is normally denoted as D5 R *Aq . ,or instance# the ector

    of the keyword relationship that corresponds to the candidate 5(9.8"

    O9.8.9.9" 9.8.8.9" 9.8.8.8P) 4uery Q R O38" 3A" 3:P consistsD

    of *A: R: R : 4espectie elementsA^(:Y8)^

    D5 R rel(38" 3A)" rel(38" 3:)" rel(3A" 3:)F.

    Letting /s and /s become the two speci$c types of keywords in a speci$c

    relationship# of the candidates ) and )# the dominance relationship that

    exists between the candidates ) and ) id can therefore be de$ned 8AF.

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    22/45

    Optimizing Keyword Queries in XML Tree Structures 22

    Dominance

    Letting ) and ) to become the two candidates of the XML search query 2

    oer a speci$c named and gien database "# ) dominates ). "his is

    represented as ) _ ) and this condition will only hold if the following aremetD

    (8 Z Z d)D5F 7 D5 F and i(8 Z i Z d)D5iF Z D5 iF

    In this scenario d refers to the keyword length relationship ector of ) and

    ) which is (d R D5 R D5 R8A4). /siF is the element in the ithector /s*andidate ) dominates ) in the relationship ;F88FA;F.

    ,&' M#AS(!#M#NT O$ T-# !#LATIONS-I. #T+##N NOD#S IN A

    DATA T!##

    "his particular section reiews the mutual information ?MI@ concept alongside

    with arious other concepts that are related. "he in depth detail of this

    particular concept will be discussed with emphasis on the concept adaptationin the measurement of the meaningful relationship that exists between

    arious nodes that exist in an XML data tree.

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    23/45

    Optimizing Keyword Queries in XML Tree Structures 2

    Mutual Information *oncepts

    Mutual information and entropy

    "hese are ery central and fundamental concepts that do exist in the $eld of

    the information theory. Entropy therefore refers to the measure of

    uncertainty of a particular random ariable. MI quanti$es the existing mutualdependence of two particular random ariables :F5F86FA0F.

    EntropyD "aking a discrete random ariable x which takes the alue '-

    extracted from the set dom (-) which is generali%ed and goerned by a

    probability distribution function of aluep ('-)."he de$nition of entropy of-

    is de$ned as followsD

    *onditional Entropy of a particular random ariable sayyproided a second

    ariable-"which is referred to as entropyyconditional-# which usually takes

    the general form of 0 (y,-) has the following de$nitionD

    In this particular type of equation#p ('y"'-) refers to Goint probability of (y='y)

    and (-='-)9 whereas p('-"'y)gien (-='-)is the conditional probability of the

    equation (y='y) 89F85F.

    Mutual informationIn reference to two random ariables it can be referred to as a quantity that

    measures mutual independence between two ariables >F. In a gien case

    scenario# discrete ariables x and y which are random# the de$nition of their

    mutual information can be de$ned asD

    In this particular scenario#p ('-"'y)refers to the Goint probability of the

    de$ned ariables (- = '-) and (y= 'y). In this particular scenario# p?x@ and

    p?y@ are the probabilities of (-='-)"(y='y) respectiely.

    "here are arious properties that characteri%e Mutual Information# and some

    of the existing properties are detailed as followsD

    roperty 8D

    / (-Sy) R 0 (-) 0 (-y) R 0 (y) 0 (y-)

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    24/45

    Optimizing Keyword Queries in XML Tree Structures 2!

    "his deals with the interpretation of Mutual Information. It indicates that the

    information that has been proided by y concerning x is the reduction or

    decrease in the uncertainty of x proided the knowledge possessed by y.

    )imilarly# this occurs for all the bits of information aailed by x concerning

    random ariable y. "he alue of the mutual information is directly

    proportional to the information that is reealed by both the ariables x and y

    in this particular property 8>FA9FA;FA6F.

    roperty AD

    / (-Sy) R / (yS-)

    It puts forward that mutual information takes a symmetric form# meaning

    that information aailed by x concerning y is the ery same type of

    information y coneys about x 8>FA9FA;F.

    roperty :D

    /(-Sy) [ 9

    "he lower bound of the mutual information is gien in this particularscenario. Cien /(-Sy) R 9# we get the resultp('-" 'y) Rp ('-) p ('y) for the

    possible alues of x and y. "hese means that the ariables x and y are

    independent# therefore obtaining the alue of x does not necessarily proide

    clues of the probable or exact alue of the ariable y. "his therefore puts

    their mutual information at %ero -F8>FA9FA6F.

    roperty ;

    /(-S-) R 0(-).

    "his property puts forward that mutual information of ariable x is by itself

    the entropy of x. "his therefore means that entropy is also referred to as self(

    information 8>FA;F.

    roperty -D

    /(-Sy) Z 0(-) and /(-Sy) Z 0(y)

    "he mutual information that exists between two ariables is limited and

    bound to the minimum of their speci$c entropy 8>FA9FA;F.

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    25/45

    Optimizing Keyword Queries in XML Tree Structures 2"

    /&' ANS+#!S !#T!I#*#D $!OM TO.0K

    "he researcher obseres that /L*+ answers alter with di!erent search

    queries. *onducting a data and information search# users usually are

    interested in top(3 answers. "hey are sorted in descending order using theirrespectie releance degrees to the need of users information. "his section

    de$nes three ranking functions used for identi$cation of the top(3 results for

    a keyword(based sequential search through XML data. "he particular ranking

    functions used in this study exploit di!erent and seeral aspects of

    dominance relationships existing between query candidates for ranking their

    releance degree to the speci$c search query8FA9FA5F.

    roided 8 (Q" 1) as a set of candidates of a speci$c query Q in an existing

    XML database 1# the degree of releance of a candidate based is measured

    on the following three ranking scoresD

    /ominating )core

    roided a candidate answer structure 5# the dominating score of 5 is

    de:ned as follows.

    scoredg(5) R O5 \8 (Q" 1)5 _5P

    ?:@

    "he dominating score of a speci$c candidate scoredg(5) shows the cumulatie

    count of candidates which 5 dominates. + candidate portrays more releance

    if it dominates as numerous and many other candidates as it possibly can."herefore# a higher dominating score of a speci$c candidate 5 denotes that )

    is more signi$cant to the speci$ed query 0F8AF.

    #5amp6e o7 an Instance %:

    Letting 5 \8 (Q" 1) and 5\8 (Q" 1) e t6o respecti'e candidates of a

    speci:ed 4uery Q in a stated XML data tree 1. 1herefore" if 5 _5# then this

    implies that scoredg(5) [ scoredg(5).

    "his example can be proed through using the transitie property which is a

    subset of a dominance relationship. "herefore# for any two candidates on aquery 2# 5 \8 (Q" 1) and 5 \8 (Q" 1)# if 5;5# then 5i \8 (Q" 1)5 _5i# we

    therefore hae 5 _5i. ,inally#

    O5i \8 (Q" 1)5 _5iP [ O5i \8 (Q" 1)5 _5iP# or it can be stated as

    scoredg(5) [ scoredg(5)

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    26/45

    Optimizing Keyword Queries in XML Tree Structures 2#

    "his particular example gies an assurance that candidate 5 dominates

    candidate 5# which then means that 5 is ranked higher as compared to 5 in

    the top 3 results that hae been returned ;F88FA;F.

    /ominated )core

    roided a candidate answer )# the dominated score of ) is de$ned as

    followsD

    scoredd(5) R O5\8 (Q" 1)5_5P

    ?;@

    "he dominated score of the speci$ed candidate 5# scoredd (5)# shows the

    number of other di!erent candidates which can dominate 5. "herefore# the

    lower the dominated score# the more meaningful to the query for candidate

    5 A9FA;F. "his therefore implies that candidate 5 is more releant whendominated by fewer candidates as possible.

    /ominance )core

    Example of Instance :

    Letting5 \8 (Q" 1) and5 \8 (Q" 1) be two respectie candidates of a

    speci$ed query 2 in a stated XML data tree ".

    /f 5 _5" then scoredd(5) Z scoredd(5).

    "his example can be proed in a similar manner like the preious examples.,or any existing two candidates 5 \8 (Q" 1) and 5 \8 (Q" 1)# if 5 _5 then

    5i \8 (Q" 1)5i _5# we hae 5i 5 8>FA9FA;F. "herefore# O5i \8 (Q" 1)5i

    5P Z O5i \8 (Q" 1)5i _5P# or scoredd(5) Z scoredd(5)

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    27/45

    Optimizing Keyword Queries in XML Tree Structures 2$

    1&' AL2O!IT-MS (S#D TO !#T!I#*# TO.0K!#S(LTS

    "his particular section is meant as an introduction to algorithms which

    identify releant search results and the top(3 answers# normally based on

    arious skyline semantics in accordance to the aforementioned criteria ofranking. In order to obtain the speci$ed set of L*+(based candidates of a

    particular gien keyword query# gien other signi$cant approaches in the

    literature# the research adopts the inerted indexes 80F. "hese particular

    indexes are built oine during a time it parsed the XML database tree

    structure. )peci$cally# letting Q R O38" . . . " 34P be parsed a gien keyword

    query and /Li be the inerted list consisting of keyword 3i. Eery entry

    contained in the inerted list /Li is the /ewey code of a particular node

    containing the keyword 3i. "he candidate set 8 of query Q is de$ned as

    8 R Olca(n8" n4)n8 \/L8" . . . " n4 \/L4P#

    Cien that lca(n8" . . . " n4) is an operation that gies the lo6est common

    ancestor of On8" . . . " n4P# the keyword relationship ector of eery

    candidate is concurrently fed as input during the candidate generation

    process. "he generated candidates are stored in a speci$ed list ordered by

    the alues of their releant keyword relationship ectors ;F80FA-F. "he

    detailed explanations will be in the following subsections.

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    28/45

    Optimizing Keyword Queries in XML Tree Structures 2%

    =aBe +lgorithm for )election of "op(K +nswers

    A6gorit8m %:=aBe +lgorithm 80F

    "he naBe algorithm used for identi$cation of the top(3 results that are

    desired corresponding to their respectie dominated scores ?similarly#

    dominance and dominating scores@ is illustrated in the +lgorithm 8. "his

    speci$c algorithm iterates through eery candidate in the speci$ed

    candidate set and facilitates the calculation of its score by performing pair

    wise dominance checks between these candidates and all other candidates

    de$ned in the set ?lines @ 80FA8FA:F. "he resultant set is then updated

    depending on the result obtained on the score compared between the

    current 3Yth candidate and the new candidate in the current top(3 results

    ?lines ?*@@ 80FA8FA;FA5F.

    "he maGor drawback of this particular algorithm is that its speci$ed

    computational cost is ery high because regardless of the alue of 3# there is

    need to iterate through each component candidate found in the candidate

    set and calculate the score deried by each candidate by performing thespeci$ed pair wise dominance checks that occur between the candidate with

    all other present candidates in the existing set 8>FA9FA;F. "his therefore

    means that no matter what the deried alue of 3 is# the algorithm

    exhaustiely performs and conducts all pair wise dominance tests across all

    candidates.

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    29/45

    Optimizing Keyword Queries in XML Tree Structures 2&

    TAL# ,:"W3 /IME=)I3= )E" *+=/I/+"E /+"+ A;F

    D% D)

    S% 9.6- 9.6

    S) 9.8- 9.-

    S, 9.8 9.6-

    S/ 9.- 9.;

    S1 9.5 9.5

    S3 9.6 9.;

    S4 9.; 9.;

    S9 9.: 9.A

    S 9.0 9.>

    S%' 9.: 9.:

    ,or instance# proided a set of candidates in "able :# in order for the proper

    identi$cation of the top(: results# there is need to calculate the score deried

    by of each candidate 5i(8 Z i Z 89) through iteration oer 6 other candidates

    and conducting a pair wise dominance check A;FA5F. "his therefore impliesthat it takes 89 6 R 69 pair wise dominance checks. Cenerally# for

    calculation of the score of a particular candidate in a gien set of n

    candidates# there is need to do pair wise dominance checks between that

    speci$c candidate together with (n Y 8) other candidates found in the set.

    "op(K /ominated +lgorithm ?"7//@

    "he chief aim of "7// is algorithm to each candidate is to e&ciently $nd the

    number of other candidates which dominate it# while aoiding exhaustie

    pair wise comparisons between the candidatesAF5FA;F . +fter the retrieal

    of 3 results# the score of the 3(th result is used as a maximum threshold and

    therefore pruning occurs for the candidates whose oerall dominated scores

    extend the threshold A;F. "o add to that fact# safe termination of the

    algorithm is guaranteed if the scores of all the remaining candidates exceed

    the proided threshold. More speci$cally analy%ed# the "7// takes course

    through the following four stepsD

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    30/45

    Optimizing Keyword Queries in XML Tree Structures '

    ?i@ /nitialiAation

    (line *)D the result set and min'alue are initiali%edS

    ?ii@ 1ermination condition

    TAL# /: )E+4* )+*E) L+ +=/ L1 K)E/ "3 *+L*KL+"E scoredd(5i)+=/

    scoredg(5i)4E)E*"I'ELJ /E4I'E/ ,43M + LI)" L3, *+=/I/+"E) WI*

    +4E )34"E/ I= /E)*E=/I=C 34/E4 3, )E*I,I* B()'+LKE) A;F

    TAL# 1: *+=/I/+"E) LI)" )34"E/ I= /E)*E=/I=C 34/E4 3, B()'+LKE)

    A;F

    TAL# 3: /3MI=+=*E *E*7) *3K=" K)E/ ,34 *+L*KL+"I3= 3, "E/3MI=+"E/ *+=/I/+"E )*34E) A;F

    $igure /: List of candidates sorted in descending order using their M()

    alues A;F

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    31/45

    Optimizing Keyword Queries in XML Tree Structures 1

    A6gorit8m ):"7// 80F

    (lines C>)D roided that M?@ alue of the present candidate 5 is below the

    minimum alue of the current 3(th candidate in # the algorithm terminates

    and the resultant set is returnedS

    ?iii@ Dominance chec3s (lines ?*)D

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    32/45

    Optimizing Keyword Queries in XML Tree Structures 2

    "he pair wise dominance checks between 5E and eery other candidate 5in

    the respectie search space of 5 6here the operation takes place. "he

    dominated score of 5 is found to be increased by 8 eery time another

    candidate dominates 80FAAF.

    ?i@ esult updates (lines ***?)D proided that 3 results are existent and the

    dominated score of the 3(th candidate is larger than the current candidates

    score# the 3(th candidate is eGected and the current candidate is put into S

    otherwise if it becomes less than 3 results exist in # there is an insertion of

    the current candidate into . ,inally# taking the si%e of as 3# the threshold

    minFalue undergoes updating (lines *G

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    33/45

    Optimizing Keyword Queries in XML Tree Structures

    candidate found to exist in the search space proided will be

    performed A;F.

    *oncurrently# the dominating score of the candidate is calculatedS

    ?i@ 4esult set update (lines *+

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    34/45

    Optimizing Keyword Queries in XML Tree Structures !

    3&' #X.#!IM#NTAL #*AL(ATION"he researcher performed and designed a couple of experiments to analy%e

    the search performance of the approach. In the experiment the researcher

    ealuates the outcomes and results of the arious experiments in order to

    compare the e&ciency and quality of the approach that the researchers used

    and other possible approaches that would hae been used:F-F>F 88F80F

    A;FA5F.

    Experimental )etup

    "he experiments were conducted on the entium ;# :.AC% computeroperating on windows X rofessional and it had an internal memory of AC1.

    .0 M1# Mondial

    8M1 and /1L *omputer )cience 1ibliography 500 M1. /1L *omputer

    )cience 1ibliography includes a list of bibliographic information of maGor

    computer science proceedings and Gournals. Mondial on the other hand is a

    worldwide geographic database or platform that has been integrated from

    the world fact book of the *I+# "E44+ database# and the international atlasamong many other sources. +uction is a form of synthetic benchmark set of

    data that has been generated by the XML generator using default /"/ from

    XMark 8:FA-FA0F.

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    35/45

    Optimizing Keyword Queries in XML Tree Structures "

    2uery )ets

    "he researchers asked a group of learners to submit $fty arious keyword

    questions to search and ealuate on eery data set. Eery query contained a

    speci$c set of search key words and also a brief description of each query

    was also ery necessary in order to understand and identify the key

    intension of the query:F-F>F 88F80FA;FA5F. "he researchers at the same

    time obsered that searching on a speci$c domain like the three main data

    sets that they were experimenting on was not e!ectie as the keyword

    queries were ambiguous. "his made it had for the users to express the

    search intention. /ue to this# it is sometimes di&cult to obtain the releant

    results and outcomes of the queries at hand which are prerequisite for the

    researchers to analy%e the performance of their approach and other

    aailable approaches:F-FA;FA5F.

    )earch 2uality

    "he researchers compared the quality of the /L*+ approach with the other

    arious approaches that exist likeS EL*+# *'L*'+# X4eal# ML*+# )L*+ and

    X)earch. "he quality of these approaches were measured in three metrics

    popular for retrieal of informationD recall ?4@# ,(measure and precision ?@ >F

    88F80FA5F. In order for the researchers to recall and compute precision

    they reformulated manually the keyword questions into schemas aware

    queries based on the data sets schemas and the keywords query

    descriptions. "he researchers then took the results of transformed queriesresults as a platform on which they computed the recall and precision of the

    queries according to the platform as followsS gien the key word query 2 and

    its corresponding X2uery that has been transformed AF85FA0F. "he

    accurate outcome set of 2 which is the result a speci$c algorithm on 2 is

    recorded as retrieed results AF88FA;FA5F. "he precision and recall of this

    algorithm can be de$ned as follows.

    "he precision is a fraction of retrieed results releant to the searchD

    R ??releant results@n?retrieed results@@

    ?4etrieed results@"he recall is a fraction of the releant results which are successfully retrieed

    by the search system

    4R ??releant results@ n ?retrieed results@@

    ?4eleant results@

    "he ,( measure which shows the trade(o! between the recall and precision is

    computed asS

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    36/45

    Optimizing Keyword Queries in XML Tree Structures #

    ,(measureR ?8T1A@ 4

    1A T4

    Where 1 R 8 the recall and precision are equal# where 1 8 precision is

    emphasi%ed and where 1 _ 8 recall is emphasi%ed.

    ,rom the calculations it is clear that the releant results of each key word

    query needs to be determined before the calculation of the appropriate

    ealuation metrics. "o acquire the releant results of the tested queries the

    researchers formed the manual corresponding schema aware Xquery with

    the assistance of users >F80F. "he appropriate result of the queries was

    then used as the basis for performance ealuation of the researchers

    approach and other aailable approaches.

    "he researchers conducted experiments with a set of -9 keyword queries by

    using arious approaches and they measured the recall and precision of

    eery approach by $nding the aerage of recall and precision alues of the

    tested queries."he relationship and comparisons of recall and precision of the researchers

    approaches in the three arious data sets are shown below.

    TAL# 4: 4E*I)I3= +=/ 4E*+LL 3, 2KE4IE) 3= M3=/I+L /+"+ 88F

    #L"A SL"A XSearc8

    "*L"A

    ML"A

    X!#AL

    DL"A

    .recision

    9 .- 9:

    3 .0 8A

    9 .> :-

    9 .> 09

    9 .0 8A

    9 .0 A8

    9 .6 AA

    !eca66 8 .9 9

    9

    9 .> A

    ;

    9 .6 ;

    :

    9 .6 8

    9

    9 .6 9

    :

    9 .> ;

    0

    9 .6 :

    6

    TAL# 9: 4E*I)I3= +=/ 4E*+LL 3, 2KE4IE) 3= +K*"I3= /+"+ 88F

    #L"A SL"A XSearc8

    "*L"A

    ML"A

    X!#AL

    DL"A

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    37/45

    Optimizing Keyword Queries in XML Tree Structures $

    .recision

    9 .; 05

    9 .0 9>

    9 .> A:

    9 .> ;9

    9.> 66

    9 .0 9:

    9 .6 98

    !eca66 8 .9 99

    9 .> -9

    9 .6 :8

    9 .6 A9

    9.6 90

    9 .> -9

    9 .6 :8

    TAL# : 4E*I)I3= +=/ 4E*+LL 3, 2KE4IE) 3= /1L /+"+ 88F

    #L"A SL"A XSearc8

    "*L"A ML"A X!#AL DL"A

    .recision

    9 .- A:

    3 .0 ::

    9 .> ;9

    9 .> 55

    9 .0 A9

    9 .0 ::

    9 .6 :;

    !eca66 8 .9 99

    9 .> ;0

    9 .6 ;8

    9 .6 88

    9 .6 A:

    9.> ; 0 9 .6 ;8

    TAL# %':*3M+4I)3=) 3= 4+=7I=C E,,E*"I'E=E)) 3, "E

    +LC34I"M) 88F

    M A . !0.!#"

    ;pre7 !0!ANK

    . < % . < 1 . < %'

    TKDD 9.509 9.5:9 9.0 - 9 9 .5 > 9 9 .5 69

    9.5 0 9 9 .5 89

    TKD2 9 .5 -9

    9 .5 A9

    9 .0 69

    3 .5 0 9 9.6 A9

    9.5 6 9 9 .5 ;9

    TK D0' &)1

    9 .5 ;9

    9 .5 99

    9 .0 69

    9 .5 > 9 9 .6 89

    9.6 8 9 9 .5 >9

    TK D0' &1'

    9 .5 >9

    9 .5 A9

    9 .0 >9

    9 .5 > 9 9.6 99

    9 .5 09

    9.5 A 9

    TK D0' &41

    9 .5 09

    9.5 - 9 9.0 : 9 9.5 5 9 9.5 59

    9 .5 -9

    9.5 9 9

    X!ANK 9 .> 0

    9

    9 .0 -

    9

    9 .> 8

    9

    9 .0 8 9 9 .> 6

    9

    9 .> 5

    9

    9 .> -

    9

    XS#A!"- 9 .0 99

    9 .0 09

    9 .> :9

    9 .> 5 9 9 .0 :9

    9.> 5 9 9 .> >9

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    38/45

    Optimizing Keyword Queries in XML Tree Structures %

    +ll the ranking algorithms makes it possible to identify the top ten results at

    a precision ranging between eighty to eighty $e percent. "he mean

    aerage precision of the algorithm is 5- and the researcher could een

    achiee more accurate precision by selecting a suitable alue which can

    maximi%e the balance and relationship the dominating and dominated scores

    >F 88FA;FA5F.

    E&ciency and )calability of "op(K +lgorithms

    "he researchers tested ten queries with arious lengths in eery data set.

    "hey tested about $e thousand candidates in default scenarios and the

    number of results found was thirty. "he queries which had less than required

    number of results from candidates# the researchers made a replica of the

    candidates repeatedly until they obtained the required candidate number:F

    88F. "he researchers then selected randomly the required candidate number

    from the set that was duplicated. "he cost of computation of the algorithm is

    shown in the $gure below. It is clear that when the candidate number

    increases the algorithm processing time also increases but at di!erent trends

    :F6F8;F80F. "7// in this case is the most e!ectie and e&cient method it

    is less a!ected by the increase in the number of candidates. "7// is mainly

    concerned with in the results which are dominated fewest number of

    candidates as possible. "his is because the results are usually located at the

    top of the list of sorted candidates and as a result it searches a small portion

    of the candidate list. ,or "7/C the search space is much larger and as a

    result there is expected delay. 3n the other hand the lower performance of

    "7// is also as a result of the score that is dominating hence it explains whyits processing time rises the same way as the "7/C which has a small

    oerhead used in calculating and $nding the dominating score -F6F 8:FA;F

    A5F. ,rom the results in ,igure -# it is clear that the "7// processing time is

    less a!ected by the increase in number of k of the returned results and it can

    return from ten to one hundred results from the set of $e thousand

    candidates within a second. "he "7/C processing time algorithm is more

    a!ected by the change of the parameter but it takes A.- seconds to get back

    to the top one hundred results from asset of $e thousand candidates.

    $igure 1: 4anking e&ciency of "7//# "7/C and "7/ algorithms 6F

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    39/45

    Optimizing Keyword Queries in XML Tree Structures &

    4&' "ON"L(SIONS

    In the thesis the researchers hae studied the issue of identifying the most

    accurate outcomes and results and the top(k appropriate results for XMLkeyword questions or queries in this matter. "he use of maGor keywords tosearch for documents has been widely accepted as a ery conenient way inretrieing resources from arious remote serers that hold that speci$c typeof data on the internet. MaGority of the search engines such as Coogle# 1ing#Jahoo and many more hae adopted the use of these technologies so as toe&ciently facilitate the process of data mining and data warehousing. "headaptation and use of keywords for querying arious databases has attractedarious researches to be conducted by the research community from thea!ected $elds of database and information retrieal ?I4@. XML documents arecomposed of nested XML attributes from the root elements to the nested

    sub(elements. XML elements often reference other elements which arequeried as XML alues and therefore the text content is captured using thedeputation contains ?u# k@. *onsequently# the predicate returns true when theelement u has keyword k while an XML query 2 is mapped from an XMLdatabase / to XML documents that characteri%e the query output. +s aresult# when the XML database enironment is K/ is and the XML documentsequence enironment is )# the outcome is 2D K/ ). 2?/@ is the result ofquery 2 oer database / whereby the query is identi$ed using XML querylanguage for instance X2uery. "herefore considering a sequence s# then e \s is true when e is in s. *onsider a p(document# which is a probabilisticdocument written in XML speci$es a probability distribution across space of

    deterministic documents written in XML. Each deterministic document thatbelongs to this space is referred to as a possible word. + probabilisticdocument referenced as a tree that has been labeled has distributional andordinary nodes. 3rdinary nodes are basically regular and normal XML nodesand their appearance may be seen in deterministic documents# whereasdistributional nodes are only used in the de$nition of probabilistic processthat inoles the generating of deterministic documents and their occurrenceis not isible in those documents. In the adaptation of rXML Oind# muxP as

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    40/45

    Optimizing Keyword Queries in XML Tree Structures !'

    part of the XML model which is probabilistic# two distributional nodes typesappear in a p(document# which are MKX and I=/.

    "he researchers hae stried to address the three ital requirements andconditions for e!ectie keyword searches of the XML. "he researchers haeintroduced new methods of analy%ing the relationship between query keywords in the candidates using mutual information idea and come up with anew /L*+ keyword queries semantic. "he researchers also hae a proposedstrategy and method of selecting the results of /L*+ from multiplecandidates and the three ranking methods used in selecting top(k resultsbased on skyline queries semantics. )ome of the properties which hae beenproen hae been acquired to accelerate proposed algorithms. "he $ndingsand experiments hae been conducted to analy%e and ealuate theresearchers experimental results and approach and they show that theapproach performs better than the approaches that hae been used in thedata sets that hae been tested and the ealuation metrics. "his is a erye&cient way of facilitating the retrieal of documents because it does notinole the learning of any concepts. "his process is an adancement of thetraditional search algorithms that were speci$cally inoling and required themastering of the particular I ?Internet rotocol@ addresses of ariousdocuments or information content and typing them in the K4L bar. is lateradanced and the I addresses were able to be attached to arious weblinks. It is from this that the search engines were deeloped with a moreinteractie and responsie algorithm that was able to handle a lot of bits andpieces of information including data mining. + ariety of approaches haebeen accessed and preiewed to $nd alternaties to the keyword queries asopposed to the XML data. "he basic approaches that currently exist uselowest common ancestor ?L*+@ type of semantics as opposed to the commongraph theory for identi$cation of the hit list gien a certain keyword query."his particular approach generates results composed of all candidates# alsoknown as sub trees# containing an instance of the queried keywords. "he L*+returned alues can be numerous yet the user may Gust be interested in aportion or bit of the whole hit list. It therefore remains and unsoled issue tobe able to identify the exact dataset that is required by the user of thesystem. "he ideal situation and the best case scenario would be for thesystem to be able to generate an exact piece that is required by the user asopposed to proiding a whole set of hits which also gies the user an extraGob to $lter the content until they obtain an exact piece. "he researchers

    hae stried to address the three ital requirements and conditions fore!ectie keyword searches of the XML. "he researchers hae introduced newmethods of analy%ing the relationship between query key words in thecandidates using mutual information idea and come up with a new /L*+keyword queries semantic. "he researchers also hae a proposed strategyand method of selecting the results of /L*+ from multiple candidates andthe three ranking methods used in selecting top(k results based on skylinequeries semantics. )ome of the properties which hae been proen hae

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    41/45

    Optimizing Keyword Queries in XML Tree Structures !1

    been acquired to accelerate proposed algorithms. "he $ndings andexperiments hae been conducted to analy%e and ealuate the researchersexperimental results and approach and they show that the approachperforms better than the approaches that hae been used in the data setsthat hae been tested and the ealuation metrics examined.

    + simple cost model introduced by the authors was based on *K costs andweighted I3 which used statistics on data page numbers consumed byrelations that bound the cost model concrete alues. "he dynamicprogramming algorithm proides a selected optimal operator $tting forspeci$c access paths. +fter that# an optimal Goin order is eri$ed based on anassumption of local optimality. In order to prune early the search space# notall possible enumerations are considered. In their place# focus is laid oninteresting Goin orders# for instance orders which can do without additionalintroductions of products of the *artesian. Craefe and /eWitt showcased theEX3/K) 3ptimi%er Cenerator and the purpose of this system was notcon$ned to a speci$c data model but it supported the algebraictransformations speci$cation as rules. Incorporated with a data model that isconcrete# the rules sere as input for the generator optimi%er# creating atailor(made query optimi%er. "his paper systematically explores XMLstructure(based answers and user expectations in order to identify thesigni$cance of XML keyword search semantics. "his paper further posits asemantics(based methodology to deelop XML keyword queries principallythrough data(centric coherency ranking which is kernelled in the design ofthe domain and database which is predicated on data dependence andmutual information models. *onsequently# keyword query results occurwithin a under schema reorgani%ation structures which process# present rankand query algorithms through coherency ranking to deelop answers. +ctualXML data indicates that coherency ranking is the methodology with thehighest precision# recall and ranking as compared with approaches. *urrentkeyword searches in XML can be diided into tree and graph supportedsearches which are largely predicated on structural document features.oweer# these approaches on structure do not comprehensiely utili%e thehidden semantics within the XML documents leading to issues in theprocessing of speci$c keyword query classes. "he growing reputation of XMLhas intensi$ed the necessitation of an accessible and precise XML queryinterface that is predicated on natural language and search procedures thatexploit XML structures to simplify queries by ordinary users within XML

    databases.

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    42/45

    Optimizing Keyword Queries in XML Tree Structures !2

    !#$#!#N"#S

    8F+lghamdi# =orah )aleh# Wenny 4ahayu# and Eric ardede. j3bGect(based

    semantic partitioning for XML twig query optimi%ation.j InId'anced

    /nformation Jet6or3ing and Ipplications (I/JI)"

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    43/45

    Optimizing Keyword Queries in XML Tree Structures !

    ngineering #or3shop"

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    44/45

    Optimizing Keyword Queries in XML Tree Structures !!

    03ctober A989#

    httpDHHdownload.oracle.comHdocsHcdH189-9998Hserer.6A9Ha6>-::Hsqltrac

    e.htm5:;;_

    8-F Memory 8on:guration and Sse A995# 1A5A0;(9A# 3racle# iewed

    8-)eptemberA989#httpDHHdownload.oracle.comHdocsHcdH1A5:-698Hserer.888HbA5A0;Hmemory

    .htm _

    8>F =. 3nose et al.# j4ewriting =ested XML 2ueries Ksing =ested 'iews#j in

    roceedings of the I8M 5/RMD /nternational conference on

    Management of Data# *hicago# IL# K)+# A99># pp. ;;: ;-;.

    80F 1. )tantic et al.# jandling of *urrent "ime in =atie XML /atabases#j in

    roceedings of the *?th Iustralasian Dataase 8onference ('olume ;6#

    obart# +ustralia# A99># pp. 80- 85A.

    85F,. Liu# *. ". Ju# W. Meng# and +. *howdhury# VE!ectie keyword search in

    relational databases# in 5/RMD 8onference# A99># pp. ->:-0;.

    86F '. ristidis# =. 7oudas# J. apakonstantinou# and /. )riastaa# V7eyword

    proximity search in xml trees# / 1rans. Kno6l. Data ng.# ol. 85# no. ;#

    pp. -A--:6# A99>.

    A9F J. Xu and J. apakonstantinou# VE&cient L*+ based keyword search

    inxml data# in DN1# A995# pp. -:--;>.

    A8F . Liu and J. *hen# VIdentifying meaningful return information for XMLkeyword search# in 5/RMD 8onference# A990# pp. :A6:;9.

    AAF *. )un# *. J. *han# and +. 7. Coenka# VMultiway )L*+(based keyword

    search in xml data# in #### A990# pp. 89;:89-A.

    A:F . Liu and J. *hen# V4easoning and identifying releant matches for xml

    keyword search# FLDN# ol. 8# no. 8# pp. 6A86:A# A995.

    A;F ). +mer(Jahia and M. Lalmas# VXml searchD languages# index and

    scoring# 5/RMD ecord# ol. :-# no. ;# pp. 8>A:# A99>.

    A-F J. Luo# X. Lin# W. Wang# and X. hou# V)parkD top(k keyword queryin

    relational databases# in 5/RMD 8onference# A990# pp. 88-8A>.

    A>F =. Mamoulis# 7. . *heng# M. L. Jiu# and /. W. *heung# VE&cient

    aggregation of ranked inputs# in /8D# A99># p. 0A.

  • 8/10/2019 Optimizing keyword queries in XML tree structure

    45/45

    Optimizing Keyword Queries in XML Tree Structures !"

    A0F /. Xin#