[ieee 2007 international symposium on applications and the internet workshops - hiroshima, japan...

4
An Extended AND Operations for Retrieving a Flexible Information Unit from Tree Structured Data Taira Yoda Yamagata Junior College Graduate School of Human Science and Cultural Studies, Kobe University Hidenari Kiyomitsu, Kazuhiro Ohtsuki, Jun-ya Morishita Faculty of Cross-Cultural Studies, Kobe University Graduate School of Human Science and Cultural Studies, Kobe University Abstract Recently, enormous volume of data is created and it has been increasing. Various types of data is collected in data storages. Each of data management system utilises indevid- ual retrieval mechanism arbitrarily, users get into difficulty that they need to obey a query manor for each systems. In our work, we focus on the ubiquity of the retrieval and we discuss about a representation of user’s query intention and a method for reflecting it in the retrieval. Conceptually, fork and chain on a tree structure could be regarded as coor- dinate and subrodinate respectively. Therefore, we atempt to use for data retrieval with lapping relationships between user input keywords on the tree structure. Our assuming user is a person or a device which can make a tree to issue a query. We introduce an elaborate use of the tree. 1 Introduction There is a research result which 5EB of data was created in 2002. Data on the Internet has also exploded. The forms of data are various, HTML, XML, PDF and so on, each data is stored in various schema. Under these circumstances, many a data management systems are developed and pro- vide their indevidual retrieval features arbitrarily and users are forced into obeying each system’s query manor. Basi- cally, retrieving appropriate data from a data management system, a user need to learn its data structure and query manor. Counter part of this, users do not have an unified method for reflecting their query intentions to retrieve ap- propriate data from those systems. We crave to equip users for an easy, useful retrieval mechanism. Web search en- gines provide keyword based retrieval functions, there are not much difference on query mannor among most of them. Indeed, they are simple, useful to find candidates of user need data. However it is not enough to accept user’s query intention. On keyword based approach, it is necessary to consider relationships among keywords and return appro- priate information in an appropriate volume. Then we in- troduce a query method for: 1) User can express his query intention explicitly with ease which he is only to arrange keywords and symbols to denote coordinate and/or subor- dinate conjunctions, 2) System translate data into a mediat- able scheme easily like DOM from XML. We assume that data in ubiquitous environment could be extracted in a tree representation. Here, we show the outline of our work: User illustrates his query intention into a part-of tree and represent a query statement, our system interplets it into a query tree. At the same time, data management system generates data trees. Then our system compare them, a query tree with data trees and return matched nodes to the data management system (figure1). The rest of this paper is organised as follows: In Figure 1. Outline of Our Work Section 2, we describe related works and our contribution. In Section 3, we clarify data structure and our model. Sec- tion 4 explain how our manipulation works in retrieving. In section 5, we present our conclusions. Proceedings of the 2007 International Symposium on Applications and the Internet Workshops (SAINTW'07) 0-7695-2757-4/07 $20.00 © 2007

Upload: jun-ya

Post on 28-Feb-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

An Extended AND Operations for Retrieving a Flexible Information Unit fromTree Structured Data

Taira YodaYamagata Junior College

Graduate School of Human Scienceand Cultural Studies, Kobe University

Hidenari Kiyomitsu, Kazuhiro Ohtsuki, Jun-ya MorishitaFaculty of Cross-Cultural Studies, Kobe University

Graduate School of Human Scienceand Cultural Studies, Kobe University

Abstract

Recently, enormous volume of data is created and it hasbeen increasing. Various types of data is collected in datastorages. Each of data management system utilises indevid-ual retrieval mechanism arbitrarily, users get into difficultythat they need to obey a query manor for each systems. Inour work, we focus on the ubiquity of the retrieval and wediscuss about a representation of user’s query intention anda method for reflecting it in the retrieval. Conceptually, forkand chain on a tree structure could be regarded as coor-dinate and subrodinate respectively. Therefore, we atemptto use for data retrieval with lapping relationships betweenuser input keywords on the tree structure. Our assuminguser is a person or a device which can make a tree to issuea query. We introduce an elaborate use of the tree.

1 Introduction

There is a research result which 5EB of data was createdin 2002. Data on the Internet has also exploded. The formsof data are various, HTML, XML, PDF and so on, each datais stored in various schema. Under these circumstances,many a data management systems are developed and pro-vide their indevidual retrieval features arbitrarily and usersare forced into obeying each system’s query manor. Basi-cally, retrieving appropriate data from a data managementsystem, a user need to learn its data structure and querymanor. Counter part of this, users do not have an unifiedmethod for reflecting their query intentions to retrieve ap-propriate data from those systems. We crave to equip usersfor an easy, useful retrieval mechanism. Web search en-gines provide keyword based retrieval functions, there arenot much difference on query mannor among most of them.Indeed, they are simple, useful to find candidates of userneed data. However it is not enough to accept user’s queryintention. On keyword based approach, it is necessary to

consider relationships among keywords and return appro-priate information in an appropriate volume. Then we in-troduce a query method for: 1) User can express his queryintention explicitly with ease which he is only to arrangekeywords and symbols to denote coordinate and/or subor-dinate conjunctions, 2) System translate data into a mediat-able scheme easily like DOM from XML. We assume thatdata in ubiquitous environment could be extracted in a treerepresentation.

Here, we show the outline of our work: User illustrateshis query intention into a part-of tree and represent a querystatement, our system interplets it into a query tree. At thesame time, data management system generates data trees.Then our system compare them, a query tree with data treesand return matched nodes to the data management system(figure1). The rest of this paper is organised as follows: In

Figure 1. Outline of Our Work

Section 2, we describe related works and our contribution.In Section 3, we clarify data structure and our model. Sec-tion 4 explain how our manipulation works in retrieving. Insection 5, we present our conclusions.

Proceedings of the 2007 International Symposiumon Applications and the Internet Workshops (SAINTW'07)0-7695-2757-4/07 $20.00 © 2007

2 Related Works

Several researches for retrieving an appropriate portionfrom a source document by using its graph structure. Tajimaet. al. [2] proposed a query method for Web data. They re-garded Web link structure as a graph and showed a novelmethodology for deriving information units from the struc-ture. Kinutani et. al. [3]presented a technique for re-turning appropriate XML sub-documents to end-user. Theyalso discuss about granule of information unit and tried toprovide a XML database system for their end-user withoutprior knowledge about the document structure. We have thesame idea that it is too much to return the whole documentas a query result.

The differences between these two related works andours are:(1) We focus on relationships among keywords anduser’s explicit intentions about query. (2) Non attribute ori-ented approach, because data is not fasten on specific for-mats in advance.

A meet operator in [1] is close to our RAND (explainedlater). Finding the lowest common ancestor (node) froma tree structure is basis in analysis of the structure. Boththe meets and our RAND return sub-trees including all inputkeywords. We also introduce DAND in addition to RAND tofind a descendant of the two nodes and discuss about thesetwo combination to enable to represent a user’s query inten-tion more explicitly. Our proposing idea is not to interpretuser’s intention from enumeration of keywords, but to re-flect it to query result by describing relationships amongkeywords conceptually.

3 Data Structure

There are researches for understanding a documentstructure and dividing it into sub-documents automaticallyaccording to its logical hierarchy, and for extracting topicand sub-topic from a document. On the other hand, thereare some projects for dividing an original document into ar-bitrary portions and give them data for retrieval accordingto their hierarchical structures manually. Even though weuse the fruits of above, it is not easy to make a commonschema for various types of data in advance. We do not is-sue a query on a strict schema, but we only use inclusiverelationships in our data structure. It is not difficult to orga-nize trees from tag based formats and databases generally.Also, we allow a keyword to be a term, a phrase, an acronymor an abbreviation which denote a characteristic of a corre-sponding portion of data, and we describe each of them asa keyword. For convenience, we call keyword which a datamanegement system attached delegate to distinguish fromuser input keyword. In our model, we do not persist in themethod how delegates are given. All strings in a material

could be delegates, also strings given by arbiterel processcould be a delegate even if they are not in a material.Information Unit: The information unit of our approach isa material which a user want to retrieve arbitrary portionsof data appropriate to input keyword(s). For example, i)entities in a logical structure such as a chapter, a section, aparagraph in a book, a paper, a report, ii) a cut, a shot, ascene in a video, iii) a region, a sub-region from arbitraryaspects in an image, iv) a topic, sub-topic and so on. All ofthem are materials. Each book, video, image is a materialtoo.Data Tree: A data tree is a mapping of its original data. Anode in a data tree is a mapping of a material in its originaldata. A data tree is defined as a tree on which nodes aremounted according to inclusive relationship in the originaldata. A node in a data tree has one or more given dele-gate(s). One node can represent a sub-tree which consistsof all the nodes under it.Terminology: Let � be a data tree for a given data. Thenlet,� � ��� � �� is a node in ���� � ���� ����� � ���� � ��� is a delegate of node ����� � ��� � �� is a node on the path from the root node

to ���.For convenience of discussion, we define an utility function����� as ������� � ��. This is to reduce a symbol for apath and avoid confusion. Note that a path is approximatedas a set of nodes on the path. That is because there are twoways to compose a content about two things in the samematerial. Under the circumstances the existence of the nodeis more important than the order of the nodes as discussedin the section 4. Finding a node from a set is lighter than apath search.

4 Querying on Data Tree

Simple Querying: A query ��� with only one keyword� returns a set of ��� � � � ��� for each data tree. Simplequerying is a basic retrieval. We show a simple example.Figure 2 illustrate a data tree where black nodes have a del-egate �. The results of a query ��� on this tree are allblack nodes.

��� ��������� � �� ������

Our work aims to obtain a query mechanism which canreflect relationships between keywords. A user usually sub-mits a query to a retrieval system according to its syntax andrule. The major problem we focus is that he can not rep-resent his intentions about relationships among keywords,although, he might have a sense of structure mounting withkeywords. We are about to utilise his sense for retrievals.

Proceedings of the 2007 International Symposiumon Applications and the Internet Workshops (SAINTW'07)0-7695-2757-4/07 $20.00 © 2007

vk

vk

vk vk

Figure 2. An Example of Simple Querying

Kobe

festival

traffic jam

Kobe

festival

traffic jam

Kobe

festival

traffic jam

Kobe

festival

traffic jam

Figure 3. Candidate Trees for an Example

An example query with ”festival”, ”Kobe” and “traffic jam”is amusing to us. Basically, this query is within a data whichhas all ”festival”, ”Kobe” and ”traffic jam” as delegates ofnodes in it. Festival and traffic jam are on the same level,they are events but Kobe is on the other level. We attempt tointerpret the user intention of this example as figure 3. Theleftmost in figure 3 may illustrate ”festival and traffic jamat Kobe”. Also the middle-left may be ”festival at Kobeand traffic jam caused by a festival”. It seems somethingrough, but it should be a candidate of an interpretation atlease. The middle-right in figure 3 illustrates ”festival atKobe and traffic jam caused by the festival”. It seems strict,however if the sequence is allowed out of order, it is usableas an interpretation. Because there are two way to representa context about a festival at Kobe. One makes ”festival” tobe upper level than ”Kobe”, another makes ”festival” to belower level than ”Kobe” on its hierarchy. Also, we assumethat users do not have prior knowledge about data struc-ture. The right-most in figure 3 means that ”a festival atKobe” and ”traffic jam” may not have immediate relation-ships. To retrieve every pattern in figure 3, we draw a sketchlike the right side in figure 4. This sketch illustrate all pos-sible shape about relationships among those keywords. Itseems something redundant to describe a query statement,so we simplify it as like the left side in figure 4.

On account, we are enable our user to represent relation-ships between keywords (User can express the type of rela-tionship whether it indicates a coordinate or a subordinate).Then he interprets his intention into a tree. We call this treeintention tree. In figure 5, the left side indicates a subor-dinate relationship between keywords � and � such as �

of �. The right side indicates a coordinate relationship �

Kobe

festival

traffic jamKobe

festival

traffic jam

traffic jam

traffic jam

Figure 4. Mapping of Query Statement toGraph

A

BAB

Figure 5. Primitive Shapes of Intention Tree

and �. These are primitive shapes for composing an inten-tion tree. It is ideal to derive a query statement from naturallanguage, but it is sufficient to represent by using only twosemantics for retrieving from a tree. At first, we show howa user expresses a query statement. The left side in figure 5is represented as � ���� �, the right side is as � ���� �.The example in figure 4 is denoted as: ”festival” DAND”Kobe” RAND ”traffic jam”. Here, we introduce two ad-vanced operators DAND and RAND. An operator DANDconnects two keywords to represent a relationship of subor-dinate. That is, a DAND is used for finding materials thathave details of ”Kobe” about ”festival” or details of ”fes-tival” about ”Kobe”. We regard this as a direct descendantrelationship on a tree. Another operator RAND connect twokeywords to represent a relationship of existing them in thesame material. Also, we consider it to be a relative relation-ship on a tree. We will mention the details of combiningtwo types of operator RAND and DAND after. In order toclear the feature of our approach, we confine our explana-tion about data manipulation within a data tree.Relative Relationships: A relative relationship betweentwo nodes, �� �������� �� is defined as:

������� � ������� �� �

An n-array extension about relative relationships,�� �������� � � � �������� �� also defined as:

������� � � � � � ������� �� �

Function for RAND: Nodes in the same data tree have rela-tive relationships among them. However, we aim to retrievenarrow portion. So we define a relatives semantics as: thesmallest sub-tree including both nodes. Also, we define arelatives AND function rand as following. ������� ��� � �� (�� � ������� � ������� � �������)

Proceedings of the 2007 International Symposiumon Applications and the Internet Workshops (SAINTW'07)0-7695-2757-4/07 $20.00 © 2007

An n-array extension about rand is: ������� � � � � ��� � �� (�� � ������� �

������� � � � � � �������)A ����� function returns the lowest node on the commonpath to each given nodes.Direct Descendant Relationships: A direct descendant re-lationship between two nodes, �� �� ���-����������� �� isdefined as:

������� � ������� � ������� � �������That is, either �� or �� is on the path to the other node.On n-array extension about direct descendant relationships,�� �� ���-����������� � � � �� ���-���������� �� is alsodefined as:������������ � ��������for all �� � � �� � �� ���

That is to say there exists a node that includes all othernodes on the path to it.Function for DAND: Suppose that a user want to find doc-uments written about �� of �� , he will set these two key-words to some search engines. Again, our objective is toreturn more narrow portions corresponding them to him. Sowe define a direct descendant semantics as the smaller sub-tree of its root is ����� � �� � ��� or ����� � �� � ���, anddefine a direct descendant AND function dand as following.

�������� ��� �

���� path(��)� path(��)�� � path(��)� path(��)

As definition of ������, it returns the lower node. An n-array extension about dand is

�������� � � � � ��� � �� (�� � �������� ������� � � � � � �������� � � �)

A ������ function returns the lowest node which includesall other nodes on the path to it.

Each nodes in a data tree is simulated as a set of nodes onthe path from root node to it. So, our user could composehis query statements in easier notation than XPath. Also,we could describe our proposed functions on set theory. Setoperation is much lighter weight than XPath search.

��� ������ ���� �� ����

We have prepared two advanced ANDs, RAND as a co-ordinate conjunction, DAND as a subordinate conjunctionbetween keywords. As the aforesaid, extensions about then-array ����� and ������ were natural and trivial, alsothe commutative low will hold. However, it is not easy tocombine ����� and ������ in one operation. Suppose that“A DAND B RAND C” and “A DAND C RAND B” should re-turn different result. “A RAND B DAND C” and “A RAND CDAND B” should return the same result. Because the shapesof intention trees of the former two are different, the latterare the same. The reason DAND requires one node must beon the path to another. In this sense, DAND is the stricterthan RAND. Therefore we adopt a rule that DAND is prior

to RAND in manipulation of combining them. This proto-col is also reasonable on computational cost, because it pro-ceeds from one that makes an intermediate narrower first.

Here we show another case, part-of relationship that Aand B are parts of C. The intention tree of it would haveA and B as leaves, C as a root. An example of this is anokonomi-yaki with pork and oyster. This can be denotedas “okonomi-yaki” DAND “pork” DAND “oyster”. At thispoint, it can not retrieve any result from data trees which areorganised in a part-of hierarchy but “okonomi-yaki” DAND“pork” RAND “okonomi-yaki” DAND “oyster” can. How-ever, the latter will return result with noise. We estimatethat, in the worst case of this, precision is higher or the sameas a keyword base retrieval without DAND nor RAND suchas full-text search. Recall is estimated the same at the worst.The worst is the case that all result were root nodes. Webelieve it is useful that user can represent at least distinc-tion between coordinate and subordinate relationships onhis query intention explicitly. From another viewpoint, wedo not define more advanced AND for each of the prepo-sitions and the conjunctions but for distinction whether theconnector is used for a subordinate or coordinate. Because,at this point, we are not about to derive a query statementfrom natural language.

5 Concluding Remarks

The essence of our spirit is to construct a bridge betweenuser’s query structure and data structure. In this paper, weintroduced two advanced AND operators for user express-ing his intentions about relationships between keywords.The syntax of our query statement without two ANDs issimilar to ordinary search engines’. That is, our proposedmethod is a keyword based retrieval which users can specifymore detailed semantics with ease. Outline of our work wasto extract a query tree from a user’s intention and comparewith data tree, then return appropriate portions in appropri-ate volume.

References

[1] A. Schmidt, M. Kerstenand M. Windhouwer: Query-ing XML Document Made Easy: Nearest ConceptQueries, Proc. of ICDE ’01, pp.321-329, 2001.

[2] K. Tajima, K. Hatano, T. Matsukura, R. Sana and K.Tanaka: Discovery and Retrieval of Logical Informa-tion Units in Web, Proc. of WOWS ’99, pp. 13-23,Berkeley, CA, 1999.

[3] H. Kinutani, M. Yoshikawa, S. Uemura: Identify re-sult Subdocuments of XML Search Conditions, Proc.of KYOTODL ’00, pp. 254-261, 2000.

Proceedings of the 2007 International Symposiumon Applications and the Internet Workshops (SAINTW'07)0-7695-2757-4/07 $20.00 © 2007