[ieee 2011 international conference on computer networks and information technology (iccnit) -...

6
MinG: An Efficient Algorithm to Mine Graphs for Semantic Associations Zohb Hassan, Mohammad Abdul Qadir Centre for Distributed and Semantic Computin Department of Computer ience Mohammad Ali Jinnah Universi, Islamab Pakistan [email protected], [email protected] Aa- Data in semantic web is modeUed in terms of directed IabeUed graph. Vertices of that graph represent entities and edges represent relationships beeen those entities. Semantic web aUows the discovery of relations between entities using the p-operato. In this paper an algorithm to answer operators, that is, to f"md all paths between any two nodes from a graph is proposed. The algorithm is based onp-index, an indexing scheme presented in the PhD thesis of Barton. Our algorithm reduces the computational and space complexity of indexing by not creating a special type of adjacency matrix caUed Path Type Matrix at each level of indexing which Barton's algorithm did. We only need Path Type Matrices at first and last level of indexing. Thus if an indexing has 100 levels, Barton requires Path Type Matrices at each level and we only require Path Type Matrices at level 1 and level 100. Kor Graph Traversal, F, Indexing, Algorithm, Mining, Semantic Web I. INTRODUCTION Semtic web is a way to make machines intelligent by structuring the d@a in such a way th@ helps them to understd the meings of inform@i. D@a over the semtic web e stored alg with metad@a by using technoloes such as Resource Description Framework (RDF). D@a in RDF is modelled in terms of concepts or entities d the rel@ionships b?ween those concepts. Such a modelling forms a directed lelled aph in which vtices reest concepts d edges represent rel@ionships b?ween those concepts. Graph is a very rich d@a structure th@ model d@a of my re life domains. In Biology for instce, vertices represent proteins d edges represent interaction betwn them. Chemistry, vertices rresent @oms d edges represent bonds betwe them. Such real life aphs be humungous d a oblem ises when it comes to sech for relations between the concts. Thus if we e to find all indirect p@hs betwe y two tities om aph, such a sech c result in exponential time complexity. In terms of semtic web p-oper@ors [3] e proposed as a me to find complex rel@ions [4] between concepts. Bton in his PhD thesis [1] [2] has oposed indexing scheme called p-index d a sech algorithm to make the sech for complex rel@ionships efficient. The p-index maintains two d@a struures for each segment @ each level called P@h Type M@rix (P) d Inverted File (). We have devised a sech algorithm 978-1-61284-941-6/1126.00 ©2011 IEEE 59 which is based on p-index. Our algorithm to find l p@hs between y two nodes om a h doesn't require Ps @ each level of indexing. We need the PTMs only @ first d last level of indexing. Thus if indexing has 100 levels we need the Ps only @ level l d @ level 100. erefore the comput@ional d space complexity of cre@ing d using the Ps @ each level of indexing c be saved. Our sech algorithm is discussed in section III of the paper. Section II discusses the rel@ed work, section N compes the performce of our proach with p-index d section V concludes the per along with ture work. II. RELATED WORK Queries over a aph c be swered in two possible ways. One proach is algorithmic on the fly approach in which no pre-processing is de over d@a. The other proach is indexing proach in which se pre- processing is involved ov d@a to ease the comput@ional complexity for swering queries. T's algorithm of a single source p@h expressi problem [5] [6] is on e fly query swing gorithm. Given a vertex A om a aph the algorithm finds for each vtex V a regul expression RE ( A,V) th@ rresents all p@hs between nodes A d V in a aph. The comput@ional complexity of T's algorithm is O(IE I) which is inefficit in terms of a size of real life aphs. When it comes to indexing proaches to find l p@hs between y two nodes om a aph, indexing scheme of [7] uses a d@a suure own as suffix rays. Suffix ray is a well-known d@a structure used for ll te sech over the documents constructed on one dimsional character strings. e algorithm eracts all p@h exessions om a RDF aph d to make query processing ster it cre@es a suffix ray for l eracted p@h expressions. The problem with this proach is th@ it works only for DAGs. Another indzexing scheme oposed by [8] converts RDF h in to forest of trees. The algorithm assis each node a si@ure which includes a pre-order rk of th@ node, a post-ord rk of th@ node, a pointer to first cestor of the node d some other inform@ion. These si@ures e then used to find all p@hs between y two nodes om aph. ere e numerous algorithms th@ c tell us the existence of a complex rel@ionship tween y two _. -les om a aph. Such algorithms e known as aph

Upload: mohammad-abdul

Post on 09-Feb-2017

214 views

Category:

Documents


1 download

TRANSCRIPT

MinG: An Efficient Algorithm to Mine Graphs for Semantic Associations

Zohaib Hassan, Mohammad Abdul Qadir

Centre for Distributed and Semantic Computing, Department of Computer Science Mohammad Ali Jinnah University, Islamabad, Pakistan

[email protected], [email protected]

Abstract- Data in semantic web is modeUed in terms of directed IabeUed graph. Vertices of that graph represent entities and edges represent relationships between those entities. Semantic web aUows the discovery of relations between entities using the p-operators. In this paper an algorithm to answer p-operators, that is, to f"md all paths between any two nodes from a graph is proposed. The algorithm is based onp-index, an indexing scheme presented in the PhD thesis of Barton. Our algorithm reduces the computational and space complexity of indexing by not creating a special type of adjacency matrix caUed Path Type Matrix at each level of indexing which Barton's algorithm did. We only need Path Type Matrices at first and last level of indexing. Thus if an indexing has 100 levels, Barton requires Path Type Matrices at each level and we only require Path Type Matrices at level 1 and level 100.

Keywords- Graph Traversal, RDF, Indexing, Algorithm, Mining, Semantic Web

I. INTRODUCTION

Semantic web is a way to make machines intelligent by structuring the data in such a way that helps them to understand the meanings of information. Data over the semantic web are stored along with metadata by using technologies such as Resource Description Framework ( RDF). Data in RDF is modelled in terms of concepts or entities and the relationships between those concepts. Such a modelling forms a directed labelled graph in which vertices represent concepts and edges represent relationships between those concepts. Graph is a very rich data structure that can model data of many real life domains. In Biology for instance, vertices represent proteins and edges represent interaction between them. In Chemistry, vertices represent atoms and edges represent bonds between them. Such real life graphs can be humungous and a problem arises when it comes to search for relations between the concepts. Thus if we are to find all indirect paths between any two entities from graph, such a search can result in an exponential time complexity.

In terms of semantic web p-operators [3] are proposed as a mean to find complex relations [4] between concepts. Barton in his PhD thesis [1] [2] has proposed an indexing scheme called p-index and a search algorithm to make the search for complex relationships efficient.

The p-index maintains two data structures for each segment at each level called Path Type Matrix (PTM) and Inverted File (INF). We have devised a search algorithm

978-1-61284-941-6/111$26.00 ©2011 IEEE

59

which is based on p-index. Our algorithm to find all paths between any two nodes from a graph doesn't require PTMs at each level of indexing. We need the PTMs only at first and last level of indexing. Thus if an indexing has 100 levels we need the PTMs only at level l and at level 100. Therefore the computational and space complexity of creating and using the PTMs at each level of indexing can be saved.

Our search algorithm is discussed in section III of the paper. Section II discusses the related work, section N compares the performance of our approach with p-index and section V concludes the paper along with future work.

II. RELATED WORK

Queries over a graph can be answered in two possible ways. One approach is an algorithmic on the fly approach in which no pre-processing is done over data. The other approach is an indexing approach in which some pre­processing is involved over data to ease the computational complexity for answering queries.

Tarjan's algorithm of a single source path expression problem [5] [6] is on the fly query answering algorithm. Given a vertex A from a graph the algorithm finds for each vertex V a regular expression RE (A,V) that represents all paths between nodes A and V in a graph. The computational complexity of Tarjan's algorithm is O(IE I) which is inefficient in terms of a size of real life graphs.

When it comes to indexing approaches to find all paths between any two nodes from a graph, an indexing scheme of [7] uses a data structure known as suffix arrays. Suffix array is a well-known data structure used for full text search over the documents constructed on one dimensional character strings. The algorithm extracts all path expressions from a RDF graph and to make query processing faster it creates a suffix array for all extracted path expressions. The problem with this approach is that it works only for DAGs.

Another indzexing scheme proposed by [8] converts RDF graph in to forest of trees. The algorithm assigns each node a signature which includes a pre-order rank of that node, a post-order rank of that node, a pointer to first ancestor of the node and some other information. These signatures are then used to find all paths between any two nodes from graph.

There are numerous algorithms that can tell us the existence of a complex relationship between any two _. -les from a graph. Such algorithms are known as graph

reach ability query answering algorithms. These approaches return a single boolean value answering that a source vertex can reach a target vertex or not. These approaches include the matrix based approach [9], an interval based approach [10], the 2-Hop approach [11], the HOPI algorithm [12] [13], the HLSS approach [14], and an indexing scheme of Dual Labelling [15] can answer the reachability queries in constant time. Although efficient, these approaches can be used to find only the existence of a relationship and not the actual relationships between any two vertices from a graph.

Barton in his PhD thesis [1] [2] has proposed an indexing scheme called p-index and a search algorithm to find all paths between any two vertices from a graph.

Suppose that we have a graph of 5000 nodes which is indexed using p-index up to three levels. First level has 5 nodes per segment, second level has 10 nodes per segment and third level has 5 nodes per segment. An index is created for a user defined limit of 10. That is, transitive closure is computed up to length 10 of the paths. The computational cost for creating PTMs by p-index will be,

Cost at level one, • L1 nodes per segment = 5 • L1 total segments = 1000 • L1 cost per segment = matrix multiplication

cost * weight limit • L1 cost per segment = (5)3 * 10 = 1250 • Total cost at L1 = Cost of one segment *

number of segments • Total cost at L1 = 1250 * 1000 = 1250000

Calculating in similar manner for all levels the total indexing cost will be 2355000.

In the next section we will present an algorithm which is based on p-index. Unlike Barton's search algorithm, our search algorithm requires the PTMs only at the bottommost and topmost level of indexing. The computational cost of creating PTMs for our algorithm for a graph we have discussed above will be,

• Total indexing cost = L1 + L4 • Total indexing cost = 1250000 + 80000 • Total indexing cost = 1330000

III. PROPOSED ALGORITHM

In section II we have discussed that in order to find all paths between any two nodes from a graph the search algorithm of p-index uses the PTMs and the lNFs at each level of indexing. In this section we will present a search algorithm to find all paths between any two nodes from a graph which is based on p-index. However, the difference is that our algorithm needs only the PTMs at the topmost and the bottommost level of indexing. PTMs at any intermediate levels are not required. So while indexing, we will only create the PTMs at topmost and bottommost levels and not at any intermediate levels.

Instead of going from bottom to topmost level, identifying all segments involved along the way and then building the paths from top to bottom [1]. What we will do is we will start from bottom, identify segments involved at bottommost level of indexing and build the

60

paths at current level. This will generate a set of possible paths at bottommost level. We will then move on to next level of indexing, identify the segments involved at that level and find the paths at that level. As there are no PTMs at any intermediate levels, only the lNFs will be used.

Our algorithm to find all paths between any two nodes from a graph performs the following steps.

1. Identify from first level the segment of Source

and Target nodes.

2. If on level one, retrieve the entries from,

• PTM( Source)

• INF_EDGE S_OUT( Source)

• INF_EDGE S_lN( Target)

• PTM(Target)

3. Generate a set of paths at level one by

information we retrieved in step 2

4. Identify that from which segment at next level a

segment of source identified at step 1 belongs to.

This newly identified segment is now Source.

Update the Target by same procedure.

5. If on level other than the bottommost and the

topmost, retrieve the entries from

• INF_EDGE S_OUT( Source)

• INF _EDGE S_lN( Target)

6. Remove the duplicate paths from a set of paths

at step 5

7. For each path in a set generated so far do the

following,

INF _EDGE S _ OUT( Initial) intersection

INF _EDGE S_lN( Terminal)

8. If entries in a set generated at step 7 doesn't

belong to first level of indexing, repeat step 7

9. If entries in a set generated at step 7 belong to first

level of indexing, retrieve the entries from

• PTM( Initial)

• EDGE S_OUT(Initial)

• EDGE S_lN( Terminal)

• PTM( Terminal)

10. In step 9, don't retrieve entries from those segments which are already processed on previous levels.

11. Add a set of paths at current level with a set of paths generated at previous level.

12. Generate a set of paths at current level by information we have in step 11

13. As there is no segmentation on the topmost level so entries will only be retrieved from PTM at that level.

In step 2 and 12, set of paths is generated by comparing the terminal of one path with the initial of the others. If terminal of one path is same as initial of other, merge that path. In order to reduce the comparisons, paths in a set can be sorted.

To explain the working of our algorithm, we will take a graph which we will index up to three levels. Fig. 1 shows an example graph whose segments are created by including vertices of original graph into six different segments. The vertices are included into segments randomly.

5, ,-C-, ----·

56

Fig. I An example graph and the first level of indexing

Fig. 2 shows the INFs of six segments created in Fig. 1.

PTMs for the segments of Fig. 1 are shown in Fig. 3. This

way a graph and segments of Fig. I, INFs of Fig. 2 and

PTMs of Fig. 3 form the first level of indexing.

Fig. 4 shows a segment graph created from the segments in Fig. 1. This segment graph is further segmented in Fig. 4 to form the second level of indexing. Only the INFs will be created at this level, which are shown in Fig. 5.

Segment graph created from the segments of Fig. 4 is shown in Fig. 6. No further segmentation is performed for a graph of Fig. 6. As there is no segmentation at last level, the INFs cannot be created there. A segment graph in Fig. 6 is at highest level of indexing, therefore a PTM at this level will be created which is shown in Fig. 7.

Now let suppose that a user pose the query Paths(CI­C9). That is, find all paths between the nodes CI and C9 from a graph shown in Fig. 1.

Vertex CI is the source and vertex C9 is the target. The first step of algorithm is to check that from which segment the source and target vertex belongs to. The source vertex CI belongs to the segment SI and the target vertex C9 belongs to the segment S4.

Sl Out 52 Out 53 Out S6 Out

C2 C3, C3 C5 C5 C7 C11 C10 C11

54 IN C4 C6 C17 C7 IN IN

C7 C5, IN C11 C2, C17 C5 C3 C13

C10 C11 C3 C2 C6 C4 C12 C13

55 Out

C13 Cll, C12

Fig. 2 Inverted Files at first level of indexing

55 =

61

C7 C8 C9 C10 Cl C2 C7 Cl el C8 e7 C2 C9 C5 C6 C17 Cl0 e8

C5 e18 C13 C14 C15 C16 C6 e16

C13 C17

C14 e15 Cl1 C12 C15 e14 56 - Cl1 C16 e13 C12 e10

Fig. 3 Path Type Matrices at first level of indexing

Fig. 4 Segment graph and the second level of indexing

52,1 Out S2,3 Out

S2 S3 S6 S4

IN S1 S6

S6 S1

S2,2 IN

S3 S2

S4 S6

Fig. 5 Inverted files at second level of indexing

Fig. 6 Segment graph at the third level of indexing

52,1 52,2 52,3

52,1 e3,1 e3,3 e3,3e3,2

52,2

52,3 e3,2

Fig. 7 Path Type Matrix at the last level of indexing

The second step is to generate a set of paths at first level by retrieving entries from the PTMs and the INFs of segments identified at first step of algorithm. The entries from INFs will be retrieved in a way that outgoing paths from an INF of source vertex and incoming paths from an INF of target vertex will be queried. Firstly, the algorithm will retrieve entries from a PTM of S 1 and outgoing paths from an INF of S1. The algorithm will then retrieve incoming paths from an INF of S4 and entries from a PTM of S4. Let us call this set of paths generated at first level as LO. A set LO will initially contains the following paths,

• PTM(SI)=Cl�C2 • EDGES_OUT(SI) = C2�C3, C2�Cll • EDGES_IN(S4) =C5�C7, CI7�C7, Cll�CI0 • PTM(S4) = C8�C9, CI0�C9

Note that we have written the incoming edges retrieved from an INF in a reverse direction. The INF of segment S4 in Fig. 1 shows that a vertex C7 has an incoming edge from C5 and C17. The direction of an arrow is towards C7. That is, it can be written as C7 f:- C5 but we have written it as C5� C7.

After removing the duplicate paths if any, the algorithm will build the partial paths from the informatim it has in LO. It does this by merging those paths for which the terminal vertex of me path is same as the initial vertex of the other path. Starting from the paths of which the source vertex C1 is the initial, the algorithm compares the terminal vertex of that path with the initial vertices of the other paths in LO.

The path in LO that starts from the source vertex is C1 � C2. The terminal vertex C2 of that path is the initial vertex of two other paths in LO that is, C2� C3 and C2� C11. By merging C2, two partial paths C1� C2� C3 and C1� C2� Cll will be formed. If paths in LO are sorted based on the initials, the terminal vertex C2 of the path C1 � C2 will be compared only with the initial vertices of the paths C2� C3 and C2� C 11 and not with the initials of other paths in LO.

The terminal vertex C3 of the partial path C1 � C2� C3 is not the initial of any path in LO. The terminal vertex Cll of the partial path C1� C2� Cll is the initial of the path C11 � ClO in LO. Thus a new partial path C1 � C2� Cll � C10 will be formed. The terminal vertex C10 of this partial path matches the initial of the path ClO� C9 in LO thus C1� C2� Cll� ClO� C9 will be formed. The other paths that are still not processed in LO are C5� C7, C17� C7 and C8� C9. The terminal vertices C7 and C9 of these paths are not the initials of any path in LO thus the algorithm has finished building the paths in LO. The LO will now contains the following information,

• Cl�C2�C3 • C17�C7 • Cl�C2�Cll�ClO�C9 • C8�C9 • C5�C7

The user query was Paths(C1- C9) and from LO the algorithm has found the path from the source to the target which is C1� C2� Cll� ClO� C9. In this query the path has been found on the first level of indexing. Let us now consider a query in which path will not be found on the first level and the algorithm has to proceed to the

62

second level where there are no PTMs. Consider the user query Paths(C1- C7). The source

vertex C1 is in Sl and the target vertex C7 is in S4. A set LO will be same as that of the previous query Paths(C1-C9) as the segments identified in first step that is, S 1 and S4 are same for both the queries. By examining LO no paths between C1 and C7 are found thus the algorithm needs to proceed to the second level of indexing. What the algorithm will do now, it checks that from which segments at the next level the segments identified at the first level belong to. The segments identified at first level were Sl and S4, the algorithm now checks that from which segment Sl and S4 belongs to. Fig. 4 shows that S 1 belongs to S2,1 and S4 belongs to S2,2.

To build a set L1 that is, a set of paths at level two. The algorithm retrieves the outgoing edges from an INF of S2, 1 and incoming edges from an INF of S2,2.

• EDGES_OUT(S2,1) = S2�S3, SI�S6 • EDGES_IN(S2,2) = S2�S3, S6�S4

After removing the duplicate paths a set L1 will contains the following information,

• S2�S3 • SI�S6 • S6�S4

Now for each path in L1 the algorithm will retrieve the outgoing edges from an INF of the initial vertex of a path and the incoming edges from an INF of the terminal vertex of a path. The intersection of these two will be taken in order to retrieve only the common edges between the initial and the terminal vertex. For a Path S2� S3,

• EDGES _ OUT(S2) = C3�C5, C4�C6 • EDGES_IN(S3) = C3�C5, C4�C6

EDGE S_OUT( S2) n EDGE S_IN(S3) = C3� C5, C4� C6. Similarly for the paths Sl � S6 and S6� S4,

• EDGES_OUT(SI) = C2�C3, C2�Cll • EDGES_IN(S6) = C2�Cll, C13�Cll, C13�CI2 • EDGES_OUT(S6) =Cll�ClO • EDGES_IN(S4) =C5�C7, CI7�C7, Cll�ClO

EDGE S_OUT( Sl) n EDGE S_IN( S6) = C2� Cll and EDGE S_OUT( S6) n EDGE S_IN( S4) = C11� ClO.

By doing all this, the algorithm has reached the bottommost level where we have the PTMs. Thus it will check the PTMs of the vertices in L1. If however the algorithm has not reached the bottommost level, it will keep on checking the INFs in the same fashion.

In L1 we have 5 vertices, S2, S3, Sl, S6 and S4. Their PTMs are shown in Fig. 3. The algorithm will query only the PTMs of S2, S3 and S6 because when it was building the paths in LO it has queried the PTMs of S1 and S4, so it will not query their PTMs at L1.

• PTM(S2) = Empty • PTM(S3) = C5�C6, C6�C 17 • PTM(S6)=CI2�Cll

The algorithm will now build set L1 with information it has collected. L1 will contains then,

• C3�C5, C4�C6 • C5�C6, C6�C 17 • C2�Cll • CI2�Cll • Cll�CI0

We now have two sets of paths, LO and L1. The algorithm will merge LO and L1. The new set at this level will become,

• Cl�C2�C3 • C5�C7 • Cl�C2�C11�CI0�C9 • C6�C17 • C2�C11 • C8�C9 • C3�C5 • C11�CI0 • C4�C6 • C 12�C 11 • C5�C6 • C17�C7

The paths are now build from this combined set in a same manner as the partial paths were built in individual sets. After building the paths, set Ll will contains,

• Cl�C2�C3�C5�C6�CI7�C7 • Cl�C2�C3�C5�C7 • C4�C6�C17�C7 • Cl�C2�Cll�CIO�C9 • C8�C9 • C2�Cll�CIO • CI2�Cll

The user query was Paths(CI--C7). A set Ll contains all paths between source Cl and target C7 of a graph shown in Fig. 1. The required paths are

• CHC2�C3�C5�C6�CI7�C7 • CHC2�C3�C5�C7.

IV. PERFORMANCE E VALUATION

This section compares an indexing complexity of our approach with p-index. Time and space complexity is calculated based upon the PTMs only as the complexity of creating the INFs will be same for both approaches. Let us take five graphs with 100000, 200000, 300000, 400000 and 500000 nodes. These graphs are indexed up to five levels based upon the parameter settings shown in Table 1. Fifth level has only the segment graph and no further segmentation is performed at that level thus it is not shown in Table 1. Segment graph at fifth level contains 16, 32, 60, 8 and 10 nodes respectively based upon the parameter settings at previous levels. The index is created up to a user defined limit of 10, that is, transitive closure is computed up to length 10 of the paths.

TABLE I PARAMETER SETTINGS FOR INDEX CREA nON

Graph Size Levell Level 2 Level 3 Level 4 100000 5 5 5 10 200000 5 5 5 10 300000 5 5 5 10 400000 5 10 10 10 500000 5 10 10 10

The Computational complexity for a graph G100000 with parameter setting of Table 1 will be,

Cost at level one (Ll), • Ll nodes per segment = 5 • Ll total segments = 20000 • Ll cost per segment = matrix multiplication cost

* weight limit • Ll cost per segment = (5)3 * 10 = 1250 • Total cost at Ll = Cost of one segment * number

of segments • Total cost at Ll = 1250 * 20000 = 25000000

Calculating in similar manner, cost at level two, three, four and five will be 5000000, 1000000, 160000 and 40960 respectively. Total indexing cost of p-index for creating the PTMs will be the cost of Ll +L2+L3+L4+L5 = 31200960. We have discussed in section III that our algorithm requires the PTMs only at first and last level of

63

indexing. Thus the indexing cost of our algorithm for creating the PTMs will be the cost of Ll +L5 = 25040960. Table 2 compares an indexing cost of our algorithm with p-index.

Graph Size 100000 200000 300000 400000 500000

TABLE II INDEXING COST

MinG 25040960 50327680 77160000

100005120 125010000

p-index 31200960 62727680

101160000 188805120 236010000

Fig. 8 shows an efficiency of our approach over the p­index. X-axis shows the graph size and to keep the scale of y-axis down, costs in Table 2 are multiplied by 10.7•

Our approach also saves the space required to store the PTMs at each level of indexing. To calculate the space complexity, assume that each entry of a matrix uses one memory location. Thus a 5 x 5 PTM will occupy 25 memory locations. If we take a transitive closure of such a matrix for the paths of length 10, then in worst case each entry of a matrix will contains a set of paths up to length 10. In that case, each entry of such a PTM will occupy 10 memory locations. Therefore, a 5 x 5 PTM with transitive closure up to path length 10 will occupy 25 * 10 = 250 memory locations.

25 �---------------------------[] MinG • p-index

20+-----------------------�

... 15 +------------------III o

u 10 +------------

5 +--------.-""'

o 100000 200000 300000 400000 500000

Graph size

Fig. 8 Indexing cost of creating the PTMs for MinG and p-index

Space complexity of a graph G100000 for storing the PTMs with parameter settings of Table 1 will be,

Space required at level one (Ll), • Ll nodes per segment = 5 • L 1 total segments = 20000 • L 1 space per segment = matrix size x weight

limit • Ll space per segment = (5 x 5) x 10 = 250 • Total cost at Ll = Cost of one segment *

number of segments • Total cost at Ll = 250 x 20000= 5000000

memory locations Calculating in similar manner, the space required at

level two, three, four and five will be 1000000, 200000, 40000 and 2560 memory locations respectively. Total space complexity of p-index for storing the PTMs will be the space occupied by Ll +L2+L3+L4+L5 = 6242560 memory locations. The space complexity of our algorithm

for storing the PTMs will be the space occupied by LI+L5 = 5002560 memory locations.

Table 3 compares the space complexity of our algorithm with p-index.

Graph Size 100000 200000 300000 400000 500000

TABLE III SPACE COMPLEXITY

MinG p-index 5002560 6242560 10010240 12490240 15036000 19236000 20000640 28944000 25001000 36767666

Let us assume that each memory location occupy 1 byte. Fig. 9 shows an efficiency of our approach over the p-index. X-axis shows the graph size and to keep the scale of y-axis down, the memory locations of Table 3 are converted into Mbytes.

40 .----------------------------o MinG • p-index

35+--------------------------

30+-----------------------­iii' 25 +-----------� G> 20 t-------------::= o [ 15 +-------0== (/)

10+----=

5

o 100000 200000 300000 400000 500000

Graph size

Fig. 9 Space Complexity of storing the PTMs for MinG and p-index

Search algorithm of p-index creates a special type of graph called a Transcription Graph to answer path queries. The vertices in a transcription graph can be processed strictly from left to right. MinG is designed with parallel processing in mind. It can build the paths from both directions at the same time. In MinG building of paths from the source and target vertex is independent of each other. This is not the case with the search algorithm of p­index.

V. CONCLUSIONS

We have presented an algorithm to find all paths between any two vertices from a graph. The algorithm is based on indexing scheme of p-index. Our algorithm reduces the time and space complexity of indexing by not creating Path Type Matrices at each level of indexing. Our algorithm inherently supports parallel processing which is an additional

64

advantage over the search algorithm of p-index. To precisely measure the complexity of our search algorithm, we need to implement and further evaluate it with more constraints. This includes our future work. We are also interested in running our algorithm on real life data so that the applicability of our algorithm in real life can be judged.

REFERENCES

[ I ) S. Barton, "Indexing graph structured data," PhD Thesis, Masaryk University, Bmo, Czech Republic, 2007.

(2) S. Barton, and P. Zezula, "Indexing structure for graph structured data," in Mining Complex Data, Springer Berlin / Heidelberg, 2009, pp. 167-188. Studies in Computational Intelligence, Vol. 165. ISBN 978-3-540-88066-0.

(3) K. Anyanwu, and A. Sheth, "The p-operator: Enabling querying for semantic associations on the semantic web," in Proceedings of the twelfth international conference on World Wide Web , pp. 690-699. ACM Press, New York, 2003.

(4) S. Thacker, A. Sheth, and S. Patel, "Complex relationships for the semantic web," in D. Fensel, J. Hendler, H Liebermann, and W. Wahlster, (eds.) Spinning the Semantic Web. MIT Press, Cambridge 2002.

(5) R.E. Trujan, "Fast algorithms for solving path problems," J. ACM, 28(3):594-614,1981.

(6) R.E. TaIjan, "A unified approach to path problems ". JACM, 28(3):577-593,1981.

(7) A. Matono, T. Amagasa, M. Yoshikawa, and S. Uemura, "An indexing scheme for RDF and RDF Schema based on suffix arrays," in Proceedings of SWDB'03, The first International Workshop on Semantic Web and Databases, Co-located with VLDB 2003, 2003.

(8) S. Barton, "Indexing structure for discovering relationships in RDF graph recursively applying tree transformation," in Proceedings of the Semantic Web Workshop at 27th Annual International ACM SIGIR Conference, pp. 58--{)8, 2004.

(9) R. Agrawal, S. Dar, HV. Jagadish, "Direct transitive closure algorithms: design and performance evaluation," ACM Transactions on Database Systems 15(3), pp. 427-458, 1990.

(10) P.F. Dietz. "Maintaining order in a linked list," in STOC'82:

Proceedings of the fourteenth annual ACM symposium on Theory of computing, pp. 122-127, New York, USA, 1982. ACM Press.

[ I I ) E. Cohen, E. Halperin, H Kaplan, and U. Zwick. "Reachability and distance queries via 2-hop labels," in Proceedings of the 13th annual ACM-SIAM Symposium on Discrete algorithms, pp. 937-946,2002.

(12) R. Schenkel, A. Theobald, and G. Weikum, "HOPI: An efficient connection index for complex XML document collections," in EDBT,2004.

(13) R. Schenkel, A. Theobald, and G.Weikum. "Efficient creation and incremental maintenance of the HOPI index for complex xml document collections," In ICDE, 2005.

(14) H He, HWang, J. Yang, and P.S. Yo, "Compact reachability labeling for graph-structured data," in CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 594-601, New York, USA, 2005. ACM Press.

(15) HWang, H He, J. Yang, P.S. Yu, and J.x. Yo, "Dual Labeling: Answering Graph Reachability Queries in Constant Time," in Proceedings of the 22"d International Conference on Data Engineering (ICDE), pp. 75,2006. IEEE Computer Society.

(16) S. Barton and P. Zezula, "rhoindex - designing and evaluating an indexing structure for graph structured data. " Technical Report FIMU-RS-2006-07, Faculty of Informatics, Masaryk University, 2006.