[ieee 2011 international conference on computer networks and information technology (iccnit) -...
TRANSCRIPT
MinG: An Efficient Algorithm to Mine Graphs for Semantic Associations
Zohaib Hassan, Mohammad Abdul Qadir
Centre for Distributed and Semantic Computing, Department of Computer Science Mohammad Ali Jinnah University, Islamabad, Pakistan
[email protected], [email protected]
Abstract- Data in semantic web is modeUed in terms of directed IabeUed graph. Vertices of that graph represent entities and edges represent relationships between those entities. Semantic web aUows the discovery of relations between entities using the p-operators. In this paper an algorithm to answer p-operators, that is, to f"md all paths between any two nodes from a graph is proposed. The algorithm is based onp-index, an indexing scheme presented in the PhD thesis of Barton. Our algorithm reduces the computational and space complexity of indexing by not creating a special type of adjacency matrix caUed Path Type Matrix at each level of indexing which Barton's algorithm did. We only need Path Type Matrices at first and last level of indexing. Thus if an indexing has 100 levels, Barton requires Path Type Matrices at each level and we only require Path Type Matrices at level 1 and level 100.
Keywords- Graph Traversal, RDF, Indexing, Algorithm, Mining, Semantic Web
I. INTRODUCTION
Semantic web is a way to make machines intelligent by structuring the data in such a way that helps them to understand the meanings of information. Data over the semantic web are stored along with metadata by using technologies such as Resource Description Framework ( RDF). Data in RDF is modelled in terms of concepts or entities and the relationships between those concepts. Such a modelling forms a directed labelled graph in which vertices represent concepts and edges represent relationships between those concepts. Graph is a very rich data structure that can model data of many real life domains. In Biology for instance, vertices represent proteins and edges represent interaction between them. In Chemistry, vertices represent atoms and edges represent bonds between them. Such real life graphs can be humungous and a problem arises when it comes to search for relations between the concepts. Thus if we are to find all indirect paths between any two entities from graph, such a search can result in an exponential time complexity.
In terms of semantic web p-operators [3] are proposed as a mean to find complex relations [4] between concepts. Barton in his PhD thesis [1] [2] has proposed an indexing scheme called p-index and a search algorithm to make the search for complex relationships efficient.
The p-index maintains two data structures for each segment at each level called Path Type Matrix (PTM) and Inverted File (INF). We have devised a search algorithm
978-1-61284-941-6/111$26.00 ©2011 IEEE
59
which is based on p-index. Our algorithm to find all paths between any two nodes from a graph doesn't require PTMs at each level of indexing. We need the PTMs only at first and last level of indexing. Thus if an indexing has 100 levels we need the PTMs only at level l and at level 100. Therefore the computational and space complexity of creating and using the PTMs at each level of indexing can be saved.
Our search algorithm is discussed in section III of the paper. Section II discusses the related work, section N compares the performance of our approach with p-index and section V concludes the paper along with future work.
II. RELATED WORK
Queries over a graph can be answered in two possible ways. One approach is an algorithmic on the fly approach in which no pre-processing is done over data. The other approach is an indexing approach in which some preprocessing is involved over data to ease the computational complexity for answering queries.
Tarjan's algorithm of a single source path expression problem [5] [6] is on the fly query answering algorithm. Given a vertex A from a graph the algorithm finds for each vertex V a regular expression RE (A,V) that represents all paths between nodes A and V in a graph. The computational complexity of Tarjan's algorithm is O(IE I) which is inefficient in terms of a size of real life graphs.
When it comes to indexing approaches to find all paths between any two nodes from a graph, an indexing scheme of [7] uses a data structure known as suffix arrays. Suffix array is a well-known data structure used for full text search over the documents constructed on one dimensional character strings. The algorithm extracts all path expressions from a RDF graph and to make query processing faster it creates a suffix array for all extracted path expressions. The problem with this approach is that it works only for DAGs.
Another indzexing scheme proposed by [8] converts RDF graph in to forest of trees. The algorithm assigns each node a signature which includes a pre-order rank of that node, a post-order rank of that node, a pointer to first ancestor of the node and some other information. These signatures are then used to find all paths between any two nodes from graph.
There are numerous algorithms that can tell us the existence of a complex relationship between any two _. -les from a graph. Such algorithms are known as graph
reach ability query answering algorithms. These approaches return a single boolean value answering that a source vertex can reach a target vertex or not. These approaches include the matrix based approach [9], an interval based approach [10], the 2-Hop approach [11], the HOPI algorithm [12] [13], the HLSS approach [14], and an indexing scheme of Dual Labelling [15] can answer the reachability queries in constant time. Although efficient, these approaches can be used to find only the existence of a relationship and not the actual relationships between any two vertices from a graph.
Barton in his PhD thesis [1] [2] has proposed an indexing scheme called p-index and a search algorithm to find all paths between any two vertices from a graph.
Suppose that we have a graph of 5000 nodes which is indexed using p-index up to three levels. First level has 5 nodes per segment, second level has 10 nodes per segment and third level has 5 nodes per segment. An index is created for a user defined limit of 10. That is, transitive closure is computed up to length 10 of the paths. The computational cost for creating PTMs by p-index will be,
Cost at level one, • L1 nodes per segment = 5 • L1 total segments = 1000 • L1 cost per segment = matrix multiplication
cost * weight limit • L1 cost per segment = (5)3 * 10 = 1250 • Total cost at L1 = Cost of one segment *
number of segments • Total cost at L1 = 1250 * 1000 = 1250000
Calculating in similar manner for all levels the total indexing cost will be 2355000.
In the next section we will present an algorithm which is based on p-index. Unlike Barton's search algorithm, our search algorithm requires the PTMs only at the bottommost and topmost level of indexing. The computational cost of creating PTMs for our algorithm for a graph we have discussed above will be,
• Total indexing cost = L1 + L4 • Total indexing cost = 1250000 + 80000 • Total indexing cost = 1330000
III. PROPOSED ALGORITHM
In section II we have discussed that in order to find all paths between any two nodes from a graph the search algorithm of p-index uses the PTMs and the lNFs at each level of indexing. In this section we will present a search algorithm to find all paths between any two nodes from a graph which is based on p-index. However, the difference is that our algorithm needs only the PTMs at the topmost and the bottommost level of indexing. PTMs at any intermediate levels are not required. So while indexing, we will only create the PTMs at topmost and bottommost levels and not at any intermediate levels.
Instead of going from bottom to topmost level, identifying all segments involved along the way and then building the paths from top to bottom [1]. What we will do is we will start from bottom, identify segments involved at bottommost level of indexing and build the
60
paths at current level. This will generate a set of possible paths at bottommost level. We will then move on to next level of indexing, identify the segments involved at that level and find the paths at that level. As there are no PTMs at any intermediate levels, only the lNFs will be used.
Our algorithm to find all paths between any two nodes from a graph performs the following steps.
1. Identify from first level the segment of Source
and Target nodes.
2. If on level one, retrieve the entries from,
• PTM( Source)
• INF_EDGE S_OUT( Source)
• INF_EDGE S_lN( Target)
• PTM(Target)
3. Generate a set of paths at level one by
information we retrieved in step 2
4. Identify that from which segment at next level a
segment of source identified at step 1 belongs to.
This newly identified segment is now Source.
Update the Target by same procedure.
5. If on level other than the bottommost and the
topmost, retrieve the entries from
• INF_EDGE S_OUT( Source)
• INF _EDGE S_lN( Target)
6. Remove the duplicate paths from a set of paths
at step 5
7. For each path in a set generated so far do the
following,
INF _EDGE S _ OUT( Initial) intersection
INF _EDGE S_lN( Terminal)
8. If entries in a set generated at step 7 doesn't
belong to first level of indexing, repeat step 7
9. If entries in a set generated at step 7 belong to first
level of indexing, retrieve the entries from
• PTM( Initial)
• EDGE S_OUT(Initial)
• EDGE S_lN( Terminal)
• PTM( Terminal)
10. In step 9, don't retrieve entries from those segments which are already processed on previous levels.
11. Add a set of paths at current level with a set of paths generated at previous level.
12. Generate a set of paths at current level by information we have in step 11
13. As there is no segmentation on the topmost level so entries will only be retrieved from PTM at that level.
In step 2 and 12, set of paths is generated by comparing the terminal of one path with the initial of the others. If terminal of one path is same as initial of other, merge that path. In order to reduce the comparisons, paths in a set can be sorted.
To explain the working of our algorithm, we will take a graph which we will index up to three levels. Fig. 1 shows an example graph whose segments are created by including vertices of original graph into six different segments. The vertices are included into segments randomly.
5, ,-C-, ----·
56
Fig. I An example graph and the first level of indexing
Fig. 2 shows the INFs of six segments created in Fig. 1.
PTMs for the segments of Fig. 1 are shown in Fig. 3. This
way a graph and segments of Fig. I, INFs of Fig. 2 and
PTMs of Fig. 3 form the first level of indexing.
Fig. 4 shows a segment graph created from the segments in Fig. 1. This segment graph is further segmented in Fig. 4 to form the second level of indexing. Only the INFs will be created at this level, which are shown in Fig. 5.
Segment graph created from the segments of Fig. 4 is shown in Fig. 6. No further segmentation is performed for a graph of Fig. 6. As there is no segmentation at last level, the INFs cannot be created there. A segment graph in Fig. 6 is at highest level of indexing, therefore a PTM at this level will be created which is shown in Fig. 7.
Now let suppose that a user pose the query Paths(CIC9). That is, find all paths between the nodes CI and C9 from a graph shown in Fig. 1.
Vertex CI is the source and vertex C9 is the target. The first step of algorithm is to check that from which segment the source and target vertex belongs to. The source vertex CI belongs to the segment SI and the target vertex C9 belongs to the segment S4.
Sl Out 52 Out 53 Out S6 Out
C2 C3, C3 C5 C5 C7 C11 C10 C11
54 IN C4 C6 C17 C7 IN IN
C7 C5, IN C11 C2, C17 C5 C3 C13
C10 C11 C3 C2 C6 C4 C12 C13
55 Out
C13 Cll, C12
Fig. 2 Inverted Files at first level of indexing
55 =
61
C7 C8 C9 C10 Cl C2 C7 Cl el C8 e7 C2 C9 C5 C6 C17 Cl0 e8
C5 e18 C13 C14 C15 C16 C6 e16
C13 C17
C14 e15 Cl1 C12 C15 e14 56 - Cl1 C16 e13 C12 e10
Fig. 3 Path Type Matrices at first level of indexing
Fig. 4 Segment graph and the second level of indexing
52,1 Out S2,3 Out
S2 S3 S6 S4
IN S1 S6
S6 S1
S2,2 IN
S3 S2
S4 S6
Fig. 5 Inverted files at second level of indexing
Fig. 6 Segment graph at the third level of indexing
52,1 52,2 52,3
52,1 e3,1 e3,3 e3,3e3,2
52,2
52,3 e3,2
Fig. 7 Path Type Matrix at the last level of indexing
The second step is to generate a set of paths at first level by retrieving entries from the PTMs and the INFs of segments identified at first step of algorithm. The entries from INFs will be retrieved in a way that outgoing paths from an INF of source vertex and incoming paths from an INF of target vertex will be queried. Firstly, the algorithm will retrieve entries from a PTM of S 1 and outgoing paths from an INF of S1. The algorithm will then retrieve incoming paths from an INF of S4 and entries from a PTM of S4. Let us call this set of paths generated at first level as LO. A set LO will initially contains the following paths,
• PTM(SI)=Cl�C2 • EDGES_OUT(SI) = C2�C3, C2�Cll • EDGES_IN(S4) =C5�C7, CI7�C7, Cll�CI0 • PTM(S4) = C8�C9, CI0�C9
Note that we have written the incoming edges retrieved from an INF in a reverse direction. The INF of segment S4 in Fig. 1 shows that a vertex C7 has an incoming edge from C5 and C17. The direction of an arrow is towards C7. That is, it can be written as C7 f:- C5 but we have written it as C5� C7.
After removing the duplicate paths if any, the algorithm will build the partial paths from the informatim it has in LO. It does this by merging those paths for which the terminal vertex of me path is same as the initial vertex of the other path. Starting from the paths of which the source vertex C1 is the initial, the algorithm compares the terminal vertex of that path with the initial vertices of the other paths in LO.
The path in LO that starts from the source vertex is C1 � C2. The terminal vertex C2 of that path is the initial vertex of two other paths in LO that is, C2� C3 and C2� C11. By merging C2, two partial paths C1� C2� C3 and C1� C2� Cll will be formed. If paths in LO are sorted based on the initials, the terminal vertex C2 of the path C1 � C2 will be compared only with the initial vertices of the paths C2� C3 and C2� C 11 and not with the initials of other paths in LO.
The terminal vertex C3 of the partial path C1 � C2� C3 is not the initial of any path in LO. The terminal vertex Cll of the partial path C1� C2� Cll is the initial of the path C11 � ClO in LO. Thus a new partial path C1 � C2� Cll � C10 will be formed. The terminal vertex C10 of this partial path matches the initial of the path ClO� C9 in LO thus C1� C2� Cll� ClO� C9 will be formed. The other paths that are still not processed in LO are C5� C7, C17� C7 and C8� C9. The terminal vertices C7 and C9 of these paths are not the initials of any path in LO thus the algorithm has finished building the paths in LO. The LO will now contains the following information,
• Cl�C2�C3 • C17�C7 • Cl�C2�Cll�ClO�C9 • C8�C9 • C5�C7
The user query was Paths(C1- C9) and from LO the algorithm has found the path from the source to the target which is C1� C2� Cll� ClO� C9. In this query the path has been found on the first level of indexing. Let us now consider a query in which path will not be found on the first level and the algorithm has to proceed to the
62
second level where there are no PTMs. Consider the user query Paths(C1- C7). The source
vertex C1 is in Sl and the target vertex C7 is in S4. A set LO will be same as that of the previous query Paths(C1-C9) as the segments identified in first step that is, S 1 and S4 are same for both the queries. By examining LO no paths between C1 and C7 are found thus the algorithm needs to proceed to the second level of indexing. What the algorithm will do now, it checks that from which segments at the next level the segments identified at the first level belong to. The segments identified at first level were Sl and S4, the algorithm now checks that from which segment Sl and S4 belongs to. Fig. 4 shows that S 1 belongs to S2,1 and S4 belongs to S2,2.
To build a set L1 that is, a set of paths at level two. The algorithm retrieves the outgoing edges from an INF of S2, 1 and incoming edges from an INF of S2,2.
• EDGES_OUT(S2,1) = S2�S3, SI�S6 • EDGES_IN(S2,2) = S2�S3, S6�S4
After removing the duplicate paths a set L1 will contains the following information,
• S2�S3 • SI�S6 • S6�S4
Now for each path in L1 the algorithm will retrieve the outgoing edges from an INF of the initial vertex of a path and the incoming edges from an INF of the terminal vertex of a path. The intersection of these two will be taken in order to retrieve only the common edges between the initial and the terminal vertex. For a Path S2� S3,
• EDGES _ OUT(S2) = C3�C5, C4�C6 • EDGES_IN(S3) = C3�C5, C4�C6
EDGE S_OUT( S2) n EDGE S_IN(S3) = C3� C5, C4� C6. Similarly for the paths Sl � S6 and S6� S4,
• EDGES_OUT(SI) = C2�C3, C2�Cll • EDGES_IN(S6) = C2�Cll, C13�Cll, C13�CI2 • EDGES_OUT(S6) =Cll�ClO • EDGES_IN(S4) =C5�C7, CI7�C7, Cll�ClO
EDGE S_OUT( Sl) n EDGE S_IN( S6) = C2� Cll and EDGE S_OUT( S6) n EDGE S_IN( S4) = C11� ClO.
By doing all this, the algorithm has reached the bottommost level where we have the PTMs. Thus it will check the PTMs of the vertices in L1. If however the algorithm has not reached the bottommost level, it will keep on checking the INFs in the same fashion.
In L1 we have 5 vertices, S2, S3, Sl, S6 and S4. Their PTMs are shown in Fig. 3. The algorithm will query only the PTMs of S2, S3 and S6 because when it was building the paths in LO it has queried the PTMs of S1 and S4, so it will not query their PTMs at L1.
• PTM(S2) = Empty • PTM(S3) = C5�C6, C6�C 17 • PTM(S6)=CI2�Cll
The algorithm will now build set L1 with information it has collected. L1 will contains then,
• C3�C5, C4�C6 • C5�C6, C6�C 17 • C2�Cll • CI2�Cll • Cll�CI0
We now have two sets of paths, LO and L1. The algorithm will merge LO and L1. The new set at this level will become,
• Cl�C2�C3 • C5�C7 • Cl�C2�C11�CI0�C9 • C6�C17 • C2�C11 • C8�C9 • C3�C5 • C11�CI0 • C4�C6 • C 12�C 11 • C5�C6 • C17�C7
The paths are now build from this combined set in a same manner as the partial paths were built in individual sets. After building the paths, set Ll will contains,
• Cl�C2�C3�C5�C6�CI7�C7 • Cl�C2�C3�C5�C7 • C4�C6�C17�C7 • Cl�C2�Cll�CIO�C9 • C8�C9 • C2�Cll�CIO • CI2�Cll
The user query was Paths(CI--C7). A set Ll contains all paths between source Cl and target C7 of a graph shown in Fig. 1. The required paths are
• CHC2�C3�C5�C6�CI7�C7 • CHC2�C3�C5�C7.
IV. PERFORMANCE E VALUATION
This section compares an indexing complexity of our approach with p-index. Time and space complexity is calculated based upon the PTMs only as the complexity of creating the INFs will be same for both approaches. Let us take five graphs with 100000, 200000, 300000, 400000 and 500000 nodes. These graphs are indexed up to five levels based upon the parameter settings shown in Table 1. Fifth level has only the segment graph and no further segmentation is performed at that level thus it is not shown in Table 1. Segment graph at fifth level contains 16, 32, 60, 8 and 10 nodes respectively based upon the parameter settings at previous levels. The index is created up to a user defined limit of 10, that is, transitive closure is computed up to length 10 of the paths.
TABLE I PARAMETER SETTINGS FOR INDEX CREA nON
Graph Size Levell Level 2 Level 3 Level 4 100000 5 5 5 10 200000 5 5 5 10 300000 5 5 5 10 400000 5 10 10 10 500000 5 10 10 10
The Computational complexity for a graph G100000 with parameter setting of Table 1 will be,
Cost at level one (Ll), • Ll nodes per segment = 5 • Ll total segments = 20000 • Ll cost per segment = matrix multiplication cost
* weight limit • Ll cost per segment = (5)3 * 10 = 1250 • Total cost at Ll = Cost of one segment * number
of segments • Total cost at Ll = 1250 * 20000 = 25000000
Calculating in similar manner, cost at level two, three, four and five will be 5000000, 1000000, 160000 and 40960 respectively. Total indexing cost of p-index for creating the PTMs will be the cost of Ll +L2+L3+L4+L5 = 31200960. We have discussed in section III that our algorithm requires the PTMs only at first and last level of
63
indexing. Thus the indexing cost of our algorithm for creating the PTMs will be the cost of Ll +L5 = 25040960. Table 2 compares an indexing cost of our algorithm with p-index.
Graph Size 100000 200000 300000 400000 500000
TABLE II INDEXING COST
MinG 25040960 50327680 77160000
100005120 125010000
p-index 31200960 62727680
101160000 188805120 236010000
Fig. 8 shows an efficiency of our approach over the pindex. X-axis shows the graph size and to keep the scale of y-axis down, costs in Table 2 are multiplied by 10.7•
Our approach also saves the space required to store the PTMs at each level of indexing. To calculate the space complexity, assume that each entry of a matrix uses one memory location. Thus a 5 x 5 PTM will occupy 25 memory locations. If we take a transitive closure of such a matrix for the paths of length 10, then in worst case each entry of a matrix will contains a set of paths up to length 10. In that case, each entry of such a PTM will occupy 10 memory locations. Therefore, a 5 x 5 PTM with transitive closure up to path length 10 will occupy 25 * 10 = 250 memory locations.
25 �---------------------------[] MinG • p-index
20+-----------------------�
... 15 +------------------III o
u 10 +------------
5 +--------.-""'
o 100000 200000 300000 400000 500000
Graph size
Fig. 8 Indexing cost of creating the PTMs for MinG and p-index
Space complexity of a graph G100000 for storing the PTMs with parameter settings of Table 1 will be,
Space required at level one (Ll), • Ll nodes per segment = 5 • L 1 total segments = 20000 • L 1 space per segment = matrix size x weight
limit • Ll space per segment = (5 x 5) x 10 = 250 • Total cost at Ll = Cost of one segment *
number of segments • Total cost at Ll = 250 x 20000= 5000000
memory locations Calculating in similar manner, the space required at
level two, three, four and five will be 1000000, 200000, 40000 and 2560 memory locations respectively. Total space complexity of p-index for storing the PTMs will be the space occupied by Ll +L2+L3+L4+L5 = 6242560 memory locations. The space complexity of our algorithm
for storing the PTMs will be the space occupied by LI+L5 = 5002560 memory locations.
Table 3 compares the space complexity of our algorithm with p-index.
Graph Size 100000 200000 300000 400000 500000
TABLE III SPACE COMPLEXITY
MinG p-index 5002560 6242560 10010240 12490240 15036000 19236000 20000640 28944000 25001000 36767666
Let us assume that each memory location occupy 1 byte. Fig. 9 shows an efficiency of our approach over the p-index. X-axis shows the graph size and to keep the scale of y-axis down, the memory locations of Table 3 are converted into Mbytes.
40 .----------------------------o MinG • p-index
35+--------------------------
30+-----------------------iii' 25 +-----------� G> 20 t-------------::= o [ 15 +-------0== (/)
10+----=
5
o 100000 200000 300000 400000 500000
Graph size
Fig. 9 Space Complexity of storing the PTMs for MinG and p-index
Search algorithm of p-index creates a special type of graph called a Transcription Graph to answer path queries. The vertices in a transcription graph can be processed strictly from left to right. MinG is designed with parallel processing in mind. It can build the paths from both directions at the same time. In MinG building of paths from the source and target vertex is independent of each other. This is not the case with the search algorithm of pindex.
V. CONCLUSIONS
We have presented an algorithm to find all paths between any two vertices from a graph. The algorithm is based on indexing scheme of p-index. Our algorithm reduces the time and space complexity of indexing by not creating Path Type Matrices at each level of indexing. Our algorithm inherently supports parallel processing which is an additional
64
advantage over the search algorithm of p-index. To precisely measure the complexity of our search algorithm, we need to implement and further evaluate it with more constraints. This includes our future work. We are also interested in running our algorithm on real life data so that the applicability of our algorithm in real life can be judged.
REFERENCES
[ I ) S. Barton, "Indexing graph structured data," PhD Thesis, Masaryk University, Bmo, Czech Republic, 2007.
(2) S. Barton, and P. Zezula, "Indexing structure for graph structured data," in Mining Complex Data, Springer Berlin / Heidelberg, 2009, pp. 167-188. Studies in Computational Intelligence, Vol. 165. ISBN 978-3-540-88066-0.
(3) K. Anyanwu, and A. Sheth, "The p-operator: Enabling querying for semantic associations on the semantic web," in Proceedings of the twelfth international conference on World Wide Web , pp. 690-699. ACM Press, New York, 2003.
(4) S. Thacker, A. Sheth, and S. Patel, "Complex relationships for the semantic web," in D. Fensel, J. Hendler, H Liebermann, and W. Wahlster, (eds.) Spinning the Semantic Web. MIT Press, Cambridge 2002.
(5) R.E. Trujan, "Fast algorithms for solving path problems," J. ACM, 28(3):594-614,1981.
(6) R.E. TaIjan, "A unified approach to path problems ". JACM, 28(3):577-593,1981.
(7) A. Matono, T. Amagasa, M. Yoshikawa, and S. Uemura, "An indexing scheme for RDF and RDF Schema based on suffix arrays," in Proceedings of SWDB'03, The first International Workshop on Semantic Web and Databases, Co-located with VLDB 2003, 2003.
(8) S. Barton, "Indexing structure for discovering relationships in RDF graph recursively applying tree transformation," in Proceedings of the Semantic Web Workshop at 27th Annual International ACM SIGIR Conference, pp. 58--{)8, 2004.
(9) R. Agrawal, S. Dar, HV. Jagadish, "Direct transitive closure algorithms: design and performance evaluation," ACM Transactions on Database Systems 15(3), pp. 427-458, 1990.
(10) P.F. Dietz. "Maintaining order in a linked list," in STOC'82:
Proceedings of the fourteenth annual ACM symposium on Theory of computing, pp. 122-127, New York, USA, 1982. ACM Press.
[ I I ) E. Cohen, E. Halperin, H Kaplan, and U. Zwick. "Reachability and distance queries via 2-hop labels," in Proceedings of the 13th annual ACM-SIAM Symposium on Discrete algorithms, pp. 937-946,2002.
(12) R. Schenkel, A. Theobald, and G. Weikum, "HOPI: An efficient connection index for complex XML document collections," in EDBT,2004.
(13) R. Schenkel, A. Theobald, and G.Weikum. "Efficient creation and incremental maintenance of the HOPI index for complex xml document collections," In ICDE, 2005.
(14) H He, HWang, J. Yang, and P.S. Yo, "Compact reachability labeling for graph-structured data," in CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 594-601, New York, USA, 2005. ACM Press.
(15) HWang, H He, J. Yang, P.S. Yu, and J.x. Yo, "Dual Labeling: Answering Graph Reachability Queries in Constant Time," in Proceedings of the 22"d International Conference on Data Engineering (ICDE), pp. 75,2006. IEEE Computer Society.
(16) S. Barton and P. Zezula, "rhoindex - designing and evaluating an indexing structure for graph structured data. " Technical Report FIMU-RS-2006-07, Faculty of Informatics, Masaryk University, 2006.